Improving source code suggestion with code embedding and enhanced convolutional long short ‐ term memory

Source code suggestion is the utmost helpful feature in the integrated development environments that helps to quicken software development by suggesting the next possible source code tokens. The source code contains useful semantic information but is ignored or not utilised to its full potential by existing approaches. To improve the performance of source code suggestion, the authors propose a deep semantic net (DeepSN) that makes use of semantic information of the source code. First, DeepSN uses an enhanced hierarchical convolutional neural network combined with code ‐ embedding to automatically extract the top ‐ notch features of the source code and to learn useful semantic information. Next, the source code's long and short ‐ term context dependencies are captured by using long short ‐ term memory. We extensively evaluated the proposed approach with three baselines on ten real ‐ world projects and the results are suggesting that the proposed approach surpasses state ‐ of ‐ the ‐ art approaches. On average, DeepSN achieves 7.6% higher accuracy than the best baseline.


| INTRODUCTION
Source code suggestion is one of the vital tools of modern integrated development environments (IDE). Source code suggestion helps to accelerate software development by suggesting the next likely source code token. However, traditional source code suggestion tools rely on the limited information already present in the IDE which could not capture human's programming patterns well. Recently, source code modelling research has gained much attention by using statistical language models and learning from large-scale codebases (e.g. Github). N-gram is one of the most widely used source code language model [1,4]. More recently, deep learning-based approaches [3,[5][6][7] have revealed excessive potential in source code modelling.
Hindle et al. [2] have revealed the naturalness of source code by employing an n-gram statistical language model. Further, they have demonstrated the effectiveness of their approach with the source code suggestion task. The n-gram-based models work wellwith smaller context size and do not consider semantic information of the source code. Recently, White et al. [7] have revealed the recurrent neural network (RNN) can significantly improve the performance of source code suggestion models as compared to n-gram models. Although the RNN-based neural language models [3,7] showed their effectiveness in the modelling of source code, they suffer from vanishing gradient problem. Further, recurrent neural models do not consider spatiotemporal information which can be helpful in the source code modelling. Moreover, the source code contains useful semantic information that is ignored or not utilised to its full potential by these approaches.
We propose a deep semantic net (DeepSN) that makes use of semantic information of the source code to defy the above apprehensions and to improve the performance of source code suggestions. First, we use an enhanced hierarchical convolutional neural network (CNN) to automatically extract the topnotch features of the source code combined with codeembedding to deep learn the semantics of the source code. CNN can extract local features by using multiple convolving filters and achieves sound performance on the tasks related to semantic parsing [8]. Thus, DeepSN adopts CNN to extract top-notch features of the source code and learns their deep semantic relationships by using the code-embedding technique. Moreover, DeepSN learns the source code's long and shortterm context dependencies by using long short-term-memory (LSTM). We extensively evaluated the proposed approach on ten projects (dataset) and the outcome suggests that the

| RELATED WORK
Recently, deep learning methods have shown excessive potential for various source code modelling tasks and showed their usefulness for various problems such as code generation, code completion, API mining, code migration, code categorisation and , syntax error correction.
Tu et al. [4] have proposed an n-gram language model consist of a cache. Hellendoorn et al. [9] have presented a cachebased model that showed improvements over the simple ngram-based approach by introducing nested locality. Similarly, Franks et al. [10] have proposed an eclipse plugin (CACHECA) that combines the default suggestions with suggestions returned by a cache language model [4]. A common limitation of n-grambased approaches is that they work well with a small size of context and lack to capture semantical similarities. Raychev et al. [3] utilised statistical language techniques to synthesise source code by providing completions for a program. They have utilised the RNN and n-gram language models for the task of synthesising source code. Recently, White et al. [7] have utilised a neural network-based approach to provide completions for the source code. They have shown that the RNN can outperform the cache-based n-gram models. They have shown the usefulness of their method on the task of the source code completion.
Gupta et al. [5] have presented an approach based on deep learning to fix C programming language errors. Similarly, Santos et al. [6] have proposed LSTM-based neural model for the detection and correction of syntax errors presented in programs. Our focus is on code completion, rather than correcting syntax errors presented in a file. Mou et al. [11] have given tree-based structure utilising CNN for the classification of source code programs. Yin et al. [12] have proposed a syntax-driven neural network for the task of source code generation. Maddison et al. [13] have proposed a method for the task of source code generation via structured generation. Similarly, Rabinovich et al. [14] have proposed an abstract syntax network-based approach for the generation of source code. Sethi et al. [15] have proposed a method for the generation of the source code from research papers by employing neural learning. These works are more focused on the source code generation task while the authors work is focused on the prediction of the next token.
Allamanis et al. [16] have proposed probabilistic contextfree grammars (PCFG) based model to mine source code idioms. They demonstrated that source code idioms occur across software projects. Chan et al. [17] have proposed a method that uses sub-graph for API usage search by using simple text phrases. Bielik et al. [18] have proposed a probabilistic decision tree model for the purpose of API elements prediction. Thung et al. [19] have proposed a method recommendation system for feature requests stored in bug repositories. Pham et al. [20] have projected a method that utilises byte code for the recommendation of API usage. Nguyen et al. [21] proposed a graph-based statistical language model named GraLan for the task of API code recommendation. Different from Nguyen et al. [21], the concentration of anticipated work is on source code suggestion tasks rather than API code recommendation.
Raychev et al. [22] have proposed a decision tree-based approach for the task of code completion for dynamic type languages. They formulate the problem of learning a probabilistic model over abstract syntax trees of code as learning a decision tree in a domain-specific language. Li et al. [23] utilise different neural networks that leverage token level information and structural information on dynamic type languages. Similarly, Bhoopchand et al. [24] have proposed a sparse pointer network-based approach for code completion tasks for Python programming language (dynamic type languages). They utilise a filtered view of a memory of a previous identifier that helps capture long-range dependencies.
The work of authors is closely related to these works [2,3,7]. Hindle et al. [2] showed that even n-gram-based statistical language techniques can help improve the source code suggestion task implying source code is natural. The n-gram based models work well with smaller context size and do not consider semantic information of the source code. Recently, White et al. [7] showed that the RNN can significantly improve the performance of source code models. Although the RNN-based neural learning based approaches [3,7] have shown their effectiveness in the modelling of source code, they suffer from vanishing gradient and do not consider spatiotemporal information which can be helpful in source code modelling. Furthermore, beneficial semantic information is present in the source code which is ignored or not utilised to its full potential by existing source code suggestion approaches.

| Convolutional neural network
Initially CNN was developed for the task of image recognition. The internal representation of the image is learned by the model for the purpose of classification [25]. The convolution layers are utilised to correlate between spatially adjacent points by transmutation of the local patterns with connected neurons. The same approach can be adopted for the text classification [8]. Normally, the CNN consists of two different kinds of layers. First, the layer is convolutional which utilises the filters to transforms the input to a feature mapping in which features correspond to filters. Second, the layer is the pooling layer. The purpose of this layer is to help lessen the spatial dimension of the feature mapping.

| Long short-term memory
LSTM network [26] is an advanced version of RNN that helps overcome the vanishing and exploding gradient problem which normally resides in vanilla RNN. The LSTM consists of a memory cell to accumulate the state information to learn long term dependencies. The LSTM is formalised as, where i t is the input gate, f t is the forget gate, c t is cell state, o t is the output gate and h t is the hidden state at time t. Similarly, f t−1 is the forget gate, i t−1 is the input gate, c t−1 is cell state, h t−1 is the hidden state at time t − 1 and o t−1 is the output gate. σ is sigmoid and tanh is hyperbolic tangent function. W is the weight and b is bias.

| BATCH NORMALISATION
Batch normalisation [27] can mitigate the problem of internal covariate shift and reduces its sensitivity to initialisation. Batch normalisation is used to regulate the input layer by adjusting and scaling the activation which results in improving the performance, and stability of neural networks. It can be expressed as where (x i ) is the batch with j values. μ and ϕ are the mean and variance of the batch. ϵ is a constant, r and β are learned during model training.

| PROPOSED APPROACH
This section describes the proposed DeepSN that suggests the next source code token. The purpose here is to improve the performance of the source code suggestion task by learning the source code context and semantics for real-world historical code-bases. Figure 1 illustrates the general framework of the projected approach. First, we discuss the preprocessing of the source code, which involves, regularisation, tokenisation, and vectorization. DeepSN uses the vectorized source code files as input to the enhanced hierarchical CNNs combined with codeembedding to automatically learn source code context (τ) and semantic relationships. Next, the extracted features are used as input to the LSTM to learn their long and short-term context dependencies. Given a source code program, the trained model is used to automatically suggest the next source code token. -201 The suggestions for the succeeding source code token can be defined as, Here, Y describes the suggestions for succeeding source code token, f describes the function utilised for classification to provide predicts for the succeeding token and τ describes the input context. We utilise real software applications (projects) source code obtained from Github 2 to train and test the models for the next source code suggestion task. The collected data set (D) can be formalised as, where i = 1 to k and k, ij >= 1 and i1< = ij. Here, < P 1 , P 2 , …, P k > represents the k number of projects in D and < F i1 , F i2 , …, F ij > represents the j number of files in i-th project.

| Preprocessing
Preprocessing is an important process in machine learning/ deep learning-based language modelling. Following common practices, we carry out regularisation, feature extraction and vocabulary building. Each of the preprocessing steps is discussed in the following subsections.

| Regularisation
Commonly, data set comprises unnecessary features that are not significant for a specific task and these type of features will deeply affect the model's performance. To avoid this issue, we take out all block-level comments, inline comments and blank lines from source code files. Moreover, we change constant values with their general types (e.g. 7 = IntVal, 9.2 = FloatVal, 'Hello World' = StringVal).

| Tokenisation
Next, we perform tokenisation in which the terms/words are extracted from the data set (D). To achieve this goal, we split each source code program (file) into a sequence of space-separated terms. Then each sequence is subdivided into multiple sequences with a fixed size context (τ) 20 by employing a sliding window approach.

| Vectorization
It is necessary to construct a vocabulary system to train a neural language model. To achieve this, we substitute singleton terms with a unique token UNK which is a common process of building a global vocabulary system for statistical language models. Next, we construct the vocabulary (V ) in which all unique tokens (terms) are abstracted from the dataset. Finally, all source code tokens are substituted with positive integer value correspondent to the vocabulary index for the preparation of sequences into a state that is acceptable to train a neural language model. The abstracted context vectors found in a software system (project) can be expressed as where τ 1 , τ 2 , …, τ k describes the number (k) of context vectors in i−th project.

| Deep semantic net (DeepSN)
The vectorized source code files are then used to automatically extract the hidden features (deep features) by learning source code's semantics along with their long and short-term context dependencies. The model first transforms the source code tokens into a high dimensional vector space (code-embedding) in which similar code tokens are put closer. In code-embedding, source code tokens are embodied by high dimensional vectors that contain their precise semantic relationships. An important ingredient of our approach is simultaneous learning of codeembedding while training DeepSN that helps learn source code's semantics by utilising the deep neural net. Figure 2 illustrates an example of transforming source code tokens into code-embedding. We use the source code tokens instead of integer encoded in the presented illustration ( Figure 2) for ease of understanding.
If there are n-tokens then code-embedding is n * K where K is the size of the vector. Next, this high dimensional code-embedding is feed into enhanced CNN to automatically learn the deep features of the source code. The model adopts multiple filters with varying sizes to learn multiple representations of the features simultaneously. Usually, these learned features are pooled by maxover-time pooling [8]. Apart from traditional max-over-time pooling, we employ the batch normalisation on the learned features. We use hierarchical CNN to learn the deep features (deep semantics) of the source code by enforcing batch F I G U R E 2 An example of transforming source code tokens into vectors normalisation on the learned features. The features learned from source code are considered the number of filters (fn) * the filter size (fs). The choice of hierarchies (hn) in CNN, the number of fn, and the size of fs will be discussed in Section 6.2. Next, the source code's long and short-term context dependencies are learned by employing LSTM. Figure 3 illustrates the net structure of DeepSN. Moreover, we consider the model to provide a low cross-entropy and heaving a high probability for the accurate suggestions. It can be expressed as where T is the training set and N is the size of the testing set, p(x i ) is the probability of event x i estimated from the training set.

| EVALUATION
We conduct several experiments to evaluate the effectiveness of DeepSN. This section discusses the evaluation setup in detail. We discuss the research questions in Section 5.1. We describe the data set used to train and evaluate the models in Section 5.2. Section 5.3 discusses model training and testing. Section 5.4 discusses the evaluation metrics in detail. The results are discussed in Section 6.

| Research questions
The goal of the study is to improve the performance of the source code suggestion task. We decompose the evaluation process of DeepSN into three major parts and aim to answer the following research questions (RQs): � RQ1: How effective is DeepSN as compared to stateof-the-art approaches?
We compare DeepSN against three state-of-the-art approaches to validate its effectiveness. We choose these baselines [2,7] because they are similar to our target task of source code suggestion and considered as stat-of-the-art. We train ngram [2], RNN [7]

| Dataset
Note that we use the same dataset used in previous studies [2,28] the evaluation of this work. We build the dataset by extracting the intersecting projects from these studies with their recent versions available on their corresponding GitHub master branch at the time of collection. This dataset comprises ten real-world java projects extracted from Github 3 . Table 1 presents each project's version, lines of code in each project (LOC), overall token count, and distinctive code tokens count for each project. The experiments are repeated on each project separately to effectively evaluate the anticipated approach. Each project is randomly divided into ten foldings from which one fold is utilised for validation, one fold is utilised for testing, -203 and the rest are used for model training purposes. Models are trained using the training set, then parameter tuning is done by using the validation set to achieve the optimal performance and finally the performance evaluation is done by using the test set. It is worth noting that the testing set was not utilised during model training or validation and was exclusively utilised for the purpose of model evaluation. The validation set was used for the model's parameter optimisation to achieve optimal performance. A brief description of these projects is provided here: � ANTLR is a structured text or binary file manipulation parser generator. � Ant is a Java-based build tool similar to 'make'. � POI is used to read and write Microsoft Office binary and OOXML file formats. � Maven is a software project management and comprehension tool. � JGit is a java-based Git version control system. � Batik is a Java-based image manipulation tool for scalable vector graphics (SVG) format. � Cassandra is a highly-scalable partitioned row store. � db4o is a java variant of object-oriented database. � JTS is a Java library for generating and manipulating vector geometry. � iText is a java-based tool to create, edit and enhance PDF documents. Table 2 shows the overall architecture of the DeepSN. We use dual-layer CNN hs = 2 with constant fn = 200 and fs = 3, 4 for first layer and second layer, respectively. We utilise 300 hidden layer neurons with the context size of 20 as extensively studied by White et al. [7]. We utilize Adam optimizer [29] with its default parameter value of 0.001. We use dropout [30] for regularisation and to prevent the model from over-fitting. The model training was done until its converge by utilizing earlystop with three hit(patience) on the validation loss. All the models are trained utilizing tensorflow 4 v1.14 on Intel server Xeon Silver 4110 having CPU 2.10 GHz consisting of 32 cores with 128 GB RAM, equipped with NVIDIA RTX 2080 GPU.

| Training and prediction
The operating system used is Ubuntu 18.04.2 LTS. It is worth noting that the model training is offline thus, doesn't impact the prediction time. The prediction time for the next source code token takes approximately 20 milliseconds on an ordinary computer/laptop. The suggestions for the succeeding source code token provided a source code file f, the trained model expects the context τ before the current suggestion position y as input and preprocesses it as discussed in Section 4.1. Next, the preprocessed context vector is used as input to trained DeepSN to provide predictions for most probable succeeding source code tokens Y for the given context τ.

| Metrics
We evaluated the effectiveness of DeepSN by using wellknown assessment metrics such as accuracy, precision, recall and f-measure. We measure the Accuracy@k, the percentage of correct next source code token within the top-k ranks. The general formalisation of these metrics are; Further, we utilise mean reciprocal rank (MRR) ranking metric for the evaluation purpose in which predictions are ranked higher than those occurring later. The MRR metric is formalised as, MRR(τ) is the average of computed source code contexts ranks in the test set where τ is the source code context and y i represents the first relevant prediction index. 6 | RESULTS Table 3 presents the Accuracy@k for k = 1, 5, 10 and MRR results of all models (N-Gram, RNN, CNN and DeepSN). DeepSN outperforms the other models for all ten projects, in both Accuracy@k and MRR. Compared to the best baseline (RNN), on average DeepSN achieves 7.6%, 5.08%, and 4.37% higher accuracy for Accuracy@1, Accuracy@5 and Accu-racy@10, respectively. Figure 4 presents the MRR results of all four models on all ten projects. From the presented results we can observe that the proposed approach's min MRR score is higher than the best baseline's (RNN) max MRR score. On average DeepSN (highlighted with a green line) achieves the MRR score of 0.8146 which is higher as compared to N-Gram's 0.4600, RNN's 0.7482, and CNN's 0.7104. DeepSN shows evident improvements, which is significant for source code suggestion because developers prefer to have fewer suggestions ranking the correct ones on top. The results suggest that the proposed approach is capable of suggesting the correct prediction on its top index four out of five times. Figure 5 presents the Accuracy@k (k = 1, 2, 3, 5, 10) results of all models (N-Gram, RNN, CNN Figure 6 illustrates the learned code-embedding of the POI project by DeepSN for qualitative evaluation. The highlighted regions present semantically similar code tokens put closer where region highlighted with green circle present arrays, red circle present conditional operators, blue circle present conditions, and orange circle present loops. From the illustration, we can observe that DeepSN effectively learned the source code semantics and arrange the related code tokens close to each other. From the results, it is evident that the proposed approach is capable of capturing the source code semantics by using the code-embedding technique and is able to learn long and short-term contexts, resulting in significant performance boost. Figures 7 and 8 illustrate the precision, recall and f-measure scores of all four approaches (N-Gram, RNN, CNN and DeepSN). The precision and recall scores are presented as bars and f-measure is illustrated with a line chart. Here again, the proposed approach outperforms all other approaches. The source code suggestion precision of the N-gram model is high with very low recall which means it is random guessing, resulting in poor performance (low f-measure). Compared to RNN and CNN, DeepSN achieves high precision and returning the majority of results in positive (high recall) for all ten projects, attaining good model performance (high fmeasure).

| RQ2: Impact of hierarchical feature extraction
This section empirically evaluated the impact of hierarchical feature extraction on model performance by varying the number of filters fn, the number of layers in CNN hn and the filter size fs in each layer. To limit the experimental run time, we randomly divide our data set into two groups, Group A (maven, cassandra, antlr4, db4o, jts) and Group B (ant, poi, jgit, batik, itext) each containing five projects. Group A studies the impact of the number of filters on model performance and Group B studies the impact of CNN layers and the size of the filters in each layer.
In Group A experiments, we varied the number for filters from (fn = 50, 100, 200, 300). Figure 9 presents the results of Group A experiments. From the illustrated results, we can observe a significant performance boost when the number of filters is increased from 50 to 100. Further, we can observe that increasing the number of filters more than 200 does not necessarily improve the model performance. From the results, we conclude that the optimal performance is achieved when we use the number of filters fn = (100, 200).
In Group B experiments, we varied the CNN layers with various filter sizes. In hierarchical CNN, the learned features are convolved over to capture deeper source code semantics. Based on our preceding findings, we use the number of filters (f = 100) as a control variable in each hierarchy. We start from a single-layered hierarchiy (CNN-1) with a filter size of fs = 3. Next, we experimented with the dual-layered hierarchy (CNN-2) where the first layer is similar to CNN-1 and 2nd layer uses the filter size k = 4. Similarly, we experimented with the triple-layered hierarchy (CNN-3) where the first and second layers are similar to CNN-1 and CNN-2 respectively, and the third layer uses the filter size k = 5. Figure 10 illustrates the performance comparison between CNN-1, CNN-2 and CNN-3. From the results, we can observe that the CNN-2 performs better as compared to CNN-3 and a significant performance boost can be observed when compared to CNN-1. Further, we found that using the number of layers greater than two does not necessarily improve the model performance and, it is prone to under-fitting instead. From the results, we conclude that the optimal performance is achieved with the filter size fs = 3 and fs = (3, 4) for singlelayered and dual-layered CNN, respectively.

| RQ3: Impact of enhanced CNN on DeepSN
This section evaluated the impact of enhanced CNN and traditional CNN on DeepSN's performance. We train two different versions of enhanced CNN and traditional CNN. The first version is trained by using single-layered CNN with the number of filters f = 100 and filter size k = 3. The second version is trained by using dual-layered CNN with the number of filters f = 200 with filter size k = (3, 4). Table 4 presents the comparative Accuracy@k scores of enhanced CNN and traditional CNN. We find that the performance of enhanced CNN is significantly improved as compared to traditional CNN in both cases. We can also observe in some cases (ant, antlr4, jts) when using traditional CNN the model performance drops catastrophically. Further, we observe when using the dual-layered setup with traditional CNN, the model is prone to under-fitting, whereas enhanced CNN was able to learn deeper source code semantics. The bold values represent the best performance compared to others. To validate the statistical significance of different approaches (enhanced CNN and traditional CNN), we conduct the ANOVA one-way statistical test on the dual-layered setup. Table 5 presents the results of the AVOVA test where the target was Accuracy@k for k = 1, 2, 3, 5, 10. From the results, we reject the null hypothesis P − value < (alpha = 0.05) and f > f − crit; reflecting that a significant statistically difference present between enhanced CNN and traditional CNN. From Tables 4 and 5 observe that the optimal performance is achieved with the dual-layered setup. We can observe that the dual-layer setup achieves the optimal results in all cases. On average ,the performance of DeepSN is similar regardless of the project sizes. For example, the accuracy@1 of antlr and POI projects are similar to a marginal difference of 0.6% regardless of the huge difference in their sizes. Similarly, when comparing the traditional and enhanced CNN we have observed that the performance of DeepSN with traditional CNN is relatively poor when trained on a smaller amount of information. We can perceive that traditional CNN achieves very low accuracy scores of 10.10%, 8.47%, and 8.31% for ant, antlr4, and jts projects in single-layered setup, respectively, while the DeepSN with enhanced CNN achieves optimal performance with a significant improvement in modelling accuracy (ant = 74.14%, antlr4 = 78.31%, jts = 74.14%).

| Suggesting code with DeepSN
We use the DeepVS [32] plugin interface in Visual Studio Code IDE that interacts with the trained DeepSN model to provide predictions for the task of source code suggestion. Given the example provided in Figure 11 in which a software developer is writing a bubble sort algorithm. The DeepSN takes the context before the cursor position from the plugin interface and preprocesses it by following the steps already discussed in Section 4.1. Next, the preprocessed context vector is used as input to trained DeepSN for possible predictions of the next source code token. Here, we can observe that DeepSN provides top-5 suggestions [<, >=, <=, >, ==] based on their contextual and semantical relevance, ranking the ¡ at its first index which is the most relevant logical operator in this scenario. Moreover, correlating this example with Figure 6 one can observe that DeepSN effectively arranges semantically similar code tokens (conditional operators highlighted in red circle) closer resulting in better modelling performance. Extending the bubble sort example ( Figure 12) we can observe that it provides suggestions [j, i, k, IntLiteral, pos] that are closely related, ranking j on its first index which is the most probable next source code token given the context.

| DISCUSSION
The experimental results show that the proposed approach performs better than the other baseline approaches. There are several reasons that helped DeepSN to achieve optimal performance. � First, it simultaneously learns the different representations of the source code features by using multiple filters. � Second, by using hierarchical feature mapping and by capturing the long and short term-context dependencies of the source code helped to improve model generality and applicability. � Third, by adopting the batch normalisation on each phase of feature mapping helped the model to further improve its performance. � Finally, the training and optimisation of DeepSN along with simultaneous learning of code-embedding helped the model to learn deeper source code semantics.

| Limitations
Although the proposed approach improves the modelling performance, still several limitations need to be addressed. The evaluation of the proposed approach is done on each project independently; it may produce different results when suggesting cross-project/domain predictions. Moreover, neural language models usually suffer from the out-of-vocabulary issue that reduces the modelling accuracy which implies that there are cases where correct predictions are not even within top-ten suggestions. These limitations could be overcome by training the model with more information. Another solution to this problem is to use local cache The bold values represent the best performance compared to others.

-
along with neural learner. We take into account these limitations and intend to provide vigorous methodology in the future.

| Construct validity
Note that a threat to construct validity is an alteration in model settings or utilising a different set of training, validation and testing set which might produce dissimilar results. An additional threat to construct validity is the appropriateness of the metrics that are used for evaluation. The Accuracy@k [2,7] and MRR [6,28] metrics are usually utilised to evaluate the neural language-based source code models. Moreover, the anticipated approach is evaluated with precision, recall, f-measure, and ANOVA statistical test to alleviate this threat and to validate the effectiveness.

| Internal Validity
Here, a threat to internal validity is the implementation of baselines. The purpose of re-implementation was to build a generalised baseline as the dataset is an intersection of previous -211 studies with the recent versions available on GitHub. So, we implemented our version by following the same procedure as in their original manuscripts. To avoid this risk we doubly check the implementations and the results; nonetheless, there may some unobserved flaws. The other threat to internal validity is the learning of code-embedding. There are several other techniques (e.g. CBOW [33] and skip-gram [34]) to learn the code-embedding. It would be interesting to test other code-embedding techniques that may enhance DeepSN or employ the presented approach for the cross-domain recommendation [35][36][37].

| External validity
The generality of the results is a threat to external validity. The data set utilised in the anticipated work is gathered from GitHub which is a renowned platform for source code repositories. The software projects that are used in the anticipated study does not necessarily represent Java code completely or additional programming languages. Moreover, the selection of hyper-parameters is another threat to external validity. There is no common methodology to acquire the optimal parameters, thus is mainly experimental.

| CONCLUSION
The existing approaches for source code suggestion ignore or underutilised the semantic information present in the source code. To improve the performance of source code suggestion, we proposed a DeepSN that makes use of semantic information of the source code. First, DeepSN uses an enhanced hierarchical CNN to automatically extract the top-notch features of the source code combined with code-embedding to learn useful semantic information. Next, the source code's long and shortterm context dependencies are captured by using LSTM. We extensively evaluated the proposed approach with three baselines on ten real-world projects and results have suggested that the anticipated approach surpasses state-of-the-art approaches. Further, we make the pre-trained code-embeddings of all ten projects publicly available that can be used for different source code modelling tasks without requiring to train them. The approach presented here is general and can be extended to other source code modelling tasks. In our future work, we consider evaluating our approach for other source code modelling tasks such as syntax error correction, bug prioritizing etc.