Mining Stack Overflow for API class recommendation using DOC2VEC and LDA

Moon Ting Su, Department of Software Engineering, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur 50603, Malaysia. Email: smting@um.edu.my Abstract To address the lexical gaps between natural language (NL) queries and Application Programming Interface (API) documentations, and between NL queries and programme code, this study developed a novel approach for recommending Java API classes that are relevant to the programming tasks described in NL queries. A Doc2Vec model was trained using question titles mined from Stack Overflow. The model was used to find question titles that are semantically similar to a query. Latent Dirichlet Allocation (LDA) topic modelling was applied on the Java API classes (extracted from code snippets found in the accepted answers of these similar questions) to extract a single topic comprising of the Top‐10 Java API classes that are relevant to the query. The benchmarking of the proposed approach against state‐of‐the‐art approaches, RACK and NLP2API, by using four performance metrics show that it is possible to produce comparable API recommendation results using a less complex approach that makes use of some basic machine learning models, in particular, Doc2Vec and LDA. The approach was implemented in a Java API class recommender with an Eclipse IDE's plug‐in serving as the front‐end.


| INTRODUCTION
Software development technologies depend heavily on reusable components provided by Application Programming Interfaces (APIs) [1], which include frameworks, libraries, toolkits and software development kits [2]. APIs enable programmers to reuse code written by others [1][2][3]. However, programmers find it difficult to use APIs and they have to spend a substantial amount of time in learning APIs [2].
A number of factors affect the usability of an API [3]: the complexity of the API, naming convention, support of caller's perspective, documentation, consistency, and so on. The complexity of an API is related to its size; the larger the size, the higher the complexity and the lower the usability. In terms of API naming convention, descriptive names are preferred over abbreviated names. To support caller's perspective, an API should explicitly show how to invoke functions or features. API documentations should be clear, complete and up to date. The design of an API should be consistent and adhere to common conventions.
Among the aforementioned factors, the documentation of an API has been found to be the main obstacle developers face in learning a new API [4]. Generally, API documentations are incomplete and not in the desired format; provide insufficient examples, insufficient information on the high-level aspects of an API such as its design or rationale, limited information on how to use an API to achieve specific tasks [1].
To search for relevant functions to be used in their programming tasks, programmers search API documentations by composing search queries. Studies found three gaps or mismatches between the terms programmers used in describing their programming tasks in their search queries and the terms used in API documentations, that cause searches to be unsuccessful. The inability to find the relevant functions to be used aggravates the difficulty of learning and using APIs. The three gaps are: search queries to search for code (that contains code terms) in API documentations. A term is a natural language term if it contains only the alphabets from a natural language (such as English) and the term can be found in a dictionary of the natural language, whereas, a code term also contains numerical symbols or other symbols and the code term cannot be found in a dictionary of a natural language [7]. � Task-API knowledge gap [5]. API documentations typically focus on describing the structure and functionalities of the APIs and do not include information on concepts involved and purposes. On the contrary, programmers typically use terms related to concepts and purposes in their search queries.
Looking at the limitations of API documentations, it is unsurprising that programmers look for alternative sources to learn APIs. One of the sources is Community Question and Answer (CQA) websites, such as Stack Overflow (SO) [8]. A CQA website creates a socially mediated form of software documentation, namely, crowd documentation, which is 'a collection of web resources, where a large group of contributors, the crowd, curate and contribute to the collection' [9]. Crowd documentation of APIs has the following advantages over official API documentations [9]: a lot of code examples and explanation on API elements, numerous questions that lead to the same API elements, different opinions on the answers, community voting on answers and questions, and tags for searching. An API element refers to 'a named entity belonging to an API, such as a class, interface, or method' [9]. With a large volume of data, SO is a good source for data mining and analytics related to APIs [9][10][11][12].
This study aimed to address the lexical gap between NL queries and Java API documentations, and the lexical gap between NL queries and Java programme codes. To achieve the desired aims, this study proposed a novel approach for recommending Java API classes by making use of the data mined from the SO, Doc2Vec word embedding model, Latent Dirichlet allocation (LDA) topic modelling and heuristic. Java is chosen as the focus of the study because of the following reasons: it is a long established and popular programming language [13], it was ranked the fifth most popular technology in the 2019 SO developers survey [14], the number of Java questions been asked in SO yearly is among the top three since 2008 [15], and there is a wide coverage of Java APIs in SO posts [5].
The following shows an example of the two types of lexical gaps addressed by this study: The user NL query is 'How to initialise all values in an array to false'. One possible answer to this query is the following method provided by the Arrays class: static void fill(boolean[ ] a, boolean val). The description provided by the official Java API documentation for this method is 'Assigns the specified boolean value to each element of the specified array of booleans'. There is a lexical gap between the NL query and the Java API documentation since different terms are used in the NL query and in the Java API documentation, in particular, 'initialise' versus 'assigns', 'false' versus 'boolean'. To invoke the fill method, a Java statement, Arrays.fill(a, false);, is needed, where a is a boolean array. The term used in the user NL query is 'initialise' but the corresponding method name is 'fill'. This shows a lexical gap between the NL query and the Java programme code needed.
The main contribution of this article is the novel approach that employs Doc2Vec, LDA and heuristic in recommending relevant Java API classes for programming tasks described in NL queries. This approach is a less complicated approach compared to the state-of-the-art approaches but it is able to produce comparable API recommendation results.
This articler is structured as follows: Section 2 presents the background related to this study. Section 3 details the proposed approach. Section 4 presents the results of benchmarking the proposed approach against state-of-the-art approaches. Section 5 presents the threats to the validity of results. Section 6 compares this study with the related work. Section 7 concludes the study and outlines future work.

| BACKGROUND
SO: SO is the earliest website created in Stack Exchange [16] network in 2008 and has become the most popular website for computer programming. Since, its inception, SO has been providing a knowledge sharing platform between inexperienced and experienced programmers where numerous programming questions have been asked and answered. As of sixth of July 2019, SO has 18 million questions asked with 71% answered, 27 million answers, 11 million registered users, 9.2 million visits per day, and a traffic of 6.2k questions being asked daily [17].
The massive volume of crowd-generated data in SO makes it a suitable repository for data mining and analytics on crowd documentation of APIs [9][10][11][12]. One supporting reason is SO posts contain a huge amount of code snippets that are of good quality [11].
Questions in SO covered 77% of the total Java API classes [9]. Approximately 65% of the classes from each of the 11 core Java API packages of Java SE 6 [18] appeared in SO Java posts [10]. The core Java API packages are: java.lang, java.util, java.io, java.math, java.nio, java.applet, java.net, java.security, java.awt, java.sql and javax.swing.
Word Embedding: Harris' distributional hypothesis [19] postulates that 'words that appear in the same contexts tend to have similar semantic meanings' [6]. This enables the use of word co-occurrences in creating vector representations of words (aka word embeddings), where words with similar meaning would have similar vectors, and similarity between these vectors is usually defined as their cosine similarity [6]. Word vectors could be used in many NL processing tasks, such as predicting the next word in the sentence. Each word is mapped to a unique vector and the aggregation or average of the vectors serves as features used to predict the next word in a sentence [20].
word2vec employs neural network to learn continuous distributed vector representations of words from very big datasets and tries to minimise computational complexity of the process [21]. It provides two model architectures for learning the vector representations of words, namely, continuous bag-of-words (CBOW) and continuous skip-gram. CBOW learns the vector representations of words by predicting the current word for the given context using continuous distributed representation of the context; continuous skip-gram does so by predicting neighbouring words given the current word [21].
Paragraph Vector is an unsupervised algorithm that 'learns continuous distributed vector representations for pieces of texts' with different lengths such as sentences, paragraphs, and documents, instead of individual words [20]. The paragraph vector for a paragraph is concatenated with several word vectors for the same paragraph to predict the surrounding words in contexts sampled from the paragraph. Different paragraphs have different unique paragraph vectors. However, word vectors are shared among paragraphs. Paragraph Vector is called Doc2Vec [22,23] (aka paragraph2vec or sentence embedding) by libraries that implemented it, such as Gensim [22], an open source Python library.
Topic Modelling: A topic model (or latent topic model or statistical topic model) is 'a method designed to automatically extract topics from a corpus of text documents' [24]. A topic or concept (or theme [25]) refers to a collection of words that co-occur frequently in the documents of a corpus, where these words are often semantically related, for example 'mouse, click, drag, right, left' [24]. The uses of the topic models include automatic indexing, searching, clustering, and semantically structuring large corpus of unstructured and unlabelled documents, by representing the documents using the topics within them.
Latent Dirichlet allocation (LDA) [24][25][26] is a generative probabilistic topic model that uses statistical properties of an unstructured document corpus' word frequencies to discover the corpus' topic structure. The intuitions behind LDA are [25]: a set of topics exists for a collection, and 'all the documents in the collection share the same set of topics but each document exhibits those topics in different proportion'. A topic is 'a distribution over a fixed vocabulary', for example, the genetics topic encompasses words regarding genetics with high probability [25]. LDA model assumes that the topics are created first before the documents. Subsequently, each document in the collection is imagined to be 'generated' in two stages [25]: In the first stage, a distribution over topics is selected randomly [25]. In the second stage, two things are performed for each word in the document. The first thing is to choose a topic from the distribution over topics selected in the first stage [25]. The second thing is to choose a word from the corresponding topic randomly [25]. Topic modelling uses a set of documents to discover the latent topic structure that likely produced the documents [25]. This can be seen as reversing LDA's generative process which assumes that topics are created first and are used to generate the documents [25].

| PROPOSED APPROACH
Since the proposed approach would be incorporated into a recommender tool, it was designed by taking into consideration the four major design concerns involved in designing Recommendation System in Software Engineering (RSSE). A RSSE refers to 'a software application that provides information items estimated to be valuable for a software engineering task in a given context' [27]. Examples of the task are reusing code and writing bug reports [27]. The four major design concerns in designing RSSE are [27]: 1) Data pre-processing that involves converting raw data retrieved from data sources into a usable format. For example, parsing source code, abstracting software into dependency graph, replacing missing values. 2) Capturing context that involves capturing the context of the task for which recommendation is sought. The task context comprises all the details of the task accessible by the recommendation system in producing recommendation. For example, partial code written for the task, a report that a user is accessing, or an explicit context of the task that is fused with a user query. 3) Producing recommendation that involves executing recommendation algorithms to select and recommend relevant items. 4) Presenting recommendation that involves listing items of potential interest with explanation on why an item is recommended. Figure 1 shows the overall design of the proposed approach which comprises the preparation and recommendation phases. The preparation phase was performed once only to prepare the things (such as Doc2Vec model and Answers dataset) needed by the recommendation phase. The recommendation phase is executed each time a query is submitted to the approach. The steps involved in the proposed approach are grouped based on the four major design concerns of RSSE.

| Preparation phase
The preparation phase includes those steps that fall under data pre-processing.
Step 1 -Acquire training data: The training data was extracted from SO using Stack Exchange Data Explorer (SEDE) [28] provided by the Stack Exchange. SEDE is an online web-based query tool that provides an easy access to the latest monthly data dumps of Stack Exchange network, which include SO. The training data comprises 632,062 SO posts/questions tagged with a 'java' tag and they have an accepted answer. Specifically, the training data contains the identifier, title, tags and accepted answer of each question, and it is stored in a comma-separated-value (csv) file.
Step 2 -Create datasets: Token splitting or tokenisation, removal of noise (which includes stop words and punctuation marks), and lemmatisation were performed on the titles of the SO questions in the training LEE AND SU -3 data, to identify the keywords in each question title. The titles of SO questions are 'a major source of query keywords for code search' [29] and were used in this study to train a Doc2Vec model (Step 4) for use in finding questions (in this case question titles) that are similar to a query during the recommendation phase. The tokenisation and removal of noise were done using the Natural Language Toolkit (NLTK) [30,31], an open source NLP library. Lemmatisation of the remaining tokens were done using NLTK's WordNetLemmatizer that makes use of WordNet, a lexical database that describes semantic relationships for words in English language [32]. Lemmatisation only returns a word in dictionary form known as a lemma. In comparison, stemming removes derivational affixes from a word to return the stem which might not be a valid word in the dictionary [33]. Two datasets were created in this step: a Questions dataset containing the SO question identifiers, question titles and the keywords extracted from the question titles; and an Answers dataset containing the SO question identifiers and the corresponding Java API classes that would be extracted from the code snippets in these questions' accepted answer in Step 3. These two datasets are stored in different csv files.
Step 3 -Extract Java API Classes Using Heuristic Rules and Store Them in Answers Dataset. This step extracted Java API classes from code snippets found in the accepted answer for each question in the training data and stored them with the respective question identifier in the Answers dataset. The rationales are code snippets frequently make use of or call API elements, and a study found that accepted answers use at least two Java API classes [29]. The code snippets were obtained by extracting the content enclosed by HTML '<code>' tag using BeautifulSoup [34], a Python library for parsing HTML and XML documents. Java API classes were extracted from code snippets that have at least two lines of code to take into consideration some context of the usage of these API classes in the code.
Five heuristic rules implemented in Python scripts were used to extract the Java API classes. The first three rules were used for API classes extraction. The remaining two were used to validate and refine the extracted results.

Heuristic 1 : Extract class names from 'import' statements.
One way to use/access a Java API class/interface in a Java programme is by using an 'import' statement to specify the package where the class/interface resides. Class names extracted from the 'import' statements are classes that might be relevant to the respective question. Heuristic 4 : Validate class names produced by the first three heuristic rules. The class names produced by the first three heuristic rules were compared against class names extracted from the official Java API documentation for Java SE 8 and Java EE 8 that are stored in a file. Any wrong letter cases in the class names were converted to the correct letter cases. User-defined or third-parties class names that do not start with a capital letter were discarded since capitalising the first letter of a class name is a Java coding convention. Redundant class names were then removed to obtain a set of unique class names.
No checking was done on camel cases in class names since the class names were compared with those found in the official Java API documentation and were converted to the correct letter cases where applicable. Unlike RACK studies [10,29] that included only class names that have camel case notation, the proposed approach also included class names that are in the form of a single word. The reason is some class names from Java SE 8 and Java EE 8 are in the latter form (e.g. Boolean class and Number class).
Heuristic 5 : Remove two high occurrence Java API classes, which are, String and ArrayList. These two classes (especially the String class) were removed because they are common classes used in most programs and therefore less likely to be addressing the specific programming task described in the respective SO question.
Step 4 -Train a Doc2Vec word embedding model.
Questions that do not have Java classes found in the code snippets in their accepted answer were removed from the Questions and Answers datasets. This resulted in only 160,680 questions in the two datasets, which is 25.4% of their original size.
The Questions dataset was used to create a tagged document containing two information, the indexes of the questions and the keywords of question titles. The indexes of the questions were created in an ascending order based on the order of the questions in the Questions dataset. The order of the questions in the two datasets are the same.
The tagged document was used as the input data (training corpus) to train a Doc2Vec model using the Paragraph Vector algorithm [20] implemented by Gensim's Doc2Vec module [22,23]. The trained Doc2Vec model was saved into a binary file to be used by the recommendation phase in retrieving question titles that are similar to a query.

| Recommendation phase
The recommendation phase involves capturing the context of a query, producing recommendation and presenting recommendation.
Capturing Context: Step 1 -Pre-process user query to obtain query keywords: In this step, a user query is pre-processed to obtain the keywords of the query. This involves tokenisation, removal of stop words and punctuation marks, and lemmatisation. The query keywords provide the context of the task described in the query.
Producing Recommendation: This involves Steps 2 and 3 below.
Step 2 -Retrieve similar question titles and corresponding Java API classes: The query keywords obtained from the user query are used to infer a vector from the Doc2Vec model. The inferred vector is then supplied as an input to the Doc2Vec model to obtain the indexes of the top 100 most similar documents (i.e. most similar question titles in this case) from the Doc2Vec training corpus. The indexes of the top 100 most similar documents with a similarity score of less than 0.7 are discarded. The remaining indexes serve as 'indexes' to locate the corresponding Java API classes for the most similar question titles, from the Answers dataset. This study refers to these Java API classes as candidate classes. The minimal threshold value of 0.7 was decided after studying the impact of different threshold values on the scores of performance metrics by running the experiments using different threshold values. It was found that a threshold value that is below 0.7 would decrease the scores of performance metrics.
Step 3 -Determine relevant Java API classes and return a ranked list of Java API classes: To determine the relevance of the candidate Java API classes (produced by the previous step) to the respective query, an LDA topic model is created using a corpus comprising of these candidate Java API classes. The LDA topic model is then used to produce an output comprising of a single topic that contains 10 words (Java API classes in this case) that have the highest probability scores. These classes are taken as the top 10 Java API classes that are most relevant to the query.
The rationale for a single topic is, the corpus used to train the LDA model contains Java API classes that correspond to question titles found to be similar to the specific query, and a query generally contains the description of a single programming problem or topic. In addition, a query typically targets one specific programming task [29].
No comparison is made between the extracted unique topic and the respective query, specifically, no comparison is made between the words in the topic and the keywords of the query. The reason is the relation between the extracted topic and the query is established in a previous step (Step 2), where, keywords of the query are compared against the Doc2Vec model (trained using keywords of SO question titles) to find the top 100 most similar question titles, which are used to obtain the Java API classes found in the code snippets of the accepted answers of these top 100 most similar question titles. The obtained Java API classes are then used as the corpus to train an LDA topic model, which is used to produce a single LEE AND SU -5 topic comprising of 10 Java classes that have the highest probability scores. These 10 Java API classes are Java classes that are most relevant to the query and are recommended to the user.
This study uses Gensim's LDA module [35] in Python language to create and use LDA topic model. The parameters of the LDA used are as below: lda model ¼ gensim:models:LdaModelðcorpus; num topics where, lda_model is a variable that stores the LDA model; corpus is a variable that stores a list of candidate Java API classes; num_topics refers to the number of latent topics to be extracted from the corpus; id2word refers to a dictionary that is created from the list of candidate Java API classes. A dictionary is a mapping between words (Java API classes in this case) and their integer IDs; passes refer to the number of passes through the corpus during training; iterations refers to the maximum number of iterations through the corpus when inferring the topic distribution of a corpus. As shown in Statement (1), the number of passes and maximum number of iterations through the corpus during the training were set at 50 and 2000, respectively. These values were selected because they produced the best results for the performance metrics. Other values that were tried are 30 and 40 for passes and 500, 1000 and 1500 for iterations.
The following was used to get a representation for selected topic: lda model:show topicsðnum topics ¼ 1; num words where, num_topics refers to the number of topics to be returned; num_words refers to the number of words to be presented for each topic. These words are the most relevant words since they achieved the highest probability for each topic; log is used to specify whether to log the output besides being returned; formatted if set to 'True' will cause the topic representations to be formatted as strings. If formatted is set to False, the topic representations would be returned as (word, probability) tuples. Statement (2) returns a single topic with 10 words, with the topic represented as word-probability pairs and the topic is not being logged.
Presenting Recommendation: The top 10 API classes found relevant for a query are displayed to the user together with the titles of SO questions that are found to be similar to the query. The user can select the title of a SO question to display the question's details retrieved directly from SO.

An example:
The following explains about the recommendation phase by using an example. The query is 'How to create a digital signature and sign data?'. In Step 1, the query is pre-processed resulting in five tokens/keywords: 'create', 'digital', 'signature', 'sign' and 'data'. In Step 2a, the five keywords are provided to the Doc2Vec model to obtain the indexes of the top 100 question titles that are most similar to the query. For example, the question title that is found to be the most similar to the query is 'Sign data using PKCS #7 in JAVA' with a probability score of 0.758. In Step 2b, its index is used to retrieve the corresponding Java API classes from the Answers dataset.
Step 2b is done for all the top 100 most similar question titles to produce a set of Java API classes for the query. In Step 3, the set of Java API classes found is used to construct an LDA model that returns a single topic comprising of 10 Java API classes arranged in decreasing order of probability scores: 'Signature' class with the probability score of 0.022, 'InputStream' class with the probability score of 0.017, 'CMSSignedData' class with the probability score of 0.015, and so on. The higher the probability score of a Java API class, the more likely it is relevant to the topic. In this case, the 'Signature' class with the highest probability score is the Java API class that is most relevant to the topic of the query.

| Discussion of techniques used
This section discusses the techniques used in the proposed approach. Why Doc2Vec? Doc2Vec is able to learn the continuous distributed vector representations for pieces of texts such as sentence or paragraph or document, instead of individual words. In comparison to word2vec which learns semantic similarities between words, Doc2Vec learns semantic similarities between sentences, paragraphs or documents. We opined that a query as well as a SO question title should be analysed at the level of a holistic piece of text instead of being analysed at the level of individual words. Following that, we chose to train a Doc2Vec model that was subsequently used to determine the semantic similarities between the query and the SO question titles at a unit of text that is larger than individual word. In addition, Doc2Vec has not been used in any existing recommendation studies and this study would like to explore that. The set of keywords of a SO question title in the Questions dataset is treated as a 'piece of text' and the Doc2Vec algorithm learns the vector representation for this 'piece of text'. The trained Doc2Vec model is used by the recommendation phase in retrieving question titles that are similar to a query. Why topic modelling and LDA? The proposed approach performs topic modelling on the set (or corpus) of candidate Java API classes to find the topic structure of the corpus. The topic model produced is used to determine the topic of the candidate Java API classes. A topic is a set of words that co-occur frequently in the documents of a collection, and these words are often semantically related. In our study, a topic is made up of a set of Java API classes that always appear together and therefore are semantically related and should be recommended together as the 6 -LEE AND SU results to the respective query. A topic contains Java API classes that are relevant to a query because these classes are extracted from the code snippets of accepted answers of question titles that are found to be similar to the query. There is a variety of topic modelling techniques and we chose to use LDA in our study because it is the most common topic modelling technique used. In a survey of 167 articles in SE domain that used topic model, it was found that nearly twothird of the studies employed LDA or its variant as the topic model in mining un-structured software repositories [24]. The survey also found that most of the studies used the basic topic models, and the second most popular software engineering task where topic models have been used is concept location, where topics discovered by a topic model are deemed equivalent to the conceptual concerns in the respective artefact [24]. This is in line with the use of topic model in our study where the single topic represents the concept involved in the programming task described by a query. The Java API classes that constitute the topic are classes that co-occur frequently in the code snippets in the accepted answers of those questions that are similar to the query, and therefore relevant to the query.
A recent work used LDA to extract topics from lexicon extracted from the source code of 50 Java projects found in the GitHub repository [36]. The work showed that the extracted topics and the extracted lexicon could be used to assist experts in assigning a software system to an application domain. They served as viable code-based substitutes to a system's ReadMe file in such task. Our study also extracted lexicon (in particular Java API classes) from source code (specifically code snippets in the accepted answers of the SO questions in the training data). In our study, LDA modelling is applied on the Java API classes (of the questions found to be similar to a query) to extract a single topic for the query.
Why Heuristic? Our approach employed heuristic to extract Java API classes from code snippets of accepted answers of SO questions during the data pre-processing phase.
Since the Java API classes were extracted from code snippets, these heuristic rules were derived from our knowledge of Java programming language, in particular those related to objectoriented programming. The first three heuristic rules ('import' statement, reference types, class constructor after a 'new' keyword) are based on the syntaxes of where a Java class name might appear in a Java programme. Heuristic 5 (remove String and ArrayList classes) is based on the fact that these two classes are common classes used in most programmes and most likely they would not be the specific classes required by the programming task described in the respective SO question. Heuristic 4 is similar to the second heuristic used in BIKER's study. We compared the extracted Java API classes against Java API classes from the official Java API documentation. BIKER compared Java API methods.

| Recommender tool
The proposed approach was implemented in a Java API class recommender that runs on a back-end server. A plug-in (APIRecJ) for Eclipse IDE was developed to serve as the front-end client to the recommender.

| BENCHMARKING
This section presents the evaluation of the performance of the proposed approach in recommending Java API classes for NL queries. The way the evaluation was done is the same as what the state-of-the-art studies, RACK [10,29] and NLP2API [37] did in evaluating their recommendation approaches. This involved executing the proposed approach on the evaluation dataset [38] used and published by NLP2API [37] and RACK [10] studies, to get the recommendation results that were used to calculate the four performance metrics used by RACK [10,29] and NLP2API [37] studies. Doing that enabled the proposed approach to be benchmarked against these stateof-the-art approaches. The proposed approach was not compared against BIKER [5] that used only two of the four metrics and that recommends Java API methods. The latest RACK study [29] also employed a query effectiveness metric to determine whether a reformulated query shows any improvement by measuring the rank of the first relevant result in the list retrieved by a query. It is not relevant to this study since the proposed approach is not for query reformulation.

| Evaluation dataset
As mentioned earlier, this study used the evaluation dataset [38] used and published by NLP2API [37] and RACK [10] studies. This evaluation dataset contains 310 code search queries and their corresponding 'ground truth' API classes, namely, API classes 'relevant' to the respective query. Code search queries are queries that are described in NL and used for searching for relevant code snippets [10], and are termed 'natural language queries' in this study.

| Performance metrics
The definitions of the four performance metrics below are from the latest publication on RACK [29]. The word 'item' in the metrics' descriptions below refers to Java API class.
Top-K Accuracy: Top-K Accuracy refers to the percentage of the search queries for each of which at least one relevant item (i.e. one ground truth Java API class) is returned within the Top-K results produced by a recommendation technique/ approach. Its formula [29] is given below: ∑ denotes the sum of isCorrect function for each q in Q, and K denotes the Top-K Java API classes returned by an approach for query q. The isCorrect function returns a value of one if the approach returns at least one relevant Java API class for query q, and a value of zero if the approach returns none of the relevant Java API classes for query q. A relevant Java API class refers to a Java API class that can be found in the set of the ground truth Java API classes for query q. Mean Recall@K (MR@K): Recall@K refers to the percentage of ground truth items (i.e. ground truth Java API classes) that are correctly recommended for a query in the Top-K results produced by an approach. MR@K averages Recall@K measures for all queries q in a dataset Q. Its formula [29] is given below. one of the ground truth API classes for q) from a ranked list of size K. rank(q, K) returns ∞ if no correct API class is found within the Top-K positions, and returns one if the correct API class is at the top first position of a ranked list. The maximum value of MRR@K is one and the minimum is zero. The larger the value of MRR@K, the better the approach.
Mean Average Precision@K (MAP@K): Precision@K refers to the precision at the occurrence of every single relevant item (i.e. Java API class that exists in the ground truth set for the query) in the ranked list [29]. Average Precision@K (AP@K) averages the Precision@K for all relevant items within Top-K results for a particular query. Mean Average Precision@K is the mean of AP@K for all queries from dataset Q. Their formulas [29] are given below.
K refers to the number of top results taken into consideration. rel k denotes the relevance function of k th result in the ranked list that returns either 1 (relevant) or 0 (irrelevant). P k denotes the precision at k th result. |RR| is the size of the set of relevant results for a query. Q denotes the set of all queries, q denotes a query that is a member of (∈Þ set Q, |Q| denotes the size of the set of queries.

| Benchmarking results
The benchmarking only considered the Top-10 results returned by the proposed approach. This is because both the state-of-the-art approaches, RACK and NLP2API, achieved the best performance for the four performance metrics when the number of results returned was set at 10.
We used the same evaluation tool [39] published online by the RACK study [10] (which is similar in terms of the four performance metrics to the evaluation tool [40] published by NLP2API study [37]) and the same evaluation dataset [38], to produce the scores of the four performance metrics for RACK, NLP2API and our approach. The performance metric scores for RACK and NLP2API are the same as what were published in the NLP2API study [37]. Table 1 shows the benchmarking results. With the highest scores for all the four metrics (84.35% for Top-10 Accuracy, 0.57 for MRR@10%, 48.93% for MAP@10% and 59.92% for MR@10), the proposed approach demonstrated the best performance. However, the NLP2API approach demonstrated a comparable performance by achieving slightly lower scores for all the four metrics. RACK approach demonstrated poorer performance with the lowest scores for all the four metrics.
In comparison to the existing approaches such as the two state-of-the-art approaches that introduced complex mathematical formulas for API classes recommendation, our approach employed some widely available machine learning models provided by open source libraries and heuristic. The benchmarking results show that it is possible to produce comparable API recommendation results using a less complex approach that makes use of some basic machine learning models, in particular, Doc2Vec, LDA, and heuristic.
The improvement over NLP2API is small but the proposed approach is less complex than the NLP2API approach as can be seen later in the section describing comparison with the related work.

| Impact of heuristic on performance metric results
The heuristic rules determined which words in the code snippets of accepted answers are Java API class names and therefore affected the content of the Questions and Answers datasets created in the preparation phase of the proposed approach. The Questions dataset was used to train the Doc2Vec model that would be used in the Recommendation phase to retrieve question titles that are similar to a query. After similar question titles are found, Java API classes that are related to these question titles will be retrieved from the Answers dataset. As a result, the heuristic rules used have an impact on the performance metric results and these are discussed below.
Heuristic 1 (Extract class names from 'import' statements) would impact the performance metric results negatively if classes extracted from the 'import' statements are not used in other parts of the programme. A programme might 'import' classes but does not use them anywhere in the programme. In this case, the 'imported' classes are not relevant to the query.
Heuristic 2 (Extract reference types of reference variables) and Heuristic 3 (Extract the name of the constructor method located right after a 'new' keyword) extracted classes that are actually used in a programme and they are relevant to the query. However, these classes might not be the most important class (es) the query is looking for. Heuristic 2 and Heuristic 3 might affect the performance metric results negatively.
Heuristic 4 (Validate class names produced by the first three heuristic rules) did not remove all custom or user-defined Java classes and third-party API classes and they might appear in the set of candidate Java API classes. These classes might make it into the list of the Top-10 API classes recommended for a query but they might not be 'ground truth' API classes in the evaluation dataset. This would impact the performance metric results negatively.
Heuristic 5 (Remove two high occurrence Java API classes, which are, String and ArrayList) would impact the performance metric results positively. 'String' and 'ArrayList' classes are of high occurrence in code snippets but are unlikely to be the essential Java API classes sought by most queries. These two classes would probably occupy the top positions in the ranked list of Java API classes returned. By removing them as candidate Java API classes that are used in creating an LDA topic model for the respective query, other relevant Java API classes can be recommended at a higher position in the list, resulting in better MRR@10 and MAP@10. In addition, the result set would comprise more accurate Java API classes and this contributes to the improvement in Top-10 Accuracy and MR@10.

| THREATS TO THE VALIDITY OF RESULTS
This section explains threats that could possibly affect the validity of the results of this study and how some of them were mitigated.
Internal validity: This study focused on Java API classes provided by Java SE 8 and Java EE 8. Nevertheless, during the preparation phase, we did not remove all custom or userdefined Java classes (such as 'Student') and Java classes originated from third-party Java API libraries that are found in the code snippets in the accepted answers for questions in the Answers dataset. We only removed user-defined or thirdparties' Java classes that do not start with a capital letter since capitalising the first letter of a class name is a Java coding convention. During the recommendation phase, these classes could become part of the set of candidate Java API classes used to build an LDA topic model for the respective query. These classes were not removed because if they (especially those from third-parties Java API libraries) appear in the list of the Top-10 Java API classes found for the query, they are relevant to the query and should not be eliminated from the recommendation results. The other reason for not removing them is the known evaluation dataset [38] that we used for the benchmarking including third-party API classes/user-defined classes.
Heuristic 5 in Step 3 of the preparation phase of the proposed approach prevented 'String' and 'ArrayList' classes from becoming part of the set of candidate Java API classes used to build an LDA topic model for a query. These two classes (especially the String class) are common classes used in most programs and therefore less likely to be addressing the specific programming task described in the respective query. As a trade-off, the proposed approach cannot cater for queries that are specifically seeking for information related to String and ArrayList classes. This is left as a future improvement.
The evaluation of the proposed approach was done using the evaluation dataset [38] published and used by the NLP2API [37] and RACK [10] studies, and the evaluation tool [39] published by the RACK study [10] (which is similar in terms of the four performance metrics to the evaluation tool [40] published by the NLP2API study [37]). This was done to enable benchmarking of the proposed approach against these state-of-the-art studies. The advantage of using a known evaluation dataset and tool is that the development of the evaluation dataset and tool is free from our possible biases or errors. However, using a known evaluation dataset posed some constraints. Our manual inspection of the evaluation dataset showed that it contains Java API classes provided by third party vendors or user-defined classes. These classes are not a part of the Java SE 8 and Java EE 8 that our study focused on. We did not remove these classes from the evaluation dataset because doing so would introduce bias in the benchmarking since NLP2API [37] and RACK [10] approaches were trained with the dataset that contains Java API classes provided by third party vendors or user-defined classes. Another constraint of using this existing evaluation dataset is that it can only be used to evaluate the performance of recommendation approaches that recommend API classes and not those that recommend API methods.
Construct validity: The proposed approach was benchmarked against NLP2API and RACK using the same four performance metrics used by NLP2API and RACK. Using established metrics employed by state-of-the-art approaches helps to achieve construct validity of this study where meaningful established measures were used in the evaluation of the proposed approach. External validity: The proposed approach could be adapted for other programming languages apart from Java by mining the corresponding SO posts and changing the heuristic rules to cater for the syntax and API classes of the targeted programming language.

| Related work on mining SO for API recommendation
This section presents existing studies that leverage SO posts in recommending Java API elements (such as classes, methods) for users' queries written in NL. Existing studies closest to this study are: RACK [10,29,41], NLP2API [37,42] and BIKER [5,43]. Since all of these studies including this study incorporated their techniques/approaches into their respective recommender tools [37,[41][42][43], the description of these techniques/approaches in this section is organised based on the four major design concerns involved in designing RSSE (refer to the section on the proposed approach). Despite that RACK [29] and NLP2API [37,42] are regarded as query reformulation techniques, the things they do in identifying candidate API classes for reformulating queries can be regarded as 'producing recommendation' and 'presenting recommendation' in RSSE design, and therefore explained under the purview of these concerns. These studies used the same name to refer to their techniques/approaches and the corresponding recommender tool that they built. For example, 'RACK' is used to refer to the technique [10,29] and also the recommender [41].
RACK (Recommending API using Crowdsourced Knowledge) is an API recommendation technique [10] and a query reformulation technique [29] that returns Top-K API classes relevant to an NL query for code search (aka code search query) using token/keyword-API associations mined from SO. The NL tokens or keywords were extracted from the titles of SO questions and the corresponding API classes were extracted from the respective question's accepted answer. These token-API pairs are stored in a relational database to be used in the recommendation phase. RACK uses two heuristics, namely, Keyword-API Co-occurrence (KAC) and Keyword-Keyword Coherence (KKC) to derive the API Cooccurrence Likelihood and API Coherence metrics to measure the relevance of the candidate API classes for a query. It returns Top-K API classes ranked based on the aggregated scores of the two metrics.
In subsequent work [41], RACK was implemented in an Eclipse IDE plug-in, as a query recommender as well as a code search engine. It uses the returned API classes as the reformulated query that serves as input to the GitHub code search API to return code snippets from thousands of open source projects. In the latest work [29], a third heuristic (Keyword Pair API Co-occurrence) is also used to obtain candidate API classes and in calculating the API Cooccurrence Likelihood metric. The explanation on RACK below is based on [10,29] with additional information from [29] being stated.
Data pre-processing: RACK collected 172,043 pairs of Java questions and their accepted answers from SO using SEDE.
To build the token-API pairs database (comprises 126,567 entries), the tokens/keywords were extracted from the question titles using standard NL pre-processing (removal of stop words and punctuation marks, token splitting/tokenisation, and stemming using Snowball stemmer). To extract the API classes from the code snippets in accepted answers, Jsoup parser was used to extract the code segments enclosed by <code> tags [10] and <pre> tags [29]. The content in a code segment was split based on punctuation marks and white spaces, and programming keywords were removed. A regular expression for Java class was used in island parsing to extract API classes having the camel case notation. The latest version of RACK dropped all the API classes (e.g. String, Integer) from java.lang package since they are generic and often used in codes [29].
Capturing context: RACK performs Parts-of-Speech (POS) tagging on a query to extract nouns and verbs. It then applies stop word removal, token splitting, and stemming on them to get the stemmed keywords. The stemmed keywords of the query provide the context of the task described in the query. The stemmed keywords are then processed using the two heuristics to obtain candidate API classes for the query. The relevance of the candidate API classes to the query is estimated using the two metrics.
Producing recommendations: RACK employs two heuristics on the token-API associations, namely, Keyword-API Cooccurrence (KAC) and Keyword-Keyword Coherence (KKC) to obtain candidate API classes from the database of token-API pairs [10]. The latest version of RACK also employs a third heuristic in doing so, namely, Keyword Pair-API Cooccurrence (KPAC) that employs coherence among the API classes [29]. KAC uses the frequency of co-occurrences between keywords and API classes to obtain the top 5 [10] or top 10 [29] API classes that are relevant to each keyword. These top API classes are those that co-occurred most frequently with the respective keyword at SO.
A query might comprises multiple keywords and an API class might not be relevant to all the keywords of the same query. Therefore, RACK uses KPAC [29] to ensure that Top-10 API classes that are concurrently relevant to pairs of keywords of a query would be selected as candidates. A keyword pair consists of any two keywords extracted from a query. KPAC [29] considers all the possible keyword pairs of a query and identifies API classes that frequently co-occur with both keywords from each pair in the same context, namely, the same Question & Answer (Q & A) thread.
RACK employs KKC that uses the contextual words of keywords to measure semantic similarity between any two keywords [29]. A context for each keyword in a query is obtained by finding words that co-occur with the keyword from thousands of SO question titles. These co-occurring words constitute the context of the keyword. KKC uses the context of each keyword to identify coherent (semantically similar) keyword pairs that will be used to obtain candidate API classes that are functionally coherent for each pair since the keywords in a pair are coherent themselves.
RACK then employs the two metrics, API Co-occurrence Likelihood and API Coherence to measure the relevance of the candidate API classes to the respective query. The first metric estimates the probability of co-occurrence of a candidate API class with an associated keyword from the user query based on KAC [10], or with one keyword or more keywords from the user query based on KAC and KPAC [29]. API Coherence estimates 'the coherence of an API with other candidate APIs for a query' based on KKC [29].
Presenting recommendations: RACK presents a list of Top-K API classes ranked based on the aggregated scores of the two metrics. NLP2API [37,42] is a query reformulation technique proposed by the similar group of researchers who developed RACK. It can address the limitations of the earlier version of RACK [10] that employs heuristics that potentially return generic and common API classes (such as, String, List) because of relying on co-occurrences alone, and false positive results as a result of using all the question-answer pairs from the corpus for each query instead of the relevant ones only.
Data pre-processing: NLP2API built a corpus of Java Q &A threads that have <code> tag in either the question or the answer and have accepted answer, out of 656,538 Java Q & A threads from SO public data dump released on March 2018 [37]. NLP2API applied the standard NL pre-processing (stop words, punctuation marks and programming keywords removal; and token splitting) on the Q & A threads in its corpus but not stemming since they might contain code segments. The corpus was indexed using Lucene. Lucene is an open source search library that provides APIs for searchrelated tasks such as analysis of incoming content and queries, indexing and storage, searching/querying and ancillary search-related tasks such as result highlighting [42]. NLP2API constructed a word2vec model using the fast-Text learning algorithm with its default parameters to analyse 1.3 million Q & A of SO. The word2vec model is used to determine Query-API class proximity score in the recommendation phase. It captures the word embedding for each word in the entire corpus and maps each word to a point in the semantic space of the corpus in such a way that semantically similar words appear close to each other. fastText [44] is an improvement over word2vec continuous skip-gram model as it also takes into account the morphology of words, and can be used to learn representation for rare words. fastText learns representations for character n-grams (sub-word), and represents a word by the sum of its character n-grams vectors.
Capturing context: NLP2API normalises a query using the standard NL pre-processing (stopword removal and token splitting). The resultant tokens provide the context of the task described in the query. Producing recommendations: NLP2API uses the query tokens to retrieve Top-M Q & A threads from its corpus using Lucene search engine. It uses pseudo-relevance feedback (PRF) in reformulating a query. Instead of relying on a query submitter's feedback on the relevance of returned API classes, NLP2API assumes that the Top-M Q & A threads returned for a query are relevant (hence 'pseudo' relevance) and analyses them automatically for query reformulation. The Top-M Q & A threads are analysed to identify candidate API classes that will be used to reformulate the query. The analysis of the Top-M Q & A threads involved a) extracting code segments (with Jsoup for HTML scraping) and constructing two sets of code segments, with the first set extracted from questions and the second set extracted from the answers; b) extracting API classes from both sets of code segments with island parsing using regular expression; c) determining the weight of the API classes using Term Frequency -Inverse Document Frequency (TF-IDF), where TF refers to the frequency of occurrence of an API class in the collected code segments, and document frequency DF refers to the number of threads that state the API class in them; d) determining the weight of the API classes using PageRank that unlike TF-IDF takes into consideration the dependencies among terms. PageRank algorithm was used to determine the relative importance of each node (i.e. API class) among all nodes in the API co-occurrence graph constructed for code segments from questions and in the API co-occurrence graph constructed for code segments from answers.
The analysis produced four ranked lists of candidate API classes: one TF-IDF list and one PageRank list of API classes extracted from questions, one TF-IDF list and one PageRank list of API classes extracted from answers. NLP2API uses Borda score and Query-API semantic proximity analysis in narrowing down the candidate API classes from the four lists. Borda count is a popular election method where the voters rank their political candidates in an order of preference [45,46]. NLP2API calculates Borda score for each API class from the four lists based on the intuition that an API class having a higher rank in multiple lists will be more important to the query than those classes having a lower rank or that do not occur in multiple lists.
PRF, PageRank and Borda count cater only for the local contexts of the query keywords within a set of possibly relevant Q & A threads. To analyse the global context of the query keywords, NLP2API used the word2vec model constructed earlier. Using the word2vec model, NLP2API determines the semantic proximity between a candidate API class and each keyword of the respective query by calculating the cosine similarity between their vectors. The maximum semantic proximity between a candidate API class and any of the keywords of the respective query is taken as the estimate of relevance of the class to the given query. A final score is obtained for each candidate API class by summing up its normalized Borda and semantic proximity scores. LEE AND SU -11 Presenting recommendations: The final scores are used to rank the candidate API classes. NLP2API then reformulates the given query by appending the resulting Top-10 API classes to it.
BIKER (Bi-Information source based KnowledgE Recommendation) is an API recommendation approach that leverages two information sources, namely, SO posts and official API documentation, to address lexical and task-API knowledge gaps between the NL description of programming tasks in queries and the API description in API documentation [5]. To address the lexical gap, BIKER constructed a word2vec model for use in calculating the semantic similarity between two texts (such as, between query and title of SO questions, and between query and the candidate API method's description in API documentation). To address the second gap, BIKER leverages SO posts since they are task centric and can cater for API documentation's lack of description on concepts and purposes.
BIKER extracts the API methods mentioned in the answers of the Top-K SO questions with titles similar to the query. BIKER then ranks the relevance of a candidate API method to the query based on: 1) the query's similarity with similar SO question titles that have the candidate API method mentioned in the questions' answers, and 2) the query's similarity with the candidate API method's description in the official API documentation. BIKER recommends Java API methods but can be customised to recommend API classes. BIKER also provides the methods' supplementary information (such as official API description, links to similar question titles, and code snippets from SO posts) that can help the user in selecting the right API(s) to use. BIKER was implemented in a search engine website [43].
Data pre-processing: BIKER constructed a text corpus comprising of plain text content extracted from 1,347,908 SO questions tagged with 'java' and their answers obtained from the SO data dump published on 9th Dec 2017. Pre-processing done on these SO posts were: removing long code snippets within HTML <pre> tag, using NLTK package to tokenise the resulting sentences and performing stemming on words. The corpus was used to train a word embedding model using Gensim's word2vec, and to build a word Inverse Document Frequency (IDF) vocabulary, both of which would be used later in the calculation of semantic similarities. A word's IDF here refers to the inverse of the number of SO posts in the corpus that state the word. BIKER also constructed a knowledge base of API-related SO questions comprising 125,847 questions. Capturing context: BIKER transforms a query into a bag of words. These words provide the context of the task described in the query.
Producing recommendations: BIKER identifies the Top-50 question titles that are similar to a query from its API-related SO questions knowledge base using the semantic similarity score. The similarity between a question title and the query is calculated as a harmonic mean of the asymmetric similarity score from T to Q and from Q to T, where Q denotes the bag of words constructed for the query and T denotes the bag of words constructed for the title. The calculation of the asymmetric similarity scores makes use of: 1) the IDFs of words in the question title and words in the query, determined from the word Inverse Document Frequency (IDF) vocabulary, and 2) the cosine similarities between the word embedding vectors of words in the question's title and words in the query, calculated using the word2vec model built earlier.
BIKER constructed two heuristic based on manual analysis of a large number of API-related questions. The heuristic are used to extract candidate API methods from the answers of each Top-50 similar question titles as they are potentially the correct API methods for the query. The first heuristic is using regular expressions to identify hyperlinks in each answer that link to the official Java API documentation site and extract the corresponding API methods' full names from these hyperlinks. The second heuristic is to check whether the plain text is enclosed in every pair of the HTML <code> tag in each answer matches any API method in a dictionary. The dictionary was pre-constructed by populating it with the names of all Java API methods extracted from the official documentation site. The API methods found by the two heuristics are taken as candidate API methods for the query.
The similarity between each candidate API method and the query is calculated as a harmonic mean of SimSO and SimDoc scores. The similarity scores are used to rank the candidate Java API methods for the query. SimSO measures the similarity between the query and a Top-50 similar question title that has the candidate API method mentioned in the question's answer. SimDoc measures the similarity between the query and the description of the API method in the official API documentation.
Presenting recommendations: BIKER displays a list of Top-K API methods (with the classes that implement these methods and the full path to their packages), description of the API methods from the official documentation, hyperlinks of the query's Top-3 similar question titles whose answers mention the API elements, and code snippets extracted from these answers.

| Comparison with the related work
This section explains how this study differs from the existing studies described in the previous section. For simplicity sake, the name of the developed plug-in (APIRecJ) is also used to refer to the approach proposed by this study in the explanation below.
Purpose: RACK is for API classes recommendation [10] and query reformulation [29,41]. NLP2API [37,42] is for query reformulation. The approach proposed by BIKER [5,43] is for recommending API methods. APIRecJ is for recommending API classes. Data pre-processing (including usage of SO data and corpus built): All the studies performs some standard NL preprocessing (such as removal of stop words, tokenisation, stemming or lemmatisation) and other pre-processing (such as 12 -LEE AND SU parsing to extract code segments) on the data they collected for use in their techniques/approaches.
The data pre-processing done by RACK and APIRecJ aimed at extracting tokens/keywords from the titles of the SO questions and API classes from the code snippets of the questions' accepted answers. RACK stored the token-API pairs in a relational database. APIRecJ stored the keywords of question titles in a Questions dataset and the corresponding Java API classes in an Answers dataset, in csv files.
The data pre-processing done by NLP2API aimed at cleaning up the selected Java Q&A threads to be indexed by Lucene. NLP2API uses SO questions in a different way. It did not extract keywords from the SO question titles during data pre-processing. NLP2API extracts Java API classes from the code snippets in the SO questions' descriptions (and in their answers) of the Top-M Q & A threads returned by Lucene for a query during the recommendation phase.
The data pre-processing done by BIKER aimed at constructing a SO text corpus that was to train a word embedding model (using Gensim's word2vec), and to build a word IDF vocabulary, both of which are used in the calculation of semantic similarities in the recommendation phase.
RACK extracted Java API classes from code snippets using island parsing with regular expression during data preprocessing. APIRecJ used five heuristic rules to extract Java API classes from code snippets during data pre-processing. NLP2API does the same as RACK but during recommendation phase. BIKER uses two heuristics to the extract candidate API methods from the answers of each Top-50 similar question titles during the recommendation phase.
RACK used 172,043 pairs of Java questions and their accepted answers extracted from SO using SEDE to build its token-API pairs database comprises of 126,567 entries [29]. NLP2API built its Lucene-indexed corpus that contains Java Q & A threads that have <code> tag in either the question or the answer and have accepted answer, out of 656,538 Java Q & A threads obtained from SO public data dump released on March 2018. NLP2API also used 1.3 million SO Q&A to train a word2vec model using the fastText learning algorithm to be used in the calculation of semantic similarity between a query's keywords and Java API classes (i.e. the Query-API class semantic proximity score). BIKER used 1,347,908 SO Java questions and their answers (obtained from the SO data dump dated 9th Dec 2017) to construct a text corpus comprising of the plain text content extracted from these questions and a knowledge base of the API-related SO questions comprising 125,847 questions. The text corpus was used to construct a word embedding model using Gensim's word2vec and to construct a word IDF vocabulary.
APIRecJ obtained 632,062 SO posts tagged with a 'java' tag and have an accepted answer using SEDE and identified 160,680 questions that have Java API classes found in the code snippets in their accepted answers. APIRecJ used the keywords of these question titles to train a Doc2Vec model using Gensim's Doc2Vec module. The Doc2Vec model is used to measure semantic similarity between a query's keywords and keywords of SO question titles to determine questions similar to the query for subsequent retrieval of the corresponding API classes.
All the four studies leverage the co-occurrences of words in their techniques/approaches but in different ways. RACK uses the co-occurrences between keywords of SO question titles and API classes (found in the code snippets of the questions' accepted answers), and co-occurrences of those keywords of SO question titles, to identify relevant API classes to reformulate a query [29]. NLP2API and BIKER uses cooccurrences of words in creating word2vec models and APIRecJ uses them to create a Doc2Vec model. RACK does not use a word embedding model.
Capturing context: All the four studies perform some standard NL pre-processing on a query to obtain keywords/ tokens that provide the context of the task described in the query.
Producing recommendations: RACK uses three heuristics, namely, Keyword-API Co-occurrence (KAC) [10] and Keyword-Keyword Coherence (KKC) [10] and Keyword Pair-API Co-occurrence (KPAC) [29] to derive API Co-occurrence Likelihood and API Coherence metrics to determine the Top-K API classes for a query. NLP2API takes the Top-M Q & A threads returned for a query from its Lucene-indexed corpus as a pseudo-relevance feedback with the assumption that these threads being relevant to the query and analyses them for query reformulation. NLP2API also uses four measures (TF_IDF, PageRank, Borda score, and Query-API class semantic proximity score) to determine the Top-10 relevant API classes to be used in the query reformulation. The fourth score takes into consideration the global context of the query keywords.
BIKER uses two heuristics to extract candidate Java API methods from the answers of Top-50 SO question titles that it finds to be similar to the query based on similarity measures it calculates using the word2vec model and the word IDF vocabulary constructed in the data pre-processing phase. The two heuristics are API methods' full names found in hyperlinks present in the answers, and API methods (found in answers) that have a match in a pre-built dictionary containing all Java API methods extracted from the official documentation site. BIKER then uses a similarity score which is a harmonic mean of SimSO (query similarity to similar SO question titles) and SimDoc (query similarity to API method description in the official API documentation) scores to rank candidate API methods. The calculation of the two scores makes use of the word2vec model and word IDF vocabulary.
APIRecJ uses the API classes extracted from code snippets in the accepted answers of Top-100 questions (i.e. question titles in this case) similar to a query (determined using the Doc2Vec model) to create an LDA topic model of a single topic that intuitively represents the concept involved in the programming task described by the query. The Java API classes that constitute the topic are classes that cooccur frequently in the code snippets found in the accepted answers of those questions similar to the query. The Top-10 of these Java API classes based on their LEE AND SU -13 probability scores are returend as the most relevant Java API classes for the query.
Presenting recommendations: RACK, NLP2API and APIRecJ present the Top-K Java API classes. BIKER presents Top-K Java API methods together with their classes and full path to their packages, and supplementary information (official API methods descriptions, Top-3 similar question titles as hyperlinks, and code snippets). Summary of comparison: APIRecJ employs relatively fewer specifically derived measures to determine the relevance of Java API classes to a query. It is therefore less complex, and yet achieved better benchmarking results. The main differences between APIRecJ and other approaches/techniques are the use of the five heuristic rules to extract Java API classes from code snippets found in the accepted answers of questions in the training data, the use of a Doc2Vec model to find question titles similar to a query, and the use of the corresponding Java API classes of these similar questions to create an LDA topic model of a single topic to determine Top-K most relevant Java API classes to be recommended for the query.

| CONCLUSION
This study aimed to address the lexical gap between NL queries and Java APIs documentation, and the lexical gap between NL queries and the Java codes. These were achieved by designing and implementing a novel approach for recommending Java API classes for programming tasks described in NL queries using data mined from SO, Doc2Vec word embedding model, Latent Dirichlet allocation (LDA) topic modelling and heuristic. The evaluation of the recommendation performance of the approach followed the way the state-of-the-art studies (RACK [10,29] and NLP2API [37]) evaluated their approaches. This involved executing the proposed approach on the evaluation dataset [38] used, published by the NLP2API [37] and RACK [10] studies, to get the recommendation results that were used to calculate the four performance metrics used by RACK [10,29] and NLP2API [37] studies. Our study made use of the same performance metrics, evaluation dataset and evaluation tool used by RACK [10,29] and NLP2API [37] studies in order to benchmark our approach against their approaches. The trade-offs of using this existing evaluation dataset are: the inclusion of Java API classes provided by third party vendors or user-defined classes in evaluating our approach which was built to focus on the Java SE 8 and Java EE 8; cannot continue to use this evaluation dataset to evaluate our approach if we extend the approach to recommend API methods in addition to API classes.
In comparison to RACK [10,29] and NLP2API [37], our approach uses relatively fewer specifically derived measures to determine the relevance of Java API classes to a query. It is less complex as it uses Doc2Vec and LDA which are common machine learning models, but it demonstrated recommendation performance that is close to the performance of the best state-of-the-art approach, NLP2API [37].
The proposed approach was implemented in a Java API class recommender tool that is accessible via an Eclipse IDE's plug-in (APIRecJ). Possible future work includes evaluating the performance of the tool in actual use; designing a more generalized and robust method in extracting Java API classes from code snippets; extending the approach to also include recommendation of API methods; extending the plug-in to include information from the official Java API documentation.
In summary, this study explored the use of Doc2Vec word embedding model and LDA topic modelling on Stack Overflow crowd knowledge in recommending relevant Java API classes for NL queries. The benchmarking results show that it is possible to produce comparable API recommendation results by using a less complex approach that makes use of some basic machine learning models, in particular, Doc2Vec and LDA.