Hot question prediction in Stack Overflow

National Key Research and Development Program of China No: 2018AAA0102304; State Key Laboratory of Software Development Environment, Grant/Award Number: SKLSDE‐2019ZX‐05; National Natural Science Foundation of China, Grant/Award Number: 61732019; National Basic Research Program of China (973 Program) , Grant/ Award Number: 2018YFB1004202; Fundamental Research Funds for the Central Universities, Grant/ Award Number: No.YWF‐20‐BJ‐J‐1018 Abstract Stack Overflow is a very popular programming question and answer community. Some questions become hot, and receive high views, which are of widespread concern to developers. Finding hot questions early can give priority to recommend potential hot questions to answers, thereby shortening the response time. Besides, the hot question prediction is also helpful for making advertising plan, planning advertising campaigns and estimating costs. Therefore, it is important to predict hot questions. The authors propose the VSAF method which analyses the View amount changes, Answer amount changes and Score changes soon after questions' creation based on Fully convolutional neural network. The performance of the VSAF method based on a training set and two different test sets has been evaluated. The training set has 1600 hot questions and 1600 cold questions. The random test set has 381 hot questions and 2819 cold questions, while the balanced test set has 400 hot questions and 400 cold questions. The experimental results show that using the balanced test set, VSAF achieves Accuracy, F1hot and F1cold of 80%, 77.77% and 81.81%, which outperforms the baseline approach by 25.59%, 21.52% and 29.04%, respectively. Using the random test set for evaluation, VSAF achieves Accuracy, F1hot and F1cold of 84.91%, 53.96% and 90.97%, which outperforms the baseline approach by 31.83%, 84.16% and 19.35%, respectively. The VSAF method significantly outperforms the state‐of‐the‐art approach on hot question prediction.


| INTRODUCTION
Developers can post questions about the problems, errors and any bugs they find, and seek solutions to those problems or any work around for them in question and answer sites (Q&A site) [1]. Stack Overflow 1 is such a Q&A site. It has become an important source for developers to solve various software development-related problems [2,3]. In Stack Overflow, the popularity of a question describes how much attention it receives, which could be measured by the amount of total views [1]. Some questions become hot, and receive high views. These hot questions are of widespread concern to developers, and their solutions help more developers than cold questions. Therefore, it is important to predict hot questions soon after their posts, and give priority to recommend these questions to answers, so as to shorten their response time and increase the response rates. Furthermore, Stack Overflow allows companies to post developer-related product advertisements 2 or job advertisements 3 on the website. Product advertisements are charged based on a Cost Per Mille (CPM) or Cost Per Click (CPC) basis. Advertisements on hot questions receive more attention than those on cold questions, which is conducive to achieving the advertising promotion effect. The prediction of hot questions is useful for making advertising plan, planning advertising campaigns and estimating costs in Stack Overflow.
In recent decades, the popularity prediction has attracted great attention [4][5][6]. There have been several studies on popularity prediction based on temporal process features [7][8][9] or text-based features [10]. Some of the above studies used the number of online content forwarded, others used the number of followers of the user who posted the online content, etc. However, these attributes are not available in Stack Overflow. Therefore, the methods proposed in the above studies cannot be used to predict the hot questions of This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Stack Overflow. Stack Overflow has some own attributes, such as questions' answer amount, scores, etc. It is necessary to explore a more suitable method for hot question prediction in Stack Overflow. Phukan et al. presented the first look at predicting hot questions based on text-based feature in Stack Overflow [1]. However, they ignored initial_feedback features, such as the score changes, answer amount changes of questions soon after they were posted. We explore the effectiveness of using ini-tial_feedback features to predict hot questions.
According to previous works [1,11], we use the number of views to characterise whether the question is hot or not. For a question, it is either a hot question or a cold question. Therefore, the hot question prediction can be regarded as a binary classification problem. We propose the VSAF method which analyses questions' initial_feedback features, including the View amount changes, Answer amount changes and Score changes soon after questions' creation, based on Fully convolutional neural network. According to the initial_feedback features within K hour(s) of question's posting, we predict whether the question will be hot T days after it is posted. K and T are defined in Section 4. If there is no special instruction, K and T are set to 3 hours and 7 days by default.
We evaluate the performance of our method (VSAF) based on a training set and two different test sets. The training set has 1600 hot questions and 1600 cold questions. Random test set has 381 hot questions and 2819 cold questions, while balanced test set has 400 hot questions and 400 cold questions. The baseline method [1] mainly uses text features, and the proposed VSAF method mainly uses temporal features. The experimental results show that, using the balanced test set for evaluation, VSAF achieves Accuracy, F1 hot and F1 cold of 80%, 77.77% and 81.81%, which outperforms baseline approach [1] by 25.59%, 21.52% and 29.04%, respectively. Using the random test set for evaluation, VSAF achieves Accuracy, F1 hot and F1 cold of 84.91%, 53.96% and 90.97%, which outperforms baseline approach [1] by 31.83%, 84.16% and 19.35%, respectively. The VSAF method significantly outperforms state-of-the-art approach on hot question prediction.
The main contributions of this paper are as follows: (1) Based on Fully convolutional neural network, we propose the VSAF method to predict hot questions of Stack Overflow, which analyses questions' initial_feedback features, including the view amount changes, answer amount changes and score changes within K hour(s) of question's posting. (2) We analyse the real questions in Stack Overflow, and experiments show that our method VSAF outperforms baseline approach [1]. And the effect of using ini-tial_feedback features alone is better than that of combining initial_feedback features and text features.
The reminder of the article is organised as follow. Section 2 presents the background and data collection. Section 3 presents our prediction approach VSAF. Section 4 conducts experiments to evaluate the proposed method. Section 5 discusses threats to validity and Section 6 discusses related works. Finally, Section 7 concludes this paper.

| Research on popularity of web content
Some researchers used social network structures to predict content's popularity. Zhang et al. predicted the popularity of an event through a diffusion model that took the content of the event and developer's information into account [12]. Shengxian et al. proposed a message popularity prediction method based on propagation simulation [13]. Zhang et al. [14] used developer-guided hierarchical attention networks to predict social image popularity based on developer's features and pattern content.
Besides, Piotrkowicz et al. used only the title of articles to predict the popularity of news articles [10]. They analysed features such as the number of entity mentions in headlines of the relevant news outlet. Sanjo et al. used image and short text features in recipes to propose a visual semantic fusion model to predict the popularity of online recipes [15].
The above studies use some features that are not available in Stack Overflow, such as users' followers, articles' publication journals, the number of online content forwarded, the number of entity mentions in headlines of the relevant news outlet, etc. Therefore, the methods proposed in the above studies cannot be directly used to predict the hot questions of Stack Overflow.

| Research on hot questions on Stack Overflow
Few people currently research hot questions in Stack Overflow. Phukan et al. predicted questions' final popularity based on the initial content of questions on the site, used views as an evaluation metric for popularity [1]. They used the term frequency-inverse document frequency model to extract feature vectors from the processed question's title, body and tags, and assigned different weights to these three feature vectors. Finally, they used a variety of classification algorithms to binary categorise questions to predict whether they would be hot.
Experiments show that our method VSAF is better than the method proposed by previous work [1].

| Other research on Stack Overflow
Zhou et al. found that questions with bounties were answered in most cases [16]. Profir-Petru et al. proposed the POSIT method, which used abstract syntax tree tags from the CLANG compiler and part-of-speech tags from the Standard Stanford part-of-speech tagger, to simultaneously tag natural and programming languages in questions [17]. Haoxiang ZHAO ET AL.

-91
Zhang et al. investigated how the knowledge in answers became obsolete and identified the characteristics of such obsolete answers [18]. In addition, some scholars studied the obsolescence of API in questions [19][20][21]. Xiaoxue Ren et al. designed an automatic open information extraction approach for systematically discovering and summarizing the controversies in Stack Overflow [22]. In addition, there have been many studies on the quality of questions and answers in Stack Overflow [18,[23][24][25].
There are many related researches on Stack Overflow, but the research on prediction of hot questions is still lacking.

| BACKGROUND AND DATA COLLECTION
In this section, we introduce the background information of Stack Overflow in more detail. In addition, our data collection process is described.

| Background
The Stack Overflow website has been in use since 2008. It is one of the largest Q&A sites where developers can share knowledge and seek experts' advice on a wide range of topics in computer programming. Stack Overflow developer base is increasing exponentially and many new questions are being posted everyday [26].
In Stack Overflow, developers can ask questions, browse questions, view answers to questions and give answers [27] to questions. Furthermore, they can vote up or down for a question. The question's score is calculated by subtracting the number of positive (up) votes from the number of negative (down) votes. Figure 1 shows an example of a question. 4 This question was viewed for 16 times and receives 2 answers. Since this question receives 20 up votes and 6 down votes, its score is 14.

| Data collection
We collect data sets from Stack Overflow. First, we aim to collect new questions' IDs and their posting time. Stack Overflow provides a "Newest" interface, 5 which shows newly created questions. According to this "Newest" interface, we continuously crawl the newly posted questions' IDs and their posting time from November 24, 2019 to December 7, 2019. After obtaining new questions' IDs, we collect detailed information on the above questions. For each question, we collect its titles, bodies and tags. Then we collect each question's views, answers and score per hour within 10 hours of the question being posted. Finally, we collect the questions' views 7 days after it was posted.
The site takes some measures to prevent it from being overvisited. Every application is subject to an IP based concurrent request throttle. Therefore, we use an IP proxy pool to break the single IP limit. IP proxy pool gathers the valid proxy IPs acquired. We configure them in our crawler application. When one IP fails, another valid IP is used immediately, thus avoiding the access restrictions of a single IP.

| APPROACH
In this section, we describe our research method VSAF. It is based on the fully convolutional neural network and takes into account the initial_feedback features of questions, including view amount changes, score changes and answer amount changes within K hour(s) of questions' posting. K represents the observation duration of questions. We explore the influence of the setting of K on the prediction effect of hot questions in Subsection 5.4.4. Based on the initial_feedback features within K hour(s) of question's posting, we predict whether the question will be hot T days after it is posted. Developers can adjust K and T according to the actual needs of the project. In this article, if there is no special instruction, K and T are set to 3 hours and 7 days by default. The research framework of this approach is shown in Figure 2. The specific structure of the fully convolutional neural network model is defined in Subsection 4.1.4. Our method is divided into two phases: a training phase and a prediction phase. In the training phase, we aim to build the fully convolutional neural network based on historical information. In the prediction phase, we use the fully convolutional neural network to predict hot questions.
In the training phase, the VSAF method separately obtains the view amount changes, score changes and answer amount changes within K hour(s) of questions' posting, which construct initial_feedback features of the question. Then, we get the views of the question after 7 days of its creation, and decide its hot level. If the view amount is larger than or equal to 50, then the hot level is considered as hot; otherwise, this question is considered as a cold one. According to the ini-tial_feedback features and hot levels of questions, we obtain the prediction model. Fully convolutional neural network is a kind of convolutional neural networks, which is widely used for classification tasks. It has the advantages of not being easy to overfit and being able to adapt to different input sizes [28]. Therefore, we use a fully convolutional neural network to build the prediction model.
In the test phase, the VSAF method obtains the view amount changes, score changes and answer amount changes soon after new questions' creation, and combines them into the initial_feedback features. Then, the features obtained during the training phase are input into the prediction model. Finally, we get the prediction results of whether the questions are hot or not.

| View amount changes within K hour(s)
When a question is posted, it can be viewed by developers. Hot questions may receive more views soon after their creation, and the view amount in the early time may be useful for predicting hot questions [29]. Therefore, after the question is posted, we get the number of views v i of the question every 1 hour until the question is posted for K hours. The formula for calculating the changes in the number of question's views per hour Qvc i is as follows: vc i represents the number of question's views between i À 1 hour and i hours after being posted, it is defined as

| Score changes within K hour(s)
After a question is posted, developers can vote up or down on the question. The question's score is the difference between the number of up votes and the number of down votes. Yao et al. [25] found that high-scoring questions tended to get more attention, because high-score questions were clearer and had better question quality. Due to better quality, potential hot questions may have higher scores soon after their creation, and thus the question's initial score may be useful in hot question prediction. After the question is posted, we get the score s i of the question every 1 hour until the question is posted for K hours. The formula for calculating the changes in question's score per hour Qsc i is as follows: sc i represents question's score changes between i À 1 hour and i hours after being posted, it is defined as

| Answer amount changes within K hour(s)
After a new question is posted, developers can answer it. Neshati et al. [30] found that questions with answers are more attractive to the community. Because many developers visit websites to find solutions to programming questions, questions with answers are more helpful for developers and they are more likely to become hot. After the question is posted, we get the number of answers a i of the question every 1 hour until the question is posted for K hours. The formula for calculating the changes in question's answers per hour Qac i is as follows: ac i represents the number of question's answers between i À 1 hour and i hours after being posted, it is defined as

| Get fully convolutional neural network
The next step is to build fully convolutional neural network in training phase. Fully convolutional neural network is a kind of convolutional neural networks, which is widely used for classification tasks. Because convolutional neural networks have the disadvantages of consuming a lot of storage space and low computing efficiency [31], Long et al. [32] proposed Fully Convolutional Neural Network (FCN) algorithm in 2015. We use the fully convolutional neural network to build the prediction model. The specific structure of the fully convolutional neural network model used is shown in Figure 3. In Subsections 3.1.1, 3.1.2 and 3.1.3, we obtain the hourly change in views Qvc i (Qvc i ¼ (vc 1 , …, vc i )), score Qsc i (Qsc i ¼ (sc 1 , …, sc i )) and answers Qac i (Qac i ¼ (ac 1 , …, ac i )). We combine them into a large vector where n represents the size of the training set.
We input Q into the convolutional layer and convert it into the input P b of the pooling layer after three times of convolutional layer processing. Convolutional layer processing consists of three steps, namely convolution operation, Batch Normalization (BN) and activation function processing. Next step, we describe the convolutional layer processing in detail.
Firstly, we normalise this matrix Q and perform a convolution operation to convert it to input BN for BN. The convolution process is shown below. In the formula, Q represents the input matrix, W represents the convolution kernel, * represents the convolution operation. i, j represent the rows and columns of the matrix, respectively.
Secondly, BN is converted into the input AC of the activation function through the BN process. BN is used to make the training process more stable and ensure that the input of each layer of the neural network maintains the same distribution. s is the size of the batch data of the neural network. For a batch, we firstly obtain the mean of the batch data, then calculate the variance, normalise each question's processed features in the batch. Finally, in order to maintain the same distribution of input at each layer of the neural network, the scaling and offset operations are carried out.

-
The mean calculation formula is defined as The variance calculation formula is defined as The normalization method is shown below: Finally, AC is transformed into the input P b of the pooling layer through the activation function processing. After the activation function is introduced, the neural network is no longer a linear combination of inputs and can approximate any function. Traditional activation functions include Sigmoid activation function and TanH activation function. These two activation functions have high calculation complexity and need to calculate the index. ReLu calculates activation values through thresholds, which greatly reduces the calculation complexity and greatly accelerates the speed of convergence. Therefore, we use the rectified linear unit (ReLU) activation function. At this point, we complete the transformation of data from the convolutional layer to the pooling layer. P b is transformed into the input P of the Softmax layer through global average pooling conversion. We perform global average pooling to prevent overfitting. Then, P enters the Softmax layer to get the probability that the questions belongs to a certain category. Through the Softmax layer, the data P obtained after pooling is converted into the predicted valueŶ . Theŷ ij represents the probability that the ith question belongs to the jth category. In fact, the actual category of questions in the training data set is Y. Each row of Y has only one value of 1 and the rest is 0. y ij ¼ 1 means the ith question belongs to the jth category. For example, if y i1 ¼ 1, it means that the ith question belongs to the first category and is a cold question. Otherwise, if y i2 ¼ 1, it means that the ith question belongs to the second category and is a hot question. According to the true category Y and the prediction category Ŷ of the questions, we obtain the total cost function of the prediction model, which is defined as follows. cn is the number of categories of questions.
The goal of training phase is to reduce errors between predicted values and actual values so as to get a good prediction effect. The smaller the value of the cost function, the better the prediction effect. Therefore, we use a gradient descent optimization algorithm. The gradient descent optimization algorithm is generally used to minimise the value of the cost function.

| Prediction phase
Given a new question, we collect its number of views, score and number of answers per hour soon after its creation, and F I G U R E 3 The structure of the fully convolutional neural network model compute view amount changes, score changes and answer amount changes. We combine them into a large vector, input it into the prediction model based on fully convolutional neural network, and predict whether the new question will become hot or not.

| EXPERIMENT
To evaluate the effectiveness of our proposed method VSAF, we conduct several experiments based on real data from Stack Overflow.

| Datasets and setup
Using the data collection method proposed in Section 2.2, we obtain question data. We make the following definitions.
Hot question: questions with views greater than or equal to 50 (views >¼50) 7 days after they are posted.
Cold question: questions with views less than 50 (views < 50) 7 days after they are posted.
In our datasets, 12.36% of questions are hot, while 87.64% of questions are cold. The number of hot questions is fewer than the number of cold questions. In practice, the administrators can set the threshold of views (hot viewsAmount ) and days (T) based on the actual needs. For example, they can define as follows. If a question with views greater than or equal to hot viewsAmount after it is posted T days, it is a hot question, otherwise it is a cold question. T is defined in Section 4.
We refer to the method of Liao et al. [11], using a balanced training set to build the prediction model, and use two test sets, including a balanced test set and a random test set. The ratio of the number of hot questions and the number of cold questions in the balanced set is 1, while the ratio of the number of hot questions and the number of cold questions in the random set is closer to the real situation. Using the balanced training set can ensure the number of hot questions and prevent overfitting problem. Using the balanced test set can evaluate the effectiveness of the method in detecting hot questions. Using the random test set can better reflect the prediction effect of the prediction model in real applications. The questions in both test sets are posted later than those in the training set. The method of building the training set and test sets is described below.
First, we obtain the training set. In order to ensure the balance of the training data, we extract 1600 hot questions and 1600 cold questions as the training set.
Second, we obtain the balanced test set. The later the question is posted, the larger the question's ID is. The maximum ID of the questions in the training set is maxID. We randomly select 400 cold questions and 400 hot questions whose ID are larger than maxID. It ensures that the questions in the balanced test set are posted later than those in the training set.
Finally, we obtain the random test set. We randomly select 3200 questions with ID > maxID as the random test set. The random test set includes 381 hot questions and 2819 cold questions.
We use the training set to obtain the prediction model, use the balanced test set and the random test set to evaluate the performance of VSAF method. We use the initial_feedback features within K hour(s) of question's posting to predict whether the question is hot or not. K is defined in Section 4. It represents the observation duration of questions. In the following experiments K is set to 3 by default if there is no special instruction. We explore the influence of the setting of K on the prediction effect of hot questions in Subsection 5.4.4. The data set used in this experiment is shown in Table 1.

| Evaluation metrics
In order to evaluate our method VSAF, we use metrics Accuracy, Precision, Recall and F1-score. These metrics are commonly used in evaluation of popularity prediction [1,11].
Accuracy is the most common evaluation metric, which is the number of correctly classified samples divided by the total number of samples. The calculation formula is as follows: TP indicates the number of hot questions predicted as hot, TN indicates the number of cold questions predicted as cold, FP indicates the number of hot questions predicted as cold and FN indicates the number of cold questions predicted as hot.
Precision is defined as Precision hot is the ratio of correctly predicted hot questions over all the questions predicted as hot. Precision cold is the ratio of correctly predicted cold questions over all the questions predicted as cold.
Recall is a measure of coverage, it is defined as Recall hot is the ratio of correctly predicted hot questions over all actually hot questions. Recall cold is the ratio of correctly predicted cold questions over all actually cold questions. The F1-score takes Precision and Recall into account. The formula for calculating the F1-score for positive samples is shown below. The method is more effective when F1 score is higher.

-
In order to compare the performance of approaches, we define the gain to compare how the approach i outperforms the approach j, including Accuracy gain, Precision gain, Recall gain and F1-score gain. They are defined as below. Among them, m stands for hot or cold.

| Research questions
In this study, we answer the following research questions.

| RQ1: How effective is VSAF in hot question prediction?
In order to evaluate the efficiency of our approach VSAF, we compare it with state-of-the-art approach [

| RQ3: What are benefits of attribute combination in hot question prediction?
Our method VSAF combines view amount changes, score changes and answer amount changes soon after questions' creation. We wonder whether all these attributes are necessary in hot question prediction. We compare proposed VSAF method with approaches based on parts of attributes.

| RQ4: How does the observation duration K affect the prediction results?
We use the initial_feedback features within K hour(s) of question's posting to predict whether the question is hot or not. K is defined in Section 4. It represents the observation duration of questions. The task of this RQ is to explore the influence of the setting of K on the prediction effect of hot questions.

| RQ5: How effective is the hot question prediction using fully convolutional neural network?
This article completes prediction task based on initial_feedback features. Recurrent Neural Network (RNN) is often used for classification tasks based on initial_feedback features [33]. Therefore, we compare the prediction effect using fully convolutional neural network with using RNN and Long Short-Term Memory (LSTM), k-Nearest Neighbor (KNN), Decision Tree and Random Forest. RNN is a neural network that contains recurrence. It takes sequence data as input, and the cyclic units are connected in a chained manner, so it is very suitable for processing serialised data such as speech, text and time series. Because of this chain structure, the hidden layer in front will affect the hidden layer in the back. LSTM is a variant of RNN structure, which can solve the problem of gradient disappearance and gradient -97 explosion during long sequence training. LSTM can remember the information that needs to be memorised for a long time, and forget the unimportant information. The idea of the KNN algorithm is that if most of the k nearest samples in the feature space of a sample belongs to a certain category, the sample is also divided into this category. The decision tree algorithm uses the structure of the tree to classify data records. The leaf nodes of the tree represent category labels, and the internal nodes represent a certain feature variable. According to the feature variable selection method, the branches of the tree are established according to the different values of the feature variables, and the lower nodes and branches are repeatedly established in each branch subset, and finally a decision tree can be generated. The random forest algorithm uses the resampling method to extract multiple samples from all samples, construct a decision tree for each sample set which is extracted, and finally combine the prediction results of multiple decision trees to obtain the final prediction conclusion by voting.

| RQ1: How effective is VSAF in hot question prediction?
We contrast our work with the baseline approach [1]. In order to answer RQ1, the evaluation results of these two methods using balanced test set and random test set are shown in Tables 2 and 3, respectively. In Table 2 Figure 4. In order to compare VSAF with baseline approach, we compute Accuracy gains, Precision gains, Recall gains and F1 À score gains and describe results in Tables 4 and 5 According to previous work [1], we extract textual context of questions, including their titles, bodies, and tags. We use Natural Language Toolkit (NLTK) to remove punctuation, stop words and convert textual context to textual features in word vectors. In Section 4, we define initial_feedback features, including view amount changes, score changes and answer amount changes. We study prediction performance based on three method, including the prediction based on textual features, the prediction based on initial_feedback features (namely our method VSAF), and the prediction based on textual and initial_feedback features.
The experimental results are shown in Tables 6 and 7. As can be seen from these tables, we can find that the approach using all features has better performance than the approach using only text features. Besides, the best Accuracy, F1 hot and F1 cold are achieved on both test sets when only initial_feedback features are analysed. Therefore, we can conclude that the hot question prediction based on initial_feedback features without text features has the best performance on both test sets. According to the prediction accuracy of the methods based on different features on different test sets, we make the histogram as shown in Figure 5.

RQ 2 The hot question prediction only based on ini-
tial_feedback features has the best performance, without textual features.

| RQ3: What are benefits of attribute combination in hot question prediction?
In order to answer RQ3, we use fully convolutional neural network to separately analyse view amount changes, score changes and answer amount changes for hot question prediction. We compare performance based on different attributes, and the experimental results are shown in Tables 8  and 9. From these tables, we find that hot question prediction based on view amount changes achieves better performance than based on score changes or answer amount changes. Besides, as can be seen from the above tables, the approach using all attributes has better performance than the approach using view amount changes. The best Accuracy, F1 hot and F1 cold are achieved on both test sets when analysing all attributes. Therefore, it is conclusion that the approach based on all attributes performs best on both test sets. According to the prediction accuracy of the methods based on different attributes on different test sets, we make the histogram as shown in Figure 6.

RQ 3
The combination of view amount changes, score changes and answer amount changes is effective for hot question prediction. In this section, we explore the effect of observation duration on prediction effects. We study how the prediction effect changes as observation duration K (K ¼ 1, 3, 5, 7, 9) changes. The results are shown in Tables 10 and 11. From the above tables, we find that as the observation duration increases, the Accuracy, F1 hot and F1 cold of our method VSAF are generally increasing on both test sets. Based on the above tables, we draw Figure 7. Figure 7 shows the changes of the Accuracy with the increase of observation duration (K) on the balanced test set and the random test set. As can be seen from the above figure, on both test sets, as the observation duration increases, the Accuracy generally increases. In addition, the Accuracy increases the most when the observation duration change from 1 to 3 hours. Since then, the increase rate of Accuracy has slowed down. In this article, we set observation duration as 3 hours by default. In practice, administrators can set the observation duration of questions according to their actual needs.

RQ 4
As the observation duration increases, the Accuracy of our method VSAF generally increases on both test sets. 5.4.5 | RQ5: How effective is the hot question prediction using fully convolutional neural network?
To answer RQ5, we try a variety of classification algorithms, including RNN, LSTM, KNN, decision tree and random forest. We study whether the prediction effect of using fully convolutional neural network is better than using other proposed algorithms. The experimental results are shown in Ta Using the balanced test set for evaluation, the best Accuracy, F1 hot and F1 cold are achieved when using fully convolutional neural network. It achieves Accuracy, F1 hot and F1 cold of 80%, 77.77% and 81.81%, respectively. Using the random test set for evaluation, hot question prediction based on fully convolutional neural network achieves Accuracy, F1 hot and F1 cold of 84.91%, 53.96% and 90.97%, which are all higher than using other classification algorithms. Therefore, we use fully convolutional neural network to predict hot questions. According to the prediction accuracy of the methods based on different classification algorithms on different test sets, we make the histogram as shown in Figure 8.

RQ 5
The prediction effect using fully convolutional neural network is better than using RNN, LSTM, KNN, Decision Tree and Random Forest.

| THREATS TO VALIDITY
In this section, we discuss threats to external and construct validity.

| External validity
Threats to external validity relate to generalization of our study. Our experimental results are based on Stack Overflow and it is not clear whether they are applicable to other question and answer communities. In future, we plan to apply our method to other question and answer communities and discuss its effectiveness on other communities.

| Construct validity
Threats to construct validity refers to the suitability of the evaluation metrics we use. We use Accuracy, Precision, Recall and F1-score, which are also used by previous works to evaluate effectiveness of popular prediction methods [1,11]. Therefore, we believe there is little threat to construct validity.

| FUTURE APPLICATION SCENARIOS
The usage scenarios of our proposed method are as follows:

| Without method
First, Bob is an experienced expert who can answer many software related questions. Before applying our proposed method to Stack Overflow, Bob answers some newly posted questions. However, these questions are only concerned by a small number of users, and more users' concerns have not been answered. A large number of users do not get answers to questions on Stack Overflow in time. Second, Jerry is an employee of Stack Overflow website who is responsible for planning advertising activities. Different advertisers are willing to pay different prices for advertisements on Stack Overflow. It is difficult for Jerry to decide how to put advertisements into corresponding questions.

| With method
Users and the employees of Stack Overflow website can get help after applying our proposed method to Stack Overflow. First, Bob can easily view the newly posted predicted hot questions and answer some of them. Users who concern these hot questions are answered in time, and Bob therefore helps more users. Second, Jerry can also make corresponding plans for advertising according to whether the question is hot or not. More specifically, Jerry puts high-bid advertisements into predicted hot questions so that these advertisements get higher page views, and puts low-bid advertisements into predicted cold questions so that these advertisements get fewer page views. A method to predict hot questions in Stack Overflow is studied here. According to previous works [1,11], we use the number of views to characterise whether the question is hot or not. We propose the VSAF method which uses a fully convolutional neural network to analyse the view amount changes, answer amount changes and score changes soon after questions' creation. We evaluate the performance of VSAF method based on a training set and two different test sets. The experimental results show that the hot question prediction only based on initial_feedback features without textual features has the best performance. In addition, when we use the balanced test set, the Accuracy of the VSAF method is 80%, which outperforms the baseline approach by 25.59%. Based on the random test set for evaluation, the Accuracy of the VSAF method is 84.91%, which outperforms the baseline approach by 31.83%. Therefore, we believe that the VSAF method is useful in predicting hot questions in Stack Overflow. However, there are still deficiencies in our research work. First, we did not analyse all the factors that may affect the popularity of the question. For example, we did not consider the topic characteristics of the question. The number of users following a topic may be more than the number of users following another topic. In future, we plan to extract more potential influencing factors and perform indepth and comprehensive explanatory research to analyse which factors have a greater impact on the predicted results. For example, we will add the topic analysis research of the question and explore the impact of adding topic analysis on the predicted results. Second, our experimental results are based on Stack Overflow and it is not clear whether they are applicable to other question and answer communities. In future, we plan to apply our method to other question and answer communities and discuss its effectiveness on other communities.