Short‐term railway passenger demand forecast using improved Wasserstein generative adversarial nets and web search terms

Funding information The National Social Science Foundation of China (Grant: 18BJY169) The Ministry of Science and Technology of the People’s Republic of China (Grant: 2018YFB1201402) The Fundamental Research Funds for the Central Universities of Central South University (2018zzts167) Abstract Accurately predicting railway passenger demand is conducive for managers to quickly adjust strategies. It is time-consuming and expensive to collect large-scale traffic data. With the digitization of railway tickets, a large amount of user data has been accumulated. We propose a method to predict railway passenger demand using web search terms data. In order to improve the prediction accuracy, we improved Wasserstein Generative Adversarial Nets (WGAN), which were good at generating and identifying data, by adding a predictor and supervised learning adversarial training to predict railway passenger demand. The improved WGAN could generate virtual data to expand real data, and use parallel data to predict railway passenger demand. We used search times of web search terms on different devices as training data to predict railway passenger demand in Beijing. The results show that the change in demand for railway passenger lags behind the change in the data of web search terms by one month. It is suitable for forecasting in advance. Compared with other forecasting methods, the improved WGAN performance is better, and the mean absolute percentage error is 1.98%. Because it can use mixed data for training and prediction, it has stronger adaptability when data scale decreases.


INTRODUCTION
The railway plays an important role in transportation systems. Figure 1 shows the trend of railway passenger volume in China. It shows a growth trend in the long term. But in the short term, there is great fluctuation. Generally, January, February, July and August of each year are the peak months of passenger flow. Railway passenger demand prediction analyses passenger demand development dynamically and carries out quantitative calculations based on qualitative analysis. The correct prediction for railway passenger demand has a significant effect on economic development, resource allocation, investment structure, and management of railway enterprises, and it also provides scientific basis for programming passenger transportation projects. Because of the vital role in the basic functions of resources management, it is very significant to predict the volume of railway passenger demand accurately [1].
According to different time periods, the demand forecasting can be divided into long-term and short-term. Long-term demand forecasting is helpful for large passenger transportation project planning, infrastructure construction planning, and long-term investment decisions. Adjustments to train operation plans, passenger flow organization, and resource allocation are more sensitive to short-term passenger demand changes. In recent years, China railway scale has grown rapidly. However, the scale expansion has made operational problems more complicated. Financial costs and fixed asset depreciation challenge railway profitability. Formulating service strategies based on passenger flow can improve service capabilities of railway operators while avoiding wastage of resources. It can not only improve service level, but also help to save operating costs. Therefore, railway passenger demand forecasting has attracted increasing attention.
So far there has been some researches on the prediction methods of passenger demand based on different perspectives, The trend of railway passenger volume in China and most of them are quantitative methods, consisting of time series model, neural network, support vectors machine (SVM) and so on. However, it is difficult to accurately forecast passenger demand and irregular fluctuations based on historical data, due to the volume of railway travel being affected by many factors. Therefore, the irregular factors become the key factors to be predicted. The forecasting of railway passenger demand remains a significant problem because of the following reasons.
1. Railway passenger flow shows obvious non-linearity, randomness and volatility. Besides conventional factors (legal holidays etc.), it may also be affected by unexpected factors, such as emergencies. For example, from January to March 2020, the demand for railway passengers decreased by more than 50% year-on-year due to the COVID-19 outbreak. Forecasting methods based on historical data trend are difficult to accurately predict railway passenger demand and the irregular fluctuation. 2. Machine learning methods require a large number of data sets for training and testing to improve prediction accuracy. However, in practice, it is time-consuming and expensive to collect large-scale traffic data [2]. Data missing or corrupted degrades its quality. There are research needs for more convenient data acquisition methods, as well as time-sensitive data.
With the rapid development of Internet technology, more and more people use it to make a reservation, search for information and so on. The Internet has become an important part of most people's lives. According to the Digital in 2019 Q2 Global Overview published by We Are Social and Hootsuite, by the end of April of 2019, the number of Internet users all over the world had reached 4.4 billion, and the total population of the world was only 7.5 billion. Besides, based on the 44th China Internet Development Statistics Report, by the end of June of 2019, the number of Internet users in China had reached 854 million, which accounted for 61.2% of China's population. Last but not least, there were 694 million Internet users using search engines and the popularity rate reached 81.3%. In the developed cities of China, the popularity rate of Internet access was higher, it could reach about 80% in Beijing. Internet users will generate a lot of data when they go online. These data can reflect not only online behaviours, but also personal habits, hobbies and demands of users. Online data are useful for predictors [3]. If these data can be reasonably analysed and utilized, it will produce great benefits.
To promote the study of web search behaviour, Baidu Company and Google Company have launched the Baidu index and the Google trends, respectively. They provide search volume of a specific search term which brings convenience to scholars. Baidu has the biggest market share in China, nearly 90%. The search engine data has been applied to prediction of tourist volume [4], flu trend [5], and public transportation arrivals [6] and so on. Web search engine data can improve performance for passenger demand prediction [7].
According to the latest China Railway statistics, with the completion of the reform of electronic railway ticket and the improvement of online ticket purchasing channels in China, more than 88.4% of people choose to use the Internet to book railway tickets. The first step to book railway tickets online is to search for relevant information via the Internet. Passengers often use the web search engine to obtain related information before travel, such as ticket booking information, departure, destination, railway station etc. The website search engine not only provides information for the passengers, but also records query process and reflects travel willingness of passengers. Therefore, the fluctuation of railway passenger demand can be reflected by the web search data and it is feasible and reasonable to predict the railway passenger demand based on web search data. But not all search terms are relevant to railway passenger demands. This paper proposes a method to select relevant search terms for the railway passenger demand forecast from the search engine database. We analyse the relationship between web search terms data and railway passenger demand, and try to use search times data of web search terms to predict the number of railway passenger demand. The paper provides a more effective and accurate method for railway passenger demand forecasting in the context of electronic ticketing.
Although web search data can reflect the trend of these fluctuations, there is a complex non-linear relationship between those data and railway passenger demand, which requires a model with preferable fitting ability. In recent years, the neural network has been widely used because of its strong fitting ability. For example, the most common back propagation (BP) neural network is used for predicting, classification. Since 2012, the deep learning has spread rapidly all over the world. In 2014, the appearance of generative adversarial nets (GAN) made model imaginative [8]. The GAN training strategy is to define a game between two competing networks, generator network and discriminator network. The generator network maps a source of noise to the input space. The discriminator network receives either a generated sample or a true data sample and must distinguish between these two. The generator is trained to fool the discriminator. When the discriminator cannot determine whether the data comes from the real dataset or the generator, the optimal state is reached. At this point, we obtain a generator model, which has learned the distribution of real data. The model can not only fit the existing data better, but also generate new data based on the existing data. The generated data has the same distribution as the real data and can continue to train the model in turn. The data generated by GAN and real data have been applied to traffic flow imputation [2]. A collapse problem may occur in the learning process of GAN, and the generator begins to degenerate. It always generates the same sample points and becomes unable to continue learning. When the generation model collapses, the discriminator will also point the similar sample points to similar directions, and the training cannot continue. Wasserstein GAN (WGAN) [9] efficiently addressed these problems. However, the weights of WGAN's focused on two extreme values, which will affect the gradient of the generator, and the weight clipping strategy led to gradient disappearance or explosion. Gradient penalty (WGAN-GP) does not suffer from the same problems [10]. Most successful examples of WGAN focus on generating data especially in images and texts, but it cannot be directly used for passenger demand prediction. In this paper, we used the parallel data, namely the generated data and the real data, to augment the original data to help improve the railway passenger demand prediction models. To forecast the railway passenger demand, we formulated a special loss function and adversarial training.
Based on the analysis above, this paper attempts to use web search terms and improved WGAN-GP to predict monthly railway passenger demand. There are mainly three contributions in this paper.
1. We have proposed methods and analysis framework for selecting web search term data related to railway passenger demand, and have analysed its relevance to railway passen-ger demand. Web search terms are suitable for forecasting because it is easy to collect, time-sensitive, and can reflect fluctuations in railway passenger demand. It enriches the data sources of the railway passenger demand forecast. 2. The improved WGAN-GP is proposed to predict the railway passenger demand. As a generative model, the most direct application of GANs is data generation. We used it not only for generating data, but also for forecasting railway passenger demand. The typical GAN consists of a generator and a discriminator. We added a predictor to the model. The special loss function and adversarial training under supervised learning were designed to predict railway passenger demand.
In the numerical experiment, the proposed model outperformed other approaches. 3. We used three datasets from different sources for numerical experiments. The prediction performance of the models is the best when all devices data was used. We used the generated data and the real data to augment the original data to train the model in turn. We analysed the impact of data set reduction on the prediction performance of the improved WGAN-GP. The improved WGAN-GP has stronger adaptability when the data scale decreases.
The remainder of the paper is organized as follows. Section 2 reviews relevant literatures; Section 3 briefly describes the model; Section 4 describes how to select web search terms; Section 5 reports and discusses data analysis, data processing and experiment results; Section 6 presents conclusions.

Research on web search data
With the popularity of the Internet and mobile phones, Internet technology is widely used in transportation industry. Currently, as an important platform for information dissemination, the Internet has been applied by almost all of the transportation companies, which publish product information and provide tickets and accommodation booking. The majority of travellers also get relevant information through the Internet. Before getting information, travellers would use search engines, such as Google and Baidu, to find them. The records of search terms were conserved by search engine companies. The study of web search data began in Ginsberg [11] by studying the main public health problem of seasonal influenza and the method of tracking the disease by analysing a large number of Google search queries was obtained. Besides, search engine data had been widely used in ranking universities [12], gathering public opinions [13,14], predicting stock market volumes [15], predicting flu trend [5,16], predicting academic fame [12], and forecasting tourist volume [2,17], tourist arrivals [5,18,19] and public transportation arrivals [6]. Search engine data were also used to forecast general economic indicators such as unemployment rates [20,21], and tourist consumptions [22]. Furthermore, search engine data has also been applied to other specific consumption categories, like box-office revenue [23] and organic foods market [24]. These studies indicate that the web search terms have relationships with social behaviour because it can reflect public demand, including railway passenger demand. Although web search term data has been used in the prediction of many industries, there is still research gap in the field of railway passenger demand prediction. The previous studies could be divided into two directions, in which one direction is the analysis of the time series of railway passenger demand [25][26][27] and the other direction is to forecast the railway passenger demand using the economic or industrial indexes [28], such as the number of people, the number of tourists and railway operating mileage. The disadvantage of the first direction is that it would be no longer applicable when the actual situation changes. The disadvantage of the second direction is that its information collection is cumbersome and not timely. Thus it is not timely to predict railway passenger demand using traditional data.

Research on the traffic demand forecast methods
Railway passenger demand is affected by various random factors and it shows obvious non-linearity, randomness, and volatility, which increase predictive difficulty to a certain extent. There are many methods used to forecast traffic demand, such as auto-regressive integrated moving average (ARIMA) [29][30][31], neural network [1,[32][33][34], non-parametric regression [35,36], Kalman filtering model [37], Gaussian maximum likelihood [38], gray model [39], and trend analysis method [40]. ARIMA is suitable for fitting and forecasting periodic time series. The railway passenger demand is affected by many factors, and it is difficult to predict passenger demand based on historical data. The parameters of this method are subject to subjective factors, which would affect the predictive results. Non-parametric regression performs well only on the condition of large numbers of samples and it shows poor performance when a small number of samples is used. Kalman filtering model is adapted for linear fitting and the relationship between search terms and the railway passenger demand is complicated, so the Kalman filtering model is unsuitable for predicting railway passenger demand. Similarly, Gaussian maximum likelihood is also unsuitable for predicting railway passenger demand, which does not fit Gaussian distribution. Also these models are not suitable for fitting the complex non-linear relationship between search terms and the railway passenger demand. The gray system theory is simple and has fast operating speed, which could give a good performance for forecasting, but it is not ideal for dynamic systems, such as railway passenger demand. The trend analysis method is applied in relatively stable systems. The railway passenger demand may be affected by accidental factors, such as important festivals or holidays. Once these happen, the trend analysis method performs poorly. Others like chaotic time series and gray predictions require stable data.
Among these techniques, neural networks have been frequently adopted as the modelling approach because of its adaptability, non-linearity and arbitrary function mapping capability [41].However, neural networks also have many disadvantages, such as easy to fall into local minimum and poor generalization ability. To this end, many methods have been proposed to improve this situation. In recent years, the usage of deep learning models [42][43][44], deep belief network [45,46] and deep autoencoder [47][48][49] have changed these weaknesses to a certain extent. These models were mainly used for image recognition, classification task and so on. When a large number of training samples are used, they can effectively find the distributed characteristics of samples, but when the number of samples is less, their generalization ability cannot be shown well.

Research on generative adversarial nets
With the development of neural networks, generative adversarial nets (GAN) [8] were proposed, which was used to generate image originally. The main purpose of this model is to train the generator which could produce realistic-looking images. In the adversarial process, images that the generator generated and real images are both discriminated by discriminator, which contributes a better discriminative ability. Recently, GAN is used for traffic flow prediction [50] and traffic flow imputation [2]. Conditional GAN (CGAN) [51], a derivative form of GAN, can perform the conditioning by feeding extra information into both discriminator and generator as an additional input layer. But its performance is not necessarily better than nonconditional GAN [51]. CGAN solves some special problems, which must consider extra condition information such as taxi passenger demand prediction [52]. GAN has many drawbacks, which include liberal parameters, training, lack of diversity of the samples generated, difficulty in indicating the training process when the value function of generator and discriminator used and so on. Given these disadvantages, new methods are put forward to solve these problems. Deep convolution GAN (DCGAN) [53] uses convolutional layers instead of full-link layers, which greatly improves the stability of GAN training and the quality of generated results. But it does not solve the problem fundamentally, which still needs to carefully balance the training process of the generator and the discriminator. Wasserstein GAN (WGAN) [9] efficiently addresses these problems. WGAN solves training instability completely and the training level of the generator and the discriminator is no longer needed to balance carefully. It addresses collapse mode and ensures the diversity of the generated samples. Last but not least, WGAN does not need to design the net architectures elaborately and the simplest multiple layers network could obtain well performance. However, later studies found that the weights of WGAN's discriminator focus on two extreme values, which would affect the gradient of the generator, and the weight clipping strategy led to gradient disappear or explosion. WGAN-GP [10,54,55] is proposed to solve these two problems and it obtains a more stable training process and better performance for generating pictures.
Most successful examples of WGAN-GP focus on generating data especially in images and texts, but it has not been used for demand forecasting. Better generated data would train better discriminator, and stronger discrimination ability meant better predictive ability.
In summary, the web search terms have relationships with social behaviour because it can reflect public demand, including railway passenger demand. Using web search terms data for railway passenger transport demand forecasting could overcome difficulties in delayed data fluctuation and high data collection costs. But there is a complex non-linear relationship between those data and railway passenger demand, which requires a model with preferable fitting ability. Neural network models have powerful fitting capabilities, but when the number of samples is less, their generalization ability cannot be shown well. WGAN-GP helps solve the problem. It can generate virtual data to expand the real data set and use parallel data to predict railway passenger transport demand. This paper attempts to use web search terms and improve WGAN-GP to predict monthly railway passenger demand. Considering that Baidu search engine has a market share of nearly 90% in China, this paper uses the Baidu index to find the data of search terms related to the railway passenger demand for predicting purposes and attempts to use the advantages of WGAN-GP for predicting railway passenger demand in Beijing. However, GAN is proposed as unsupervised learning [8], which means that it cannot be used to predict. To forecast the railway passenger demand, we formulate special loss function and adversarial training under supervised learning. In this study, the generator and the discriminator use multilayer neural networks, and the hidden layers of the generator use the rectified linear unit (ReLU) activation function. To make discriminator own better discriminative and prediction ability, all activation functions of discriminator use sigmoid function.

METHODOLOGY
The training procedure of typical GAN corresponds to a minimax two-player game. GAN is a framework to improve the generative model through an adversarial process, in which two models are trained simultaneously: a generator G that captures the data distribution, and a discriminator D estimates that a sample comes from the training data rather than a generator. The training procedure for G maximizes the probability of making a mistake of D. In the case where G and D are defined by multilayer perceptrons, GAN is proposed for generating and discriminating data, not for prediction. And it is unsupervised learning, which means that it cannot be used to predict. We improved GAN by adding a predictor P and supervised learning adversarial training to predict the railway passenger demand. Note that the two neural networks, i.e. generator G and discriminator D, could be formulated in any type, in this paper, G and D use six layers neural networks. The purpose of G is to output data that have the same distribution with real web search terms data x. An input noise variable z is defined, then a mapping to the generated web search terms data space as x ′ = G (z ) is represented, where G is a differentiable function represented by G with generator parameters . At present, the data distribution of x ′ is far away from that of x. Different from typical GAN, the purpose of D is not only to discriminate that the input data come from the real data or the generated data, but also to predict the railway passenger demand. The first five layers of D are used to discriminate the generated data or real data, and the last layer is added to predict the demand of railway passengers. Therefore, the improved WGAN-GP consists of G, D, and predictor P. The layout for the improved WGAN-GP is shown in Figure 2. y is the label of x. y represents the real railway passenger demand. Then D w (x) and D w (x ′ ) are defined, where D w represents D with critic parameters w. D w (x) represents the fitted value of railway passenger demand based on real web search terms data. D w (x ′ ) represents the fitted value of railway passenger demand based on generated web search terms data.
The hidden layers of G use the ReLU activation function, ReLU as the following Equation (1).
where h i represents the input of the ith layer, w i j is the weight from the ith layer to j th layer, b j is the bias of the j th layer. The output of G uses the linear function Equation (2).
To make D get better discriminative ability, all activation functions of D use sigmoid function Equation (3). The output of D uses the linear function Equation (2).
The numbers of neurons in the hidden layer of D and G are obtained by the following Equation (4): where node input represents the number of neurons in the input layer of D.
The discriminator D and the predictor P were pre-trained using train sets with an un-trained generator G for certain iterations. To update the weights and bias of G and D, we need to define the loss functions of them. According to WGAN-GP [10], the loss function of D is as the following Equations (5) and (6).
where E[⋅] represents expectation value, represents the gradient penalty coefficient,x is a probability distribution Px, which is sampled uniformly along straight lines between pairs of points sampled from the distribution P x and P x ′ , and it can be got by Equation (2). is a random value of 0 to 1. [‖∇xD w (x)‖ p − 1] 2 represents that the gradient of the D is limited to about 1, which prevents the gradient of D from disappearing or exploding. The closer the data distribution of x ′ to that of x is, the smaller the value of L D is. The more different the data distribution of x ′ from that of x is, the larger the value of L D is. It is important to note that the distribution of noise data z conforms to the standard normal distribution.
G makes every effort to bring L D down. Considering that E[D w (x)] in Equation (5) is not related to G , the loss function of G is as the following Equation (7).
Since it is unsupervised learning, which means that it cannot be used to predict. To make D gain the predictive ability, supervised learning needs to be added in Equation (5). This paper proposes to change the loss function of D to Equation (8).
where represents the penalty coefficient of supervised learning, y is the label of x. Since this article uses improved WGAN-GP for prediction, the purpose of here is to make it focus on supervised learning when calculating the gradient of D. It should be noted that when the input data of D is x, the target ALGORITHM 1 Improved WGAN-GP. We use values of = 0.1, = 100, n D = 5, n G = 15000, D = 0.00002, G = 0.000001, 1 = 0.5, 2 = 0.9 Require: The gradient penalty coefficient , the penalty coefficient of supervised learning , the number of critic iterations per generator iteration n D , the number of generator iteration n G , the batch size m, Adam hyperparameters D , G , 1 , 2 , the labels of real data y.
While 0 has not converged or n < n G do for t = 1, … , n D do Samplex from the real data, variable z from noise data, a random number ∼U [0, 1].
Sample a batch of variables z from noise data.
Training and predicting procedure flow chart output is y, and when the input data of D is x ′ , the target output is D w (x ′ ) .
The research uses the Adam optimization algorithm to update the weights and biases of improved WGAN-GP to improve the training performance. The detailed steps of the training-improved WGAN-GP are as follows: The training and predicting procedure for the conditional generator G, the conditional discriminator D, and the predictor P are illustrated in Figure 3. In this model, the predictor P is trained not only with real data, but also with generated data x'. After the improved WGAN-GP was well trained, we put the test sets into the well trained P to predict the railway passenger demand. Noted that railway passengers generally book tickets and query information online some time before travel. Web search terms data may possess lag structures when compared with the actual passengers demand count [4]. Therefore, we can use the current web search terms data to predict the future railway passenger demand. The capability of the improved WGAN-GP to forecast multiple-step ahead values will be discussed in the experiments below.

DATA ANALYSIS OF WEB SEARCH TERMS
Not all search terms are suitable for railway passenger demand prediction. We followed a four-stage process in selecting the candidate terms with the most predictive power, by relying on the related searches function on Baidu Index. The selection steps of the web search terms are shown in Figure 4.
Step 1. According to the search purpose of railway passengers, classify the web search terms and determine the seed terms of each category. Step 2. Enter the seed terms in Baidu Index as seed queries and retrieve the related queries. We then iteratively obtain the related queries for the second round of queries. Repeat this process for a few rounds until the number of search terms converges. Remove duplication terms.
Step 3. Analyse the correlations between the search terms and railway passenger demand by Spearman rank correlation analysis. Select the web search terms with good correlation.
Step 4. To predict future railway passenger demand, analyse the leading and lagging relationship between web search term data and railway passenger demand.
In this paper, the search terms related to railway passenger demand could be found via Baidu Index. It should be noted that data we collected include search times of search terms on a personal computer (PC), mobile device (MD), and both of them, i.e. all devices (AD). The data about search terms were collected from users in Beijing.
We collected web search term data according to the process shown in Figure 4 to prepare for railway passenger demand forecasting. We grouped related search terms into four types: booking software (BS), tickets booking related vocabulary (R), scenic spot (SS), railway station name (RS). The first two are ways to book tickets. The latter two are destinations.
We entered those four types of keywords in Baidu Index as seed terms and retrieved the related terms. We then iteratively obtained the related terms for the second round of terms. We repeated this process for a few rounds. The number of terms converged to a total of 174. Baidu Index (http://index.baidu.com/) provides Baidu daily query volume data, from June 2006 to the present. It does not report the raw volumes for a given search query; however it reports a query index, which displays how frequently a search query has been searched relative to the total search volume from different areas and different dates. We obtained the search frequency for these 174 terms of Beijing users in Baidu Index from May 2013 to July 2017 as the alternative set of web search terms.
Web search terms may possess lag structures when compared with the actual travel activities [4]. We need to analyse the time difference (TD) between the web search terms and the railway passenger demand. From January 2013 to January 2017, railway tickets were scheduled for a period of 20 to 60 days in China. For the convenience of calculation, the TD was determined to be 1 month or 2 months in this paper. The correlations between the search terms and railway passenger demand were analysed by Spearman rank correlation analysis. Through Spearman rank correlation analysis, the relational degrees between search terms and railway passenger demand could be obtained. The correlation degree range is [−1, 1], the closer to 1 the relational degree is, the higher the positive similarity degree of two-time series is; the closer to 0 the relational degree is, the lower the similarity degree is; the closer to −1 the relational degree is, the higher the dissimilarity degree of two-time series is [49]. The stronger correlations between search terms and Beijing railway passenger demand, the better the predictive performance.
We selected terms with a correlation degree of more than 0.5, and the results are shown in Tables 1, 2 and 3. Tables 1, 2  and 3 show the analysis results between the number of railway passenger demand in Beijing and the search terms based on AD, PC and MD, respectively. The TD = 1 indicates that the TD is 1 month, and the TD = 2 indicates that TD is 2 months. When TD = 1, the time range of the search term data is from May 2013 to August 2017, and Beijing railway passenger demand is from June 2013 to September 2017. When TD = 2, the time range of the search term data was from May 2013 to July 2017,  and Beijing railway passenger demand is from July 2013 to September 2017. From these these tables, we can see that the search terms whose correlation degree more than 0.5 are all SS and RS types, that is, SS scenic spot and main railway station. When TD is 1, the correlation degrees of all these search terms are greater than that when it is 2, which verifies the analysis that people usually make travel plans within a month. Therefore, we selected the last month's search terms data as input of the improved WGAN-GP for predicting Beijing railway passenger demand of the following month. If the search terms with low correlation are used to forecast the railway passenger demand, the prediction performance will be reduced, so only the search terms with high correlation are retained. The number of RS search terms in Table 2 is more than SS search terms. The correlation between SS search terms and Beijing railway passenger demand is stronger than RS search terms. These demonstrate that people are accustomed to using personal computers to search for information about railway stations or places' names and mobile devices to search for information about scenic spots. Finally, we selected 21 search terms in Table 1, 12 search terms in Table 2, and 10 search terms in Table 3 as the training data.

Data of railway passenger demand in Beijing
Beijing railway passenger demand data from the Beijing Municipal Bureau of Statistics website provides the passenger volume across all railway stations in Beijing. From June 2013 to September 2017, the profile of passenger demand in all Beijing railway stations is shown in Figure 5. The adjustment cycle of railway transportation plan is different from that of road and air transportation. Considering that the Chinese railway train schedule is not updated every day, we chose monthly data as the time granularity. It can be seen that although the change in passenger demand has certain law points, it also shows small fluctuations and an upward trend on the whole. Since the summer vacation in China is from July to August, a large number of tourists travel to/from Beijing in these two months, which makes the volume of railway passenger demand in both months higher when compared with other months. Besides, during the Spring Festival The number of monthly railway passengers demand in Beijing in China in January or February, people tend to go back home early, so there will be a small increase in December or January. In other months, people are busy at work and have little time to travel. Therefore, the railway passenger demand in Beijing is less during these periods. Last but not least, there are many unpredictable conferences hold in Beijing. Therefore, it is difficult to accurately predict the railway passenger demand in Beijing.
The data, concluding the number of Beijing railway passenger demand and search times of search terms, are normalized by the following Equation (9): This is a standard formula. Where y represents the normalized value, y max and y min represent the normalized maximum and minimum, respectively. x is the original value, x max and x min are the original maximum and minimum, respectively. Data are normalized to [y min , y max ] by Equation (9). To train and fit better, we set y max = 0.9 and y min = 0.1.

Prediction experiments
The training samples are search terms data from May 2013 to May 2017 and railway passenger demand data from June 2013 to June 2017, a total of 49 months. The test samples are search terms data from June 2017 to August 2017 and railway passenger demand data from July 2017 to September 2017, 3 months in total. Improved WGAN-GP is implemented through Python 3.7.
The generated data x ′ by G when the number of iteration reaches maximum are shown in Figure 6(a,b) shows the real normalized train samples x, whilst the line represents a set of data. Since the output layer of G is a linear function, its output can be an arbitrary value. Although the generated data are similar to the real data, the data generated by G have different features and do not tend to be the same, which indicates that the data generated by the improved WGAN-GP are diverse. Figure 7 shows the change of D's loss function value, different curves can be obtained by using AD, PC, MD data. The bottom curve of the three curves represents the loss function value when training with AD data. The middle curve represents the loss function value when training with PC data. The top curve represents the loss function value when training with MD data. Apparently, using AD data to train could make cost function converge faster and get smaller value. In addition, at the end of iterations, the value of loss function using AD data is minimum, and the value of using data on MD is maximum. This suggests that using the AD data, Beijing railway passenger demand could be fitted better. Besides, the weight of the supervised learning items in loss function (i.e. Equation (4)) are large, which means better fitting can get smaller value. The fitting performance directly affects the predictive performance. Table 4 shows the mean absolute percentage error (MAPE) and the root mean square error (RMSE) of different data source. When AD samples are used as the input data, the proposed improved WGAN-GP has the best prediction performance.  Due to the increasing trend of Beijing railway passenger demand from May 2013 to June 2017, it needs to be differentiated. After railway passenger demand was differentiated with the first order, the stationary time series was obtained. The autocorrelation function (ACF) graph and partial autocorrelation (PACF) graph of the differentiated time series show seasonality, so seasonal difference is needed. Finally, some candidate models are obtained and their parameters are analysed, as shown in Table 5. We can see that the Akaike information criterion (AIC) and Bayesian Information Criterion (BIC) of (2, 1, 1) × (0, 1, 1) 12 are 457.81 and 465.86, respectively, and the AIC and BIC of (1, 1, 1) × (0, 1, 1) 12 are 458.75 and 465.19, respectively. These parameters are smaller than other candidate models. The values of ACF and PCAF of (2, 1, 1) × (0, 1, 1) 12 fall into the confidence interval, while the ACF and PACF values of (1, 1, 1) × (0, 1, 1) 12 do not completely fall into the confidence interval. Besides, the PACF graph of the first order differentiated series of Beijing railway passenger demand is shown in Figure 9. According to this figure the value of p can be 2, this paper chooses the SARIMA model (2, 1, 1) × (0, 1, 1) 12 to predict Beijing railway passenger demand. The predicting errors using search terms on AD, PC and MD are shown in Tables 6 and 7. We can see from the tables that: (i) When AD samples are used as the input data, the proposed improved WGAN-GP has the best prediction performance The PACF graph of the first order differentiated series of Beijing railway passenger demand compared with other models, MAPE is 1.98% and RMSE is 160 thousand people. This indicates that the AD samples and the improved WGAN-GP are very suitable for predicting the railway passenger demand in Beijing. (ii) From the overall point of view, the prediction performances using AD data are better than that using PC data and MD data. (iii) Compared with 3 layers BPNN and 6 layers BPNN, irrespective of the type of input data is used, the predictive performance of improved WGAN-GP is better than that of BPNNs. As mentioned before, the

Sensitivity analysis
Since the paper modified the loss function to introduce supervised learning in WGAN, we analyse the sensitivity of the WGAN with respect to different hyperparameters, which include the penalty coefficient of the supervised learning and number of layers in discriminator and generator. The results are shown in Table 8. When the numbers of discriminator and generator layers do not change, model prediction performance improves with the increase of the penalty coefficient. Too low penalty coefficient leads to over-fitting and high prediction error. But an excessive penalty coefficient will increase the D's loss function value. When the penalty coefficient is 0.1, the prediction model has a strong generalization ability.
We tested the prediction performance of the model under different discriminator and generator layers. The prediction performance of the model increases with the number of layers. However, as the number of layers increases, over-fitting may occur. The model training error is very low, but the generalization ability is poor. Although over-fitting can be avoided by optimizing the neural network parameters, it does not significantly improve the prediction accuracy. The excessively

FIGURE 10
The change of MAPE when data are reduced high number of layers increases the amount of calculation and reduces the training speed. We do not need to increase the number of generator and discriminator layers to unlimited.
Since the improved WGAN-GP can generate some fake data for training D to get better predictive performance, we investigated whether a small amount of training data can achieve good predictive effects to reduce the computing cost and the trouble of data collection. Therefore, the amount of data used in the research were reduced by gradual decrements of 1/20, and the change of MAPE is observed. Figure 10 presents an MAPE profile for improved WGAN-GP and multilayer BP neural network (MBP) when the amount of AD train data is reduced. Here the MBP has the same parameters and structures as D in improved WGAN-GP. Overall, the MAPF increases as the amount of data reduces. Figure 10 shows marginal differences between MBP and the improved WGAN-GP at the lower end. At the upper end, when data is less than 16/20, MAPE for MBP is marginally greater than improved WGAN-GP. Otherwise, improved WGAN-GP is significantly less than MBP. When the amount of data is very small, that is, when it is reduced by more than 16/20, the MAPE of improved WGAN-GP and MBP increase sharply, which is due to the lack of data and insufficient training of two models. The MAPE increases when the data reduces because the deleted data is more important to the training of the model. The above shows that compared with MBP, the improved WGAN-GP will have a better predictive performance in the case of less amount of data.

CONCLUSIONS
The forecast of railway passenger demand plays an important role in the rational allocation of railway passenger transport resources. Reasonable allocation of railway passenger trans-portation resources can promote the sustainable development of railway passenger transportation. Due to the volume of railway travel being affected by many factors, it is difficult to accurately forecast passenger demand and irregular fluctuations based on historical data. Forecasting methods based on machine learning can effectively predict such irregular fluctuations. However, this method requires a large number of data sets for training and testing. In practice, it is time-consuming and expensive to collect large-scale traffic data. Data missing or corrupted degrades its quality. There are needs for more convenient data acquisition methods, as well as time-sensitive data. This paper proposes a method for selecting relevant search terms for railway passenger demand forecast from the search engine database. We analysed the relationship between web search terms data and railway passenger demand. And we improved WGAN-GP to generate virtual data to expand the real data set, and used parallel data to predict railway passenger demand. We used three different sources of data to test the model. Compared with the performance of several other forecasting models, certain conclusions are drawn as follows.
1. We have proposed methods and analysis framework for selecting web search term data related to railway passenger demand, and have analysed its relevance to railway passenger demand. The results show that the change in the railway passenger demand lags behind the change in the data of web search terms by one month and has a good correlation. Web search terms are easy to collect, time-sensitive and can reflect the fluctuation of railway passenger demand. Currently, web search data can be obtained easily by using index platform. Compared with the traditional large-scale data surveys, the advantages of applying web search data on the railway passenger demand prediction lie in its timeliness as well as low cost. More significantly, the results of this study encourage future research on various types of big data sets other than search engine query data, like blogs, and other social media, and the impact they have on passenger demand prediction and passenger behaviour analysis. The large scale of Internet data could make up for the limitation of sample size issues faced by survey data users, as well as provide us with a new way to understand passenger demand. 2. The improved WGAN-GP is proposed to predict the railway passenger demand. Previous studies concentrated on training neural networks by using real data only, and the generalization ability of neural networks was limited. GAN is trained to use not only real data but also generated data, which increases the generalization ability of the network and predictive accuracy. As a kind of generative model, the most direct application of GANs is data generation. We used it not only for generating data, but also for forecasting railway passenger demand. The special loss function and adversarial training under supervised learning are designed to predict railway passenger demand. And the prediction performance of the improved WGAN-GP, 3-layer BPNN, 6-layer BPNN, ELM, RBF, SVR, LSTM, SARIMA were compared.
In the numerical experiment, the improved WGAN-GP outperformed the other approaches. The models in practical applications can be used to obtain accurate forecast results. 3. We used three datasets from different sources for numerical experiments. The prediction performance of the models is the best when all devices data was used. We used the generated data and the real data to augment the original data to train the model in turn. We analysed the impact of data set reduction on the prediction performance of the improved WGAN-GP. The improved WGAN-GP has stronger adaptability when the data scale decreases.
However, this study had several limitations, we only focused on the Beijing railway passenger demand as a test case. Future research will focus on three points: (i) If more relevant search terms data are found, the prediction results might be better. (ii) Applying this method to demand forecasting for air and road travel. (iii) What results will be obtained by using different generative and discriminative models, such as residual neural network, convolution neural network.