Prediction of longwall mining‐induced stress in roof rock using LSTM neural network and transfer learning method

Real‐time monitoring of three‐dimensional stress in the field is an effective approach to detect evolving stress in roof rock and to evaluate rock bursts risk. However, the sensors or data transmission cables may be damaged due to the volatile environment found in coal mines, which can lead to the loss of relevant monitoring data, and some critical information for rock burst prediction may be missed entirely. A number of methods that use historical data to predict missing data or future structural states have been proposed. However, the performance of these methods is poor when the training data are insufficient owing to lack of data. To address this issue, a methodology framework is proposed to predict the mining‐induced stress when some monitoring data are missing. The framework uses a long short‐term memory neural network integrated with the transfer learning method. The proposed method can transfer the knowledge learned from complete monitored data of adjacent sensor to target sensor to boost forecasting. A case study has been conducted to evaluate the method. The results show that the developed model can significantly improve the prediction performance for the target domain, which can be improved further by increasing the size of the target domain training data available.

Three-dimensional (3D) stress monitoring is a crucial and effective approach for revealing the stress state, predicting rock burst events, and adopting countermeasures in coal mines. 21 Generally, long-term stress monitoring is helpful for analyzing and predicting the evolution of stresses in surrounding rock. Therefore, the accuracy and reliability of early disaster warnings rely on the quantity and quality of the in situ monitoring data. 22 However, the sensors and data transmission cables are often damaged in the volatile environment in underground mines, which leads to the loss of monitoring data. 19,23,24 The loss of data greatly affects the early warning of surrounding rock instability. Therefore, it is essential to reconstruct the missing data and predict the stress state in the following few days to enable evaluations of safety, reliability analyses, and real-time early warnings of disasters. The reconstruction and completion of missing data in field monitoring can be converted to a time series prediction task, and the process has been studied widely. Numerous methods of model-based and data-driven approaches have been proposed in the structure health monitoring (SHM) field. These approaches include time series state space modeling, autoregressive modeling, Gaussian process modeling, Bayesian multi-task learning methodologies, support vector regression methodologies, and long short-term memory (LSTM) neural networks. [25][26][27] However, the performance of these methods is highly dependent on the quantity of training data available. When monitoring data are lost due to the failure of sensors or optical fiber cables, sufficient training data may not be available to train the models. Conventional methods do not perform well with insufficient data. Therefore, in this situation, how to improve the prediction accuracy is a critical issue.
To this end, an LSTM neural network based on transfer learning for stress prediction is proposed. The LSTM neural network is a special and advanced recurrent neural network (RNN) that is capable of learning the long and short time series patterns from historical data. This type of networks has been proved to exhibit excellent performance for addressing time series problems and has the best performance compared with the other algorithms. 34,41 Recently, LSTM neural networks have been wildly used and achieved great success in civil engineering, for example, stress-strain behavior of soils, 42 tunneling-induced ground settlement, 43 performance of EPB shield tunneling, 44,45 and seismic bearing capacity of foundations. 46 However, few studies have been performed to predict the mining-induced stress using LSTM neural networks. More importantly, insufficient training data limit the effective performance of such a network. Therefore, in this study, we integrate the LSTM neural network and the transfer learning method to predict the mining-induced stress and improve the prediction accuracy. Transfer learning, on the other hand, is an important development in machine learning that aims to improve the prediction performance by transferring knowledge learned from source domains to target domains. The source domain refers to the domain with knowledge and more labeled data and is the object that is to be transferred. The target domain refers to a new but related domain with a small amount of labeled data and is the object to which the knowledge is transferred. 28,29 To test and verify the proposed method, we selected the vertical stress data of monitored results in a coal mine as a case study. The results show that the developed model can significantly improve the prediction performance for the target domain, which can be improved further by increasing the size of the target domain training data available.
This paper is organized as follows. Section 2 presents a detailed analysis of a field case study of a stress monitoring scheme and the results of the study. The methodology framework of the proposed model, base theory of the LSTM network, and transfer learning are described in Section 3. In Section 4, a case study of a model application and experiments over two datasets of adjacent monitoring sections are described, and corresponding experimental results and analysis are presented. Section 5 concludes the paper.

Dongtan coal mine
The Dongtan coal mine is located in Ji-ning, Shandong, China ( Figure 1). The longwall panel in this study was the 6303 working face. One side of the 6304 panel had been previously extracted. The protective coal pillar between the 6303 panel and the gob was 5-m-wide. The panel overburden depth was approximately 660 m, and the panel was 245-m-wide and 1400-m-long. The immediate roof of the panel comprised mudstone and had an average thickness of 0.8 m. The main roof was fine-grained sandstone with an average thickness of 12.9 m. The immediate floor of the panel was composed of siltstone and was approximately 1.54-m-thick. The main floor comprised fine-grained siltstone and was approximately 7.7-m-thick. The detailed information geological conditions of panel 6303 are shown in Figure 1. As shown in Figure 1, the geological conditions of the monitoring sections from S-1 to S-6 are relatively simple, and they are not located in densely faulted areas. To avoid the influence of geological conditions on this study, we selected the monitoring section of S-3 and S-4 as the research subject.

| Monitoring scheme
The working face of the 6303 panel experienced frequent microseismic events during extraction, which were closely related to the high-stress distribution around the mining area. A Fiber Bragg Grating (FBG) borehole deformation sensor for stress measurement in coal mine roof rock is adopted. The details of the measurement technique can be found in the literature. 39 To ensure accurate monitoring of the long-term stress changes during the mining process, sensors were installed in the intact homogenous main roof. This installation avoided the difficulties associated with drilling and installing sensors in the soft coal.
According to the theory of rock mechanics, stress in rock involves both in situ stresses 0 ij and excavation-induced stresses Δ ij . The in situ stresses in a rock mass depend largely on the geological structure, such as discontinuities, faults, folds, and dikes. Excavation-induced stresses are due to the mining process or nearby activities such as excavation, blasting, or pumping. Therefore, the real stress state of the rock mass is the sum of in situ stresses and induced stresses, which can be denoted as follows.
Therefore, the monitoring of the dynamic evolution of mining-induced stress should include two steps, the investigation of the in situ stresses and real-time investigation of the induced stresses.
First, we measured the in situ stresses using the overcoring stress measurement method for the 6303 working face before long-term monitoring. The in situ stresses measured by overcoring tests are shown in Table 1. As shown in Figures 2 and 3, eight FBG borehole stress sensors were installed in the main roof of the 6303 panel at different section positions ahead of the working face. The sensor installation inclination was approximately 35º, the depth was approximately 15-20 m, and the distance between adjacent sensors was approximately 90 m. The detailed information of these stress monitoring boreholes is shown in Table 2. Figure 4 shows the core with sensor by overcoring test and on-site installation work. The dynamic 3D stress was obtained by the sensors with the variations of strains that were calculated from the variation of wave lengths and the results of three normal stresses and three principal stresses within the rock mass. To illustrate the stress evolution law and data characteristics, normal stress measurements were taken from only two sections (S-2 and S-3) for a detailed analysis.

| Variation of normal stresses
The magnitude and distribution of existing in situ stresses around a coal seam are disturbed by a goaf formed due to underground mine excavation. Figure 5 shows typical monitoring results of the changes in the three normal stresses (σ xx , σ yy , σ zz ) in the monitoring sections S-2 and S-3. The three normal stresses vary, as shown in Figures 5(a) and 3(b). The figures show that the stress states in the monitoring sections are not affected by mining disturbance and are similar to the in situ stresses at distances >80 m from the working face.
As the working face advances the three normal stresses increase; the vertical stress increases rapidly compared to the other two horizontal stresses. At a distance of about 20 m ahead of the working face, all the three normal stresses reach a peak and then decrease sharply because the integrity of the roof and coal seam is violated.
To quantitatively evaluate the degree of disturbance to the roof rock mass caused by mining activities, we assume that k 1 , k 2 , and k 3 are the corresponding stress concentration coefficients of the ratio of the three normal stresses to in situ stresses. The values of the coefficients when the working face is at different distances from the monitoring section can be calculated using the following equation: Figure 6 shows the variations of the concentration coefficients of the three normal stresses during the coal mining  process. Therefore, the area ahead of the working face can be divided into four zones, namely, the in situ stresses zone, slightly disturbed zone, violently disturbed zone, and stress relief zone according to the values of the coefficients. From above, field measurement can more directly reflect the distribution and variation law of mininginduced stress, and the monitoring data are essential for rock burst warning and prevention. However, it is ubiquitous and inevitable that monitoring data may be lost due to sensor malfunction, optical fiber cable damage caused by coal mining activities, and the volatile environment. 30 Because of lost data, information that is critical for safety evaluation may not be available. Thus, data that can be recovered when the sensor fails or short-term stress data predictions can have important implications for the diagnosis and prognosis of disasters in coal mines. For example, relevant destress measures can be adopted to mitigate high-stress concentration, and an early warning can be raised timeously. Therefore, in the next section, a methodology framework of an LSTM neural network that integrates transfer learning is proposed for stress prediction. Figure 7 shows a flow chart of the methodology proposed in this study, and the algorithm of ensemble of the LSTM and transfer learning method is shown in Table 3. To summarize, the ensemble of LSTM and transfer learning method can be generalized into the following steps. First, the time series datasets

| METHODOLOGY FRAMEWORK
are collected to serve as source domain and target domain, respectively. Second, time series dataset D S is preprocessed by rolling window method to obtain time series samples. The LSTM neural network is used to predict the stress; thus, the next step is to construct a base LSTM model using the source domain data. The grid search method is used to optimize the hyperparameters for the base model. Subsequently, the parameter transfer approach is used to complete knowledge transfer. The weights of hidden layers and the hyperparameters of the above pre-trained base LSTM model act as initialization parameters of the target LSTM model over the target domain data. The model parameters are then fine-tuned according to the test results. The proposed model is used to improve the prediction accuracy to overcome the missing data problem resulting from sensor or optical fiber cable damage caused by mining activities.

| LSTM neural network
Long short-term memory is a special RNN that is applicable to time series problems, compared to RNNs or Distance to working face (m) Stress concentration coefficient traditional neural networks, which is best able to solve vanishing gradient and exploding gradient problems of long time series. A single LSTM cell comprises an input gate, a forget gate, an output gate, and the cell state memory ( Figure 8). Gates are used to optionally perform information saving, adding, or deleting using the activation functions, thereby updating the cell state to achieve longterm storage of information and resolve the dependence of the time series on time. More specifically, the input gate controls the flow of input activation into the internal cell state. The forget gate controls the LSTM cell to forget or reset the cell's memory adaptively. The output gate controls the flow of output activation into the LSTM cell output. 31 The activation functions are sigmoid and tanh.
At the time phase t (t = 1, …, n) and inside the lth LSTM network layer, the input state of the LSTM cell is x (l) t ; the forget gate is f (l) t ; the input gate is i (l) t ; the output gate is o (l) t ; the hidden state output is h (l) t ; and the memory cell state is c (l) t . At the previous time t − 1, the cell state memory is c (l) t−1 and the hidden state output is h (l) The following equations describe the relationship between these variables. 31,32

End For
Also shown in Figure 8 for schematic illustration, where W (l) h t−1 , x t (with = {f, i, c, o}) are the weight matrices corresponding to different input vectors x t or h t−1 , respectively, within different gates. ĉ (l) t−1 is a vector of candidate memory cells created by the tanh function. and tanh are the sigmoid and tanh activation function, respectively.

| Rolling window method for data preprocessing
The raw data cannot be fed into the proposed model directly because the LSTM neural network expects input or output sequences. The rolling window method is used to transform the raw data into X (input) and Y (output) sequences. In this method, Δt is defined as the rolling window size, and a series of small samples with the same number can be obtained by rolling through the whole sample with a fixed window size. Therefore, to predict the time series value at time t + 1, the rolling window feeds not only the value at time t but also those at times t − 1, t − 2, …, t − Δt to the model. The predicted value at time t + 1 is appended to the sequence at time t + 1, t, t − 1, t − 2, …, t − Δt − 1, and so on until the last value has been predicted. 33 This can be expresses as follows.  The algorithm of rolling window is shown in Table 4. Specifically, we assume 10 samples in the dataset, including T1, T2, …, T10, and set Δt = 6. An example of the transformation of the raw data to time series samples is shown in Figure 9. 34 The appropriate window size for the study will be identified later in Section 4.

| Transfer learning
Traditional machine learning (supervised learning) relies on the availability of a large amount of labeled data and the identical distribution of the training and test data. 35 However, the difference in data distribution and the lack of sufficient labeled data are challenges when tackling practical problems. Transfer learning, as opposed to traditional machine learning, uses the ability of a system to recognize and apply knowledge and skills learned in the source domains or tasks to new but related target domains or tasks. This is an important ability that enables solving the problems described above. Figure 10 shows the learning processes comparing traditional machine learning and transfer learning.
Given a source domain D S and a target domain D T , transfer learning aims to help improve the target prediction performance using the transfer of knowledge in the source domain data when dealing with the issue of time series prediction but with few fresh training samples. Based on the definition of transfer learning and learning patterns, transfer learning can be divided into instance transfer, feature-representation transfer, relationalknowledge transfer, and parameter transfer. In this study, the parameter transfer approach is adopted, in which the parameters of hidden layers and the hyperparameters of the above pre-trained LSTM model act as initialization parameters of another LSTM model over the target domain data.
Some studies in engineering research have recently explored the applicability of deep learning techniques and transfer learning strategies. Li et al. 40 proposed a model to predict dam displacement data based on transfer learning and deep learning. Transfer learning is used to transfer the knowledge learned from similar sensors to improve prediction accuracy in the target sensor. Ma et al. 33,34 proposed a method that integrates transfer learning and advanced deep learning to transfer knowledge from existing air quality stations to new stations to predict air quality. For monitoring mining-induced stress, a set of stress sensors is usually installed in different sections. Along the mining direction, the longwall face passes through each monitoring section in turn. When the longwall face is going through a monitoring section, the stress sensor can record complete monitoring data successfully if the sensor does not early fail work. Therefore, according to transfer learning theory, the complete monitoring data of the adjacent stress sensor can be considered the source domain, and the monitoring data of the following stress sensor can be regarded as the target domain.

| Data collection and preprocessing
A detailed analysis of a stress monitoring scheme and the results in a coal mine are presented in Section 2. To test the proposed model, the monitored vertical stress data from the field monitoring case of the Dongtan coal mine were used as an application study. The monitored vertical stress data of S-2 ( Figure 5 To avoid the influence of noise data on the model, we artificially eliminate some extremely outlier and inconsistent data and select the average of the data every 6 h as a data sample during the data collection process. However, the model cannot learn the full process of the stress variation because the data must be split into training and test datasets. To learn enough patterns and knowledge from the source domain time series, especially when the peak of stress is reached, the vertical stress data of S-2 are copied to augment the size of the dataset in source domain. This is reasonable because transfer learning only focuses on the performance of the target domain. Then, the rolling window method is used to transform the source domain data to time series samples and split the data into the training and test datasets in proportions of 90%-10%, respectively, to pre-train the base LSTM model. F I G U R E 9 A preprocessing example to transform the raw data to time series samples

| Performance evaluate indicators
To evaluate the performance of the proposed model, three widely used evaluation indicators, namely, root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE), are adopted to measure the prediction error of the model. The equations are described as follows: where n is the number of the prediction, and y i and ŷ i are the ith actual monitoring stress data and predicted value, respectively. y = 1 n ∑ n i=1 y i . Low values of the RMSE, MAPE, and MAE indicate high accuracy of the predictions.

| Rolling window sizes
The size of the rolling window Δt influences the prediction performance. This is because data for previous instances might have a strong or weak lagged effect on the data at the next instance. A small window size cannot guarantee that enough information and sample features will be processed for the LSTM neural network inputs, while a large window size might increase unrelated information and, thus, the computation complexity. 34,36 To determine an appropriate window size, the autocorrelation function 37 is used to measure the temporal correlations among stress time series. Higher autocorrelation coefficients indicate stronger time correlations. For a window size Δt, the autocorrelation functions can be calculated as follows: where y (t) and y (t + Δt) denote the stress value at time t and t + Δt, respectively, Cov ( ⋅ ) represents the covariance, and ( ⋅ ) is the standard deviation. For the source domain data, Figure 11 shows the autocorrelation coefficients for different window sizes. The autocorrelation coefficients clearly decrease with increasing window size. This confirms that the earlier events have a weaker impact on the current status. In addition, when the window size is <7, the autocorrelation coefficient is >0.5. This study follows the range of the window size used in previous studies 33,38 when the autocorrelation coefficients are >0.5, which indicate a high temporal correlation. However, for our stress data, the coefficients are >0.5 for a window size ≤7. To ensure enough sample features for model inputs, the window size Δt was therefore set as 7.

| Learning rate and LSTM structure
In addition to the above rolling window sizes, the LSTM neural network structure and other hyperparameters influence the forecasting performance. To obtain an optimal prediction performance on the base LSTM model, the grid search method is used to optimize the hyperparameters. Given the premise of window size Δt = 7, the ranges of the learning rate (0.0001, 0.01), hidden layers (2, 3), hidden units (5, 100), and epoch times (200, 1000) were used to determine the optimal hyperparameters for the model. The evaluation criteria of the RMSE were used to measure the prediction performance for each parameter combination. In this base LSTM model, the Adam optimizer algorithm was used, which can replace the classic stochastic gradient descent method to update the network weights more effectively. Table 5 shows the influence of the learning rate on performance. For the models using the four network structures, the prediction error increases significantly when the learning rate is 0.01 and decreases when the learning rate is <0.001. Therefore, the recommended learning rate for this model is <0.001. Figure 12 shows the prediction accuracy as a function of the different LSTM network structures and number of iterations (for learning rate =0.0001). The different network structures provide considerable improvements to the prediction error of the LSTM models. The minimum prediction error is at around the 400th iteration. However, the fluctuations are caused by the inherent stochasticity in training or poor combination of parameters, which are usually observed in such cases, and cannot be completely eliminated. Therefore, for these hyperparameters (such as the learning rate of 0.0005 and 0.0001, and network structure as [20,10,5]), the amplitude and frequency of fluctuations can be reduced. Considering the above problems and learning efficiency, the learning rate, network structure, and epochs were set as 0.0001, [20,10,5], and 400 for the base LSTM model, respectively.

| Performance comparison between without transfer learning and with transfer learning
To verify the effect of the proposed model, the complete monitoring data of vertical stress of S-3 were considered to be the target domain data. The hyperparameters and weights of the hidden layer of the pre-trained LSTM base model are transferred from the source domain data (vertical stress of S-2) to serve as initialization for the target LSTM model to be trained on the target domain data (vertical stress of S-3). It is assumed that the vertical stress data of S-3 are missing some data when the distance to the working face is approximately 33 m. Therefore, the monitored data closer than 33 m can be set as the training set, and all remaining data can be set as the test set.
In addition, to compare the performance of the proposed model using the transfer learning method, another LSTM model without using transfer learning, and two time series prediction commonly used models, including Autoregressive Integrated Moving Average (ARIMA) and Recurrent Neural Network (RNN), are also used to generate predictions.
The predicted results without transfer learning (LSTM), with transfer learning, ARIMA, and RNN, are illustrated in Figures 13, 14, 15, and 16, respectively. The 95% confidence intervals calculated using multiple predictions are also shown in these figures. Given the prediction results of both cases, it can be seen from Figures 13, 15, and 16 that the variation of the predicted vertical stress tends to increase indefinitely when the transfer learning method is not used, and the peak of stress cannot be predicted. Figure 14 shows that it can be predicted accurately by the proposed method using transfer learning. This has important implications for the diagnosis and prognosis of rock bursts in coal mines. A comparison of the prediction error by three evaluation criteria is shown in Table 6, which shows that the predicted error without using transfer learning is 2-3 times greater than that obtained by the proposed method. This confirms that the use of the proposed method is more efficient than the use of the LSTM, ARIMA, and RNN model alone.

| Performance on different size of the target domain training data availability
In the actual monitoring application, the monitoring data may be lost at any time due to sensor failure, abnormal data transmission, or human factors. Therefore, the availability and amount of target domain data depends strongly on the practical situation and has a significant influence on the prediction results. domain training data were set at 50%, 55%, 60%, 65%, 70%, and 80% of the total target domain data, and the test set was assigned to all the remaining data. For each case, the same pre-trained LSTM model was used for transfer learning.
In addition, two metrics, which are RMSE and coefficient of determination R 2 , are adopt to describe the prediction performance on different size of the target domain training data availability of developed model. The RMSE has been defined in Section 3.4, and the R 2 can be calculated using the following equation: The results showed that, for each case, the stress peak and drop point can be predicted accurately using the transfer learning strategy. The stress drop points varied considerably; however, this was not important as the focus was on predictions of the buildup to the stress peak. The results of the RMSE and R-squared parameters as a function of size of the training data are shown in Figure 17. Figure 17 shows that when the training data size was set to 50%, the prediction performance was poor. While the prediction performance improves significantly when the availability of the training data is 55%, the RMSE decreased from 7.21 to 3.48 MPa and the R 2 increased from 0.32 to 0.83. The fundamental reason for this trend is that the stress state was still in the virgin mode with no excavation disturbance at 66 m from the working face (training data availability 50%). The stress state is in an excavation disturbed zone at 54 m from the working face (training data availability 55%). As the training data size increases further, the prediction performance in terms of RMSE and R 2 values gradually improves. When the target domain training data availability is >65%, the RMSE value is <2 MPa and the R 2 is >0.9. This experiment demonstrates that the proposed model in this study is effective and can be used to predict the stress in future and recover the missing stress data when the target domain training data availability is >55%. In other words, the proposed model can significantly improve the prediction performance when the target domain training data availability has reached the stress disturbed zone.

| CONCLUSION
In this study, a framework that integrates an LSTM neural network and the transfer learning method to improve the prediction performance for missing stress data is proposed. To test the proposed model, vertical stress data of two monitoring sections obtained from previous field measurement for stresses in the Dongtan coal mine were selected as a case study for stress prediction. The main conclusions from this study are as follows.
For the base LSTM model, excellent prediction performance is obtained when the network structure is set as three layers comprising [20,15,5] cells and when the window size is 7 and the learning rate is 0.0001. Model application and experimental results showed that the proposed model using transfer learning is more efficient than a model without transfer learning. Moreover, the peak stress can be predicted accurately by the proposed model. With the increase in the size of the target domain training data availability, the prediction performance measured by the RMSE and R-squared values improves. The proposed model can be used to predict the stress in future and recover missing stress data. The prediction (17) performance improves significantly when the target domain training data availability is >55%, indicating that the data have reached the stress disturbed zone. The main contributions of the proposed framework are that the LSTM neural network and transfer learning method are integrated to improve the mining-induced stress prediction performance when data are missing. This can be applied to reconstruct the missing data or predict the stress state in the next few days, which is crucial for the diagnosis and prognosis of the stability of surrounding rock, and provides an important reference for similar projects. However, in the mining field, many factors such as blasting, tremor generated by wave propagation, and energy release of the hard roof fracturing can cause sudden stress changes. The influence of these external factors on stress cannot be accounted for by the proposed model. This challenging and significant work will be carried out in future studies.