Machine Learning Predicts Earthquakes in the Continuum Model of a Rate‐And‐State Fault With Frictional Heterogeneities

Machine learning (ML) has been used to study the predictability of laboratory earthquakes. However, the question remains whether or not this approach can be applied in a tectonic setting where one may have to rely on sparse earthquake catalogs, and where important timescales vary by orders of magnitude. Here, we apply ML to a synthetic seismicity catalog, generated by continuum models of a rate‐and‐state fault with frictional heterogeneities, which contains foreshocks, mainshocks, and aftershocks that nucleate in a similar manner. We develop a network representation of the seismicity catalog to calculate input features and find that the trained ML model can predict the time‐to‐mainshock with great accuracy, from the scale of decades to minutes. Our results offer clues as to why ML can predict laboratory earthquakes and how the developed approach could be applied to more complex problems where multiple timescales are at play.


Introduction
Recent studies demonstrated that machine learning (ML) could predict fast and slow slip events on laboratory faults via instantaneous monitoring of continuous acoustic emission (AE) occurring between main failure events (e.g., Borate et al., 2023;Hulbert et al., 2019;Jasperson et al., 2021;Johnson et al., 2021;Karimpouli et al., 2023;Laurenti et al., 2022;Lubbers et al., 2018;Pu et al., 2021;Rouet-Leduc et al., 2017;Shokouhi et al., 2021).For example, Rouet-Leduc et al. (2017) showed that the statistical characteristic of AEs, notably the evolution of their variance, provides predictability of the time-to-failure in a double direct shear-experiment.Since these ML approaches rely only on real-time, continuous foreshock-like AE observations, they can provide additional information compared to classical approaches based on the recurrence interval of past large events (e.g., Savage, 1994;Schwartz & Coppersmith, 1984).
Despite its successes, notably with laboratory earthquakes, there are at least two major challenges in applying these ML approaches to forecast the timing of earthquakes in nature.First, the timescales of earthquake cycles vary by orders of magnitude, with typical recurrence intervals of large earthquakes being hundreds to thousands of years while the timescales of months to days need to be considered for short-term forecasting to be meaningful.Second, it remains unclear why ML approaches can predict time-to-failure for laboratory earthquakes, making the upscaling to real Earth problems challenging.Given that the variance and amplitude of AEs are related to the preparatory phase of the main slip event (Rouet-Leduc et al., 2017), ML might have captured hidden patterns related to the evolution of certain foreshock characteristics (e.g., Bolton et al., 2021;W. Goebel et al., 2013; • We developed a network representation of a synthetic earthquake catalog to train a machinelearning Machine learning (ML) model • The trained ML model can predict the time remaining before simulated mainshocks with great accuracy • The network representation of seismic moment and event interval enables us to accurately predict the timing of mainshocks Supporting Information: Supporting Information may be found in the online version of this article.Johnson et al., 2013;Kwiatek et al., 2014;Lei et al., 2018;Main et al., 1989;McLaskey & Lockner, 2014;Ohnaka & Mogi, 1982;Rivière et al., 2018;Scholz, 1968;Yamashita et al., 2021).
One way to overcome these challenges is to apply an ML approach to a synthetic earthquake catalog generated by earthquake-sequence simulations in which the timescales vary by orders of magnitude and the mechanism of the earthquake generation is known or understandable (e.g., Beall et al., 2022;Cattania & Segall, 2021;Dublanchet, 2022;Dublanchet et al., 2013;Ito & Kaneko, 2023;Luo & Ampuero, 2018;Schaal & Lapusta, 2019;Skarbek et al., 2012;Yabe & Ide, 2017).For example, Ito and Kaneko (2023) found that in continuum models of a rate-and-state fault with frictional heterogeneities, the slope of a Gutenberg-Richter magnitude-frequency relation, referred to as a b-value, decreases with time toward the mainshock, consistent with the characteristics of seismicity before large earthquakes (e.g., Gulia et al., 2016;Nanjo et al., 2012;Papadopoulos et al., 2010;Schurr et al., 2014;Shimbaru & Yoshida, 2021;Simon et al., 2021).In their model, a temporal decrease in the b-value is caused by the reduction of rupture barrier effectiveness in creeping fault patches.Applying an ML approach to this type of synthetic catalog may enable us to understand the reasons behind why and how an ML model can predict the timing of mainshocks in previous laboratory studies.
In this study, we apply ML to a synthetic seismicity catalog generated by the fault model of Ito and Kaneko (2023).The synthetic catalog contains foreshocks, mainshocks, and aftershocks that nucleate in a similar way, making it difficult to assess which output parameters contain key information for the mainshock predictability (Ito & Kaneko, 2023).In the following, we describe a network representation of the earthquake catalog to train an ML model (Section 2).We describe the results (Section 3) and discuss the implications (Section 4).

Synthetic Earthquake Catalog Generated by a Rate-And-State Fault Model With Frictional Heterogeneities
The fault model that we use to generate a synthetic earthquake catalog contains a 2,400-m long, 1D fault embedded into Earth's crust, loaded by slow tectonic loading (Figures 1a-1c).The numerical approach is based on a boundary integral method (Kaneko et al., 2016;Lapusta et al., 2000), adapted for a two-dimensional (in-plane, Mode II), fully dynamic model of earthquake cycles.The constitutive response of the fault is governed by the rateand-state friction laws with the aging form of state variable evolution (Dieterich, 1978(Dieterich, , 1979;;Ruina, 1983).The fault is characterized by regularly alternating patches of velocity-weakening (VW) and velocity-strengthening (VS) friction (Figure 1c).The details of the model setup and parameters are described in Text S1 and Table S1 in Supporting Information S1.
We use the catalog generated by the simulation during 900 years (∼18,000 earthquakes).The earthquake is characterized by its epicenter x, its origin time t, and its seismic moment M (see details in Text S2 in Supporting Information S1).We categorize mainshocks as earthquakes that rupture the entire fault and all the other earthquakes as foreshocks.For simplicity, we do not distinguish between foreshocks and aftershocks, because of the relatively few aftershocks in this model.Both the mainshocks and foreshocks nucleate in a similar manner within one of the VW patches, as a VW patch is set slightly smaller than a critical nucleation size (Rubin & Ampuero, 2005) with the setting of small characteristic slip distance D RS = 15 μm which is within a typical range of D RS in rock-friction experiments (Ikari et al., 2011;Marone, 1998).Despite the relatively simple heterogeneities, the model results in a complex spatiotemporal pattern of earthquakes (Figures 1d-1f).Foreshocks follow a power-law magnitude-frequency relation (Gutenberg & Richter, 1949) with a magnitude completeness of 2.0 (Figure S1 in Supporting Information S1).To avoid sampling smaller events limited by the size of a VW patch and potential numerical artifacts, we remove events below the magnitude completeness from the catalog (see more details on catalog preparation in Text S2 in Supporting Information S1).

Forecasting Method Using Random Forest
To predict the time to the next mainshock in the synthetic catalog mentioned above, we apply the classical random forest (RF) ML technique, implemented in scikit-learn (Breiman, 2001;Louppe et al., 2013;Pedregosa et al., 2011) (see details in Figures S3 and S4 in Supporting Information S1).We develop a data-network representation to calculate features in the earthquake catalog.A schematic explanation of our data-network representation is shown in Figures 2a and 2b.We refer to a network as a group of a specific number of consecutive earthquakes.In the current network, we compute the changes of each earthquake pair, , where i denotes the ith earthquake in the current network.Hence there are a total of 5 parameters that describe an earthquake pair: X i = (x i , M i , Δx i , Δt i , ΔM i ).We compute the average and variance of X for all the pairs in a network in order to quantify the pattern of earthquake occurrence, as follows: where X j is the average, Var is the variance, and j is the number of earthquakes included in the current network, which we define as the network size.The network size j is determined as percentages of the average foreshock number j = p f, where p is the percentages and f is the average foreshock number in the training set.The network representation conceptually corresponds to adaptive time windows with no arbitrary duration, but with a duration that changes depending on the stage of earthquake-cycles.The percentage p is the free parameter in this method and it is optimized along with the hyperparameters of the RF model (see details in Text S3, S4, and Figure S2 in Supporting Information S1).As an example of feature notation, M 5f means seismic moment averaged over a network of 5% average foreshock number.Note that all the earthquakes are considered by networks "when computing features" and there is no in-advance classification of earthquake-type (e.g., foreshocks, mainshocks, aftershocks, swarms, etc.).
Here we introduce three types of data-network representations: (a) instantaneous, (b) single, and (c) multiple network representation.The instantaneous network representation is analogous to that used in previous studies (e.g., Hulbert et al., 2019;Jasperson et al., 2021;Johnson et al., 2021;Rouet-Leduc et al., 2017) that monitor AEs over a small time window.We define a pair of consecutive earthquakes as one network ( j = 2), thus, the instantaneous network is the minimum one among this representation.The single network representation accounts for a specific number of earthquakes.Depending on the percentage p, the network incorporates the earthquake catalog over different timescales.As representative cases, we show three single networks, where p = 5%, 50%, and 100%.The multiple network representation accounts for multiple single networks over different timescales simultaneously.A similar method is introduced by Karimpouli et al. (2023), showing the benefit of using multiple (albeit, fixed) time windows of different sizes in predicting the time-to-failure of laboratory earthquakes.The network sizes in the multiple network representation are defined as p = (100, 90,80,70,60,50,40,30,25,20,15,10,5)%.All the features that we eventually used are listed in Table S2 in Supporting Information S1 (see details in Text S4 in Supporting Information S1).
Labels are computed as the time remaining before the next mainshock from the latest earthquake in the current network.For networks whose latest earthquake is a mainshock, we compute the time to the next mainshock.The networks move forward each time an earthquake occurs, and the features are associated with the time of the latest earthquake in the current networks.To efficiently train the RF model on the data whose labels have many orders of magnitude (10 8 to 10 1 years), we change the labels to logarithmic values (see details in Text S3 in Supporting Information S1).where n is the number of earthquakes, y i is the true time-to-mainshock, ŷi is the prediction, ȳ is the average of y i .In order to emphasize predictions at different timescales, we use logarithmic values to compute R 2 , whereby the RF approach minimizes the L2 norm cost function with the input of logarithmic labels.

The Results by the RF Model With Instantaneous, Single, and Multiple Network Representation
The model's performances in the test set are R 2 instant = 0.72, R 2 5% = 0.79, R 2 50% = 0.87, R 2 100% = 0.37, and R 2 multi = 0.96.The instantaneous network (Figures S3a and S3b in Supporting Information S1) relies only on two earthquakes on the prediction at a given time, and hence it is highly sensitive to the recurrence interval and the seismic moment.With this simplified network as input, the RF model can roughly classify the time-to-mainshock, but cannot discriminate the detailed time-to-mainshock, leading to unstable predictions and oscillations in a broad time range.The effect of the small amount of data in the network can be seen in the single small network approach (network 5%, Figures S3c and S3d in Supporting Information S1) as well.In contrast, a much larger network size (network 100%, Figures S3g and S3h in Supporting Information S1) captures a global trend, but is insensitive to the trend toward the next mainshock and has difficulty in reliably recognizing an impending mainshock.Therefore, the medium single network (network 50%, Figures S3e and S3f in Supporting Information S1) works the best among the instantaneous and single network representations because it is in between those end-members (network 5% and 100%).In the multiple network representation, these networks of different sizes are used simultaneously, which is equivalent to using the temporal evolution of historical earthquake patterns, and the RF model can achieve a very high accuracy R 2 multi = 0.96 (Figures 2c and 2d).

Reasons Why the Multiple Network Representation Works Effectively
To explain why the multiple network representation enables accurate time-to-mainshock predictions, we visualize the temporal evolution of the features in the different network sizes (Figure 3). Figure S4 in Supporting Information S1 shows the feature importance output by the trained RF model.A small network size (e.g., network 5%) can capture the precursory behaviors, such as the increase of seismic moment and the decrease of recurrence interval when the time-to-mainshock is less than 1 hr (Figures 3a and 3b).However, they are too sensitive to mainshock-like behavior with a large number of earthquakes occurring in The probability density of the average recurrence time from the network of 5%, 20%, 40%, 60%, 80% average foreshock number, respectively.The increase of the seismic moment and the decrease of the recurrence interval over multiple network sizes enable accurate prediction of the time-to-mainshock.Note that different features from different network sizes display predictive power at different time scales.
a short period of time.Since the small networks contain a small number of earthquakes to compute the statistics, the data distribution becomes uncorrelated with time-to-mainshock when the mainshock is more than an hour away (a blue circled region in Figure 3b).This cluster possibly makes the RF model misunderstand that the next mainshock is approaching earlier.The large networks can add constraints in the regions where small networks have difficulty in discriminating the time-to-mainshock.Large networks contain a large number of earthquakes, which means that they tend to show precursory trends only right before the mainshocks.As shown in Figures 3c-3f, the distribution of Δt j=large for the uncorrelated clusters (a blue circled region) tends to shrink in larger networks.When the RF model refers to these multiple networks, it accounts for the difference in feature distribution over various timescales, suppressing the oscillation of prediction and accurately predicting the time-to-mainshock.In other words, useful features present a monotonous and non-flat distribution with respect to the time before mainshock (e.g., as indicated by black arrows in Figure 3).Different network sizes have this monotonous relationship at different parts of the earthquake cycle and at different timescales, and having access to these different network sizes enables the RF model to reconcile these different timescales.In summary, the temporal evolution of the seismic moment and recurrence interval abstracted in networks of specific sizes give effective information for predicting the time-to-mainshock.

On the Accuracy of the Predictions With Training Data Size
Next, we investigate how the training data size affects the accuracy of the predictions.We generate a synthetic catalog during the 2,000 simulation years in the default parameter setting and quantify the accuracy of prediction with varying training set sizes.Specifically, we change the size of the training set from 25 years (2 mainshock cycles) to 1,800 years (115 mainshock cycles) and predict the same test set in Section 3.1 via the multiple network representation.The hyperparameters and free parameters in the RF model are determined with the multiple network approach as in Section 3.1 because these hyperparameters do not affect the result in orders.Comparing the scores of the default scenarios and the end-members, the R 2 multi = 0.96 (default scenario), the R 2 2 cycles = 0.68, and the R 2 115 cycles = 0.97.As expected, the smaller the training set becomes, the smaller the R 2 becomes (Figure 4a).Interestingly, the R 2 with the large training set is not largely improved even if we use about three times larger training set (115 mainshock cycles) than the default data size (41 mainshock cycles).We interpret this stagnation of the scores as the RF model reaching its upper limit and the randomness in the model preventing its accuracy from being higher.In addition, the prediction in the smallest training set (only 2 mainshock cycles) performs satisfactorily with the R 2 2 cycles = 0.68 (Figure 4a).Despite poor performance for prediction at long time scales (Figure 4b), the trained ML model can reasonably predict the time-to-mainshock starting several weeks prior to subsequent mainshocks (Figure 4c).The effect of data size is further examined in Section 3.4.

Application to the Synthetic Earthquake Catalog Containing Aperiodic Mainshock Cycles
To understand the general applicability of the developed multiple network representation, we perform an earthquake-cycle simulation with time-varying loading.Every time the mainshock occurs, we change the loading rate V pl randomly in the range of 1.0 < V pl < 3.0 mm/yr as well as the corresponding stressing rate in the range of 0.010 < τ < 0.030 MPa/yr and keep these values constant until the next mainshock.All the other parameters of the earthquake-cycle simulation are the same as in Section 2.1.The recurrence interval varies in the range of about 10-35 years (Figure 4d).Despite the aperiodicity of the mainshock cycles, the trained ML model in the randomized loading scenario can accurately predict the multi-scaling time-tomainshock R 2 = 0.87 (Figures 4e and 4f).We also examine the effect of limited data size using this catalog (Figure S5 in Supporting Information S1) and find that the ML model, trained on a minimal data set, exhibits a capability to reasonably predict the time-to-mainshocks immediately preceding their occurrence (Figure S5c in Supporting Information S1).Thus, the result indicates that the aperiodicity does not affect the performance of our ML approach, especially right before the next mainshock, supporting our conclusion that the characteristics of features mentioned in Section 3.2 can successfully capture the trend toward the next mainshock time.

Implications for the ML-Based Prediction of Laboratory Earthquakes
From the present results, we provide a potential explanation as to why an ML approach can predict the time-tofailure in laboratory friction experiments.Our results suggest that the evolutions of important features are the increase of average seismic moment and the decrease of average recurrence interval, which is equivalent to the increase in the event's signal amplitude and frequency, consistent with the corresponding b-value decreases with time-to-mainshock (Ito & Kaneko, 2023).In the model of Ito and Kaneko (2023), the temporal decrease of the bvalue emerges due to increasing shear stresses in the localized creeping patches (Figure S6 in Supporting Information S1).If we interpret laboratory AE events as tiny foreshocks, our results are consistent with the laboratory results in which the increasing signal amplitudes or seismic moment provides the predictability of mainshocks (e.g., Rouet-Leduc et al., 2017).By inference, the reason why an ML approach can predict the timeto-failure in rock-friction laboratory experiments may be related to a decrease in b-value (Rivière et al., 2018) and corresponding changes in shear stresses in the localized creeping patches.
Our results also suggest the importance of an ML approach to account for the variations in timescales.Comparing the results by a single network and multiple networks (Figure 2 and Figure S3 in Supporting Information S1), the prediction accuracy by multiple networks is dramatically superior to the prediction by only the most important network (network 50%).Similar inferences were made in the ML results of forecasting laboratory earthquakes (Karimpouli et al., 2023).These results indicate that explicitly teaching the time evolution of features to an ML model in multiple time windows would greatly improve the prediction accuracy for laboratory earthquakes.In addition, our network representation can focus on the timescales in which precursory behavior occurs, and thus, the implementation of the multiple network representation with the ML technique would improve the prediction accuracy for laboratory earthquakes.

Toward Application to a More Realistic Earthquake Catalog
Based on the results presented so far, we discuss potential challenges when applying this technique to a real earthquake catalog.The first and most significant problem is the lack of data.Our method requires at least two mainshock cycles to train the ML model.However, there is often insufficient data or past-earthquake to be used to train an ML model, as real earthquake catalogs are up to several decades, shorter than the recurrence intervals of typical large earthquakes.A potential solution is a transfer learning technique, which has been applied to rock-experiment data in Wang et al. (2021) by combining numerical simulation and experimental data to reduce the experimental data size needed for training.Second, our earthquake-cycle model contains a single fault and does not produce off-fault seismicity.Whether our ML approach would work for a more complex synthetic catalog needs to be tested in the future.Third, there are many different types of earthquake behaviors in nature such as mainshocks with few or no observed foreshocks, earthquake swarms, aftershocks, low-frequency earthquakes, etc., making accurately predicting the timing of mainshocks challenging.
As examples of more complex cases, we test our method with different earthquake-cycle simulation parameters, for two end-member cases: a scenario with very few foreshocks, and a swarm-like scenario (Ito & Kaneko, 2023).Specifically, in the scenario with fewer foreshocks, we decrease the friction parameter a-b on the VS patches from 0.0025 (default scenario) to 0.0020, and in the swarm-like scenario, we increase them to 0.0031.The details in catalog processing are documented in Text S5in Supporting Information S1.
In the case of the scenario with fewer foreshocks, the RF model via multiple network representation can predict the time-to-mainshock accurately with R 2 = 0.89 (Figure S7 in Supporting Information S1).However, to achieve this high score, we need to use a very large training data set (1,800 years earthquake-cycle, containing 105 mainshock sequences) and hence, the effect of the lack of data size is more significant, compared to the default scenario (Figure S8 in Supporting Information S1).In the case of the swarm-like scenario, the testing score is quite poor and barely above a random prediction at R 2 = 0.10 (Figure S9 in Supporting Information S1).The swarm-like scenario has much more complex behavior where a cluster of small earthquakes occurs right before large events that are not large enough to be categorized as the mainshocks, resulting in false-positive errors in the prediction (Figure S9 in Supporting Information S1).This is related to the fact that continuously distributed magnitudes in all the earthquakes in this scenario make the prediction task ill-defined (see details in Text S5 in Supporting Information S1).Therefore, the effort to reduce the required training data size and to develop a method dealing with more complex behaviors (including defining a mainshock in swarm-like scenarios) would be important, and testing our method in a more complex setting will be the topic of future work.

Conclusions
We have developed a machine-learning approach with a multiple network representation and applied it to a synthetic earthquake catalog generated by a continuum model of a rate-and-state fault with frictional heterogeneities.We have found that the ML model with multiple network representation can predict the time remaining before the mainshock with great accuracy, from the scale of decades, long before the upcoming earthquake, down to the scale of hours to minutes, right before the mainshock.We find that quantifying the occurrence patterns of earthquakes by abstracting a catalog in a specific number of groups of earthquakes is key for accurate prediction at all timescales.The important earthquake features for predicting the timing of mainshocks are the evolution of seismic moment and recurrence interval of foreshocks averaged over multiple foreshock network sizes.

Figure 1 .
Figure 1.Fault model with frictional heterogeneities and typical active foreshock behavior in the default scenario.(a) A conceptual illustration of a fault surface with frictional heterogeneities.Gray/white areas represent velocity-strengthening and weakening (velocity-weakening) patches, respectively.(b) A schematic illustrating a strike-slip fault in the model.(c) Spatial distribution of frictional heterogeneities characterized by the rate-and-state friction parameters a-b in the default scenario (d-f) Earthquake behavior over different timescales.Pink circles show the hypocenters and origin times of foreshocks and blue stars are those of mainshocks.The pink area has a higher slip velocity than 1 cm/s and indicates a coseismic slip area of individual earthquakes.

Figure 2 .
Figure 2. (a, b) Schematic of our data-network representation, showing the time and location of foreshocks (the pink circles) and one mainshock (the blue star).The green and orange foreshock groups denote the current and next networks, respectively.Statistically summarized catalog information inside the network is labeled with the time to the next mainshock, counting from the latest earthquake in the network (individually circled).(c) Prediction of the time-to-mainshock by an random forest (RF) model with multiple network representations.The red dots are the true time-to-mainshock and the blue dots are the prediction.Prediction is plotted on a year scale, corresponding to long-term prediction.Each new foreshock gives a new prediction by the RF model, irregularly spaced over time.(d) Logarithmic values of prediction as a function of the foreshock number.This corresponds to short-term prediction, down to the scale of hours to minutes.The axis of the foreshock number can be regarded as the axis of a discrete time series.The 90% confidence interval of prediction is shown in the blue color area and the normalized misfit in the logarithmic space is shown in the green color bar.

Figures
Figures2c and 2dshow the RF predictions based on the multiple network representation, and those via the instantaneous and single network representations are shown in FigureS3in Supporting Information S1.We quantify the accuracy of the prediction by the coefficient of determination R 2 , where a value closer to one means the prediction is more accurate:

Figure 3 .
Figure 3. Feature distribution on the training set as a function of the time-to-mainshock.The contour shows the feature's probability density approximated by kernel density estimation.(a)The probability density of the average seismic moment from the network of 5% average foreshock number (b-f) The probability density of the average recurrence time from the network of 5%, 20%, 40%, 60%, 80% average foreshock number, respectively.The increase of the seismic moment and the decrease of the recurrence interval over multiple network sizes enable accurate prediction of the time-to-mainshock.Note that different features from different network sizes display predictive power at different time scales.

Figure 4 .
Figure 4. Performance of our approach in case of little training data and case of aperiodic earthquakes.(a) R 2 score as a function of training data size in the default scenario.The cross-dashed line denotes the score shown in Figure 2.After ∼300 years (i.e., ∼20 mainshocks) of training data, the prediction becomes very accurate (b, c) The long-and short-term prediction by the minimum training set (i.e., with two mainshock cycles only).(d) The recurrence interval of mainshocks of an earthquake-sequence model scenario with randomized loading rates (e, f) The long-and short-term predictions by the random forest model with multiple network representation in the randomized loading scenario.