Using Deep Learning for Flexible and Scalable Earthquake Forecasting

Seismology is witnessing explosive growth in the diversity and scale of earthquake catalogs. A key motivation for this community effort is that more data should translate into better earthquake forecasts. Such improvements are yet to be seen. Here, we introduce the Recurrent Earthquake foreCAST (RECAST), a deep‐learning model based on recent developments in neural temporal point processes. The model enables access to a greater volume and diversity of earthquake observations, overcoming the theoretical and computational limitations of traditional approaches. We benchmark against a temporal Epidemic Type Aftershock Sequence model. Tests on synthetic data suggest that with a modest‐sized data set, RECAST accurately models earthquake‐like point processes directly from cataloged data. Tests on earthquake catalogs in Southern California indicate improved fit and forecast accuracy compared to our benchmark when the training set is sufficiently long (>104 events). The basic components in RECAST add flexibility and scalability for earthquake forecasting without sacrificing performance.

provide a foundation to explore the contribution of different geophysical data sets, relax model assumptions and incorporate more data into current forecasts. Despite these desirable characteristics, it is still unclear whether deep learning is well suited for earthquake data (Mignan & Broccardo, 2020), wherein the lack of significant progress in machine learning forecasts has been attributed to an earthquake record that is highly stochastic and is strongly influenced by extreme events. Our goal is to implement an earthquake forecasting model that meets the basic requirements of flexibility and scalability and assess whether it is well suited for the task. This paper introduces a deep learning model: the Recurrent Earthquake foreCAST (RECAST). Section 2 describes the model and the benchmark for performance, a temporal Epidemic Type Aftershock Sequence (ETAS) model. Section 3 shows the results of four experiments comparing model performance on synthetic and observed earthquake catalogs. Finally, Section 4 discusses the advantages and limitations of the modeling approach, considering our results.

Model Architecture and Benchmark
RECAST builds on recent developments in machine learning known as neural temporal point processes (Du et al., 2016;Omi et al., 2019;Shchur et al., 2020Shchur et al., , 2021Zhou et al., 2021;Zhu et al., 2021). The model uses a general-purpose encoder-decoder neural network architecture that predicts the timing of the next event given the history of past events ( Figure 1a) (Du et al., 2016;Mei & Eisner, 2017;Omi et al., 2019;Shchur et al., 2020Shchur et al., , 2021Upadhyay et al., 2018). Note that to facilitate benchmarking, we restrict our analysis to the timing of events and do not predict the magnitudes. RECAST is based on the Gated Recurrent Unit neural network architecture (Cho et al., 2014) that encodes the variable-length history of past earthquakes, ( ) = {(t 1 , M 1 ),…,(t i−1 , M i−1 )}, where t i and M i respectively indicate the time and magnitude of the ith earthquake, into a fixed-dimensional hidden state vector h i . The hidden state carries the influence of previous events forward. Next, the decoder uses h i to parametrize the probability density function of the next earthquake origin time, RECAST( |( )) = ( |ℎ ). (1) We model the probability density function of the next event origin time with a Weibull Mixture Distribution. This approach results in a model where training and simulation is efficient and exact because it avoids iterative numerical approximations for the evaluation and sampling of the output distribution. A mixture distribution with a sufficiently large number of components can approximate any other probability distribution arbitrarily well provided enough data (Shchur et al., 2020). RECAST output is equivalently expressed continuously in time in terms of the conditional intensity function ( |( )) commonly used in point process literature. Text S5 in Supporting Information S1 explains this equivalence.
The model's architecture follows the application in Shchur et al. (2020). We extend the original implementation by introducing an encoder for continuous event marks, in this case, earthquake magnitudes. We also use a Weibull mixture instead of a log-normal mixture to ensure that f(t i |h i ) can have a non-zero intercept. Early testing suggested we did not need to tune the model hyperparameters for the earthquake forecasting task. See Text S1 in Supporting Information S1 for the full model description.
As a benchmark, we model the recurrence of earthquakes with a temporal ETAS model (Figure 1b), a parametric statistical model that calculates the time-dependent earthquake intensity ( |( )) as a combination of a steady background rate and aftershock sequences (Aalen et al., 2008;Ogata, 1988). It is assumed that all earthquakes can produce aftershocks, and thus the current earthquake rate is an explicit function of the entire event history ( ) . ETAS yields a probability density function for the next event time t i that is similar to Equation 1, ( See Text S2 in Supporting Information S1 for further implementation details. RECAST and ETAS are both temporal point process models. They are trained by maximizing the joint log-likelihood of the cataloged origin times using the same optimization procedure (see Text S3 in Supporting Information S1). However, RECAST introduces practical benefits. Additional event features, such as loca- tions, source properties, and geophysical data, can be introduced into the model. Furthermore, the model's components are modular and easily adapted using standard machine learning frameworks (Abadi et al., 2016;Paszke et al., 2019). Another key distinction is computational efficiency. Whereas RECAST processes events sequentially, summarizing the event history into a fixed-dimensional vector (Figure 1a), ETAS references all previous events to determine the intensity function and, in turn, the event likelihood ( Figure 1b). As a result, the computational time and space complexity for evaluating the likelihood of a catalog is linear for RECAST and quadratic for ETAS ( Figure 1c). To enable the comparisons of this study, we needed to write a new implementation of ETAS that can utilize GPU hardware with a more efficient gradient descent algorithm implemented in PyTorch (see Text S2 in Supporting Information S1). Even with these improvements, the quadratic growth is computationally cumbersome or potentially prohibitive for the real-time evaluation of earthquake forecasts given earthquake catalogs with more than ∼10,000 events. This limitation is mitigated in the RECAST model that can be trained on catalogs with more than 1,000,000 events on a single consumer-grade GPU (see Figure  S1 in Supporting Information S1).
We recognize that this comparison to a time-only ETAS poses a minimum performance benchmark. Substantial effort has gone into developing spatial ETAS models (Ogata, 1988(Ogata, , 2017Zhuang, 2011). Heterogeneous background features complicate consistent benchmarking of RECAST across our multiple data sets. Furthermore, the computational disadvantages of even temporal ETAS required the reimplementation of the algorithm as discussed above and in Text S2 in Supporting Information S1. Spatial ETAS would be even more complex to implement in a computationally tractable framework and thus is beyond the scope of this initial study. Here we retain the more transparent time-only benchmark as a consistent and reproducible measure of performance that allows us to highlight the new model's key features on large catalogs.

Synthetic Data
As a first step, we evaluate the performance of RECAST on a collection of synthetic catalogs generated by an ETAS model. The set of synthetic catalogs are split into training (600), validation (200), and test (200) sets. Each individual realization spans 10,000 days and contains ∼1,000 events on average. For both models, the training set is used to optimize the parameters. During training, we monitor the performance on the validation set. After training, we recover the weights from the model with the best validation score. Finally, the comparison of model performance is reported for the reserved test set. ETAS will yield, in expectation, the highest possible joint log-likelihood on the test sequences since it is the generative model. If RECAST's performance approaches this limit, it would demonstrate that the model can be trained to capture realistic earthquake temporal clustering from the event data alone.

Figures 2a and 2b
shows the conditional intensity function output from the trained RECAST model successfully capturing, for instance, the sharp fluctuations associated with Omori's law (Figures 2a and 2b). The goodness of fit, over both the training and test set, approaches the limiting performance of ETAS ( Figure 2c). Increasing the volume of synthetic data for training causes the performance of RECAST to approach ETAS asymptotically ( Figure 3a). Without tailoring the RECAST architecture for the task, the model generalizes the statistical laws of seismology composing ETAS from event data alone. The architecture for the input marks, which in this case are magnitudes, is general, and thus the successful recovery of standard magnitude-based behavior is a demonstration of the effectiveness of the approach to incorporating additional inputs into the model.

San Jacinto Fault Earthquake Catalog and Scaling
We next consider an actual earthquake catalog from 2008 to 2021 bounding the San Jacinto Fault ( Figure S4 in Supporting Information S1) (White et al., 2019). This catalog makes a good test case because of its dense station coverage of a particularly seismically active area (See Text S6 in Supporting Information S1 for more detail on data pre-processing). RECAST outperforms ETAS in terms of goodness-of-fit in the reserved test period (Figure 3b).
We again explore the effect of data size. We track performance on the test set while incrementally extending the training period backward in time. In this experiment, the magnitude of completeness is fixed. The benchmark ETAS is the preferred model when trained on fewer than 10,000 events. The new model RECAST is the preferred model beyond 10,000 events. The ETAS benchmark performance saturates with more than ∼4,000 training events, whereas the log-likelihood score of RECAST continues to increase in proportion to the log length of the training data. The crossover appears to reflect fundamentally distinct scaling between the size of the training set and performance.

Comparative Performance on Other Data Sets
Comparable improvements arise when considering the entire Southern California Earthquake Data Center catalog from 1981 to 2021 (Hutton et al., 2010), and subregions of the Quake Template Matching Catalog (Ross et al., 2019) (Figure S2, S3-S5 and Table S1 in Supporting Information S1). Improvements are most pronounced in the smaller regions we consider. Consistent out-of-sample improvements on these disparate catalogs suggest that RECAST is robust to multiple methods of data production and regional variations in seismicity.

Southern California Earthquake Catalog and 14-day Forecasts
We finally compare earthquake forecasts generated using ETAS and RECAST on extended time intervals. Once trained, RECAST provides a straightforward way to simulate potential catalog continuations. We generate a sample earthquake by drawing an origin time from the mixture distribution in Equation 1 and a magnitude from the Gutenberg-Richter Distribution (Gutenberg & Richter, 1944). Repeatedly adding events to the catalog yields continuations spanning a range of potential outcomes. For implementation details, refer to Text S4 in Supporting Information S1. We refer to this set as an earthquake forecast. Specific aspects of a forecast, such as the abundance of events or the probability of a particularly large earthquake, can then be derived from statistics of the set of catalog continuations. Here, we consider 2-week earthquake forecasts with the Southern California earthquake catalog (Hutton et al., 2010) with an additional set of results for the San Jacinto fault catalog ( Figures S7 and S8 in Supporting Information S1). Each 2-week forecast comprises 50,000 simulated catalog continuations during the testing period. We note that individual continuations exhibit realistic features, including aftershock decay, secondary bursts of aftershocks, and stochasticity (Figure 4a). To quantitatively compare forecasts, we compute the fraction of simulated continuations, r, that exactly reproduce the total number of events in the interval (Figure 4b and Figure S7B in Supporting Information S1). Forecast accuracy is measured by r and larger values represent more accurate forecasts (see Text S5 in Supporting Information S1 for further details and rationale). Error bars indicate the 95% confidence interval of 1,000 bootstrap samples for five random initializations of RECAST. The models were trained with incrementally longer training and validation sets. In panel (b), the inset shows the time series for the seismicity in the San Jacinto area (see Figure S3 in Supporting Information S1). Each white bar shows the corresponding periods used to train and validate ETAS and RECAST. RECAST approaches the theoretically optimal performance of the ETAS model as we increase the training set size. In contrast, for the real catalog, a training and validation catalog in excess of ∼10,000 earthquakes (fourth white bar from the top in the inset) results in a test period best modeled by RECAST.
The majority of 2-week intervals in the test set were best forecast by RECAST (138 out of 155 14-day intervals, Figure 4c-d, Figures S7C and S7D in Supporting Information S1). Best fit periods tend to be during background activity. Nonetheless, RECAST also performs better following the largest earthquakes. Another metric of success is the ability to accurately capture the full range of observations. 14-day windows with outcomes outside the 95% confidence interval of the forecasts are far less frequent for RECAST than for the ETAS model (14% vs. 33%,respectively,Figures S6 and S8 in Supporting Information S1). Outcomes not featured in the entire set of 50,000 simulations occurred only once for the RECAST forecasts and 10 times for the ETAS forecasts (Figure 4c). These full misses for ETAS occur during the Ridgecrest earthquake aftershock sequence. While RECAST does not foresee the Ridgecrest sequence, it quickly recovers relatively accurate forecasts after the onset of the sequence.

Discussion
Our analysis shows both the advantages and limitations of the RECAST model. Some are inherent to deep learning, while others stem from the choices in the model architecture.
RECAST is fundamentally different from ETAS. The exact functional form that relates catalog data to the likelihood of the next event time is not required. This aspect is well illustrated in our tests on synthetic catalogs. Prior to training, RECAST only assumes that inter-event times will be strictly positive. The RECAST architecture can be trained on a modest-sized data set to learn the relationship between the inputs, in this case, timing and magnitude, and the output, event likelihood. This flexibility to incorporate new data and relationships is a key advantage. In contrast, ETAS is a parametric model and requires known functional relationships. The benefit of the latter approach is that fewer parameters are required. The drawback is that the forecast is limited by the assumptions of the model which are often broken either due to physical changes in the system or because of artifacts in the preparation of the catalog data. RECAST may have the potential to outperform ETAS for real earthquake catalogs.
The new model requires a sufficiently large earthquake catalog for training. As exemplified in the benchmarks from the San Jacinto Fault zone (Figure 3b), for small data sets, the temporal ETAS model outperforms RECAST. (c) Test catalog with the evolution of the log-score for the tested 14-day forecast intervals. Empty orange markers and the corresponding annotation indicate the intervals where both model forecasts (1 instances), or just the ETAS model forecast (9 instances) yielded no accurate forecast (r = 0 and log r is undefined). These all occur during the Ridgecrest earthquake sequence. (d) Comparison of the relative accuracy (forecast log-likelihood) of RECAST and ETAS models. A positive log-score indicates a more accurate RECAST forecast. In most intervals, the RECAST is more accurate.
In this experiment, the model needs a larger catalog (10 4 events) before RECAST outperforms the benchmark. Much like ETAS, we anticipate poor performance if the training set is different from the period over which the model is evaluated. The improvements across every other test we conduct would suggest that the result is robust for such dense earthquake catalogs. Intriguingly, the relative gap in performance between RECAST and ETAS grows with catalog size, suggesting persistent improvements that may scale with increasingly available data. Unlike ETAS, RECAST's model architecture ensures that the volume of data currently available does not pose computational limits on this trend. Historically there was little reason for a method to use information for more than a few thousand events. Now catalogs offer more and demand the new technology.
It is difficult to diagnose the exact cause of improvements. We can, however, highlight some diagnostics. First, we note that the more pronounced improvements are in smaller subregions (Table S2, Figures S7 and S8 in Supporting Information S1). Second, it appears that the temporal ETAS model underfits event sequences and does not account for long-term trends in seismicity. In the case of the San Jacinto data set, ETAS under-predicts during periods of high seismicity and over-predicts during periods of low seismicity rates (e.g., Figure S8 in Supporting Information S1). In the case of the SCEDC catalog, ETAS forecasts reflect a high base rate of seismicity of the training set and are biased accordingly, consistently over-predicting the base rate in the testing interval ( Figure S6 in Supporting Information S1). Although RECAST uses the same training periods in all cases, the model's memory (the hidden state in Equation 1) allows RECAST to account for time-varying trends in the validation and test sets and, as a result, may better track a constantly evolving seismicity rate. Altering the ETAS training procedure to retrain on a fixed window directly before each forecast might address this model limitation.
Improvements could stem from uncovering a physical process that governs the predictability of the system. Another interpretation is that deep learning models, such as RECAST, jointly address failure modes associated with standard parametric models, such as ETAS. Current practice entails many moving pieces. It is challenging to introduce and weigh the contribution of new assumptions about the data regarding non-stationarity (Kumazawa & Ogata, 2014;Llenos & Michael, 2019;Muir & Ross, 2023), anisotropy (Ogata, 2011), catalog incompleteness (Mizrahi et al., 2021;Page et al., 2016), magnitude errors (Werner & Sornette, 2008), changes in instrumental coverage (Woessner & Wiemer, 2005), pathologies, for example, orphaned aftershocks (van der Elst, 2017), criticality (Helmstetter & Sornette, 2001), and missing links (Wang et al., 2010). A subset of these factors, as implemented in state-of-the-art models, may close the performance gap between ETAS and RECAST. In our experiments, it seems most important to address long-term trends in seismicity either with a non-stationary ETAS implementation (Kumazawa & Ogata, 2014;Llenos & Michael, 2019) or with a sliding window for recalibration. Such prioritization is clear after the fact and may not robustly hold in further experiments. To the best of our knowledge, no model addresses the suite of limitations that ETAS models face concurrently, mainly due to the complexity involved in coordinating the competing effects while preserving the benefits of the original model. The advantages of deep learning as an approach may be that modeling builds on well-maintained and rapidly evolving libraries and provides a path forward relying less heavily on the idealizations of the data.

Conclusion
Dense earthquake catalogs reveal a richer window into the seismic cycle. Yet it has been a challenge to translate finer detail and increased data volume into improved forecasts. A deep-learning approach such as RECAST is sufficiently general to recover well-known statistical patterns. Our experiments on synthetic data sets suggest performance at least as capable as a temporal ETAS model provided a modest number of events and that RECAST may also capture processes that are not explicitly parameterized in an ETAS model on real-world catalogs.
This study is a proof of concept for the application of neural temporal point process models for earthquake forecasting. More generally, neural temporal point process models and deep learning provide important advantages. They do not require knowledge of the relationship between earthquake features to forecasting probabilities. With increasingly large earthquake data sets, their design enables increasingly complex relationships to be derived directly from the data. It remains to be seen whether the increase in performance is due to more accurately capturing the observational limitations of catalogs or reflects a physical process that controls long-term seismicity evolution. In either case, forecasters can capitalize on the production of enhanced catalogs by leveraging the flexibility and scalability of deep learning.