Forecasting the 2016-2017 Central Apennines Earthquake Sequence with a Neural Point Process

Point processes have been dominant in modeling the evolution of seismicity for decades, with the Epidemic Type Aftershock Sequence (ETAS) model being most popular. Recent advances in machine learning have constructed highly flexible point process models using neural networks to improve upon existing parametric models. We investigate whether these flexible point process models can be applied to short-term seismicity forecasting by extending an existing temporal neural model to the magnitude domain and we show how this model can forecast earthquakes above a target magnitude threshold. We first demonstrate that the neural model can fit synthetic ETAS data, however, requiring less computational time because it is not dependent on the full history of the sequence. By artificially emulating short-term aftershock incompleteness in the synthetic dataset, we find that the neural model outperforms ETAS. Using a new enhanced catalog from the 2016-2017 Central Apennines earthquake sequence, we investigate the predictive skill of ETAS and the neural model with respect to the lowest input magnitude. Constructing multiple forecasting experiments using the Visso, Norcia and Campotosto earthquakes to partition training and testing data, we target M3+ events. We find both models perform similarly at previously explored thresholds (e.g., above M3), but lowering the threshold to M1.2 reduces the performance of ETAS unlike the neural model. We argue that some of these gains are due to the neural model's ability to handle incomplete data. The robustness to missing data and speed to train the neural model present it as an encouraging competitor in earthquake forecasting.

earthquakes to partition training and testing data, we target M3+ events.We find both models perform similarly at previously explored thresholds (e.g., above M3), but lowering the threshold to M1.2 reduces the performance of ETAS unlike the neural model.
We argue that some of these gains are due to the neural model's ability to handle incomplete data.The robustness to missing data and speed to train the neural model present it as an encouraging competitor in earthquake forecasting.

Plain Language Summary
For decades, the Epidemic-Type Aftershock Sequence (ETAS) model has been the most popular way of forecasting earthquakes over short time spans (days/weeks).It is formulated mathematically as a point process, a general class of statistical model describing the random occurrence of points in time.Recently the machine learning community have used neural networks to make point processes more expressive and titled them neural point processes.In this study we investigate whether a neural point process can compete with the ETAS model.We find that the two models perform similarly on computer simulated data; however, the neural model is much faster with large datasets and is not hindered if there is missing data for smaller earthquakes.Most earthquake catalogs contain missing data due to varying capability in our detection methods, therefore we need models that are robust to this missingness.We then find that the neural model outperforms ETAS on a new catalog for the 2016-2017 Central Apennines earthquake sequence, which through machine learning detection contains thousands of previously undetected small magnitude events.We argue that some of this improvement can in fact be explained by missing data.These results present neural point processes as an encouraging competitor in earthquake forecasting.

Introduction
The construction of machine learning algorithms for detecting the arrival times of earthquake phases (eg.Zhu and Beroza (2019)) combined with an accelerated growth in the number of seismic sensors, has meant that sizes of earthquake catalogs have grown substantially (Kong et al., 2019).With the amount of available seismicity data increasing, current forecasting methods that include the full history of the catalog in the form of all event pairs are increasingly inefficient and might not be flexible enough to incorporate this additional data, thus the need for the application of methods developed in the machine learning community is becoming more apparent.However, there exists a disconnect between the tools used by statistical seismologists and those in the machine learning community that apply their methods to seismic data.This work attempts to bridge that gap (to some extent) by considering a machine learning variant of point processes.Point processes are a class of models that contain the Epidemic-Type Aftershock Sequence (ETAS) model, a widely accepted and used point process model for earthquakes (Ogata, 1988(Ogata, , 1998;;Marzocchi et al., 2014;Mancini et al., 2019).In working with a machine learning method with a conditional intensity function (the function that explicitly defines a point process) (Zhuang et al., 2012), we present a model that is directly comparable to ETAS models, the current benchmark for short-term earthquake forecasting, yet now with desirable properties such as flexibility and scalability.The machine learning variant of point processes we introduce are known as neural point processes.
At the heart of neural point processes is the use of a recurrent neural network (RNN) to learn a compact representation of the history of events, first introduced by Du et al. (2016).The sequential nature of the way data pass through RNNs makes them an ideal modeling tool for temporal data.Instead of directly summing over all past events, as in models based on the Hawkes process (Hawkes, 1971), a fixed length vector representation of the past is learnt and updated at each new time step.Forecasts can then be made by modelling the conditional intensity function on this vector (Xiao et al., 2017;Li et al., 2018;Upadhyay et al., 2018;Huang et al., 2019;Omi et al., 2019), or instead through directly modelling the probability of the next event (Shchur et al., 2019).We direct the reader to Shchur et al. (2021) for a review of neural point process models.None of these models, however, are directly suitable for describing the temporal behaviour of seismicity, which includes a continuous mark space for the magnitude of each earthquake as well as the times.We require a model that is dependent on continuous marks as well as forecasts them.To the best of our knowledge, neither of these requirements are satisfied in the existing temporal point process literature.
In this work we extend the architecture introduced by Omi et al. (2019) so that it may also deal with earthquake magnitudes.For this we require a model that is dependent on previous event magnitudes, can forecast subsequent magnitudes as well as forecast earthquakes above a threshold magnitude.Distinguishing between a threshold for the input magnitude and a threshold for the target earthquakes is a problem specific requirement for earthquake forecasting, so does not exist in other works on temporal point processes.
We choose to extend Omi et al. (2019) to give the most flexible representation of the intensity, since they use a fully non-parametric approach compared to other intensity based methods that use a semi-parametric approach.Working with the intensity function rather than directly modelling the likelihood of the next event provides a model that is closer in interpretation to ETAS and provides a natural way to forecast earthquakes above some target threshold magnitude, detailed in section 4.2.
Although the use of the ETAS model (Ogata, 1988) has been the dominant way for modelling seismicity in both retrospective and fully prospective forecasting experiments (e.g.Woessner et al., 2011;Taroni et al., 2018;Cattania et al., 2018;Mancini et al., 2019Mancini et al., , 2020) ) as well as in operational earthquake forecasting (Marzocchi et al., 2014;Rhoades et al., 2016;Field et al., 2017), it is restricted to its rigid parametric form.As ETAS only describes the self-exciting nature of seismicity, it cannot capture any kind of inhibition or release of stress such as captured by stress-release models (Zheng & Vere-Jones, 1991;Xiaogu & Vere-Jones, 1994;Bebbington & Harte, 2003) or models based on elastostatic stress transfer and Coulomb Rate-and-State (CRS) friction (Dieterich, 1994).Furthermore, foreshock activity that differs from ETAS has also been observed (McGuire et al., 2005;Brodsky, 2011;Lippiello et al., 2012;Ogata & Katsura, 2014).Beyond this understanding that ETAS is misspecified, there are also difficulties and inefficiencies with fit-ting and forecasting.To estimate the intensity, ETAS sums over all previous earthquakes, which requires substantial memory and slows the fitting process and forecasting simulations.For large earthquakes in the past this is important, because their contribution can last more than 100 years (Utsu et al., 1995).However, for smaller earthquakes particularly found in enhanced catalogs, one expects the contribution to be close to zero after a far shorter amount of time, making summing over these terms inefficient (Helmstetter & Sornette, 2003;Marsan & Lengline, 2008).Furthermore, a particular difficulty with fitting the ETAS model is that there needs to be a reliable estimate of the completeness across the time of the catalog and this needs to be incorporated into the model.Failing to do so will result in biases (Hainzl, 2016b;Zhuang et al., 2017;Seif et al., 2017).
Methods that attempt to do this either truncate the periods of time where the catalog is most incomplete (Kagan, 1991;Hainzl et al., 2008), leading to parameters that can be dominated by a few aftershock sequences, or attempt to model the data incompleteness itself (Omi et al., 2014;Hainzl, 2016aHainzl, , 2016b;;Mizrahi et al., 2021b), but this adds additional computational requirements over a standard ETAS model.
To benchmark our proposed neural point process with ETAS, we design forecasting experiments on both synthetic data as well as a new enhanced catalog for the 2016-2017 Amatrice-Visso-Norcia (AVN) seismic sequence.The catalog generated by Tan et al. (2021b), containing roughly 900,000 earthquakes, was generated using a deep neural network based phase picker for earthquake arrival times (Zhu & Beroza, 2019).As a result, the size of the catalog increased 10 fold from the routine catalog generated by the Italian National Institute of Geophysics and Volcanology (INGV).The INGV catalog has been used in several retrospective forecasting experiments (Ebrahimian & Jalayer, 2017;Marzocchi et al., 2017;Mancini et al., 2019Mancini et al., , 2022)), but there has yet to be much development of forecasting models using enhanced catalogs such as this one and, given that they contain considerably more earthquakes, investigations into how we can harness these machine learning generated enhanced catalogs are essential.The AVN sequence contains ten M 5+ events during a five month period over an 80 km long normal-fault system (Mancini et al., 2019).The number of large earthquakes as well as the compactness of the region on which they occur make this sequence preferable for testing purely temporal forecasting models which contain no spatial covariates.We do still expect some loss of information by ignoring the spatial covariates, particularly for smaller earthquakes where there is a spatial extent across which earthquakes won't interact.
We seek to understand how taking different magnitude thresholds and temporal partitions of our datasets affects the performance of the two models.Through altering these two aspects, we naturally change the amount of data shown to each model so that we may see how sensitive their forecasting performance is to training sample size (Wang et al., 2010).Through partitioning in time we can see how the performance is affected by the number of major earthquakes that each model is trained on.By altering the magnitude threshold of the input catalog, we seek to improve the predictive skill of forecasts by using the hypothesis that small earthquakes should help to forecast the moderateto-large earthquakes.Either from a time-independent perspective where large earthquakes are found to nucleate in areas that have a large density of small events (Kafka & Walcott, 1998;Kafka & Levin, 2000), or in time-dependent forecasting (eg.ETAS and CRS) where earthquake triggering is believed to exist at all scales (Helmstetter & Sornette, 2003;Marsan, 2005;Nandan et al., 2016), reducing the threshold of the input catalog generally leads to improved forecasting performance of moderate-to-large events (Helmstetter et al., 2007;Werner et al., 2011;Helmstetter & Werner, 2014;Mancini et al., 2022).
Particularly, Mancini et al. (2022) consider the same sequence as this study, and compare forecasting results from models trained on several different enhanced catalogs (including the Tan et al. (2021b) catalog used in this study).They find that the forecasting of M3+ events by CRS and ETAS models is not improved by training on the enhanced catalogs.When using the same catalog as this study, they see the models increase in performance as the input magnitude threshold is lowered from M5 to M3, but at the lowest two thresholds M1 and M2, the performance of ETAS is worse than for M3+.Direct comparison of results, however, isn't possible as they report information gains that are for spatio-temporal forecasts using the Poisson assumption of earthquake rates in gridded space-time windows.This study hopes to provide some further insight into the performance of forecasting models using the low magnitude earthquakes found in this catalog and presents neural point processes as a competitive model using such events.

Data
To benchmark our neural point process against ETAS we conduct forecasting experiments on both real data from the Amatrice-Visso-Norcia sequence as well as synthetic data generated by ETAS.Since we are making comparisons about temporal models, in both catalogs, we remove all spatial covariates.
Figure 1: The magnitudes and times of the AVN sequence 2016-2017 (Tan et al., 2021b) used to evaluate the performance of the neural and ETAS model.Marked with a dashed red line are the times of the 4 major events of the sequence.The size of the points are plotted on a log scale corresponding to M w .An estimate of the temporal completeness M 0 (t) is plotted using the maximum curvature method (Wiemer & Wyss, 2000).

Amatrice-Visso-Norcia High Resolution Catalog
On the 24th of August 2016 a M w 6.0 earthquake was recorded near the town of Amatrice in northern Lazio, central Italy.It was followed by a M w 5.9 near the town of Visso on the 26th of October and a M w 6.5 near the town of Norcia four days later.Finally, in January 2017, four events between M w 5.0 and M w 5.5 struck the Campotosto area.Figure 1 depicts the evolution of this seismic sequence over time.
The INGV produced a routine catalog for the 1 year period containing this sequence (Chiaraluce et al., 2017).Their catalog contains roughly 82,000 events with a completeness of M c = 2.3 (Mancini et al., 2019).With the use of a neural network based phase picker to determine P and S-wave arrival times, an enhanced catalog has been created for the same earthquake sequence (Tan et al., 2021b).This catalog contains around 900,000 events and has an overall value of completeness of the catalog M c = 0.2.We estimated the time varying completeness of this catalog using the maximum curvature method (Wiemer & Wyss, 2000) with samples of 300 events and can see clear variation in completeness particularly following large magnitude earthquakes.This approximate method for the completeness is only used to show the variability across the catalog and is not used directly in any modelling.

Synthetic Catalog
For the forecasting experiment of synthetic ETAS data, we generate a dataset using the simulator by Mizrahi et al. (2021a), with uniform background intensity µ and triggering function, The resulting dataset of roughly 250,000 events comes from removing all the spatial covariates.
We also generate a second synthetic dataset from the first by emulating short-term aftershock incompleteness using the time-dependent formula from Helmstetter et al. (2006), where M is the mainshock magnitude.Events below the function are removed using the six largest events as mainshocks in this synthetic catalog.

Theoretical Background
In this section we briefly introduce the theory of neural point processes.We begin with the basic theory of temporal point processes and show how to use this to construct neural point processes for temporal forecasts only.In Section 4, we extend a temporal neural point process to a marked neural process.

Point Processes
A temporal point process (Daley et al., 2003) is a stochastic process that generates a sequence of discrete events at times The process is characterised by its conditional intensity function λ(t|H t ), which gives the expected number of events in a small interval about t conditioned on the event history The intensity function (Rasmussen, 2018) completely defines the point process and can take a variety of functional forms.The most basic form is the stationary Poisson process (Daley & Vere-Jones, 2003) which assumes that all events are independent of each other, and the conditional intensity function is constant.Self-exciting point processes assume that events increase the likelihood of subsequent events, with a popular class of these processes being the Hawkes process (Hawkes, 1971).The Hawkes process is defined by its conditional intensity λ(t|H t ) = µ + ti<t g(t − t i ), where g(s) is a kernel function defining how past events trigger subsequent events.
With the conditional intensity function specified we can write an expression for the loglikelihood of observing a sequence of events We assume that the observation interval is [t 1 , t n ] to make the algebra in the remainder more compact.It is trivial to relax this to an arbitrary interval [0, T ].
Temporal point processes can be extended to incorporate marks.A marked point process is stochastic process that generates events paired with a mark, In our setting this represents the occurrence times of earthquakes with their corresponding magnitudes.A marked point process is defined by its conditional intensity function, with the log-likelihood of observing a marked sequence of events given by log (2)

Neural Point Processes
The most common form of neural point processes seeks to obtain a compact representation of the event history through the use of an RNN (Du et al., 2016).In this approach, an input representing the inter-event times τ i = t i − t i−1 is first fed into the RNN.A hidden state h i of the RNN is updated where {W h , w τ , b h } are learnable parameters, and σ is an activation function.The conditional intensity function is then formulated as a function of the elapsed time from the most recent event and is dependent on the hidden state of the RNN, where ϕ is a non-negative function referred to as the hazard function and t i is the time of the most recent event.To avoid having to numerically integrate the intensity function directly, Omi et al. (2019) model the integral of the hazard function with a fully con- With the construction of the model in this way, the log-likelihood of observing a sequence of event times (1) can be written as:

Methods
Since the neural point process introduced by Omi et al. ( 2019) is purely temporal, to model seismicity we must extend their model to our requirements.Particularly we require that forecasts be dependent on the history of both times and magnitudes, as magnitudes are an important predictor of seismicity (Utsu, 1970(Utsu, , 1971)).We also require a forecast over the next magnitude where, unlike the Gutenberg-Richter law (Gutenberg & Richter, 1936) routinely assumed in the ETAS framework, this is also dependent on the history of events.Finally, we require that we may make forecasts of earthquakes above some target magnitude threshold despite a dependence on earthquakes below that target threshold.In Section 4.1, we first extend the structure of the neural network by Omi et al. (2019) to maximise the likelihood of observing a marked sequence of events, including constructing a time-history dependent magnitude distribution.In Section 4.2, we show how we adjust this new structure to target events above a magnitude threshold.
To aid in the development of more flexible forecasting models, we will make both the dataset and models used in this study available after publication on https://github.com/ss15859/Neural-Point-Process.

Continuously Marked Neural Point Process
We begin with the factorisation of the joint conditional intensity function into its marginal intensity and conditional density function, following Daley et al. (2003), where λ * (t) is the marginal conditional intensity function of t, and f * (m|t) is the conditional density function of the mark at time t.Both of these functions are dependent on the history H t , here denoted by the asterisk *.To construct the likelihood for the marked sequence we model these two functions separately.
With the factorisation, the expression for the log-likelihood of observing the marked sequence of events (2) becomes, Now with a two dimensional input, the hidden state of the RNN is updated as a linear combination of the inter-event times and magnitudes.This is the continuous mark extension to the RNN update from Du et al. ( 2016), where w m is an additional learnable parameter.
The marginal intensity function is formulated as a function of the elapsed time from the most recent event and is dependent on the hidden state of the RNN (Du et al., 2016), where ϕ is a non-negative function referred to as the hazard function and t i is the time of the most recent event.Following Omi et al. (2019), we model the cumulative hazard function using a fully connected neural network, which allows us to differentiate this with respect to τ to extract the hazard function.The derivative is easily obtained through automatic differentiation (Van Merriënboer et al., 2018), which is available in all neural network packages.
We now also formulate the conditional density function of the mark at time t as a function of the current mark.This is dependent on the time since the most recent event and the hidden state of the RNN, We again model its cumulative distribution with a fully connected neural network, Although this integral does not feature in the expression for the log-likelihood, we still opt for this approach over directly modelling the density function with a neural network.
This follows from work on neural density estimation where positive weights can be enforced in the network to capture the positivity and monotonicity of cumulative distribution functions (Chilinski & Silva, 2020).We can then obtain the density function again through automatic differentiation.
We can now write the log-likelihood as: We We enforce positive weights in both fully connected sections of the network to capture the positivity and monotonicity that is required by both cumulative functions.In the final component of the network we formulate the output as the log-likelihood of observ- ing the pair {τ, m}.To construct this output, one backward pass is performed to calculate the derivatives with respect to the next time and the next magnitude found in eq (7).This output is exactly what is maximised to learn the weight parameters, during a second backwards pass.

Target Events
The growth in machine learning generated catalogs from their predecessors is found through detecting events of magnitude much lower than previously possible.However, operationally, we may not care about the forecasting of these smaller events if it is the larger ones that are the most hazardous.The aim is therefore to use the smaller events to forecast earthquakes above some target threshold.In this section we outline how this is done for both ETAS and the neural model.We hereafter call events above the target magnitude threshold M d target events.
Let {(t i , m i )} n i=1 ∈ [0, T ]×M be the entire sequence of events, complete down to M 0 .
We seek to make forecasts of events above magnitude M d .This corresponds to the sequence: To ease in the distinction between 'all events' and the target events, we subscript the former with i and the latter with j.
Let λ 0 (t, m|H t ) denote the joint intensity function of all events above M 0 .We seek to learn the intensity of events above M d , denoted λ d (t, m|H t ), where the history H t contains all events {(t i , m i )} i:ti<t .The ground intensity above the target threshold is found by marginalising the joint intensity over the target magnitude region, The log likelihood of target events is then given by: log L({t j , m j }) For ETAS, the rate above magnitude M d is a fraction of the rate above M 0 , due to the independent distribution for magnitudes, where f GR (m) is the Gutenburg-Richter law and p d is simply the probability that m ≥ M d .Therefore the expression for the likelihood is relatively simple, log L({t j , m j }) For the neural model, we make use of the fact that the integral of the intensity function between target events, {(t j , m j ) : , can be expressed as a sum of dis-joint integrals between all events {(t where between ( 8) and ( 9) we have factorised the joint intensity of events above M d , λ d (t, m|H t ), into the ground intensity above the target threshold, λ d (t|H t ), and the distribution of the next magnitude above the target threshold given the time and the history, f d (m|t, H t ).
Between ( 9) and ( 10) we changed the summation from being over target events to being over all events by adding the indicator function becomes the sum of integrals in (10) through a decomposition into disjoint integrals between all events {(t i , m i )} n i=1 .Thus the neural model may target events by adding the indicator function I{m i ≥ M d } to the expression for the log-likelihood.Now the hazard function models the rate above M d as a function of the time from the last event (of any magnitude),

Experimental Design
For both the synthetic data and real data we apply the same training and testing procedure illustrated in Figure 3.At a fixed point in time along the sequence we set a marker and train on data up to that point.Following that, the remainder of the sequence will be used as the test set.For the synthetic catalogs this is done at one single point in time, whereas for the Central Apennines sequence we make three partitions -each just before the Visso, Norcia and Campotosto earthquakes.By making these partitions we can see how the performance of each model is affected by the number of training datapoints as well as the number of major earthquakes.
We seek to understand how different magnitude thresholding affects the performance of each of the models.For each of the partitions, we look at the performance of both mod-els as the magnitude threshold of the input catalog is lowered, a parameter we refer to as Mcut.For the AVN catalog and incomplete synthetic catalog, this includes lowering Mcut into periods where Mcut< M 0 (t).We keep fixed the magnitude of events we wish to target at M d = 3 Mw.
Both models are trained by maximum likelihood estimation (MLE) on the training dataset.
For ETAS we use the intensity function defined by Ogata (1988) and maximise the likelihood through Nelder-Mead optimisation, chosen for its robustness.For the neural model, we maximise the likelihood defined in equation ( 11) through ADAM optimisation (Kingma & Ba, 2014) written in Tensorflow (Abadi et al., 2015).
We compare the two models' performance through the log-likelihood of the events in the testing set.We separate the temporal and magnitude terms in the likelihood equation (3), to analyse their predictive skills on the two target variables separately.To compare the performance across different magnitude thresholds, we also fit a homogeneous Poisson model.For the temporal log-likelihood we can therefore present the log-likelihood gain from a benchmark Poisson model, whereas for the magnitude forecasts, simply the log-likelihood is reported.We shall refer to both performance metrics as log-likelihood scores and to make general comparisons across the models, we compare the mean loglikelihood score per earthquake as well as construct a 95% confidence interval to assess the variability.The confidence interval is constructed with 1000 bootstrap samples of the log-likelihood scores.

Synthetic Data
Despite the synthetic catalog being complete down to M 0 = 1 Mw, we only lower Mcut down to 1.7 due to the computational time it takes to find the MLE parameters of ETAS for such a large dataset.Figure 4 shows the computation time (CPU hours) to train each of the models as a function of the size of the training set using an 2.4 GHz Intel E5-2680 v4 (Broadwell) CPU.The neural model is significantly faster to train than ETAS due to the likelihood function not being dependent on the full history of the sequence, giving it complexity O(N ) (Shchur et al., 2021).ETAS in contrast has complexity O(N 2 ) due to the double sum in the likelihood.Figure 5c) shows the magnitude log-likelihood scores for the complete synthetic catalog.For the highest two thresholds, the neural model performs significantly worse than ETAS but then remains marginally worse for all other values of Mcut.For the magnitude scores for the incomplete data in Figure 5d), ETAS significantly outperforms the neural model at the higher thresholds.As the threshold is lowered the two perform more similarly, owing to an increase in performance from the neural model and a slight decrease from ETAS.
We can understand the marginal difference in performance between the two models at lower thresholds by looking at their respective distributions.Figure 6 shows five instances of the magnitude distribution learnt by both models at Mcut = 1.7 for the complete catalog.For ETAS we simply learn the b value of the Gutenberg-Richter (GR) law whereas for the neural model, a history and time-dependent distribution for the next magnitude is learnt.In these five instances, although the neural model can closely approximate the GR law, allowing it to be time-history dependent means that its predictions vary across different occurrences in the sequence and therefore in this synthetic example performs worse than the stationary data-generating distribution.Since the neural model contains  The magnitudes of the observed events are plotted as points along the log-density for the neural model.
orders of magnitude more parameters, this result indicates overfitting (Lawrence et al., 1997).

AVN Catalog
Figure 7 shows the log-likelihood scores for both models on each of the testing-training partitions on the AVN catalog, where Figure 7a) is for training both models up to the Visso 5.9 Mw event, Figure 7b) up to the Norcia 6.5 Mw event and The magnitude log-likelihood scores in Figure 7d)-7f) show that the time-history dependent magnitude distribution generally cannot match the predictive power of the stationary GR law.The performance of ETAS remains constant for all values of Mcut and testingtraining partitions.In Figure 7d) and Figure 7e) the neural model improves in performance as Mcut is lowered, where it only performs closely to ETAS at the very lowest threshold.This and the fact that it performs much closer to ETAS when training up to Campotosto (Figure 7f)) suggests that it is learning and improving when shown more data.
To compare the models' performance as a function of time, Figure 8 displays the cumulative information gain (CIG) of the neural model over ETAS, for both models trained up to the Norcia earthquake.This information gain is simply the difference in the loglikelihood scores, where we subtract the score of ETAS from the neural model for both the magnitude and event-time term of the likelihood.The CIG is plotted per earthquake, but the evolution with time since the Norcia earthquake is also displayed.Figure 8a) shows the CIG for event time forecasts.Beyond the trend that the neural model improves over ETAS as we lower Mcut, the improvement varies over the testing catalog.For the thresholds that give the largest gain, (Mcut = 1.2, 1.4, 1.6, 1.8), the period of time with the greatest amount of gain, indicated by the steepest gradient of the curve, is found within the first 2 hours of the Norcia earthquake.This is followed by a reduced improvement up to 24 hours, beyond which it levels out and remains relatively linear.The forecasted magnitude distribution of each model shown in Figure 9 provides some insight into the cause of this immediate improvement over ETAS's GR forecast right after Norcia.Figure 9a) shows the magnitude distribution of each model at the time of the Norcia earthquake.The two distributions resemble each other relatively closely.At the time of the next Mw3+ event, Figure 9b), the neural model has shifted its density towards higher magnitude values anticipating a lack of observed smaller magnitude earthquakes due to catalog incompleteness.

Approximating ETAS
The ability to approximate ETAS data using a neural point process is a benchmark goal.Demonstrating a baseline level of expressiveness is essential before any work on real data is done.Given other neural point process models regularly use univariate or discretely marked Hawkes data as a baseline (Du et al., 2016;Omi et al., 2019), this result provides an example of fit to continuous bivariate Hawkes data.
Specifically, the merit of this fit to ETAS data is that it uses a truncated history of events.
Truncating the ETAS model to sum over only the last d events would dramatically change the evaluation of the intensity function between a significantly large earthquake d events ago or d+1 events ago, thus instead the way in which to formulate the relationship between the intensity and these truncated events is learnt.The past d events are exhibiting behaviour based on the events prior and thus we do not directly specify the contribution from events further back in time, instead dependence on such events is learnt indirectly.
The reduction in the amount of history each forecast is dependent upon drastically improves the computational requirements for both learning and evaluation of the likelihood.
To create forecasting models alongside the growing size of earthquake catalogs generated through machine learning based phase picking, we require models that scale well with the data.Shown here is that we can achieve the same predictive accuracy as ETAS (in a synthetic catalog) but with a smaller computational budget.

Embracing and Ignoring Data Incompleteness
At the previously explored thresholds of this sequence, (Ebrahimian & Jalayer, 2017; Marzocchi et al., 2017;Mancini et al., 2019Mancini et al., , 2022)), the neural point process performs similarly to ETAS.The biggest deviations between the two models are found as the magnitude threshold is lowered into new unexplored regions revealed by this enhanced catalog.The deviations come from the fact that the neural model increases gradually in performance as the threshold is lowered whereas ETAS drastically decreases in performance.
Below, we offer an interpretation for these results.
We argue that the largest gains made by the neural model are due to its ability to handle the incomplete data immediately following large earthquakes.There are two justifications for this: Given that the magnitude threshold of the input catalog is lowered into regions where there are periods when M 0 (t) > Mcut, we expect there to be biases in fitting ETAS (Zhuang et al., 2017;Seif et al., 2017;Hainzl, 2016b).The consequences of these biases on ETAS are reflected in the log-likelihood scores on the synthetic incomplete data figure 5b).In this synthetic catalog with short-term aftershock incompleteness, ETAS drastically reduces in performance in contrast to the neural model.Since a basic ETAS is formulated as completely observing all potentially triggering earthquakes it poorly captures incomplete sequences.The same shape of plot is found in the real data in Figure 7a)-c), where the performance of ETAS decreases as Mcut is lowered.In contrast, other studies have found decreasing the minimum triggering magnitude improves the performance of ETAS (e.g., Werner et al. (2011); Helmstetter and Werner (2014)).
A consequence of the bias in the ETAS parameters is that the forecasting performance is only competitive with the neural model during intermediate values of incompleteness, Figure 8c).Even during complete periods in the testing catalog, since ETAS has been fit on incomplete data, it fails to forecast well.
The second justification comes through considering the process by which the observed data are generated, e.g. as described by Omi et al. (2014).The relationship between the underlying process λ(t|H t ) and the observed process v(t|H t ) that forms the catalog itself can be written as where r(t|M 0 ) is the probability of detection at time t.For ETAS variants that deal with time-varying completeness (e.g.Omi et al., 2014;Hainzl, 2016aHainzl, , 2016b;;Mizrahi et al., 2021b), this function has to be estimated alongside the parameters of ETAS.When fitting a temporal ETAS model to data with temporal incompleteness, bias in the fitted parameters comes from the modeling assumption that v * (t) = λ * (t).Through the use of a flexible model such as this neural point process, rather than trying to learn both the underlying process and the detection rate, we instead directly learn the observed process.So in fact, in the construction of the model rather than equation ( 4), the observation process is approximated, where t i is the time of the last event.As we lower Mcut into regions of temporal incompleteness, unlike ETAS, the performance of the neural model does not decrease, Figure 5b).This demonstrates the neural models' ability to fit to observed data as it is not biased by an increasing amount of missingness.
Learning to model the observation process requires the assumption that the process of the incompleteness is stationary for future forecasts, thus if there is new methodology in data collection, the model would have to be re-trained on this new data.This is similar to detection rate based methods, (Omi et al., 2014;Hainzl, 2016aHainzl, , 2016b;;Mizrahi et al., 2021b), that would also need to update their detection function with new observational methodology.
To further test the effectiveness of the neural model, a comparison with other ETAS models that specifically deal with incompleteness is needed.But given that methods that deal with incompleteness only increase the computational requirements upon fitting a basic ETAS model, neural point processes could offer a more efficient way to deal with miss-

Conclusion
We present an initial investigation into the viability of neural point processes for the forecasting of short term seismicity.The neural point process is formulated in a similar way to the ETAS model, only with a much more flexible way of representing the intensity function.Now with much larger earthquake catalogs, data-driven point process models present us with an opportunity to investigate whether these new data may offer some deviation to the parameterization of ETAS as well as providing more computationally efficient models that are robust to missing data.
We extend the existing point process model of Omi et al. (2019) so that it also models the magnitudes associated with the events contained in earthquake sequences.We also show how this model can be used to forecast earthquakes above some target threshold magnitude through decomposing the cumulative hazard function between target events.
A notable feature of the presented model is that a forecast is only dependent on a fixed length vector representing the history of events, making the evaluation of the likelihood scale linear with the sample size.
With an experiment on data simulated from the ETAS model we demonstrate this computational advantage by showing a stark improvement on the time to train the neural point process against the ETAS model, whilst still obtaining a similar likelihood score on test data.We find that defining a more flexible time-history dependent magnitude distribution leads to overfitting and consequentially the magnitude likelihood scores are worse than when using a stationary Gutenberg-Richter law.
Through artificially removing events from the synthetic catalog we create a dataset that mimics short-term aftershock incompleteness.We find that on this altered catalog, the performance of ETAS now decreases as the magnitude threshold is lowered.In contrast, the neural model remains constant in performance, suggesting that it is more robust to the missing data found in typical earthquake catalogs.
On real data from the Amatrice-Visso-Norcia sequence the performance of both models vary with respect to the magnitude threshold of the input catalog.Both models perform similarly at previously explored thresholds (Mw3+), but when lowered into magnitude regions revealed by the new catalog, ETAS decreases in performance unlike the neural point process.We argue that these gains are due to the neural model's ability to handle the incomplete data found in this enhanced catalog.This experiment both motivates the need for considering temporal completeness when using enhanced catalogs and motivates further work into what the spatial covariates of this dataset might offer when combined with flexible point process models such as this one.

Open Research
The Amatrice-Visso-Norcia catalog produced by Tan et al. (2021b) is accessible at the Zenodo repository https://doi.org/10.5281/zenodo.4736089(Tan et al., 2021a).The ETAS simulator used to generate the synthetic data was written for Mizrahi et al. (2021a) and Mizrahi et al. (2021b), and is available at https://github.com/lmizrahi/etas,(Mizrahi & Schmid, 2022).Both synthetic and real datasets are found in the reproducibility package along with the models and the experimental design used in this study (?, ?).
model both the cumulative hazard function Φ, and the conditional distribution function of the marks Ψ, using a feed-forward neural network.The network depicted in Figure 2 consists of four component parts.The first part is the recurrent section, which finds an encoding of the history of the point process.The output of the recurrent section h i passes into two fully connected components.One models the integral of the intensity function, this is a function of the time of the next event τ .The other component models the integral of the conditional density function of the next mark, this is dependent on the next time τ , but is a function of the next mark m.Since passing long sequences into recurrent neural networks can often lead to exploding or vanishing gradient problems(Hochreiter, 1998), we do not pass the whole history of the point process into the recurrent section, but the past d events.This implies that we use only the past d events to forecast the next event.Thus, this model is learning to estimate the intensity given a recent history of d events.This hyperparameter is kept the same asOmi et al. (2019) at d = 20.A naive tuning search found no significant improvement at larger values of {50, 100, 200, 500}.This difference to the full-history ETAS model is discussed in Section 6.1.

Figure 2 :
Figure 2: The proposed network comprises four sections.First, the inter-event times and magnitudes of the last d events are fed into a recurrent section consisting of 64 recurrent units.The output of this section is fed into two fully connected sections where it is combined with the next inter-event time τ for the temporal network and additionally with the next magnitude m for the magnitude network.The outputs of both these sections are combined to formulate the log-likelihood of the next inter-event time and magnitude {τ, m}.We can separate the temporal and magnitude terms in this likelihood to give point evaluations of the density of the next inter-event time and conditional density of the next magnitude.The dependence structure of the point process is expressed by the connections between sections in the network.

Figure 5
Figure 5 shows 95% confidence intervals for the log-likelihood scores on the synthetic catalogs for the varying magnitude thresholds.By varying the value of Mcut the

Figure 3 :
Figure 3: The synthetic catalog with an outline of the training and testing procedure.We train up to a fixed point in time in the catalog, following which the remainder of the catalog is used for testing.We vary the value of the threshold for the input catalog (Mcut) and keep fixed the value of the target threshold (M d ).

Figure 4 :
Figure 4: Time on a single CPU required to train each of the models as the training size increases.Each model is trained by maximising the likelihood of the training data.

Figure 5b )
Figure5b) shows the temporal log-likelihood scores for the incomplete synthetic catalog.Just as in Figure5a), ETAS remains constant down to Mcut 2.0.But now, on this incomplete dataset, the performance of ETAS significantly reduces as Mcut is lowered below this threshold.In contrast, the neural model remains constant in performance and significantly outperforms ETAS for all but the highest two thresholds, demonstrating a robustness to the missing data in this catalog.By synthetically recreating incompleteness we remove many datapoints, therefore we can fit ETAS to a lower Mcut as we do not experience the longer training times of the complete catalog.

Figure 5 :
Figure 5: Results from the synthetic tests.95 % confidence intervals for the loglikelihood scores for each model as a function of Mcut (the magnitude threshold of the input catalog).The size of the training set is displayed in the green barplot; the size of the testing set in the legend.a) temporal log-likelihood gain from Poisson for the complete synthetic catalog.b) temporal log-likelihood gain for the incomplete catalog.c) magnitude log-likelihood for the complete catalog.d) magnitude log-likelihood for the incomplete catalog.

Figure 6 :
Figure 6: Five examples of the forecasted magnitude distributions from the complete synthetic catalog tests at Mcut = 1.7 compared with the ETAS Gutenberg-Richter law.
Figure 7c) up to the first of the major Campotosto earthquakes.Across all training-testing partitions Figure 7a)-7c), as Mcut is lowered below 3 Mw, the performance of ETAS decreases consistently.The neural model, however, either remains constant in performance or improves as Mcut is lowered.In addition, as the neural model is trained on a longer period of time, its performance improves.For higher values of Mcut the neural model performs significantly worse than ETAS when trained up to Visso, but with the additional training data leading up to Norcia and Campotosto, it is similar to ETAS.For low values of Mcut, the performance of the neural model is significantly better than ETAS.When comparing across all values of Mcut neither model is significantly better than the other.Generally, the neural model is more robust to different values of Mcut than ETAS. Figure S1 of the Supporting Information shows how the fitted ETAS parameters change with Mcut.

Figure 8b )
Figure8b) shows the CIG for the magnitude forecasts, confirming the loss in average performance of the neural model over ETAS.All thresholds decrease fairly steadily for nearly all of the testing period, apart from immediately following the Norcia earthquake.For the lower thresholds the period of time following Norcia sees an improvement over ETAS very briefly before declining.

Figure 8c )
Figure 8c) shows the IG of the neural model over ETAS but now as a function of the estimate of the completeness of the testing period.Both models are trained up to Norcia for Mcut = 1.2, 2.0, 2.8.A locally weighted scatterplot smoothing (lowess) regression (Cleveland, 1979) with 95 % confidence intervals estimates this relationship.For Mcut = 1.2, the difference between the two models is smallest for intermediate values of the incompleteness (around M 0 (t) = 2.0), but for the most complete (M 0 (t) = 1.0) and

Figure 8 :
Figure 8: a) -b) The Cumulative Information Gain (CIG) of the neural model over ETAS for a range of values of Mcut.The models are trained up to the Norcia earthquake and the plot depicts the evolution of the CIG from the Norcia earthquake to the end of the catalog.The curve is plotted per event, however, the actual time since the Norcia earthquake is displayed on the top axis.a) displays the CIG for event-time forecasts, b) displays the CIG for magnitude forecasts.c) displays the information gain of the neural model over ETAS as a function of the completeness of the testing catalog -both models are trained up to Norcia for Mcut = 1.2, 2.0, 2.8.

Figure 9 :
Figure 9: The forecasted magnitude distribution of each model, a) at the occurrence of the Norcia earthquake, and b) at the occurrence of the next Mw3+ earthquake following Norcia.