Comparison of Deep Learning Techniques for the Investigation of a Seismic Sequence: An Application to the 2019, Mw 4.5 Mugello (Italy) Earthquake

The increase of available seismic data prompts the need for automatic processing procedures to fully exploit them. A good example is aftershock sequences recorded by temporary seismic networks, whose thorough analysis is challenging because of the high seismicity rate and station density. Here, we test the performance of two recent Deep Learning algorithms, the Generalized Phase Detection and Earthquake Transformer, for automatic seismic phases identification. We use data from the December 2019 Mugello basin (Northern Apennines, Italy) swarm, recorded on 13 permanent and nine temporary stations, applying these automatic procedures under different network configurations. As a benchmark, we use a catalog of 279 manually repicked earthquakes reported by the Italian National Seismic Network. Due to the ability of deep learning techniques to identify earthquakes under poor signal‐to‐noise‐ratio (SNR) conditions, we obtain: (a) a factor 3 increase in the number of locations with respect to INGV bulletin and (b) a factor 4 increase when stations from the temporary network are added. Comparison between deep learning and manually picked arrival times shows a mean difference of 0.02–0.04 s and a variance in the range 0.02–0.07 s. The improvement in magnitude completeness is ∼0.5 units. The deep learning algorithms were originally trained using data sets from different regions of the world: our results indicate that these can be successfully applied in our case, without any significant modification. Deep learning algorithms are efficient and accurate tools for data reprocessing in order to better understand the space‐time evolution of earthquake sequences.

The abrupt changes in the seismic signal due to arrivals of earthquake waves can also be characterized using variance and higher order statistics parameters such as skewness and kurtosis (Ross & Ben-Zion, 2014;Saragiotis et al., 2002), which further increase sensitivity with respect to the aforementioned techniques. All the previous approaches, however, suffer from two fundamental drawbacks. The first is that they require a non trivial and accurate tuning of the parameters in order to provide reliable results. As an example, the implementation of the ubiquitous Allen Picker (Allen, 1978) in Earthworm (Johnson et al., 1995), a widespread tool to process real time data, requires 18 parameters to be defined and most of them are highly affected by the type of sensor used, the geometry of the network, the noise level, and other factors (Mele et al., 2010). The second reason is that, being based on energy variations within the time series, it may be non trivial to isolate the earthquake signals from other disturbances. On the other hand, these techniques provide valuable tools for general investigations of time series anomalies.
An alternative approach is inherited from controlled-source seismic imaging, in which some characteristic function parameterizing the seismic time series at the different stations is back-propagated through the propagation medium. The optimal subsurface location and origin time of the earthquake are then given by the space-time coordinates at which all the back-propagated wavefronts interfere constructively (e.g., Drew et al., 2013;Grigoli et al., 2014;Kao & Shan, 2004;Langet et al., 2014). This class of techniques has an advantage when applied to noisy data sets since the direct stacking process enhances the SNR filtering out incoherences in the time series.
Algorithms based on the similarity of waveforms between similarly located earthquakes such as the matched filter (e.g., Peng & Zhao, 2009) partially overcome the limitations of the previously described picking approaches. Indeed, the reprocessing of continuous data streams using cataloged events as a template may improve the detection of P and S waves arrivals by up to an order of magnitude (e.g., Allen, 1982;Ross et al., 2018). The drawback of these methods is that the new found events are duplicates of the original templates. This means that other seismic signals strongly different from the templates still remain undetected, resulting in the new catalogs potentially still being incomplete. The recent introduction of the FAST algorithm (Yoon et al., 2015) has minimized this bias by demonstrating the ability to match the general characteristics between matched waveforms with respect to the specific shape of the template waveform. Additional drawback of the template matching is the potential spatial bias in detection ability which depends on the density of events, and also the character of events, with some regions more prone to have similar events than others.
In recent years there has been a rapid growth in the application of Machine Learning and Deep Learning (DL) techniques to solid Earth geoscience including seismology (Bergen et al., 2019;Kong et al., 2019). Due to the large data volumes and many ubiquitous properties of seismic waveforms independent of setting, seismic phase detections have been an attractive target for DL applications (García et al., 2016;Mousavi et al., 2020;Perol et al., 2018;Ross et al., 2018;Titos et al., 2019;Woollam et al., 2019;Zhu & Beroza, 2018). The extensive work already done by analysts for seismic catalog compilation can be easily converted into high quality labeled data sets suited for the setup of supervised DL methods. The labels are used to train the algorithm to learn relevant features in the seismograms, provided that a sufficient amount of data is available. Once this process is complete, the trained DL model is able to map new and unseen seismic waveforms into target representations that, in the case of phase detection, encode the P or S wave arrivals with a continuous function peaked at the arrival time. Among the various flavors of DL, convolutional neural networks (CNN) and attention-based transformer network have been successfully applied to event detection and seismic phase recognition (Mousavi et al., 2020;Perol et al., 2018;Ross et al., 2018). The picking task is accomplished by a succession of layers, performing first feature extraction and then classification of the targets. The power of DL methods is that the features upon which the target recognition is based are learned implicitly. This reflects the agnostic character of DL as a data driven discipline. In real world applications this results in the possibility of feeding data with minimal pre-processing, which results in very efficient analysis pipelines from the computational point of view. Further, providing the full information included in the waveforms minimizes potential biases due to a priori selections of particular features.
The key role in the success of the learning process is played by the quality, completeness and size of the training data set to ensure the network is learning correctly and without any bias. In seismology this task is particularly facilitated by the fact that a fair amount of information is made publicly available by data providers (e.g., IRIS, FDSN, Ingate, 2008) and also by recent publications of extensive data sets ready for DL applications such as STEAD (Mousavi et al., 2019), LEN-DB (Magrini et al., 2020) and INSTANCE (Michelini et al., 2021).
Among the different efforts in the application of DL to earthquake detection, ConvNetQuake (Perol et al., 2018) represents the first CNN for phase detection and earthquake location. This network classifies the input data into different clusters as a function of earthquake location. Woollam et al. (2019) investigated the performances of that approach to classify seismic phases, when only a relatively small set of events is available for training the network. More recently, Ross et al. (2018) developed Generalized Phase Detection (GPD), a CNN algorithm to identify seismic phases, while Mousavi et al. (2020) designed Earthquake Transformer, an attention-based model to detect events and picks. Both these frameworks take as input 3 components of seismic waveform recordings and return time series of forecast probabilities for P, S waves and noise for GPD or event detection, P and S waves for EQT.
In this paper we select a seismic data set of earthquakes, recorded by a dense network of 22 seismic stations, associated with the Mw 4.5, 9 December 2019 mainshock in the Mugello basin (Northern Apennines, Italy). According to the INGV bulletin (Pagliuca et al., 2020) the sequence consists of about 280 events occurring between 8 and 26 December 2019. However, it is likely that a large number of low magnitude events, preceding and following the mainshock, and potentially overlapping, may have been missed by a standard analysis for seismic monitoring purposes. This data set can be considered a valuable and challenging test case of automatic event detection and seismic phase recognition because of the difficulty to separate earthquakes occurring close in time during an intense aftershocks sequence. Further, a catalog of manually revised seismic phases and earthquake locations is available making possible to compare with advanced phase detection such as those based on DL. Our goal is to analyze this data set, extended in time from 1 to 31 December 2019, using template matching and DL frameworks such as GPD and EQT. We provide insights about the performance of the different algorithms through the study of the resulting catalogs, earthquake statistics, locations, and pick quality.

Location, Instruments, and Data
We study an earthquake sequence that occurred in the Mugello Basin (Figure 1a), located ∼30 km north of Florence. This is a WNW-ESE striking intra-mountain basin that is 25 km long and 15 km wide, and it is part of the Northern Apennines, a NW-SE striking fold-and-thrust belt composed of a pile of NE-verging tectonic units that developed during Cenozoic collision between the European plate (Corso-Sardinian block) and the Adria plate (e.g., Carminati et al., 2012). Seismicity of the Northern Apennines is mostly concentrated within a ∼40 km wide, NW striking swath along the mountain chain. Earthquakes are generally shallow ( 20 km deep), and exhibit normal-fault focal mechanisms on the SW side of the chain (e.g., Piccinini et al., 2014). Larger focal depths (up to 40 km) and dominant thrust-faulting mechanisms are instead observed on the NE side of the chain (Piccinini et al., 2014). This spatial change in seismicity pattern across the Apennines is one of the primary pieces of evidence for the current spatial transition in tectonic style. The Mugello basin has experienced recent NW-striking seismic swarms in 2008 and 2009, near the western margin of the basin (Bonini et al., 2016;Piccinini et al., 2014). The most significant historical earthquakes occurred in 1542 and 1919, with macroseismic magnitudes of M = 5.9 and M = 6.3, respectively (Rovida et al., 2016).
As is the case for the entirety of the Italian territory, seismicity of the study area is monitored through the Italian National Seismic Network (INSN), a permanent infrastructure operated by the Istituto Nazionale di Geofisica e Vulcanologia (INGV hereinafter (INGV Seismological Data Centre, 1997)). In the summer of 2019, nine additional temporary broadband stations were added as part of a collaboration between INGV, the Universities of Southampton (UK) and Florence (IT), and the Italian National Research Council (CNR). These stations, on lease from the SEIS UK facility, were deployed in order to achieve improved coverage of the Mugello basin and adjoining areas (Figure 1a), with the ultimate goal of accurately identifying faults from seismicity and doing high resolution seismic imaging of the crust in this portion of the Northern Apennines. Three of these stations (RINC, RONT and VISG) became part of the INSN (IV network) shortly after their installation, while the remaining have been registered as seismic network 9M (Bruni et al., 2019). A full list of the stations is available in Table S1 of Supporting Information S1.
This new deployment recorded the Mw = 4.5 earthquake that struck the NW margin of the basin on 9 December 2019 (Figure 1b). The mainshock and its aftershocks were clustered along a previously identified epicentral range), causing damage that resulted in the evacuation of more than 150 residents and economic losses of several million. The whole sequence lasted less than one month. During 1-31 December 2019, the INGV catalog (ISIDe Working Group, 2007) reports 279 earthquakes, with local magnitudes (M L ) in between 0.5 and 4.5. We will refer to this catalog as M_IV ( Figure 1b).
For these events we proceeded with manual repicking of P and S wave arrival times (to achieve uniformity in picking made by a single operator) at all the stations from the temporary network and a selected subset of the closest stations from INSN (namely, stations SEI, LMD, MOCL, MPPT and FNVD in Figure 1a). This new set of arrival times was then inverted for source location using the NonLinLoc software package (Lomax et al., 2000(Lomax et al., , 2014 with an improved velocity model specifically derived for the area (Piccinini et al., 2014). Local magnitudes (M L ) of earthquakes are finally computed following Di Bona (2016) and Mele and Quintiliani (2020), who provide parameters and station correction terms for INSN. Details about M L computations are described in Supporting Information S1. Throughout the following, we refer to this catalog as M_9M.
As generally occurs during the early stages of aftershock sequences, earthquakes repeat at tight temporal intervals and often overlap each other ( Figure 2). Under such conditions the classical automatic discrimination procedures such the STA/LTA algorithm (Allen, 1982), and the subsequent phase picking and association, generally fail and a significant portion of events may pass undetected or not correctly associated and located. For this reason the waveform data collected during December 2019 are here processed with template matching and DL based algorithms to explore the efficiency of these methods in finding new events and to better characterize the sequence.

Matched Filter
Matched filter is an earthquake detection method based on waveform cross-correlation for identifying replicas of a known signal in continuous time series data (e.g., Gibbons & Ringdal, 2006). The method takes advantage of the high waveform similarity observed within seismic sequences (Geller & Mueller, 1980), when seismic waves originating from sources of similar location and mechanism travel a similar path resulting in similar waveform shapes at a receiver (Herrmann et al., 2019). This technique facilitates detection of new events characterized by low SNR yielding a better sensitivity than that offered by the common detection techniques such as STA/LTA (e.g., Gibbons & Ringdal, 2006;Schaff, 2008).
We use up to 278 earthquakes reported in catalog M_9M during 1-31 December 2019 as template earthquakes and performed a detection procedure on the continuous seismic data at six stations that are close to the epicentral area and have the most continuous records (namely, stations CRCL, GAGN, MBEN, MOCL, RONT, and SEI; see Table S1 in Supporting Information S1). The exact number of template events does vary (227-278) between stations, since not all the events were picked at all the stations. The matched filter procedure is conducted separately for P-and S-waves. P-waves are scanned over the continuous vertical-component recordings, using 3 s long template waveforms starting 1 s before the picked arrival time. For S-waves, the procedure is iterated over the N and E components, using 5 s long template waveforms starting 1 s before the picked arrival time. Both templates and continuous data streams are band-pass filtered within the [2-15] Hz frequency band. In a first stage, we retained all the matching wavelets having maximum cross-correlation (CC) larger than 0.5. Then, we removed duplicate matches derived from different templates with similar waveforms, keeping only the template-match pair with the highest CC. At all the stations, the number of matches decreases with increasing CC; the rate of decrease, however, is not constant, exhibiting an evident change of slope at CC ≃ 0.7 (see example in Figure 3a). Hence, we choose CC = 0.7 as a threshold for selecting the matched phases to be subsequently used for the location procedure. A sample P-wave template and corresponding matching phases are shown in Figure 3, panels (b and c), respectively. P-and S-waves detected via matched filter are associated by first sorting in ascending order the entire set of arrival times, then declaring an event whenever the time difference between two consecutive times is larger than 5 s; such simple procedure is justified by the narrowness of the epicentral region, and the restricted time interval spanned by the travel times to the different stations. The set of associated arrival times is finally located and M L are computed using the same procedure described for the manual catalog M_9M. The resulting catalog will be referred in the following as M_MATCH.

Deep Learning Analysis
The waveform data from 22 stations from the IV and 9M networks (Figure 1a) are processed in the time domain using GPD and EQT. Here we use these frameworks as published and the model weights as provided in supplements or data publications without any further training or modification, apart from evaluating the detection threshold values most suitable for our data set. These algorithms have been selected for their high sensitivity in detecting P and S phases with low SNRs, for example, when arrivals overlap the coda of previous events, or in case of small magnitude earthquakes, as usually occurs during seismic sequences.
GPD (Ross et al., 2018) is based on a CNN trained on 4.5 million three-component, hand-labeled seismic phases from southern California earthquakes (for a detailed description of the data set, see Ross et al. [2018]). The algorithm requires as input 4 s segments of the three-components ground velocities, which are preprocessed by removing the mean and linear trend, resampled at 100 Hz and band-pass filtered within the [3-20] Hz frequency interval. The sliding time windows are overlapped by 97.5%, corresponding to a step width of 0.1 s. The method returns three time series corresponding to the forecast probability associated with the detection of noise, P and S wave picks. The picks are declared if the forecast probability is greater than a user-defined threshold. The chosen value for the threshold is 0.99. This was determined by studying (a) how the total number of picks changes with different threshold values and (b) the forecast probabilities values for the GPD picks which have a match in the manual catalog. We came up with the value of 0.99 which represents the best compromise to retrieving more than 90% of manual catalog picks. In Supporting Information S1 we describe how we select this value in the GPD threshold choice section. The picking error is formally set to 0.1 s for all the detections, resulting from the step width for the sliding window.
EQT (Mousavi et al., 2020) is the second algorithm we selected for the analysis. It consists of an event detector and picker based on a hierarchical attention mechanism. The neural network has a multi-task structure consisting of one very-deep encoder and three separate decoders mapping the output into three time series of forecast probabilities associated with the earthquake detection, a P phase, and an S phase (Mousavi et al., 2020). First, it performs earthquake detection in a 60 s window of three component seismic data. If an event is flagged, P and S phases are recognized within the analyzed window. The network has been trained on a subset of STEAD, a largescale global data set of 1.2 million hand-labeled waveforms of earthquakes and noise (Mousavi et al., 2019). Events, P and S wave picks are declared upon the exceedance of predefined thresholds, that we fixed to 0.2 for event detection and 0.1 for both P and S waves picking, respectively. In Supporting Information S1 we describe how we select these values in the EQT threshold choice section. For each available station, waveform data are pre-processed by filling gaps, removing mean and linear trend, resampling to 100 Hz and band-pass filtering within the [1-45] Hz frequency band. Data are then fed to EQT in blocks of 60 s with 30% overlap. EQT also provides picking error estimates based on a combination of the forecast probability and the model uncertainty evaluation, accomplished by a procedure equivalent to a Bayesian sampling of the posterior parameters distribution. This uncertainty translates into weights associated with the P and S phases for the subsequent location procedure.
The picks retrieved using the DL techniques are associated into four catalogs, two for GPD and two for EQT, according to the seismic stations defined for catalogs M_IV and M_9M (Figure 1). For both GDP and EQT results, the association of the seismic phases is performed according to a coincidence criterion, requiring that at least three P wave picks are detected within an 8 s long time interval and considering a 2 s trigger-off extension to lengthen the research window in the coincidence sum computation. This window is long enough to include the picks from the furthest station (CSNT ∼60 km away from the seismic swarm). Once the association of P phases has been performed, we search at all the available stations for the S picks in the time window ranging from the 10.1029/2021JB023405 8 of 17 first P onset and a maximum duration computed as 4 times the difference between the first and the last P trigger times. Two cases may occur. If a P pick is available for the station under investigation we search the S pick closest to the P onset, otherwise we add the first available S pick for the time interval considered. If two P arrivals are detected for the same station, a second separate event is created. Despite the simplicity of this phase association method, we are confident in its effectiveness because of the limited spatial extent of the sequence and the high level of accuracy of the DL pickers (Ross et al., 2018). For this purpose, Figure S6 in Supporting Information S1 we show an example of how the picks are associated into events.
The associated P and S phases are inverted to identify the source location using the same code and velocity structure adopted for the locations of catalog M_9M. We require at least three P phases, and a total of 5 phases (P or S) to locate the earthquake. Finally we computed the M L for the located events. From all the located events, we select those whose epicenters lie within a 10 km radius of Figure 1b. The final catalogs are named M_IV GPD, M_9M GPD, M_IV EQT, and M_9M EQT, based on whether they have been assembled using the M_IV and M_9M stations and GDP or ETQ methods, respectively.

Matched Filter
The detection performances of the different methods should ideally be carried out with a comparison to the ground truth, which is not established in an unexploited seismic data set. In Table 1 we show, for the six stations used to determine the matched filter catalog, the total number of picks before associating (T), the arrivals (A, i.e., picks associated and located into events within our study area) and the percentages of picks retrieved as arrivals (% A/T) in the seismic catalogs. This is an indirect way to assess the presence of false positive picks, that is, picks that do not contribute to event location. However, this approach has some limitations since it does not allow us to take into account true positive picks belonging to very low magnitude events that cannot be associated due to low phase count or picks associated to events located outside the selected area (within 10 km radius). Table 1 summarizes and compares the results obtained for Matched Filter and DL methods. On average the percentages obtained for M_MATCH span between 40% and 90%. Two exceptions are constituted by stations CRCL and MOCL, which reported a huge number of detections, of which only 10% and 4%, respectively, are successfully associated and located. This occurrence is related to electronic disturbances and short-duration gaps in the data streams from those two stations, which provoked numerous false detections, particularly for the most impulsive P-wave arrivals. The M_MATCH catalog finally has 930 locations. Taking into account the high cross-correlation thresholding, the reduced number of used stations and their proximity to the hypocentral area, we consider this catalog to be rather robust, thus providing a lower bound on the actual number of detectable events in the sequence.

Deep Learning
The most evident feature in Table 1 is that both DL methods show fairly strong variations among the stations considered. For example, the percentage of GPD P picks turned into arrivals decreases from 56% at MBEN to only 16% at SEI and that for EQT P phase ranges between 87% at MBEN to 28% at SEI. These variations may be partially related to the large number of false detection triggered by data stream problems (as is the case for SEI) and eventually mitigated with an ad-hoc training of the DL models for each station: as an example at MBEN, where few picks are detected by GPD with respect to EQT. Overall, EQT provides the best performance since the arrivals percentage for both P and S phases are consistently above 65%.  Table 1 For

the Stations Used in Matched Filter Computation We Report the Total (Indicated as T) Number of Picks (Sum of Templates and Matches With CC ≥ 0.7), the Number of Arrivals (as A) and and Percentages of Picks Retrieved as Arrivals (as % A/T), for Both P and S Phases, for Matched Filter (MATCH), GPD, and EQT, Respectively
seven catalogs discussed in the previous section. is here reported to characterize each single catalog but not for the purpose of providing a metric to compare the various catalogs since they are obtained using different sets of stations and picks. The most evident achievement attained by DL methods is the large increase in the number of detected earthquakes.
The results, after earthquake location and selection, show that the number of events in the whole sequence increases significantly: while M_IV reports 279 earthquakes, DL based detections increase by about a factor of 3 (833 for M_IV GPD and 740 for M_IV EQT). We remark, however, that INSN is designed to detect events either damaging or possibly felt by population (M L ≥ 2.0) and not to exploit microseismic activity. The effect of enhancing seismic station density in the proximity of the sequence can be appreciated considering the set of M_9M catalogs: the number of located events increases by a factor of about 4 (1,156 for M_9M GPD and 983 for M_9M EQT, in agreement to that attained by M_MATCH). This is due to the denser network, providing better coverage and more detailed information needed to reveal low magnitude earthquakes.
The time evolution of the swarm (Figure 4) shows a synchronous step-like behavior in all the catalogs, corresponding to the occurrence of mainshock and following major events.    Comparing the origin times of DL catalogs with those of M_9M catalogs we find that origin times differ by less than 0.5 s for the 95% of the events. We analyze the remaining 5% of events that have origin times that differ by more than 0.5 s. They are mainly clustered in the hours immediately following the mainshock and major bursts of the seismic sequence on 9, 13, and 17 December. A closer view of the picks for the missing events suggests that the main problems arise with earthquakes occurring close in time, at intervals comparable to the travel time, especially if the second event is more energetic than the first one. In these cases the P wave onset of the second event is masked by the S wave coda of the first event. The most common cases are the following: First, the forecast probabilities of the P wave of the second event does not rise above the chosen threshold, so the P onset is not detected ( Figure S7 in Supporting Information S1). Second, the DL method correctly picks the two P phases and separates the two events ( Figure S8 in Supporting Information S1). In both cases the operator incorrectly declared the S onset of the first event as the P onset of the second one. This error causes the subsequent discrepancy in the origin time computation between manual and DL methods and hypocentral location. In other cases the differences may arise when the DL methods associate a larger number of phases to an event with respect to the manual picking, resulting in a better constrained location but different from that in the M_9M catalog. Other factors that may also contribute to the quality of DL locations are: (a) the choice of a high probability threshold to declare a pick (0.99 for GPD and 0.1 for EQT) that may prevent some detections of true positive picks, (b) the use of pre-trained models on data collected in a different geological environment, and (c) the simple event association approach implemented. Despite these differences, the quality of DL catalogs is supported by the low values of the ( Table 2).
The M L computed for all the catalogs except M_IV, whose values are taken from INGV bulletin, confirms (Figure 5) that the matched filter and DL techniques are very effective in detecting low magnitude earthquakes (M L ≤ 1.5) with respect to those included in M_IV and M_9M. Indeed the frequency histograms in Figure 5 almost overlap for M L ≥ 1.5 while the number of events strongly increases below that value, especially when stations from the temporary network are used. Figure 5a clearly shows that there are few differences in the distributions of the two DL and the IV catalogs especially for events with M L ≥ 2.0. These discrepancies are due to the larger number of stations used in the INGV bulletin which takes data from the whole INSN network while M_IV GPD and M_IV EQT are limited to the subset of stations used in our analysis (see Figure 1a and Table S1 in Supporting Information S1). This implies that the most energetic earthquakes are recorded by a larger number of stations, resulting in a slightly different hypocentral location and, more important, into a larger number of amplitude estimates obtained over a much larger distance range.
We also plot the Gutenberg-Richter distribution of the catalogs ( Figure 6) as a further control of the performance of the DL techniques. Again, the DL methods confirm their effectiveness in detection and characterization of microearthquakes. Using the Goodness-of-Fit technique (Wiemer & Wyss, 2000), we estimate the magnitudes of completeness (Mc) for M_IV,M_9M, M_MATCH, and DL based catalogs. The DL and matched filter catalogs show a significant reduction of Mc. The corresponding b-value is around ≃0.6, very low but consistent with that retrieved from the INSN (M_IV). The low b-value is likely due to the short time span, and limited spatial range, spanned by our catalogs. Standard error (SE) of the frequency-magnitude distribution b-value is computed using the Shi and Bolt (1982) approach. Results are summarized in Table 3.
In Figure 7 we plot the 2D projections and vertical cross sections of earthquake hypocenters for catalogs M_9M, M_MATCH, M_9M GPD, and M_9M EQT, respectively: the color shades reflect the kernel density estimates of the hypocenter distribution, with lighter color indicating higher density; in the cross sections only earthquakes less than 1 km away from the profile are plotted; red stars indicate the mainshock locations. The locations from the four different catalogs all show a main NW-SE alignment, common to the original M_IV catalog (Figure 1b), that suggests a fault plane with a strike of about 120° and steeply dipping (about 70°) SW. This is consistent with one of the two fault planes derived from the Time Domain Moment Tensor solution (see Figure 1), in particular with the plane striking N308° and dipping 46° (http://cnt.rm.ingv.it/event/23558121). In all the catalogs, the mainshock is located at the SE tip of the epicentral alignment, and most of the hypocenters are clustered within the 5-10 km depth interval. The region characterized by highest density of earthquakes shows different geometry according to the methods used to retrieve the catalogs. For M_MATCH most of the earthquakes are located in the westernmost part of the sequence. This may be due to the fact that three out of the six stations used for the matched filter procedure (stations MOCL, CRCL, and GAGN) are positioned in the western part and are closer to the sequence. For this reason the resulting catalog may have a slight spatial bias toward the west. The M_9M GPD and M_9M EQT catalogs show a more consistent distribution of events with an elongated shape in the NW-SE direction. A further, common feature which is particularly evident in the M_MATCH and M_9M EQT catalogs is a secondary alignment at the NW margin of the sequence. This secondary cluster, NNW-SSE oriented, exhibits hypocentral depths concentrated in the shallowest portion (5-8 km) of the main sequence. One question that arises from the comparison of the results of GPD and EQT (Figures 7e, 7f, 7g, and 7h) is whether the application of different methods could lead to different interpretation of the active structures. If we assume that both GPD and EQT are affected in the same way by the network density and geometry of the Mugello area, the differences shown can be attributed to the different implementation of phase picking in the two algorithms. In one case (GPD) only small waveforms windows of 4 s length are considered, while for EQT a more complex, hierarchical attention mechanisms is pursued in 60 s windows. Despite these different approaches, some common features of the seismicity are shown in the results, such as the continuity of the steeply dipping NW-SW fault plane located NW of the mainshock. Nonetheless, the overall picture of the seismicity obtained from EQT ( Figure 7g) is less smeared with respect of the GPD (Figure 7e), showing more details and suggesting the activation of a complex fault system. We expect that these differences may eventually be  reduced by the retraining of GPD and EQT with specific data set reflecting the peculiarities of the seismic activity along the Apennines.
Despite the differences discussed above, the main contribution of the enhanced catalogs based on DL is a better definition of the spatial extent of the structure activated after the mainshock. Both M_9M GPD and M_9M EQT catalogs clearly image a rectangular fault patch of about 5.5 km length by 5 km width at depth, which is confined on the E by the location of the mainshock and on the W by a sharp, NW-SE trending, line (parallel to the shown AB cross section in Figure 7). We finally remark that a significant amount of additional information has been gained with the DL analysis, as can be seen comparing the starting bulletin data from Figure 1b and the results of Figure 7.

Picking Performance Comparison
In this section, our aim is to quantitatively compare the performance of the different methods in the task of phase picking. Further details are available in Supporting Information S1 where we add a set of Figures S9-S12 in Supporting Information S1 showing an example of P and S phase arrival detection at CRCL station in the time interval shown in Figure 2, characterized by repeated foreshock activity.

M_9M Versus DL
We compare the time differences between P and S wave arrival times for events in catalog M_9M and the corresponding arrivals (same event, same stations) from M_9M GPD and EQT catalogs (Figure 8). The results are shown as frequency histograms: in each panel are reported the mean value (μ) and standard deviation (σ) of the corresponding distribution.
Both DL methods are more accurate in determining P wave arrivals with respect to S wave ones; this is likely due to the lower SNR of S wave arrivals, whose onset is contaminated by the P wave coda, or to the difference in the frequency content. The μ and σ values suggest that EQT picks have better quality than those attained with GPD for both P and S waves (e.g., Mousavi et al., 2020, Figure 6). The non zero mean values and the slightly asymmetrical distributions of frequency histogram for P phase detections (Figures 8a and 8c) indicates that both GPD and EQT tend to anticipate manual picks, while the opposite occurs for S phase detection (Figures 8b and 8d). This behavior is likely ascribed to both the bandpass filter used to maximize the SNR and to a feature of the procedures themselves.

EQT Versus GPD
The M_9M GPD and M_9M EQT catalogs have 769 events whose origin times differ by less than 0.5 s. In addition to what is shown in Figure 8, we now examine the arrival time differences for both the P and S phases picked by the two methods at common stations and common events (Figure 9). Both P and S wave differential times exhibit a symmetric distribution, but while there are no systematic differences in P wave timing, GPD detections of the S wave anticipate the EQT ones by a mean value of ∼0.03 s. This slight difference may derive from the different frequency bands in which waveform data are filtered: [3-20] Hz for GPD and [1-45] Hz for EQT, which are the same used in the pre-processing of the respective training data sets.
We also compare how GPD and EQT arrivals are distributed among the stations in shared events. In Figure 10 the number of P and S arrivals are plotted In many cases, the two DL methods identify almost the same number of arrivals for both P and S phases. Despite this generalized similarity there are some relevant exceptions such as the large difference observed at station MBEN: here GPD model detects only 30% of both P and S phase arrivals with respect to EQT. EQT performance is also significantly better for stations CASC and SEI. The opposite is found for the P phases at CRCL (EQT −30% with respect to GPD), and for both P and S phases at stations FNVD, VISG and RONT. We have verified that the difference in the number of picks detected at MBEN could be attributed to the value of the probability threshold defined to declare a pick in GPD algorithm. Lowering this value from 0.99 to 0.96 would allow us to triplicate the number of picks found. This suggests that a fine tuning of the probability threshold at different stations could yield better performance from the DL algorithms.

Summary of Results and Discussion
The main results can be summarized as follows: 1. The DL frameworks used throughout this study, EQT and GPD, show a marked ability to detect low magnitude earthquakes similar to that achieved using matched filter. The DL methods increase by a factor of ∼3 the number of earthquakes with respect to benchmark permanent station derived M_IV catalog. The use of the denser temporary network increases the number of earthquakes by a factor of 4 with respect to the benchmark catalog. 2. The DL algorithm's performance in picking seismic phases is comparable to manual analysis. Both EQT and GPD pick P and S wave onset with high accuracy. However, GPD S picks slightly anticipate both those obtained by EQT and manually determined picks. 3. GPD has a better overall performance in finding low magnitude earthquakes occurring very close in time with respect to EQT. However, EQT shows the larger number of detected phases turned into arrivals suggesting a higher precision of the algorithm. 4. The distributions of the hypocenters from all catalogs show that the aftershocks align with a W-NW direction from the mainshock. 5. DL catalogs definitely improved the definition of the spatial extent of the fault structures activated after the mainshock, suggesting an area of 5.5 km length and 5 km width.
In this paper, we analyze a single seismic sequence limited in time and space. This provides a "controlled environment" to test the effectiveness and the applicability of DL approaches we used to exploit the data set. Due to the lack of previous comparison studies and with the aim of testing how transportable these DL methods are to new data we use the GPD and EQT models out-of-the-box, simply tuning the forecast probability thresholds. While our results show that these methods work very well on our data set we expect to get better results if the DL models are trained on good quality data reflecting the peculiarities of the seismic release in the Apennines (Michelini et al., 2021). DL is a relatively new technique for the analysis of seismic data, and there is a need for a quantitative comparison of different approaches and DL schemes. To this end our paper is a first attempt to provide some insights, but there are ongoing efforts in developing open source frameworks, such as SeisBench , whose aim is to facilitate the comparison of different DL algorithms by a standardized approach .

Conclusions
In this paper, we focus on testing the performance of recent DL algorithms for event detection and seismic phase picking against manually derived picks. As a case study we analyze the seismic swarm associated to the Mw 4.5 occurred on 9 December 2019 in the Mugello basin, which lasted about one month, using seismic data from a dense network of 13 permanent and 9 temporary stations. The DL based analyses implemented on the permanent stations alone detect a factor of 3 more earthquakes than the 279 achieved manually by the INGV bulletin, due mainly to the better capability to identify seismic phases of low magnitude earthquakes with low SNR. When data from the denser temporary network are included a factor of 4 more earthquakes than the INGV bulletin are obtained. The DL algorithms detect phase arrivals with uncertainty similar to the manual picks, leading to better definition of microseismicity in the swarm and providing constraints on the geometry of activated structures. The DL models, generalizing what is learned from different data sets, can be effectively applied in different areas in an accurate, fast way. The adoption of DL methods in reprocessing existing data can lead to significant advances in the study of earthquake sequences, shed light on a variety of processes such as the time evolution of seismic sequences, or for more complex cases of fault interaction.