Machine Learning Approaches to Identifying Changes in Eruptive State Using Multi‐Parameter Datasets From the 2006 Eruption of Augustine Volcano, Alaska

Understanding the timing of critical changes in volcanic systems, such as the beginning and end of eruptive behavior, is a key goal of volcanic monitoring. Traditional approaches to forecasting these changes have used models motivated by the underlying physics of eruption onset, which assume that geophysical precursors will consistently display similar patterns prior to transition in volcanic state. We present a machine learning classification approach for detecting significant changes in patterns of volcanic activity, potentially signaling transitions during the onset or end of volcanic activity, which does not require a model of the physical processes underlying critical changes. We apply novelty detection, where models are trained only on data prior to eruption, to the precursory unrest at Augustine Volcano, Alaska in 2005. This approach looks promising for geophysically monitored volcanic systems which have been in repose for some time, as no eruptive data is required for model training. We compare novelty detection results with multi‐class classification, where models are trained on examples of both non‐eruptive and eruptive data. We contextualize the results of these classification models using constraints from petrological, satellite and visual observations from the 2006 eruption of Augustine Volcano. The transition from non‐eruptive to eruptive behavior we identify in mid‐November 2005 is in agreement with previous estimates of the initiation of dike intrusion prior to the 2006 eruption. We find that models which include multiple types of data (seismic, deformation, and gas emissions) can better distinguish between non‐eruptive and eruptive data than models formulated on single data types.

Previous approaches to identifying critical changes preceding eruptions have focused on incorporating physical understanding of potential drivers of eruption. For example, the Failure Forecast Method (FFM) is based on the assumption that an accelerating rate of a given observable geophysical parameter, such as the rate of seismicity, corresponds to the accumulation of subsurface damage (e.g., Kilburn et al., 2003;Voight, 1988;Voight & Cornelius, 1991). Eruptions may then be forecast by fitting a power-law relationship to the rate of occurrence of the observable parameter. However, physically motivated models rely on the assumption that the mechanics underlying the model are consistent, which may not always be the case. Application of a probabilistic formulation of the FFM to multiple run-ups to explosions in the same eruptive episode at Volcan de Colima, Mexico yielded variable results for run-ups in which the pattern of unrest was more complicated than a single acceleration of seismic rate (Boué et al., 2015). Other physically motivated models may be highly successful but are rather specific to one eruption or class of eruptions and limited in their applicability to a wide range of volcanic systems. For example, Anderson and Segall (2013) presented a physics-based extrusion model combined with a Monte-Carlo inversion approach, and demonstrated retrospectively that the final extruded volume of the Mt St Helens 2004-2008 dome-forming eruption could be successfully forecasted from up to two years prior to the end of the eruption (see also Segall, 2013).
Identifying eruption onset and end using machine learning classification approaches is a growing field. Dempsey et al. (2020) presented a machine learning approach to issuing short-term alerts from real-time seismic data at Whakaari, New Zealand, using supervised random forest classification on features derived from seismic tremor data and retrospectively applied the model to five episodes of eruption. Ren et al. (2020) applied both supervised and unsupervised classification to seismic data at Piton de la Fournaise, La Réunion and identified consistent tremor frequencies between eruptive episodes. Manley et al. (2020) identified transitions based on supervised classification of data from Telica and Nevado del Ruiz volcanoes, and showed that machine learning classification of features derived from seismic data alone held the potential to identify the end of eruptive sequences. Additionally, the timing of transition between eruptive and non-eruptive seismic activity at the end of eruption was found to be 2-4 months after the last visual activity observed at the volcano.
While seismic data has historically been the most common observable used to monitor volcanoes, complementary information is contained within other parameters, including gas and deformation data. Following on from our previous study (Manley et al., 2020) which was limited to seismic data, here we use multiple types of instrumental data, including seismicity, gas emission rate and GPS-derived deformation to classify behavior prior to and during a major eruption at Augustine volcano, Alaska, using a machine learning approach. We hypothesize that the range of physical phenomena captured from an ensemble of observables is much wider than would have been identified using a single data stream, and that the physical processes driving these phenomena will have operated across a range of depths within the volcanic system. Augustine Volcano, Alaska is a stratovolcano located in the Cook Inlet region (Figure 2), ∼280 km SW of Anchorage, and is one of the most historically active volcanoes in Alaska , with major eruptions in 1976 and 1986. The most recent eruption of Augustine began in January 2006, and was documented in detail Figure 1. Illustration of the range and relative depths of physical processes associated with observation of seismicity, gas and geodetic measurements. Arrows are a conceptual representation of movement of material. Gray boxes are a generalised representation of magmatic or hydrothermal activity. Annotations denote the corresponding physical processes outlined in Table 1.

Type
Observed Associated physical processes Seismic -Swarms of low frequency (LF) earthquakes -Volcanic tremor -Accelerating seismic rates for example, Cameron et al. (2018) or increase in real-time Seismic Amplitude Measurement (RSAM) -Declines in seismicity (if close to eruption) due to sealing Endo et al. (1996)  Note. All of these physical processes are conceptually illustrated in Figure 1.  Coombs et al. (2010) described the chronology of deposits from the eruption, inferred from observations made during overflights, and satellite remote sensing. Timing of the large ash explosions which defined the explosive phase of the eruption were constrained using infrasound (Petersen et al., 2006), and these observations which extended over the course of the eruption provide an independent constraint on the timing of transitions in activity at Augustine, and of the physical interpretations of our modeling results detailed below. The 2006 eruption of Augustine represents an ideal case study as the eruption was large, and the availability of comprehensive geophysical and observational datasets constraining the activity provides context for the results of classification modeling.
In order to discuss the transitions between the state of volcanic activity at Augustine, we first define the states of behaviour which we adopt here, namely "repose," "unrest," and "eruption". We use a definition of eruption after Siebert et al. (2015): eruption is characterized by the explosive ejection of fragmented new magma or older previously solidified magma, and/or the effusion of liquid lava. Unrest is defined as any non-eruptive activity above background which occurs outside the eruption, and the term repose is used for a volcano which is neither erupting nor displaying above-background signs of unrest. For the Augustine system specifically, the 2006 activity can also be divided into phases distinguished by their visual products erupted during these periods , and according to the Alaska Volcano Observatory system of rating volcanic activity known as the Level of Concern Color Code system. These classification schemes are described in detail in Section 2.1.
In this paper we compare results from classifying volcanic state using a novelty detection or one-class approach, which does not incorporate information from future days of the time series, to a multi-class approach in which models are trained on examples of both non-eruptive and eruptive data. These models are formulated on features derived from seismic, gas and geodetic data. We present a novel approach for handling missing and irregularly sampled values in SO 2 emission rate time series. We compare the performance of novelty detection and multi-class classification approaches using SO 2 emission rates estimated using either no future data points or with a probabilistic, retrospective approach formed using the full set of data points. We also compare models formulated with all datasets to models formulated using only seismic, gas or geodetic features.

Eruption of 2006
The most recent explosive eruption of Augustine began in January 2006 and is recorded in the Smithsonian Global Volcanism Program (GVP) database as a VEI 3 event. The GVP start-date for activity is December 9, 2005 and the end-date is April 27, 2006, based on reports of changes in seismicity, though the exact days are marked with uncertainty in the GVP record (Global Volcanism Programme, 2006). Figure 3 illustrates the timeline of the visible eruption during the 2006 period of activity. The 2006 eruption of Augustine began with several small, phreatic explosions on December 2, 10, 12, and 15 (Power & Lalla, 2010), which were not thought to contain juvenile material. The main phase of the eruption began on the January 11, 2006, with two large explosions generating substantial ash-bearing plumes. January 11 was also the day of the eruption with the highest seismic event rate. Explosive activity continued until January 28, after which the eruption style mainly comprised low-level ash emission, and small pyroclastic flows .
There are different ways to classify the stages of the event. At the time of its explosive eruption in 2006, Augustine volcano had a system of Concern Color codes which had 4 states: green, yellow, orange and red (Brantley, 1990). The beginning of the 2006 eruption was successfully forecast by the Alaska Volcano Observatory, with the Concern Color code being raised from Green to Yellow 3 days preceding the first phreatic explosions on December 2, 2005 ( Figure 3, Cameron et al., 2018). In the Concern Color Code system in place at the time, the Yellow level implies that an eruption is possible within the next few weeks. Activity associated with the 2006 eruption itself was split into 4 phases by , and activity before the eruption is known as the precursory phase, whereas activity following the designated eruption end-date is known as the post-eruptive phase. The eruption itself comprises an explosive phase, continuous phase, hiatus and effusive phase. The classification schemes associated with the Augustine eruption of 2006 are summarized in Figure 3.

Physical Constraints on the Magmatic System at Augustine
There are two primary lithologies associated with the 2006 eruption (Larsen et al., 2010): High-Silica Andesite pumice (HSA) and Low-Silica Andesite Scoria (LSAS). The volcano summit was located at 1.2 km above sea level (Power & Lalla, 2010). The primary magma storage was located at 3-5 km depth below sea level, as constrained by post-and syn-eruptive earthquake locations (Power & Lalla, 2010). Melt inclusion analyses of HSA (Webster et al., 2010) and lack of significant SO 2 emissions before December 2005 indicated storage at >4 km depth below the edifice . There was a primarily aseismic conduit linking the magma storage to the fracture zone; with few earthquakes located at these depths during the 1970s (Power & Lalla, 2010). The majority of earthquakes during the precursory swarm were located within a network of fractures from 100s of m above sea level to approximately 1 km below sea level (DeShon et al., 2010;Power & Lalla, 2010; Figure 3). Cervelli et al., (2010) suggest that there was no active magma column below the edifice prior to the 2006 eruption, inferred due to low fumarolic temperatures at Augustine. However, Jacobs et al. (2010) report increases in the monthly temperature average from January 2005, measured at the location of the AUS seismic station, which are attributed to a volcanic origin. The conduit which fed the 2006 eruptive episode has been suggested to have formed in the preceding weeks-months to the eruption (Cervelli et al., , 2010.

Data Processing
Before applying the machine learning methods (Section 2.3), we first extract a number of representative features from the seismic, gas and geodetic measurements. We derive a set of 40 features comprising 34 seismic features, 2 gas features and 4 geodetic features, which are outlined in Tables S3-S5 in Supporting Information S1. This full set of features is used to train and test the models described in Section 3.

Catalog Curation
We create a catalog by performing a full network detection on the seismic waveform data from Augustine volcano between April 30, 2005 and May 31, 2006. We use a Short-Term Averaging/Long-Term Averaging (STA/LTA) detection, with detection parameters from DeRoin and McNutt (2012). We perform this detection on the full network of seismic stations (Figure 2), using the GISMO toolbox for MATLAB (Thompson & Reyes, 2018). The detection parameters for identifying a trigger were an STA window of 1s and an LTA window of 8s, with an STA/ LTA threshold of 2 required for event triggering. The STA/LTA algorithm is used in "frozen" mode, meaning that the LTA window is fixed once an event is triggered. The detection was run twice over the data, once using a Black horizontal lines above the top bar represent the non-eruptive and eruptive training periods for the multi-class model (upper) and non-eruptive training period for the novelty detection model (lower). The top bar (a) represents the classification of activity presented in  and the bottom bar (g) represents the Concern Color Code status of Augustine throughout the data period. In the observable eruption bar (b), blue symbols denote days on which visual eruption was confirmed: stars represent days in which explosions were observed, diamonds represent days in which pyroclastic flows were observed and circles represent days in which there was evidence of lava dome growth and/or effusion (compiled from Coombs et al., 2010). In the GPS panel (d), blue dots represent the baseline length changes of the N-components of AV02-AC59 and orange dots represent the AV01 vertical component, which are the station and component combinations used to derive geodetic features. Measurements of daily SO 2 emission rate (c), GPS baseline length changes (d), seismic event rate (which is one seismic feature) (e) and seismic event depth (f) are normalized between 0 and 1 and the true bounds of these quantities are contained within the second y-axes for each plot. Seismic event depths are not used to derive features.
low-frequency filter (0.5-10 Hz) and once using a high-frequency filter (3-18 Hz). For a detection to be included as an event we checked that it was detected on 3 different stations within 10 s of each other. To exclude nearby regional events, we checked the resulting detection times against the USGS earthquake catalog (USGS, 2021). We considered all events occurring in the data period of at least 2.5 magnitude within 500 km of Augustine Island and manually checked events in our detection which coincided with a travel time between 30 and 300s of the USGS catalog, excluding approximately 50 events.
The event rates from the network detection are in agreement with event rates quoted in Buurman and West (2010). A comparison of the two event rates highlights good agreement on days with no events and for days with the highest event rates during the explosive phase of the eruption (January 11 and 13, 2006). For days with medium event rates, our catalog yields a daily event count approximately 1 order of magnitude greater than the AVO catalog. Event rates in the explosive phase onwards were affected by the outages of summit seismic stations, resulting in fewer detected events meeting the criteria for inclusion in the catalog.
Alaska Volcano Observatory (AVO) published an analyst-reviewed catalog comprising 2,367 seismic events associated with the 2006 eruption (Dixon & Stihler, 2009). For an event to be included in the AVO catalog, it must have greater than 3 P phases and greater than 2 S phases in addition to locating an event hypocenter to within 15 km of the volcano and with standard hypocentral errors. Due to these criteria for inclusion, the AVO catalog represents a subset of the total events at Augustine. Buurman and West (2010) find that 39% of the full number of events they detect at Augustine are included in the AVO catalog, and that the criteria for an event to be included mean that small and emergent seismic events are underrepresented in the catalog.

Seismic Feature Extraction
Using event times from both the AVO catalog and our catalog, we derive a set of seismic features from the event waveforms at station AUI, a seismic station which was in place before the eruption and was not disabled during the eruption (listed in Tables S1 and S2 in Supporting Information S1). We use the number of seismic events detected per day, and any days with zero seismic events are omitted from the time series. Though the gaps in the time series could be filled with a substitute value, e.g., the mean value of the feature (Schafer & Graham, 2002), we suggest that no value should be substituted as no events were detected therefore it would not follow to interpolate a value for the event characteristics of non-existent events. Dominant frequency information is derived by recording the 5 peak frequencies from each event in a given day. The ratio of low to high frequency energy in a waveform can be expressed via the band ratio (Buurman & West, 2010;Rodgers et al., 2015) and is calculated by taking the base 2 log of the ratio of energy within two windows of low frequency (1-2 Hz) and high frequency (10-20 Hz). These windows have previously been applied to data from Augustine (Buurman & West, 2010). We calculate peak amplitude based on the maximum amplitude of the event normalized by its mean amplitude. We compute the fast Fourier transform of each waveform and calculate the spectral width of the largest peak of each event.
After calculating features for each event on a given day, we then derive the set of seismic features for each day by taking the mean, median, variance, minimum, maximum, 10th percentile, 90th percentile and change in mean from the previous day for all categories apart from event rate, for which the per-day value and change in value from previous day are used.
We derive two sets of features from the seismic data, and compare the two sets of features in Section 4.2.
1. Features from seismic events from our catalog 2. Features from seismic events from the AVO Augustine catalog

Gas Time Series
SO 2 emission rate measurements for Augustine are available from July 2002 to July 2008. Measurements of SO 2 column abundance were made via airborne traverses, using COSPEC (McGee et al., 2010). The SO 2 emission rates which we use for this analysis were processed by McGee et al. (2010), who processed the raw COSPEC traverses by calculating the area under the SO 2 traverse time series and divided by the approximate width. This average signal was then multiplied by the plume width and average wind speed to determine an SO 2 emission rate in units of t/day. The time series of SO 2 emission rate is irregular, reflecting the challenges of making these airborne measurements. The median interval between measurements is 8 days, and the mean interval between 10.1029/2021JB022323 8 of 19 measurements is 50 days. There are larger intervals between measurements during the pre-eruptive and post-eruptive phases than the eruptive phases as the measurement frequency increased when seismic unrest began in 2005.
To use the gas time series data as input for our models alongside the daily time series from geodetic and seismic datasets, we use interpolation to estimate values of SO 2 emission rate between measurements made at Augustine. Figure 4 illustrates the two methods of interpolation used: 1. Gaussian Process Regression (GPR): a probabilistic approach in which a stochastic model is applied to provide estimates of intervening measurements with confidence intervals, in which no functional form is assumed. 2. Carry Last Value Forward (CLVF): until a new measurement of SO 2 emission rate is made, every day has the same value of emission flux.
To interpolate the gas measurements using the GPR, we fit the model to the measurements of SO 2 emission rate.
The values are first logged as we do not want to fit a model which can permit negative values for SO 2 emission rate, and because the noise in the data set is multiplicative that is, the uncertainties in emission rate increase as the magnitude of the emission rate increases. Before the eruption, there are values of zero in the data set which would be undefined in log space, so we choose a value an order of magnitude smaller than the lowest SO 2 emission rate as a "detection limit" of 10 0 t/day. We use a matérn covariance function (Rasmussen & Williams, 2006), which has previously been used to capture processes which are not smoothly varying in meteorological datasets (Guttorp & Gneiting, 2006).
These methods of interpolation each have advantages and disadvantages. CLVF is a conservative estimate of the daily emission rate, using no future emission rate measurements to interpolate the daily values. However, CLVF does not provide an estimate of uncertainty on the emission rate measurements, and also permits large jumps in the daily value of SO 2 emission rate. GPR does not assume an underlying functional form for the SO 2 emission rate time series and also yields an uncertainty bound on the interpolated daily values. The upper uncertainty bound broadly follows the form of the mean, apart from either side of the unrest and eruption (2004-2005 and 2007) where the uncertainty is higher due to an increase or decrease of measured emission rate respectively. This upper bound is not used as a feature in the models. GPR requires the use of all of the data points to estimate values for the intervening days and thus is a retrospective method. The total SO 2 flux estimated using CLVF interpolation is 4.3 × 10 5 t and the mean total SO 2 flux estimated using GPR is 4.5 × 10 5 t with 95% confidence interval bounds of 1.3 × 10 5 t and 2.0 × 10 6 t. Though neither method is preferable to a complete time series of measured daily values of SO 2 emission rate, the methods used for interpolating intervening daily values presented here represent two contrasting approaches to the problem of missing data in gas time series, which might find applications to estimating appropriate time-averaged SO 2 fluxes as well as for detecting changes in eruptive state as explored here.
Though measurements of CO 2 and H 2 S were attempted at the same time as SO 2 , there are under half as many values obtained for the emission rates of these species compared to SO 2 measurements (McGee et al., 2010). No value of CO 2 or H 2 S emission rate is available for the continuous or hiatus phase during the eruption, and there are significant gaps in the precursory and post-eruptive phases of the eruption. Consistent measurements of SO 2 emission rate of 0 t/day during the precursory and post-eruptive phases could be due to the process of SO 2 scrubbing by groundwater (Symonds et al., 2001). As the precursory phase of the eruption occurred during the Alaskan winter, the presence of snowpack on Augustine could also have contributed to scrubbing and suppressed the quantity of SO 2 measured at the surface. We derive features only from the SO 2 emission rate, and take the data as reported with no modification to account for potential scrubbing processes. The features extracted from the full SO 2 emission rate time series are the daily estimated value and the change in daily value from the previous day.

Geodetic Measurements
A GPS network comprising 6 stations was in place before the 2006 eruption of Augustine ( Figure 2; Cervelli et al., 2010). This network began to detect precursory deformation around July 2005, and the timing of the first dike intrusions is inferred from deformation measurements as occurring in November 2005 . Data availability from stations was variable during the eruption due to the destruction of stations by volcanic deposits during the later stages of eruption.
We use the daily GPS solutions provided by Nevada Geodetic Laboratory (Blewitt et al., 2018) in the International GNSS Service 14 reference frame (Rebischung & Schmid, 2016). We choose stations AV01, AV02, and AC59, which had a complete record of measurements before, during and after the eruption. For station AV01 we extract the daily vertical position. The difference between the maximum and minimum value of AV01 vertical component during the study period is 52.1 mm, a variation which exceeds the typical seasonal signal observed at AV01. For each day we also calculate the variation with respect to the previous day. For station pair AV02-AC59 we compute the daily change of the N-S component of the baseline vector with respect to a fixed chosen day. For the geometry of these two stations the baseline variation is a proxy for N-S extension or shortening across the volcano (e.g., Cervelli et al., 2010;Dixon, 1991;Saballos et al., 2014). The magnitude of change for the AV02-AC59 N-S component of the baseline is 18.9 mm. For each day we also calculate the variation with respect to the previous day. The features we use are daily vertical position, daily change in the N-S component of the baseline, and the difference with respect to the previous day of both the vertical position and the N-S component of the baseline.

Machine Learning Methods
Classification methods aim to categorize input observations into a finite number of categories which are representative of different types of activity of a physical system (Bishop, 2006). The methods we use are supervised, meaning that the model is trained on a selected set of training examples with assigned labels and then subsequently tested on unseen data. During the model training process, some training points (known as validation data) are held back and used to test the model in order to maximize the training accuracy. This validation data is distinct from the testing data which is subsequently used to evaluate the effectiveness of the models in classifying eruptive state. The models presented in Section 3 are trained on the seismic, gas and geodetic features described in Section 2.2. Classification methods used here can be split into one-class and multi-class classification, which are described in further detail below.

Novelty Detection Classification
Novelty detection (also known as one-class classification) methods involve training the model on a single class of observation, and then testing subsequent observations against this model to identify patterns which are anomalous or novel (Pimentel et al., 2014). In this approach, typically the training or "normal" class is better sampled than the abnormal or "novel" data with the objective of identifying the onset of the "novel" data. A novelty detection approach is advantageous for systems in which there may be several possible "abnormal" modes, and in which those abnormal modes are not well characterized. Existing applications of novelty detection include medical diagnostics (e.g., Quinn & Williams, 2007) and jet engine condition monitoring . In the geosciences, novelty detection has been previously applied to remote sensing for the purposes of classifying the occurrence of a single type of land use (Li et al., 2010;Mũnoz-Marí et al., 2010).
Use of novelty detection in volcanology is novel. Volcanic systems are conceptually well-suited to a novelty detection approach, as there is a broad spectrum of volcanic styles (thus failure types) associated with eruption. Moreover, larger eruptions are associated with longer repose times (Passarelli & Brodsky et al., 2012;Pyle, 1998) and thus previous eruptions may not have been fully instrumented (and therefore well-characterized). We use a nearest-neighbor-based approach to novelty detection, using the NDtool novelty detection toolbox (Clifton, 2007;Pimentel et al., 2014). For a nearest neighbor approach, the Euclidean distance between a test sample and its nearest neighbor within the training data is computed. The novelty score is then calculated by comparing this distance to the maximum distance between that nearest neighbor within the training data and its nearest neighbor in the entire training set. Hence, a larger novelty score is achieved where the test sample lies far outside the region of the training data as the distance between the testing sample and its nearest neighbor will greatly exceed the distance between that nearest neighbor and any point in the training data. The novelty threshold is defined as the maximum novelty score calculated within the training data set. Novelty detection models require a model training period which is sufficiently long to capture the distribution of the normal class, and we select the minimum possible training length for successful model training in order to test the model over the months preceding the eruption. We select a training period for the novelty detection model which begins on April 30, 2005, the first day of the data period, and ends on September 6, 2005. We hold back 1/5 of the training points to use as validation data for the model. This training period represents approximately 30% of the total number of days spanned by the data set.

Multi-Class Classification
Multi-class classification methods involve training models on two or more classes of observation, and then testing on unseen data. In this study we apply binary classification, with two classes: "non-eruptive" data and "eruptive" data, and we select representative training examples from both classes. For the non-eruptive training data, we select days at the beginning of the data period in order to train as far away from the hypothesized timing of the start of eruption, from April 30, 2005-August 24, 2005. For the eruptive training period, although the GVP start-date was a month earlier, visible eruption was limited to a short period of steam explosions and therefore we choose a training period from the main explosive phase of the eruption onwards, from January 10, 2006-April 12, 2006. This training period represents approximately 40% of the total number of days spanned by the data set.
We apply a random forest method, in which an ensemble of decision trees is averaged to produce a classification, wherein the decision represents the conversion from model inputs to final classification. Individual decision trees represent a series of operations which compare the features to randomly selected thresholds (Hastie et al., 2001). The aggregation of decision trees means that the final models have good generalization. We use standard 5-fold cross validation in order to test the model performance during the training process (Hastie et al., 2001). Once a final aggregated tree is formulated, the model is then tested on the unseen data. This method was successfully applied in previous work classifying volcanic state from features derived from seismic data (Manley et al., 2020).

Assumptions
The classification approaches applied here involve the underlying assumption that the variables used as input are Independent and Identically Distributed (IID). This assumption implies that features on a given day are mutually independent from features from other days, and features are drawn from the same underlying distribution between days (Cover & Thomas, 2006). Data are rarely perfectly IID in practice; however, we hypothesize that this approximation holds to a first-order and observe whether this hypothesis is supported by the resulting models distinguishing between previously unseen eruptive and non-eruptive data. This assumption was shown to be acceptable to a first order in previous work involving these classification methods (Manley et al., 2020) and we therefore retain this assumption for the seismic features. In this paper, we also assume that the individual gas and geodetic changes will be approximately IID for similar reasons to the seismic data: the generative process which produces measured geodetic and gas emission rates is assumed to not have memory of values on previous days. This assumption is less constrained for the interpolated GPR and CLVF values of SO 2 emission rate as the values interpolated for the missing days depend on the nearest measured values. Nonetheless, interpolation of the missing gas data is preferable to excluding the days with no measurements altogether. Gill et al., (2007) establish a similar methodology to fill missing values in hydrological datasets prior to applying machine learning algorithms, and find significantly better performance for models with interpolated values compared to models formulated with missing values omitted from the data set.

Results
We present results from one-class and multi-class classification models trained and tested independently on time series of features derived from the Augustine seismic, geodetic and gas data. We compare classification results using two end-member approaches for interpolation of the SO 2 emission rate: CLVF and GPR, introduced in Section 2.2.1. Figure 5 illustrates the results from novelty detection methods using features derived from seismic, geodetic and both gas data interpolation methods, which are also summarized in Figure 6c and Table S6 in Supporting Information S1. These models were both formulated using a nearest-neighbor approach to novelty detection. Each day of the model is classified independently, therefore there is no influence of previous or future novelty scores on the current day. There are several observations which can be made from the novelty detection results ( Figure 5):

Novelty Detection Results
1. Both models begin to consistently classify days as eruptive prior to the phreatic explosions in December 2005.
The novelty scores begin to exceed the novelty threshold in mid-November 2005 for both gas interpolation methods (GPR and CLVF). This classification of abnormal behavior precedes the change in Concern Color code at Augustine (Cameron et al., 2018). 2. The overall shape of the novelty score is comparable to the method of SO 2 emission rate interpolation, thus the CLVF novelty scores appear step based and GPR novelty scores are smoothly varying. 3. There is agreement between the models on the end-date of eruption-GPR novelty scores and CLVF novelty scores do not return below novelty threshold within the data period end-date of May 2006. 4. Precursory classification of abnormal behavior in the CLVF model occurs whilst SO 2 emission rates remain at 0 t/day, indicating that changes in the geodetic and seismic features are primarily responsible for the novelty scores. We compare the novelty detection model results to the GVP labels and find 84% testing agreement for the CLVFbased model compared to GVP labels and 58% agreement for GPR-based model compared to GVP labels. The misclassifications with respect to the GVP labels are entirely false positives, that is, more days are classified as eruptive compared to the dates recorded in GVP. The model agreements reported here are computed relative to the labels recorded in GVP, where the beginning of the eruption is recorded as December 9, 2005 and the enddate is April 27, 2006. Labels from GVP are defined by the presence or absence of visual evidence of volcanic activity. However, as discussed in Section 1 and by Manley et al., (2020), a range of subsurface processes are sampled by seismic, gas and geodetic observations which are not necessarily accompanied by surface activity and therefore reflected in the GVP labels.

Multi-Class Results
The results from random forest classification are illustrated in Figure 6 and outlined in Table S6 in Supporting Information S1. As in the one-class models, each day of the model is classified independently thus the classification on any given day is not influenced by the classification on another day. The following key observations can be made from the two random forest approaches: 1. The model formulated using seismic, geodetic and GPR-interpolated gas features (Figure 6a) Table S6 in Supporting Information S1 for more details.
2. There is some agreement between the models on the end-date of eruption-the CLVF-based model returns to non-eruptive classification in early mid May 2006, whereas the GPR-based model end-date is slightly earlier, in early May.
The agreement of the random forest models with GVP labels is higher for both interpolation approaches than the one-class approach. Overall model agreement for the model with CLVF-interpolated gas features is 86% and model agreement for the model with GPR-interpolated gas features is 82%. The multi-class model with CLVF-interpolated gas features begins to classify days as eruptive later than all other novelty detection or multi-class models. However, the classification of eruptive activity still begins approximately a month before the main explosive phase of the eruption, in which the first juvenile material associated with the eruption was ejected.

Physical Interpretation of Model Results
Both CLVF-based and GPR-based novelty detection models and GPR-based multi-class models begin to classify days as eruptive during the precursory phase of the Augustine eruption, around November 2005. For the GPRbased novelty detection model, the novelty score does not return below the novelty threshold for the remainder of the study period whereas the CLVF-based novelty scores have isolated peaks above the novelty threshold. Previous qualitative models of activity during the precursory phase of eruption are consistent with transition at this time. On November 17, 2005, there was a marked offset in the GPS-derived offsets between AV04 and AV05, which Cervelli et al. (2010) attribute to a crack opening at a depth near sea level, and visual observations of surface cracks at the summit. A day later on November 18, the rate of edifice inflation also increased (Cervelli et al., 2010).
Previous modeling of ascent duration based on 40-100 micron reaction rims in amphiboles from the LSA magma is consistent with a ∼20-57 days duration of ascent (Larsen et al., 2010). This petrologically derived constraint is consistent with dike initiation around November 17 and 18, as the shortest ascent duration would be associated with an eruption date around December 9, 2005 (consistent with phreatic explosions) and the longest ascent duration would be associated with an eruption date around January 12, 2006 (consistent with beginning of the explosive phase). Therefore, the transition which we identify using machine learning approaches in the precursory phase could be associated with physical changes associated with dike initiation.
Earthquakes in the early precursory phase of activity at Augustine, prior to dike initiation, are attributed to the accumulation of gas below the conduit (Larsen et al., 2010). The depth of earthquakes which would be generated by this mechanism is consistent with the model of Huppert and Sparks (2016), who model the pressure distribution associated with the rise of compressible vapor through a porous conduit. In this model, the gas pressure scales with the square root of depth, whereas the lithostatic pressure scales linearly with depth. Therefore, a critical overpressure is thought to occur approximately 2/3 of the distance between the source depth (i.e., magma storage zone) and the surface (i.e., the edifice) when the gas pressure exceeds the lithostatic pressure and causes rock fracture, thereby nucleating VT events. The observed depths of precursory VTs during the unrest at Augustine are consistent with this model, as the events are concentrated in a fracture zone 3-4 km above the primary magma storage depth. A similar process has been observed at Soufrière Hills Volcano, Montserrat, where VT swarms occurred in an overpressure zone ahead of the ascending magma . The eruptive transition we identify from our classification modeling could correspond to the source mechanism of precursory earthquakes changing as dike intrusion began. This hypothesis is supported by Syracuse et al. (2011) who relocated seismic events at Augustine with an improved velocity model, and found that from November 2005 onwards seismic event locations began to align in a NW-SE trend consistent with intrusive volumes modeled by Cervelli et al. (2010) and the regional stress field (Buurman et al., 2014).
The transport of magma to the surface via dike intrusion is also associated with changes in gas measurements and ground deformation (e.g., Sparks et al., 2012). Increased emission of SO 2 can represent shallow magmatic ascent of gas-rich magma, as the degassing of SO 2 occurs at lower pressures than CO 2 (Moran et al., 2011). Where gas scrubbing processes have suppressed the surface emission of SO 2 prior to eruption, increases in SO 2 emission rate can represent the boiling and subsequent removal of a hydrothermal system to allow previously scrubbed SO 2 emissions to reach the surface (Pritchard et al., 2019). Magmatic ascent is often associated with deformation which, in the case of the Augustine eruption, was recorded as an increase in edifice inflation rate (Cervelli et al., 2010).

Comparison of Seismic Catalogs
In this study, we derive features from seismic, gas and geodetic data. Figure 7 is a 2-dimensional visualisation of the full set of seismic, gas flux and geodetic features (with 3-dimensional representation in Figure S2 in Supporting Information S1). This visualisation is performed using the t-Distributed Stochastic Nearest Neighbor Embedding (t-SNE), which is a probabilistic machine learning algorithm designed to project high-dimensional data-the set of 40 features-into lower dimensional space such as 2-or 3-dimensions (van der Maaten & Hinton, 2008). This process is performed by calculating a joint probability for points in the high-dimensional space using the conditional probability that a given datapoint would pick another as its neighbor, given a probability distribution described by a Gaussian centered at that point. The low-dimensional joint probability is calculated using the Student t-distribution with one degree of freedom. The final mapping is calculated by minimizing the Kullback-Leibler divergence between these two joint probabilities, such that points which are close in high-dimensional space remain close in the two-dimensional representation. We apply the t-SNE to four sets of features: the two sets of seismic and geodetic features used for classification in Section 3 (with GPR or CLVF derived daily values for gas), and two sets of features derived from the AVO seismic catalog (with the same geodetic and GPR or CLVF derived daily values for gas flux).
Comparison of the t-SNE projections for the set of seismic, gas and geodetic features containing seismic features derived from our seismic event catalog shows that the non-eruptive and eruptive days are clustered in separate areas of the projected space, with little overlap between the two clusters. The transition between non-eruptive and eruptive data is smoother for GPR-based interpolation than CLVF-based interpolation, which reflects the continuous nature of GPR-based interpolation. Novelty detection classification methods rely on the ability to fully characterize the training class of data, so that outliers can be reliably identified. The distinction between clusters in the projection of the Augustine data set illustrates that there are systematic differences between the eruptive and non-eruptive days of the data set which can be exploited by a one-class approach. Comparison of the projections of features from our catalog and AVO catalog-derived features indicates that the differences in eruptive and non-eruptive data are less distinct for the AVO catalog, with no clear separation of data into clusters on this basis. The increased separability of features from our catalog suggests that the smaller, more emergent events not included in the AVO analyst-reviewed catalog contain meaningful information which contributes to distinguishing between eruptive and non-eruptive data.

Comparison of Multi-Parameter and Single-Parameter Models
For comparison against the full sets of features used for models in Section 3, we trained and tested three additional novelty detection models using features derived from only the seismic, gas or geodetic data with the same training dates ( Figure S1 in Supporting Information S1). The models formulated using seismic data and geodetic data alone have a similar trend to the model using the full set of features in the precursory months, with novelty scores from both models exceeding the novelty threshold in November 2005. However, both seismic and geodetic models fail to consistently classify days from the main eruptive sequence above the novelty threshold. In the case of the seismic data, we propose that this model failure arises due to outages in summit seismic stations during the explosive phase of the eruption, which means that smaller events are underrepresented in our catalog during the later stages of the eruption. For the geodetic data, the inflation signal was concentrated in the precursory phase and, once the main phase of the eruption commenced, the daily displacement decreased until similar to pre-inflation values. Models formulated using gas features with a GPR interpolation classified behavior as eruptive as Figure 7. Two-dimensional visualisation of the seismic, gas flux and geodetic features derived from Augustine using the t-distributed stochastic nearest neighbor embedding (t-SNE; van der Maaten & Hinton, 2008). Blue points represent non-eruptive days and red points represent eruptive days according to the Global volcanism program (GVP) database. The top two panels, which include seismic features from our seismic catalog, display a greater separation between non-eruptive and eruptive days than the bottom two panels, which include seismic features from the Alaska volcano observatory (AVO) analystreviewed catalog. A three-dimensional visualisation of these datasets using t-SNE is plotted in Figure S2 in Supporting Information S1. The x and y axes represent non-linearly scaled distances projected into a two-dimensional space and thus have no units, and are not comparable across axes. soon as model training ended, whereas models formulated using gas with a CLVF interpolation fail to identify an appropriate novelty threshold as the interpolated SO 2 emission rates (hence novelty scores) remained very low until significant emissions were measured on December 20, 2005. The GPR failure likely results from the smoothly varying nature of the interpolated time series where there is no stable normal behavior to characterize during model training, whereas the CLVF failure is due to the low surface emission of SO 2 during the precursory phase.
The results of this single data modeling indicate that, in the case of the 2006 Augustine eruption, use of multiple types of data for classification is able to discern between non-eruptive and eruptive data better than single data-streams alone for novelty detection models. When all of the features are included in the model, the precursory increasing trend in novelty score remains similar to the individual seismic and geodetic models and eruptive days are better identified due to the inclusion of the SO 2 emission rate features. Moreover, the single data analysis indicates that the inclusion of multiple datasets can compensate for data quality issues due to instrument failures during a large eruption. Applying multi-class models to single data-streams was not attempted due to the issues identified above pertaining to isolating suitable eruptive and non-eruptive training periods for each of the data-streams separately.

Forecasting Eruptions
Both one-class and multi-class methods, explored in this paper, have potential utility in forecasting. Poland and Anderson (2020) note that machine learning approaches represent an as-yet underutilized resource in eruption forecasting, which is in part due to limited quantities of available training data. The choice to apply one type of machine learning model over another involves consideration of the past volcanic activity and/or monitoring infrastructure available at a given volcanic system, as these factors affect the data available for training models. Here, we discuss possible applications of novelty detection versus multi-class approaches to the detection of eruptive transitions in different volcanic scenarios.
Novelty detection approaches-those in which models are trained only on non-eruptive data-would be applicable to monitored volcanoes which have been in repose for some time (e.g., 10s years), as these systems would not have eruptive data available for training but a large amount of data from repose to characterize the non-eruptive class. This class of volcanoes would include many of those that have experienced major eruptions in recent decades (Pyle & Barclay, 2020). Other potential applications include volcanoes which have displayed several types of activity in the past or have highly variable manifestations of eruptive activity. Here, a novelty detection approach would not assume a consistent mode of eruption. Additionally, volcanoes where the monitoring infrastructure has significantly changed since the previous eruption are a good candidate for novelty detection, as the eruptive data from the previous eruption would not necessarily characterize the eruptive data on the updated monitoring network.
Multi-class approaches-those in which models are trained on both non-eruptive and eruptive data-could be more useful at volcanoes which display consistent signals associated with eruption. Here, eruptive training data available from past eruptions would well-characterize the eruptive behavior of the system in order to identify future eruptive signals. For example, Piton de la Fournaise volcano has a consistent tremor signal that is correlated with eruption (Battaglia & Aki, 2003) and most eruptive episodes are preceded by precursory VT swarms (De Barros et al., 2013), which has made it a good candidate for previous multi-class studies (Ren et al., 2020). Volcanoes with short repose time between eruptions could be better-suited to a multi-class approach, as there would be insufficient time to characterize the non-eruptive class with a one-class approach.
For the purposes of forecasting eruptions, consideration of which methods are used to interpolate values for missing data points are necessary. CLVF-based interpolation involves no future information to fill in missing values, therefore is well-suited to a forecasting context. GPR-based interpolation uses the full set of data points to construct the model from which the daily values for the training period are taken and therefore is a purely retrospective analysis in this study. However, Colopy et al., (2018) demonstrate that Gaussian Processes can be successfully applied in healthcare for forecasting patient vital signs beyond the model training periods. Future work could take Gaussian Processes into the domain of real-time volcanic forecasting.

Elucidating Timings of Transition in Volcanic State
Future applications of classification could extend the two-class approach here to investigate intra-eruptive transitions in activity. Prolonged eruptions may undergo intra-eruptive transitions; for example, dome-forming eruptions often switch between effusive and explosive behavior (e.g., Edmonds & Herd, 2007;Preece et al., 2016). Characterizing and understanding the nature and timing of changes in eruptive intensity and style, and anticipating future transitions in behavior, is critical for managing hazards during eruptive crises (e.g., Barclay et al., 2019;Fearnley & Beaven, 2018). For systems in which the transitions between activity are sufficiently long to generate enough training data, one-class or multi-class methods trained during eruption could help to distinguish intra-eruptive transitions.
Consideration of the methods for interpolating gaps in irregularly sampled time series is necessary. As discussed in Section 3, we observe different end-dates for the 2006 eruption depending on the SO 2 interpolation methods used. We find that models formulated based on GPR interpolation of the SO 2 time series yield an earlier estimate of eruption end-date than models formulated based on CLVF-based interpolation. This observation is explained due to the 'carry forward' nature of CLVF: at the end of the eruption where the SO 2 emission rate is consistently decreasing between observations, CLVF represents an overestimate of the daily emission rate and therefore data is classified as eruptive for a longer period. Consideration of interpolation methods becomes more important as the sampling frequency of the time series decreases, as more points need to be estimated for sparser time series.
Application of the approach presented in this paper is not limited to the data streams used here. Additional ground-based data streams from which features could be derived include infrasound or thermal measurements. Satellite-derived observations of ground displacement (e.g., InSAR) are being made in increasing frequency, and could be incorporated into a machine learning framework. Indeed, there is a growing body of literature examining the automatic detection of unrest in InSAR time series (Anantrasirichai et al., 2018;Biggs et al., 2019;Gaddes et al., 2019;Valade et al., 2019).

Conclusions
Identifying transitions in multi-parameter time series is a challenge in volcanology. We apply novelty detection (one-class) and multi-class machine learning methods to features derived from seismic, gas and geodetic data from the 2006 eruption of Augustine Volcano, Alaska. We present the first study applying a one-class or novelty detection approach to classify volcanic state, which does not require training examples characterizing eruptive behavior. We suggest a framework for processing multi-parameter data to produce daily representative features, including two distinct approaches toward interpolating missing values in SO 2 emission rate time series. These approaches include Carry Last Value Forward (CLVF) interpolation in which the daily value remains constant since the previous measurement of emission rate, and Gaussian Process Regression (GPR) which is a probabilistic approach using the full available data to estimate intervening values with uncertainty bounds. We compare the results of models formulated using seismic, geodetic and both CLVF and GPR estimates of daily gas emission rates. Both novelty detection and multi-class random forest models identify a key change-point from non-eruptive to eruptive state in mid-November. This change-point is in agreement with previous conceptual models at Augustine which suggest the initiation of dike intrusion from the shallow magmatic storage at this time. We find that novelty detection models formulated using multiple types of data are able to distinguish between non-eruptive and eruptive data better than models formulated on single data-streams alone. We suggest possible applications of one-class novelty detection and multi-class modeling and discuss considerations in applying these approaches in a monitoring or retrospective context. Novelty detection methods show promising application for monitoring volcanoes where previous eruptions have not been observed, or where the monitoring infrastructure has significantly changed since previous eruptions.

Data Availability Statement
Augustine seismic data was collected by the AV network (AVO/USGS, 1988). Augustine geodetic data was taken from the continuous GPS network installed as part of the National Science Foundation Plate Boundary Observatory  and can be obtained from Blewitt et al. (2018). Augustine gas data can be obtained from McGee et al. (2010). We used the Generic Mapping Tools version 6 (Wessel et al., 2019). The features and code to run the models from this paper can be found at Manley (2021).