Probabilistic Source Classification of Large Tephra Producing Eruptions Using Supervised Machine Learning: An Example From the Alaska‐Aleutian Arc

Alaska contains over 130 volcanoes and volcanic fields that have been active within the last 2 million years. Of these, roughly 90 have erupted during the Holocene, with many characterized by at least one large explosive eruption. These large tephra‐producing eruptions (LTPEs) generate orders of magnitude more erupted material than a “typical” arc explosive eruption and distribute ash thousands of kilometers from their source. Because LTPEs occur infrequently, and the proximal explosive deposit record in Alaska is generally limited to the Holocene, we require a method that links distal deposits to a source volcano where the correlative proximal deposits from that eruption are no longer preserved. We present a model that accurately and confidently identifies LTPE volcanic sources in the Alaska‐Aleutian arc using only in situ geochemistry. The model is a voting ensemble classifier comprised of six conceptually different machine learning algorithms trained on proximal tephra deposits that have had their source positively identified. We show that incompatible trace element ratios (e.g., Nb/U, Th/La, Rb/Sm) help produce a feature space that contains significantly more variance than one produced by major element concentrations, ultimately creating a model that can achieve high accuracy, precision, and recall on predicted volcanic sources, regardless of the perceived 2D data distribution (i.e., bimodal, uniform, normal) or composition (i.e., andesite, trachyte, rhyolite) of that source. Finally, we apply our model to unidentified distal marine tephra deposits in the region to better understand explosive volcanism in the Alaska‐Aleutian arc, specifically its pre‐Holocene spatiotemporal distribution.

of effort required to generate large amounts of high quality in situ trace (<0.1 wt%) element concentrations from geological materials, subsequently allowing for a new set of high-quality discriminators to be used in tephrochronology (D.J. Lowe et al., 2017).
Tephra studies can be used to develop a complete understanding of the explosive eruptive history at a single volcano (e.g., Mullineaux et al., 1975;Sisson & Vallance, 2009;Wallace & Miller, 2014).Employing consistent methodologies and data handling practices (e.g., sampling, analytical procedures; Wallace et al., 2022), however, also allows tephras to be linked through tephrochronology to produce accurate regional volcanic eruption histories (e.g., Fierstein, 2007;Shane, 2000).Regional studies also have the added benefit of helping constrain eruption magnitudes and volumes by aiding in accurate isopach construction (e.g., Buckland et al., 2020).Furthermore, eruption magnitudes and volumes, along with the frequency at which they occur at a specific volcano, are integral pieces of information required for accurate hazard classification (Ewert, 2007;Ewert et al., 2018).Complicating these classifications, however, is the observation that volcanic eruption magnitude is inversely proportional to frequency (Mason et al., 2004), ultimately necessitating a long-term tephra record to accurately quantify the probability of large eruptions and their source volcanoes for a specific region.

Large Tephra Producing Eruptions in the Alaska-Aleutian Arc
The Alaska-Aleutian arc extends ∼3,000 km from Cook Inlet to the Kamchatka Peninsula (Figure 1).Seismic studies indicate crustal thicknesses ranging from ∼25 km (Lizarralde, 2002) in the western part of the arc to 45 km toward the Cook Inlet (Veenstra et al., 2006).Composition of the crust is also quite variable along strike, being largely basaltic west of the Unimak Island (Vallier et al., 1994) to highly variable (i.e., comprised of various accreted sedimentary, igneous, and metamorphic terranes) under mainland Alaska (Nokleberg et al., 1994).In addition to changes in crustal composition along strike, it has also been shown that there is highly variable sediment input (Jicha et al., 2004;Singer et al., 2007) and subduction angle (Minster & Jordan, 1978).These characteristics ultimately lead to the production of a spatiotemporally diverse suite of magmas and eruptions across the Alaska-Aleutian arc over the last 46 million years (Jicha & Singer, 2006).
The "modern" Alaska-Aleutian arc (i.e., Quaternary) contains over 140 volcanoes and volcanic fields (Cameron et al., 2022) with many producing eruptions that distribute large volumes of volcanic ash over Alaska and northwest Canada and/or create calderas.Many of these calderas were first identified during the arc-wide mapping project conducted throughout Alaska by the U.S. Geological Survey from 1946 to 1954 led by Robert Coats (e.g., Geological Survey Bulletin series 1028).The first published list of Alaska calderas was produced by Powers (1958), then later edited slightly by Miller and Barnes (1976).Continuing to refine our understanding of explosive volcanism in Alaska in both space and time, Miller and Smith (1987) showed that 12 volcanic centers from the eastern Aleutians have produced calderas during the Quaternary (from east to west): Kaguyak Crater, Mount Katmai, Ugashik-Peulik volcanic complex, Aniakchak Crater, Black Peak, Mount Veniaminof, Emmons Lake, Fisher Caldera, Makushin Volcano, Akutan Volcano, Okmok Volcano.

Machine Learning Classification in Tephrochronology
Advancements in personal computing coupled with increased ease of access to high powered algorithms (e.g., Buitinck et al., 2013;Pedregosa et al., 2011) have allowed for the implementation of machine learning by non-experts in just about every scientific field, including the geosciences, where machine learning has been applied for more than 50 years (Dramsch, 2020).It, however, has only recently been applied to tephrochronology (Bolton et al., 2020;Petrelli et al., 2017) and petrology (e.g., Boschetty et al., 2022;Caricchi et al., 2020;Jorgenson et al., 2022;Li & Zhang, 2022;Musu et al., 2023;Petrelli et al., 2020) applications.Machine learning tasks can largely be grouped into two categories: regression and classification.In regression, the goal is to use a machine learning algorithm to predict a continuous number (e.g., pressure of mineral formation based on mineral composition; Jorgenson et al., 2022), whereas in classification the goal is to predict a class label (e.g., source volcano based on tephra major element glass chemistry; Bolton et al., 2020).In both instances a machine learning algorithm uses features (explanatory variables) to predict either a target value (regression) or class (classification).
Tephrochronology problems are well-suited for machine learning classification.There are many potential features (geochemical and physical) in tephra data sets and linking tephras to source volcanoes or other tephras through manual inspection of data can be a difficult and time-consuming task.Furthermore, when geochemical compositions from different tephra units are similar (but not identical) only considering a few dimensions of a data set either through visual inspection of bivariate plots or using dimensionality reduction techniques (i.e., principal component analysis) when discriminating between classes (i.e., tephra units) often leads to non-unique solutions, ambiguous results, and non-robust correlations as there is typically either sufficient information loss or not enough variance between classes in reduced dimension space (e.g., Hopkins et al., 2015;Petrelli et al., 2017).Conversely, considering a much larger set of features (n ≫ 2) to create a more complex "n" dimensional data set, although more thorough, is challenging to interpret and visualize without the use of machine learning algorithms.
Using major-element geochemical information as feature space for various classification algorithms, Bolton et al. (2020) show that classification of source volcanoes for Alaska-sourced tephras can potentially be extremely accurate for SE Alaska, while Petrelli et al. (2017) show that trace elements may also be useful additions to model feature space.In this work we expand on these two pioneering studies and incorporate the use of common incompatible trace element ratios (e.g., Derkachev et al., 2018;Hopkins et al., 2015) as features to create an ensemble voting classifier that accurately and confidently predicts the volcanic source of regionally distributed tephra deposits in Alaska.We test the ability of our classifier to accurately predict the source of unknown eruptions through the systematic training and testing of source predictions on left-out eruptions from our data set, which we argue is the best way to assess a classifier's true ability to predict the correct answer on data it has not seen yet.We then apply our classifier to a suite of tephras from International Ocean Discovery Program (IODP) sites U1417 and U1418 in the Gulf of Alaska from the last 800 ka to identify their most probable source volcanoes to better understand the spatiotemporal evolution of LTPEs in Alaska.Finally, we compare the results of our model to marine core samples and interpretations from Derkachev et al. (2018) from the Bering Sea.Combined, these two applications help validate our model in both the eastern and western part of the Alaska-Aleutian arc, demonstrating its utility as a tool for accurately linking tephras to their source volcano throughout the region.

Samples
Samples utilized in this project were gathered by U.S. Geological Survey scientists and partners over decades of fieldwork.Considerable effort has been made to ensure that every sample gathered has the appropriate metadata required for a large, arc-scale project (e.g., Wallace et al., 2022).These metadata (e.g., location, sampling geologist, date, sample description) can be easily tracked through the Alaska Volcano Observatory geologic database of information on volcanoes in Alaska (GeoDIVA; https://www.avo.alaska.edu/geochem/;Cameron et al., 2022) using the unique Alaska Tephra Lab (AT) number assigned to each sample.To maintain a high degree of confidence that every sample in our data set is accurately linked to its source volcano, only proximal tephra samples previously linked to a source volcano are used.When possible, a small chip from approximately 10-12 different airfall pumices from a single sample are used to avoid sampling bias as much as possible and three to four analyses are completed on each chip.Chips from each sample are mounted for in situ geochemical analyses in standard 25 mm diameter epoxy rounds that have pre-drilled holes to avoid inter-sample contamination.If a large compositional range exists for a given eruption (e.g., Katmai 1912Hildreth & Fierstein, 2012) we try to ensure that samples covering this entire range are analyzed.Figure 1 shows volcanic sources investigated in this study as well as their location within the Alaska-Aleutian arc.A list of all samples used in this project and their basic metadata can be found in Table 1 and Text S1 in Supporting Information S1.We note that this list does not include every volcano with evidence for a LTPE (see Section 1.2 above).This is due to one of two reasons: (a) we do not possess a proximal tephra sample from the inferred LTPE; (b) we possess a proximal sample from the inferred LTPE but it does not contain material suitable for quality in situ analyses (e.g., devitrified glass, too microlitic).Samples utilized from IODP  Expedition 341 are found in Table 4.More information on the metadata and descriptions of these samples can be found in Jaeger et al. (2014).

Electron Probe MicroAnalyzer
Major elements and their oxides (SiO 2 , TiO 2 , Al 2 O 3 , FeO T , MnO, MgO, CaO, Na 2 O, K 2 O, P 2 O 5 , Cl) in tephra matrix glasses were measured using a JEOL JXA 8530 F + Electron Probe Microanalyzer (EPMA) located at the U.S. Geological Survey in Menlo Park, CA.Glass analyses were performed using a 5 μm defocused beam, with an energy of 15 kV, and a current of 10 nA.Calibration standards, crystal assignments, and count times can be found in Table S1 in Supporting Information S1.Data were collected and processed with Probe for EPMA software (http://www.probesoftware.com/).All cations were obtained from raw intensities using Phi-Rho-Z intensity corrections (Armstrong, 1988), while oxygen was calculated by cation stoichiometry.Accuracy and precision of analyses were determined by measuring natural rhyolite and basalt glass reference materials periodically throughout all analytical sessions and a summary of these measurements can be found in Table S2 and Figure S1 in Supporting Information S1.Accepted values for Smithsonian glasses are taken from Jarosewich et al. (1980), except for Mg on VG-2 (Helz et al., 2014), and Cl concentrations from various sources compiled on the Smithsonian National Museum of Natural History website.RLS-132 values are reported in Macdonald et al. (1992).
All EPMA measurements can be found at Loewen, Wallace, Coombs, and Mulliken (2023), Loewen, Dietterich, andRosenkrans (2023), andLubbers et al. (2023).A long-term assessment of the accuracy and precision of the analytical methods described above can be found at Loewen, Wallace, Lubbers, et al. (2023).EPMA data for IODP samples utilized can be found at Benowitz et al. (2023) and their analytical methods can be found in the Supplementary Appendix.

Laser Ablation Inductively Coupled Plasma Mass Spectrometry
Trace element concentrations in tephra were measured using an Applied Spectra RESOlution SE ArF excimer laser ablation (LA) system connected to a Analyses were conducted with a 24 μm spot, pulse rate of 10 Hz, fluence of 7.2 J cm 2 , and ablation time of 22 s.Elemental concentrations were calculated from analyte raw signals using the methodology of Kent and Ungerer (2006), Longerich et al. (1996), and the software LaserTRAM-DB (Lubbers et al., 2021) where 29 Si was used as the internal standard analyte.GSD-1G and GSE-1G reference glasses were run as primary standards every 15 analyses (∼15 min of analysis time) and GSE-1G was used as the calibration standard for calculating concentrations.GSD-1G, GSE-1G, BCR-2G, ATHO-G, and NIST-612 were run as standard blocks at the beginning and end of each experiment.Accuracy and precision of standard reference materials was determined by comparing measured concentrations to preferred values reported in the GEOREM database (Jochum et al., 2007) and can be found in Table S3 and Figure S2 in Supporting Information S1.All LA-ICP-MS measurements can be found at Loewen, Wallace, Coombs, and Mulliken (2023), Loewen, Dietterich, and Rosenkrans (2023), and Lubbers et al. (2023).LA-ICP-MS data for IODP samples utilized can be found at Benowitz et al. (2023).

Outlier Removal
To mitigate the effect of anomalous analyses influencing our model, we manually inspect data for outliers and analyses with low major-element totals after concentrations are calculated from raw data.This outlier filter is effectively an element by element standard deviation filter with the null hypothesis being that an individual tephra chip is homogenous in matrix glass composition (i.e., if numerous elements show significant variance such that their standard deviation is larger than ∼1% they are flagged as outliers as it means either (a) the analysis material was a mixture of crystal and matrix glass or (b) that analysis is not representative of the overall matrix glass composition).This is done on an individual tephra chip scale such that the overall variance within one chip is small, but still allows for preserving natural variance that may occur within one individual sample (i.e., eruption).
In most cases this results in less than 1.0% variance for each oxide in a tephra chip.Of the 1,080 EPMA tephra matrix glass analyses, 61 were rejected for low overall totals, and 36 were rejected for being within-chip outliers.
For trace elements, much of spot rejection is done during the initial data reduction (e.g., analyses that are not tephra matrix glass), lessening the need for manual filtering.We still, however, manually filter for outliers within a single chip using the same logic as was used for major elements, resulting in 1,719 LA-ICP-MS spot analyses.

Handling Missing Data
Dealing with missing data is a common pre-processing step in machine learning problems as using features that are continuous variables (e.g., geochemical components) will not work if certain observations contain no information for that feature.Missing data occurs within our data set in two ways.The first is in major element compositions.Our data set consists of more LA-ICP-MS analyses than EPMA analyses, ultimately resulting in a data set that is not 1:1 (i.e., for each LA-ICP-MS analysis there is not always a corresponding EPMA analysis).
We deal with this by assigning the chip mean and standard deviation major-element composition for every trace element analysis on that chip.We argue this is justified as the major element variance within a single tephra chip is mostly below 1.0% and often below 0.5% (see Section 2.2.1 above; Text S1 in Supporting Information S1).
The second form of missing data is in low abundance trace elements.Due to both a large compositional range between training volcanoes (i.e., andesite-high silica rhyolite) and analyte list (see Section 2.1.3above) that has a large range in compatibility, some analyses have analytes that are below detection limit.These can be considered "missing" data as an actual concentration cannot be accurately measured; however, it still is a nonzero value.Both data transformation (see Section 2.2.3) and most machine learning algorithms necessitate a data set that does not have missing values for an observation.While there are many ways of dealing with missing data (i.e., imputation using the chip mean, median, or nearest neighbors), we currently omit these analyses from the model as it limits the number of assumptions made and the frequency of these observations is low enough that omission does not reduce the size of the data set significantly (∼7%) or remove entire eruptions from the data set.

Transformation of Compositional Data
Geochemical data, as commonly reported, are an example of compositional data where each measured component is a relative value, not an absolute one.Interpreting their values "as is" may bias data toward spurious negative correlations (Chayes, 1960), or force correlations between components of a data set (Miesch, 1969), ultimately making the distinction between petrologically significant and mathematical artifact correlations difficult.Furthermore, many common statistical tests necessitate data to be unconstrained (i.e., have values from −∞ → ∞ in Euclidian space), which is not the case with geochemical data that are plagued by the "constant sum problem" (Aitchison, 1982(Aitchison, , 1984(Aitchison, , 1986;;Chayes, 1960;Miesch, 1969;Rollinson, 1992) and closed between 0 and some constant (e.g., 1, 100%, 10 6 ppm).A way to transform geochemical data into unconstrained real space such that multivariate statistical tests can be applied is using log-ratio transforms (Aitchison, 1982(Aitchison, , 1984(Aitchison, , 1986) and in our model we utilize the centered log-ratio transform (clr) which is defined as: where the x i is the concentration of component i and g(x) is the geometric mean of x (Pawlowsky- Glahn & Egozcue, 2006).This is easily applied to our data set using the pyrolite package in Python (Williams et al., 2020) after all components have been converted to the same units (e.g., SiO 2 wt% → Si ppm) and oxygen has been added by difference such that all analyses total 10 6 ppm.Commonly in machine learning models, to deal with data sets that have data with varying units or scales, data are re-scaled such that they have a mean of 0 and unit variance.We opt to forego this step in our data processing for two main reasons: (a) data are already transformed and the orders of magnitude differences between feature values are removed using the clr; (b) this scaling may result in information loss that would inhibit the ability of our model to accurately determine classes (Lipp et al., 2020).
For the purposes of training and testing our model, we build two data sets referred to hereafter as "training" and "test" data sets.As is common practice in machine learning experiments, they are built by taking one data set (i.e., our in situ data from proximal eruption samples) and randomly splitting it such that the training data set is used for the purposes of model building and cross-validation, while the test data set is left aside and only used for the purpose of assessing model performance.When splitting the data into training and test data sets, we keep the proportion of each class (i.e., volcano) the same as it was in the original data set.This helps ensure that the model "sees" every class during the training process and can assess the performance of each class in the testing process.

Supervised Classification Algorithms
Rather than focus on the background theory for each algorithm, as this is outside the scope of this study and an area of active research on its own, we outline the various supervised machine learning algorithms used (Table 2), and their basic syntax for how they were implemented using the Python package scikit-learn (Pedregosa et al., 2011).The reader is referred to Hastie et al. (2009) and references therein for a more complete discussion on each algorithm.A list of common machine learning terms and their definitions used throughout the manuscript can be found in Table 3.The goal of our machine learning classifier model is to accurately predict the source of a tephra sample using learned relationships from our training data set (i.e., in situ geochemical data).This is only achievable when the model can properly generalize the training data distributions, rather than over or underfitting them."High variance" models tend to capture noise in the data and incorporate it into the overall learned data distribution.These models perform extremely well on training data sets but exceedingly poor on new data, ultimately making them prone to: (a) inaccurate classifications on unknown observations if they are not drawn from the exact data distributions; (b) being extremely sensitive to changes in training data and its subsequent noise.Conversely, low variance models are not complex enough to capture the overall data distribution and underfit the data.These models, although not as sensitive to training data, have large errors associated with "bias" (i.e., making inaccurate assumptions when the model is fit to training data).
Models prone to overfitting are high variance-low bias, while models prone to underfitting are high bias-low variance.This ultimately leads to a series of tradeoffs (e.g., optimization parameters, complexity of the chosen model, size of training data set) that need to be made such that the model is low enough bias to capture the overall data distribution but also low enough variance such that it accurately predicts classes for new data.While each individual algorithm listed in Table 2 can be used on its own for the purpose of assigning various tephra units a source volcano, we approach the bias-variance trade-off by combining the results from each of them as they are conceptually different machine learning algorithms (e.g., high bias-low variance → high variance-low bias).More specifically, the algorithms in where n classifiers is the number of classifiers listed in performing models more influence in the overall ensemble vote.The overall idea is that by having algorithms that determine class boundaries using different metrics the ensemble model should be able to (a) better quantify data distributions for each class; (b) be less prone to incorrect predictions as a "majority rules" approach is employed.Based largely on work that was originally inspired by the Condorcet Jury Theorem (Condorcet, 1785), which states that judgment of a group is superior to that of any one individual provided the individuals have reasonable competence (P > 0.5 of being correct), this sort of voting has been proven to be an effective strategy at increasing ensemble classifier performance over any one algorithm (Tax et al., 2000).

Performance Metrics
Quantifying performance is a crucial part of the iterative model building process.Seeing not only how often a given model predicts the correct class, but also where and why it predicts the incorrect class can help inform hyperparameter tuning as well as feature engineering, ultimately creating a more robust classifier model.In the most basic sense, if true positives, true negatives, false positives, and false negatives are denoted by TP, TN, FP, and FN, respectively, model accuracy is calculated by: Accuracy scores will be between 0 and 1, with 1 being the situation where all predicted labels match true labels in the test data set.Here, accuracy scores are computed during the cross-validation process where, instead of splitting the data solely into training and test data sets, the training data set is further split into f smaller data sets called "folds" (Geisser, 1975;Stone, 1974).For each fold in the cross-validation process, the model is trained on all remaining folds and validated on the left out one to produce an accuracy score.This produces a f-length vector of accuracy scores, where we compute the uncertainty on the accuracy score using the standard deviation of that vector.While somewhat informative, this only tells us the overall model performance, but does not provide any information on which classes the correct or incorrect predictions came from.Expanding on this, we calculate the precision (i.e., for each class, what proportion of positive classifications were correct): Similarly, we calculate the model recall (i.e., for each class, what proportion of TP classifications were identified): Combining precision and recall into one metric is commonly done by the F 1 score: Similar to accuracy scores, values closer to 1 in precision, recall, and F 1 metrics denote higher model performance.In multiclass classification problems, it is common to use a confusion matrix to visualize TP, FP, FN, and TN classifications.This takes the form of a n class × n class matrix where TP classifications fall along the diagonal, FP classifications are column sums that exclude the diagonal, FN classifications are row sums excluding the diagonal, and TNs are everything that is not a TP, FP, or FN.As is the case with most real-world data sets, we have imbalanced data classes (i.e., some classes have many more observations than others).To account for this when determining an overall precision or recall score for a given model, we weight each class, k, by its number of observations, n: The final metric we use to quantify model performance utilizes the class prediction probabilities for a given observation.For each sample in the test data set, the ensemble classifier assigns a non-zero probability to each class, where the highest-class value becomes the predicted class.In making a distribution of these "winning" probabilities for all correct predictions of a given sample, we can see which volcanic sources have more confident (i.e., their winning probability distribution is centered around values closer to 1) or less confident (i.e., their winning probability distribution is centered around values further from 1) predictions.Furthermore, this allows us to see which eruptions within the training data set have compositions that are more difficult to accurately classify than others.

Hyperparameter Tuning
Hyperparameter tuning can be thought of as a series of top-level decisions that determine how the model "learns," ultimately influencing the bias and variance of a given model.While some machine learning algorithms such as k-nearest neighbors only have a few hyperparameters (Buitinck et al., 2013), many algorithms have numerous hyperparameters to tune.The random forest classifier algorithm in scikit-learn, for example, currently has 18 hyperparameters available to customize.Despite this, it has been shown that tree-based algorithms found within high level programming languages perform very well "out of the box" with their default (or minor adjustments to) Application Programming Interface (API) parameters (e.g., Jorgenson et al., 2022;Petrelli et al., 2020) and are robust to over-fitting (Breiman, 2001).Algorithms like the SVM classifier however, require more significant hyperparameter tuning to perform well on both training and test data (e.g., Bolton et al., 2020;Petrelli et al., 2017).
There are two end-member approaches to hyperparameter tuning: tune everything and tune nothing.On one hand, tuning everything will most likely result in a well generalized model, however it comes at the cost of computation and development time (e.g., Bischl et al., 2023;Yang & Shami, 2020).Tuning nothing will often result in an inadequate model but will be extremely fast to develop.We take an approach somewhere in the middle whereby we use previous literature and the user guide within scikit-learn (https://scikit-learn.org/stable/user-guide. html) to intelligently choose which hyperparameters to tune.This is done using a parameter grid that incorporates cross-validation for each model listed in Table 2.Each possible option within the parameter grid gets an accuracy score after going through a five-fold cross-validation exercise on the training data and the most optimal set of hyperparameters is chosen as that which produces the best accuracy score on the test data that has been completely removed from the model building process, implying that these parameters create a model that has low enough variance such that it performs well on unseen data.Once the optimal set of parameters for each algorithm is found, they are then implemented as the parameters for the overall ensemble classifier model.

Feature Engineering
Features in machine learning classification problems are the explanatory variables that describe a given target variable (i.e., what the model is trying to predict; class).In this specific instance they are the geochemical components measured via EPMA or LA-ICP-MS or any value that can be derived from them (e.g., ratios of components) after they have been transformed by the clr (Equation 1).Feature engineering, then, is the process where feature lists are built and tested to determine those that best help the model determine the overall data distribution.In this study we explore a variety of feature combinations that range from major elements only (Bolton et al., 2020), manually selecting major and trace elements (Petrelli et al., 2017), manually selecting petrologically significant geochemical ratios (e.g., Table 4), and using iterative feature scoring methods such as recursive feature elimination (RFE; Guyon et al., 2002).

Alaska-Aleutian LTPE Compositions
We measured in situ major-element glass compositions (see Section 2) for tephra from 71 eruptions spanning 19 different volcanic sources (Figure 1; Text S1 in Supporting Information S1).Compositions range from andesite (e.g., Aniakchak, Okmok, Makushin) to trachyte (e.g., Gareloi) to rhyolite (e.g., Katmai) and overall show trends similar to those found in other arcs.We find that many sources overlap significantly in major element space, however, K 2 O-SiO 2 space shows sufficient variance such that individual volcanic sources/eruptions form their own clusters (Figure 2).We also find that some individual eruptions are homogeneous (e.g., Gareloi 1929, all Adagdak eruptions, Churchill White River Ash) with respect to their matrix glass chemistry, while other eruptions (i.e., Semisopochnoi CFE, Katmai 1912, Veniaminof 1 ka dacite, Okmok CFE2) either have a large continuous array of compositions or bimodal clusters.For example, Aniakchak, Augustine, Katmai, and Redoubt are all volcanic sources with more than 3 known proximal LTPEs that have compositions that span a large range in major element compositions (Figure 2).
Since clean LA-ICP-MS analyses are not possible on microlite-rich tephra (i.e., 5 out of 71 eruptions) our trace element data set only includes 66 eruptions from 18 volcanic sources (Table 1; Text S1 in Supporting Information S1).Trace element compositions show more variance between eruptions and volcanic sources than nearly all major elements (excluding K 2 O; Figure 2).Furthermore, we find that many of the incompatible trace element ratios (i.e., Nb/U, Ba/Nb, La/Yb) show discrete compositional clusters in clr space (Figure 2 bottom right panel).

Feature Selection
Five different scenarios were tested when looking at the influence of different features on model performance: 1. major elements: SiO 2 , TiO 2 , Al 2 O 3 , FeO T , MnO, MgO, CaO, Na 2 O, K 2 O 2. trace element concentrations: Li, Ca, Sc, Ti, Mn, Zn, Rb, Sr, Y, Zr, Nb, Cs, Ba, La, Ce, Pr, Nd, Sm, Eu, Gd, Dy, Er, Yb, Hf, Ta, Pb, Th, U 3. trace element ratios: Sr/Y, Zr/Y, Zr/Nb, Zr/Hf, Nb/U, Nb/Y, Rb/Sm, Ba/Th, Ba/La, Ba/Nb, La/Nb, Rb/Ba, La/ Yb, Th/Yb, Th/La, Ce/Pb (Table 4) 4. major elements + trace elements + trace element ratios 5. recurring feature elimination (RFE) chosen elements To remain consistent between all tests, and significantly reduce computational time during this iterative process, an "out of the box" random forest classifier from scikit-learn was used as they have proven to be accurate for petrological and tephrochronology tasks with minimal tuning (e.g., Bolton et al., 2020;Jorgenson et al., 2022;Petrelli et al., 2020), are relatively robust to overfitting, have multiple metrics (e.g., impurity or permutation based) for calculating the importance of each feature in the model (Breiman, 2002), as well as metrics for class similarities (Breiman, 2001(Breiman, , 2002;;Louppe, 2014).We find that, overall, using only major elements to classify Alaska-Aleutian tephras from LTPEs results in a higher degree of FP and FN classifications than when trace elements are incorporated as features (Table 5, Figure 3).Furthermore, these FP and FN predictions occur between tephras with similar total alkalis-SiO 2 (TAS) classifications (i.e., Aniakchak, Makushin, and Veniaminof; Redoubt, Katmai, Davidof, and Adagdak).This is likely due to the most important features utilized in the major element random forest model being Si and K (Figure 3a   Transformed data have classes that are more tightly clustered than those in raw concentration space.Note, the linear arrays observed in clr space for major elements arise as an artifact of using chip averages for major element composition.More specifically, because the clr uses the geometric mean of an observation (i.e., a LA-ICP-MS analysis), although the major element composition for all analyses on the chip is the same, the trace element compositions vary, ultimately varying the geometric mean of that observation.This ultimately affects the logratio transformed value in such a way that it creates the observed arrays, whereas numerous observations plot on top of one another in untransformed concentration space.We stress that this is an expected result and not a problem for subsequent modeling.

Major elements
Trace  Using RFE on an original feature list containing all major elements, trace elements, and trace element ratios, 32 features were selected as the optimal number of features such that performance metrics (e.g., overall accuracy) were maximized (Figure S3 in Supporting Information S1).Exploring the relationship between number of RFE features and model performance we find that the rate at which model performance increases with increasing RFE features significantly drops off after 6 RFE features (Figure S4 in Supporting Information S1).Doubling this value to 12 RFE features results in less than a 1% difference in precision, recall, and F 1 score (Table 5).We use this to determine our final list of features in our voting classifier: Cs, Nb/U, Nb, Si, Ba, Th/La, Sr, Rb/Sm, K, La/ Yb, Pb, Sr/Y.Using these features and the same random forest classifier we find that there are significantly less false positives and negatives (Figure 3b bottom) predicted than when only major elements are used, showing that incompatible trace elements are the most important features in accurately classifying Alaska-Aleutian tephras from LTPEs (Figure 3b top).This final feature list helps ensure a high performing model while also maintaining an analyte list that contains elements that are relatively easy to measure accurately and precisely on intermediate to silicic tephras using the analytical conditions listed in the Section 2 above.

Hyperparameter Tuning
Hyperparameter tuning results are summarized in Table 6.Algorithms that create linear decision boundaries between classes, LDA and logistic regression, have default scikit-learn parameters that result in optimal or near optimal accuracy, precision, and recall scores on test data (Figure S5 in Supporting Information S1).This, however, is not the case with the SVM algorithm that creates nonlinear decision boundaries.It has default True negatives are everything that is not a TP, FP, or FN.We compare models where features were major elements only (a) and those chosen by using recurring feature elimination (b) and show that a model whereby only major elements are used as features leads to significantly more false positives and false negatives than when trace elements and their ratios are incorporated.Bar charts in both A and B are features ranked by their mean decrease in accuracy (i.e., if the feature were removed from the model, this is the decrease in accuracy expected).
scikit-learn parameter values that perform significantly worse (e.g., default vs. tuned accuracy of 57.5% and 99.7%, respectively) on our test data (Figure S6 in Supporting Information S1).Tree-based models, although technically a part of the nonlinear decision boundary family of algorithms, only increase slightly in accuracy, precision, and recall using non-default parameters (Figure S7 in Supporting Information S1).
While iterating over hyperparameter grid space to find the optimal set of hyperparameters for each model is computationally expensive with the cost determined by the size of the parameter grid, the computational cost of training the overall ensemble classifier is relatively inexpensive (<1 min on a Dell laptop using a eleventh Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz 2.61 GHz and 64 Gb of RAM), ultimately justifying the hyperparameter tuning process.For example, an overall performance increase of even a few percent could potentially mean the difference between a correctly classified volcanic eruption and an incorrect one.Furthermore, this parameter grid search is not done every time the model gets re-trained in the eventuality that more data become available in the future.We find that only one algorithm has training times drastically affected by using optimized hyper-parameters: gradient boosting.The best test results were achieved using an ensemble of 1,000 trees, however for a cost of 0.1% accuracy we can achieve near identical performance using 250 trees (Table 6; Figure S7 in Supporting Information S1) with a drastic decrease in training time (∼200%).This decrease in training time becomes extremely important when running tests like the one we conduct later in the Section 4 that deals with assessing overall model variance.

Class Similarities
We use random forest training observation proximities to quantify similarities between classes.Proximity here is defined as the number of times two samples within the training data set reach the same leaf within a decision tree, normalized to the number of trees in the forest (Breiman, 2002;Louppe, 2014), and provides a measure of sample similarity as viewed by the random forest algorithm itself.We sorted our training observations by both class (i.e., volcano) and location along the Alaska-Aleutian arc to test whether geospatially similar volcanoes have higher random forest proximities.Figure 4 shows that the random forest classifier from the feature selection process trained on our final feature list (i.e., Figure 3b bottom) has higher within-class proximities (i.e., more yellow to red colors within each box) than one trained on major element concentrations.This is especially true for volcanic sources that have bimodal or a continuous array of major element compositions (e.g., Katmai, Aniakchak, Makushin).Where a model trained on major-element concentrations views individual eruptions from these sources as dissimilar (e.g., Katmai 1912 vs. Lethe vs. 23 ka), this is not the case when trained using features that include incompatible trace element ratios (i.e., our final feature list) and there is little to no ambiguity between classes.The exception is tephras from Hayes volcano which has drastically different major and trace element compositions between its oldest LTPE and younger ones (Figure 2).Volcanic sources that have high out of class proximities with one another when trained on major elements are Davidof volcano-Mount Katmai-Ugashik-Peulic volcanic complex-Kaguyak Crater; Makushin Volcano-Mount Veniaminof; Aniakchak Crater-Mount Veniaminof; Hayes-Augustine; Katmai-Redoubt (Figure 4 top), suggesting that FP classifications between these sources would be a potentially likely scenario when only considering major element concentrations as features.

Ensemble Voting Classifier
Our ensemble voting classifier model was constructed using algorithms listed in Table 2, using the hyperparameters listed in Table 6, and the features listed above in the Section 3.2.1.While it is possible to tune hyperparameters directly within the voting classifier pipeline (Buitinck et al., 2013), we opt to not do this as it results in a drastic increase in model development time and does not allow for as easy of an inspection into the interplay between each hyperparameter value and model performance.We find that the ensemble voting classifier produces similarly accurate results when trained on either our final feature list or major element concentrations (Figure S8 in Supporting Information S1).This is to be expected as many individual models trained on these sets of features also produce accurate (but not quite as accurate) results even when trained on algorithms that are not tuned for optimal performance (e.g., Figure 3), such that when they are combined in a voting scheme the accuracy will change little (Tax et al., 2000).The real strengths of the ensemble voting classifier, however, are (a) the ability to combine predictions from numerous conceptually different algorithms into one well-rounded prediction; (b) the ability to provide probabilistic information about each class for a given observation (i.e., for each observation, the classifier returns a non-zero probability for every class, with the highest probability then being the predicted class).This, in turn, allows for us to quantify the "confidence" of an individual prediction or series of predictions from a class.More specifically, prediction probabilities for an observation (or series of observations) closer to 1 are higher confidence and those closer to 0 are lower confidence making the distinction of winning versus non-winning classes more difficult.This utility is demonstrated in Figure 5 where we compare winning class probabilities for our ensemble voting classifier that is trained on major elements versus when it is trained on our final RFE chosen features.This overall leads to the observation that, while models trained on major elements and models trained on trace-element ratios produce near identical predictions, those trained on features that include trace element ratios have higher overall probabilities and lower variance probability distributions when trained on features that include trace-element ratios (Figure 5, Figure S8 in Supporting Information S1).

Assessing Model Variance
As stated earlier, the goal of our model is to probabilistically link Alaska volcanic sources to regionally distributed unknown or source-ambiguous tephra layers.Having already established that our model performs well on our test and validation sets when training data are built from portions of all eruptions (i.e., has low bias), we explore our model variance by asking the following question: "How likely would we be able to predict the correct volcanic source for an eruption if the model was not trained on that eruption?" For example, could we predict that a tephra sample from the Katmai 1912 eruption was indeed from Mount Katmai if the model is only trained on two previous LTPEs produced by the volcano (Lethe and 23 ka eruptions)?Or more broadly, could we predict the correct volcanic source for an unknown tephra if its composition is slightly different than that of the training data from the same source?As volcanoes rarely erupt compositionally identical magmas throughout their entire history, this gives us a better idea of (a) how reliant we are on certain samples within our training data set and  1.We see that within-class proximities (row-column pairs within white boxes) are much higher, while out-of-class proximities are smaller (fewer areas of lighter colors) when using trace-element ratios (i.e., Table 4) as features rather than major elements alone.This ultimately leads to tree-based models having an easier time predicting accurate classes for new observations using a model trained including trace element ratios than with only major elements.where only one eruption for a given source exists in our LTPE data set (Black Peak, Fisher, Gareloi, Kaguyak, Okmok, Semisopochnoi, Ugashik), this source is no longer a possible option for predictions during the variance test (i.e., it is removed entirely from the training set and treated as an unknown), and results should indicate the next most compositionally similar volcanic source.Furthermore, to test the importance of certain features in predicting volcanic sources for unknown tephras, we run this test in two scenarios: using our ensemble voting classifier that is trained on major elements and one that is trained on our final feature list.Similar to Figure 5 and Figure S8 in Supporting Information S1, we compare the probability distributions for the results from each model to assess their confidence in their ability to predict the correct answer when it has no prior training on that particular subset of data.In incorporating analytical uncertainty into the pipeline, we argue that we create a more realistic scenario for testing true "unknown" samples, especially those that have a low number of observations (e.g., eruptions from Redoubt, Adagdak, Kaguyak etc.; Text S1 in Supporting Information S1).More specifically, an unknown sample with n ≫ 1 observations has a better chance of recording the true chemical variability (i.e., our Katmai 1912 sample where n = 180) within the eruption.Conversely, when n is small the chances of this chemical variability being at least partly missed increases greatly as n → 1.In these instances, incorporating analytical uncertainty may help to capture this potential "missing" variability.
For eruptions from volcanic sources with more than one eruption in our data set (n = 59/66), the results of this variance test can be grouped into three categories: Both major element and RFE trained models dominantly predict the correct source (n = 43/59); both major element and RFE trained models dominantly predict the incorrect source (n = 3/59); the major element trained model predicts the incorrect source and RFE trained model predicts the correct source (n = 13/59).We also find, similar to Figure 5, that even where both major and RFE trained models predict the correct source volcano for an eruption, prediction probabilities are always higher for models trained on features that include trace element ratios (Figure 6).Furthermore, prediction probability distributions frequently also have higher precision for the RFE feature trained model (Figure 7). Figure 6 also highlights an important observation: For volcanic sources in our data set that have more than one LTPE (n = 11/18), nine of them have at least one incorrectly classified eruption from a major element trained model in our variance test, ultimately suggesting that volcanic source predictions produced from models that use only major elements as features are less accurate at correctly identifying volcanic sources for unknown tephras than those trained on features that include trace element ratios.While they may perform well on test data sets that contain portions of data from the same samples as the training data set, this accuracy and precision is significantly decreased when the test data is from a data distribution that has not been seen by the model at all (i.e., an unknown distal tephra sample).
Figure 5. Comparing prediction probabilities for each observation (individual analysis) within a single class (source volcano).Y-values are probabilities generated by the voting classifier trained on major-element concentrations and x-values are probabilities generated by the voting classifier trained on our final features list.The black dashed line is a 1:1 line and shows that the vast majority of prediction probabilities for all classes are higher when trace elements and their ratios are incorporated into the model.Color-symbol combinations denote four different scenarios: both models predicted the correct class (green circle), only the model trained on major elements predicted the correct class (orange squares), only the model trained on recurring feature elimination features predicted the correct class (blue diamond), neither model predicted the correct class (red triangles).
We find this is especially the case with many of the large tholeiitic volcanic centers in the Alaska-Aleutian arc that have produced more than one LTPE: Makushin, Veniaminof, and Aniakchak.More specifically, a model that uses solely major elements as features will produce confident (i.e., probability distributions near 1) predictions for an eruption from one of these centers but be incorrect (Figure 8).This ultimately illustrates a "worst case scenario" for a machine learning classification model whereby although predictions may look very confident and accurate, they are simply confidently inaccurate.These volcanoes predominantly erupt magmas that overlap in major element space but not trace element ratio space (Figure 2), which likely explains why our voting classifier (or any classifier algorithm) trained on major elements has a difficult time distinguishing between them, whereas our voting classifier trained on features that include trace element ratios does not (Figure 8).
Overall, the results of our variance test imply the following: (a) Our ensemble voting classifier trained on major elements is significantly less accurate at predicting the correct volcanic source for unknown tephras than when it is trained on our final feature list (Figure 6); (b) Caution should be applied when interpreting the results for unknown tephra classification when only major elements are used as features as they are prone to extremely confident FP identifications (Figures 3a and 8); (c) When volcanoes produce drastically different composition LTPEs throughout time (i.e., Hayes), it may be challenging to correctly identify them using any previous geochemical information from that volcano.

Application to Distal Tephras
The remainder of the paper is devoted to applying our voting classifier to distal unknown tephras found within both the Gulf of Alaska and the Bering Sea.Gulf of Alaska samples include tephra layers from IODP sites U1417 and U1418 and provide a record of volcanic activity in the area going back more than 10 Ma (Jaeger et al., 2014) helping to quantify the spatiotemporal distribution of LTPEs in the region.Here we focus on 44 tephra layers from sites U1417 and U1418 ranging in age from ∼0.100-800 ka (Table S4 and Text S1 in Supporting Information S1) that helps to reveal the "modern" state of volcanic activity across the arc by documenting pre-Holocene volcanism in the region that otherwise has been wiped away by glacial scouring.Bering Sea tephras are taken from Derkachev et al. (2018) and include samples Br2 and SR2 initially studied to identify their source volcano.These two test applications, combined, help us assess the validity of applying our model to tephras found throughout much of the region (i.e., Bering Sea, mainland Alaska and Canada, and Gulf of Alaska).Plot showing the results of the variance test for models trained both using major-element concentrations as features (orange markers) and those chosen using recurring feature elimination (blue markers).Markers represent the median prediction probability of an eruption when it is treated as an unknown and has its source predicted from a model that has been trained on all other data.Black lines connect same eruptions so that their prediction probabilities may be easily compared for each model.Red boxes indicate where the median prediction probability for the target class (volcano) was not the highest value (i.e., where the model prediction was incorrect).We show here that these incorrect model predictions predominantly occur in models trained solely on major elements.

Quantifying and Identifying Discrete Eruptions in IODP Cores
Prior to applying our classifier to distal tephras, we tested to ensure compositions of the distal tephra fall within the multivariate feature space created by our model.Careful inspection of the different feature distributions in 2D space shows that IODP data generally overlap with the training data, suggesting that they are not novel or outliers relative to our proximal data.Furthermore, looking at certain trace element ratios (e.g., Sr/Y, Nb/U, La/Yb) we see that there are discrete compositional clusters where the number of clusters is less than the number of samples (Figure S9 in Supporting Information S1).This is likely the result of both the same eruption being sampled in This ultimately shows that a model trained incorporating trace elements and their ratios can generate less ambiguous predictions than one trained on major elements.In these boxplots and others throughout the manuscript, boxes extend from the first to third quartile of the data and whiskers extend to 1.5× the inter-quartile range from the edge of the box.Points outside whiskers are determined to be outliers for that distribution and plotted individually.numerous IODP cores as well as the same source producing numerous compositionally similar eruptions through time.
To go from number of samples to number of discrete eruptions we look at a combination of age and composition distributions for our samples.In brief, if a given IODP sample from one core is very similar in both age and composition (i.e., part of the same compositional cluster) to a sample in a different core, they are assumed to be recording the same volcanic event.Sample age determinations can be found in Jaeger et al. (2014).Following this logic, we find that our samples record 20 eruptions and 10 compositional groups (Figure 9; Figure S9 in Supporting Information S1).
Lastly, before we make interpretations about the most probable source volcanoes to distribute ash in the Gulf of Alaska over the last 800 ka, we highlight two important observations that help confirm the validity of our model.The first pertains to IODP samples that are 100 years old (Jaeger et al., 2014).These, according to our model, are highly likely to be from Katmai, and more specifically, a composition that is similar to the Novarupta 1912 high-silica rhyolite in our training data set rather than compositions that reflect the older Lethe and 23 ka eruptions (Figure S10A in Supporting Information S1).The extreme overlap in age and composition ultimately helps explain why our model predicts that Katmai has a probability of greater than 95% of being the source for all observations from these samples.The second pertains to samples dated to be about 27 ka (Jaeger et al., 2014).Our model predicts these to be erupted from the Emmons Lake volcanic center ("Emmons Lake" herein).Mangan et al. (2003) show that the Dawson tephra from Emmons Lake was erupted 27 ka, suggesting that our model is correctly identifying this layer as Emmons Lake and it is, indeed, the Dawson tephra.Furthermore, similar to the 100 yr old samples mentioned above overlapping with Katmai, it overlaps closely with Emmons Lake training data in every feature (Figure S10B in Supporting Information S1), specifically our portion of Emmons Lake training data that is from the Dawson tephra.
Application of our voting classifier to each eruption allows for their most probable volcanic source to be identified.These predictions are summarized in Figures 10 and 11.From this, we find that the most common source of volcanic tephras predicted within our IODP tephra layers (i.e., 7/20 eruptions being predicted by the model; Figure 10) is Katmai (Figure 11).Furthermore, we also find that there are multiple compositions (e.g., 1912 high Predicted class for each eruption has a red filled box and outlier markers.Volcano names are found on the x-axis for the bottom row with all other x-axes being the established two-letter abbreviation for each volcano found in Table 1.silica rhyolite-like, Lethe-like, etc.) of Katmai tephras found over the last 800 ka (Figures 9 and 10; Figure S9 in Supporting Information S1).Overall, this illustrates that although Katmai has produced relatively frequent LTPEs over the last 800 ka compared to other Alaska volcanoes, its magma system is one that is complex and capable of producing a diverse compositional suite of magmas, aligning with previous research (e.g., Hildreth & Fierstein, 2012).
Not as commonly predicted within our samples as Mount Katmai, but still with relatively high overall cumulative probability, we identify three eruptions belonging to each Redoubt Volcano and Emmons Lake (Figures 10  and 11) as the next most likely volcanoes to distribute ash in the Gulf of Alaska during the Quaternary.Perhaps as intriguing are the volcanic sources from known caldera complexes our model does not identify in tephra layers from Gulf of Alaska cores: Veniaminof, Fisher, and Aniakchak.Our model predicts that these sources have a very low probability of having distributed ash over the Gulf of Alaska in this sample suite (Figure 11).While it is possible that there may be eruptions from these volcanoes that have deposited ash in this area, as not every tephra layer found within sites U1417 and U1418 was analyzed for this project (44/69 tephra layers from the last 800 ka), our results align with previous studies mapping the distribution of LTPEs from these volcanoes (e.g., Derkachev et al., 2018) in that they have a tendency to not distribute significant amounts of ash in the Gulf of Alaska, but rather over the Bering Sea.Because of our data set not including the complete tephra record from these cores, we stress that further work needs to be done to definitively establish the most likely volcanoes in the region to distribute ash in the Gulf of Alaska during the Quaternary.
The probabilistic outputs of our model allow for a general assessment of prediction "quality."Within sites U1417 and U1418 this reveals a few model outcomes: high probability predictions that overlap with the training data for that prediction (Figure 12a); numerous lower probability predictions that overlap with more than one volcanic source in the training data (Figure 12b); and predictions that do not overlap significantly with the training data for that prediction (Figure 12c).While easy to dismiss an observation from the latter category as "bad", we argue that this should not be the case.Again, this exhibits the real strength of our model: prediction probabilities.In the case of Figure 12c, while the IODP eruption does not exactly overlap with Redoubt training data, it is the closest volcanic source in our training data set throughout model feature space and prediction probabilities are assigned  In this scenario both are lower probability than the left column scenario (i.e., P = 0.3 → 0.6).(c) Training data for the model prediction does not overlap IODP data significantly but are near the edge in many features.In this scenario predictions are high probability, as even though they do not overlap significantly with the training data for the predicted class, they are very close in almost every feature dimension whereas while other classes may have good fits in some feature dimensions, they have petrologically improbable misfits in others, ultimately eliminating them as good predictions.This potentially represents a case where the true source may not be represented in the training data.In this case (e.g., column C) it may be more accurate to say that the unknown sample has a "Redoubt-like" composition.
accordingly.This same logic should be applied to any prediction, no matter the fit with the training data: With any model prediction it is impossible to say with absolute certainty (i.e., P = 1) that a given tephra was erupted from a specific volcano (e.g., Figure 12c; Redoubt), however, based on all training data sampled from LTPEs throughout the Alaska-Aleutian arc, it is the most probable source.

Prediction of Bering Sea Samples
Finally, we apply our model to data from Derkachev et al. (2018) for two tephra layers.This is important for two reasons: (a) it expands our case study locations to the Bering Sea and not just the Gulf of Alaska; (b) this application involves using data not generated by this study from analytical conditions dissimilar to ours (e.g., different instrumentation and LA-ICP-MS data reduction methods) showing that our model is applicable to data generated by the tephra community at large.On the basis of incompatible trace element ratios, Derkachev et al. (2018) argue they were successfully able to link tephras Br2 and SR2 (locations shown in Figure 1) to Aniakchak and Semisopochnoi volcanoes, respectively.These results have helped further our understanding about timing and magnitudes of Aleutian volcanism, specifically the timing of the post-glacial caldera forming event at Semisopochnoi volcano, which exists in our training data set as, to our knowledge, the sole LTPE from that volcano.It is also the first-time tephra from the ∼3.6 ka Aniakchak LTPE was documented from the Bering Sea, increasing its aerial distribution considerably.
Data from Derkachev et al. (2018) come from their Supplementary data and are transformed in the same fashion as our training data and IODP data (e.g., Equation 1; see Section 2).In total there are eight analyses for tephra Br2 and nine analyses for tephra SR2.Because of the lower number of analyses and to mitigate the influence of any one spurious analysis, we also take the average and standard deviation of each sample to generate 1,000 synthetic compositions based on these values.Sample-source probability distributions are shown in Figure 13 for both the individual observations and the synthetic data set.Median prediction probabilities from the synthetic data set are less precise and overall skewed toward lower values than those derived from the individual data set (likely due to the large variance in the original data set inherited by the synthetic one), however their rankings (i.e., final predictions) are identical.
Ultimately, however, this illustrates an important observation in tephrochronology: generation of source predictions from a low number of observations (i.e., cryptotephras) should be treated with extreme caution using methods such as machine learning, as there may be numerous sources that have similar likelihoods and that a sample size of n = 1 may overlap at the 2σ level with numerous sources (e.g., Figure 12b).In this instance we recommend two solutions: (a) similar advice to Bolton et al. (2020), whereby careful inspection of the data should be done in addition to application of machine learning methods; (b) the generation of synthetic data sets based on the analytical uncertainty of the measurements done on the samples (i.e., the approach taken in our variance analysis above).Like Figure 12, inspection of data in many different combinations of feature space can help validate model predictions (Figure 14).Combining the results of Figure 13 and tests like those shown in Figure 14, predicted classes for Br2 and SR2 are Aniakchak and Semisopochnoi, respectively, aligning with the findings of Derkachev et al. (2018).Ultimately, these results help show that our model can produce accurate predictions when applied to data sets outside our own and be useful in disciplines that utilize tephrochronology for their own purposes (e.g., volcanology, climate science, archeology, paleoclimatology).

Limitations and Future Directions
All models have limitations, and our voting classifier is no different.Currently, only data from the eruptions listed in Table 1 are included within the training data set.Volcanoes hypothesized to have potentially produced LTPE(s) but not included within the training data set are (Figure 1): Little Sitkin, Seguam, Yunaska, Herbert, Akutan, Roundtop, and Edgecumbe.Data for these volcanoes are not included because either (a) we do not possess a sample for the suspected LTPE from these sources or (b) the sample we do possess does not contain sufficient volcanic glass for LA-ICP-MS analyses (i.e., it is too microlitic or devitrified).Ultimately, this prevents any unknown eruption from these volcanoes from being identified by our model until their data are added to the training data set (this is an inherent limitation of supervised machine learning in general), however, we argue that their occurrence within IODP cores U1417 and U1418 is likely minimal as many of the predictions generated overlap extremely well with eruptions within our training data set (e.g., Figures 12a and 12b) and many of the volcanoes not currently incorporated are from either much further west or east such that their eruptions depositing ash at U1417 and U1418 should be less likely than those on the Alaska Peninsula like many of our predictions indicate.Further work using the same methodology in this study is required to add the "missing" volcano LTPEs to our data set.
Looking forward, the framework we have created need not be limited to the Alaska-Aleutian arc.Machine learning classification models have already shown to be useful with Italian tephras (Petrelli et al., 2017) as well as other locations within Alaska (Bolton et al., 2020) and we argue there is no reason this methodology cannot be applied to other arcs/groups of volcanoes (e.g., Kamchatka, Western North America, Andes).Furthermore, we envision a system whereby these types of models may be combined with other logic constraints (i.e., spatial  filters) and have the potential to be a global tephra classification tool when FAIR (findable, accessible, interoperable, reusable) data practices (Wallace et al., 2022) are implemented.

Conclusions
We have created a model for accurately predicting the volcanic source of unknown regionally distributed tephras within Alaska or the surrounding region.This model uses a diverse range of machine learning classifier algorithms and an approach commonly referred to as "ensemble soft voting" to predict the most probable source volcano for a tephra based on the following geochemical components: Cs, Nb/U, Nb, Si, Ba, Th/La, Sr, Rb/ Sm, K, La/Yb, Pb, Sr/Y.We have shown that by incorporating incompatible trace elements and their ratios into model feature space, we can achieve much more precise and confident source volcano predictions than if only major element concentrations are incorporated.Furthermore, we stress that, while this model is specifically tuned for high performance on the Alaska-Aleutian arc, our approach has wide-ranging applications to other arcs as well.Although this approach applied to other arc systems may determine different optimal hyperparameters and features specific to the natural geochemical variance found within a data set, the methodology is one that is still statistically robust and capable of identifying even subtle geochemical differences between volcanic sources such that they may be accurately identified.This justifies in situ trace element measuring methodologies (e.g., LA-ICP-MS) being a staple tool in the tephrochronology pipeline alongside EPMA.Finally, we stress that tephrochronology studies should transition away from thinking about tephra-source volcano relationships in absolutes, but rather toward probabilistic estimates derived from many factors.We would like to thank the field geologists who, over numerous decades, collected the vast majority of samples used in this project.We would also like to thank open-source coding community for developing the tools and algorithms that made this work possible.We would like to thank Felix Boschetty, Stephen Kuehn, and Lauren Harrison for their constructive reviews on earlier versions of the manuscript as well as Paul Asimow for editorial handling.Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Figure 1 .
Figure 1.Map of the Alaska-Aleutian arc.Green triangles indicate volcanoes with large tephra-producing eruptions (LTPEs) analyzed in this study.Red triangles indicate volcanoes hypothesized to have LTPEs not included in this study.Yellow circles represent the locations of International Ocean Discovery Program cores U1417 and U1418.Orange circles represent cores SO201-2 and Lv63-19-1 from Derkachev et al. (2018).
Training dataPortion of the overall data set used for the purposes of building and validating a machine learning model Test data Portion of the overall data set completely omitted from the model building process and only used for the purposes of testing the machine learning model after it has been trained Fold Individual subset of the training data created during the cross-validation process Cross-validation The process whereby training data are split into a user defined number of folds.These folds are iterated through where one is treated as test data and the remaining folds treated as the training data.For each iteration the model is trained and evaluated via performance metrics.Performance is then the average of each individual iteration Bias Metric used to describe the ability for a machine learning algorithm to accurately capture data distributions from training data.High = able to capture data distributions well Variance Metric used to describe the sensitivity of a machine learning algorithm's predictions to changes in training data.High = algorithm predictions sensitive to changes in training data Figure 2. top).

Figure 2 .
Figure2.Selected major and trace element relationships for tephras used in this study showing the comparison between concentration (left) and centered logratio transform (clr; right) values.Transformed data have classes that are more tightly clustered than those in raw concentration space.Note, the linear arrays observed in clr space for major elements arise as an artifact of using chip averages for major element composition.More specifically, because the clr uses the geometric mean of an observation (i.e., a LA-ICP-MS analysis), although the major element composition for all analyses on the chip is the same, the trace element compositions vary, ultimately varying the geometric mean of that observation.This ultimately affects the logratio transformed value in such a way that it creates the observed arrays, whereas numerous observations plot on top of one another in untransformed concentration space.We stress that this is an expected result and not a problem for subsequent modeling.

Figure 3 .
Figure 3. Feature importance (top) and confusion matrix (bottom) for the "out of the box" random forest feature engineering exercise.Here, the confusion matrix takes the form of an 18 × 18 matrix (i.e., there are 18 potential volcanoes the model can possibly predict) where true positive (TP) classifications fall along the diagonal, false positive (FP) classifications are denoted by column sums, and false negative (FN) classifications are denoted by row sums.True negatives are everything that is not a TP, FP, or FN.We compare models where features were major elements only (a) and those chosen by using recurring feature elimination (b) and show that a model whereby only major elements are used as features leads to significantly more false positives and false negatives than when trace elements and their ratios are incorporated.Bar charts in both A and B are features ranked by their mean decrease in accuracy (i.e., if the feature were removed from the model, this is the decrease in accuracy expected).

Figure 4 .
Figure 4. Proximity matrices for random forest models that use major elements (top) and recurring feature elimination chosen elements (bottom) as features.Individual volcanoes are sorted by location and outlined by a white box.Volcano abbreviations can be found in Table1.We see that within-class proximities (row-column pairs within white boxes) are much higher, while out-of-class proximities are smaller (fewer areas of lighter colors) when using trace-element ratios (i.e., Table4) as features rather than major elements alone.This ultimately leads to tree-based models having an easier time predicting accurate classes for new observations using a model trained including trace element ratios than with only major elements.

Figure 6 .
Figure 6.Plot showing the results of the variance test for models trained both using major-element concentrations as features (orange markers) and those chosen using recurring feature elimination (blue markers).Markers represent the median prediction probability of an eruption when it is treated as an unknown and has its source predicted from a model that has been trained on all other data.Black lines connect same eruptions so that their prediction probabilities may be easily compared for each model.Red boxes indicate where the median prediction probability for the target class (volcano) was not the highest value (i.e., where the model prediction was incorrect).We show here that these incorrect model predictions predominantly occur in models trained solely on major elements.

Figure 7 .
Figure7.Probability distribution boxplots for three eruptions in our variance test illustrating the observation that our voting classifier trained on the recurring feature elimination features results in higher probability as well as more precise predictions.This ultimately shows that a model trained incorporating trace elements and their ratios can generate less ambiguous predictions than one trained on major elements.In these boxplots and others throughout the manuscript, boxes extend from the first to third quartile of the data and whiskers extend to 1.5× the inter-quartile range from the edge of the box.Points outside whiskers are determined to be outliers for that distribution and plotted individually.

Figure 8 .
Figure 8. Prediction probabilities from our variance test reflected as ternary diagrams for eruptions sourced from Aniakchak (top), Veniaminof (middle), and Makushin (bottom) volcanoes.For each ternary diagram, markers indicate individual predictions from each observation in the target class and demonstrate the ambiguity of classifications derived from the voting classifier trained using only major element concentrations (orange markers).Blue markers show prediction probabilities from the voting classifier trained using our final recurring feature elimination features.Overall we find these prediction probabilities plot much closer to the target vertex than those calculated from the model trained using major element concentrations.

Figure 9 .
Figure 9.International Ocean Discovery Program samples analyzed in this study plotted as a function of depth and the site (U1417 or U1418)-core (A, B, C, D, E, F) combination they were collected.Samples are grouped according to their chemical population (Figure S9 in Supporting Information S1) to help illustrate how an individual eruption may be found in numerous cores and that the same chemical population may show up with large temporal gaps between occurrences (e.g., populations 1 and 2).Discrete eruptions are labeled on the right and show that between the 44 samples analyzed, we have 20 discrete eruptions.

Figure 10 .
Figure 10.Class probabilities for each class in our model for each eruption (i.e., labels in Figure 8) found within International Ocean Discovery Program cores U1417 and U1418 represented as box and whisker plots.Approximate eruption age and number of tephra shards analyzed for each eruption are in the top corner of each plot.Predicted class for each eruption has a red filled box and outlier markers.Volcano names are found on the x-axis for the bottom row with all other x-axes being the established two-letter abbreviation for each volcano found in Table1.

Figure 11 .
Figure11.Summary of all observations from the "unknown" International Ocean Discovery Program data set showing their cumulative probability of producing an eruption over the age range of the samples analyzed (∼750 ka).Annotated in bold on the map are the top three most likely sources: Katmai Volcano, Redoubt Volcano, Emmons Lake illustrating that during the late Quaternary-present these volcanoes have the highest probability of producing large tephra-producing eruptions that distribute ash over southcentral Alaska and the Gulf of Alaska.All other volcanoes in the training data set are labeled with their two-letter abbreviation (Table1).

Figure 12 .
Figure 12.Bivariate plots for selected trace-element ratios showing model training data (gray circles), individual eruption data from International Ocean Discovery Program (IODP) cores (red circles), training data for the most probable source as predicted by the model (blue circles), and second most probable source if one is predicted (yellow circles).Contour lines on model predictions encompass 67% and 95% of the training data for that class.Each column represents an individual model scenario encountered with the IODP data.(a) training data for the model prediction significantly overlaps with the IODP sample in almost all features.These model predictions are often near P = 1.(b) IODP data falls between the training data for the class predicted by the model but also the second-place prediction for the model.In this scenario both are lower probability than the left column scenario (i.e., P = 0.3 → 0.6).(c) Training data for the model prediction does not overlap IODP data significantly but are near the edge in many features.In this scenario predictions are high probability, as even though they do not overlap significantly with the training data for the predicted class, they are very close in almost every feature dimension whereas while other classes may have good fits in some feature dimensions, they have petrologically improbable misfits in others, ultimately eliminating them as good predictions.This potentially represents a case where the true source may not be represented in the training data.In this case (e.g., column C) it may be more accurate to say that the unknown sample has a "Redoubt-like" composition.

Figure 13 .
Figure 13.Boxplots for prediction probability distributions from our model when applied to individual observations from Derkachev et al. (2018) supplementary data (orange) and a synthetic data set generated based off of 1,000 normally distributed values derived from the sample mean and standard deviation.Top is sample Br2 showing that Aniakchak is the most probabilistically likely class given sample data, although in this instance precision is low.Bottom is sample SR2 showing that Semisopochnoi is the most probabilistically likely class given sample data.Similar to sample Br2, precision is low, ultimately warranting further inspection of the data in multivariate feature space to compare top ranked predictions.

Figure 14 .
Figure 14.Bivariate plots for selected model features showing model training data (gray circles), observed data points for Derkachev et al. (2018) samples SR2 (left column) and Br2 (right column).Individual measurements indicated by red open circles, sample mean, and standard deviation by red closed circle with error bars, training data for model predicted class by blue circles, and model second place prediction by yellow circles.Contour lines on model predictions encompass 67% and 95% of the training data for that class.Here we see that while data for top model predictions may be similar in some features (e.g., Ba/Nb), certain features are not (e.g., Nb/U, Th/La), and observed data much more closely align with the top model prediction.This ultimately helps validate both our model and the findings of Derkachev et al. (2018).

Table 2 Supervised
Machine-Learning Algorithms Used in This Study Along With Their Basic Syntax in the Scikit-Learn Application Programming Interface (API)Term Definition Observation Single analysis in the data set (i.e., row in a spreadsheet) Sample A group of observations (i.e., tephra from a single eruption as defined by a unique Alaska Tephra Lab (AT) number) Feature variable Explanatory variable (i.e., geochemical component) used to describe a target variable Target variable Value a machine learning model is trying to predict (i.e., volcano) using feature variables Hyperparameter Changeable algorithm value that, when changed, influences how a given machine learning algorithm makes predictions Class label Individual target label(s) assigned to data prior to algorithm application (i.e., the correct answer; volcano names).A target variable is comprised of n > 1 class labels

Table 3
List of Machine Learning Terms and Their Definitions as Used by This Study Table2were chosen because two construct linear decision boundaries (Linear Discriminant Analysis (LDA) and logistic regression), two construct non-linear decision boundaries (Support Vector Machine (SVM) and k nearest neighbors), and two are different approaches to an ensemble of decision trees (random forest and gradient boosting trees).This combination creates what is commonly referred to as an "ensemble" model (see review by Kittler & Roli, 2000, and references therein) and employs a weighted voting scheme between each individual algorithm in the ensemble.Here, each algorithm in Table2returns a non-zero probability for each predicted class (k; volcano) for each observation in the test data set.Overall model probabilities are then calculated as:

Table 4
List of Petrologically Significant Ratios Explored in the Feature Engineering Process Along With Their Discriminant Capabilities and References Table 2 and P ki is the class probability for each individual classifier.Individual model weights (w i ) are determined by their F 1 score (see Section 2.4 below), which allows better In this instance precision and recall are the weighted precision and recall.

Table 5
Each Set of Features Tested in the Feature Engineering Process and Their Various Scoring Metrics Listed in the Section 2.4.1 Using an "Out of the Box" Random Forest Model From Scikit-Learn

Table 6
List of Individual Algorithms That Comprise the Voting Classifier and Their Deviations From Default scikit-learn Application Programming Interface Parameters