An Introduction to SGTPPR: Sparse Geochemical Tectono‐Magmatic Setting Probabilistic MembershiP DiscriminatoR

We present a new and easy‐to‐use geochemical tectono‐magmatic setting discriminator to calculate the probability of membership (the Sparse Geochemical Tectono‐magmatic setting Probabilistic membershiP discriminatoR, SGTPPR) that runs in Excel. It outputs the probability of membership for eight different tectono‐magmatic settings (mid‐ocean ridge, oceanic island, oceanic plateau, continental flood basalt province, intra‐oceanic arc, continental arc, island arc, and back‐arc basin) for a given volcanic rock sample based on major and selected trace element contents (SiO2, TiO2, Al2O3, Fe2O3, MgO, CaO, K2O, Na2O, Rb, Sr, Y, Zr, Nb, and Ba). We consider all possible ratios and multiplications of these contents, in addition to the contents themselves, which improves the discrimination accuracy. We use a statistical method called sparse multinomial logistic regression to construct a robust and predictive discrimination model. By imposing the sparsity, only a small number of essential variables are included in the model. The variables are objectively extracted from 287 possible geochemical variables, including all possible ratios and multiplications of the major and trace element contents. The constructed model exhibits a high classification ability, indicating that tectonic discrimination using major and selected trace elements yields a high classification ability when ratios and multiplications are considered. The system outputs the relative weights of the variables (i.e., contents, and ratios and multiplications of contents) of the input geochemical data to the calculated membership probabilities. This information can be used to evaluate and interpret the results. We apply the model to multiple samples of a geological unit, to determine the tectonic setting.


Introduction
Geochemical discrimination between magmas formed in different tectono-magmatic settings using the wholerock geochemistry of volcanic rocks is an important field of research in igneous geochemistry (e.g., Pearce & Cann, 1973;Pearce & Norry, 1979).This approach allows us to identify the geochemical characteristics of volcanic rock samples, as well as to compare the geochemistry of different samples and identify their tectonic setting.
Discrimination of magmas formed in various tectono-magmatic settings requires the analysis of high-dimensional and large geochemical data sets (e.g., Li et al., 2015).Recent advances in data analysis techniques, including statistical and machine-learning methods, have enabled the analysis of large multi-dimensional geochemical data sets and the objective construction of geochemical discriminators between magmas formed in different tectonomagmatic settings with a high degree of accuracy and predictive ability (Petrelli & Perugini, 2016;Takaew et al., 2024;Ueki et al., 2018Ueki et al., , 2022)).Although the latest models (Petrelli & Perugini, 2016;Ueki et al., 2018Ueki et al., , 2022) ) have a high classification ability (∼90% accuracy), they are limited in their use because they require data for 8 major elements, 16 trace elements, and 5 isotope ratios.Using statistical and machine learning-based approaches, a classification ability as high as 70%-80% can be achieved when only major elements and a limited number of trace elements are considered (Petrelli & Perugini, 2016;Ren et al., 2019;Zhong et al., 2021).Ueki et al. (2022) showed that an even higher classification ability could be expected by considering all possible combinations (i.e., ratios and multiplications) of major and trace element contents.From these results, it is expected that by using appropriate statistical models and the ratios and multiplications of elements, a high classification ability can be achieved even by using selected elements and not including isotope ratios as input variables.
In this study, we trained a statistical model that uses the contents of eight major elements and six trace elements as input data, which outputs the probability of the membership for eight tectono-magmatic settings: mid-ocean ridge (MOR), oceanic island (OI), oceanic plateau (OP), continental flood basalt (CFB), continental arc (CA), island arc (IA), intra-oceanic arc (IOA), and back-arc basin (BAB).We considered eight major elements (SiO 2 , TiO 2 , Al 2 O 3 , Fe 2 O 3 , CaO, MgO, Na 2 O, and K 2 O; in wt %) and six trace elements (Rb, Sr, Y, Zr, Nb, and Ba; in ppm), and all possible ratios and multiplications of these major and trace elements.These major and trace elements can be commonly analyzed using conventional analytical techniques such as X-ray fluorescence (XRF) spectrometry, meaning that the discrimination model should be applicable to many data sets.We show that the statistical model, using the 14 selected elements, has a high classification ability.In addition, our system provides the evidence to support its results (i.e., the relative weights of the independent variables of the input data [element contents, and their ratios and multiplications] on the output probabilities), which can be used to evaluate and interpret the output results.
We used a sparse modeling approach to identify a small number of essential signals from a large number of observations, and to obtain a model with high generalization capability (e.g., Kuwatani et al., 2014;Tibshirani, 1996;Ueki et al., 2020).Due to the sparsity and linearity, the discrimination model constructed with our approach is highly interpretative (Ueki et al., 2018(Ueki et al., , 2022)).In addition, due to the linearity of the method, the constructed discrimination model does not require optimization calculations to output the probabilities, can be implemented with minimal computational cost and works on various platforms.We provide an easy-to-use Excelbased spreadsheet to use the trained discriminator.

Model Tuning
This study used geochemical data for volcanic rocks from known tectono-magmatic settings to train the discrimination model.We followed the approach of Li et al. (2015) in terms of the definition of the tectonomagmatic settings.We considered MOR, OI, OP, CFB, CA, IA, IOA, and BAB.We considered six trace elements (Rb, Sr, Y, Zr, Nb, and Ba) along with eight major elements (SiO 2 , TiO 2 , Al 2 O 3 , Fe 2 O 3 , CaO, MgO, Na 2 O, and K 2 O) in the modeling.We selected frequently analyzed trace elements by referring to previous data compilations (Haraguchi et al., 2018;Ueki et al., 2018).For example, in the database of Haraguchi et al. (2018), 92.4% of the samples have complete data for the major elements, and 50.4% (i.e., 2933 of 5818 samples) have complete data for the 14 major and trace elements considered in this study.In contrast, only 0.7% of the samples (43 of 5818 total samples) have complete data for the 14 major and trace elements plus rare earth elements and isotope ratios (e.g., Sr and/or Nd).
Following Ueki et al. (2022), we considered all possible combinations of contents (i.e., ratios and multiplications) of 14 major and trace elements, which means 14 C 2 combinations for the multiplications and 14 P 2 combinations for the ratios.Eight intercept terms of the regression (w (k)  0 in Equation 1) were also included.Consequently, the maximum number of regression coefficients (w (k)  p ) is 295 (8 major elements, 6 trace elements, 14 C 2 multiplications, 14 P 2 ratios, and 8 intercept terms).The training data set is from Ueki et al. (2022), which included volcanic rock geochemical data compiled from the global geochemical databases PetDB (http://search.earthchem.org/) and GEOROC (https://georoc.eu/).Only Quaternary samples were used for arc settings (CA, IA, and IOA).In terms of selecting sample localities for the data compilation, well-defined and well-described localities were selected following Li et al. (2015).As a result, samples from complex tectonic settings were excluded.We also removed altered samples and noticeable outliers.The training data set includes mafic to silicic compositions (∼78 Geochemistry, Geophysics, Geosystems 10.1029/2023GC011237 UEKI ET AL. SiO 2 wt %) without filtering; consequently, the trained model can be used for volcanic rocks of all compositional ranges.See Ueki et al. (2018Ueki et al. ( , 2022) ) for details of the data compilation, basic statistics, and scatter plots of major and trace elements of the compiled data, and a sample location map.A total of 2,063 samples were included in the data set.The data set is available in the Supporting Information of Ueki et al. (2022) and https://doi.org/10.5281/zenodo.10520676(Hino, 2024).
Following Ueki et al. (2018Ueki et al. ( , 2022)), we used a popular statistical method called sparse multinomial logistic regression (SMR) to construct the discrimination model.Multinomial regression is a linear method that uses a weighted linear combination of the input data (i.e., geochemical composition) to model the probability that the observed data belong to a particular class (Nelder & Wedderburn, 1972).Sparse multinomial logistic regression is a sparse version of the multinomial regression (Krishnapuram et al., 2005).Given that we consider C tectonic settings (C = 8) and p geochemical variables (p = 287), the weighted linear combination of the input vector x using a set of projection vectors w (k) is as follows: where x denotes the compositional vector that consists of major and trace element contents, ratios, and multiplications of a sample.w (k) 0 denotes the intercept terms of the regression.The probability (Pr) that the input sample belongs to the kth class is defined as follows: where c(x) denotes the class of x.
The SMR approach yields sparse solutions in which most of the recovered regression coefficients (w (k) p ) are exactly zero, meaning that only a small number of important variables are included in the model.As such, SMR is a powerful approach to a wide variety of geochemical and geophysical problems, enabling multi-class classification and the extraction of a small number of key geochemical and geophysical features (e.g., Nakao et al., 2022;Ueki et al., 2018Ueki et al., , 2022)).
Given the details and statistical background of the method were given in Ueki et al. (2018Ueki et al. ( , 2022)), the procedure for the model fitting is only briefly described here.First, each input variable (contents, ratios, and multiplications) was normalized to yield a zero mean and unit variance value, and transformed to be close to a normal distribution using a Box-Cox transform (Box & Cox, 1964) before applying the SMR.The number of variables included in the model (i.e., sparsity) was then determined.Ueki et al. (2022) used a sparse modeling approach to identify a small number of fundamental geochemical features from a large number of variables in which the classification ability was not a subject of concern.In this study, the model was trained based on a sparse modeling approach to achieve a high classification ability.A 10-fold cross-validation (CV) was used to determine the optimal sparsity that achieves the highest predictive ability.In the 10-fold CV, the entire data set is divided into 10 non-overlapping subsets.One of these subsets is retained to test the classification accuracy of the model under the given sparsity, which is trained using the remaining nine subsets.This procedure was repeated 10 times, and the average of these 10 estimates is reported as the predictive ability of the given sparsity.Finally, the set of regression coefficients determined under the sparsity with the highest predictive ability was adopted as the final model.
Model training was conducted using R, which is an open-source programming language for statistical computing (R Core Team, 2022).The program source code for the model tuning based on 10-fold CV is provided in the Supporting Information of Ueki et al. (2018).The program source code to consider the combinations of major and trace elements is provided in the Supporting Information of Ueki et al. (2022) and https://doi.org/10.5281/zenodo.10520676 (Hino, 2024).Note that the methodology proposed in our series of studies is general, and the number of categories and input variables are variable and can be applied to various geochemical data.

Results: Trained Geochemical Discrimination Model
The variables included in the final model and the regression coefficients are shown in Figure 1; Table S1.As a result of the sparse modeling approach, sparse solutions that yield a high predictability were obtained: 134 Geochemistry, Geophysics, Geosystems 10.1029/2023GC011237 UEKI ET AL.
variables from the 287 possible variables (8 + 6 + 14 C 2 + 14 P 2 excluding intercepts) were included in the model.All eight major and six trace elements were included in some form.Three independent major element contents, SiO 2 , Al 2 O 3 , and Fe 2 O 3 , were involved in the model, whereas independent trace element contents were not.Eighty-nine ratios and 42 multiplications were included.Between 16 (OP) and 52 (CA) coefficients are required to discriminate among the different tectono-magmatic settings.
The absolute values of the regression coefficients (Figure 1; Table S1) reflect the importance of each variable in discriminating a particular tectono-magmatic setting from the other settings.A positive value indicates a relatively higher value of the variable, and a negative value indicates a relatively lower value as compared with the other settings.See Ueki et al. (2018Ueki et al. ( , 2022) ) for a detailed geochemical discussion of the features extracted using the SMR method.For geochemical interpretation of the output from the discriminator, see the discussions in Ueki et al. (2018Ueki et al. ( , 2022)).
The classification accuracy of the trained model during the 10-fold CV is presented in a confusion matrix form (Figure 2).Columns represent instances in a predicted class and rows represent instances in an actual class; therefore, the diagonal cells of the confusion matrix represent the proportion of data that were correctly assigned.The trained model, which uses only eight major and six selected trace elements as input data, exhibited high classification scores comparable with those of previous studies (Petrelli & Perugini, 2016;Ueki et al., 2018); 71%-97% of the classification scores were derived for all tectono-magmatic settings, except for BAB (42%).This high classification score indicates that ratios and multiplications of major and trace element contents are useful in discriminating different tectono-magmatic settings.
BAB had the lowest classification score of all the tectono-magmatic settings.The low classification score of BAB is consistent with previous studies using different statistical methods and input variables (Nakamura, 2023;Petrelli & Perugini, 2016;Ueki et al., 2018Ueki et al., , 2022)).A significant number of BAB samples were misclassified as IOA and MOR by the different approaches.Therefore, this low classification score for BAB may be related to the tectonic complexity of BAB magmatism rather than a limitation of the method.

SGTPPR: Excel-Based Geochemical Tectono-Magmatic Setting Discriminator
We constructed a spreadsheet that runs on Excel to implement the trained geochemical discrimination model on unknown samples.The spreadsheet automatically calculates the probability of membership for eight tectonomagmatic settings of a volcanic rock sample based on the input geochemical data.The spreadsheet is provided as Supporting Information S2 and is available at https://doi.org/10.5281/zenodo.8323122(Ueki et al., 2023).We describe the details of the spreadsheet in this section.
Figure 3 shows a screenshot of the sheet for data input and output of the result.By inputting the geochemical composition of the sample on the "input_and_result" sheet, the probability of membership of each tectonomagmatic setting is immediately calculated and displayed both as numbers and bar and pie plots.Actual calculations are conducted in the "calculations" sheet.
The output from the constructed discrimination model comprises the probability of membership for the eight tectono-magmatic settings (CA, IOA, IA, CFB, OP, OI, BAB, and MOR).Calculated probabilities sum to 100%.Bar plots showing the loadings of variables (contents, ratios, and multiplications) on the final calculated  S1 for detailed data.
probabilities can be accessed from the "plots" sheet (Figure 4).Each panel gives the loadings of individual variables of the input data on the calculated probability of a given tectono-magmatic setting.The "wx" values in the "calculations" sheet (w (k)  p x p in Equation 1) are plotted.As shown by Equation 2, the exponent of the sum of the plotted "wx" values of each tectono-magmatic setting, normalized to 100%, is the final probability displayed in the "input_and_result" sheet (Pr(c(x) = k) in Equation 2).This means that a positive value indicates that the variable contributes to increasing the probability of membership in a given setting, and a negative value indicates that the variable contributes to decreasing the probability.The plotted loadings show which variables (contents, ratios, and multiplications) of the sample used in the analysis contribute to the calculated probabilities, allowing the user to evaluate and geochemically interpret the results.
Instructions for using the model are described below.A sample that deviates significantly from the compositional range of the training data (see Figure 3 of Ueki et al., 2018) should be excluded.In addition, due to the internal data processing (e.g., ratios and multiplications, normalization, and Box-Cox transformation), even if a sample is within the compositional range of the training data, calculations may not be possible for extremely low or high contents.Calculations using data containing missing or zero values are not possible.Note that the output from the system is probabilistic rather than deterministic; consequently, the user must refer to probability values when interpreting the output.For example, a higher probability for a given setting means that the system considered the sample to have a stronger geochemical signature of that setting.In some cases, a similar probability may be Geochemistry, Geophysics, Geosystems 10.1029/2023GC011237 UEKI ET AL. calculated for each of several settings rather than a prominent probability for one setting.There are two statistical possibilities in this case.The first possibility is that the system was unable to identify which of the tectonomagmatic settings the sample belongs to.Second, the sample could belong to a different setting from the assumed eight tectono-magmatic settings.Misclassification rates (Figure 2) should also be referred to when discussing the output from the classification model.Independent geological observations should be used when interpreting and discussing the model output.
We conducted a sensitivity test to evaluate the effects of analytical uncertainties on the calculated probability.We generated 100 artificial data points by perturbing the original XRF analysis (Ueki & Iwamori, 2017), using assumed errors (the repetition error during the XRF analysis of Ueki & Iwamori, 2017), for the sample shown in Figure 3 (sample 03060426; Ueki & Iwamori, 2017).Discrimination analyses were then performed on these 100 artificial data points using an Excel spreadsheet (Supporting Information S2).One hundred independent results, and their average and standard deviation values, are presented in Table S2.The mean probability for the IA setting is 46.18% ± 11.94%.The output from the system (i.e., probability of membership for eight tectono-magmatic settings) is displayed on the right.

Example Applications: Evaluating the Geochemical Characteristics of Geological Units
The primary use of the model is to evaluate the geochemical characteristics of a single sample.The most straightforward way to interpret the output probabilities is that the tectono-magmatic setting with the highest probability is the prediction of the trained discriminator, which represents the predicted original tectonomagmatic setting of the sample.Alternatively, the calculated probabilities of the memberships to the eight different settings can be interpreted and used as a geochemical characteristic of the sample.
The trained discrimination model can also be used to characterize a geological unit, such as a single volcano or rock body, based on multiple samples.We propose two types of analyses for multiple samples.The first method is based on a majority vote of multiple samples, which is achieved by adding the setting with the highest probability for each sample across all samples from a given geological unit.The second method is to characterize a geological unit based on the probability of membership for all eight settings.By assuming that the probabilities for the eight settings derived from a single sample are representative of the expected value of the sample, the average of the probabilities of all samples from a single geological unit can represent the expected value of the geological unit itself.In other words, we can determine the membership probabilities for all eight settings of a geological unit simply by averaging the probabilities of total available samples from a geological unit.
We now demonstrate four examples of applications using geologically well-characterized samples from the Japan arc.The results are briefly described in the main text, and the full results are given in Tables S3-S6.The samples used in the following case studies were not used to train the model.
First, we present an analysis of a single volcano from a known tectono-magmatic setting.Results for the Akitakomatagate volcano, which is an active Quaternary volcano located on the volcanic front of the northeastern Japan island arc, are given in Figures 3 and 4; Table S3.The data set was originally reported by Ueki and Iwamori (2017).Whole-rock geochemical data were obtained by XRF spectrometry.A classification of IA was obtained for 72.41% (21 of 29) of the samples, meaning that the Akita-komagatake volcano is classified as IA by the model.The probabilities of membership for the Akita-komagatake volcano, based on the average of the total available samples, are IA 46.63%, CFB 26.56%, IOA 10.54%, BAB 8.53%, CA 7.53%, OP 0.12%, OI 0.08%, and MOR 0.01%.Figures 3 and 4 are example screenshots of the spreadsheet showing the output from one sample from Akita-komagatake volcano.Figure 4 shows that positive Al 2 O 3 and Rb/K 2 O, and negative Fe 2 O 3 × Rb, Ba × Nb, and Y/Ba of the input sample (sample 03060426) contribute to the increased probability of membership of the IA setting.
The next case study (Table S4) is based on melt inclusions in igneous basement rock samples from the Philippine Sea Plate recovered by the International Ocean Discovery Program (Site U1438; Brandl et al., 2017).The analyzed melt inclusions were considered to have been sourced from the frontal arc section of the proto-Izu-Bonin-Mariana intra-oceanic arc (Brandl et al., 2017;Hamada et al., 2020).Chemical compositions of the melt inclusions were determined by laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS).Of the samples, 91% (41 of 45, excluding compositional outliers that could not be calculated) are classified as IOA.Probabilities of membership for the whole core suite (Site U1428), derived from the average of the total available samples, are IOA 74.28%, BAB 16.08%, IA 6.59%, CA 2.73%, CFB 0.17%, OP 0.15%, and 0.00% for OI and MOR.
We now present results for volcaniclastic fragments sampled in the Cenozoic Shimanto Belt in western Shikoku, southwestern Japan, which formed in a paleo-accretionary prism (Kiminami & Imaoka, 2006) (Table S5).The analyzed samples are fragments of volcanic breccia or conglomerate.Whole-rock geochemical compositions of the volcaniclastic fragments and blocks were analyzed by XRF spectrometry.Samples have undergone varying degrees of alteration, and samples with limited alteration were selected for analysis by Kiminami and Imaoka (2006).Of the samples, 79% (15 of 19 computable samples) are classified as OI.Probabilities of membership, derived from the average of the total available samples, are OI 76.12%, CA 9.98%, CFB 6.70%, MOR 6.16%, OP 0.93%, BAB 0.08%, IA 0.03%, and IOA 0.00%.These volcanic fragments were inferred to have originated in OI settings, based on their geological occurrences and geochemical compositions (Kiminami & Imaoka, 2006).The geochemical database of Haraguchi et al. (2018) was used to retrieve this data set.
The compositions of samples of altered oceanic crust (Kelley et al., 2003) from the western margin of the Pacific Plate, which is subducting in the Izu-Bonin-Mariana arc (Ocean Drilling Program Site 801; Plank et al., 2000), were analyzed to assess the ability of the model to classify altered samples (Table S6).Whole-rock geochemical compositions were analyzed by solution ICP-MS.Amongst the samples described as basalt or altered basalt, 62% (23 of 37 computable samples) are classified as MOR.Probabilities of membership for the whole core suite, based on the average of the total basalt and altered basalt samples, are MOR 57.86%, BAB 18.87%, OP 10.74%, CA 5.46%, IOA 2.52%, OI 2.14%, CFB 1.97%, and IA 0.45%.These results show that the basement core suite is classified as MOR by the model.However, samples from just below the uppermost basement at 461 m below seafloor (mbsf), where lava flows are interbedded with sedimentary materials, were not classified as MOR by the model, along with samples around two hydrothermal zones at 510-530 and 625 mbsf, and samples around a breccia zone at 840-850 mbsf.This indicates that caution is required when applying the model to highly altered samples.

Summary
We have presented a newly trained geochemical tectono-magmatic setting discriminator.It uses the geochemical composition of a volcanic rock consisting of eight major elements and six selected trace elements as input data, and outputs the geochemical characteristics of the sample as the probability of membership of eight different Geochemistry, Geophysics, Geosystems 10.1029/2023GC011237 UEKI ET AL.
tectono-magmatic settings.The discriminator can be used to assess the geochemical characteristics of a volcanic rock sample of all compositional ranges, as well as a geological unit and single volcano, and to identify the origin of unknown samples.Due to the linearity and sparsity of the statistical method used in this study, the method is highly interpretative.It provides the relative weights of variables (contents, ratios, and multiplications) of the input geochemical data on the calculated membership probabilities.The four case studies show the model is robust against alteration, especially when the analysis is based on multiple samples.We also recommend using bivariate plots of the key geochemical features of the tectono-magmatic settings presented by Ueki et al. (2022).We present an easy-to-use Excel spreadsheet to implement the trained geochemical tectono-magmatic setting discriminator as Supporting Information S2.The latest version of the spreadsheet can be downloaded at https:// doi.org/10.5281/zenodo.8323122(Ueki et al., 2023).

Figure 1 .
Figure 1.Regression coefficients from the discrimination model for different tectono-magmatic settings shown as different colors.Cells shown in gray indicate the corresponding variable is not included in the model as a result of the sparse modeling-based variable selection.See TableS1for detailed data.

Figure 2 .
Figure 2. Confusion matrix showing the classification accuracy of the trained model.Columns represent instances in a predicted class and rows represent instances in an actual class.Classification accuracies were derived using a 10-fold crossvalidation.

Figure 3 .
Figure 3. Screenshot showing the "input_and_result" sheet of the Excel spreadsheet.The geochemical composition of the sample for the analysis is defined on this sheet.The output from the system (i.e., probability of membership for eight tectono-magmatic settings) is displayed on the right.

Figure 4 .
Figure 4. Screenshot of the "plots" sheet showing the loadings of individual variables (contents, ratios, and multiplications of major and trace elements) of the input data on a final result (i.e., probability of a given tectono-magmatic setting).A positive value indicates that the variable contributes to increasing the probability of membership for the tectono-magmatic setting of interest, and a negative value indicates that the variable contributes to decreasing probability.