Discriminating Tectonic Setting of Igneous Rocks Using Biotite Major Element chemistry−A Machine Learning Approach

The composition of igneous biotite is a potential indicator of the geologic environment of its host rock. Here, we apply two machine learning models−eXtreemly Greedy tree Boosting (XGBoost) and light gradient boosting machine (LightGBM) to classify biotite in igneous rocks from five tectonic settings−oceanic intraplate, continental intraplate, continental arc, island arc, and rift, using their major element chemistry compiled from a global dataset. The two models are successfully able to discriminate among biotite from the five tectonic settings. The classifiers quantitatively search for unique geochemical signatures in the training dataset, mapping each biotite analysis to its corresponding tectonic setting. Both models yielded good classification accuracy (∼90% on average) on an unseen dataset, suggesting that biotite major element chemistry can successfully discriminate the tectonic setting of many igneous rocks. The Shapley Additive exPlanations algorithm, which measures the impact of every element, indicates that Na, Mn, Ba, Mg, Al, Cr, and O2− constitute important geochemical discriminators. According to both models, high Na, low Na, low Ba, high Mn, and low Mn of biotite have the strongest influence on the prediction of continental arc, continental intraplate, island arc, oceanic intraplate, and rift settings, respectively. The classifier models have been applied to investigate the Neoproterozoic geodynamics of the Aravalli‐Delhi Belt, northwestern India. The models show that igneous activity related to the Erinpura granites (∼835 Ma) and the Malani Igneous Suite (∼750 Ma) can be ascribed to continental intraplate setting related to extensional tectonics prior to the break‐up of the Rodinia supercontinent.

Although the binary and ternary discrimination diagrams described above are helpful and easily constructed/visualized, they are based on a limited number of elements and thus cannot describe the full compositional range of biotite from igneous rocks of different tectonic settings. This is a potential impediment in the reliability of these diagrams in discriminating many tectonic settings. Recent advances in data science and artificial intelligence have led to the development of several new advanced and powerful data-analyses techniques, known as machine learning (ML), which can detect patterns in big dataset automatically (Bishop, 2006). The ML classifier tools overcome the limitations of binary or ternary diagrams and consider all the variables simultaneously in a multivariate, large geochemical database of rock/mineral composition in order to discriminate tectonic settings (e.g., Han et al., 2019;Hasterok et al., 2019;Itano et al., 2020;Petrelli & Perugini, 2016;Qi & Xuelong, 2019;Ueki et al., 2018;Zhong et al., 2021).
In this study, we describe for the first time, a machine learning approach for discriminating the tectonic setting of igneous rocks using a global geochemical dataset of biotite chemistry from five tectonic environments, namely island arc, continental arc, continental intraplate, oceanic intraplate, and rift-related. A highly effective correlation between the elements constituting the major element chemistry of biotite can be obtained using ML classifiers, quickly sorting the most important ones that are useful for the discrimination of tectonic settings. The eXtreme Gradient tree Boosting or XGBoost (Chen & Guestrin, 2016) and light gradient boosting machine or LightGBM (Ke et al., 2017) are two supervised classifier models. These techniques can be executed for discrimination along with the interpretation of features of importance, their interaction, and relationship, which control the classification to a large extent. Gradient boosting regression model (Friedman, 2002) was applied for thermo-barometry using clinopyroxene-melt chemistry (Petrelli et al., 2016) and the XGBoost and LightGBM models are being used in solving many geological problems such as lithofacies classification (Dev & Eden, 2019;Merembayev et al., 2021;L. Zhang & Zhan, 2017). However, these techniques have not been used for handling and classifying large geochemical dataset until now. We developed LightGBM and XGBoost ML classifiers for a detailed analysis of biotite composition and compared the results from the two ML models. A synthetic minority over-sampling technique or SMOTE (Chawla et al., 2002) is also applied to gain class balance in our label-imbalanced training dataset. We demonstrate that the major element chemistry of biotite can be used to discriminate arc, intraplate, and rift-related tectonic environments of igneous rocks to within acceptable limits of confidence. A case study is presented using primary biotite composition from the Erinpura granite and the Malani Igneous Suite of the Aravalli Delhi Belt (ADB) in northwestern India to comment on the debate surrounding the geodynamic regime of the region in the Neoproterozoic.

Machine Learning Algorithms
In supervised multi-labeled classification, a classifier model is trained over a known labeled dataset (training set), and after evaluating the trained model, significant discriminant features or elements help to classify previously unseen data (test set). Among many supervised machine learning classifiers, two ensemble learning models, XGBoost and LightGBM, have been successfully implemented in many fields.

XGBoost Model
Chen and Guestrin (2016) developed a novel ensemble tree boosting algorithm called eXtreem Tree Boosting (XGBoost). It uses the gradient boosting decision tree (GBDT) framework (cf. Friedman, 2002) that generates a new tree at every iteration, which is then added to the weak trees of the previous iteration to minimize the loss function defining how good the model is at making predictions and to improve the final output. This process continues until the optimized value for the loss function is obtained. To minimize the loss function, the gradient descent optimization method is used. The XGBoost method can handle null values by its own procedure. For a complex trained model, it causes overfitting to the training data leading to poor performance of the model. The regularization technique is adopted to reduce the model complexity and to avoid overfitting.

LightGBM Model
LightGBM is also a recent, widely used ensemble learning algorithm (Ke et al., 2017) that modifies the GBDT implementation. Unlike other GBDT-based models (e.g., XGBoost), it follows a leaf-wise tree growth strategy to construct the tree, instead of a level-wise strategy. In leaf-wise tree growth, LightGBM splits only those leaves that reduce the loss maximum and ignores other leaves for splitting at the same level. It can also accelerate the training time and efficiency by reducing the computational complexity.

Construction of Models
A number of steps were carried out for building the ML models. These are described below.

Data Pre-Processing
The machine learning algorithms were executed in python language using Jupyter notebook platform. Any missing data imputation was not performed during data pre-processing prior to training of the machine learning models. The XGBoost and lightGBM models can learn internally to handle missing values during model training by 'Sparsity-aware split finding' and 'Exclusive Feature Building' techniques, respectively (Chen & Guestrin, 2016;Ke et al., 2017). Normalization or logarithmic transformation of data was not performed before training, because tree boosting algorithms such as XGBoost and lightGBM models don't require feature scaling (normalization/standardization). Since decision tree nodes split on a feature by selecting a threshold value, the monotonic transformation of feature variables cannot affect the possible ways of split by thresholding.
The compiled dataset (described in Section 3) was divided into three parts-a training set for model training purpose, a validation set for the model tuning purpose, and an unseen test set for model evaluation purpose, comprising 60%, 20%, and 20% of the data respectively. We use the validation set instead of the test set for model learning purpose to avoid any biases in the prediction of a test set during the model evaluation.
In supervised machine learning, a model is trained to learn the criteria relating the features to classification labels on a training dataset. Here, the atom per formula unit (apfu) of the major cations and anions in biotite are considered as variables/features, and the tectonic setting is the class/label. In our dataset. all the variables are numerical, and the labels, i.e., tectonic settings, are categorical. A synthetic minority over-sampling technique or SMOTE (Chawla et al., 2002) is applied to remove the imbalance in the classes of our training dataset, before developing the models. This technique synthesizes new samples of the minority class to balance the training dataset, by re-sampling the instances of the minority class for a better performance of the classifier models.

Model Parameter Tuning
If the model complexity is high enough, it can fit well to the training data at high accuracy, resulting in a less-biased model. Many times, however, there is overfitting to the training set, and the model does not have good accuracy over the test set evaluation (high variance). This usually results from the model having learned the training data so well, including the noises that it is unable to generalize and predict correctly from a new, unseen dataset. The XGBoost and LightGBM models have specific hyperparameters that can be tuned with the help of a validation dataset during training of the model to achieve the best performance (low bias, low variance), avoiding the pitfall of overfitting.

of 29
To obtain the optimum performance of the models, a random search method is performed with ten-fold cross-validation over the training data for tuning the model parameters. This evaluates the model for different combinations of parameter values within the search range and finds out the best combination of parameters by minimizing the loss function over the training set. The 'early stopping' approach (Prechelt, 2012) is applied to achieve the optimal fit of the model. It 'early stops' the training process, when the model performance on the validation set starts to decrease or stabilize, while the performance on the train set is still increasing. This iteration is considered as the best one (minimum logarithmic loss), i.e., there is no further improvement in the model performance on the validation set on further iterations. This way, early stopping avoids overfitting problem by selecting the optimal number of iterations required to construct a generalized model. All the optimized model parameters values, and their ranges for tuning are summarized in Table S1 in Supporting Information S1.

Model Evaluation
Once a trained model has been constructed, it can be evaluated to predict the labels of an unseen or test dataset. The performance of the classification models over the test set is measured with evaluation metrics such as accuracy score, confusion matrix, precision, recall, f1 score, and precision-recall curve.
The accuracy score measures the proportion of the classes that are predicted correctly. Alternatively, the accuracy of the classifier model for individual classes can be shown in the confusion matrix form, which compares the predicted results with the actual labels. The confusion matrix has columns representing predicted classes and rows representing actual classes. The diagonals of the matrix show the tectonic settings that are correctly discriminated, while the other cells show the percentage of tectonic settings that are misclassified with others.
The precision, recall, and f1 score can be calculated over the dataset for each class. The precision of a class is the number of correct predictions of that class out of all predictions made for that class. Recall of a class is defined by the number of correct predictions of that class out of the number that actually belongs to that class. The f1 score can be computed class-wise from the combination of precision and recall. A high f1 score corresponds to better classification quality of the model and vice-versa.
The overall model performance across the entire dataset can be measured by two other metrics−macro-average and weighted macro-average precision, recall, and f1 score. To compute the macro-averaged precision, recall, and f1 score, all the class-wise counts are taken, and the average calculated (Özgür et al., 2005). Weighted macro-average is a preferable metric for a class imbalance, where each class contributes to the average by a weight depending on the number of instances belonging to that class. The precision-recall curve is also useful to evaluate the model's performance when applied to imbalanced data (Saito & Rehmsmeier, 2015), where a large area under curve or PR-AUC (maximum = 1) implies high precision as well as high recall.

Biotite Major Element Compositional Data Set
The dataset of primary biotite composition was collected from the GEOROC database (http://georoc.mpchmainz.gwdg.de/georoc) and filtered considering only primary biotite from igneous rocks only. It consists of a total of 8055 major element analyses of magmatic biotite in igneous rocks from five tectonic settings−continental arc (n = 749), continental intraplate (n = 5416), island arc (n = 674), oceanic intraplate (n = 473), and rift-related (n = 743) from all over the world. The geographic locations of the samples from which the biotite was analyzed are shown in Figure 1. The complete dataset, including geographic locations, and host rock types, is summarized in Data Set S1. The host rock of biotite in the dataset varies from intrusive to volcanic types, covering a wide compositional range from ultramafic/mafic to felsic (Table 1). It is noted that the continental intraplate category includes a diverse sets of host rocks/magmas which could be a potential hindrance in the accurate discrimination of biotite from this tectonic setting. Therefore, based on the host rock types, we have further subdivided biotite from this tectonic setting into three sub-groups: lamprophyre clan-hosted, carbonatite-hosted, and other rocks-hosted. The detailed description of the host rocks, the percentage of volcanic, plutonic, and hypabyssal types, the compositional range of biotite in terms of major element oxides (wt.%), and their missing data counts are tabulated for individual tectonic settings in Table 1.
Our working dataset contains 19 variables, which constitute the atom per formula unit (apfu) of the tetrahedral site cations T Si, T Al, T Fe 3+ , octahedral site cations M Al, M Mg, M Fe 2+ , M Fe 3+ , M Ti 4+ , M Cr 3+ , M Mn 2+ , M Ni 2+ , A-site cations A K + , A Na + , A Ba 2+ , A Ca 2+ , and W-site anions W F − , W Cl − , W OH − , and W O 2− , computed from the major elements (expressed as wt.% oxides) compositional data of biotite. The compositional ranges of the major element oxides in our dataset (SiO 2 : 33-46 wt.%, Al 2 O 3 : 8-22 wt.%, MgO: 5-28 wt.%, FeO T : 0-25 wt.%, K 2 O: 7-11 wt.%, TiO 2 : 0-9 wt.%, BaO: 0-3 wt.%, Na 2 O: 0-1.2 wt.%, Cr 2 O 3 : 0-1 wt.%, MnO: 0-1 wt.%, CaO: 0-0.1 wt.%, NiO: 0-0.14 wt.%, F: 0-6 wt.%, and Cl: 0-0.4 wt.%) are similar to that of Li et al., 2020, and therefore we chose to use the biotite structural formula calculation sheet of Li et al., 2020 to calculate the apfu values of the biotite from their major element oxide data. Our classifier models are therefore reliable for biotite that fall within the mentioned compositional range. The major element composition of the biotite was obtained using an electron probe microanalyzer and/or XRF by different groups of researchers for their study. The dataset includes some analyses from both core and rim parts of grains, as well as multiple analyses from a single rock (Data Set S1). Some biotite analyses have missing values (NA/ND) for some of the major oxides, i.e., Cr 2 O 3 , NiO, MnO, BaO, CaO, Na 2 O, Cl, and F (Table 1). Our working dataset contains some missing data for the elements M Cr, M Ni, M Mn, A Ba, A Ca, A Na, W Cl, W F as their oxide concentrations (wt.%) are not reported or are below detection limit.

Descriptive Statistics and Scatter Plots
Before executing the model on the dataset, it is crucial to visualize and characterize the data distribution, and to carry out descriptive statistics such as mean, standard deviation, and maximum-minimum values of each variable. These are listed in Table 2 with the data counts and missing counts. The variations in biotite composition in terms of their elemental apfu values among different tectonic settings are shown in box-andwhisker plots (Figures 2 and 3). The upper and lower boundaries of each box represent the third and first quartiles of the data, while the horizontal line dividing the boxes represents the median value. The vertical Figure 1. Locations of biotite-bearing igneous rocks from five different tectonic settings, namely continental arc, continental intraplate, island arc, oceanic intraplate, and rift-related shown on the world map. Biotite from continental intraplate setting is further sub-divided into three sub-groups based on their host rock association into lamprophyre clan-hosted, carbonatite-hosted, and other igneous rocks-hosted.             (Figures 3a and 3i), whereas those from continental arc settings have higher W Cl, and lower A K contents compared to those from other tectonic settings (Figures 3b and 3f). If we consider the three sub-groups of biotite from continental intraplate setting, significant chemical differences are noted between them. Biotite from rocks belonging to the lamprophyre clan and carbonatite group have higher M Mg, M Cr, A Ca, and lower M Al, M Mn, M Fe 2+ , M Ti compared to biotite hosted in other continental intraplate rocks as well as those from the other tectonic settings (Figures 2d-2g, 2i, 3h, and 3i).
The heat maps of Spearman correlation between variables (atomic concentration per formula unit) (Figures S1-S8 in Supporting Information S1) indicates several substitution schemes for incorporation of the ions in biotite during its crystallization from the parental magmas. In the scatter plots and heatmaps provided, good  (Figures 4a, 4c, 4f, Figures S1-S8 in Supporting Information S1) are noted for biotite from all tectonic settings. In particular, biotite from continental arcs show a strong positive correlation between T Fe 3+ -W Cl, M Al-W Cl, T Fe 3+ -M Al, and negative correlation between T Al-W Cl, and M Mg-W Cl (Figure 4f, Figure S2 in Supporting Information S1). Biotite from the lamprophyre clan shows negative correlation between M Ti and M Mg ( Figure S3 in Supporting Information S1) while those from the carbonatite group display negative correlations between M Ti-T Si, M Ti-M Mg, A Na-A K, and positive correlation between M Ti-T Al ( Figure S4 in Supporting Information S1). Positive correlations between M Al-M Fe 2+ and A Ba-M Fe 3+ and a negative correlation between T Al-W Cl are found in the case of biotite from continental intraplate rocks other than those of the lamprophyre and carbonatite groups ( Figure S5 in Supporting Information S1). For island arc biotite, strong negative correlations between M Al-M Mg, A K-W Cl, and positive correlation between A Ba-M Ti, M Al-M Fe 2+ , M Al-W Cl, M Al-W OH are observed (Figure 4d, Figure S6 in Supporting Information S1). Biotite from oceanic intraplate settings shows negative correlation between M Ti-T Fe 3+ , M Ti-W OH, and A Na-W OH ( Figure  S7 in Supporting Information S1). Negative correlations between A Na-M Al, A Na-W OH, M Ti-M Al, M Ti-W OH, and a positive correlation between M Al-W OH are also noticed for biotite from rift-related settings ( Figure S8 in Supporting Information S1). However, none of these binary scatter plots reveal any systematic differences in biotite composition from the five tectonic settings considered, with considerable compositional overlap noticed in the scatter plots (Figure 4).

Principal Component Analysis
Multivariate statistical technique such as principal component analysis (PCA) is often used to identify similarities and dissimilarities in big data and to extract information about features (Carranza, 2008;Howarth & Sinding-Larsen, 1983;Smith, 2002.;Zhao et al., 2019). Principal components (PC) are groups of variables that reduce the dimensionality and maximize the variance in multivariate standardized geochemical data (Güler et al., 2002;Hasterok et al., 2019;Iwamori et al., 2017;Rinnen et al., 2015;Sadeghi et al., 2013). The first principal component (PC1) reflects the maximum variance of variables in the dataset, followed by the second principal component (PC2) and so on. We performed PCA on our high-dimensional dataset. As described in Section 3.1, the working dataset has some missing data. Before performing PCA, missing data (NA/ND) were imputed in the working dataset using the MissForest function of Python's missingpy library. It imputes missing values iteratively using a random forest regression model. Following this, the data were standardized with mean 0 and standard deviation 1, i.e., scaling to unit variance. Although all the variables of the dataset have the same scale (i.e., apfu values), the ranges are considerably different for some. With non-standardized data, PCA will give more weight to the variable with higher variance, and the PCA result will be misinterpreted. To avoid such errors, we standardized the data using the StandardScalar function of the scikit-learn library in Python during data pre-processing for PCA. The Spearman correlation heatmap shows the correlation between all variables and PC ( Figure S9-a in Supporting Information S1). The scree plot shows the proportion of variance (in %) explained by each PC, where PC1 and PC2 are able to explain 21.65% and 19.24% variance, respectively ( Figure S9-b in Supporting Information S1). In the PCA loading plot ( Figure S9-c in Supporting Information S1), it is seen that octahedral site cations M Al, M Fe 2+ , M Mg are the main contributors to PC1, while M Ti, W O 2− , A Na, M Mg, and W OH are the ones that mainly contribute to PC2. From the PCA biplot (Aitchison & Greenacre, 2002), which combines the sample datapoints and PCA loading plot ( Figure 5), no distinct clustering of the biotite compositional data from the seven tectonic settings (including the three sub-groups of continental intraplate setting biotite) is observed, though biotite from oceanic intraplate settings are well-characterized in the direction of PC2 with positive loadings on M Ti, W O 2− and A Na. Hence, discriminating different tectonic settings for the majority of biotite is not possible using PCA analysis, considering that only 40.9% of the total variance in the data is explained by the PC1 and PC2 axes. Therefore, two machine learning approaches using XGBoost and LightGBM classifier models were executed to test whether the methods can robustly help to classify the tectonic setting of igneous rocks using biotite chemistry. We analyze the final output from the machine learning models in the next section. Ni-M Cr, A Ba-A Ca, A K-W OH, A Na-W O 2− ) indicate a good correlation between elements in biotite structure. The biplot models 40.9% of the total data variance. Here, biotite is classified according to its groupings in tectonic settings, though the discrimination of different types of biotite by PCA is not distinct.

Output of the XGBoost and LightGBM Classifier Models
During data pre-processing (Section 2.2.1), no missing data were imputed in the working dataset, since XG-Boost and LightGBM models can handle missing values automatically during model training. Normalization or logarithmic transformation of data was also not performed before the training of machine learning classifiers, as all the variables (apfu values of elements calculated from the major element oxides of biotite) considered here have the same unit. However, feature scaling is not required for decision tree-based models. The outputs from XGBoost and LightGBM models are described in Sections 4.3.1 and 4.3.2, respectively.

XGBoost Classifier
First, the random search method was applied on the training set using a 10-fold cross-validation score to find the optimum values for a range of model hyperparameters. Every configuration specified in the range of the parameters is evaluated with 10-fold cross-validation. Table S1 in Supporting Information S1 lists the best parameter sets for which the trained model achieved minimum loss function and best cross-validation score. The performance of the models over the test set is evaluated with evaluation matrices.
Considering the seven tectonic settings (including the three sub-groups under the broad category of continental intraplate setting), the XGBoost model can discriminate 87.6% (averaged accuracy score) of our test set correctly. Normalized confusion matrix helps to visualize the tectonic setting-wise accuracy of the results (normalized). The normalized confusion matrix (Figure 6a) shows that the XGBoost model can classify biotite from the lamprophyre clan of continental intraplate setting with the highest classification accuracy of 93% with the remaining 7% being misclassified as belonging to other settings. On the other hand, carbonatite-hosted and other rocks-hosted continental intraplate biotite are discriminated with 92% and 85% accuracy, respectively. The XGBoost model is able to correctly classify 88% of biotite from oceanic intraplate setting with the rest 12% being wrongly predicted. Eighty three percent of rift setting biotite is correctly classified by this model, with 7% and 6% being wrongly predicted as belonging to continental intraplate and continental arc settings, respectively. The model can classify biotite from island arcs with 84% classification accuracy, and those from continental arcs with the lowest accuracy (76%). The class prediction errors are also reflected in the bar diagrams in Figure 6c. This classifier performs well with good precision, recall, f1 score greater than 85% for individual tectonic settings (except for island arc and continental arc), and their overall macro-average and weighted average is listed in Table S2 of Supporting Information S1. We further plotted the precision-recall curves for each tectonic setting as well as their micro-averaging precision-recall curve, shown in Figure 6e. It is observed that the lamprophyre clan-hosted continental intraplate setting has the highest PR-AUC (=0.98), whereas the continental arc has the lowest PR-AUC (=0.85).
We clubbed the biotite from three sub-groups of the continental intraplate setting into a single group and ran the XGBoost classifier model again to test whether it was still able to correctly discriminate the continental intraplate biotite as a single group along with the other four tectonic settings (continental arc, continental intraplate, island arc, oceanic intraplate and rift setting). Even in this case, the XGBoost model correctly discriminates 89.8% of biotite, with a comparatively better averaged accuracy than the previous case. The normalized confusion matrix (Figure 6b) shows that the model obtains maximum accuracy of 94% for biotite from continental intraplate setting, followed by 88% accuracy for oceanic intraplate setting. The model is successfully able to classify 83% and 82% of biotite from rift settings and island arc settings, respectively. Continental arc biotite is classified with the lowest accuracy of 76%. Barplots depicting the Figure 9. SHAP summary plots for individual tectonic settings obtained from XGBoost Classifier. SHAP summary plots of (a) continental arc, (b) continental intraplate, (c) island arcs, (d) ocean intraplate and (e) rift setting biotite. The SHAP value of each feature calculated across all samples in our test set can be used to explore the impact of each feature on predicting the tectonic setting of interest. The SHAP summary plot of a tectonic setting allows sorting the features according to the sum of the SHAP value magnitudes across all individuals. Additionally, the plot illustrates how a high or low apfu value of an element, (represented by red and blue colors respectively, and gray-colored dots represent missing feature values with positive/negative SHAP values), impacts the classifier model output. class prediction errors are also shown in Figure 6d. This classifier acquires more than 80% precision, recall, f1 score for all tectonic settings with the exception of island arc and continental arc settings (Table S2 in Supporting Information S1). In Figure 6f, the precision-recall curves for each tectonic setting as well as their micro-averaging precision-recall curves are shown with maximum PR-AUC (=0.98) in the case of continental intraplate setting, and minimum PR-AUC (=0.84) in case of continental arc setting.

LightGBM Classifier
The optimized parameter setting for the LightGBM model is selected through a random search method with 10-fold cross-validation and listed in Table S1 of the Supporting Information S1. When biotite from the seven groups, including the three sub-groups of the continental intraplate setting are considered, the LightGBM model scores an averaged classification accuracy of 88% over the test set. The normalized confusion matrix (Figure 7a) shows that the LightGBM model is able to classify biotite from lamprophyre clan, carbonatite-hosted, and other rocks-hosted subsets of the continental intraplate setting with classification Figure 10. SHAP summary plots for individual tectonic settings obtained from LightGBM Classifier. SHAP summary plot of (a) continental arc, (b) continental intraplate, (c) island arc, (d) ocean intraplate and, (e) rift setting biotite. The SHAP value of each feature calculated across all samples in our test set can be used to explore the impact of each feature on predicting the tectonic setting of interest. The SHAP summary plot of a tectonic setting allows sorting the features according to the sum of the SHAP value magnitudes across all individuals. Additionally, the plot illustrates how a high or low apfu value of an element, (represented by red and blue colors respectively, and gray-colored dots represent missing feature values with positive/negative SHAP values), impacts the classifier model output.
accuracies of 94%, 92%, and 87%, respectively. The model can correctly discriminate 88% of biotite from oceanic intraplate settings and 84% of those from rift settings. Biotite from island arc and continental arc settings are discriminated with the lowest accuracy (82% and 76%, respectively). The class prediction errors are presented as bar diagrams in Figure 7c. The LightGBM classifier achieves good precision, recall, f1 score (>80%) for all tectonic settings (except for island arc and continental arc), and their overall macro-average and weighted average is tabulated in Table S2 of Supporting Information S1. The precision-recall curves for each tectonic setting as well as their micro-averaging precision-recall curve are shown in Figure 7e, where the lamprophyre clan-hosted continental intraplate setting has the highest PR-AUC (=0.99), and the continental arc has the minimum PR-AUC (=0.85) among all the groups.
When the three sub-groups of the continental intraplate setting are combined into one major group, the LightGBM model is still able to discriminate biotite from the five major tectonic settings. An averaged classification accuracy of 90.2% over the test set is achieved with the trained model. The normalized confusion matrix obtained from LightGBM Classifier is near-identical to that of the XGBoost model and is shown in Figure 7b. Ninety five percent biotite from continental intraplate, 87% from oceanic intraplate, 84% from rift setting, 82% from island arc, and 74% from continental arc settings are predicted correctly. The errors on individual class predictions are shown as stacked bar diagrams in Figure 7d. Good precision, recall, and f1 scores are obtained for all tectonic settings except for island arc and continental arc biotite (<80%). Precision, recall, and f1 scores, and their macro-average and weighted average are listed in Table S2 of Supporting Information S1. The precision-recall curves for each tectonic setting and their micro-averaging precision-recall curves are shown in Figure 7f, where all the tectonic settings have good PR-AUC (range = 0.85-0.99). Overall, the LightGBM classifier provides classification results consistent with the XGBoost classifier. Furthermore, it is clear that both the ML classifiers are able to successfully discriminate biotite from continental intraplate setting, both when it is considered as one group, and also when it is subdivided into three subgroups, namely lamprophyre-clan hosted, carbonatite-hosted, and others-hosted. Therefore, the XGBoost and LightGBM models developed with the continental intraplate setting as a single group are considered for further interpretation in the next section.

Evaluation of the Features Importance on Model Performance (Mean SHAP Value)
Our dataset contains 19 elements defining the composition of biotite as classification features. It is important to recognize the features that contribute most to the discrimination of the five tectonic settings. The XGBoost and LightGBM modeling can determine the importance of features in classifying biotite from different tectonic settings. The Shapley Additive exPlanations (SHAP) algorithm (Lundberg et al., 2018) is a convenient tool for tree-based models that can be used to estimate the relative importance of each classification feature, i.e., the SHAP value of a feature indicates how much it impacts the model prediction by reducing the loss. A TreeExplainer is implemented with the XGBoost and LightGBM models to calculate the SHAP values of each feature. In the feature importance stacked bar plot for our multiclass problem, features are ordered on the basis of their descending mean SHAP value or importance (Figure 8). For both XGBoost and LightGBM models (Figures 8a and 8b), the SHAP summary plots show that considering all tectonic settings together, M Mn and A Na have the highest impacts, and M Ni + the lowest impact on the prediction of tectonic settings in terms of the largest and smallest mean SHAP values, respectively, across the test set data. The mean SHAP value of M Mn for biotite from oceanic intraplate setting is maximum (∼1.2 in XGBoost and ∼1 in LightGBM models) among all tectonic settings, which implies that M Mn has greater influence in predicting biotite from oceanic intraplate setting compared to others.
The importance of a feature may not be uniform across all samples in the dataset. The SHAP algorithm computes the local importance of a feature for every sample. In SHAP summary plot for individual tectonic settings predicted using XGBoost ( Figure 9) and LightGBM ( Figure 10) models, each datapoint represents SHAP value of a feature for an individual sample in the test set, and we can explore the impact of every feature (positive or negative) in discriminating the tectonic setting of interest. It also allows sorting the features according to the sum of the SHAP value magnitudes across all individuals. Additionally, this plot illustrates how a high or low value of a feature (represented by red and blue colors respectively) impacts the model output. The SHAP dependence plot for each tectonic setting (Figures S10 and S11 in Supporting Information S1 for XGBoost and LightGBM models, respectively) indicates how the SHAP value of a single feature changes due to its interaction with other features and is briefly summarized below.
Continental arc: For both the XGBoost and LightGBM classifiers, the SHAP summary plots (Figures 9a  and 10a) for biotite from continental arc setting indicates that high value of A Na has the highest positive impact on discriminating continental arc settings. It is also noted that for a few samples, high W Cl value has a greater positive impact on prediction (higher SHAP value) compared to A Na, implying W Cl value influences the predictions of few biotite to a higher extent, but A Na controls the prediction of most of the continental arc setting biotite but to a lesser extent. High values of M Al, A Ba, W F, and low values of M Mn, \A K also significantly impact the prediction of continental arc setting. Considering the most influential feature, i.e., A Na, the SHAP feature dependence plot for the XGBoost model ( Figure S10-a in Supporting Information S1) shows that SHAP value of A Na increases with increasing A Na content in biotite. A Na interacts mostly with M Fe 2+ , suggesting that a high value of A Na (up to ∼0.13 apfu) increases the SHAP value when associated with high M Fe 2+ . The LightGBM model-based SHAP feature dependence plot ( Figure S11-a in Supporting Information S1) shows the impact of the most influential feature A Na increases with its increasing concentration in biotite and interacts mostly with A Ba.
Continental intraplate setting: The SHAP summary plots of biotite from continental intraplate settings for both of the XGBoost and LightGBM models (Figures 9b and 10b, respectively) show that low value of A Na has the highest impact on the correct discrimination of biotite from this tectonic setting, followed by high values of M Mg, A K, and low values of M Al, W Cl, A Ca, M Cr. The SHAP feature dependence plots from both of the models (Figures S10-b and S11-b in Supporting Information S1, respectively) indicate that the impact of A Na on the prediction of continental intraplate biotite increases as its concentration in biotite decreases. As per the XGBoost and LightGBM models, the interaction of A Na takes place mostly with W Fand M Mg, respectively, where a higher concentration of A Na associated with lower W F and higher M Mg apfu leads to a lower SHAP value.
Island arc: Low value of A Ba has the strongest positive influence on the prediction of biotite from island arc rocks (Figures 9c and 10c). High T Al, M Mn, M Cr, and low A Na, A K also influence many predictions of island arc biotite. The SHAP feature dependence plots for the XGBoost and LightGBM models (Figures S10-c, S11-c in Supporting Information S1) show that the impact of A Ba on the discrimination of island arc biotite decreases with increasing A Ba (>0.02 apfu) and with the association of lower M Cr.
Oceanic intraplate setting: For both XGBoost and LightGBM models, higher concentration of M Mn in biotite has the biggest positive impact on discriminating oceanic intraplate tectonic setting, followed by high A Na, W O 2− , M Ti and low M Al, A Ba apfu values (Figures 9d and 10d). In the SHAP dependence plots (Figures S10-d and S11-d in Supporting Information S1), we observe that the M Mn of biotite impacts discrimination of biotite from oceanic intraplate tectonic setting even more at higher concentration, when associated with lower A Ba and lower M Al content, in the case of XGBoost and LightGBM classifiers, respectively.
Rift setting: The SHAP summary plots for rift setting biotite obtained from XGBoost and LightGBM models (Figures 9e and 10e) indicate that low M Mn and high A Ca have the highest positive impact on discriminating biotite from rift setting, respectively, followed by low M Mg, and high A Ba. The SHAP dependence plot (Figure S10-e in Supporting Information S1) of the XGBoost model shows that the impact of M Mn decreases with increasing Mn concentration, and also when it associates mostly with lower M Cr. However, the SHAP dependence plot (Figure S11-e in Supporting Information S1) of the LightGBM model shows that higher value of A Ca (<0.006 apfu) increases the impact of A Ca on the prediction of rift setting when coupled with lower M Mg value.

Petrogenetic Significance of the Influential Discrimination Parameters
The machine learning methods employed by us using the major element chemistry of biotite are successfully able to discriminate the five tectonic settings considered (Figures 6-10). This indicates that magmas produced in different tectonic settings have unique major element chemistries and that processes of magma generation/evolution are different in these tectonic settings. More importantly, the unique geochemical character of the magmas from each of these settings is to a large extent mirrored in the composition of biotite that crystallizes from them. Furthermore, the fact that biotite in different rock types from the same tectonic setting is correctly discriminated, suggests that the processes of magma generation have a fundamental and greater control on the composition of the magmas compared to other processes such as magmatic differentiation that tend to modify the primary geochemical signatures.
The two machine learning models, XGBoost and LightGBM classifiers can be used to pinpoint the important features (elements in biotite structure in terms of apfu values) that contribute most to the discrimination of a particular tectonic setting, which can then be associated with the petrogenetic processes operating therein. Below we try and explore the link between the important compositional characteristics of biotite (as identified by the ML models) that help to discriminate the tectonic setting of their host igneous rock, and the broad compositional trends exhibited by magmas from the five tectonic settings considered.
Continental arc: It is observed that high A Na, A Ba, M Al, W F, W Cl, and low A K, M Mn exert major control in discriminating biotite from continental arc rocks (Figures 9a and 10a). Continental arc magmas contain material derived from highly enriched and isotopically evolved sources, including subduction-related components (e.g. Pearce et al., 2005;Shibata & Nakamura, 1997), and crustal material/sediments (e.g. Kimura & Yoshida, 2006). The high Na, Cl, F, and Ba contents of biotite in continental arc rocks may reflect the contribution of slab components in the form of slab-derived melt (Na 2 O-rich) (Schmidt & Jagoutz, 2017) and slab-derived fluid (Cl, F, Ba-rich) (Huang et al., 2019;Kent et al., 2002;Kovalenko et al., 2010), which are generally involved in magma production in continental arc settings. The high Al in biotite may be indicative of continental sediment assimilation by the parental magma (Shabani et al., 2003).
Continental intraplate setting: Magmatism in continental intraplate settings is diverse and may be linked to a variety of causes including mantle plume and crustal extension (Hawkesworth & Gallagher, 1993). Low values of A Na, M Al, M Cr, A Ca, and W Cl and high values of M Mg, A K are key indicators of biotite from continental intraplate tectonic settings according to our model results. Continental intraplate magmas are enriched in incompatible elements like K, Ba (Kovalenko et al., 2007), which might allow the incorporation of A K, A Ba into biotite during crystallization. Due to the presence of a thick continental lithosphere, ascending magmas may interact with mantle peridotite consuming olivine and crystallizing pyroxene and/ or garnet. This melt-rock interaction can produce MgO-rich, Al 2 O 3 -poor magma with high K 2 O/Na 2 O (Liu et al., 2017), which may explain the high M Mg, A K, and lower A Na, M Al of biotite that crystallizes from continental intraplate magmas. Continental crust assimilation can also play an important role in enhancing the K 2 O/Na 2 O ratio of the magmas in such settings.
Island arc: Low A Ba, A Na, A K, and high T Al, M Mn, M Cr, W O 2− constitute the most impactful features that help to discriminate biotite from island arc rocks. Island arc magmas have several potential source components. These include the depleted mantle wedge as well as the different slab components (Labanieh et al., 2012;Stern, 2002). The mantle wedge component may be variably depleted depending on its previous melt extraction history. Island arc magmatism is generally explained to be the result of melting of the mantle wedge fed by mainly cooler slab-derived dehydration fluids/supercritical liquids (with high K 2 O/Na 2 O), where involvement of slab melt component (Na 2 O-rich) is less common (Schmidt & Jagoutz, 2017). Therefore, Na 2 O content, being primarily of mantle origin, is low in island arc magma, allowing the crystallization of biotite with low A Na. However, the low K in biotite may be the result of early fractionation of amphiboles from more primitive island arc magmas (Davidson et al., 2007;Hamada et al., 2014), because of the relative compatibility of K in amphiboles (Tiepolo et al., 2007). Furthermore, magmas are generally hydrous and oxidized having high Fe 3+ /Fe 2+ ratio (Cottrell et al., 2020). The high W O 2− content of biotite from rocks of this setting can be explained by the hydrous and oxidized nature of the parental magmas.
Oceanic intraplate setting: Higher concentration of M Mn, A Na, W O 2− , M Ti, and lower concentration of M Al, A Ba (apfu values) have the strongest influence on the discrimination biotite in oceanic intraplate rocks. Variation of oceanic lithosphere thickness, referred to as the lid effect, exerts the primary control on the geochemistry of oceanic intraplate rocks such as ocean island basalts on a global scale (Niu et al., 2011). Ocean island magmas tend to have lower SiO 2 and high TiO 2 concentrations amongst all the tectonic settings. A deep melting source for the magmas is consistent with the high M Ti and low M Al of the biotites as Ti concentration of the melt increases while that of Al decreases with increasing pressure (Ueki & Iwamori, 2014).
Since melting is predominantly due to decompression of the asthenosphere, pressure plays a major role both during melting as well as during crystallization. As the partition co-efficient of Mn/Fe between olivine and the melt (K d [Mn/Fe] ) is 0.82 (Foley et al., 2013), early fractionation of olivine may produce residual melts with higher Mn/Fe, resulting in the incorporation of the higher M Mn in biotite structure from this setting.
Rift setting: Rift settings biotite can be discriminated by low M Mn, M Mg, and high A Ba, A Ca. Oxygen fugacity is low in rift setting compared to the other tectonic settings (Cottrell et al., 2020), and the magmas tend to be reduced with high Fe/Mg (low M Mg, high M Fe 2+ ). Such low fO 2 -magmas evolve by early fractional crystallization of olivine and/or clinopyroxene leaving the residual magmas depleted in MnO (Foley et al., 2013;Mullen, 1983), which is probably the cause of the low Mn content in biotite from this setting (Barbarin, 1990).

Geological Background of the Study Area
The Aravalli Delhi Belt (ADB) represents a prominent NE-SW trending composite orogenic belt in north-western India ( Figure 11). The Belt comprises an Archean basement (3.3-2.5 Ga; Roy & Kröner, 1996;Roy et al., 2012), referred to as the Banded Gneissic Complex, overlain by intensely deformed and metamorphosed supracrustal rocks belonging to the Aravalli and the Delhi Supergroups (Gupta et al., 1992). Sediments of the Aravalli Supergroup were deposited between 2.1 Ga and 1.6 Ga over the Archean basement and were deformed and metamorphosed during the Aravalli orogeny that gave rise to the Aravalli Fold Belt ( McKenzie et al., 2013;Wang et al., 2018). Sediments of the Delhi Supergroup were deposited at the end of the Mesoproterozoic and underwent deformation and metamorphism during the Delhi orogeny related to the assembly of the Rodinia supercontinent (Grenville-age orogeny) at ∼ 1.0 Ga (Pandey et al., 2013). The Delhi Fold Belt is divided into North Delhi and South Delhi Fold Belts (NDFB and SDFB) (Sinha- Roy, 1984) based on the purported diachronous nature of sedimentation and granite magmatism in the two belts (Biju-Sekhar et al., 2003;Choudhary et al., 1984). The Delhi orogeny led to the amalgamation of the northern Indian blocks through collisions between the Aravalli-Bundelkhand Craton and the Marwar terrane (Bhowmik et al., 2010). Following this, multiple phases of felsic magmatism have been documented in the SDFB. Of these, the Erinpura granites were emplaced between 873 Ma and 820 Ma Purohit et al., 2012) along the western flank of the SDFB (Heron, 1953), and are considered as a late orogenic stage of the Delhi orogeny. The next major felsic igneous activity between 770-750 Ma (Ashwal et al., 2013;Gregory et al., 2009;Van Lente et al., 2009;Torsvik et al., 2001) is represented by the Malani Igneous Suite (MIS) comprising peraluminous and peralkaline volcanic and granitic rocks with major occurrences at Jalor, and Siwana. Equivalent felsic rocks are found in the Tusham ring complex, Jhunjhunu granite, and Nagar Parkar granites further away (Khan et al., 2012(Khan et al., , 2017Kochhar, 2015;Srivastava, 1988).
Several studies have tried to evaluate the geodynamic setting of Neoproterozoic igneous activity in the SDFB using whole-rock chemistry, structural studies, or geophysical tools. However, there is considerable debate and a number of geodynamic models have been proposed. One group of workers relate the Neoproterozoic magmatism to subduction/Andean-type active margin (Ashwal et al., 2013;De Wall et al., 2021). Some studies proposed emplacement of the Erinpura granite at late syn-to post-orogenic stage of the Delhi orogeny (Gupta et al., 1980;Pandit et al., 2011) as a result of subduction along the northern margin of the northwestern Indian craton and the South China block (De Wall et al., 2021). Many workers (e.g., Dhar et al., 1996;Eby & Kochhar, 1990;Kochhar, 1984;2001;Kumar & Vallinayagam, 2014;Maheshwari et al., 2009;Sharma & Kumar, 2017) on the other hand, support the idea of hot spot-driven magmatism in an extensional, anorogenic, continental intraplate setting for the formation of the subsequent late Neoproterozoic MIS. A third group correlate the MIS magmatic events with intra-cratonic rifting along structural lineaments during Rodinia break-up (Bhushan, 2000; Pareek  Sharma, 2004;Srivastava, 1988), coeval with Pan-African thermal event and magmatism in Nagarparkar (Pakistan), Madagascar, Seychelles, and South China. Just et al. (2011) suggested that the Erinpura granites were unrelated to the Delhi orogeny, and rather constituted an early phase of MIS magmatism related to the break-up of the Rodinia supercontinent.

Interpretation of Geodynamic Setting
We applied our ML models on biotite from the Erinpura granites and the MIS to constrain the geodynamic regime prevailing in the SDFB during the Neoproterozoic. For this purpose, the major elemental oxide (wt. %) analyses of a total of 233 biotite were collected from plutonic and volcanic rocks of the Erinpura granite (n = 14) and Malani igneous suites (n = 219) and their equivalent units and discriminated using our machine learning classifiers. It is observed that most of the biotite (>90%) is classified as belonging to the continental intraplate setting by both the XGBoost and LightGBM models (Figure 12). The biotite data from the MIS and the Erinpura granite and/or their equivalent used for this case study along with the predicted tectonic settings are listed in Data Set 2. Based on the results of our ML models, the Neoproterozoic igneous activity in the southern part of the ADB can be ascribed to anorogenic, hot spot-related continental intraplate setting, possibly triggered by the crustal extension prior to the rifting that led to the break-up of the Rodinia supercontinent.

Web Application for Users of Our Classifier Models
A browser-based application has been designed by deploying our model in the Heroku platform compatible with the open-source app framework Streamlit package of Python. It will help users to visualize our working database and the output of our models. The objective of developing this web application is to allow any user to determine the tectonic setting from the major element chemistry of biotites (apfu values) from igneous rocks that they may be interested in, using our ML models. The users can upload their excel sheet containing biotite composition and by selecting the model option and clicking the 'Discriminate' button they can get the output of the tectonic setting printed on the screen. The web application can be accessed at https://biotite.herokuapp.com/.

Conclusions
This study demonstrates that both XGBoost and LightGBM machine learning algorithms can successfully discriminate biotite from different tectonic settings based on their major element chemistry, with consistent and comparable performance of both the LightGBM and XGBoost classifier models. The results suggest that biotite crystallizing in igneous rocks in each tectonic setting carries unique geochemical characteristics inherited from the parental magmas, which can be exploited to constrain the tectonic setting of their host rocks. The fact that biotite from a wide variety of rock types from each tectonic setting can be accurately classified suggests that the magmas from which the biotite crystallized retain the signature of their petrogenesis/tectonic setting even when significant magmatic differentiation may have taken place.
The performance of the machine learning classification models was evaluated through classification accuracy score, confusion matrix, precision, recall, f1 score, and precision-recall curve. During the training of the model, parameter tuning processes, such as random search and early stopping methods have been applied to select the optimum set of parameters for the best outcomes of the model. We have achieved reasonable performance of the XGBoost and LightGBM models (averaged classification accuracy 89.8% and 90.2%, respectively). Therefore, our classification models are expected to be able to discriminate any new geochemical dataset of biotite, considering the complexities and range of data in our training set. This classification model also determines the most important geochemical features as significant discriminants of different biotite-forming tectonic settings without requiring a priori information on the tectonic settings. We measure the impact of each element on the discrimination of biotite from the tectonic settings considered using SHAP values and evaluate their petrogenetic significance. To further corroborate the findings of our study, we applied our machine learning discriminator models to investigate the geodynamic setting of Neoproterozoicfelsic magmatism in the southern Aravalli Delhi Belt of northwestern India, and inferred a continental intraplate setting for the granitoids rocks of the area.
However, this kind of approach has some limitations. The addition of more biotite data and more features (like trace elements and isotope composition) will possibly help to build a better-trained model and improve the accuracy of classifying biotite from different tectonic settings. Besides, due to the high dimensionality of the biotite compositional dataset, it is difficult to visualize the model-based classification between five tectonic settings in any two or three-dimensional discrimination diagram. For visualization of the output of our model as well as the working dataset, a web application has been developed, which users can use to discriminate the tectonic setting of their own data of biotite major element composition using our classifiers.

Data Availability Statement
Datasets used in this research are available in the Supporting Information S1, or in Zenodo data (https://doi. org/10.5281/zenodo.5554902) as Data Set S1 and Data Set S2.