Coping with imbalanced data problem in digital mapping of soil classes

An unsolved problem in the digital mapping of categorical soil variables and soil types is the imbalanced number of observations, which leads to reduced accuracy and the loss of the minority class (the class with a significantly lower number of observations compared to other classes) in the final map. So far, synthetic over‐ and under‐sampling techniques have been explored in soil science; however, more efficient approaches that do not have the drawbacks of these techniques and guarantee retention of the minority classes in the produced map are essentially required. Such approaches suggested in the present study for digital mapping of soil classes include machine learning models of ensemble gradient boosting, cost‐sensitive learning and one‐class classification (OCC) of the minority class combined with multi‐class classification. In this regard, extreme gradient boosting (XGB) as an ensemble gradient learner, a cost‐sensitive decision tree (CSDT) within the C5.0 algorithm, and a one‐class support vector machine combined with multi‐class classification (OCCM) were investigated to map eight soil great groups with a naturally imbalanced frequency of observations in northwest Iran. A total of 453 profile data points were used for mapping the soil great groups of the study area. A data split was done manually for each class separately, which resulted in an overall 70% of the data for calibration and 30% for validation. The bootstrapping approach of calibration (with 10 runs) was performed to produce multiple maps for each model. The 10 bootstraps were evaluated against the hold‐out validation dataset. The average values of accuracy measures, including Kappa (K), overall accuracy (OA), producer's accuracy (PA) and user's accuracy (UA), were explored. In addition, the results of this study were compared with a previous study in the same area, in which resampling techniques were used to deal with imbalanced data for digital soil class mapping. The findings show that all three suggested methods can deal well with the imbalanced classification problem, with OCCM showing the highest K (= 0.76) and OA (= 82) in the validation stage. Also, this model can guarantee the retention of the minority classes in the final map. Comparing the present approaches with the previous study approach demonstrates that the three newly suggested methods can remarkably increase both overall and individual class accuracy for mapping.

approach demonstrates that the three newly suggested methods can remarkably increase both overall and individual class accuracy for mapping.

K E Y W O R D S
class imbalance, cost-sensitive decision tree, digital soil mapping, extreme gradient boosting, imbalanced multi-class classification, machine learning, one-class classification, support vector machine

| INTRODUCTION
Understanding soil classification and the associated constraints, conditions and capabilities are key to sustainable development (Bouma et al., 2022), and one way of enhancing soil knowledge for better management of land is through digital mapping of soil classes. However, the applicability of a soil map relies on the accuracy and certainty of the information it provides by illustrating the spatial pedo-diversity.
A very common problem with predictive mapping of soil classes is the imbalance in the number of observations for different soil types, as a skewed distribution of soil classes may occur almost everywhere due to soil forming/ developing factors (Brungard et al., 2015;Heung et al., 2014;Heung et al., 2016;Sharififar, Sarmadian, Malone, & Minasny, 2019;Taghizadeh-Mehrjardi et al., 2020). This issue in predictive mapping results in digital soil maps with substantially low accuracy and the omission of classes with a significantly lower number of observations, which are called minority classes (Sharififar, Sarmadian, Malone, & Minasny, 2019;. Because most machine learning and simulation models assume a balanced distribution of data input, they end up with poor outcomes when trained with imbalanced classes. This problem exists in computer and data science as well as in many other fields such as engineering and social and human sciences (Burez & Van den Poel, 2009;Haixiang et al., 2017;Zhu et al., 2017). However, this problem has not been well studied in soil science for any of the following soil-related classes: soil class mapping, soil texture classes, salinity and alkalinity classes, the presence or absence of root restricting horizons, land capability and suitability classes for crop production and water management, and any other categorical soil quality for classification and prediction purposes. As a result of the class imbalance problem, reduced accuracy and the loss of minority classes in final predictions yield uncertain or misleading maps.
Possible approaches to address the issue of class imbalance in modelling processes include resampling (for example, minority class synthetic oversampling and majority class synthetic undersampling), ensemble modelling and algorithm tuning (Galar et al., 2012;Liu et al., 2008;Ma & He, 2013;S aez et al., 2016). Among these suggestions, only resampling techniques have been explored in soil science Taghizadeh-Mehrjardi et al., 2020). The resampling technique can be an easy method to apply, although it may not offer the most optimal solution to the problem of class imbalance as a part of the data may be squandered (during undersampling), which results in a loss of information. Also, artificially oversampling duplicates input data to help enhance the training process, although it might not necessarily improve correctly predicting the minority classes L opez et al., 2013;S aez et al., 2016). Although resampling has been shown to improve overall accuracy in predictive mapping, it may not guarantee the preservation and accurate prediction of the minority class in the outcome of the final model (Loyola-Gonz alez et al., 2016;Ma & He, 2013;Neyestani et al., 2021).
The present study suggests investigating the feasibility of approaches that can potentially guarantee the preservation of the minority soil class in the map and improve the map accuracy further, compared to resampling techniques, without making any change to the input data. These suggestions are: (a) the use of an ensemble model that boosts the training process of imbalanced classes to enhance the learning of relating skewed class

Highlights
• Imbalanced data problems lead to the loss of minority classes and low accuracy in soil class maps. • Cost-sensitive calibration, gradient boosting and one-class classification can solve the problem. • Three suggested methods increased accuracy and retained the minority class compared to resampling. • One-class classification can guarantee the retention of the minority class on the map.
observations to covariates and taking the minority class not as noise in the data but as a signal for better adaptation to covariates. Extreme gradient boosting (XGB) is a model of such capability that uses multiple trees as an ensemble algorithm that starts the process by building an initial tree and performing self-correction in an iterative cycle to boost learning from imbalanced data (Chen & Guestrin, 2016). This may result in the more efficient capability of the model to learn from a highly skewed dataset compared to algorithms that do not benefit from tree pruning (self-correction). (b) The use of cost-sensitive learning; most classification algorithms assume that the misclassification errors (i.e., costs of false negative and false positive class prediction) are the same, while in most real-world applications, this assumption is not true (Thai-Nghe et al., 2010). Hence, misclassifying the minority class as the majority class should not have an equal error or cost for the algorithm as misclassifying the majority class as the minority class. Providing an unequal misclassification cost matrix can potentially result in training that does not treat all classes the same, leading to a modified algorithm aligned with the imbalanced data input. As an example of this approach, a cost-sensitive decision tree (CSDT) (e.g., the C5.0 decision tree [DT] algorithm [Pandya & Pandya, 2015;Quinlan, 1986]) can be helpful for the digital mapping of imbalanced soil classes. (c) The use of one-class classification (OCC), in which a model is trained solely for the minority class and predictions are made for the whole study area. Then, the model is trained for other classes in a separate stage, and predictions are made for the whole study area (Perera et al., 2021). The combined predictions from the two stages would result in the final soil map. This approach can guarantee that the minority class will never be lost in the final map, no matter how skewed the class distribution is (Seliya et al., 2021). An example of this approach is the use of OCC using a support vector machine (SVM) (Meyer et al., 2019;Noble, 2006;Pisner & Schnyer, 2020). In the literature, the XGB model has been used in some examples for digital mapping of some continuous soil properties and has been demonstrated to be a powerful learner (Hengl et al., 2017;Reddy & Das, 2023;Zhang et al., 2022). In a categorical variable example, Zhang, Shi, and Xu (2020) compared five machine learning algorithms, including K-nearest neighbour, multilayer perceptron neural network, random forest (RF), SVM and XGB, for mapping soil texture. They reported high computational efficiency, more meaningful predictions, and, in some cases, higher accuracy of the XGB compared to the RF model. In an example of soil classes mapping, Meier et al. (2018) compared eight models, including RF, XGB, extra trees RF by randomisation, ranger RF, weighted subspace RF, SVM with linear kernel, SVM with polynomial kernel and bagged AdaBoost models, and reported the highest kappa coefficient value for the XGB model (=0.48), though they did not find statistically different results between the models. In the particular case of mapping imbalanced soil classes, Taghizadeh-Mehrjardi et al. (2020) applied the XGB model and reported its comparative efficiency compared to the RF model. Other than that example, the XGB model has not been applied to the specific problem of imbalanced soil class mapping.
The DT C5.0 model is a well-established method (Quinlan, 1986;Quinlan, 2004) for digital soil mapping (DSM) and has been reported to perform well for mapping categorical variables, such as soil classes (Lamichhane et al., 2021;Sharififar, Sarmadian, Malone, & Minasny, 2019;Xiao-Lin et al., 2011). However, the DT (C5.0) with cost sensitivity tuning for class predictions with regard to imbalanced class mapping has not been explored.
In the case of one-class classification, this approach has not been studied for DSM so far. On the other hand, OCC using SVM has been used for other studies such as image classification and several other fields of science such as medicine and engineering, fraud, fault and failure detection applications (Ao et al., 2017;Cyganek, 2012;Gao et al., 2020;Mũnoz-Marí et al., 2010). Also, OCC has applications for novelty detection, outlier detection, class rarity, severe class imbalance and noisy data in various fields (Schölkopf et al., 1999;Seliya et al., 2021) and there are several methods for performing OCC (Alam et al., 2020;Wenzhu et al., 2019). One of the most successful OCC methods is one-class SVM (Alam et al., 2020;Schölkopf et al., 2001). In the case of extreme class imbalance, oneclass SVM seems to have a problem-solving capability, as it discriminates the target class from all other classes, building a calibrated model that is well representative of the minority classes.
As a result of the above-mentioned literature surveys, the XGB, cost-sensitive C5.0 and one-class SVM models seem to have outstanding potential for dealing with the class imbalance problem in data mining and predictive machine learning (Fern andez et al., 2018;He & Garcia, 2009). This study aims to evaluate the usefulness of three approaches, including ensemble modelling (using an XGB algorithm), cost-sensitive learning and OCC, to solve the issue of imbalanced observations for digital soil class mapping in an area with a naturally imbalanced distribution of soil types in northwest Iran. In addition, the results of this study are compared to the results of a similar study of the same area, in which resampling techniques were utilised for dealing with class imbalance problem in digital soil class mapping using the same data (Sharififar, Sarmadian, Malone, & Minasny, 2019).

| Study area and soil sampling
The study area is a typical semi-arid rural area covering approximately 12,000 ha. On a regular grid sampling scheme, 453 soil profiles up to 1.5 m depth have been dug for soil genetic horizon morphology and a range of physio-chemical analyses for taxonomic classification according to the United States Department of Agriculture (USDA) Key to Soil Taxonomy (USDA, 2010). The soil types are mainly aridisols and entisols. The main land uses are agriculture and rangelands for livestock grazing. The mean annual rainfall is 271 mm, and the mean annual temperature is 15 C. The mean altitude is 255 m above sea level. The physiography includes plateaus and hills with piedmont plains. Figure 1 shows the study area position and sampling sites.
According to the USDA Key to Soil Taxonomy, the soils of the study area were classified into eight great groups: (A) Calcigypsids, (B) Argigypsids, (C) Natrigypsids, (D) Haplogypsids, (E) Haplocalcids, (F) Haplocambids, (G) Torrifluvents and (H) Torriorthents (Table 1) (USDA, 2010). The Torrifluvents (G) and Torriorthents (H) classes, with a substantially lower number of observations compared to other soil classes, were considered the minority soil classes. Figure 2 illustrates the frequency of the observed soil classes.

| Digital soil mapping (DSM)
Following the DSM and empirical predictive estimation (McBratney et al., 2003) of soil classes, by relating a set of environmental covariates (soil forming or developing factors) with the target variable (soil class), a taxonomic soil class map of the study area is created. To do this, three models, including ensemble XGB, CSDT (the C5.0 model) and one-class SVM along with regular multi-class SVM classification (OCCM), were used to mathematically relate our selected covariates to soil classes. Figure 3 illustrates a diagram of the F I G U R E 1 Study area location and sampling sites. Adapted from Sharififar, Sarmadian, and Minasny (2019) with permission.
holistic study methodology. The soil profile data were split randomly, with 30% hold-back for validation and 70% for calibration. The splitting was done per each class separately to make sure every class exists in both the calibration and validation datasets. The calibration was carried out by bootstrapping for 10 runs in a random resampling with replacement approach.
The environmental covariates used in this study include a digital elevation model (DEM), which was obtained from the freely available ASTER satellite image (ASTER GDEM, METI and NASA; http:// earthexplorer.usgs.gov), and several terrain attributes that were derived from this DEM. These covariatesrelative slope position, channel network base level, landform, surface texture and valley depth-were chosen as covariates for DSM. These covariates were chosen based on spatial variations and local expert knowledge of the study area. Covariates, such as vegetation cover created from Landsat satellite images were not effective in explaining the soil variations and were not used in the study. Also, several other covariates related to climate did not show a significant variation in the study area and were not considered for this study. All the covariates have a 32 m Â 32 m resolution (Sharififar, Sarmadian, Malone, & Minasny, 2019). SAGA ® software was used to derive the covariates from the DEM. The following is a brief description of each of the covariates derived from DEM: Channel network base level: For this covariate, the vertical distance to a channel network base level is calculated. The algorithm in SAGA software works by interpolating channel network base level elevations and subtracting base levels from the original elevations (Conrad et al., 2015).
Landform: Using a topographic position (Weiss, 2001), DEM cell values are compared to the mean value of neighbouring cells. The mean value is calculated based on shapes defined by an algorithm or by the user. Positive values represent features that are higher than the surrounding features, and negative values represent features that are lower than the surrounding features. These calculated ranges are then used to generate classified landscape categories of landforms (Guisan et al., 1999;Wilson & Gallant, 2000).
Relative slope position: This variable is computed using the relative relief position calculation, which computes the vertical distance below a terrain culmination and the distance above a terrain minima to obtain relative slope positions. More detailed explanations of the algorithm and the fundamentals of this variable can be found in (Conrad et al., 2015;Freeman, 1991).
Surface texture: Terrain surface texture emphasises fine versus coarse expressions of topographic spacing. The texture is calculated by extracting grid cells that outline the distribution of valleys and ridges. It is defined by both relief (feature frequency) and spacing in the horizontal distance. Each grid cell value represents the relative frequency (in percent) of the number of pits and peaks within a radius of 10 cells (Iwahashi & Pike, 2007).
Valley depth: It is calculated as the difference between the elevation and an interpolated ridge level. It is obtained by first defining the ridge cells, interpolation of the ridge level and then subtraction of the original elevations from the ridge level (Conrad et al., 2015).

| Extreme gradient boosting (XGB)
XGB is a supervised machine learning algorithm, which is nowadays the most popular and most efficient boosting algorithm in data science (Asselman et  Codes used throughout the paper to represent the soil classes for conciseness. SHARIFIFAR and SARMADIAN et al., 2021;Quinto, 2020;Shwartz-Ziv & Armon, 2022). In a simple representation, it can be defined as: where b y i is the prediction outcome, f n x i ð Þ is the input function of the n th DT, n is the number of DTs, and F is the set of all possible classification trees. The XGB objective function is comprised of two parts, including training error at each tree building and regularisation, which can be defined as follows: where P N i¼1 l y i ,b y i ð Þ represents measuring the difference between the predicted value and the real value of the loss function.
P n n¼1 Ω f n ð Þ is the regularisation term and Ω f n ð Þ¼γT þ 1 2 λ ω k k where T is the number of leaf nodes, ω is the score of a leaf node, γ is the leaf penalty coefficient and λ checks that the score of a leaf node is not too large (Chen & Guestrin, 2016).
XGB is a gradient tree learning ensemble algorithm that is an advanced implementation of the gradient boosting framework. In this model, a cycle of repeatedly building the model is performed by growing an initial weak tree, pruning (error check) and growing other trees (leaves) subsequently (Chen & Guestrin, 2016). The cycle begins by taking the previously constructed model and calculating the errors for each observation in the input dataset. Then a new model is built to predict these errors. Next, predictions from this error-predicting model are added to the ensemble of models. With the sequential model building and boosting parameters, this model can be an efficient algorithm for class prediction and dealing with imbalanced data. A gradient-boosting training process allows the model to efficiently learn from categorical data with a highly skewed distribution (Chen & Guestrin, 2016;Zhang, Tong, et al., 2020). It is fast in computation speed due to parallel computing and applies regularisation to prevent overfitting (Chen et al., 2019).
F I G U R E 3 Holistic conceptual diagram of the study methodology. Also, in the XGB, tree pruning grows the tree up to a maximum depth and then prunes backwards until the improvement in the loss function is below a threshold. In comparison, in non-self-corrective models with regular gradient boosting, tree pruning stops once a negative loss is encountered. In fact, in the XGB, weak learners are converted into strong learners. In this approach, trees are grown one after the other to slowly learn from data and improve predictions in subsequent iterations. A subsequent tree building capitalises on the misclassifications of the previous tree and tries to reduce them. This is preceded by giving higher weights to misclassified points from the previous tree. The consideration of misclassification in each iteration is thought to be the most important factor for efficient learning from imbalanced classes (Fern andez et al., 2018).
Comparing with regular bagging approaches with no self-pruning, such as the RF model, RF trees are fully grown to classify a possible category, where variance is reduced to achieve mitigated errors in predictions. In contrast, the XGB uses weak learners that are defined by high bias and low variance (Niang et al., 2021). Hence, the XGB can potentially help with learning better from an imbalanced distribution of input classes. In the XGB, an initial tree is grown, and the subsequent trees are only grown after pruning and error checking. Then, an ensemble of pruned (error-checked) trees is made. Hence, this model can be potentially highly capable of learning from imbalanced observations (i.e., patterns in data that are often presumed to be noise by other algorithms).
In this study, the xgboost package (Chen et al., 2019) was used in R programming software (R core team, 2022) to carry out soil class prediction within the approach of DSM. Like many other algorithms, tuning parameters in the algorithm function can have a significant impact on the outcomes. As one of the most important parameters, the subsample ratio of the training instance (subsample) was set to 0.7 to decrease overfitting and bias in predictions. The maximum depth of a tree was set to 20, the learning rate (eta) = 0.9, the maximum number of boosting iterations (nrounds) = 15, the minimum loss reduction (gamma) = 0.01 and the subsample ratio of columns (col-sample_bytree) was set to 0.9. These parameters were tuned empirically for our data, and after trials, the best tuning values were chosen to obtain the best prediction accuracy for the imbalanced classes and whether the minority classes get predicted. In fact, tuning such parameters in the XGB function makes the models capable of learning more efficiently from the imbalanced classes and yielding the desired predictions for this particular purpose. Readers are referred to the xgboost R package (Chen et al., 2019) for further information on several parameters that can be tuned. Further details on XGB and its potential for imbalanced classification can be found in Chen and Guestrin (2016)

| Cost-sensitive decision tree (CSDT)
The C5.0 function found in the C50 package was used in the R programming software to run a CSDT model using Quinlan (1986) framework extension of the DT (Kuhn et al., 2018). This method uses tree structures to build the classification models. It divides a dataset into smaller subsets. A leaf node represents a decision within the structure. The DTs classify the categories based on their feature values in the tree structure. Each node represents a feature in a category that is to be classified by a DT, and each branch represents a value. The classification of categories starts from the root node and is sorted based on their feature values (Pandya & Pandya, 2015;Wu et al., 2008). In the cost-sensitive approach, the algorithm tries to optimise the class predictions by minimising the total misclassification cost (i.e., error). In this study, the DT algorithm was tuned by incorporating classification sensitivity to costs (errors of predictions). In this regard, the algorithm is trained such that misclassification of every soil class does not result in equal costs for all the classes. For example, if the very few numbers of minority classes are misclassified as another class (e.g., as a majority class), this will result in the loss of the minority class among the final model predictions (the soil class map). Hence, there should be a higher cost associated with this misclassification compared to misclassifying a majority class as another class. Therefore, in order to make the fewest misclassifications, the algorithm is trained to learn from data subsamples when computing errors as the DT proceeds to reduce errors within its process of making predictions. This has been made possible by adding a cost matrix to the C5.0 function in the C50 R package. The cost matrix used for this study was defined based on a comparative judgement on the inverse number of observations for each soil class. In fact, the cost is defined as the penalty associated with an incorrect prediction for a given class. The cost matrix for the eight soil classes in this study is presented in the Appendix of this paper (Table A1)

| One-class classification
For dealing with imbalanced soil class mapping, the oneclass SVM for the minority classes is combined with an SVM model for the rest of the soil classes. The one-class SVM detects the often-lost minority classes in most machine-learning calibrated models by separating them from the rest of the classes. It computes an optimal hyperplane such that all training data patterns are categorised into the target minority class. The optimal hyperplane is a linear combination of the training data patterns that are positioned near or on the hyperplane, which are called support vectors (Lecomte et al., 2011;Xing & Ji, 2018).
The purpose behind using OCC is to calibrate a model that enables the recognition of the minority class, which is otherwise highly likely to be lost in the predictions of a multi-classification model. A calibrated model in OCC represents the minority class (in our case, minority classes G and H separately). The covariates across the whole study area remain constant, and the minority class is trained on the same set of covariates to enable its prediction throughout the study area. The predictions from OCC are then added to the predictions obtained from a regular multi-class prediction using a model trained for the other remaining soil classes. In this regard, one-class SVM was used in two stages, including one-class model training using classes G and H separately and multi-class training using the remaining classes. In this approach, we can make sure that our minority soil class observations are included in the mapping process. Figure 4 illustrates a flowchart for performing OCC for the digital mapping of soil classes in this study.
To carry out OCC in the SVM model, the minority class (the true positive class in the confusion matrix) is labelled as the true class, and all other classes are labelled as the false class. Then, the model is trained with two categories of true positives and false positives classes. The built model is then used to predict the unvisited pixels of the study area using the geographically corresponding covariate data.
An SVM is a supervised machine learning algorithm that works by positioning data on an n-dimensional feature space by making hyperplanes or lines that categorise the data. The characteristics of these categories are used to predict the groups to which a class belongs. In fact, support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. This model can work well on data with linear and non-linear relationships (Wang, 2005), such as soils with non-linear spatial variations. This approach was performed using the package e1071 (Meyer et al., 2019) using the svm function (oneclassification and C-classification prediction types) in R programming software. The type of SVM function applied in this study was the radial basis function. More information about OCC and SVM can be found in Madden (2014, 2009) and Suthaharan (2016). Some explanation of the mathematical expression of a oneclass SVM can be found in Xing and Ji (2018).

| Validation statistics
To assess the accuracy of the models, four measures, including the Kappa coefficient of agreement, overall accuracy, producer's accuracy and user's accuracy (Congalton, 1991), were each derived from the validation dataset and evaluated against 10 bootstraps calibration models. The final accuracy results are the averages of 10 evaluations. The Kappa coefficient is a measure that shows the difference between observed agreement and expected agreement by chance, obtained by the following: where, p 0 is the overall or observed accuracy, and p e is the expected accuracy, which is defined as: Here, colSum i and rowSum i are the summations of the columns and rows of a class in the confusion matrix. TO is the total number of observations, and n is the number of classes. Overall accuracy is obtained by dividing the total correctly predicted number of classes by the total number of observations. The producer's accuracy is the correctness of predictions for a certain class, obtained by dividing the total number of correct predictions for a class by the total number of observations for that class. The user's accuracy is calculated as the total number of correct predictions in a class divided by the total number of predictions that were made for that class.

| Variable importance analysis
In order to find out which covariates have been relatively more important in each model outcome, a variable importance analysis (also known as feature importance or covariate importance) was done. For the OCC in the SVM model, we used the importance function using the rminer package (Cortez & Cortez, 2016) after fitting the model. For the XGB model, the xgb.importance function in xgboost was used, and for the C5.0 model, we used the C5imp function in the C50 R package. Table 2 shows the average kappa and overall accuracy values using calibration and validation datasets for the three models of XGB, cost-sensitive C5.0 (CSDT) and combined OCC and multi-class classification support vector machines (OCCM). All the studied models showed high accuracy of predictions both in the calibration and validation datasets, with OCCM showing the highest values with a kappa = 0.76 and an overall accuracy = 82% using the validation dataset. Table 3 replicates the results for the DT, multinomial logistic regression (MNLR) and RF models for the same study area and data input, with and without data pre-treatment using resampling techniques (Sharififar, Sarmadian, Malone, & Minasny, 2019). Comparing the results of the present study with the previous study shows a remarkable improvement in the overall prediction accuracies for mapping the same eight soil classes using the newly suggested models in this study. Compared with the data resampling approach, the XGB shows a 464% increase in overall accuracy compared to the RF model (from the previous study) in the external validation stage. In addition, the CSDT model shows a 176% increase in overall accuracy compared to the regular DT with the data resampling technique. Likewise, the OCCM shows a 183%, 148% and 486% increase in overall accuracy compared to the DT, MNLR and RF models, respectively, from the previous study (Tables 2 and 3). The maps produced by the three models are shown in Figure 5. Tables 4 and 5 show the producer's and user's accuracy analysis results using the validation dataset for digital mapping of imbalanced soil classes for the current and previous studies, respectively. The accuracy values for each class show that the present study approach provides much higher accuracy for almost all classes compared to DT, MNLR and RF outcomes using imbalanced data that were treated with resampling techniques. Particularly for the minority class G (Torrifluvents), it has been correctly F I G U R E 4 Flowchart of one-class SVM combined with multi-class SVM (OCCM) procedure. [The figure shows a single-run process of the model] ( 1 The random sampling was carried out for each class separately; the percentage for each class is shown in Table 1. The overall percentages of the dataset used for validation and calibration were 30% and 70%, respectively. 2 The final map is the average of 10 bootstrap calibration runs. 3 The validation was carried out by evaluating the 10 maps (produced by 10 bootstrap models of calibration process) against the hold-out dataset). predicted and preserved in the final maps obtained from the XGB, CSDT and OCCM models. It has to be noted that class H (Torriorthents) is problematic, as it has a total of only two observations among the 336 observations in the calibration dataset. Among the three studied models, only OCCM has been able to predict this class. The previous study did predict the minority class, but incorrectly.

| Prediction accuracy for all classes individually
3.3 | Importance of the covariates for predictive mapping of soil classes Figure 6 shows the plots of the feature importance analysis outcomes. The analysis showed that for the XGB model, valley depth followed by relative slope position were the most important covariates for the digital mapping of soil classes in the study area. For the CSDT model, all covariates showed relatively high importance, with valley depth, channel network base level and the DEM being the most important ones. Also, for the OCCM model, valley depth followed by surface texture were the most important covariates for predicting soil classes.
Analysis of the importance of variables in predictive mapping is not only useful for finding the most important covariates that helped models create a map but also for interpreting spatial patterns on the final soil maps and comparing different model outcomes in terms of the spatial patterns and structures they have created. The spatial patterns of the produced maps are usually similar to the spatial structure of the covariates used for mapping.

| DISCUSSION
In the previous study, even after balancing the data, overfitting was observed, with high accuracy values for calibration and relatively low accuracy using the validation dataset. The present study approaches have solved the overfitting issue, as evidenced by the relatively high accuracy values obtained using the validation dataset. This T A B L E 2 Results of overall prediction accuracy for the three suggested models.

Kappa
Overall accuracy One-class classification combined with multi-class classification using support vector machine algorithm. b Evaluation was based on a random data splitting (30% for validation and 70% for calibration). The procedure was a repeated calibration with 10 bootstraps.
The accuracy results are averages of evaluating 10 bootstraps against the hold-out validation dataset. Data splitting was done per class manually. c Standard Deviations of average values were computed for 10 evaluations of each model.

T A B L E 3
Overall accuracy results of the predictive models for balanced (using resampling technique) and imbalanced datasets for the previous study. issue is important in mapping soil classes or any other soil characteristic, as overfitting results in misleading soil maps with some data points (or categories) being overly predicted and an artificially vast area of the map being assigned to a specific class or a given range of continuous values (Sharififar, 2022;Sharififar, Sarmadian, Malone, & Minasny, 2019). In addition, data resampling might not provide the optimal solution, as the minority class(es) may still get lost in the final map.

Imbalanced dataset
The producer's accuracy shows the ratio of correct predictions to the actual observations for a class, which can be equal to zero but is never not a number (NaN). However, the user's accuracy is the ratio of correct predictions to all the predictions made for a given class, which can result in NaN if that class has not been predicted at all. This is possible in the DSM with imbalanced class data, resulting in an uncertain map with some soil types missing. In the previous study using the data resampling approach, although improvements were made for the minority classes from not being predicted to being included in the outcome, the results did not show correct predictions. However, using the present study approaches, the minority class G (Torrifluvents) got correct predictions from all three models and the minority class H (Torriorthents) got correct predictions using the OCCM model.
Standard deviations (SD) of the means of the accuracy measures are comparable for different models ( Table 2). The standard deviation values were higher for the XGB model compared with CSDT and OCCM models. The higher SD shows higher variance among bootstraps (10 maps produced for each algorithm), which could be due to inconsistent representation of the minority classes among the bootstraps. The relatively lower SD values for the CSDT and OCCM models could be a result of more consistency in the representation of minority classes and a greater presence of these classes compared with the XGB model. This in turn decreases the variance among bootstrap calibrations (maps produced in each run).
For validation of this study's outcomes, we used data splitting (hold-back method) and repeated calibration by bootstrapping. This method of validation might not be the optimal method compared with repeated K-fold crossvalidation in terms of probability of sampling and bias (Brus et al., 2011). However, in our case, due to severe imbalances in soil classes and small sample sizes for some classes, data splitting was found to be the most suitable and applicable approach for validation. Empirically, data splitting (data hold-back method) could be considered acceptable but might not be well generalisable.
For a highly skewed soil class distribution, OCC (combined with multi-class classification) can be the solution that guarantees the retention of the minority class in the final map. Also, cost-sensitive learning and an ensemble model with deep trees and sequential gradient learning (i.e., the XGB model) appear to handle the imbalanced class data input in our case study, as they have shown significantly higher general and individual class accuracies compared to DT (non-cost sensitive), MNLR and RF models, even after their data input was treated by resampling.
Comparing the bagging approach (e.g., RF model) versus boosting (e.g., XGB model), RF is an ensemble F I G U R E 5 Produced soil class maps using the three models extreme gradient boosting (XGB), cost-sensitive decision tree (CSDT) and one-class classification combined with multi-class classification support vector machine (OCCM). Letters A-H refer to the soil classes defined in Table 1. model and is known to be a powerful algorithm that reduces variance through data resampling. However, in XGB, an iterative learning cycle is carried out, in which the model initially predicts something, self-analyses its mistakes and then gives more weight to the data points that it made a wrong prediction for in the next iteration. This is done in an iterative process that results in predictions with a high degree of certainty that do not happen by random chance but rather by understanding the patterns in the data. Usually, boosting avoids predictions by random chance, while RF bagging, as an ensemble of trees, gives the output of all the trees, in which there is a high probability of random chance in each tree prediction that could be affected by local patterns of the data, such as class imbalance. This can very often lead to the loss of a minority class in the final output or the overfitting of a majority class (Sharififar, Sarmadian, Malone, & Minasny, 2019). While in XGB, pruning the trees to prevent them from being constructed deeper when the similarity between the nodes and the newly born trees is high results in the prevention of overfitting the majority class as well as handling the minority class more efficiently than bagging approaches. In RF, the trees are likely to be built upon similar samples (such as the majority classes in cases of imbalanced data input), which can overfit a majority class, which in turn leads to T A B L E 4 Results individual classes prediction accuracies using the validation dataset.

Model
Test Cost-sensitive learning has performed well, with fairly high accuracy and correctly predicting the minority class G (Torrifluvents). However, training this model for the minority class H (Torriorthents) is hard, even with defining a high cost for its misclassification. Due to a substantially lower number of observations for minority classes, fitting the model is a challenge and providing relatively high-cost values might not solve the issue of fitting the points with the covariates. In such cases, OCC can be an alternative to solving the problem.
Not many studies have been done on comparing or evaluating the mentioned or any similar approaches for dealing with imbalanced data in soil science. However, some studies have addressed the problem only by applying resampling techniques to improve DSM accuracy. For example, synthetic resampling as a data pre-treatment method has been explored for improving soil class mapping (Brungard et al., 2015;Taghizadeh-Mehrjardi et al., 2020). Also, within an extrapolation procedure (using the Homosoil concept), the resampling technique was reported as an improving method for soil mapping and accuracy enhancement (Neyestani et al., 2021). In terms of algorithm hyperparameter tuning, the XGB parameters, like many other algorithms, can result in significant changes in prediction accuracy (Taghizadeh-Mehrjardi et al., 2021). Parameter tuning needs to be explored in a separate systematic framework to find out the optimal functioning of the models for dealing with the class rarity or imbalance problem in the DSM.
In addition to class imbalance and parameter tuning, the choice of covariates can influence the predictive accuracy of minority classes. Different models, depending on the algorithm structure, might have different choosing rules for covariates as conditioning data and can show differences in the final maps as they might give different weights to covariates in the calibration process. For soil class mapping, the choice of covariates can be made based on expert knowledge after adaptation of soil profile investigations to spatial variation in satellite images or any available maps related to soil formation and development. At the landscape or sub-regional scale (e.g., the present study), the extent of the study area might not be large enough to include a variety of covariates such as climatic variables and vegetation cover. Hence, the choice of covariates would be limited. In addition, the availability of covariate maps or images with desired spatial, temporal and spectral resolutions can sometimes be a challenge for some specific areas. In future studies, the F I G U R E 6 Feature importance analysis results for extreme gradient boosting (XGB), cost sensitive decision tree (CSDT) and one-class-classification combined with multi-class classification support vector machine (OCCM) for soil class prediction in the study area. systematic evaluation of the choice of covariates sets on soil class mapping needs to be explored more deeply.

| CONCLUSIONS
This study demonstrated the usefulness of three methods, including XGB, CSDT learning and OC of the minority class combined with multi-class classification (OCCM), for dealing with imbalanced soil class data distribution. All three models were capable of coping with the problem of losing the minority class in predictive mapping. All the suggested models showed significantly increased accuracy regarding overall and individual class predictions, compared to the previous study that used the resampling technique. In addition, the newly suggested models do not come with the drawbacks associated with synthetic resampling techniques.
In the validation stage, the OCCM model showed the highest Kappa value (=0.76) and overall accuracy (=82), followed by the CSDT with Kappa = 0.74 and OA = 80. OCC can be recommended as an optimal method for retaining the minority class in predictive mapping.
AUTHOR CONTRIBUTIONS Amin Sharififar: Conceptualization; investigation; writingoriginal draft; methodology; validation; visualization; writingreview and editing; software; data curation; formal analysis. Fereydoon Sarmadian: Writingreview and editing; resources. Note: Definitions of the letters as soil classes are provided in Table 1.