Lithofacies identification of shale formation based on mineral content regression using LightGBM algorithm: A case study in the Luzhou block, South Sichuan Basin, China

Lithofacies form the basis for evaluating shale gas fields and play an important role in gas reservoir enrichment. The accurate identification of shale lithofacies is key for exploration and development. Based on well‐logged data, the accuracy of mineral content prediction using machine‐learning regression models is not ideal. Therefore, feature derivation was introduced to enhance the correlation between minerals and lithofacies and improve the data expression ability. Four machine‐learning models for mineral regression were established based on feature‐derived data sets: LightGBM, XGBoost, artificial neural network, and support vector machine. By calculating the evaluation metrics of each model, we found that LightGBM had the best prediction performance. To compare and confirm the accuracy of the model in identifying lithofacies, this study established a new method, MT‐LightGBM, which combines the LightGBM mineral content regression model with mineral ternary diagrams to identify lithofacies. By using the MT‐LightGBM model and LightGBM classification models to identify the target lithofacies, it was found that the accuracy of lithofacies identification of MT‐LightGBM reached 94%. This accuracy is high and is of great significance for understanding and evaluating underground shale reservoirs.


| INTRODUCTION
Lithofacies are inherent characteristics of shale that contain important geological information about reservoirs.The identification of lithofacies can aid gas exploration personnel in understanding reservoir characteristics, evaluating reservoir performance, and determining oil and gas reserves and production plans.Therefore, it is important to comprehensively understand shale lithofacies.The reliability of traditional artificial lithofacies qualitative and quantitative interpretation methods based on the response characteristics of logging curves relies on the interpretation of personnel and profile complexity.They are significantly influenced by human factors, have a low interpretation efficiency, and are limited by the number of mineral components in the formation, making them unsuitable for use in complex lithologic reservoirs.With the widespread adoption of artificial intelligence in various fields, experts and scholars have proposed new approaches to efficiently and accurately identify lithofacies based on geological data combined with machine-learning methods. 1,2he origin of machine learning can be traced back to the 1950s, with early machine-learning algorithms, including linear regression and perceptron models.In the 1970s, machine learning was applied to speech and image recognition, such as k-nearest neighbors.With the popularity of the internet, machine learning has reached a new peak with algorithms, such as neural networks, support vector machine (SVM), and random forest being widely used.In the 1980s, these algorithms were introduced to the petroleum industry to improve work efficiency.Machinelearning algorithms find extensive applications across diverse domains and phases within the petroleum industry including reservoir geology and engineering, as well as oil and gas exploration, development, and production. 3Combining with machine learning will be the research core and hotspot for the construction and development of artificial intelligence in the oil and gas field. 4luster analysis algorithms are primarily used to classify and identify lithologies based on strong correlations with geological features.Wang et al. 5 established a new KNN clustering method by weighting the cosine distance.It predicts that the lithology is more in line with the lithological profile than traditional KNN clustering in identifying sand and can better fit the lithology model.Jing et al. 6 proposed a K-means dynamic clustering algorithm that combines rock mechanical properties to identify lithology. 7Liu et al. 8 used a multiresolution graph clustering (MRGC) method to optimize logging data that were more sensitive to electrical phase clustering analysis.Geophysical features obtained from the electrical phase were linked to the lithology to select the dominant lithology.Subsequently, they proposed a method to divide the lithology of shale sections by combining the MRGC algorithm with the sedimentary structure and mineral composition; this method is suitable for lithology classification research. 9However, cluster analysis involves high computational complexity and a time-consuming computation process, and the results of lithology identification rely on the sample data volume and initial parameters, making its practical application complex.
Ameur-Zaimeche et al. 10 discussed the application of a multilayer perceptron neural network (MLPNN) method to reconstruct a noncore lithofacies division algorithm.Feng et al. proposed a Bayesian neural network (B-ANN) lithofacies identification algorithm.Sun et al. 11 proposed a method for the automatic classification of sandstone using BP neural network model technology.Li et al. 12 proposed a memoryrecursive neural network method for lithological identification, achieving good results in complex reservoir rock classifications.Sun et al. 11 used a backpropagation (BP) neural network combined with a gradient-descent algorithm to establish a sandstone lithology identification model.Although neural network algorithms have better classification performance, the results are highly dependent on the number of sample parameters and feature parameter extraction. 11 new lithology identification method based on an SVM (LDM KL-SVM) was proposed to conduct multicore deep learning. 9Al-Anazi and Gates 13 used SVM algorithms to classify sand and mud reservoirs. 10Gao and Jiao 14 proposed a method for identifying rock types by combining three-dimensional vibration signal mixeddomain features with an SVM.This method transforms complex nonlinear problems into simple linear problems.However, the accuracy of the SVM in identifying rock types depends on the selection of the kernel function and penalty parameters.The improper selection of these two parameters can result in poor recognition performance and low accuracy. 15cholars have begun to focus on decision tree ensemble algorithms, which have strong interpretability, low data sample requirements, strong robustness, and suitability for large-scale data, to address the shortcomings of the single-model machine-learning methods mentioned above.Chen and Guestrin 16 elucidated the loss function of decision trees and created an XGBoost model to identify lithology.Zou et al. proposed a method based on ore deposit logging records.That gradientboosting decision tree algorithm was used to establish a corresponding model for lithological identification.Sun et al. 17 experimentally demonstrated that the XGBoost decision tree model performed well in lithology prediction.However, decision tree algorithms have a large memory footprint and are less efficient at processing many variables, which can lead to overfitting. 18Therefore, a new algorithm, LightGBM, was developed based on XGBoost.It has a small memory footprint, efficient training speed, accurate prediction ability, can handle large-scale data and high-dimensional features, and provides flexible parameter adjustment and parallelization functions, which can effectively improve model performance and efficiency. 19,20hale gas has been a hot topic in recent years; however, few scholars have researched the automatic recognition of shale lithofacies.Therefore, more in-depth studies are required.Bhattacharya et al. 21used SVM, artificial neural network (ANN), SOM, and MRGC algorithms to identify the lithofacies of the Bakken and Mahantango Marcellus shale.The comparison showed that the optimal results were achieved using SVM. 21Hou et al. 22 proposed the use of MLP, SVM, XGBoost, and RF models to identify the shale lithofacies of Gulong shale in the Songliao Basin.The results showed that XGBoost and RF performed best.Wang et al. 4 used the SHAP method to quantify the importance of logging parameters and established a shale lithofacies identification model based on the RF algorithm.Zhao et al. 23 proposed a shale lithofacies recognition method based on TAN to classify the shale lithofacies of the Longmaxi Formation in the Changning Block.All these studies used logging data to establish lithofacies classification models without analyzing the mineral components.Although they achieved relatively good results, the recognition of shale facies requires further improvement.
This study aims to address the issues of insufficient accuracy and low efficiency in identifying shale lithofacies based on logging data.We adopted the LightGBM ensemble learning algorithm combined with experimental geochemical and logging data and introduced feature engineering to establish a new method, MT-LightGBM.This method combines a mineral regression model and mineral ternary plot to improve the accuracy and efficiency of lithofacies identification.This model performed well in identifying shale lithofacies in the Luzhou area.

| Geological background
The Sichuan Basin, situated on the northwest edge of the Yangtze Platform, is a crucial geological region extending across multiple fault fold belts, including the Chuanxiang, Longmen Mountains, Micang-Daba Mountains, EmeiWa Mountains, and Lou Mountains.This region exhibits a complex multilevel structure with distinct tectonic characteristics.By analyzing these characteristics, the area can be divided into six secondary structural belts: the North Sichuan low and gentle structural belt, East Sichuan high and steep structural belt, Central Sichuan gentle structural belt, West Sichuan low and steep structural belt, Southwest Sichuan low and steep structural belt, and South Sichuan low and steep structural belt. 23The research area is located in the Luzhou area of the southern Sichuan Basin.Structurally, it belongs to the southern Sichuan low-steep structural belt on the southern side of the central Sichuan paleo-uplift, which was formed by the activity of basement faults that developed from deep to shallow layers, including hidden faults that dominate the folds.These fault-fold structures resulted from the uplift of the Qinghai-Tibet Plateau during the Cenozoic.During the deposition of the Wufeng-Longmaxi Formation, which is located in the center of the deepwater shelf deposition, the deposition thickness was 500-650 m.The reservoir of the Longmaxi Formation mainly develops black or grayish-black thinlayer shale or block shale, whereas the Wufeng and Longmaxi Formations are in integrated contact, primarily developing black or gray-black shale (Figure 1).

| Lithofacies interpretation
In this study, the Longmaxi Formation in the Luzhou Block of the Sichuan Basin was selected as the research subject.We collected core photographs and experimental analysis data from gas wells.To analyze the rock type, mineralogy composition, and particle shape of shale slices, we conducted quantitative tests on the mineralogy composition, type, and content of clay minerals present in the samples using the PanalyticalX'Pert PRO MPD X-ray diffractometer.Additionally, the total organic carbon (TOC) content of the samples was analyzed using a LECO CS230 carbon/sulfur analyzer.The core length for our study was 364.35 m and 968 core test samples were collected at depths ranging from 3400 to 4300 m.
To classify shale lithofacies in the study area, we first established a ternary diagram of lithofacies classification.We then projected the sample data onto this diagram and categorized the shale lithofacies based on the relative contents of siliceous, carbonate, and clay minerals.Our analysis divided the lithofacies of the study area into five categories.These include calcium-bearing siliceous shale (CBSS), clay-siliceous shale (CSS), calcareous mixed shale (CCMS), clay-mixed shale (CMS), and mixed shale (MS) (Figure 2).
The mineral contents of the five types of lithofacies were determined and a petrographic division scheme suitable for the study area was established according to the distribution range of the minerals and the actual situation of the study area (Table 1).

| Data preprocessing
The sample data set in this study consisted mainly of experimental test data of real core samples, logging data, and derived data.Logging and derived data were primarily used as input features, mineral content was used as an output label, and both were indispensable.
They were applied together on model training to complete mineral regression modeling.
Logging data are often subject to errors due to the complexity of the logging environment, measurement conditions, and study objects.Nongeological factors have varying degrees of impact on logging information, resulting in inaccurate data.To provide reliable quality, consistent depth, and accurate numerical logging data, it is necessary to perform depth calibration, delete invalid values, and standardize the data.It was necessary to calibrate the data for depth, remove invalid values, and normalize the data.
The data normalization formula is as follows: where X is the normalized logging data, X′ is the original log data, X min is the minimum value of the original log data, and X max is the maximum value of the original log data.

| Parameter selection
A Pearson correlation coefficient heat map was created and the preprocessed log data were subjected to parameter correlation analysis.The Pearson correlation coefficient measures the linear correlation and values in (−1,1).The closer the correlation coefficient is to 1 or −1, the stronger the correlation, and the closer the correlation coefficient is to 0, the weaker the correlation.We set a parameter correlation threshold to filter the input parameters.When the correlation coefficient of the two input parameters was more significant than 0.8, we retained the parameter with the highest correlation with the output label and eliminated the other.Similarly, when the correlation coefficient between the input F I G U R E 1 Regional structure and stratigraphic profile of the study area.parameter and output label was less than 0.2, the input parameter was deleted.
Based on the results of the analysis, logging parameters such as CAL, DEN, U, and RXO were deleted.Simultaneously, GR, AC, CNL, K, RT, RXO, BRIT, TH, and POR were preliminarily selected to establish a regression model for mineral content (Figure 3).
To reduce the risk of overfitting of the algorithm and improve the interpretability and performance of the model, parameters were added, combined, transformed, and calculated.Based on conventional logging parameters, characteristic parameters such as U/TH, TOC, and POR were added.The size of U/TH better reflects the redox environment of water bodies.The TOC content represents the degree of organic matter development, whereas POR represents porosity.These three factors are closely related to shale minerals and play important guiding roles in understanding shale. 24,25he derived characteristic parameters constructed for this work area were the relative content of TOC (w (TOC)), 26 relative content of siliceous minerals (w(Si)), and relative content of clay minerals (w(Clay)). 27fter data preprocessing, correlation analysis of the logging parameters, and feature derivation, the following parameters were selected to establish the automatic lithofacies identification model: AC, CNL, GR, TH, K, RXO, BRIT, POR, TOC, U/TH, w(Si), and w(Clay).

| Workflow
The workflow of the machine-learning algorithm is as follows.First, the logging data and experimental testing data are collated to create the sample data set.
Next, the data were preprocessed for depth calibration, outlier removal, and normalization.A mineral content prediction regression model was developed using the preprocessed data set, where logging parameters were used as input features and mineral content was used as the output label.We subsequently optimized the models by determining the hyperparameters that needed optimization for each algorithm and defining the hyperparameter search space.The Bayesian-based Tree-structured Parzen Estimator (TPE) algorithm was used to identify the optimal hyperparameter combination, with the aim of obtaining appropriate hyperparameters to enhance the performance of the regression model.Cross-validation was performed to validate the stability and performance of the regression models.Finally, a shale lithofacies recognition model was established by combining a mineral content regression model with the mineral ternary diagram method.The model results were validated using a test data set.The performance of the mineral regression model was evaluated based on the mean absolute error (MAE), root-mean-square error (RMSE), and coefficient of determination (R 2 ).The lithofacies identification effect was evaluated according to the accuracy rate, recall, and F1-score (Figure 4).

| The implementation of machinelearning models
In this study, four machine-learning algorithms were used to build a prediction model based on sample data: LightGBM, XGBoost, ANN, and SVM.The XGBoost and SVM algorithms were used for effect comparisons, and their principles were not described in detail.
The LightGBM algorithm was based on the XGBoost algorithm.It introduced histogram algorithms and leafwise growth strategies, thereby reducing the number of feature split points, searching for the best split point, limiting the maximum depth of the tree, and achieving the goal of optimizing the algorithm (Figure 5).
First, we define the expression of a decision tree as where fi x ( ) is the output variable, x is the input variable, w is the leaf node weight, and q is the mapping relationship between the sample instance and leaf node.
The LightGBM tree was obtained by the forward accumulation of a single tree as follows: where K is the number of decision trees and y ˆis the accumulation of K decision trees.

| 4261
The loss function of the decision tree is where L is the loss function, n is the number of samples, i is the number of samples, t is the number of steps in operation,  l yi y i ( , ˆ) is the empirical loss function, and is the regularization term used to control the overfitting of the model.
Next, we expand and derive the empirical loss function and regularization term separately to obtain the final loss function equation.The equation is where gi is the first-order derivative and hi is the second- order derivative.
The regularization term of the decision tree complexity is defined as Ω, which is determined by the number of leaf nodes T and leaf weight w of a single decision tree.The regularization term can be expressed as where T is the number of leaves, γ is the pseudo- regularization hyperparameter, λ is the norm of the L2 mode, and w is the weight.Subsequently, all leaf nodes of the decision tree are regrouped.All samples belonging to the first leaf node were divided into a sample set of leaf nodes, that is,  Ij i q xi j = { ( ) = }.Thus, the loss function can be rewritten as of the samples contained in the leaf nodes, respectively, all of which are constants.Subsequently, the final loss function is When the independent leaf nodes of each tree reach the optimal value, the entire loss function reaches the optimum value.When the tree structure was fixed, the best advantages and optimal values were obtained as follows:

| Model validation and evaluation
In this study, 70% of the sample data was selected as the training data set to train the model and 30% was used as the testing data set to test the model's effectiveness.The sample data set was trained using the 10-fold crossvalidation method to generate ten independently and identically distributed training and test data sets, which were built separately, allowing the models to be adequately trained and validated to obtain the best inversion accuracy (Figure 6).Evaluation metrics are quantitative metrics used to describe how well or poorly a model works for the same data fed into different algorithmic models or different data fed into the same algorithmic model, including regression and classification evaluation metrics.
Regression evaluation metrics: All regression models used MAE, RMSE, and R 2 to evaluate the performance.MAE is the mean absolute difference between the predicted and actual values, representing the average magnitude of the error in a set of predictions.RMSE is the root mean square of the difference between the predicted and actual values, which indicates the closeness of the observed data points to the model's predicted values.R 2 reflects the overall degree of fit of the model.The lower the MAE and RMSE values, the better the predictive performance of the model.In contrast, the higher the R 2 value, the better the prediction performance of the model.These three metrics are defined as follows:  where y and y ˆare the true values and model predicted values from the data set, y ¯is the mean of the true values, and n is the number of samples.Classification Evaluation Metrics: A classification model was established for label classification prediction.Accuracy, recall, precision, and F1-score are typically used to evaluate the classification performance.The accuracy was used to assess the accuracy of the entire model.Precision refers to the probability that the actual situation in a sample predicted to be positive by the  | 4263 model is favorable.Recall refers to the likelihood that a model's prediction is optimistic for a positive example.F1-score is the reconciled average of the accuracy and precision.The confusion matrix is used to explain these metrics and formulas, as follows:

Accuracy TP TN TP TN FP FN
where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

| LightGBM performance
The degree of contribution of the parameters to the model was calculated to determine those that most unidentified, and all five types of petrographic phases were commonly either misidentified or poorly identified (Figure 13).

| Comparison of different methods
SVM, ANN, and XGBoost have achieved satisfactory results in identifying sandstone and carbonate lithofacies. 28Therefore, this study used the SVM, ANN, and XGBoost algorithms for comparison with the LightGBM algorithm in identifying lithofacies.Comparing the regression prediction results of the four algorithms for the relative mineral content, the LightGBM model had lower MAE and RMSE values than the XGBoost, ANN, and SVM models, with significantly higher R 2 values and the best fit (Figure 14).The four algorithms were combined with the mineral ternary diagram to automatically identify the lithofacies, and the accuracy of lithofacies identification was found to be 93.79% for the MT-LightGBM model, which was the highest of the tested models.XGBoost was the second most effective, with an identification accuracy of 83.1%.The single-model SVM algorithm was the least effective, with an identification accuracy of only 58.62%.This matches previous findings that single-model machine-learning methods are highly dependent on data and kernel functions, the model performance is not sufficiently stable, and the effect may vary significantly with a different sample data set.The LightGBM algorithm is an efficient integrated algorithm optimized based on the XGBoost algorithm and is more accurate effectively improved the performance of the LightGBM algorithm (Figure 15).The results showed that feature engineering is an important link in machine learning and data mining.Based on the original data and domain knowledge, it is necessary to combine, transform, and calculate the features to obtain new features with the original feature information and higher expression and prediction ability.The parameters generated by feature engineering can significantly improve the performance and generalization ability of the model.

| Mineral regression improves recognition performance
When using the LightGBM classification method to classify and predict the lithofacies directly, the accuracy was 85.52%, precision was 84.04%, recall was 85.6%, and F1-score was 84.74%.The accuracy, precision, recall, and F1-score of the identification results of the model built using the MT-LightGBM method were 8.27%, 9.12%, 7.79%, and 8.5% higher than those of the LightGBM classification model, respectively (Table 7, Figure 16), which was significantly better.The single-well composite bar chart shows that the MT-LightGBM method can accurately identify almost every type of lithofacies.In contrast, the LightGBM classification method could not as accurately identify the transition zone and cannot classify curve anomaly parts of the lithofacies accurately.
For shale lithofacies identification, the MT-LightGBM method was more accurate than the LightGBM classification method (Figure 17).Compared with previous research results, this study further verifies the reliability and progressiveness of the model.Bhattacharya et al. 21chieved an accuracy of 87% and 82% in the Bakken and Mahantango Marcellus shale lithofacies, respectively.Hou et al. 22 achieved an accuracy of 87% in identifying Gulong shale lithofacies.The MT-LightGBM method used in this study achieved an accuracy of 93.79%, indicating much better performance.This may be due to the complex lithofacies, extreme heterogeneity, and vertically variable distribution of lithofacies, resulting in fuzzy characteristics, unclear curve boundaries, or abnormal curve response characteristics corresponding to different lithofacies gradients and mutation zones.When the LightGBM algorithm is used to classify lithofacies, the data are discrete.A petrographic point corresponds to a specific set of logging data.The results are subject to extremely high data sample requirements, and transition zone lithofacies are not easily distinguishable or are even unrecognizable.In contrast, using LightGBM to regress the mineral content, the mineral content is continuous data, and one petrographic point corresponds to a set of mineral content distribution intervals, which can reduce the impact of excessive data feature blurring and extreme anomalies at some points, expand the recognition range, and improve the performance of lithofacies recognition.

F
I G U R E 2 Mineral ternary diagram method for lithofacies classification.

T A B L E 1
Standard for classification of lithofacies: ω(Si) is the content of siliceous mineral; ω(Car) is the content of carbonate mineral; ω(Clay) is the content of clay mineral.

F I G U R E 4
Workflow of automatic recognition of lithofacies.

F I G U R E 5
Schematic diagram of histogram algorithms and leafwise growth strategies.LIU ET AL.
2 for the training and testing data sets of the LightGBM algorithm: (A) siliceous minerals, (B) carbonate minerals, and (C) clay minerals.

F
I G U R E 13 MT-LightGBM, XGBoost, artificial neural network (ANN), and support vector machine (SVM) algorithm lithofacies identification single-well composite bar chart.

T A B L E 7 F
Classification evaluation metrics of MT-LightGBM and LightGBM.I G U R E 16 Confusion matrix for the test data set using MT-LightGBM and LightGBM.Four machine-learning algorithms were used to establish a lithofacies identification model for shale based on well logging and experimental testing data of the Longmaxi Formation shale in the Luzhou area.The main results are as follows: 1.The LightGBM regression model efficiently and accurately predicted the relative contents of various F I G U R E 17 Single-well composite bar chart of MT-LightGBM and LightGBM classification lithofacies identification.
R 2 for training and testing data sets of XGBoost algorithm: (A) siliceous minerals, (B) carbonate minerals, and (C) clay minerals.T A B L E 4 Statistics of evaluation metrics of the ANN regression model.I G U R E 10 R 2 values for the training and testing data sets of the artificial neural network algorithm: (A) siliceous minerals, (B) carbonate minerals, and (C) clay minerals.T A B L E 5 Statistics of evaluation metrics of the SVM regression model.
minerals in shale.Using 10-fold cross-validation reduced the randomness of the model training results, ensuring model stability and reliability.The LightGBM algorithm outperforms XGBoost, ANN, and SVM algorithms in predicting siliceous, carbonate, and clay minerals, with R 2 values of 0.86, 0.83, and 0.85, respectively.The MT-LightGBM method, which combines the LightGBM regression model with mineral ternary diagrams, achieved the best metrics and demonstrated excellent recognition performance.2. Feature engineering effectively enhanced the performance of machine-learning models.Adding feature engineering to the mineral relative content prediction model significantly reduced the MAE and RMSE.In addition, the R 2 of the three minerals increased by 0.08, 0.07, and 0.06, respectively.3. The MT-LightGBM lithofacies identification model exhibited superior identification metrics than the LightGBM classification lithofacies identification model, with an accuracy of 93.79%, precision of 93.16%, and F1-Score of 93.24%.This study shows that, based on logging data, the MT-LightGBM lithofacies identification model is a stable and reliable method for identifying shale lithofacies in the Luzhou area and provides a new idea for shale reservoir research.