The site‐specific selection of the infiltration model based on the global dataset and random forest algorithm

The selection of the infiltration equation is important in various water research and management fields. Different infiltration equations were found to perform better than others in previous studies. Our objective was to use the Soil Water Infiltration Global (SWIG) database for (a) evaluating and comparing infiltration equations and (b) using the random forest algorithm to investigate how soil properties, land use, and infiltration measurement methods influence the infiltration equation selection. The performance of six equations (Horton, Mezencev, Collis‐George, Green–Ampt, Philip, and Swartzendruber) was characterized by the Akaike information criterion obtained after fitting them to 4,830 cumulative infiltration datasets from the SWIG database. Then, the random forest algorithm was applied to predict the best infiltration equation using soil texture class, organic matter content, bulk density, saturated hydraulic conductivity, land use, and infiltration measurement method as inputs. The Horton, Mezencev, and Collis‐George models were the best in 36, 24, and 12% of cases, respectively. Swartzendruber, Philip, and Green–Ampt were the best in 11, 10, and 7% cases, respectively. The different error of predicting the best infiltration equation was observed in different parts of input variable space. The infiltration method was by far the most important predictor for a model being the best across the whole database, followed by the organic C content and land use type. Organic C content and land use type were the most important predictors for the tension infiltrometer datasets. The SWIG database presents the opportunity to forecast which infiltration equation will work best in site‐specific conditions.


INTRODUCTION
Water infiltration is a fundamental land surface process that controls the hydrological cycle at all scales. The dynamics of infiltration determines water regime and balance at various scales, water availability for runoff and plant consump-infiltration rates have to be estimated. Such estimation requires a model of the dependence of infiltration on time, often given in the form of the infiltration equations and dependencies of parameters of this model (these equations) equations on soil and vegetation properties and site-specific conditions water supply rate.
Many equations (not less than 15) were proposed to simulate infiltration (Mishra et al., 2003). In different conditions, different equations performed better. Several studies have compared the accuracy of infiltration equations. Originally, the comparisons were made with data on one or two soils (Davidoff & Selim, 1986;Haverkamp et al., 1988;Lal & Van Doren, 1990;Mbagwu, 1995). Double-ring and singlering infiltrometers were used in these works. Duan et al. (2011) showed that the Mezencev and Horton infiltration equation performed better than Philip and Kostiakov equations for the data from a double-ring infiltrometer in three soils under lawns. Mishra et al. (2003) analyzed data from 243 experiments in the United States and India and concluded that the semiempirical Singh-Yu general equation performed best compared with other popular infiltration equations. Dashtaki et al. (2009) evaluated the Kostiakov, Mezencev, Horton, and Philip infiltration equations with double-ring infiltrometer data from 123 sites in Iran and found that the performance ranking varied across the sites. Shao and Baumgartl (2014) measured infiltration at 28 sites using the rainfall simulator and concluded that the Horton and Holtan equations showed better performance than physically based Green-Ampt and Philip equations. The same infiltration measurement method was used in most of the infiltration equation comparisons. The performance of equations depended on the infiltration measurement method if more than one method was used (Mazloom & Foladmand, 2013).
Pedotransfer functions for infiltration equation coefficients were developed for some regions (e.g., for Green-Ampt equation over the continental United States [Rawls et al., 1983], and for the Philip, Green-Ampt, Kostiakov, and Horton equations over two regions in Brazil [ Van de Genachte et al., 1996]). The creation of large international database such as the Soil Water Infiltration Global (SWIG) database (Rahmati, Weihermüller, Vanderborght, et al., 2018; opens the opportunity to create pedotransfer functions for the variety of infiltration equations. To use them in site-specific conditions, one has to decide which infiltration equation should be used in these conditions. Overall, no attempt has been made so far to use site-specific and measurement-specific data to predict which infiltration equation will perform best. Research in this direction becomes feasible with the development of the SWIG database and the availability of powerful algorithms such as random forest (RF) quantifying the complex relationships in databases.
The objective of this work was to use the SWIG database for (a) evaluating and comparing the suitability of infiltration

Core Ideas
• Selection of the infiltration equation is based on basic soil properties and land use. • Data from the global SWIG database were fitted with six popular infiltration equations. • The infiltration measurement method was the lead predictor for the equation selection. • It was followed by the organic C content and the land use type. • The error of the best infiltration equation predictions depends on soil properties and land use.
equations and (b) using the RF algorithm to investigate how soil properties, land use, and infiltration measurement methods influence infiltration equation selection.

The database
Data in SWIG were collected for the 1976-2017 time span. The SWIG database includes 5,023 datasets from 54 countries. It provides cumulative infiltration data, some soil properties (Table 1), land use (Table 2), and infiltration measurement method (Table 3) (Rahmati, Weihermüller, Vanderborght, et al., 2018). Different soil properties have different frequencies of occurrence. Approximately 76% (3,818 out of 5,023) samples have known land use types. Twenty-two categories of land use types in the SWIG database were reclassified into seven categories as presented in Table 2. Agriculture (arable land) is reported as the most frequently reported land use in the SWIG database with 53%, followed by grassland, pasture, garden, forest, urban, and others. Several infiltration methods were used to measure infiltration rates (Table 3). The disc infiltrometer (disc, minidisc, micro-disc, hood, and tension infiltrometers) is the most frequently used type of infiltration method in the SWIG database; it was applied to obtain with approximately 51% of datasets. The minidisc infiltrometer is the most frequently reported infiltration measurement method in the SWIG database; it was reported for about 23% of all datasets. The double-ring infiltrometer is the second most frequently applied infiltration measurement method in 16% of datasets. The disc infiltrometer (12% of datasets) and the single-ring infiltrometer (11% of datasets) are ranked as the third and fourth most frequently used infiltration measurement methods, respectively.
The cumulative distribution functions (CDFs) for the length of time of cumulative infiltration from each separated infiltration method are shown in Supplemental Figure S1. Less than T A B L E 1 Selected statistics of soil properties in Soil Water Infiltration Global (SWIG) datasets used in this work (Rahmati, Weihermüller, Vanderborght, et al., 2018)

Infiltration equations
Three empirical equations (Horton, Mezencev, and Collis-George) and three theoretical equations (Green-Ampt, Philip, and Swartzendruber) were selected to evaluate and predict their performance in this study. The infiltration equations are shown in Table 4. The software R version 3.53

Reference Infiltration equations
Horton ( Note. F(t) = cumulative infiltration at time t (cm). In the Horton equation, f c = final or equilibrium infiltration rate (cm h -1 ), f 0 = initial infiltration rate (cm h -1 ), k = constant representing exponential rate of the approach of the infiltration rate to the final one (h -1 ). In the Mezencev equation, k (cm h -1 ), a (unitless) and f 0 (cm h -1 ) are empirical constants, k > 0 and 0 < a < 1. In the Collis-George equation, K is the infiltration rate at steady state, = ( ) 1∕2 , and = ∕ ; the sorptivity (S is the measurable soil property. In the Philip equation, S is the sorptivity (cm h -1/2 ), and c is the constant. In the Green-Ampt equation, K is hydraulic conductivity (cm h -1 ), Ψ is the soil suction head at the wetting front (cm), and θ is the volumetric water content. In the Swartzendruber equation, A is a parameter depending on soil properties (h -0.5 ), K is the saturated hydraulic conductivity (cm h -1 ), and S is the soil sorptivity. (R Core Team, 2019) was used to fit equations to datasets using the R nls_search routine. There were approximately 200 datasets in which the cumulative infiltration value oscillated with time. Datasets with more than five oscillations were removed. A total of 4,830 cumulative infiltration data at a given time were used for the analysis in this work.

Application of the RF algorithm
The RF algorithm can model complex relationships between a set of input variables and the output (target) variable. Both input and output variables can be both numerical and categorical. The concept of RF is based on the concept of the decision tree (DT). For a given database, the DT creates the tree-like representation of the relationship between input and output variables. The tree is binary, the data are split into two subgroups, each of those subgroups are split into another two, and so on. For each split, the goal is to create two groups with the most similar target values within the groups and the percentage of all datasets sent to this dataset subgroup. The process ends with having small terminal subgroups. In a DT, each node except terminal nodes represents the test that needs to be performed to send the datasets to one of two branches. A leaf, or terminal node, represents a classification result. The predicted target variable value in these subgroups is found as the average if the target variable is numerical, or by the majority of represented datasets if the target variable is categorical. An example of the DT with data from the SWIG database is shown in Figure 1. The potential input variables for classifications were the infiltration measurement method, the organic C content, the bulk density, the land use, and the textural class. At the graph's bottom, there are leaf nodes that present classification results. Each leaf node shows the infiltration equation name, the fractions of datasets classified correctly and misclassified at this node, and the percentage of the total datasets classified to this node. For example, the leftmost node at the bottom represents datasets obtained with disk infiltrometry (DI) and mini-disk infiltrometry (MD). These datasets constitute 13% of the total number of datasets. Fractions of datasets appearing in this group are equal to 0.02 for datasets having the Collis-George equation as the best, 0.00 for the Green-

Vadose Zone Journal
The split conditions are written under classification results at the intermediate nodes. For example, the first split at the top node occurs by the infiltration measurement method. Not all potential classifier variables participate in tree splits-for example, the textural class was not used in this case.
Decision trees work very well with databases that have distinctly different input-output relationships in different database parts. Transparency of the results made DTs a powerful instrument for soil hydraulic database exploration (Lilly et al., 2008;Rawls & Pachepsky, 2002). Decision trees can be used for predictions. In such cases, available data have to be divided into training and testing datasets, and crossvalidation has to be performed. Many pairs of training and testing datasets have to be created. Predictions are found by averaging or voting on results from all DTs obtained by training, and the metrics of the prediction accuracy and reliability are found from the testing results. In this way, those metrics can be treated as random values, and this creates an opportunity of using standard statistical procedures for model performance comparison. Pedotransfer relationships have been developed with ensembles of DTs (Gebauer et al., 2020;Jorda t al. et al., 2015) The RF algorithm is one of several algorithms using DTs for predictions. It builds multiple DTs and merges their results together to get a more accurate and stable prediction. The testing datasets are randomly extracted from the original one, and the training datasets are created by randomly adding data from the original dataset to the positions vacated after the extraction of the testing dataset. This algorithm allows for existence of the same data in training and testing datasets. However, the testing is done only with data in the testing dataset that are not present in the training dataset (i.e., have not been seen during training). Such data are called "out-of-bag," or OOB, data. The RF algorithm has become popular in pedotransfer work (Araya & Ghezzehei, 2019;Koestel & Jorda, 2014).
The OOB data are used not only for the evaluation of the RF reliability but also for ranking input variables by their importance. A common measure of variable importance is the mean decrease of accuracy, or MDA. After a tree is grown, the OOB samples are passed down the tree, and the prediction accuracy metric is recorded. Then, the values for each input variable are randomly permuted (shuffled) in the OOB samples, and the accuracy metric is again computed. The MDA metric as a result of this permuting is averaged over all trees, normalized by the standard deviation value, and used as a measure of the importance of the permuted variable in the RF (Bhalla, 2019).
The RF algorithm was used to predict the best-performing equation for the dataset's soil texture class, organic C content, bulk density, land use, and infiltration measurement method. The RF built many DTs similar to Figure 1; each tree selects the best model, and the best infiltration equation is the one found in the majority of individual tree outputs (Hastie et al., 2009). The RF algorithm was applied using the randomForest package in R version 3.53 (Liaw & Wiener, 2018). Soil properties (soil texture class, organic C contents, and bulk density), land use class, and the infiltration measurement method were used as inputs, and the best infiltration equation name was the output. The "true" output was the name of the equation with the best performance statistic (see below). Before applying the RF classification, datasets were removed if one of the input variables was missing. A total of 1,698 datasets were used for the best equation predicting study. The split of datasets was 75 and 25% for the RF training and testing, respectively. Such splitting was randomly done 500 times to obtain 500 DTs. The number of votes for the correct equation was the accuracy metric.

Performance statistics
The second-order Akaike information criterion, AICc (Burnham & Anderson, 2002), was used to evaluate the performance of the infiltration equations in this study: where k is the number of estimated parameters in the infiltration equation, N is the sample size (number of infiltration data points), and ( ) and ( ) are estimated and measured cumulative infiltration at the ith experimental points, respectively. The smallest value of AICc indicated the best equation. The accuracy of classification achieved for individual infiltration equations is assessed using the confusion matrix. The element "ij" of this matrix is the number of datasets classified as having the equation i as the best one, whereas in fact, these datasets have the equation j as the best one.
The quality of the equation selection achieved with the RF was assessed by the error values that were computed as where FP (false positive) is the number of incorrect predictions for an equation to be the best, and TP (true positive) is the number of correct predictions for the equation to be the best.
The overall accuracy was computed from TP, FP, the number of correct predictions for the equation to not be the best (TN, true negative), and the number of incorrect predictions for the equation to not be the best (FN, false negative) (Lu Vadose Zone Journal F I G U R E 2 Percentage (bars) and the total numbers (in parentheses) of datasets for which infiltration models appeared to be the best using the Akaike information criterion et al., 2004):

Overall accuracy = TP + TN TP + TN + FP + FN
100% (3) where TN is correct negative prediction and FN is incorrect negative prediction. The additional statistic to evaluate the classification was the Cohen's kappa (κ) statistics (Ben-David, 2008): where 0 is the observed agreement, and is the hypothetical probability of change agreement. According to Landis and Koch (1977), less than 0 κ value indicates no agreement, 0-0.20 is slight, 0.21-0.40 is fair, 0.41-0.60 is moderate, 0.61-0.80 is substantial, and 0.81-1 is almost perfect agreement.

Evaluating the performance of infiltration equations
All six equations were fitted to the cumulative infiltration time series in each of 4,830 datasets, and AICc values were computed for each fit. The cumulative distribution of the AICc for all models are shown in Supplemental Figure S2. Overall, the distributions were not substantially different. For each dataset, the fitting results of six infiltration equations were ranked using AICc values, and the equation with the smallest AICc was defined as the best (Figure 2). The Horton equation was the best in 1,745 out of 4,830 databases (36%). Mezencev equation (24%) and the Collis-George equation (11.5%) were ranked second and third, respectively. Empirical infiltration equations (Horton, Mezencev, and Collis-George) performed better than theoretical infiltration equations (Philip, Green-Ampt, and Swartzendruber). The Swartzendruber and Philip equations were the most suitable in 535 (11%) and 464 (10%) datasets. The Green-Ampt equation appeared to be the least applicable infiltration equations in this study.

Prediction of the best equation and the most influential predictors
The confusion matrix obtained from RF results for the entire dataset is shown in Table 5. The last column provides the error values computed according to Equation 3. The empirical infiltration equations (Horton, Mezencev, and Collis-George) showed prediction accuracy higher than the Philip, Green-Ampt, and Swartzendruber The best and the worst performance (35 and 88%) were found for the Horton and Philip infiltration equations, respectively. The difference between training and testing errors was small for the Horton and the Mezencev equations but became more substantial (12 and 19% for the Collis-George and the Green-Ampt equations, respectively). Similar errors in training and testing were found for the Philip and the Swartzendruber equations. The overall accuracy of predictions over all equations was 42%. The κ values for training and testing were within 0.2 and 0.4, indicating a fair agreement.
The input variable ranking by MDA is shown in Figure 3a. The most important predictor was the infiltration measurement method followed by organic C and soil texture class. Land use type and bulk density were not in the top three predictors. Nevertheless, they were relatively close to land use type with MDA in the range from 30 to 40%.

Dataset obtained with the same infiltration measurement method
Since the infiltration measurement method appeared to be the leading predictor of the best equation, it was deemed interesting to look at leading predictors for dataset groups that were obtained with the same infiltration measurement method. A large group obtained with the same method was the tension infiltrometer group (476 datasets). Table 6 shows the confusion matrix obtained from the prediction of the best infiltration model using the tension infiltrometer dataset. The Collis-George infiltration equation had the best error with training data, and its errors with the testing data were much better than errors of other models. The overall accuracy across all equations was 27%, which was much lower than for the entire dataset in Table 5. The κ values for training and testing were 0.086 and 0.066, which were also lower than for the entire dataset.   The rankings of the best equation predictors were different for the entire dataset and tension infiltrometer datasets. The organic C content was the most important predictor in the tension infiltrometer dataset. Land use and bulk density were equally important (Figure 3b).

DISCUSSION
The comparison of six infiltration equations by the AICc criterion showed that empirical infiltration equations provided a better fit compared with theoretical infiltration equations  Figure 1). Past research provided both similar and contradicting to our conclusions. The Horton equation was found to be the most suitable infiltration equation when compared with the Kostiakov, the modified-Kostiakov, and the Philip equations (Ogbe et al., 2011). Previous research showed that Horton and Mezencev infiltration equations performed better than Philip and Kostiakov (Duan et al., 2011). Shao and Baumgartl (2014) demonstrated that empirical infiltration equations were more accurate than theoretical infiltration equations when Philip, Green-Ampt, Horton, and Holtan equations were compared. Mishra et al. (2003) found that theoretical infiltration equations performed better in a study for evaluation of 14 popular infiltration equations based on the Nash-Sutcliffe efficiency criterion using 243 sets of infiltration test data of various soil types from India and the United States. Jha et al. (2019) concluded that the theoretical infiltration equation (Swartzendruber) provided the best reliable infiltration equation by comparing five infiltration equations using 72 infiltration data. The above studies were conducted with relatively small databases with soil samples from specific and limited locations compared with the SWIG database. The evaluation of suitability for infiltration equations using the SWIG database might result in more broadly applicable conclusions. In general, there is no reason for theoretical equations to be more accurate than empirical equations. The pore space of field soils is more complex than the pore space assumed in theoretical equation development. This study was limited to six infiltration equations, further research is possible to select the most suitable infiltration equations by including other infiltration equations from the literature. It is also worth mentioning that the Horton and Mezencev equations were not only the most accurate but also were the most accurately predicted as the best equations (Table 5). Besides, the prediction error was very close in training and testing datasets. That may mean that the predictor set includes the major determinants for the selection of the best equation.
The infiltration method was the most influential predictor for the selection of the most suitable infiltration equation. There can be several reasons for that, including the scale effect caused by differences in the contact area between infiltrometer and soil. The contact area for the minidisc infiltrometer and double-ring infiltrometer is approximately 16 and 700 cm 2 , respectively. These differences can make measurements sensitive to different levels of soil heterogeneity (Pachepsky et al., 2014). Rahmati, Weihermüller, Vanderborght, et al. (2018) noted that different infiltration equations should be used for measurements made with different methods. Prediction of the infiltration equations can have a different efficiency for different infiltration measurement methods.
When percentages of sand, silt, and clay were used instead of soil texture class (the results are shown as supporting information in Supplemental Tables S1 and S2), the overall accuracy was 44.8 and 34.1%, respectively. The overall accuracy was not much different from overall accuracy when computing RF input soil texture class (suggested by the anonymous reviewer). Given the importance of infiltration equations in land surface modeling , the efficiency of the soil texture class as a predictor of infiltration equation suitability makes possible large-scale maps to inform land surface modeling on selection of the infiltration equation.
The organic C content was found to be one of the important predictors (Figure 3). This happens because soil organic C contents highly are related to soil structure (Haynes & Beare, 1996;Huntington, 2007). The soil bulk density was not found as one of the most important predictors in the results from the entire dataset and from the tension infiltrometer data. The bulk density can reflect different aspects of soil structure caused by compaction (Hakansson & Lipiec, 2000) or aggregation (Aksakal et al., 2019) in the soil. It is also possible that the land use provides enough information for predictions of the best equation and makes the bulk density less critical predictor. Several previous studies stressed out the land use type impact on hydraulic conductivity (Kidwell et al., 1997;Shao & Baumgartl, 2014).
The RF input values that were conducive for the Horton equation to be the best provided a moderately accurate prediction across the whole database. Other input values pointed to other infiltration equations as the best ones, but the error of these predictions was high. Similarly, in measurements with the tension infiltrometer, the RF input values indicated that the Collis-George equation was the best led to the moderate accuracy in predictions. In contrast, low accuracy was found with the input sets pointing to other equations as the best ones. These results of predicting the best model appear to demonstrate a common, but not always analyzed, feature in pedotransfer work-different accuracy of pedotransfer in different parts of the input domain. For example, in studies of scaled pedotransfer functions, Amanabadi et al. (2019) concluded that predictions of water retention curves of fine-textured soil significantly better than that of the coarse-textured soils. Grote and Leverett (2019) compared pedotransfer functions for high-resolution mapping of hydraulic conductivity using GPR. They observed that the GPR-based estimates were more accurate in the sandy loam than in the silty clay loam. Why the ability of some equations to be the best can be predicted better than for other equations presents an interesting research question.
Additional predictors of infiltration equation suitability may be found and used. The initial water content in the soil should be examined as a predictor because it can influence the infiltration rate which was considered for predicting hydraulic conductivity (Araya & Ghezzehei, 2019). The initial water content in the soil was not used in this study, since it was only available for 31% of infiltration data in the SWIG database. Furman et al. (2006) showed that initial and boundary conditions may be influential factors in infiltration equation performance. One may argue that some infiltration equations may be better suited for one-dimensional flow than for threedimensional and vice versa, and that might be incorporated in the analysis. There are two aspects to that. First, some equations are known to work well for both geometries. In particular, the Philip equation (Table 4) was derived from the one-dimensional water flow equation (Philip, 1957). However, Havercamp et al. (1994) showed that the same equation applies to the three-dimensional infiltration in disc infiltrometry for short and medium times. This was confirmed in recent experiments of Kargas et al. (2018), where the linear dependence on time was demonstrated for the difference between three-and one-dimensional cumulative infiltration. Similarly, the Mezencev equation has performed well with data from both one-and three-dimensional flow (Kumar & Sihag, 2019). Second, the performance of an infiltration equation also depends on the stage of the infiltration that has been reached in measurements. In measurements that did not reach the stationary stage, the presumably one-dimensional equations such as Horton's or Swartzendruber's may perform better than the presumably three-dimensional equations for some pore space geometries. The SWIG database includes both relatively short and relatively long measurements for all measurement methods (Supplemental Figure S1). Including the infiltration stage in the analysis might help create a better prediction of the most suitable equation. This can probably be done for small subsets of the SWIG database and will require an algorithm to decide at which stage the infiltration measurements ended. This needs in-depth research, given the observation of Warrick (1992) who researched the equations for disk infiltrometers and pointed out that the approach to the steadystate solution may take considerably longer than what is commonly reported in the literature for field applications. Nevertheless, predicting which equation will be suitable at a particular infiltration stage presents an interesting research avenue.
The analysis is done in this work for tension infiltrometer data infiltration measurement methods presents an example. It shows that, depending on the measurement method more appropriate for the project at hand, the SWIG database allows one to select a subset of data with the soil and landscape properties that are available for this project, and evaluate the precision of the selection of the infiltration equation.

CONCLUSIONS
We evaluated which infiltration equation is more suitable using RF classification and investigate how soil properties, land use, and infiltration methods control infiltration equations. We used 4,830 cumulative infiltration data from the SWIG database. The Horton equation was found as the most suitable equation and Mezencev and Collis-George equations were ranked second and third, respectively. The Green-Ampt equation was found as the least suitable infiltration equation. For the entire dataset, the most important predictor was the infiltration measurement method followed by organic C and land use type. The ranking of predictors for selection the most suitable infiltration equation was different for different infiltration measurement methods. The accuracy of predicting the best model appeared to be different for different parts of the input domain. The SWIG database presents an important resource for selecting the appropriate infiltration equation for project-specific infiltration methods and available predictors.

A C K N O W L E D G M E N T S
Dr. Gülay Karahan thanks the Scientific and Technological Council of Turkey (TUBITAK) for the financial support given under the grant of BIDEP (2219).

C O N F L I C T O F I N T E R E S T
The authors declare no conflict of interest.