Estimation of soil moisture using environmental covariates and machine learning algorithms in Cathedral Peak Catchment, South Africa

Soil moisture (SM) is a fundamental constituent of the terrestrial environment and the hydrological cycle. Owing to its significant influence on catchment hydrological responses, it can be utilized as an indicator of floods and droughts to aid early warning systems. This study aimed to develop a field‐scale method to estimate SM using parametric and machine learning‐based methods to compare whether advanced artificial intelligence methods can give similar results as traditional methods. Considering this, monthly observed SM data (from the top 10 cm), environmental covariates, and remotely sensed data from March 2019 to July 2021 for the Cathedral Peak Research Catchments VI and IX in South Africa were obtained. From the 241 observations obtained across 12 sites, 160 (∼66%) were used for model training, while the remaining 81 (∼34%) were used for model testing. Employing 10‐fold cross‐validation, the individual machine learning models (viz., support vector machine [SVM], random forest (RF), k‐nearest neighbor, classification and regression trees [Rpart], and generalized linear model) displayed a satisfactory performance (R2 = 0.52–0.79; root mean square error = 3.79–5.80). In the validation phase, the RF model displayed a superior performance, followed by the SVM. Subsequent SM estimation using the hybrid model produced satisfactory results in training (R2 = 0.90) and testing (R2 = 0.45). The results obtained from this study can aid in predicting SM variations in catchments with limited monitoring. Furthermore, this model can be applied in drought monitoring, forecasting, and informing agricultural management practices.


INTRODUCTION
Soil moisture (SM) is a fundamental constituent of the terrestrial environment and the hydrological cycle (Li et al., 2022).It significantly influences catchment hydrological responses by moderating processes such as runoff generation (Freebairn et al., 2009;Robinson et al., 2008;Singh et al., 2021) and evapotranspiration (Kim et al., 2016).Furthermore, the effect of SM on biogeochemical processes (Zhang & Wienhold, 2002), ecological functions (Dunn et al., 1985;Mackay & Barber, 1985;C. Wang et al., 2019), and soil behavior and physical properties (Lal & Shukla, 2004) affords its importance in agriculture and hydrology (Songara & Patel, 2022) where it can serve as an indicator of climate change-related risks (viz., droughts and floods).Hence, the accurate quantification and monitoring of the spatiotemporal variations of surface SM are essential for many climate-related studies, such as flood and drought predictions (Dai et al., 2004) and for informing efficient agricultural and land management practices (AbdelRahman et al., 2021).
Relatively recent SM estimation techniques involve the use of instrument-based methods that measure and relate soil dielectric properties to SM content, such as time-domain reflectometry (Černý, 2009) and frequency-domain reflectometry (He et al., 2021).Although these in situ methods provide invaluable SM information, they are susceptible to uncertainties as the direct methods suffer errors in organic soils, and instruments can be sensitive to air gaps and soil salinity (Babaeian et al., 2019).Furthermore, these methods often fail to adequately represent field-scale SM conditions spatially as they only provide point measurements (Petropoulos, 2014).In South Africa, there are few active SM monitoring networks.Those that are somewhat up to date are using either remotely sensed global datasets (Vey et al., 2016) or a combination of remote sensing and or hydrological/land surface models (van Huyssteen et al., 2013;Myeni et al., 2019Myeni et al., , 2022;;Sinclair et al., 2010;Vischel et al., 2008).To date, to the authors' knowledge, no study has undertaken the use of machine algorithms to estimate SM or soil water content in South Africa.
Since establishing dense sensor networks could prove costly, labor-intensive, and destructive to soil profiles (Lekshmi et al., 2014), SM estimation using remote sensing (RS) provides a practical solution to these challenges.RS has emerged as a powerful tool to provide spatially and temporally representative SM data, thus increasing the measurement possibilities, especially for remote catchments with sparse monitoring networks (Cheng et al., 2022;Hillel, 2003;Kundu et al., 2017;Peng et al., 2017).
Remote sensing-based SM products/technology have developed over the past decades.Since the early 1980s, when the Scanning Multichannel Microwave Radiometer, carried by Nimbus-8, began providing SM products,

Core Ideas
• The study focused on conducting soil moisture estimations in KwaZulu Natal, South Africa at a catchment scale.• Google Earth Engine served as a valuable tool to estimate soil mositure levels in KwaZululu Natal, South Africa.• Soil Moisture estimations in South Africa at catchment scale.• The research employed a combination of capacitance probes, remotely sensed imagery, and machine learning algorithms to estimate soil moisture content effectively.
there have been developments to improve the spatial and temporal accuracy and resolution of products (Fan et al., 2022).As such, the market presents numerous remote sensing-based SM projects retrieved from various satellites using various algorithms, for example, the Soil Moisture and Ocean Salinity of the European Space Agency with a spatial resolution of 35-50 km (Xiao et al., 2017), the Soil Moisture Active and Passive of the National Aeronautics and Space Administration with a spatial resolution of 36 km (Fan et al., 2022), and the Advanced Microwave Scanning Radiometer-Earth Observing System (AMSR-E/2), of Japan Aerospace Exploration Agency's; with a spatial resolution of 25 km (Yan & Bai, 2020).The challenge with these SM products is their coarse spatial resolution; hence, an alternative method to estimate SM at a local scale is required.Therefore, machine learning (ML) models could provide an alternative to coarse remote sensing-based SM products to estimate SM that will be useful at a local or field scale.
Accordingly, recent years have seen the increased use of machine learning models for SM retrieval/modeling from RS products.These models are driven by ML algorithms, which have gained popularity for their efficiency in determining complex nonlinear relationships.They have solidified their use in SM estimation using RS owing to the numerous works citing their efficiency and the higher prediction accuracy they can achieve (Nguyen et al., 2022;Sanuade et al., 2020;Yang et al., 2021).
From a mapping point of view, machine learning algorithms have been the prime for SM mapping due to successful applications in predicting soil attributes (Ahmad et al., 2010;Chaudhary et al., 2022;Hachani et al., 2019;Sanuade et al., 2020).The advantages of employing ML algorithms include rapidly producing high-accuracy results and deciphering complex relationships (Khanzode & Sarode, 2020).Numerous literature has cited their success in the prediction of SM from observed and meteorological data (Babaeian et al., 2021;Sanuade et al., 2020) as well as from assimilating RS data (Adab et al., 2020;Chaudhary et al., 2022;Chung et al., 2022).
For example, artificial neural networks allow the extraction of an empirical equation from the model (Sanuade et al., 2020).Furthermore, their ability to draw nonlinear relationships for prediction benefits SM estimation (Ahmad et al., 2010;Twarakavi et al., 2006).Additionally, RF and SVM models have been frontrunners in numerous SM estimation studies (Adab et al., 2020;Chaudhary et al., 2022;Sanuade et al., 2020).This is promising for the current study as it will employ ML algorithms.However, ML algorithms generally require large datasets for adequate training, thus limiting their use in regions with limited data.
Key findings from the literature reveal some limitations and opportunities that may be encountered in implementing the current study.For example, the existing literature informs that using ML algorithms requires large training datasets for improved prediction accuracy.However, the current study will utilize a relatively limited dataset, therefore presenting an opportunity to establish and identify the limit to which this is the case and aid in informing use in other data-limited regions.Despite their high efficiency, another limitation established by the literature is the site-specific nature of ML algorithms which often gives rise to shortcomings in replicating accurate predictions outside of the site they are trained in (Adab et al., 2020).This further presents an opportunity to investigate whether a GLM is more accurate than ML models to overcome site-specific limitations.GLM is driven by geostatistical approaches (e.g., multiple linear regression).These approaches have long been favored owing to their simple but effective applications (Thompson et al., 2006), which can provide predictions of SM for unsampled areas by employing the spatial dependence of the unsampled point in relation to the sampled points around it (Tunçay et al., 2018).Furthermore, the literature identified potential RS products that can aid in refining results, such as land surface temperature (LST) and normalized difference vegetation index (NDVI).This insight will provide the guidelines for the current study.
Therefore, in keeping with the recent literature and emerging trends in SM estimation, this study aimed to test whether one could accurately estimate SM at the field scale using RS data since the current RS products provide fairly coarse spatial resolution estimates, which may be inadequate for various purposes.The following objectives guided the study: (i) to develop simple linear and more complex ML-based models for SM estimation in study sites of varying terrain, (ii) to evaluate the difference between simple linear and more complex ML-based models, and (iii) to develop a model ensemble that combines the strengths of a linear method and numer-ous machine learning algorithms for SM estimation at a field scale.Therefore, this study has the potential to support the few sparse continuous in situ monitoring networks in South Africa and can validate and calibrate RS products by using machine learning algorithms at the catchment scale.

STUDY AREA
The Cathedral Peak research catchments (Figure 1) are longterm monitoring sites where the South African Environmental Observation Network (SAEON) continuously monitors various hydro-meteorological variables.These catchments (naturally grasslands) are situated in the northern part of uKhahlamba-Drakensberg in KwaZulu-Natal, South Africa, and fall under the uKhahlamba-Drakensberg Park World Heritage Site (R. L2; Harrison et al., 2022).Although these catchments comprise 15 distinct hydrological units (viz., I-XV), only catchments VI and IX were utilized for this study due to the present instrumentation.These catchments are situated within the South African summer rainfall region and experience a mean annual precipitation of ∼1200-1600 mm (Bosch, 1979;Harrison & van Tol, 2022), predominantly occurring during October and March (Schulze, 1975).The region's geology is acidic, greatly leached soils (Schulze & George, 1987) with underlying basaltic lavas over Clarens Sandstone (Nänni, 1956;Toucher et al., 2016).

Cathedral Peak Catchment VI
Catchment VI (29˚15′3.77″E; 28˚59′34.607″S) has mainly remained unaltered as a natural Themeda triandra grassland and is subjected to biennial spring burns.The dense grassland minimizes soil evaporation while keeping the surface temperature lower than surrounding exposed areas.The soils act as an efficient storage of organic carbon due to the dense root networks of the grasses (Kunene et al., 2015).This is the principal research catchment of SAEON owing to the density and comprehensive nature of monitoring.The catchment characteristics include a mean annual precipitation (MAP) of 1340 mm, a total area of 0.667 km 2 , an altitude ranging between 1845-2073 meters above sea level (m.a.s.l.) and an north-northwest aspect (Harrison et al., 2022;Toucher et al., 2016).2016).The exclusion of fire in this catchment has allowed the growth of Leucasidea serica, which are evergreen, frostresistant trees.The tree canopy shades the surface, enabling cooler surface temperatures and reduced surface evaporation.The soil is moist and darkly pigmented, indicating high organic content.The catchment characteristics include a MAP of 1257 mm, a total area of 0.645 km 2 , an altitude ranging between 1823 and 1966 m.a.s.l. and an east-northeast aspect (Harrison et al., 2022;Toucher et al., 2016).

Observed soil moisture data
Data were obtained from Catchments IV and IX from SAEON, where SM data had been recorded using the Sentek (Pty Ltd.) Diviner2000 probes (hereafter named, Diviner 2000), which use the frequency-domain reflectometry method of determining SM.The Diviner 2000 can take rapid readings of the SM profile in 10-cm vertical increments for a 5-to 14-cm radius outside of the access tube (Sentek, 2011) up to a depth of 1.6 m and can provide high accuracy in both saturated and oven-dry soils (Sentek, 2009).The Diviner 2000 in Cathedral Peak has been calibrated by submersion in water and subsequent exposure to air while also using the default settings.Monthly SM (%) from March 2019 to June 2021 has been obtained for both catchments.
Figure 2 displays the location of the Diviner probes from which the 2-year observed SM used in this study was obtained.The probes are installed to a depth of 1 m below the ground's surface, and the spacing varies from ∼50 (CP-VI) to 70 m (CP-IX) in lower regions of the catchments.In the middle and upper regions of the catchments, the spacing varied from ∼10 (CP-VI middle) to 15 m in the middle and upper regions of the catchments (CP-VI upper and CP-IX middle/upper).The varying spacing was due to the varying terrain features across the site's landscape.
The probes were installed in pairs in the catchments to provide a more representative image of the variation of SM along the different regions of the catchment.Furthermore, these probes are installed in pairs as a measure to allow the validation of the results obtained by providing a similar point from which to cross-check the SM measurements.

Environmental covariates
A high-resolution (5 m) digital surface model (DSM) obtained from SAEON was generated using light detection and ranging data.The DSM was used to extract terrain-based covariates such as the slope, aspect, relative relief, drainage density, and topographic wetness index.

Sentinel 1 imagery
The European Space Agency's Sentinel-1 C-band SAR twin satellite constellation was used to obtain the 10-m resolution ascending track backscatter coefficients for vertical transmit/vertical receive (VV band) single co-polarization and vertical transmit/horizontal receive (VH band) dual-band cross-polarization (Table 1).

Sentinel 2 imagery
The European Space Agency's Sentinel-2 twin satellite provides 13 spectral bands in the visible/near-infrared and short-wave infrared spectral range.Sentinel 2B was used to  2).

Google Earth Engine and R Studio
Google Earth Engine (GEE) is a cloud-based platform with high computational speed, is easily applicable to broad areas as well as performs successful analyses and visualization of big earth datasets, which can easily be made available online (Mehmood et al., 2021;Piao et al., 2022;Vanama et al., 2020).GEE was used to extract and process various satellite imageries such as Sentinel, Landsat, and LST.Finally, GEE compiled the composite remote sensed images exported as time series in Excel.The data were further processed before use for model development in R Studio.In R Studio, the dataset was divided into training (66%) and testing data (33%).Afterward, machine learning models were applied to the data to estimate SM.

Machine learning models
Random forest.A random forest (RF) model selects the training samples randomly and provides many decision trees (Abbes & Jarray, 2022).The method has been widely used in remote sensing applications because it has a fast training speed (i.e., can handle big data with numerous variables) and has high flexibility and accuracy (Clewley et al., 2017).In this study, the RF model was used because of its flexibility in combining and extending covariates of different types, which is suited for the estimation of spatial variables such as soil properties (Gagkas & Lilly, 2019;Hengl et al., 2018;Tramblay & Quintana Seguí, 2022).(Kemppinen et al., 2018).One of the considerations for the choice of GLM in this study was the advantage GLM has over the conventional linear regression model.For example, it is more flexible because it can assess nonlinear relationships and works when output variables are not continuous (Srivastava et al., 2013).
K-nearest neighbors.K-nearest neighbors (KNN) is a nonparametric method that finds the k number of nearest neighbors from a reference dataset of input attributes ( de Oliveira et al., 2021).The reason for selecting KNN is that it has proved easy to implement, is simple, efficient, and can be used for classification and regression problems (Yamaç et al., 2020).accuracy of derived variables (Greifender et al., 2021;Maia et al., 2022).The GEE platform provides a powerful platform for such analyses, allowing users to integrate various Earth observation datasets and apply advanced algorithms.
The hybrid model was selected in this study to assess the effect of combining more than one machine learning algorithm on the estimation of SM over the chosen study period.

Methods
Figure 3 outlines the approach in which the Digital Soil Mapping framework from Dobos et al. ( 2006) was adopted and applied in this study.

Terrain variables
The high-resolution 5-m DSM (Figure 4) enabled a detailed depiction of the terrain features, improving the spatial accuracy of covariates.Using the Spatial Analyst Tools, the slope and aspect were determined in ArcGIS with the DSM as the input raster.The aspect (Figure 5) is the cardinal direction a slope faces.This is an important parameter as it influ-ences the terrain's exposure to sunlight, thus forming variable microclimate regions across neighboring slopes (Kibirige & Dobos, 2020).For instance, north-facing slopes often receive less exposure to sunlight, and therefore more moisture can be observed owing to a cooler microclimate.The slope (Figure 6) can be considered a measure of the steepness or flatness of the catchment.It is an important parameter as it influences the movement, the rate of runoff generation, and runoff erosivity.Furthermore, the variations in catchment slope can define areas where moisture will pool (viz., the gentle slopes) or rapidly run off (viz., the steep slopes).
Relative relief (Figure 7) can be described as the difference in altitude between the highest and lowest point within a cell of interest.Similar to the slope, this parameter influences the generation and movement of moisture in a catchment.This parameter was obtained by first applying a fishnet over the DSM in ArcGIS, after which the Zonal Statistics as Table tool was used with the fishnet as the input feature zone data and the DSM as the input value raster.The RANGE was selected for the statistic type, and the output range values were subsequently joined to the fishnet labels.Finally, the inverse distance weighted was applied using the fishnet points to obtain relative relief.The topographic wetness index (TWI) (Figure 8) was determined by first using "Spatial Analyst Tools » Hydrology » Flow Direction" with the DSM as the input raster to obtain a flow direction raster.The following steps were performed using the flow direction raster: "Spatial Analyst Tools » Hydrology » Flow Accumulation" to determine the flow accumulation.Then, the slope of each cell of the DSM was determined using "Spatial Analyst Tools » Surface » Slope", after which this raster was used as an input for the raster calculator to determine the slope using Equation (1).The tanslope was then obtained using Equation (2) in the raster calculator.The flow accumulation was then appropriately scaled using Equation (3), after which the TWI could be calculated using Equation (4).

Slope =
"Slope (DSM) " × 1.570796 90 , (1) scaled flow accumulation = ("flow accumulation" + 1) The drainage density (Figure 9) was determined using the following steps: "Spatial Analyst Tools » Density » Line Density" with a previously generated stream raster.Drainage density is the river length per unit of the catchment area.This parameter indicates the degree of the catchment's erosion and enables an understanding of the possible catchment characteristics.For example, a lower drainage density is generally a characteristic of areas of low relief, high vegetation cover, and high permeability soils.

Satellite data processing
The methodology utilized to collect and process RS data was GEE, and subsequent processing in Microsoft Excel, the R statistical program, and RStudio is detailed in Figure 11.Four variables were retrieved in GEE: Elevation, backscatter coefficients (VV and VH bands), and NDVI.They were retrieved at the monthly time step from a constellation of satellites in the GEE, after which Microsoft Excel was used to consolidate monthly observed data and extracted terrain properties.

Sentinel-1 data
Sentinel 1B image ground range detected product was selected because the product consists of SAR data that have been multi-looked detected, multi-looked, and projected to ground range using an Earth ellipsoid model as opposed to a single look complex (Kibirige & Dobos, 2020).Thereafter, the image is subjected to an accurate selection of orbit state vectors, thermal and noise removal, calibration, speckle filtering, terrain correction, conversion of an image into dB using Equation ( 5) and finally, a raster image for each day containing the mean σ(dB) in VV and VH polarizations (Figure 10).

LST data
LST was computed using the Statistical Mono-Window algorithm developed by the Climate Monitoring Satellite Application Facility for deriving LST climate data records from Meteosat First and Second Generation (Ermida et al., 2020;Duguay-Tetzlaff et al., 2015).The approach is based on an empirical relationship between top-of-atmosphere (TOA) brightness temperatures in a single TIR channel and LST and utilizes simple linear regression.The model consists of a linearization of the radiative transfer equation that maintains an explicit dependence on surface emissivity: where T b is the TOA brightness temperature in the TIR channel, and  is the surface emissivity for the same channel.The algorithm coefficients A i , B i , and C i are determined from linear regressions of radiative transfer simulations performed for 10 classes.

Multicollinearity testing
The variance inflation factor (VIF) was utilized to identify the multicollinearity between the derived covariates.Subsequently, the variables with a high correlation to other variables (i.e., those having a VIF greater than 10) were removed (Kutner et al., 2005).The remaining covariates were selected for SM estimation using a geostatistical method and machine learning algorithms.

Model development
The literature reviewed for the study enabled the selection of commonly used ML algorithms (Chaudhary et al., 2022;Gangat et al., 2020;Gao et al., 2014;Han et al., 2018;Yamaç et al., 2020;Yao et al., 2021).The dataset was then separated for training and validation (66.6%) and testing (33.3%) to develop and evaluate models effectively (Figure 11).The training set is used to train the model's parameters (weights and biases) based on the input features and corresponding target variables, that is, SM.The validation set tunes hyperparameters and monitors the model's performance during training.This helps prevent overfitting (when the model performs well on the training data but poorly on new, unseen data).This step would aid in assessing the models' efficiency in predicting using "unseen" data.Thus, the training phase made use of the SVM, RF, KNN, Rpart (better known as CART), and GLM based on the frequency of appearance and good performances in the literature.The R program assessed individual model performances using 10-fold cross-validation with 10 repeats.Each model that produced a satisfactory performance (i.e., R 2 > 0.5) was retained for inclusion in the ensemble models.Following that, two ensemble models were established by utilising RF and GLM techinques for aggregating the individual models selected during the validation process.This step combines the outputs of the models and aids in maximizing the strengths of the individual models for improved prediction.This study selected the bestperforming ensemble as the hybrid model for SM prediction.Individual and ensemble models were assessed using the R 2 and root mean square error (RMSE) values.

RESULTS AND DISCUSSION
The following sections present the results of multicollinearity testing of variables and the assessment of individual and ensemble models.Afterward, the results of algorithm training and testing are presented.

Multicollinearity testing
The study initially identified 10 covariates to assist in the estimation of SM.A correlation plot (Figure 12) was utilized to determine the degree to which the individual covariates correlate.The plot revealed that VV, VH, and NDVI share a high correlation with SM, while LST had the lowest correlation with SM.Additionally, the plot revealed that VV shared a T A B L E 4 Multicollinearity testing using the variance inflation factor (VIF) before and after removing covariates with a high correlation.high degree of correlation with VH and NDVI, and the relative relief and drainage density shared a high correlation with elevation.Thus, VV, relative relief, and drainage density were identified as covariates for removal.

Covariate
To diagnose the multicollinearity between the individual covariates, the VIF was determined in the R program (Table 4).The test revealed the presence of multicollinearity within the dataset.Thus, based on the variable correlation identified in Figure 12, the VV, relative relief, and drainage density were omitted from the predictor variables.Removing these covariates resulted in a lower VIF, enabling reliability in model results.

Assessment of individual models and the ensemble models
A statistical analysis of the individual models (Table 5) revealed that the selected models are suitable for prediction in the current study (i.e., each produced an R 2 > 0.5 in the validation stage).Additionally, the two ensemble models were assessed, the results for which are shown in Table 6.The superior performance of the ensemble stacking using the RF informed its selection as the hybrid model employed in this study.

Model performance in the training phase
The training phase utilized two-thirds of the observed data: the observations recorded from the upper and middle regions of the catchment.For this phase, the overall performance results are provided in Figure 13, where the general hybrid model achieves a satisfactory performance (r = 0.95 or R 2 = 0.91) across the eight sites reserved for training.
It can be noted in Figure 14 that the model had a superior performance in Catchment IX (R 2 = 0.93) than in Catchment VI (R 2 = 0.88) when simulating for "seen" data.Furthermore, it can be observed that fewer points deviate from the linear trendline.
The model can also adequately simulate the general trends in SM fluctuations, as evidenced in Figure 15, occasionally under-simulating the extreme ends of the dataset.This is especially evident in the simulation performed for the middle and  higher and lower values in the observed dataset are undersimulated in the model.This highlights the model's ability to adequately simulate the average SM content and limitations in accounting for extremes in the dataset.

Model performance in the testing phase
The remaining observed data not used during training and validation were utilized for the testing phase.The overall performance achieved for this phase is provided in Figure 18, which shows that the general hybrid model achieves an adequate performance (r = 0.67 or R 2 = 0.49) across the four sites reserved for testing.
Figure 19 illustrates that the model performance experiences a reduction when simulating for "unseen" data, as evidenced by the new r = 0.67 for Catchment VI and R 2 = 0.33 for Catchment IX.The model's performance in Catchment VI is superior to that observed in Catchment IX, contrary to the results obtained in the training phase.For both catchments, more points can be observed to deviate from the linear trendline.
A closer examination of the results in Figure 20 illustrates that the model can adequately anticipate high and low SM periods but has a delayed response.In Catchment VI, the model does not appear to have an under-simulation trend for high SM conditions, as was observed in the training dataset.However, on one occasion in July 2020, the low SM F I G U R E 1 7 Comparison of the observed and simulated 5-number summaries on "seen" data.
conditions under-simulated in both sites 5 and 6, whereas the rest of the trend shows an over-simulation.
In Catchment IX (Figure 21), the model appears to oversimulate in site 5 and under-simulate in site 6, influencing the mean graph to highlight an "under-over-under" simulation pattern.The SM fluctuations are well anticipated in site 5 of Catchment IX; the model predominately over-simulates the "low-moderate" SM conditions.Conversely, the model undersimulates site 6 throughout the study period.Compared to the training dataset, the mean graphs for both Catchment VI and XI show slight improvement; however, the trends in Catchment IX are not constant and fluctuate based on the site 6 poor model performance.
Figure 22 reveals that maximum, mean, and median values are well simulated, while the lower quartile and minimum values are under-simulated.This highlights the model's ability to adequately simulate the high and average SM conditions and limitations in accounting for the lower SM conditions in the dataset.

DISCUSSION
The results demonstrate that the general hybrid model developed in this study reasonably predicts SM from a combination of static (terrain-related) and dynamic (RS-related) attributes in the Cathedral Peak Research catchments.The study benefitted from using the GEE, a free cloud-based computing platform that simplifies the task of processing sizeable geospatial datasets by employing user-friendly, interfacedeveloping algorithms (Gorelick et al., 2017;Mutanga & Kumar, 2019).This enabled quick access to a 2-year record of the RS data used in this study.Furthermore, the study benefitted from using the R coding environment through RStudio to apply the ML models and create a hybrid model.Like the GEE, the RStudio program simplifies applying ML algorithms by using a comprehensive library of ML packages that can easily be added to the coding environment.
A summary of the individual model performances is reported in Table 5 regarding the models initially identified for the study.Furthermore, a summary of the preceding literature on topsoil SM estimation studies and the reported model performances similar to the current is provided in Table 7.
Comparing the results of this study to other studies, it can be noted that the RF had superior performance compared to the other models forming the ensemble, a result comparable to Adab et al. (2020), who utilized terrain-related data and RS data to predict SM in a semi-arid region and found that the RF model exhibited a better performance from the selected models (viz., SVM).Similarly, Sedaghat et al. (2022) also found that the RF method was considerably higher than the MLR method in estimating surface SM using pedotransfer functions and Sentinel 2 imagery.Chaudhary et al. (2022) reported a similar finding, showing that the RF method was the best during the training stage; however, RF was second out of 12 machine learning methods during testing.
Additionally, the SVM model displayed a satisfactory performance that was second to that of the RF, also complementary to the work of Adab et al. (2020), whereas SVM was superior.On the contrary, the work of Uthayakumar et al. (2022) highlighted that linear regression was superior to the SVM method in a very limited dataset (<5 target values).A study by Sanuade et al. (2020) revealed that the SVM model was most suited for the study area situated within the University of Ibadan campus in Nigeria; these results are markedly better than what is achieved in the current study; however, the improved estimation may have been enabled by a smaller catchment area and a denser sampling network (i.e., 75 sampling points in 0.0253 km 2 ).The GLM displayed a good performance compared to Kibirige and Dobos (2020), who used an MLR approach.While the error metrics obtained for the current study are satisfactory and share commonalities with the previous literature, the comparison is difficult given the vastly different catchment land uses, areas, and climatic conditions.
Between the training and testing phases, the hybrid model experiences a 50% performance reduction (i.e., from R = 0.95/R 2 = 0.91 to R = 0.67/R 2 = 0.49).Assuming that the model is not overfitting, this observation may result from a difference in the variability captured in the training dataset versus the variability present in the testing dataset.A closer examination of the results illustrates that the hybrid model performed better in Catchment IX (R 2 = 0.93) than in Catchment VI (R 2 = 0.88) during training, while the opposite was true for the testing phase, with the model achieving an R 2 = 0.33 in Catchment IX and an R 2 = 0.67 in Catchment VI.Based on this, overfitting in this model appears unlikely; instead, the overall performance appears to be reduced by a poor prediction in Catchment IX.
Sites 5 and 6 in Catchment IX display relatively low R 2 values.The poor model performance in site 6 may be attributed to largely different covariates that remain "unseen" by the model in training.Observation of sites 5 and 6 using satellite imagery reveals that the vegetation cover for both sites is relatively similar.Therefore, the main differences identified are only related to the actual terrain.Site 6 has the least steep slope and the highest TWI of any of the sites included in the training, thus providing a possible explanation for the performance reduction observed since the model had not been trained to accommodate such conditions.Both the slope and TWI are related to the topography of the catchment.
Through these results, the dependence of SM on the terrain has allowed topography to be a principal predictor in many mapping applications (McBratney et al., 2003).This finding agrees with the works of Martínez-Murillo et al. ( 2017), who found a significant influence of terrain on the SM, and Chung et al. (2022), who found that the addition of terrain attributes aided in improving model performance in an ML approach for SM estimation.From the correlation plot, it could be assumed that the LST is not a major predictor for SM since it had the lowest correlation to SM, as shown in Figure 12, contrary to the findings of Mohamed et al. (2020) and Ahmadi et al. (2022).However, a sensitivity or principal component analysis would be required to determine the precise influence of this variable on a larger dataset in the catchments.F I G U R E 2 2 Comparison of the observed and simulated 5-number summaries on "unseen" data.

CONCLUSION
This study developed a general predictive model to estimate SM in the Cathedral Peak Research Catchments, South Africa, using an ensemble of ML algorithms (viz., RF, SVM, Rpart, KNN, and GLM).Furthermore, the individual performances of some models were assessed in the model validation before testing the ensemble.Overall, the model successfully used a small sample size and achieved an R = 0.67 and RMSE = 5.82.These results highlighted the efficiency of RF and SVM models in SM prediction studies and re-emphasized the importance of terrain-related covariates for such applications.
The resultant model was able to closely anticipate the fluctuations of SM over time, occasionally under-simulating the low SM conditions and managing to simulate the high SM conditions well.

Vadose Zone Journal
The study explored the limits of a small dataset in ML applications and identified the threshold to which such models may be applicable in data-limited regions.In total, 160 samples were utilized for training, producing a decent testing performance.However, the significant performance reduction in Catchment IX, site 6 may indicate some restrictions that can be anticipated when training with a small dataset.Since the model was trained using data from both catchments, it does not appear to have any site-specific conflict.However, due to the proximity of the catchments, it is difficult to identify how the model would perform in a completely "unseen" catchment.
The study could benefit from better integrating spatial representivity by presenting SM maps to improve the current results.The uses of the model developed in this study could expand into drought forecasting in Cathedral Peak catchments with limited monitoring or aid in infilling SM data for catchments with sparse records.Further, this model used moderate resolution RS products to predict SM for a point by establishing a relationship between respective covariates; thus, this modeling could also be applied in the down-scaling of RS products using SM.Currently, flood forecasting may be an impractical application of the model owing to the coarse temporal resolution.However, if the time scale is adjusted to a finer resolution (daily or hourly), the feasibility of use as a flood warning system may be explored in future works.

A C K N O W L E D G M E N T S
This work was part of a Hons research project at the University of KwaZulu Natal financed by the National Research Foundation (NRF), South Africa.We are also thankful to SAEON for providing us with in situ SM data for the chosen period of the study.

C O N F L I C T O F I N T E R E S T S T A T E M E N T
The authors declare no conflicts of interest.
Specifications of the Sentinel-2B data used in the study.bands 4 (red) and 8 (near infrared; NIR) (Table Hybrid model.A hybrid model combines multiple data sources and machine learning techniques to improve the F I G U R E 3 Flow chart methodology utilized in the current study (adapted from Dobos et al., 2006).
Elevation maps for Catchments VI and IX.
Aspect maps for Catchments VI and IX.F I G U R E 6 Slope maps for Catchments VI and IX.

F I G U R E 7
Relative relief maps for Catchments VI and IX.σ(dB) = log 10(DN), (5) where DN is the raw digital number of each pixel on the image Sentinel-2 data To obtain the NDVI, ESA's Sentinel-2 Multispectral Instrument constellation was employed to retrieve the red (B4) and Vadose Zone Journal F I G U R E 8 Topographic wetness index (TWI) maps for Catchments VI and IX.F I G U R E 9 Drainage density maps for Catchments VI and IX.NIR (B8) bands, each with a 10-m resolution.Due to the sensitivity of vegetation of cloud cover, a filter was used only to select images with <10% cloud cover.After which Equation (6) was utilized to obtain the NDVI.
Sentinel 1 image pre-processing.GRD, ground range detected.F I G U R E 1 1 Summary of methodology and applications utilized in data collection, processing, and subsequent modeling.LST, land surface temperature; RS, remote sensing; SM, soil moisture.

F
Correlation plot of the 10 covariates initially identified.DD, drainage density; VV, vertical transmit/vertical receive band; VH, vertical transmit/horizontal receive band; NDVI, normalized difference vegetation index; LST, land surface temperature; TWI, topographic wetness index.

F
Hybrid model performance on "seen" data.*** refers to no significance.upper regions of Catchment VI, where the simulated data missed high extremes in the upper regions and low extremes in the middle regions.Conversely, the model produces a superior performance in Catchment IX's upper and middle regions (Figure 16), as evidenced by the improved simulation of the extremes in the dataset.However, the model continues under-simulating extreme (low and high) SM conditions from April to July 2020.The observations made in Figures 15 and 16 of both Catchment VI and IX-Mean graphs are reaffirmed by the results presented in Figure 17, which illustrates that the means of the simulated and observed datasets are similar.In contrast, the F I G U R E 1 4 Relationship between the observed and simulated soil moisture (%) utilized in the training phase.Observed and simulated soil moisture utilized in the training phase (Catchment VI).F I G U R E 1 6 Observed and simulated soil moisture utilized in the training phase (Catchment XI).

F
Hybrid model performance on "unseen" data.*** refers to no significance.F I G U R E 1 9 Relationship between the observed and simulated soil moisture (%) utilized in the testing phase.F I G U R E 2 0 Observed and simulated soil moisture utilized in the testing phase (Catchment VI).
Specifications of the Landsat data used in the study.
T A B L E 3 vector machine.Support vector machine (SVM) is a supervised algorithm for classification and regression.It is normally used when the data are not regularly distributed.It is one of the most robust prediction methods because the SVM (Han et al., 2018) builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.The advantages of SVM are it is known not to suffer from overfitting, which is common in many spatial and temporal estimations of soil properties (J.Wang et al., 2022).Classification and regression trees.Classification and regression trees (Rpart/CART) is a predictive algorithm that explains how the target variable values can be predicted based on other variables.One of the features of CART is that in a large area with complex surface features, such as the catchment in Cathedral Peak, the CART algorithm can divide the data into multiple areas with lower complexity and then use piecewise linear fitting (real-valued function of a real variable, whose graph is composed of the straight-line segment) instead of global fitting(Han et al., 2018).The difference between classification and regression is that classification predicts discrete class values, for example, True or False, Spam or Not Spam, that needs to be split into classes that belong to a variable response, whereas regression helps to predict a continuous quantity, for example, price, salary, and age.Generalized linear model.A generalized linear model (GLM) is a non-parametric extension of linear regression models that allow non-normally distributed response variables Individual model performance during model validation.Ensemble model performances in the validation phases using different stacking methods.
T A B L E 5 T A B L E 6Abbreviations: GLM, generalized linear model; MAE, mean absolute error; RF, random forest; RMSE, root mean square error.