Can eXplainable AI Offer a New Perspective for Groundwater Recharge Estimation?—Global‐Scale Modeling Using Neural Network

Due to the difficulties in estimating groundwater recharge and cross‐boundary nature of many aquifers, estimating groundwater recharge at large scale has been called upon. Process‐based models as well as data‐driven models have been established to meet this need. Meanwhile, with the advent of explainable artificial intelligence (XAI) methods, data‐driven machine learning models can take advantage of enhanced explainability while keeping the strength of high flexibility. In this study, an ensemble neural network model was built to check the suitability of the model to predict groundwater recharge and the possibility to gain new insights from large data set. Recent large inputs of groundwater recharge data and additional input for the Arabian Peninsula collated in this study were fed to the model with multiple predictors related to climatology considering seasonality, soil and plant characteristics, topography, and hydrogeology. The model showed higher performance (adjusted R2: 0.702, RMSE: 193.35 mm yr−1) than a recent global process‐based model in predicting groundwater recharge. Using XAI methods as individual conditional expectations and Shapley Additive Explanation interaction values, the model behavior was analyzed and possible linear and non‐linear relationships between the predictors and the groundwater recharge rate were found. Long‐term averaged precipitation and enhanced vegetation index showed non‐linear relationships with groundwater recharge rate, while slope, compound topographic index, and water table depth showed low importance to the model results. Most model behaviors followed the domain knowledge, while multi‐correlation between predictors and data skewness hindered the model from learning.


Introduction
Estimating the groundwater recharge rate has been one of the most important tasks for hydrologists since it constitutes one of the key parameters determining the safe yield-the amount of water that is replenished and thus can be sustainably withdrawn (Blöschl et al., 2019;Oki & Kanae, 2006).Currently, rapid decline of groundwater levels has been observed worldwide and accelerated during the past four decades in 30% of aquifers (Jasechko et al., 2024).This is especially alarming in regions where surface water resources are either generally scarce, not constantly available in time and space, or contaminated, so that the population is primarily dependent on groundwater (Schulz et al., 2024).Many aquifers lie across national boundaries, which require large-scale models for policy decisions.Moreover, groundwater reserves should be considered in the context of global climate change (Gleeson et al., 2021).In this context, hydrological models for analyzing groundwater recharge at the continental or global scale are helpful to consider different interlinkages of climate change and water resource availability.However, regionalization of groundwater recharge on a global scale is challenging, mainly due to the lack of consistent data with high accuracy for groundwater recharge rates.Particularly at larger catchments, estimation of groundwater recharge largely depends on methods under certain assumptions due to the lack of direct measuring methods (Healy & Scanlon, 2010).Moreover, at the large scale, uncertainty still remains regarding the relationships between groundwater recharge and its governing factors (Mohan et al., 2018;Morbidelli et al., 2018;Vereecken et al., 2019).These factors can vary significantly depending on spatial and temporal location and its scales, as the groundwater recharge mechanisms to be considered differ.Also, there are no standardized criteria in practice for assessing the accuracy of estimation (Healy & Scanlon, 2010;MacDonald et al., 2021).
Previously, there have been considerable attempts in developing models to estimate groundwater recharge rates at larger scales.These comprise process-based models (de Graaf et al., 2015;Döll & Fiedler, 2008;Müller Schmied et al., 2021) as well as data-driven models (Berghuijs et al., 2022;MacDonald et al., 2021;Mohan et al., 2018).Usually, process-based models such as land-surface models and hydrological models are targeting on multiple components of the hydrological cycle.However, these models tend to oversimplify the groundwater component, which result in mismatches between field observations and model results (Scanlon et al., 2018;Soltani et al., 2021).Moreover, several model parameters have to be calibrated with a single or a few sets of observation data, which often leads to the problem of non-unique solutions, where the model results in good performance regardless of each component's low fidelity (Beven, 2006;Her & Seong, 2018).On the other hand, there have been data-driven models, which are entirely based on the statistical properties of the data at hand.Different model designs can be applied in this statistical sense.In the study of Mohan et al. (2018), multiple regression models with multiple predictors were used to examine the relationship between predictors and groundwater recharge in a global scale model.This model was based on a data set of 715 sites that were highly skewed to arid and temperate regions.There, multi-collinearity among the predictors was found, but not thoroughly explored in terms of model behavior (Mohan et al., 2018).A linear mixed model and a non-linear regression model with a sigmoid function were applied to build predictive models.Each model performs well at the regional scale (African continent; MacDonald et al., 2021) and global scale (Berghuijs et al., 2022).These studies reassured that groundwater recharge process was firmly constrained by climate factors, especially precipitation.However, the model designs were less flexible since the model structures were predefined before parameters were fitted by the data.Thus, even though other factors such as soil characteristics are expected to affect the groundwater recharge, adding more factors doesn't increase accuracy of the models.
There have been substantial advances of machine learning (ML) in the data science field.For predicting groundwater levels that can be observed relatively easily, ML models have been used successfully at regional scales (Haaf et al., 2023;Wunsch et al., 2022).For the European continent, there is even a groundwater recharge map based on ML, which uses national survey data as training data (Martinsen et al., 2022).However, apart from a few regions with accessible and consistent national survey data, there has been the chronic deficiency of groundwater recharge data.Not only data-driven models but also existing process-based models at the regional and global scale usually have been limited by a relatively small number of field measurements in the subsurface (Seibert et al., 2024) which limits the verifiability and thus constrains the model's credibility.However, recently, comprehensive data sets of ground-based groundwater recharge estimates (i.e., estimates based on field measurements, excluding satellite-based measurements) have been presented by Mohan et al. (2018), Moeck et al. (2020) andMacDonald et al. (2021), which, if used collectively, could address the problem of insufficient data.For large-scale groundwater recharge prediction or its upscaling, various machine learning algorithms can be applied, ranging from classic regression models with low computational cost to computationally expensive neural network (NN) based deep learning models.Thanks to the great increase in computing power in this era, NN models with high computational demands have become more and more common.However, despite their high flexibility NN models have been criticized for being "black boxes" where explaining the model's behavior is difficult.Therefore, great efforts have been made to develop methods for ML model explanation, so-called explainable artificial intelligence (XAI).Using these XAI methods, models with high complexity and high flexibility can also take advantage of high interpretability (Ali et al., 2023).
Given the recent release of large data sets from the aforementioned meta-studies and the advances of ML models in the data science field, two objectives are addressed in this study: (a) The suitability of the ML model is investigated for the model to predict groundwater recharge rates with unevenly distributed, high dimensional data from ground-based groundwater recharge estimates.This is tested using three different validation strategies (Gleeson et al., 2021)

Collection of Groundwater Recharge Estimates
Estimated groundwater recharge rates have been collated from two global-scale meta-studies (Moeck et al., 2020;Mohan et al., 2018) and from a meta-study for the African continent (MacDonald et al., 2021).Additionally, 92 estimates for the Arabian Peninsula have been collected in this study (Jung et al., 2024).Collected data represent naturally occurring recharge from precipitation.Artificial recharges such as from irrigation return flows and managed aquifer recharge are omitted.If several estimates are available for different periods at one location, the value for the longer period is selected.Studies with a period of less than 1 year are also removed from the data to avoid bias due to seasonal effects in the meta-studies by Mohan et al. (2018) and Moeck et al. (2020).After removing duplicated studies, 5541 samples have been obtained in total.Geographic coordinates as well as groundwater recharge rates from each study have been compiled (Figure 1).
Diverse methods have been used to estimate the groundwater recharge, categorized into water balancing methods, water table fluctuation, and methods using environmental tracers such as the chloride mass balance.The use of different methods to determine recharge rates is known to result in varying estimates (Crosbie et al., 2010).In the data collected by Moeck et al. (2020), tracer methods, which are also the dominant estimation method in other meta-studies, account for about 80%.Nevertheless, the estimation method was not considered in the training data set used in this study.This is due to the lack of relevant information.Many studies did not specify the underlying Water Resources Research 10.1029/2023WR036360 method because they obtained data from gray literature in which the estimation method was not specified, or the methods were omitted when collating the data from the previous meta-studies.

Predictors
A large number of factors possibly influencing groundwater recharge were selected and generated.Among them, 20 were finally selected as input predictors for the further neural network development.The selection was based on inspecting the feature importance derived by applying a random forest approach to all possible influencing factors (Breiman, 2001), where all features with non-negligible feature importances were chosen.The final predictors with significant predicting-power for groundwater recharge are listed in Table 1.They can be acquired from publicly available data sets related to climatology, soil characteristics, topography, hydrogeology, and geographic location.Note that the omitted factors are not necessarily unimportant from a hydrological point of view, they are rather redundant from an information-content point of view.Further information on the predictors can also be found in the Supporting Information.
The climate database CRU TS v4.05 includes monthly means of globally interpolated observational data for 1901-2020 (Harris et al., 2020).Both annual mean and monthly mean values are calculated for the 120 years period.Subsequently, the intraannual variability (IAV, Equation 1) of precipitation, temperature, and potential evapotranspiration are calculated.
where IAV f is the intraannual variability of the climate variable f, F m is the monthly mean of f during the month m, and F is the annual mean of f.Water Resources Research 10.1029/2023WR036360 Also, vegetation as well as soil characteristics affect the percolation process.Therefore, the enhanced vegetation index (EVI) is derived based on MODIS Terra v6.1 satellite surface reflectance imagery (Didan, 2021).Monthly means of EVI between February 2000 to April 2022 are obtained, and data are averaged for the whole period.For the water storage capacity of the soil, relevant features such as bulk density, field capacity, profile available water capacity (PAWC) and wilting point are derived from IGBP-DIS set using a worldwide pedon and soil texture database (Global Soil Data Task Group, 2000).To characterize the topography, elevation as well as its secondary data such as the slope and the Compound Topographic Index (CTI), which is defined as the natural logarithm of a specific catchment area per tangent slope (Verdin & Survey, 2017), are derived from a void-filled digital elevation model (SRTM, Shuttle Radar Topography Mission; Jarvis et al., 2008).Water table depth is defined as the distance between ground level and groundwater table and resulted from inverse groundwater flow modeling (Fan et al., 2017).Runoff potential is extracted from hydrologic soil groups that have been compiled for runoff modeling (Ross et al., 2018).

Preprocessing the Data Set
Predictor values for each groundwater recharge estimate are extracted from gridded or vectorized data based on the coordinate of each sample.Data distributions for each predictor are shown in Figure S1 in Supporting Information S1.Since categorical features cannot be processed by neural networks, they are transformed into numerical vectors of zeros and a single one.The vector length equals the number of classes in a categorical predictor.The position of the one in the vector encodes the specific class of the predictor (one-hot encoding).This results in eight features for "Runoff potential," five features for "Köppen-Geiger climate classification" and 14 features for "Lithology."These 14 lithological classes represent the first level information that is the only one available on a global scale (Hartmann & Moosdorf, 2012).To prevent weighting of a feature based on its scale, all features are standardized by subtracting their mean from the values and by dividing by their standard deviation (Zscoring).

Model
Many algorithms exist for regression problems.These include machine learning algorithms such as neural networks (NN), which have been widely applied to numerous geoscientific problems (Reichstein et al., 2019).Their advantages are that they are universal, adaptive, and can well handle high-dimensional nonlinear problems (Irrgang et al., 2021).Therefore, in our approach, a feed forward neural network is chosen for estimating groundwater recharge rates.
Five different NN models (hereafter referred to as "block models") are built from five sets of randomly split training and test data with a ratio of 8:2 (cf.Chollet, 2017;Krizhevsky et al., 2017).Subsequently, the results of the five block models are averaged and hereafter referred to as ensemble model (Figure 2a).This strategy increases the ensemble model's robustness and prevents overfitting (e.g., Breiman, 2001).The hyperparameterconfiguration (i.e., number of hidden layers, number of neurons each layer, learning rate and dropout) of each block model is optimized by a hyperparameter search within the Tensorflow/Keras environment (Figure 2b, Abadi et al., 2016).To increase the model's stability and to favor generality during the model learning, weak signals are neglected by not being conveyed to nodes in the next layer through ReLU (Rectified Linear Unit) activation functions (Chollet, 2017).An option for dropout is used in all layers.The results can be seen in Table 2.
In general, all input layers consist of the 20 predictors (Table 1) followed by a first layer of 35 nodes.An output layer of one node was set for the groundwater recharge rate estimate.

Model Explanation
Two complementary XAI methods, namely aggregated individual conditional expectation (ICE, Goldstein et al., 2015) and Shapley additive explanations (SHAP, Lundberg & Lee, 2017), are used to tackle the "black box" nature of the neural network operator.The two XAI methods are based on different approaches and theories and are mutually supportive to each other's limitations.Aggregated ICEs are used to find out linear and non-linear relationships between the predictors and the groundwater recharge rate estimates and to measure the respective sensitivity within the neural network.SHAP measures the importance of a single predictor based on the performance decline of the model without this predictor.To measure this decline in a computationally feasible way, a predictor is randomly replaced by typical values of itself.Subsequently, the average performances of the models with the original values and the random values are compared.This is repeated for each predictor.
While aggregated ICEs are good at describing the non-linear relationships and the sensitivity of the groundwater recharge rates with respect to the predictors, the method has limitations since independent predictors are assumed but not guaranteed.This limitation is lifted in SHAP analysis as this method explicitly includes interactions between the predictors.At the same time, the limitation of SHAP is that it does not, in contrast to ICEs, approximate the model's behavior outside the range of the data set.
ICEs and SHAP values have been grouped according to the climate zone, and model behavior depending on the climate zone has been analyzed.Due to the small number of samples for cold regions and polar regions, only results in tropical regions, arid regions, and temperate regions are meaningful to explain and discussed mainly.

Model Performance and Stability
The ensemble NN model has provided predictions of the groundwater recharge rates across the globe at a resolution of 0.25°× 0.25°(Figure 3a).The distribution of global groundwater recharge rates generally aligns with Köppen-Geiger climate classification and the trend that has been reported in the previous data-based model ( Berghuijs et al., 2022) and process-based model (Müller Schmied et al., 2021).However, notable discrepancies are identified, particularly in the polar regions, where the ensemble NN model predicts significantly higher recharge rates.This deviation can be attributed to the absence of observation data in permafrost regions, especially in northern latitudes.The model behavior that have led to high groundwater recharge rates in permafrost areas are able to be explained with the support of XAI.The small number of observations in polar regions (3 Water Resources Research 10.1029/2023WR036360 samples) leads the predictor representing polar climate to have a high weight to compensate weak representation during the learning process.On the other hand, all three observations classified to polar area are located in a mountain near the Alpes, where the groundwater recharge rate is around 100 mm/yr.Thus, the model has trained itself radically with other several predictors to offset the high weight of the predictor for polar regions, which can be seen in the aggregated ICEs of polar region (Figure S9 in Supporting Information S1).Unlike the model behavior in the other climate regions, the model has related low EVI with high groundwater recharge rate and shown high sensitivities in elevation and slope.This indicates the risk that the AI model works as a "Mathematical Marionette" with scarce observation.In view of the limited availability of corresponding observational data, areas where the long-term averaged temperature is below 0°C are marked in Figure 3a.
The predicted values of the ensemble model have been compared with the observed values (Figure 3b).On average of all samples, the adjusted R 2 score is 0.702 and the Root mean squared error (RMSE) is 193.35 mm yr 1 with ≤25 mm yr 1 in 51% of samples and ≤100 mm yr 1 in 75% of samples.The model's error tends to be high in regions where high groundwater recharge has been observed, indicating a relative error of our model results rather than an error with a constant magnitude (Figure 3c).
The performance of the ensemble model has been scrutinized in different subgroups of data according to the climate zone, location and lithology (Table 3).Since the data is highly skewed to few dominant properties in most predictors (Figure S1 in Supporting Information S1), it might be expected that there is a higher chance for the model to be fit to only dominant properties.However, the model has shown relatively weak performances particularly over arid regions despite its large number of samples (cf., Oceania in Figure 3d and Table 3).The reason for this can be that, in contrast to the tradeoff often observed in classical regression approaches, a NN can fit several different relationships without jeopardizing each other.Consequently, one of the reasons of high RMSE despite a large data fraction can be that the observational errors, that is, measurement error and data uncertainty, in ground-based estimates of groundwater recharge as well as uncertainties of the predictors are relatively high compared to the scale of observed groundwater recharge in these arid regions.
The block models are established from 5 randomly split sets of training and test data.No big difference among the averaged performances of the block models derived from different train and test data sets has been found, which implies general stability of the model result (Table 2).Compared to the block model 1 (the block model of the best performance), the model's error decreases on most of the properties in the ensemble model, and improvement on less representative properties tends to be larger than the dominant properties.

Model Sensitivity and Non-Linear Behavior
The linearity of the relationship between each predictor and predicted groundwater recharge rate has been analyzed as well as the sensitivity of each predictor (Figure 4).It results that the two most sensitive predictors (normalized slope >100), long-term-averaged precipitation (LTA_P) and enhanced vegetation index (EVI), show non-linear relationships with groundwater recharge rate.Groundwater recharge rate increases exponentially as the average precipitation increases.This follows the domain knowledge that groundwater recharge rate can be estimated by a power function of the precipitation (Keese et al., 2005).The non-linear behavior in tropical regions between vegetation and groundwater recharge rate is shown by a positive relationship up to an EVI of about 0.3 and a subsequent drop in groundwater recharge as EVI increases.Interestingly, the sensitivity of EVI is greatly weaker in arid and temperate regions than in tropical regions.A possible explanation might be that distinctive vegetation is also an indication of sufficient rainfall.However, in the case of very dense vegetation (such as in tropical rainforests), at one point large interception losses caused by the dense canopies as well as uptake from plants reduce the water amount available for deep percolation and thus causing a negative effect on groundwater recharge.Such a relationship, characterized by an initial increase and then from intermediate tree cover on a subsequent decrease in groundwater recharge with an increase of canopy cover in the tropics, is also described by Ilstedt et al. (2016).
The other long-term averaged climate predictors are temperature (LTA_T) and potential evapotranspiration (LTA_PET).Both show high sensitivities and a fairly linear relationship with the groundwater recharge.Especially in the tropics, the groundwater recharge rate increases as temperature and potential evapotranspiration increase, which fits the domain as high temperatures, facilitating also high potential evapotranspiration, are associated with particularly high rainfall amounts in these environments (Adler et al., 2008).Also, the seasonality of weather variables, represented by the intraannual variability of climate predictors, can be a factor in this context as well.While the sensitivity of intraannual variance of temperature (IAV_T) is very low, IAVs of precipitation (IAV_P) and potential evapotranspiration (IAV_PET) show moderate sensitivities on the model.Especially, the sensitivity of the model to IAV_P is relatively high in tropical regions, despite that groundwater recharge has been known to have a seasonal trend outside the tropical region (Healy & Scanlon, 2010).Yet, it might support a recent finding that groundwater recharge in tropical regions is also more likely to occur from single heavy rainfall events rather than light rain (Jasechko et al., 2014;Jasechko & Taylor, 2015).
Another interesting observation is that the model is insensitive to predictors that have been expected to be relevant factors for groundwater recharge.This is noticed for some predictors related to topography and hydrogeology, which are slope (SL), compound topographic index (CTI), and water table depth (WTD).One possible explanation for the low sensitivity of the WTD variable could be that this variable can act in different directions.On the one hand, it is to be expected that high groundwater recharge will also tend to lead to high groundwater levels.On the other hand, high groundwater levels can also have a negative impact on the net groundwater recharge.This is the case when the groundwater water table is above the zero-flux plane, allowing capillary rise and root water uptake from groundwater.In addition, other predictor variables such as elevation or lithology, which can also be assumed to have an influence on WTD, could mask the effect of WTD on groundwater recharge.Accordingly, high sensitivities with linear relationships to groundwater recharge are shown by elevation (DEM).Predictions of groundwater recharge increase as elevation decreases.This behavior has been also observed in a study on groundwater recharge estimation for Europe using machine learning by Martinsen et al. (2022).The direction of the model behavior with respect to elevation, as well as slope (despite the lower sensitivity), suggests that precipitation tends to infiltrate in the plain lowlands, as expected (Jaafarzadeh et al., 2021).Like elevation, bulk density (BD) also shows a relatively high sensitivity and is negatively correlated with groundwater recharge, which is similarly observed by Martinsen et al. (2022).At first, this seems counterintuitive from a physical point of view, since finer materials show usually lower bulk densities than coarser ones and that those coarser materials are characterized by higher hydraulic conductivities, which favor percolation.Contrary to this assumption, however, there are also observations that it can be the other way around.For example, Price et al. (2010) showed in a study on soil hydraulic properties that bulk density can also be negatively correlated with hydraulic conductivity.The explanation probably lies in the fact that the processes and relationships in natural environments are complex and that a direct conclusion about one specific soil characteristic is not readily possible.In fact, bulk density can be partially correlated with porosity and thus water holding capacity, which also have an influence on groundwater recharge.This might also explain why other soil characteristics related to water holding capacity show relatively low sensitivity on the model, which are field capacity (FC), wilting point (WP), and PAWC.Surprisingly, they all show a positive effect on groundwater recharge, although their relationship to each other (PAWC = FC-WP) actually does not allow this.A possible explanation could be given by their low sensitivity caused by being confounded with bulk density.Interaction effect of bulk density and water holding capacity is described in a subsequent Section 3.3.2.
For category-type predictors, such as Köppen-Geiger climate classification (KG), Lithology (Li), and Runoff potential (RP), predicted recharge rates depend on the proportion of samples within a particular class rather than following the domain knowledge.For example, the model sensitivities are high with rare properties such as the climate zones cold and polar climate or the lithologies evaporate and intermediate plutonic rock (Table 3).This can be due to the one-hot encoding and subsequent normalization that often result in extremely high input values for rare properties.This weighting of rare properties could hinder the model from learning the process related to the categorical predictor if the data distribution is highly skewed.Thus, it should be carefully decided whether one-hot encoding and normalization are to be applied on skewed categorical data, depending on the model purpose.

Quantifying Contribution of Predictors
SHAP values, which represent a measure of the contribution of a predictor on the model's prediction, have been calculated for each predictor (Figure S2 in Supporting Information S1) and decomposed to a main effect (diagonal) and interaction effects (off-diagonal) in Figure 5.The tendency of SHAP values depending on a predictor in the samples describes the relationship between groundwater recharge and the predictor solely (main effect) or interaction with the other predictors (interaction effect).The magnitude of SHAP values is a measure of the importance of each predictor under the presumption that the predictor of high SHAP (large contribution to the model) has a high influence on prediction (Figures S4, S5, and S6 in Supporting Information S1).

Climatology
A clear positive relation between long-term averaged precipitation (LTA_P) and groundwater recharge can be observed for tropical and temperate regions (Figure 5 (a)).In arid regions, however, long-term averaged precipitation is not a determinant predictor and not even shows a clear tendency in the direction of impact in the ensemble model.In the neural network model that is only fed by the samples from arid regions (Figure S7 in Supporting Information S1), however, a clearly positive but weaker relationship of long-term averaged precipitation and groundwater recharge is observed as the fifth important predictor following vegetation, latitude, intraannual variance of precipitation, and potential evapotranspiration.
While seasonality (represented by the IAVs) of potential evapotranspiration (Figure 5 (b)) and precipitation (Figure 5 (c)) positively affect groundwater recharge predictions of the ensemble model in arid and temperate regions, the seasonal effect is relatively weak in both main and interaction effects in tropical regions.This corresponds to the domain knowledge, as groundwater recharge tends to follow seasonal weather fluctuations, which however are not or less pronounced in the tropics (Healy & Scanlon, 2010;Mohan et al., 2018).
Regarding long-term averaged temperature (LTA_T), its main effect is not significant in tropical regions, but as temperature increases, the main effect increases in the direction of increasing groundwater recharge in arid and temperate regions (Figure 5 (d)).In tropical regions, the interaction effect of precipitation and temperature is stronger than the main effect of temperature (Figure S4 in Supporting Information S1), and an increase in these predictors also leads to an increase in the predicted groundwater recharge when both precipitation and temperature have high values (Figure 5 (e), (f)).This might indicate that the model uses the temperature predictor at high precipitation rates as an indicator for tropical regions rather than following the expectation that groundwater recharge decreases with higher temperature, as this usually also results in higher evapotranspiration.

Soil Characteristics and Vegetation
First, regarding the main effects driven by one predictor, as bulk density (BD) is low, and as field capacity (FC) and PAWC is low, the main effects of each decrease the prediction in all climates (Figure 5 (g), (h), (i)).Also, for all climates, low bulk density decreases the interaction effects of water holding capacity (Figure 5 (j) and (k)), and vice versa (Figure 5 (l) and (m)).However, related to the interactions of bulk density and water holding capacity (FC and PAWC) with precipitation, they show different behaviors by climate.In tropical regions, in case of low bulk density and high water holding capacity, the interaction effect of precipitation changes the prediction positively, while it affects it negatively in temperate and arid climates (Figure 5 (n), (o), (p)).From this, it can be concluded that the effect of precipitation is affected by other predictors of soil characteristics, that is, precipitation effect decreases the predicted groundwater recharge rates if the soil has low bulk density and high water holding capacity in tropical regions and if high bulk density and low water holding capacity in temperate and arid regions.
Vegetation has been considered an important factor for groundwater recharge (Kim & Jackson, 2012;Moeck et al., 2020;Mohan et al., 2018).In this study, the predictor for vegetation (EVI) is more important in tropical and arid regions, but less important in temperate regions (Figure S2 in Supporting Information S1).However, the main effect of vegetation doesn't show a clear direction in tropical and temperate regions, but positive effect of vegetation is observed in arid region (Figure 5 (q)).In tropical regions, the effect of vegetation by precipitation decreases the prediction as long-term averaged precipitation is high (Figure 5 (r)).

Topography, Hydrogeology, and Lithology
At a global scale, orographic effects can be related to elevation.The main effects in all three climates increase as the elevation (DEM) becomes low (Figure 5 (s)), which may imply that the groundwater recharge increases in lower elevations.However, the interaction effect shows differences depending on the climate zone.In tropical regions, a positive interaction effect with precipitation is found as elevation is lower.In contrast, in arid regions, the interaction effect of precipitation decreases as elevation is getting lower (Figure 5 (t)).
The main effect of DEM might be explained by the mountain front recharge scheme consisting two mechanisms: mountain-block recharge, which occurs through the faults and fractures of the mountain bedrock, and surface mountain front recharge that occurs at the foot of the mountain by accumulated mountain-originated surface water (Markovich et al., 2019).Mountain front recharge is known to be dominant in arid regions (Markovich et al., 2019;Scanlon et al., 2006).Due to the orographic effect, the mountainous areas in arid environments receive a relatively large amount of precipitation (Whitford & Duval, 2020).Although precipitation often cannot infiltrate directly onto the slopes due to high rainfall intensity and the water-repellent effect of dry soils in arid regions, surface runoff collects in depressions and valleys (wadis), where it leads to increased groundwater recharge.Moreover, interaction effects of precipitation and elevation might imply that mountain block recharge can also engage significantly in arid regions.These interpretations are also underpinned by the topographical distribution of the training data (Figure S3 in Supporting Information S1).For example, within arid environments,

Water Resources Research
10.1029/2023WR036360 the sites in the mountain regions show about twice as much annual groundwater recharge (43 mm, n = 137) as in the lowlands (23, n = 1893).For the training data in the other climate zones, the situation is exactly the opposite.
In the mountain regions, the average annual groundwater recharge (140 mm, n = 180) is only about half as high as in the lowlands (359, n = 3166).
The predictors related to surface morphology give mere contributions to groundwater recharge prediction.Especially, slope and curvature have been reported to be sensitive predictors in domain knowledge and local and regional-scaled studies (Jaafarzadeh et al., 2021;Morbidelli et al., 2018).However, at large scales, the role of the slope has been still quite unclear and controversially discussed (Morbidelli et al., 2018;Moeck et al., 2020;Mohan et al., 2018;Vereecken et al., 2019).This might imply that slope and curvature constitute an important predictor for groundwater recharge at a small scale, but not on a larger scale, where elevation itself is of greater relevance.In addition to slope, the depth to water table affects weakly the model of this study, although it has been reported to have an impact on groundwater recharge (Carrera-Hernández et al., 2011;Moeck et al., 2020).
Unconsolidated sediments (Li_su) and metamorphic rocks (Li_mt) show meaningful importance among the boolean lithology predictors.The main effect increases groundwater recharge rates for unconsolidated sediments or metamorphic rocks in all climates (Figure 5 (u), (v)).However, the interaction effect shows differences between climates.In tropical regions, the interaction effect of long-term averaged precipitation tends to incline in case of prevailing unconsolidated sediments and metamorphic rocks.Contrarily, in arid and temperate regions, it tends to decline for unconsolidated sediments or metamorphic rocks (Figure 5 (w), (x)).Also, in arid and temperate regions, the interaction effect of elevation decreases the prediction for metamorphic rocks (Figure 5 (y)).On the other hand, contributions of lithology are affected by elevation.As elevation is low, the interaction effect of unconsolidated sediments is negative (Figure 5 (z)) and the interaction effect of metamorphic rocks is positive (Figure 5 (za)).While it can be assumed that unconsolidated sediments have a favorable effect on infiltration, it's expected that this would be not the case for metamorphic rocks.However, the model reacts opposite with the main effect for metamorphic rocks as expected.The prediction of the groundwater recharge rate depends more on interactions with other predictors rather than on a single predictor.

Causality and Comparison to Process Based Models
XAI can discover the associations required for the predicting model, but does not show necessarily causality.In this study, also associations of predictors that do not correspond to the expected behavior have been found.For example, the direction of the contributions of climate predictors, such as long-term averaged temperature and potential evapotranspiration, and the directions of the contributions of field capacity, profile available water, and wilting point are the opposite of the domain knowledge for some climate zones.This could be due to the fact that the related predictors were inter-correlated and confounded.Especially high multi-correlation has been found between the climate predictors and between field capacity and wilting point (Figure S8 in Supporting Information S1).
In the case of the aforementioned climate predictors, a simple conceptual causal model can be introduced to explain this.In the conceptual causal model, temperature increases potential evapotranspiration, both of which have negative impacts on groundwater recharge since potential evapotranspiration decreases the amount of available water.However, in the model of this study, precipitation, the strongest predictor, is not independent on both.For example, high precipitation is spatially associated with high temperature and high evapotranspiration in the tropics.Since the positive effect of precipitation is mathematically much stronger compared to the negative effect of temperature and potential evapotranspiration on groundwater recharge, groundwater recharge rate can seem to be increased as temperature and potential evapotranspiration increase.Thus, these two predictors could be considered in the model as positive factors in groundwater recharge.Especially, in this exemplary causal model, even though the temperature doesn't have a direct relation to groundwater recharge, it can be considered an effective predictor affecting groundwater recharge positively.Not only this "confoundedness" issue but redundancy among the predictors (e.g., field capacity and wilting point or intraannual variance of temperature and evapotranspiration) might cause underestimation of feature importance.These issues have been noticed and also expected in this study, due to the multi-correlation among the predictors observed in reality.Thus, confoundedness and redundancy are difficult to avoid in data-driven groundwater recharge estimation.Also, the set of predictors cannot be guaranteed to be complete.Not all relevant features for estimating groundwater recharge at a global scale are clearly determined.Even if the complete set of predictors would be determined, not all Water Resources Research 10.1029/2023WR036360 features are measurable.For example, groundwater withdrawal can be an important factor (Shamsudduha et al., 2011), but due to the lack of proper monitoring networks (or their availability) in large parts of the world, it is difficult to obtain reliable spatially and temporally continuous data on it.
Moreover, there are limitations that ground-based data inherently have.For example, uncertainties caused by different spatiotemporal assumptions of estimating methods, Improper application of methods due to lack of sitespecific knowledge or expense, non-existence of standard to evaluate the accuracy of different methods, and so on (Healy & Scanlon, 2010).Also, the size and skewness of the data set for groundwater recharge can be a limitation in the model learning process.Groundwater recharge is known to be a complicated process related to many variables.With increasing dimensionality, where the amount of data required to cover the variability of predictors increases exponentially, capturing all the relationships requires a considerably large amount of data as the number of variables increases.Besides, skewness can largely hamper covering the variability of predictors.Additionally, observational errors in ground-based estimating methods may make the model difficult to learn.Especially, assumptions regarding the temporal and spatial scale of the estimation are different according to the method, which can lead to discrepancies between estimates.For example, in regions where infiltrated water can take up to geological time scales to cause a groundwater system response (Cuthbert et al., 2019), the recharge rate can be greatly misestimated depending on assumptions and spatiotemporal scales of methods (Moeck et al., 2020;Yenehun et al., 2022).
Despite these limitations, the performance of the data-driven model is comparable to conventional process-based hydrological models for predicting groundwater recharge rates on a global scale, even though the model has not been built specifically for high performance.Compared to the estimated naturally occurring groundwater recharge from precipitation of the recently developed process-based hydrological model WaterGAP Global Hydrology Model v2.2d (Müller Schmied et al., 2021), the model of this study shows even better scores in common performance criteria for the prediction of the groundwater recharge rate (Figure 6).Root mean squared error (RMSE) and adjusted R 2 score are 328 mm yr 1 and 0.14 for WaterGAP, and 193 mm yr 1 and 0.70 for the ensemble NN model in this study, respectively.The data-driven model of this study shows its strength in lower variance and in a better fit in predicting the groundwater recharge at higher recharge rates.High performance for higher rates was also observed in another data-driven model (Berghuijs et al., 2022), where moving average over 10% of the data is used for model comparison (RMSE: 218 mm yr 1 in data-driven model while WaterGAP model shows 141 mm yr 1 ).In the study of Berghuijs et al. (2022), three other process-based models are additionally compared and show similar results to those of WaterGAP.However, in this context, it must also be noted that the data-driven model of this study aims at one target hydrological component, while the hydrological model is targeting multiple components in the hydrological system.Also, the hydrological model is built based on its embedded causal model (equations).Therefore, it can be free from misleading results driven by confoundedness and redundancy of the predictors.Also, enough information of the predictors that cannot be measured can still be given to the model through the relationships of predictors in the causal model, so the model can be built with observed data that are relatively easily obtained.Yet, since many components in hydrological models are usually calibrated with a few sets of data (usually river discharge), quite high inaccuracy for single components can exist (e.g., deep percolation), despite their high performance (non-unique solutions).Moreover, processes as described in hydrological models are known to be lacking scale-related theories and not necessarily valid at larger scale (Nearing et al., 2021).

Conclusion
Formerly, groundwater recharge estimating models are largely dependent on models with relatively low flexibility such as process-based models or regression models with a fixed structure.However, with the help of XAI, the advantage of using a model with high flexibility is being kept, while the result of the model is still explainable.XAI can help quantify the importance and sensitivities of predictors as well as diagnose misleading results.Also, the lack of large data has been one of the main obstacles, but recently, there have been large inputs of data for groundwater recharge rate, collated from a few comprehensive meta-studies, and additional data for the Arabian Peninsula were added to the database in this study.Since the model is structured with high flexibility based on the quite large observed data set, inversely, there is now an opportunity to learn the process of target mechanism from the model.The data-driven neural network model for estimating groundwater recharge followed mostly the corresponding domain knowledge on the impact and interaction of predictors.Despite the absence of the groundwater recharge rate estimates in permafrost area, this might imply that the collated data have been sufficient to learn some governing mechanisms of groundwater recharge and achieve predictions with an accuracy comparable to process-based conventional hydrological models.Also, this large data of ground-based estimates for groundwater recharge collated in this study can be used to optimize or validate a hydrological model or further applied for data driven model applications.
However, it needs to be emphasized that, in any case, AI and XAI must be applied with a fair degree of caution and thoughtfulness.Despite the high performance, neural network models can also generate misleading results due to limitations in observational data and predictors such as statistical issues like data skewness, confoundedness and redundancy among predictors.At the same time, deficiencies in the data sets themselves might influence the reliability of predictions.This includes inconsistent spatial and temporal scales of the data as well as data quality.Despite a steady increase and improvement in global data sets, this is an aspect that needs to be considered especially for larger studies outside the relatively few well-monitored regions.Thus, it is recommended that the results of XAI are carefully interpreted using a causal model and validation, taking hydrological process understanding into account, especially when the predictions involve future projections, such as impacts of climate change.
Can XAI offer a new perspective for groundwater recharge estimation?Yes, we think so, especially considering that big data in hydrology have started to become an issue with the advent of high-resolution measuring methods.Successively, data-driven models and machine learning based methods are expected to be seen much more in this research field with big data.By applying XAI, we identify and quantify a large range of possible connections of extensive predictors and groundwater recharge process.In some cases, it's been possible to go beyond the usual measures as correlation.What AI model and XAI can offer to the domain is to point out the direction of further research, and analyzing all these connections is a large amount of work that we intend to do in the further studies.
Furthermore, data-driven models are not dependent on human perceptions that are built on one's experience and carry a certain risk of not being able to foresee all relevant processes-especially when dealing with large spatial scales (Nearing et al., 2021).In this context, learning from data can provide additional insights, although this also requires explanations through causal models.Recently, besides theories in hydrology, causal models are actively being studied to answer this question in the field of data science (Pearl & Mackenzie, 2018).Since the real causal relationships are not known, observational causal inference methods based on data analysis might be applied further to understand causality (Chernozhukov et al., 2018).

•
Estimating groundwater recharge rates at global scale using an ensemble neural network model with 5541 observations and 20 predictors • XAI can quantify the sensitivity and importance of each predictor, showing non-linearities with long-term precipitation and vegetation index • Predictions show higher accuracy than the current process-based model, with most behaviors measured by XAI aligning with domain knowledge Supporting Information: Supporting Information may be found in the online version of this article.
: The model results are compared with additionally available observations (observationbased evaluation), with the knowledge of groundwater recharge processes characterized by different climate zones (expert-based evaluation), and with a previous hydrological model (model-based evaluation).(b) To gain new insights from the model, the model behavior is explored at a global scale for different climate zones using XAI.The significance and sensitivity of the predictors and the bivariate interaction effect among the predictors are quantified.

Figure 2 .
Figure 2. The structure of the (a) ensemble model and (b) the hyperparameter optimization for a set of training and test data.

Figure 3 .
Figure 3. Groundwater recharge predictions (a); model error (residuals) as average values on a 0.25°grid for the entire world, filtered with a 3 × 3 median filter for better visibility (b), and northern Australia (d); probability density functions of the relative errors for the different climate zones (c).

Figure 4 .
Figure 4. Individual conditional expectations of each predictor are aggregated by climate zone (Blue: Tropical, Yellow: Arid, Green: Temperate, Black: Average of all climate zones).5 to 95 percentile of predictor is displayed.95% confidence interval is shown around the average line over climate zones.Sensitivities are represented in each plot by maximum slope S max (mm yr 1 per unit of predictor) and normalized S max in brackets (no unit).

Figure 5 .
Figure 5. Highlight of main and interaction effects in tropical (blue), arid (yellow) and temperate (green) region.Selected features are displayed in the order of absolute value of SHAP.Feature value of the row is illustrated by lightness of color (dark: high value, light: low value).SHAP interaction values for more predictors can be seen in Figures S10, S11, and S12 in Supporting Information S1.

Figure 6 .
Figure 6.Pointwise comparison of modeled groundwater recharge rates from this study and Müller Schmied et al. (2021) with observations, based on the collected data of meta studies and own literature survey (cf.2.1.1.).Aggregating is carried out over an interval of 10 mm y 1 .

Table 2
Design and Performance of the Models for Different Training Data Sets With Number of Nodes Per Layer (L) and the Frequency of Dropout (D) JUNG ET AL.

Table 3
Performance of the Ensemble Model All units are "mm yr 1 " except number of samples and adjusted R 2 .