Revisiting Global Vegetation Controls Using Multi‐Layer Soil Moisture

The productivity of terrestrial vegetation is determined by multiple land surface and atmospheric drivers. Water availability is critical for vegetation productivity, but the role of vertical variability of soil moisture (SM) is largely unknown. Here, we analyze dominant controls of global vegetation productivity represented by sun‐induced fluorescence and spectral vegetation indices at the half‐monthly time scale. We apply random forests to predict vegetation productivity from several hydrometeorological variables including multi‐layer SM and quantify the variable importance. Dominant hydrometeorological controls generally vary with latitudes: temperature in higher latitudes, solar radiation in lower latitudes, and root‐zone SM in between. We find that including vertically resolved SM allows a better understanding of vegetation productivity and reveals extended water‐related control. The deep(er) SM control for semi‐arid grasses and shrubs illustrates the potential of deep(er) rooting systems to adapt to water limitation. This study highlights the potential to infer sub‐surface processes from remote sensing observations.

Reliable observation-based global photosynthesis proxies are available for recent years through satellite-derived sun-induced fluorescence (SIF, Frankenberg et al., 2011;Joiner et al., 2013). SIF data are increasingly used to study the relationships between global vegetation productivity and hydrometeorological drivers (Chen et al., 2020;Jiao et al., 2019;Wagle et al., 2016;Walther et al., 2019;Yang et al., 2015;Zuromski et al., 2018). Besides, spectral vegetation indices and biophysical parameters from multi-spectral satellite instruments are widely used to study drivers of vegetation phenology and productivity (Buermann et al., 2018;Forkel et al. 2015). In this study, we consider SIF alongside two spectral indices (the normalized difference vegetation index, NDVI, Tucker, 1979; and near-infrared reflectance of terrestrial vegetation, NIRv, Badgley et al., 2017), and a comprehensive set of explanatory variables representing energy (temperature; radiation; vapor pressure deficit, VPD) and water availability (precipitation; multi-layer SM) to revisit global photosynthesis and greenness controls.

SIF
SIF is used as a proxy for variations in photosynthesis because it captures radiation emitted by chlorophyll molecules and is related to photosynthetic activity. We use one of the longest available satellite-derived SIF records which is based on the Global Ozone Monitoring Experiment-2 (GOME-2) instrument and ranges from 2007 to 2018 (Köhler et al., 2015). The raw global SIF observations at daily time scale are filtered to remove data based on (i) high solar zenith angles (>70°), (ii) large differences to the normal local overpass time (2 p.m.-8 a.m. in the next day), and (iii) large cloud cover (>50%), as done by Köhler et al. (2015). We note that different levels of cloud filtering of the satellite retrievals yield similar SIF anomalies, such that a cloud cover threshold of 50% is a reasonable compromise between SIF data quality and quantity (Köhler et al., 2015).

Vegetation Indices
To complement the photosynthesis analysis we use NDVI and NIRv as spectral vegetation indices (Badgley et al., 2017;Huete et al., 2002). NIRv is defined as NDVI multiplied by the near-infrared reflectance (Badgley et al., 2017). We obtain red and near-infrared reflectances from MOD13C1 v006 product (https://lpdaac. usgs.gov/products/mod13c1v006/) in an original 16-day and 0.05° resolution. NDVI and NIRv are computed from data with quality flags 0 and 1 representing good and marginal data, thereby ignoring low-quality data.

Hydrometeorological Data
We consider a comprehensive selection of energy and water-related variables from the ERA5 reanalysis. This dataset accounts for spatial variations in soil types and assimilates spaceborne microwave instruments of surface SM Hersbach et al., 2020). Even though SM estimates from deeper layers in ERA5 are less constrained by observations, previous studies illustrated the agreement of ERA5 SM data in each layer with independent observations (Albergel et al., 2012(Albergel et al., , 2013(Albergel et al., , 2018Li, Wu, & Ma, 2020;Liu et al., 2013;Hersbach et al., 2020;Jing et al., 2018; see Text S1 for details). Energy-related variables include temperature at 2-m height (temperature), surface downward solar radiation (solar radiation) and VPD, and the water-related variables are total precipitation (precipitation), SM layer 1 (0-7 cm), layer 2 (7-28 cm), layer 3 (28-100 cm) and layer 4 (100-289 cm). For comparison, we compute total SM by averaging values across the individual layers weighted by their thickness. It is to note that VPD is related to the relative humidity and temperature. Hence, it is an energy-related variable but representing the demand of the water in the atmosphere.

Climate and Vegetation Regimes
To evaluate the results of our analyses, we compute the aridity index for each grid cell as the ratio between the long-term averages of net radiation (expressed in mm) and precipitation using ERA5 data (Budyko, 1974).
We distinguish climate regimes using long-term mean temperatures and aridity index. We use fractional vegetation cover data from the AVHRR vegetation continuous fields products (VCF5KYR, https://lpdaac. usgs.gov/products/vcf5kyrv001/) from 2007 to 2016 to classify the fraction of total vegetation cover, and the fraction of tree cover in total vegetation cover .

Data Pre-processing
The data pre-processing is illustrated in Figure S1. All vegetation data and hydrometeorological data are aggregated to 0.5° spatial and half-monthly temporal resolution where SIF is available, and 16-day original NDVI and NIRv are linearly interpolated to half-monthly resolution. The first half-month consists of the first 15 days of the month, and the second half-month consists of the remaining days in the respective month. The study period is limited to 2007-2018 because of the SIF availability. In all SIF-based analyses we focus on data with SIF > 0.5 mW/m 2 /sr/nm to filter out sparse or dormant vegetation. This filtering is also applied in the NDVI and NIRv analyses, where additionally negative NDVI and NIRv values are filtered out. Grid cells are only considered in the analysis if more than 15 data points are left after filtering, and if the vegetation cover fraction exceeds 5%. For all target and predictor variables, we obtain half-monthly anomalies by subtracting the mean seasonal cycles determined by averaging values from all years for each of the 24 half-monthly periods between the first half of January and the second half of December. As we focus on relationships at short time scales, we remove long-term trends for each grid cell determined by a locally weighted smoothing filter (Cleveland et al., 1979) with a smoothing span of 0.4. Moreover, this helps to filter out any signal introduced by potential satellite sensor degradation.

Identification of Relative Importance of Hydrometeorological Variables Using Random Forests (RF)
RF is a nonparametric regression-based methods requiring no statistical assumptions on predictor and target variables, and is designed to process large amounts of diverse input data (Breiman, 2001). With this flexibility it is better placed than traditional statistical methods to detect nonlinear relationships among shortterm hydrometeorological variations and vegetation productivity. We also note that our RF analysis can indicate plausible governing processes from emergent relationships, but by construction it does not suggest causality. In this study, all hydrometeorological anomalies are used as predictor variables, and anomalies of SIF and vegetation indices are employed as target variables per each grid cell, respectively ( Figure S1). RF training is done using information from each grid cell and the surrounding grid cells (forming 3 × 3 grid cell matrices) to obtain sufficient data. The performance of the RF model is then evaluated by the fraction of explained variance in regression analysis carried out with the linear least squares using cross-validation (hereafter referred to as R 2 ; see details in Text S2.1). Grid cells with R 2 lower than or equal to 0 are filtered out. Finally, two experiments are performed with RF models differing in the SM data used: total versus multi-layer SM alongside precipitation, VPD, solar radiation, and temperature.
Permutation importance is one of the most common methods for RF to measure the relative importance of each predictor variable (Lunetta et al., 2004;Nicodemus, 2011;Zhang & Yang, 2020). It is inferred from the difference of error before and after a temporal permutation applied to the particular variable (Cutler et al., 2012;Gómez-Ramírez et al., 2020). To validate results of permutation importance we employ two more methods and to infer confounding effects we also quantify the sensitivity of SIF response to each predictor variable (see Text S2.1 for variable importance identification methods, Text S2.2 for sensitivity algorithm).
When considering multiple hydro-meteorological variables, the identification of global vegetation controls is challenged by potential high collinearity (Dormann et al., 2013) between some of the variables. Most previous studies did not consider more than three variables, thereby somewhat circumventing this problem while ignoring potentially important variables (Claessen et al., 2019;Garonna et al., 2018;Seddon et al., 2016). Though RF are also challenged by the collinearity in the input data, they yield more robust results with noisy training data and have high interpretability (Zhang & Yang, 2020), for example by inferring the sign of the sensitivity of vegetation productivity to particular predictor variables. As the computed relationships and their signs are consistent with previous literature and physical expectations (see Section 3), we note that RF are applicable in our multi-variate context. Further, collinearity can sometimes be mitigated through a pre-processing of the data by removing long-term trends or the mean seasonal cycle (see Section 2.4.1), as long as the collinearity is mainly driven by them, that is trends or seasonal cycles would be similar between variables while shorter-term dynamics are not.

Model Performance
The spatial patterns of RF model performance are similar between the two experiments using total and multi-layer SM with comparatively high R 2 (>0.3) in the central North America, central Eurasia, southern and eastern Africa, central Asia, and eastern Australia (Figure 1). The predictive performance is improved LI ET AL.  in most regions across the globe when using multi-layer instead of total SM. Vertical SM information improves model performance particularly in semi-arid regions such as Australia, central North America and central Asia (Figures 1c and 1d). In these regions plant rooting systems apparently adapt to compensate for local water deficits which arise from divergent dynamics of surface and root-zone SM across time and space (Berg et al., 2017;Lian et al., 2020;Schlaepfer et al., 2017;Zhang et al., 2016).
Although the performance of SIF prediction is improved with multi-layer SM, the R 2 values are still relatively low, especially in large regions in South America. This relates to a typically significantly decreased model performance in predicting global vegetation productivity for anomalies compared to absolute data and mean seasonal cycles (Kraft et al., 2019). Further, it is related to input data quality where the satellite-based SIF retrievals are strongly impacted by noises in large regions in South America (Joiner et al., 2013;Köhler et al., 2015). Furthermore, using relatively coarser GOME2 pixels to derive the SIF data includes more residual cloud contamination than for example the finer-scale MODIS footprints (Joiner et al., 2013). Therefore, our RF models show much better performance in the case of NDVI and NIRv ( Figure S2) compared with SIF. In this context the performance of RF is generally lower in humid regions compared to drier regions, related to increased cloud cover which degrades the quality of the satellite retrievals as previously showed (Kraft et al., 2019;Linscheid et al., 2020). We include regions with weak model performance in the case of SIF for the subsequent analyses, and we believe that our methodology is robust, because (i) our main goal is to rank the relative importance of predictors instead of accurately capturing their dynamics; (ii) we find strongly similar global patterns of main controlling variables across SIF, NIRv and NDVI; and (iii) controlling patterns are largely in line with previous studies (Madani et al., 2017;Nemani et al., 2003;Seddon et al., 2016).
Additionally, we verify that the efficiency of using multi-layer SM in RF is not an artifact of over-fitting ( Figures S3 and S4), and we find similar results when using the root fractions in each soil layer as weights in the total SM vertical average computation rather than the layer depths ( Figure S5; root fractions are derived from ERA5 data by following ECMWF, 2020). See Text S3 for uncertainties in vegetation data and model tests in details.

Main Hydrometeorological Controls on Global Vegetation Productivity
The global partitioning of water-vs energy-related controls of vegetation productivity significantly varies between the two experiments involving total and multi-layer SM ( Figure 2). Apparently, total SM does not provide sufficient information to the RF model to detect water-controlled regions ( Figure 2a). Overall, in the multi-layer SM analysis (Figure 2b) temperature is identified as the main driver of SIF in the higher northern latitudes, solar radiation dominantly controls SIF in tropical regions, and VPD emerges as a main control in parts of the western Amazon forests (Green et al., 2020), eastern North America, northern Eurasia and eastern Asia. In between the tropics and the higher latitudes, where mostly (semi-)arid climate regimes are prevailing, water-related variables play the dominant role. Precipitation and surface SM (0-7 cm in ERA5) control SIF in central India, western Sahel and in central-southern Africa. Root-zone SM (7-28 cm in ERA5) mainly controls SIF in southern North America, southern Europe, and many parts of Eurasia, India and Australia. In general, root-zone SM (7-28 cm in ERA5) emerges as the most relevant water reservoir for vegetation productivity, while deeper SM (28-100 cm in ERA5) is particularly important in the transitional zones and temperate dry regions, such as central North America and southern Europe. To further test the assumption of energy variables misleadingly being detected as the main control where surface SM is the actual main driver due to the insufficiency of the total SM experiment (energy variables negatively contribute to the variation of SIF), we repeat the analysis from Figure 2 while only considering variables with positive contributions to SIF prediction. The result confirms our hypothesis and illustrates that confounding effects can be minimized using multi-layer SM ( Figure S6; see Text S4.1).
Confirming these SIF-based results, we find similar global patterns of the main controlling variables in the case of NIRv and NDVI ( Figure S7), even though they show extended SM-controlled regions. Furthermore, indicating physical meaningfulness of the obtained global patterns, we find that the sensitivities of SIF to the respective diagnosed hydrometeorological main controls are typically strongly positive ( Figure S8). This is also true for precipitation when it is identified as the most important predictor.
Next, we analyze the prevailing main controls across climate and vegetation regimes. Overall, controlling patterns across climate regimes in Figure 3a are in line with first-order constraints for evapotranspiration from Seneviratne et al. (2010), and from Denissen et al. (2020) in a European study. Root-zone SM (7-28 cm in EAR5) is identified as the main control in (semi-)arid regions. In humid regions energy variables are the most relevant, and overall temperature is the most important while solar radiation also plays a role. Solar radiation best explains SIF variability for forests in humid regions, because SIF is mechanistically linked through absorbed photosynthetically active radiation which depends on the variability of the irradiance . In extreme warm and dry conditions, precipitation is identified as the dominant control. This could be explained as (usually weak) rainfall would not significantly wet the soil surface because of quick evaporation from warm surfaces (Seneviratne et al., 2010) and soil water repellency after long dry periods (Song & Wang, 2019). The rain-induced evaporation in turn mitigates atmospheric water stress (high VPD), and thereby contributes to the precipitation control on SIF. In Figures 3d-3f we show that grasses and shrubs with a low fraction of tree cover are most water-controlled, regions with intermediate tree cover are typically temperature-controlled, and regions with the highest tree cover are mostly radiation-controlled. Energy controls involve a relatively lower vulnerability of tree ecosystems to droughts than other ecosystems (Huang & Xia, 2019), as droughts are typically associated with above-average solar radiation LI ET AL.  Proportions of study area where each variable is the most important factor are shown in (d and e). TP denotes precipitation; TSM denotes total SM; SM1, 2, 3, 4 denote SM in layers 1, 2, 3, 4 respectively; TEM denotes temperature; SSRD denotes solar radiation; VPD denotes vapor pressure deficit. and newly developing leaves that can support photosynthesis (Hutyra et al., 2007;X. Li et al., 2018;Orth & Destouni, 2018;J. Wu et al., 2016;Yan et al., 2019). While the NDVI and NIRv analyses overall confirm the SIF results, they show extended water-related controls in arid regions and in tree-grass mixed regimes, consistent with previous findings (Walther et al., 2019). This is more pronounced for NDVI, potentially due to larger confounding effects of background brightness in NDVI as a response to SM changes, while NIRv contains more information about vegetation canopy structure and partly overcomes this issue (Badgley et al., 2017. Figure S9 finally investigates jointly the role of fraction of tree cover and aridity and shows that the main hydrometeorological controls change in response to both of them. We furthermore analyze variations of hydrometeorological controls between early and late growing seasons. Temperature control can be found in larger regions in the early growing season compared to the later growing season, while the control of SM expanded in the late growing season (Figures S10 and S11) is in line with previous studies (Buermann et al., 2018;Lian et al., 2020;. Overall, patterns for these two sub-periods do not differ much from the results for the entire growing season (See Text S4.2 for details).

Main Water-Related Controls on Global Vegetation Productivity
Focusing exclusively on water-related first-and second-order controls reveals that the most important soil layer varies across climate-vegetation regimes (Figures 4a-4c). Shallow (7-28 cm in ERA5) and deep rootzone SM (28-100 cm in ERA5) are the most relevant layers for semi-arid grasses or shrubs, indicating that plants can adapt to water-scarce conditions at the surface with establishing deeper-reaching rooting systems (Fan et al., 2017). This is in line with previous but smaller-scale study from Yinglan et al. (2019); further studies confirm that in dry surface soils in (semi-)arid regions, root plasticity and morphology support water uptake from deeper soil layers , for instance in local Mediterranean grass (Barkaoui et al., 2016), savannas ecosystems (Hoekstra et al., 2014;Nippert & Holdo, 2014) or central Brazilian savannas (Oliveira et al., 2005). For even drier climate conditions, shallower soil layers become more relevant (Figures 4a-4c), probably because intermittent vegetation productivity mostly benefits from rainfed surface LI ET AL.  SM, or deeper fine roots down to the water table capillary fringe play a role (Fan et al., 2017). Interestingly, toward humid conditions our analysis shows a dominant role of surface SM (0-7 cm in ERA5) and precipitation, while these regions are characterized by high tree cover with expected deep rooting systems. This could be due to frequent precipitation keeping surface soil layers wet such that trees absorb significant fractions of water through the near-surface roots (S. G. Li et al., 2007), while the dependence on deeper layers for trees during short droughts is not reflected here. Furthermore, we note that these regions are characterized by first-order energy controls (Figure 3) such that the results could also partly be an artifact as precipitation and surface SM are expected to co-vary with the dominant energy variables.
We validate our findings on the relative importance of the different soil layers by comparing them with multiple global rooting-depth products from two perspectives, effective root-zone water uptake and physical maximum rooting depths (Figures 4d-4i). In general, rooting depths from Fan et al. (2017) and Schenk and Jackson (2009) although from the physical maximum perspective show similar patterns as our results with deepest roots in semi-arid areas and for nontree vegetation such as grasses and shrubs. Canadell et al. (1996) states that maximum rooting depths of trees, shrubs and grasses can exceed 2 meters, respectively. Our results indicate that grasses and shrubs use their deep roots more often such that we can detect a respective relevance of deep SM reservoirs while trees predominantly use shallow roots as water is often easily accessible in wet surface layers. See Text S5 for data details.
LI ET AL.  To illustrate the robust importance-based analyses we use: (i) Spearman correlation ( Figure S13) and (ii) SHAP feature importance ( Figure S14) and find similar results. Further, we employ alternative SM products, namely GLEAM, MERRA-2 and SoMo.ml (see data details in Text S5; Figure S15), all of which lead to similar results as found with the ERA5 SM. We note that using a suite of SM products derived independently with physical models or machine learning approaches highlights that our results are not an artifact of one single SM model. In particular we find across these datasets that generally shallow soil layers are more relevant in humid regimes and deep(er) soil layers are more important in semi-arid areas and for nontree vegetation such as grasses and shrubs, confirming our previous results, while the layer depths and amounts are different across products and cannot be readily compared. In addition, we convert multi-layer SM to water potential. Thereby, the water potential is computed as the difference between actual SM and the permanent wilting point which is inferred from the soil texture in each grid cell (ECMWF, 2020). We find similar results indicating no major influence of the considered water availability metric on our results ( Figure S16, as the role of different soil types might be diluted at large spatial scales where usually different soil types coincide in individual grid cells. Furthermore, we acknowledge, however, that our analyses do not consider water storage strategies related to hydraulic traits (Matheny et al., 2015) and irrigation effects.

Conclusions
This study illustrates that vegetation productivity relies on water from different soil depths while these characteristic depths vary with climate and vegetation types. In particular, we elucidate at the global scale that semi-arid areas and vegetation types such as grasses and shrubs are controlled by comparatively deep layers as they have deep rooting systems. This complexity is not yet sufficiently acknowledged by dynamic global vegetation models (Guimberteau et al., 2018;Schaphoff et al., 2018;Warren et al., 2015) which apply globally constant soil depths, and do not account for deep rooting strategies or potential physical barriers for vertical soil water transport (Sakschewski et al., 2020). This highlights the relevance of our results, and of our approach illustrating that sub-surface soil processes can be inferred with the help of surface-based Earth observations. Further, we compare the hydrometeorological controls of vegetation productivity obtained with different respective proxy metrics. SIF is more strongly related to photosynthesis, compared with NDVI and NIRv, but SIF data are only available for recent years. Our results show that NDVI and NIRv yield similar spatial patterns and largely confirm the SIF-based results. However, we find extended water-related controls in the cases of NDVI and NIRv, especially in temperate wet regimes, probably induced by changes of soil background reflectance as a response to SM changes.
Overall, our study contributes to an advanced understanding of the global-scale partitioning of hydrometeorological controls of vegetation productivity by benefitting from the ever-growing suite of global eco-hydrological data streams. cal data and the MODIS data, Tina Trautmann for processing the global rooting-depth data, and Gianpaolo Balsamo for guidance on the ERA5 vertical root distribution data. W. Li acknowledges funding from a PhD scholarship from the China Scholarship Council. W. Li acknowledges support from the International Max Planck Research School for Global Biogeochemical Cycles. R. Orth is funded by the German Research Foundation (Emmy Noether grant number 391059971), and S. Walther acknowledges funding by an ESA Living Planet Fellowship "Vad3e mecum". Open access funding enabled and organized by Projekt DEAL.