Predicting ecosystem responses by data‐driven reciprocal modelling

Treatment effects are traditionally quantified in controlled experiments. However, experimental control is often achieved at the expense of representativeness. Here, we present a data‐driven reciprocal modelling framework to quantify the individual effects of environmental treatments under field conditions. The framework requires a representative survey data set describing the treatment (A or B), its responding target variable and other environmental properties that cause variability of the target within the region or population studied. A machine learning model is trained to predict the target only based on observations in group A. This model is then applied to group B, with predictions restricted to the model's space of applicability. The resulting residuals represent case‐specific effect size estimates and thus provide a quantification of treatment effects. This paper illustrates the new concept of such data‐driven reciprocal modelling to estimate spatially explicit effects of land‐use change on organic carbon stocks in European agricultural soils. For many environmental treatments, the proposed concept can provide accurate effect size estimates that are more representative than could feasibly ever be achieved with controlled experiments.


| INTRODUC TI ON
The quantification of cause-and-effect relationships is key to understanding and managing the environment in which we live. Traditionally, such relationships are studied in very small subsets of our environment, at mesocosm and microcosm scale, or under simplified model conditions. In controlled experiments, the effect size of a given treatment is evaluated against a corresponding untreated reference. The setting can be an ecological field trial in which the treatment might be plant diversity and the studied response is ecosystem functioning (paired samples). It could also be a pharmaceutical trial, where a test population is randomly split and one group receives a drug (treatment) while the other receives a placebo (reference), and these unpaired groups are compared. Such controlled experimental designs offer great potential for the accurate quantification of treatment effects. In fact, the more controlled (simplified) the experiment is, the more accurately cause-and-effect relationships can be determined. However, as experimental control is best achieved by simplifying environmental conditions, many cause-and-effect descriptions lack representativeness. Carefully and very accurately quantified effects observed in a single site or population might appear very different at other sites or in populations with altered environmental conditions. Survey programs offer an alternative, data-driven approach for estimating treatment effects. While controlled experiments focus on accurately quantifying treatment effects at the expense of representativeness, the opposite is the case for surveys. Surveys aim to document target variables under the complete range of field conditions, but at the expense of easily being able to unravel causality. With increasing complexity of the studied environment, it is harder to distinguish causality from correlation (Huston, 1997). In the past two decades, machine learning methods have progressively been used to examine associations between target variables and environmental properties. Machine learning has allowed the identification and reproduction of complex associations, which has greatly advanced the ability to predict accurately using unseen data. However, even advanced machine learning methods are unable to distinguish between random associations and causal relationships (Wadoux et al., 2020). What machine learning can do, however, is link target values observed in a treatment group to environmental properties. This link could then be used to predict potential treatment effects for environmental observations that have not been treated (Bastin et al., 2019;Schneider & Don, 2019). The approach of training a model in one group and then applying this model to another group with a different treatment is referred to below as data-driven reciprocal modelling. In ecology and environmental science, only relatively few case studies to date have implemented data-driven reciprocal modelling methods, for example to examine the effect of human activity on the number of trees (Bastin et al., 2019) and soil compactness (Schneider & Don, 2019). However, these pioneering studies tend to insufficiently address two key issues: (i) explanatory variables used for model training should be unaffected by the examined treatment (otherwise treatment effects can be underestimated) and (ii) machine learning models should only be applied within their training space, not beyond it.
In the following, we propose a good practice approach for datadriven reciprocal modelling to estimate complex treatment effects under field conditions. We illustrate this powerful method using the LUCAS soil data set describing the state of European topsoil in the year 2015 (Orgiazzi et al., 2018;Panagos et al., 2020) and examine the effect of agricultural land-use change on soil organic carbon (SOC) stocks. Specifically, we address the question of how much atmospheric carbon (C) European agricultural soils could sequester if today's croplands were converted to grasslands. However, the method presented can also be applied to answer numerous other questionsall that is required is a representative data set that relates values of a chosen target variable (here, SOC stock) to values of the studied treatment (here, land use) along with values of other potential explanatory variables (here, soil, climate, geology and management variables).

| General concept
Data-driven reciprocal modelling can be used to quantify treatment effects in situ, directly in the environment in which the treatment naturally occurs. Instead of controlling the environment, the proposed method aims to describe it as a whole. This is achieved by means of a representative survey that documents values of the target variable along with the examined treatment, and those environmental features that potentially also cause variation in the target variable within the studied region or population. The target variable depends on its associated treatment and environment: where Treatment is a dichotomous factor with two classes: each observation is either of group A or group B. To quantify the response to a change from group B to A, Response B→A , data-driven reciprocal modelling requires (i) at least one observation in group B, (ii) sufficient observations in group A to statistically describe the target variable as a function of Environment and (iii) the Environment of group B to match or be a subset of the Environment of group A. If these criteria are met, data-driven reciprocal modelling can be implemented as follows ( Figure 1): Step 1. In group A, a data-driven statistical model is trained to predict the target variable as a function of environmental properties: This model should reproduce associations between the Target variable and the Environment as accurately as possible (minimal underfitting) while still being generalizable to new observations within the boundaries of the training space (minimal overfitting). In the case of large and complex data sets, this task can often be achieved most efficiently with machine learning, choosing a model or model ensemble with an associated tuning approach that yields optimal performance. As a constraint, predictions should only include environmental variables that, based on expert knowledge, remain unaffected by the examined Treatment. This constraint is important in order to avoid underestimation of effect sizes (Step 3).
Step 2. Next, the method requires a comparison of the environmental properties of groups A and B. The observations of group B that are not comparable with the environmental properties of group A (training data) have to be deleted because they are located outside the model's space of applicability-they would be unknown to the model. The selection can be automated, for example as described by Meyer and Pebesma (2021), who judged the ability of any data-driven model to predict an unseen observation based on the Euclidean distance of this observation to the nearest training data point in the predictor space.
Step 3. Finally, the model that was trained on group A is applied to the remaining observations of group B. The resulting residuals represent case-specific effect size estimates for a change from group B to A: Step 4. In an optional post hoc analysis, the variation among individual effect sizes can be analysed further by relating the effect sizes from Step 3 to environmental properties. For example, this can Response B→A = Target variable predicted − Target variable observed .
be done by training a second machine learning model, which predicts individual effect sizes. This model can then be interpreted, for example by means of partial dependence plots (Molnar, 2019). The constraint concerning environmental variable selection defined in Step 1 can be dropped in this final step, and for model training all potential predictors (based on expert knowledge) can be used, including those that might depend on the Treatment.

| Case study
Minor alterations in global SOC stocks can cause major changes in the concentration of atmospheric carbon dioxide (Minasny et al., 2017). In mineral soils, land-use change typically exhibits the strongest anthropogenic effect on SOC stocks (Paustian et al., 2016).
However, site-specific effects of land-use change on SOC stocks may depend on environmental properties such as climate and soil texture Poeplau et al., 2011). In the present case study, we applied the new data-driven reciprocal modelling framework to provide spatially explicit, quantitative estimates for the effect of agricultural land-use change on SOC stocks in the European Union (EU-27) and the United Kingdom. The agricultural area of the EU-27 and the United Kingdom covers 1.73*10 6 km 2 , of which about two thirds are used as cropland and the remaining area is grassland (Eurostat, 2020). In the LUCAS 2015 Topsoil Survey, soils of these land-use types were sampled representatively to a depth of 20 cm F I G U R E 1 General concept for quantifying treatment effects by data-driven reciprocal modelling.
Step 1: In group A (here, squares; in case study, grassland), train model to predict the target variable as a function of treatment-independent environmental properties (here, colour); the range of environmental properties used to train the model define its space of applicability.
Step 2: In group B (here, circles; in case study, cropland), drop observations with properties outside the model's space of applicability.
Step 3: Apply the model to the remaining observations of group B and determine residuals to obtain modelled responses to a change from B to A (here, arrow direction +length).
Step 4: In an optional post hoc analysis, relate the different responses from Step 3 to environmental properties; in this final step, all potential predictors can be used, including those that were possibly set aside is Step 1 due to the treatment independency constraint and analysed for SOC content, inorganic C content, nutrient contents (available phosphorus, available potassium, total nitrogen), texture and pH (Orgiazzi et al., 2018;Panagos et al., 2020). These data were provided by the European Soil Data Centre (https://esdac.jrc. ec.europa.eu/). Bulk density values of LUCAS sites were estimated with a pedotransfer function previously recommended for European agricultural soils by Hollis et al. (2012). The bulk density values obtained were used to convert SOC contents to stocks (Lugato et al., 2014): where SOC stock is given in Mg ha −1 , SOC content in per cent, BD is bulk density in g cm −3 , Depth equals 20 cm and Coarse represents the percentage of coarse fragments. Additionally, predictors were compiled representing climatic, geological and anthropogenic influences on SOC at the LUCAS sites. The bioclimatic variables of WorldClim (Fick & Hijmans, 2017), elevation, slope and aspect, as readily provided in the Step 1. To estimate how much atmospheric C agricultural soils could theoretically sequester if today's cropland (group B) were converted to grassland (group A), a Random Forest model was trained to predict the SOC stocks of grassland. The following predictors were used for model training: soil texture, pH, C:N ratio, carbonate, soil groups, elevation, slope, aspect and climate data. These particular predictors were chosen because, based on expert knowledge, (i) they represent typical explanatory variables for SOC (Schneider et al., 2021) and (ii) land use can be assumed to have no causal effect on them, that is these predictors are independent of the examined land-use treatment. The resulting model was interpreted by computing permutation importance and illustrating associations of the five most important variables with SOC in partial dependence plots (Molnar, 2019).
Step 2. Second, the predictor space of the grassland model was examined. This space can be thought of as a point cloud with n-dimensions, where each dimension represents one predictor variable and each point in the cloud represents one grassland observation. The location of each point in this training space is defined by its respective value in each of the n predictor variables (here n = 13 continuous variables + one dummy coded soil group variable). After adding the cropland data to this space, (i) each predictor variable was standardized, that is mean-centred values were divided by their respective standard deviation, and (ii) the standardized values were multiplied by their respective variable importance in the grassland SOC model, as suggested by Meyer and Pebesma (2021). This was done to (i) make the predictor variables comparable and (ii) weight the predictors according to their importance in the SOC model. For each cropland observation, the Euclidean distance to its nearest grassland neighbour was calculated, and the result was divided by the average of all pairwise distances in the grassland data. This yielded a standardized index to characterize the dissimilarity of cropland and grassland observations. In the present study, the CAST::aoa function (Meyer, 2020) in R was used to calculate dissimilarity indices and define the associated space of applicability for the grassland model by applying the function's default threshold value of 0.95 (Meyer, 2020;Meyer & Pebesma, 2021). Cropland observations beyond the space of applicability of the grassland model were omitted.
Step 3. Finally, the SOC model, which was trained on grassland (group A), was used to predict SOC in the remaining cropland sites (group B). The resulting residuals represented the required estimates for the site-specific SOC response to converting today's cropland to grassland.
Step 4. In order to explain site-specific differences between these estimated effect sizes (residuals), a new Random Forest model was trained using all the available predictors for SOC, including those predictors that depended on the examined land-use treatment: variables characterizing agricultural activity and NDVI. After calculating permutation importance, selected key variables explaining SOC responses to land-use change were illustrated in partial dependence plots (Molnar, 2019).

| RE SULTS AND D ISCUSS I ON
The main result of the study was the new method of data-driven reciprocal modelling, as outlined in the Methods section. This method can be applied to a wide range of research questions for which large survey-based data are available. It helps to estimate treatment effects without being restricted to controlled experiments that include a control treatment and keeping all other environmental parameters the same, the ceteris paribus principle. In environmental research, this principle is often not achievable or comes at the expense of representativeness when being restricted to a few controlled field experiments. Here, we used the example of a soil survey and how land-use affects soil C stocks as a case study to illustrate the data-driven reciprocal modelling.
The five most influential predictors described climatic conditions, soil texture and the C:N ratio of soils ( Figure 2). Specifically, the model described SOC stocks increasing with precipitation in the driest month, attributing to grassland with more than 40 mm of such precipitation about 10 Mg ha −1 more SOC than grassland in a drier climate. Modelled SOC stocks decreased with increasing maximum temperatures and/or increasing coarse fragment fraction, while they increased with higher C:N ratios and/or higher clay content. Overall, the Random Forest model reflected data patterns that were in good agreement with previous studies explaining the variability of SOC stocks in European agricultural soils (Rial et al., 2017).

| Step 2. Stick to the model's space of applicability
In total, 276 (4%) of cropland sites were located outside the model's space of applicability, and were therefore excluded from further analyses ( Figure 3; Figure 4 crosses). Affected sites were located in regions with strong cropland dominance, and were mostly characterized by a Mediterranean climate with maximum temperatures above 25°C in the hottest month and/or alkaline, clay-rich soil. These conditions were largely unknown to the model, because in Europe, grassland use is rare under such conditions.

| Step 3. Apply model to group B (here, cropland) to estimate effect sizes
In the remaining croplands, the model predicted SOC stocks to be 12.1 Mg C ha −1 (95% CI, 11.8 to 12.5) higher on average than measured values (Figure 4). This number illustrates the average SOC accrual if cropland were converted to grassland. Individual effect sizes differed drastically, but their spatial distribution revealed pronounced regional trends. On a country level, the potential SOC accrual from converting cropland to grassland was predicted to be highest in Belgium (mean 27.  (Poeplau & Don, 2013).

| Step 4: Explain effect sizes
The effects of land-use change on SOC stock were strongly related to climate, mineral N fertilization, C:N ratios of soils and mean annual NDVI values of the land surface ( Figure 5). Cropland with mean annual precipitation below 40 mm in the driest month was predicted to sequester about 6 Mg C ha −1 less by grassland conversion than cropland with precipitation above this threshold value.
This can largely be explained by the relatively low SOC stocks of grassland in dry areas ( Figure 2). Additionally, in the dry areas of the Mediterranean, the widespread occurrence of silvoarable systems, such as olive trees paired with ley or cereals, might make croplands particularly SOC rich relative to adjacent grassland (Eichhorn et al., 2006). Converting such relatively productive silvoarable systems in the Mediterranean to grassland might even result in SOC losses ( Figure 2, brown colours). This interpretation is underlined by the partial dependence of ΔSOC on the NDVI values of cropland: ΔSOC decreased with increasing mean annual NDVI values of cropland ( Figure 5). Some Mediterranean croplands showed even higher mean annual NDVI values than neighbouring grasslands, that is cropland was 'greener' than grassland ( Figure 6, green colours). However, on average this differed across Europe, with cropland showing lower mean annual average NDVI values than neighbouring grassland, particularly in the agricultural area between the Paris basin in France and Belgium ( Figure 6). The same region was characterized by high mineral N fertilization, which was related to larger values of ΔSOC.
Associations between soil properties and ΔSOC were minor, with the exception of the C:N ratio of cropland soil, underlining the close linkage between C and N cycles.

| Pros and cons of data-driven reciprocal modelling
Data-driven reciprocal modelling offers a new tool for case-specific estimates of effect sizes that take local environmental conditions into account. In our case study, the spatial patterns of predicted land-use change effects could be explained with comprehensible driver variables, confirming that most effect size estimates were plausible. However, it is important to note that the case-specific effect size estimates obtained via reciprocal modelling tend to be F I G U R E 3 Density plots characterizing croplands within the model's space of applicability (green; n = 5879) and croplands outside of this space (grey; n = 276 A major bottleneck in the accurate prediction of treatment effects with reciprocal modelling is data availability. In the future, the quantity and quality of data, as well as computational power to train data-driven models, will continue to increase, which will allow further improvements to the accuracy of data-driven effect size estimates. Controlled experiments will always be indispensable as they contribute to process understanding and build the foundation for mechanistic models. Good mechanistic models that correctly describe all relevant processes will always provide better effect size estimates than data-driven reciprocal modelling. In practice, however, many phenomena are too complex to be accurately described with mechanistic models. Data-driven reciprocal modelling allows effect size to be estimated merely by analysing associations, without understanding causation. They provide the best possible quantitative estimate of effects without understanding every detail of their cause while achieving a maximum of representativeness.

ACK N OWLED G EM ENTS
This study was part of the EJPSoil project, which has received fund-

CO N FLI C T O F I NTE R E S T
There were no conflicts of interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The R code used for data-driven reciprocal modelling in our case study is available on GitHub (https://github.com/Flori anSch ne/