Substantial Differences in Crop Yield Sensitivities Between Models Call for Functionality‐Based Model Evaluation

Crop models are often used to project future crop yield under climate and global change and typically show a broad range of outcomes. To understand differences in modeled responses, we analyzed modeled crop yield response types using impact response surfaces along four drivers of crop yield: carbon dioxide (C), temperature (T), water (W), and nitrogen (N). Crop yield response types help to understand differences in simulated responses per driver and their combinations rather than aggregated changes in yields as the result of simultaneous changes in various drivers. We find that models' sensitivities to the individual drivers are substantially different and often more different across models than across regions. There is some agreement across models with respect to the spatial patterns of response types but strong differences in the distribution of response types across models and their configurations suggests that models need to undergo further scrutiny. We suggest establishing standards in model evaluation based on emergent functionality not only against historical yield observations but also against dedicated experiments across different drivers to analyze emergent functional patterns of crop models.


Introduction
Crop models are often employed to project crop yields under changing conditions such as global warming and associated management change for adaptation (Jägermeyr et al., 2021).Multi-model ensembles are promoted to enhance the robustness of projections (Asseng et al., 2015;Martre et al., 2015), but questions remain on what causes often large differences between projections of individual models (e.g., Jägermeyr et al., 2021;Müller et al., 2021;E. Wang et al., 2022).Global Gridded Crop Models (GGCMs) are especially exposed to this question when applied for assessing climate change impacts (Jägermeyr et al., 2021;Schleussner et al., 2018), adaptation (Franke et al., 2022;Minoli et al., 2019;Zabel et al., 2021), or environmental impacts of agricultural production (W.Liu et al., 2018), because their results are used in downstream analyses, such as in integrated assessment (Ruane et al., 2017) or economic modeling for projecting future land-use change (Stevanović et al., 2016;Wiebe et al., 2015).Even though global gridded crop models are often based on detailed field-scale models or have implemented similar modeling principles in other ecosystem models (Müller et al., 2019) and show similar performance in evaluation against historical, national yield statistics (Franke et al., 2020a(Franke et al., , 2020b;;Müller et al., 2017), models are subject to substantial uncertainties from both model structure and parametrization (Folberth et al., 2019) as well as from calibration and input data quality (Ruane et al., 2021).This uncertainty shows most prominently in future projections under high-emission climate change scenarios, where models are exposed to driving data far outside the evaluation domain and results show large inter-model differences (Jägermeyr et al., 2021;Müller et al., 2021;Rosenzweig et al., 2014).
Climatic conditions (D. Liu et al., 2020) and soil properties (Qiao et al., 2022) determine yield potentials (Mauser et al., 2009;van Ittersum et al., 2013) and the suitability of different technologies, such as cultivars (Couëdel et al., 2021).Areas with similar climate and soil conditions show similar yield responses to variations in weather conditions, which can be monitored and reported using representative sites (Gommes et al., 2016).D. Liu et al. (2020) have identified the most limiting climate variable(s) across global crop production areas, finding that temperature has generally a higher impact on crop yields than precipitation for maize, rice, soybean, and wheat.Climate change is projected to alter climate conditions in many agricultural regions substantially (Franke et al., 2022;Jägermeyr et al., 2021;Ruane et al., 2018).Kummu et al. (2021), for example, find that substantial shares of these areas may be driven out of a climatic envelope suitable for agricultural production.Projections of future climate change demonstrate high levels of agreement on global mean temperature trajectories for given forcing scenarios, such as the "Shared Socioeconomic Pathways with Representative Concentration Pathways" (SSP-RCPs) framework (Tebaldi et al., 2021), but are subject to high levels of uncertainties when it comes to spatial and seasonal changes in temperatures and especially precipitation (e.g., Hawkins & Sutton, 2011;Monerie et al., 2020;Wu et al., 2022).Analyzing the sensitivity of cropping systems to changes in individual climate variables can thus help understand their fragility under changing climate.
Process-based crop models are widely accepted tools to project crop yields under changing climatic or management conditions and can help to inform decision making in direct or indirect ways.Crop models are employed at field to global scale and a large variety of crop models exists (e.g., Asseng et al., 2019;Müller et al., 2017).Model intercomparison projects, such as the Agricultural Model Intercomparison and Improvement Project AgMIP (Rosenzweig et al., 2013) have shed light on the inter-model uncertainty (Asseng et al., 2015;Palosuo et al., 2011;Rosenzweig et al., 2014;Ruane et al., 2017), leading to and following up on a call for a general overhaul of crop models (Rötter et al., 2011).Model development efforts since have led to various improvements of crop models (e.g., T. Li et al., 2017;Maiorano et al., 2017;Olin et al., 2015;von Bloh et al., 2018), disagreement between individual crop models remains high (Asseng et al., 2019;Jägermeyr et al., 2021;Kostková et al., 2021;Müller et al., 2021).
Local environmental conditions determine how individual crops are affected by changes in individual drivers.However, owing to the multiple interactions of drivers and processes in yield formation (Schauberger et al., 2016) and the incomplete implementation of processes in crop models (Boote et al., 2013), models can be expected to differ in crop yield projections and sensitivities to individual drivers.Still, regions with severe drought conditions should show substantial sensitivity to changes in water supply and regions with very little nitrogen availability should be sensitive to changes in nitrogen inputs.AgMIP's Global Gridded Crop Model Intercomparison (GGCMI) has set out to intercompare GGCMs in order to evaluate model performance, describe model uncertainties, identify inconsistencies within the ensemble and underlying reasons, and to ultimately improve models and modeling capacities (Elliott et al., 2015).The GGCMI Phase 2 experiment provides simulation data from a large, structured simulation experiment with regular perturbations of four different drivers of yield formation (atmospheric carbon dioxide concentrations (C), temperature (T), water (W), and nitrogen(N)), referred to as CTWN.The CTWN experiment is very well suited to study models' responses to changes in individual or combined driver dimensions.Modeled yield responses to such regular perturbations in drivers can be used to describe crop yield response types, which vary in space (water is a more important driver in arid environments than in humid ones) and among models.If there were no model uncertainty, crop yield response types would determined by genotype, environment and management (G × E × M) characteristics of each farming system and could be identified with a single crop model.Under given model uncertainty, crop yield response types are, however, a function of the local cropping conditions, but also of model design, functionality, and parameterization (Folberth et al., 2019).Consequently, crop yield response types can describe differences in model behavior and spatial disagreement and can thus help identifying functional differences between models that can guide further model development.Tao et al. (2020) conducted a model intercomparison study with eight barley models for two sites and eight different simulation settings, combining offsets in air temperature, precipitation, irradiation and atmospheric CO 2 .They find that the models' disagreement from different sensitivities to changes in temperatures and CO 2 was largest and could identify modeled dynamics of leaf area index as a process that is responsible for model divergence with respect to simulated evapotranspiration, above ground biomass, and yield.In this study, we are conducting a global analysis of GGCMs sensitivities to individual drivers of crop yields, deriving classes of model response types that allow for intercomparing models and regions, aiming to better understand sources of uncertainties in future crop yield projections with crop models.

The GGCMI Phase 2 Model Output Data Set
The GGCMI Phase 2 experiment is a structured input sensitivity test (Franke et al., 2020a(Franke et al., , 2020b) ) with a modeling protocol that asked for up to 1,404 31-year global simulations at 0.5 arc-degree spatial resolution to assess models' sensitivities to changes in atmospheric carbon dioxide concentrations (C; 4 levels) temperature (T; 7 absolute offset levels, including zero), water supply (W; 9 relative offset levels, including zero), and nitrogen (N; 3 levels), the so-called CTWN experiment (see Table A1) (Franke et al., 2020a(Franke et al., , 2020b)).A fifth dimension in the CTWN Experiment on Adaptation (A) was not considered here, that is, we only used the simulation sets that assumed no change in cultivars (n = 756).Previous work has used emulators trained on the CTWN experiment (Franke et al., 2020a(Franke et al., , 2020b) ) to explore the contribution from crop models to overall uncertainty in crop yield projections driven by climate change projections (Müller et al., 2021) and to explore the role of adaptation to future agricultural production (Zabel et al., 2021) and the latitudinal shifts in breadbasket regions (Franke et al., 2022).We also focused on rainfed growing conditions only, ignoring the settings with unlimited irrigation (W inf ).Of the 12 participating modeling groups, only four supplied all 756 simulation sets (EPIC-TAMU, LPJ-GUESS, LPJmL, and pDSSAT), but five additional modeling groups provided sufficient data to allow for emulation of their yield responses (CARAIB, GEPIC, JULES, PEPIC, and PROMET) and we used the emulators that were build on these simulations (Franke et al., 2020a(Franke et al., , 2020b) ) to gap-fill missing simulation sets that were not provided.The remaining models (APSIM-UGOE, EPIC-IIASA, and ORCHIDEE-crop) are only shown here in the overview figure for completeness, but are not included in the following analyses.The scarcity of simulations provided by these modeling teams (33-44 of 756, see Table 1) does not allow for in-depth analysis and also led to exclusion of these models from the emulator training (Franke et al., 2020a(Franke et al., , 2020b)).The GGCMs' baseline simulations (C = 360, T = 0, W = 0 for rainfed and W = inf for irrigated, N = 200) have been used to evaluate model performance, even though uniform N application rates and atmospheric CO 2 concentrations of 360 ppm are not realistic representations of the current systems.Despite the uniform management assumption, Franke et al. (2020aFranke et al. ( , 2020b) ) show that the GGCMI Phase 2 model ensemble is capable of reproducing some of the observed yield variability, especially in top-producer countries (see reproduced Figure S2 in the Supplementary Information), but also at the global aggregation level (Figures S3-S6 in the Supplementary Information).GGCMs' skills in reproducing observed yield variability at the grid-cell level varies across GGCMs and crops but shows roughly similar skill to the GGCMI Phase 1 model ensemble tested by Müller et al. (2017).

Data Analysis
The analysis conducted here aims at understanding differences in models' sensitivities of simulated crop yield (y) to the CTWN drivers across crops and regions as well as understanding differences among models.We considered current crop-specific cropland extent, making use of the MIRCA2000 cropland data set Portmann et al. (2010).To avoid distortions of marginal production areas, we only considered grid cells (0.5°by 0.5°longitude/latitude, equivalent to 55 km by 55 km at the equator) with at least 200 ha of crop cultivation (rainfed and irrigated area).Spring and winter wheat are not separated in the MIRCA2000 data so we considered total wheat areas for both.MIRCA2000 data were also used for data aggregation to the global scale, using the provided crop-specific Earth's Future 10.1029/2023EF003773 harvested areas as aggregation weights.Globally, there were 21,262 grid cells included for maize, 9,165 for soybean, 11,452 for rice, 17,032 for spring wheat, and 17,032 for winter wheat.With the sheer amount of data of the GGCMI Phase2 experiment (up to 4,368 global simulations, see Table 1), a visual representation of variations in model response is not helpful.We thus structured the analysis to condense the information in a meaningful way so that different response types can be identified and discussed.

Impact Response Surfaces
Impact Response Surfaces (IRSs) have been used to describe crop model behavior under changes in two driver dimensions (e.g., temperature and precipitation) (e.g., Pirttioja et al., 2015) and Fronzek et al. (2018) have used IRS to identify different model response types.Zabel et al. (2021) used IRSs to describe isolines for comparison of adapted and non-adapted global production systems.Here, we were interested in regional differences and thus constructed IRSs for each grid cell i, GGCM g, crop c, and each paired combination of two drivers d1 and d2 of the four CTWN dimensions (i.e., T ∼ W, T ∼ N, C ∼ T, W ∼ N, C ∼ W, C ∼ N).IRSs display yield changes (Δy i,g, c ) for any grid cell i or aggregation of grid cells for combination of any two drivers (d1 and d2) in relation to the average yield across all cases included in the IRS ȳd1 * ,d2 * ,i,g,c ) as described by Equation 1, where d1* and d2* describe the full set of elements in d1 and d2 respectively.We used the average yield (of each respective IRS) rather than the yield at default conditions (y i,g,c,C360,T0,W0,N200 ) as the default conditions were not always directly supplied by all models g.
The other two dimensions, not displayed in the IRS are kept at their default setting (C: 360 ppm, T: 0°C, W: 0%, N: 200 kg ha 1 ).The atmospheric CO 2 concentration of 360 ppm refers to approximately the value of 1,995, the middle of the simulation period 1980-2010 of the GGCMI Phase2 experiment.Note.Some models do not account for nitrogen dynamics, as indicated in column "Nitrogen."Not all models provide data for all crops (indicated by "-" in the respective columns), but always supply the same simulation sets across all crops provided.Earth's Future

10.1029/2023EF003773
Depending on the global extent of cropland, 9,165 (soybean) to 21,262 (maize) of such IRS sets were constructed per CTWN dimension and crop, which cannot be displayed or interpreted as visuals.For illustrative purposes, Figure 1 shows IRS for the T ∼ W responses of globally aggregated maize yield.

Dominant Response Dimensions
IRSs show the response of projected yields for any two drivers (d1 and d2, e.g.T and W).The classification of IRS as proposed by Fronzek et al. (2018), which distinguishes nine cases of maximum yield location per IRS and the strength of the response per dimension, is still too complex for our purposes here, especially if extended from two (TW) to four (CTWN) dimensions of drivers.For a simpler metric to describe the characteristics of IRS, we identified the dominant response dimension, using response ratios (RR).Response ratios describe the relationship of the gradients along the two dimensions, based on minimum and maximum values, that is, ignoring the shape of these gradients (i.e., it does not matter if the minimum (or maximum) is at either end of the row or column).In contrast to the illustrative IRSs, the reference yield ȳ cancels out in the computation of RRs, so we computed RRs based on actual yields (y) rather than yield changes (Δy).Any distortion that may be introduced by using the IRS' mean value rather than a standard simulation set thus does not affect any quantitative analysis here.In order to determine which of the two drivers dominates over the other, we selected the data slice from the CTWN cube that spans the full range of the drivers of interest (d1 and d2) at the default conditions of the other two drivers (e.g., [T 1 .. T +6 ] vs. [P 50 .. P +30 ] at C 360 and N 200 ).Across that selected surface, we computed the range of simulated yields (y) for each grid cell i, model g, crop c for each element j1 of d1 across all elements j2 of d2, computing the average response to those drivers (e.g., R T and R W in Figure S1) by dividing by the number of elements n d1 and n d2 .The average response of the two drivers d1 and d2 are computed as described in Equations 2 and 3 and their combination to compute the response ratio RR d1,d2,i,g,c is described in Equation 4. RR ranges between 0 and 1 and describes the contribution of the first driver to the yield variation across both drivers.If these are perfectly balanced, RR is 0.5, if the first driver is the only driver of yield change, RR is 1, if it has no effect, RR is zero.All data processing and plotting was done in R, version 4.1.2(R Core Team, 2021).The simulations for "W 40%" and "T + 5" were not requested per protocol (Franke et al., 2020a(Franke et al., , 2020b).
Earth's Future We describe the different RR values with the median value and the skewness of their distribution.Skewness was computed with R version 4.1.2(R Core Team, 2021) with the skewness function of the moments R package, version 0.14.1, using Equation 5, with x for the data and n for the number of data points i in x and x for the mean of x.
Skewness values range between positive and negative infinity and values outside the [ 0.4, 0.4] interval can be considered skewed, that is, data are distributed asymmetrically (Doane & Seward, 2011).

Cluster Analysis
RRs take continuous values in the interval [0, 1] and were computed for all six combinations of any two drivers of the CTWN data cube In order to structure RRs into Crop Yield Response Types (YRTs), we use hierarchical clustering, separating RR combinations into clusters so that at least 90% of the overall variance in the total sample is explained by the separation into clusters.The resulting YRT describe differences across models and environments simultaneously.This allows for comparing regions and GGCMs with respect to their sensitivities to changes in the CTWN drivers under the full range of global crop growing conditions.In order to include all GGCMs with sufficient data provision, independent of their ability to provide data on responses to variation in N input (see Table 1), we also conducted the same analysis for the CTW data cube with 3 different combinations of any two drivers (T ∼ W, C ∼ T, C ∼ W), which we refer to as CTW-YRT.We used R version 4.1.2(R Core Team, 2021) with R-package Rclustercpp.hclust(version 0.2.6) for large data sets with standard settings, that is, using euclidean distances and the ward method.For describing the characteristics of the individual clusters, we make use of the median and interquartile range of each RRs distribution within each cluster.

Distribution of RR
The GGCMs show different distributions of RR across all crop-specific cropland.There are differences in the median values, but also in the shape -and skewness -of the distributions.Most RR values per GGCM are not normally distributed but highly skewed or bi-modal (see illustrative Figure 2).The differences in median values illustrate differences between models, as the distributions always refer to the same spatial sample (all grid cells with at least 200 ḣa crop-specific harvested area, according to Portmann et al. (2010)).Median values range substantially across GGCMs, but also across crops.
Maize yield simulations of CARAIB show very little response to changes in water supply in comparison to changes in temperature with a median RR T,W value of 0.84, which is in line with the vertical stripe pattern seen in the IRS for CARAIB in Figure 1.JULES maize yield simulations, on the other hand, show the opposite behavior with a median RR value of 0.26.Specific regional characteristics can also already be detected here.EPIC-TAMU, pDSSAT, and PROMET show a spike in the highest RR bin, indicating that there is a substantial number of grid cells (about 1,000 for EPIC-TAMU, >2,500 for pDSSAT, >2,100 for PROMET), in which changes in water supply have basically no effect in comparison to changes in temperature on simulated yields.JULES and LPJmL hardly have such maize-growing grid cells where water supply matters little in comparison to changes in Earth's Future 10.1029/2023EF003773 temperature.Some distributions are highly skewed or show bi-modal patterns, which is most prominent in the combined distribution across all nine GGCMs.There are too many RR and crop combinations to show all distributions as histograms in figures and we thus present results of variation in median and skewness values in Tables 2 and 3. Table 2 shows the range of median RR values across GGCMs, crops and driver combinations.Median values range between 0.06 (importance of C in comparison to N for maize in GEPIC data) and 1.0 (importance of C in comparison to N for soybean in GEPIC data).For all crops analyzed here, many RRs show a very broad range across GGCMs with differences between min and max values often well above 0.5 (Table 2).
One exception is RR C,T of the C3 crops (other than spring wheat), where the range is only 0.3 or lower.Large differences between GGCM's RRs are particularly pronounced for RR W,N and RR T,N for all crops other than the Nfixing leguminous crop soybean.

Crop Yield Response Types
Identifying crop yield response types (YRT ) can help to illustrate the similarities and differences between RR combinations across GGCMs and regions.The hierarchical clustering combines elements (individual data points (leaves) or clusters) by similarity and dendrograms illustrate the similarity of these elements (Figures S11-S20).
Three (e.g., soybean, Figure S23) to six (e.g., winter wheat, Figure S27) clusters were needed to explain at least 90% of the overall variance in the global crop-specific simulation sets.
As already suggested by the GGCM-specific distributions of RRs (Figure 2, Tables 2 and 3), some GGCMs show substantially different YRTs than others, however, also the regional distribution of YRTs differs between individual GGCMs (see Figures 3 and 4 for maize and Figures S21-S28 for the other crops).Since the clusters are defined by similarity of RR combinations, the interpretation depends on the RR distributions within clusters, as displayed in Figure 3 for maize CTW responses (corresponding to Figure 4).As the C4 crop, maize sees no direct stimulation of photosynthesis through elevated atmospheric CO 2 concentrations, but only improvements in water- Earth's Future 10.1029/2023EF003773 use efficiency.Still, some GGCMs display substantial shares of maize growing areas where C is more dominant than changes in T and similar to changes in W (cluster#2; Figure 4).Temperature dominance (cluster#4, as well as clusters #1 and #3, in which T is dominant or on par with the other drivers) is particularly important in pDSSAT, GEPIC, PEPIC, and LPJmL, even though patterns differ (Figure 4).
For rice simulations, the distribution of different CTW-YRTs is more balanced across GGCMs (Figures S21 and  S22), with JULES and PROMET showing little presence of cluster #4 (T dominance and C dominance over W, Figure S21) and LPJmL with little presence of cluster #2 (W dominance and balanced C vs. T response).Spatial patterns show some similarities with respect to cluster #4 (other than in JULES and PROMET) in the tropics and cluster #2 (other than LPJmL) in more arid regions of Asia, Africa, and south America.
Soybean CTW data are only clustered in three different CTW-YRTs (Figures S23 and S24), where JULES and to some lesser extent LPJmL are mostly characterized by cluster #2 (W dominates and C vs. T is balanced).CARAIB and PROMET show larger shares of cluster #3 (dominance of T and of C over W).There is larger Note.CARAIB and JULES did not supply data for different N levels, LPJ-GUESS did not supply data for rice or soybean, JULES did not supply data for winter wheat.These missing data points are indicated by "NA" (not available).
Earth's Future 10.1029/2023EF003773 agreement (n = 6) on presence of cluster #2 CTW-YRT in Europe and parts of North America and moderate agreement for South America, and parts of Africa.
CTW-YRTs for spring wheat are more mixed (Figures S25 and S26).CARAIB, LPJ-GUESS and LPJmL show mostly clusters #1 (W with little importance and C vs. T balanced, Figure S25) and #2 (all balanced), but CARAIB has these two in approximately equal shares, while LPJ-GUESS has substantially more #1 and LPJmL substantially more #2.EPIC-TAMU, GEPIC and JULES are substantially more sensitive to W, JULES with mostly #3 (W dominates, C vs. T is balanced), GEPIC with mostly cluster #4 (C with little importance and T vs. W balanced).Spatial patterns are also mixed, with little pockets of multi-model agreement across all continents.
Also for winter wheat CTW-YRTs show a mixed picture with 6 distinct clusters (Figures S27 and S28).Here, a divide can be seen along the importance of C: only in four of the eight GGCMs (EPIC-TAMU, GEPIC, pDSSAT, PEPIC), cluster #6 can be found (in which C is of little importance and W dominates T), while cluster #2 (C dominates and T vs. W is balanced) is basically absent in these GGCMs.Cluster #2 is particularly widespread in CARAIB and LPJ-GUESS.Spatial patterns show little consistency across GGCMs.If including the N dimension, the number of GGCMs is reduced to at most seven (Table 1), while the number of combinations of any two drivers increases to six.Still, the hierarchical clustering finds similar number of clusters with the threshold of 10% of overall variance within the clusters.The two models that show very little sensitivity of maize yields to water (CARAIB) or very high (JULES) did not provide any data along the N dimension, but within the reduced ensemble with N, there are again two models that show opposite behavior (Figures 5 and 6): LPJ-GUESS has a very strong response of maize yields to N either in combination with strong response to W (clusters #2) or with combination with strong response to T (cluster #4), while PROMET has little response to N either with strong sensitivity to W (cluster #1) or T (cluster #5).LPJmL and pDSSAT maize yields are dominated  by clusters #3 (with balanced responses, but T, W, and N all dominate C) and #1 (with mostly water dominance and little importance of N).Also GEPIC maize yields show large shares of cluster #3, but in combination with cluster #4.PROMET shows very little sensitivity to N also in rice yields (Figures S29 and S30) with almost all pixels being clustered in cluster #1, while all other GGCMs show strong importance of N in clusters #3 and #4, except for LPJmL, which has basically no occurrence of cluster #3 but of cluster #2 (where dimensions are more balanced but T dominates W and N), which is not very predominant in all other GGCMs.Spatial patterns of EPIC-TAMU, GEPIC, pDSSAT, and PEPIC rice sensitivities show some consistency, including LPJmL for Asia.For soybean, all six GGCMs that provided data, see little importance of N (with soybean being an N-fixing leguminous crop).PROMET soybean yield simulations are mostly in YRT cluster #3 (dominance of T and C), which is basically absent in GEPIC and LPJmL simulations (Figures S31 and S32).These two GGCMs find mostly clusters #2 (everything balanced, unless if compared to N) and #4 (T and C dominance).W and C dominance as in YRT cluster #1 is rare, but there is some cross-GGCM agreement on the regional occurrence of this YRT in SE-Europe and northern USA/Canada.Spring wheat YRTs are more mixed across GGCMs and regions.PROMET shows again little sensitivity to N with clusters #4 (W dominates and little importance of N otherwise) and #5 (T and C dominate).LPJmL also shows large shares of #4, but in combination with #1 (W and N dominate) and #2 (N and C dominate).YRT clusters #1 and #2 are also predominant for spring wheat YRTs of EPIC-TAMU, GEPIC, and PEPIC.For winter wheat YRTs, PROMET yield simulations shows also little sensitivity to N (clusters #3 with W dominance and #4 with T and C dominance), whereas LPJ-GUESS is most sensitive to N (cluster #1 with N dominance and all others balanced).Cluster #3 with W dominance is also found to some larger extent in LPJmL simulations, whereas all other GGCMs show large shares of cluster #2 with W and N dominance.

Emergent Functional Relationships
There are also different emergent functional relationships among GGCMs, that is, changes in functional responses that can be observed (emergent) but that we cannot attribute to actual model code structure or parameterization.Making use of the median RRs, we analyze how these change as a function of the other driver dimensions.Owing to the complexity of the data set, we constrain this analysis to median RR T,W responses to changes in C and N (Figure 7 for spring wheat, Figures S27-S40 for the other crops).
In some models and crops, the median RR T,W is hardly affected by increasing C (e.g., CARAIB, EPIC-TAMU, GEPIC, pDSSAT for spring wheat, Figure 7), whereas there are more pronounced changes in the median spring wheat RR T,W with changes in C for the other models.Similarly, the median RR T,W changes only little under different levels of N supply for some GGCMs (e.g., EPIC-TAMU, LPJ-GUESS, LPJmL, pDSSAT, PROMET for spring wheat, Figure 7) but more strongly in others.Also the direction of change varies across GGCMs.While some show an increasing importance of T versus W with increasing N supply (e.g., EPIC-TAMU, GEPIC, PEPIC for spring wheat, Figure 7), others see the opposite (decreasing importance of T vs. W with increasing N supply) or mixed cases.The combination of changes in C and N can lead to different emergent functional relationships, too: PEPIC spring wheat simulations show an increasing importance of T versus W with increasing C under high N supply, but a substantially lower importance of T versus W at low N supply and also a decreasing trend with higher C. Similar emergent functional relationships can be observed for the other crops analyzed here, but there are also crop-specific differences for some individual models.CARAIB shows always high median RR T,W values with little to no effect from changes in C across all five crops.EPIC-TAMU, GEPIC, and PEPIC all show very strong responses in median RR T,W to changes in N supply for rice (Figure S38), but much less so for winter wheat (Figure S40).LPJmL and PROMET see increasing median RR T,W with C for winter wheat (Figure S40), but less so for other crops (PROMET also increasing values for maize, Figure S37).

Discussion
We find that the crop models contributing to the GGCMI Phase 2 experiment (Franke et al., 2020a(Franke et al., , 2020b) ) show substantial differences in yield responses to drivers along the carbon dioxide (C), temperature (T), water (W), and nitrogen (N) dimensions.These differences are caused by model structure and mechanisms as well as parameterization (Folberth et al., 2019).Because not all GGCMs provided the full set of CTWN simulations (Franke et al., 2020a(Franke et al., , 2020b) (see Table 1), we used the emulators developed on these simulation sets (Franke et al., 2020a(Franke et al., , 2020b) ) to gap-fill missing elements.Even though the emulators show generally good skill in reproducing model results, yield responses along the N dimension were particularly difficult to emulate with the low number of experiments in that dimension (n = 3, see Table A1).Also the number of simulations that needed to be Earth's Future 10.1029/2023EF003773 supplemented by emulated responses affects how well 'true' GGCM responses can be reproduced by the emulators.However, PEPIC, which supplied the smallest number of simulations of the ensemble considered here (Table 1) is in relatively good agreement with the other EPIC-based GGCMs considered here.This mechanism could also be a possible reason for the low N sensitivity of PROMET, which had also supplied only a small number of simulation sets for different N levels (Franke et al., 2020a(Franke et al., , 2020b)).The selection of the CTWN drivers does not cover the full range of climatic drivers of crop yield change (Schauberger et al., 2016) and albeit these are important, further research on additional drivers, such as irradiation as included in the study of Tao et al. (2020), would be helpful (Ruane et al., 2022).
Median RRs show a broad range of values, but some of this is expected.So is the role of CO 2 fertilization not very strong for maize as a C4 plant or the role of N inputs is relatively small for soybean, which can acquire atmospheric N via biological N fixation.The extreme soybean RR values for driver combinations with N (Table 2) can thus be explained by model design, but also indicate that more complex implementations should be implemented, as for example, done in a later LPJ-GUESS version than the one used here (Ma et al., 2022).Apart from such general responses of little C importance for maize yield simulations or little N importance for soybean yield simulations, large differences in the sensitivity to different drivers exist between models.
The skewness of RR distributions is in part determined by the environmental conditions of the spatial sample, that is, the actual crop-growing areas, yet model differences are also dominant here.In some cases, some models find highly negatively skewed distributions, whereas others find highly positive skewed distributions (e.g., 1.42 for CARAIB vs. 1.45 for JULES for the distribution of maize RR T,W , or 1.69 for LPJ-GUESS vs. 1.18 for GEPIC for spring wheat RR C,T ).While we cannot expect normally distributed RR values as the cropland sample may not reflect normally distributed growing conditions, differences in skewness across GGCMs are only attributable to model functionality as they all use the same spatial sample here.Earth's Future 10.1029/2023EF003773 The clustering of RR values to YRTs illustrates that differences in response types can be larger between GGCMs than between regions.The arbitrary choice of leaving 10% of overall variance within clusters led to a small number of clusters (n < 7) that allow for qualitative description of their characteristics and interpretability.
Even though this threshold does not follow any formal definition of the optimal number of clusters, we argue that it is important to only have a small number of clusters for discussing regional and GGCM-specific distributions of YRTs.The dendrograms (Figures S11-S20) show that the number of clusters is not very sensitive to smaller variations of the 10% threshold but that the number of clusters dramatically increases at thresholds <5%.Some models show consistent behavior across different crops (e.g., PROMET is typically not very sensitive to changes in N compared to changes in any other driver and CARAIB is not very sensitive to changes in W compared to other drivers).LPJ-GUESS shows greater sensitivity to C than other models for spring wheat, but greater sensitivity to N than any other model for winter wheat, where CARAIB shows the greatest sensitivity to C. PROMET also shows greatest sensitivity to C of winter wheat, but only from the ensemble that also supplied data on the N dimension, even though it tends to be rather insensitive to N in general.Similarity in spatial patterns of YRTs across some GGCMs and crops suggest that growing environments can be the dominant determinant of model sensitivities, as should be expected for perfect models.Differences in spatial patterns can stem from smaller differences along cluster borders that result in different clusters and suggest significant differences (classification problem) or differences in regional parameterizations, as applied by some GGCMs (Folberth et al., 2019), reflecting how sparsely the global diversity in farming systems (e.g., Jarvis et al., 2008) is reflected in crop models.Nonetheless, differences in spatial patterns across GGCMs suggest that differences in models' sensitivities to environmental drivers needs further attention from model development and application.
It can be expected that crop yields show interacting responses to simultaneous changes in CTWN drivers.If, for example, N limitation is lifted, W limitation may show more clearly and vice versa.The EPIC model has a maximum function approach and only considers the most severe from several stressors in daily biomass gains (Sharpley & Williams, 1990; J. R. Williams, 1990) and indeed, the EPIC-based GGCMs (EPIC-TAMU, GEPIC, PEPIC) show substantial differences in RR T,W under different N levels, except for soybean (as an N-fixing plant) and only to some limited extent for winter wheat.The GGCMs of the GGCMI Phase2 ensemble show a range of emergent functional relationships, varying between no effect (e.g., pDSSAT for soybean, Figure S39), layered but flat effects (e.g., GEPIC for maize, Figure S37), increasing (e.g., LPJmL for winter wheat, Figure S40) or decreasing (e.g., LPJ-GUESS for spring wheat, Figure 7).While it is quite possible that these emergent functional relationships should differ between crops, because of their physiological traits (e.g., C3 vs. C4 photosynthesis) and where they are grown, there should not be substantial differences in the overall RR level or in the direction of change under variations in additional drivers.We here only analyze highly aggregated data (global and temporal aggregation), but aggregation typically leads to more balanced responses with extremes canceling out so that even stronger differences can be expected at the more detailed level (individual sites and years).
We find that in a standard model evaluation, focusing on reproducing grain yield as for example, by Müller et al. (2017) (see Figures S2-S10), model differences are not as prominent as in response types and functional relationships.As such, crop models need to be evaluated not only with respect to reproduce observed yield dynamics, because final yields are affected by a multitude of processes and drivers (Schauberger et al., 2016) and Zhu et al. (2019) showed that error compensation in maize simulations can lead to accurate yield estimates.Different emergent functional relationships have been reported also for model intercomparison studies at site scale (Tao et al., 2020) and for other crops (e.g., E. Wang et al., 2022) and can originate from model parameterization (e.g., through different calibration methods), choice of subroutines (e.g., for potential evapotranspiration (Cammarano et al., 2016;Folberth et al., 2019)) or modeler choices (Albanito et al., 2022;Folberth et al., 2019;E. Wang et al., 2022).Fronzek et al. (2018) attempt to relate process implementation with IRS classes, identifying evapotranspiration models, soil water modules, and heat stress modules as important determinants of similarity between crop models.
Data availability for crop model evaluation on aspects other than yields is still a strong limitation, especially at large-scale applications.Remote sensing products may fill this gap to some extent (e.g., Cetin et al., 2023;Jiang et al., 2023;Jin et al., 2018;Yue et al., 2023).
Testing models for emergent functional properties, as also requested by Tao et al. (2020) could be an alternative approach to model evaluation, which requires knowledge on functional relationships and structured model Earth's Future 10.1029/2023EF003773 experiments, such as the GGCMI Phase2 experiment (Franke et al., 2020a(Franke et al., , 2020b)), even though experimental design targeted to identifying specific functional relationships should drastically reduce the computation demand that was associated with the GGCMI Phase2 experiment.There are some examples of testing model for emergent functional properties (e.g., Schauberger et al., 2017), and dedicated efforts for model evaluation on aspects other than yield (e.g., Kimball et al., 2019, for maize evapotranspiration) but this is typically not integrated into standard model evaluation exercises.Schneck et al. (2022) provide an assessment of the emergent property water-use efficiency of their land surface model across different precipitation and temperature regimes (sampled from a global simulation rather than a stylized experiment design).The Earth System Modeling community has established standards for model evaluation (e.g., ESMValTool v2.0 Eyring et al., 2020), which can provide guidance from a technical and conceptual perspective.Yet, the climate system is described (and evaluated) by many different variables in contrast to the focus on the single end-of-season variable yield in crop modeling, limiting the comparability of evaluation standards in the two research domains.Horak et al. (2021) suggest a process-based evaluation of Intermediate Complexity Atmospheric Research Models that is based on stylized modeling experiments to help models become right for the right reasons.This approach is likely easier to transfer to crop modeling and we suggest that the idea of process-based model evaluation from targeted simulation experiments is pursued in future crop model evaluation efforts.The sensitivity of models to calibration (e.g., E. Wang et al., 2022) and parameterization (e.g., Folberth et al., 2019;E. Wang et al., 2017) indicates that model evaluation needs to be conducted continuously and cannot be substituted by references to model description papers or earlier evaluation efforts.The approach by Brown et al. (2018) (Kelley et al., 2013).Better efforts in the crop modeling community for model testing and evaluation are needed.At the same time, model deficiencies that have been identified by testing models, such as by Y. Li et al. (2019), who found that an ensemble of GGMCs (Müller et al., 2017(Müller et al., , 2019) )  • along a broad nitrogen supply gradient, crop yields should first increase and then level off; • along a soil water content gradient from dry to very wet, crop yields should first increase and then decrease (e.g., Y. Li et al., 2019); • exposure to canopy temperatures above lethal temperature thresholds should lead to an abort of crop growth.

Conclusions
The diversity in RR indicates that GGCMs have very different sensitivities to different climatic drivers and nitrogen supply.This has been discussed in the literature with a strong focus on the role of CO 2 on yield formation (Toreti et al., 2020), as many studies had presented results with and without CO 2 fertilization effects (e.g., Rosenzweig et al., 2014), bringing attention to this particular effect.We find that changes in temperatures, water or nitrogen supply yield similar strong differences when it comes to model sensitivities.Model evaluation should advance to including emergent functional relationships that may tell more about model plausibility and skill than only comparison with yield data from observations.For this, existing knowledge needs to be collected, tested for generalizability, and translated into simple tests that models can be subjected to.Modeling protocols need to be designed to enable such functionality tests rather than only comparisons with yield data.A community effort is needed to bring together knowledge collection and formalization, model experiment design and model testing in order to advance crop modeling toward reduced uncertainty in crop model applications.

Figure 1 .
Figure 1.Illustrative example of crop model Impact Response Surfaces (IRS), here for maize and the temperature (T) and water (W) dimensions with atmospheric CO 2 (C) at 360 ppm and fertilizer input (N) at 200 kg ha 1 .All crop model simulations provided by the modeling groups are displayed as colored rectangles in the IRS.Colors indicate relative yield changes compared to the mean across all data points of the respective IRS.White spaces indicate missing simulation sets.The simulations for "W 40%" and "T + 5" were not requested per protocol(Franke et al., 2020a(Franke et al., , 2020b)).

Figure 2 .
Figure 2. Distributions of maize Response Ratios for the temperature (T) versus water (W) domains (RR T,W ) of the nine different Global Gridded Crop Models (GGCMs), showing the importance of T in comparison to W at default C (360 ppm) and N (200 kg ha 1).Values approaching unity indicate higher sensitivity to T than to W, while values approaching zero indicate higher sensitivity to W than to T (see Equation4). Green vertical lines show the median value, which is also given in the title of each panel.Bottom right-hand panel shows the distribution across all GGCMs.Results are shown for currently cultivated maize cropland.

Figure 4 .
Figure 4. Spatial distribution of Crop Yield Response Types (YRTs) for each of the nine Global Gridded Crop Models (GGCMs) for maize, considering the C, T, and W dimensions, but without consideration of the N dimension, because this was not supplied by all GGCMs.The stacks in the bottom right-hand corner show the grid cell frequency distribution of the YRT clusters for each GGCM.The RR combinations characterizing each cluster are shown in Figure 3 above.

Figure 5 .
Figure 5.As Figure 3 but now for the full CTWN set with nitrogen and six combinations of any two drivers (gray shadings of boxplots, ordered from left to right).

Figure 6 .
Figure 6.Spatial distribution of Crop Yield Response Types (YRTs) for each of the seven Global Gridded Crop Models (GGCMs) for maize including all four response dimensions (i.e., C, T, W, N).The stacks in the bottom right-hand corner show the grid cell frequency distribution of the YRT clusters for each GGCM.The RR combinations characterizing each cluster are shown in Figure 5 above.

Figure 7 .
Figure 7. Spring wheat median RR T,W values under different levels of atmospheric CO 2 (x-axis) and N supply (colors).CARAIB and JULES did not supply data at different levels of nitrogen supply, their generic N response in shown in gray.

Table 1
Number of Global Simulation Sets of Crop Yield (y) Included in the Global Gridded Crop Model Intercomparison Phase 2 Archive per Model and Crop for the Simulation Set

Table 2
Median Values of RR Across Crops and Driver Combinations

Table 3
Skewness of RR Distributions Across Crops and Driver CombinationsNote.CARAIB and JULES did not supply data for different N levels, LPJ-GUESS did not supply data for rice or soybean, JULES did not supply data for winter wheat.These missing data points are indicated by "NA" (not available).Distribution of RRs within CTW-YRT clusters.Within each cluster (colored boxes), three boxplots describe the distribution of RRs for T versus W (dark gray, left boxplot), C versus W (gray, middle boxplot), C versus T (light gray, right boxplot).Horizontal lines indicate the median value, boxes extent across the interquartile range (IQR).Whiskers extend to the most extreme value within 1.5 times the IQR, outliers beyond this threshold are omitted.Colored dots on top of each cluster box indicate what drivers dominates: red for T dominance, blue for W dominance, purple for C dominance, and gray for no dominance.We rate drivers as balanced (i.e., no dominance) if the median RR is between 0.4 and 0.6.
to identify standard tests and include these into the user interface of the crop model APSIM to facilitate better model development is a promising approach.Such easy to access standard tests based on specific experiments can guide model development, but may be more limited for testing different case-specific model parametrizations.While the approach Brown et al. (2018) is model specific, such standard tests can be generalized to crop models in general, as demonstrated by efforts on general model benchmarking in global vegetation modeling fails to reproduce the observed yield penalties under extremely wet conditions, need to incentivize model development.Global-scale crop model applications for high-end climate scenarios can expose crop models to input data combinations that have never been used during model development and may lead to unwanted model behavior.GGCMI data are thus tested for implausible data ranges (see quality check scripts at https://github.com/AgMIP-GGCMI/phase3_quality_check),but functional model test could test for plausibility in response to different drivers, such as.