Parsimony versus predictive and functional performance of three stomatal optimization principles in a big-leaf framework.

Stomatal optimization models can improve estimates of water and carbon fluxes with relatively low complexity, yet there is no consensus on which formulations are most appropriate for ecosystem-scale applications. We implemented three existing analytical equations for stomatal conductance, based on different water penalty functions, in a big-leaf comparison framework, and determined which optimization principles were most consistent with flux tower observations from different biomes. We used information theory to dissect controls of soil water supply and atmospheric demand on evapotranspiration in wet to dry conditions and to quantify missing or inadequate information in model variants. We ranked stomatal optimization principles based on parameter uncertainty, parsimony, predictive accuracy, and functional accuracy of the interactions between soil moisture, vapor pressure deficit, and evapotranspiration. Performance was high for all model variants. Water penalty functions with explicit representation of plant hydraulics did not substantially improve predictive or functional accuracy of ecosystem-scale evapotranspiration estimates and parameterizations were more uncertain, despite having physiological underpinnings at the plant level. Stomatal optimization based on water use efficiency thus provided more information about ecosystem-scale evapotranspiration compared to those based on xylem vulnerability and proved more useful in improving ecosystem-scale models with less complexity.


Introduction
Stomata, pores on the surface of leaves, control the exchange of water and carbon dioxide (CO 2 ), thus playing a key role in regulating terrestrial water and carbon fluxes. Globally, plants recycle over 60% of terrestrial water from the land to the atmosphere through transpiration (Good et al., 2015) and sequester 30% of annual anthropogenic CO 2 emissions through photosynthesis (Friedlingstein et al., 2019). Natural resource management adapted to increasingly variable climatic conditions requires robust modelling of water and carbon cycles and hence of stomatal conductance. Improving our ability to characterize the terrestrial biosphere response to climatemore specifically, accurately translating mechanistic stomatal behavior to larger scalesis essential to reduce uncertainty in water and carbon fluxes (Rogers et al., 2017) and climate projections (Friedlingstein et al., 2014). Under climate change, empirical approaches have limited skill in estimating stomatal conductance (De Kauwe et al., 2013;Liu et al., 2020a). Mechanistic models are needed (Buckley, 2017), which accurately account for hydraulic system feedbacks  and leaf-to-landscape scaling of soil-plant-water transport (Mencuccini et al., 2019a). Complex models, however, have high data requirements and are computationally demanding.
Among mechanistic approaches, stomatal optimization has the potential to supersede empirical models with low complexity (Buckley, 2017). Optimality theory synthesizes stomatal response to environmental change in terms of plants' tradeoffs in carbon assimilation (Cowan & Farquhar, 1977), offering analytical expressions of optimal stomatal conductance. Earth System Models (ESMs) can use these analytical expressions instead of canonical empirical equations for stomatal conductance (Ball et al., 1987;Leuning et al., 1995) because they are functionally similar, with the added benefit of physically meaningful parameters describing ecophysiological tradeoffs (Medlyn et al., 2011). Existing stomatal optimization models differ in how they formulate the cost or penalty of opening stomata, for example water loss vs xylem damage through transpiration (Wang et al., 2020). Global variability of water cost per carbon gained is explained by biophysical characteristics that can be generalized by biome (Lin et al., 2015) and in principle determined as a function of a system's boundary conditions (Cowan & Farquhar, 1977;Manzoni et al., 2013a). A carbon cost associated with xylem damage directly couples stomatal conductance to hydraulic mechanisms (Wolf et al., 2016) and requires no fitting parameters if plant hydraulic traits are known (Sperry et al., 2017;Eller et al., 2018).
The growing number of stomatal optimization approaches and emergent interest in their performance leads to debates over This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
which is most appropriate Dewar et al., 2018;Wang et al., 2020). There is no consensus on the most realistic carbon cost or water penalty formulations and how optimality criteria vary with soil moisture. Further, optimality criteria are often based on instantaneous costs and benefits, while accounting for longer-term opportunity costs is more consistent with observations (Lu et al., 2020). Finally, the functional accuracy of both leaf-and xylem-level optimization theories, including their parameters, in representing ecosystem-scale behavior and in capturing short-and long-term feedbacks between water consumption, plant status, and soil-water depletion is unknown. Stomatal optimization models have been reviewed and compared theoretically and empirically at the leaf level Anderegg et al., 2018b;Dewar et al., 2018;Wang et al., 2020) and their predictive performance at large scales compared to empirical approaches, specifically during drought (Eller et al., 2020;Liu et al., 2020a;Sabot et al., 2020). However, ecosystemscale comparisons of diverse optimality principles and clarity on which formulations improve ESMs are lacking. To this end, robust performance diagnostics compared to ecosystem-scale observations are necessary.
Plant transpiration is simultaneously constrained by soil water supply and atmospheric water demand. Disentangling the effects of these co-varying variables on evapotranspiration and accurately representing their combined effects in models is challenging (Novick et al., 2016;Gentine et al., 2019). Stomatal conductance models should have high predictive performance, that is, should accurately reproduce values of a target observation such as transpiration. Stomatal conductance models should also have adequate functional performance, that it, should accurately reproduce the sensitivity of transpiration to both soil moisture and vapor pressure deficit as well as the sensitivity of their interactions in varying conditions and ecosystems. Tradeoffs between predictive and functional performance can indicate that a model provides accurate estimates for the wrong reasons due to model structural errors (Ruddell et al., 2019), and thus may not capture the correct response of stomatal conductance to environmental conditions. Parameterization performance ensures that data, at the scale of interests, encodes enough information to characterize model relational assumptions, that is, model complexity matches data availability (Feng, 2020).
Information theory can quantify directional and nonlinear functional relations to characterize joint causal interactions in complex ecohydrological systems (Ruddell & Kumar, 2009;Goodwell et al., 2020), such as those between evapotranspiration, soil moisture, and vapor pressure deficit. Information theory also provides performance diagnostics that quantify how much information models extract from observations in a nonparametric way, and can attribute observed errors to specific processes (Weijs et al., 2010;Ruddell et al., 2019). Information quantification is emerging as a powerful technique for model evaluation and hypothesis testing (Nearing et al., 2020). It has been applied to the benchmarking of ESMs (Nearing et al., 2018) but has yet to be used for ecophysiological hypothesis testing.
In addition, quantifying stomatal optimization parameters and characteristics of the soil-plant-atmosphere continuum for ESMs that preserve key ecological mechanisms of an ecosystem is challenging because such parameters cannot be measured directly at large scales (Mencuccini et al., 2019a). Comparisons of observed vs modeled or leaf vs ecosystem water use efficiency reveal uncertainties in upscaling ecophysiology Lavergne et al., 2019). Water and carbon fluxes are sensitive to plant trait diversity within an ecosystem. The challenge of summarizing complex interactions resulting from species coexistence using single parameter values remains unresolved Anderegg et al., 2018a). It is thus important to determine whether any added uncertainty in stomatal optimization parameterizations is justified in terms of information gained about water and carbon dynamics at ecosystem scales and apply 'Occam's razor', that is, the principle that improved predictions should outweigh increased complexity (Weijs & Ruddell, 2020).
We evaluate three analytical equations for optimal stomatal conductance, based on different water penalty functions, using flux tower observations and information theory diagnostics. Our goal is to rank optimality principles based on their ability to provide information about transpiration and its interactions with soil moisture and vapor pressure deficit at ecosystem scales. The ideal optimization principle should minimize information loss in terms of uncertainty in model structure, including uncertainty in transpiration estimates, relations with driving variables, and complexity; that is, it should have high predictive and high functional performance across different plant functional types (PFT) and in wet to dry conditions; and adequate parameterization, with parsimonious and well-constrained calibrated parameters that reflect realistic traits of different PFTs.

Data
We selected 30 sites (Supporting Information Table S1) from the FLUXNET2015 dataset (Pastorello et al., 2020) that met the following criteria: dominance of a single PFT; observations of model variables for at least 150 (nonconsecutive) growing season days to ensure analysis robustness; soil moisture measurements between 10 and 32 cm to represent soil water content at the average rooting depth; and minimum growing season soil water potential (ψ s ) below −1 MPa to include water-limiting conditions. Selected sites comprise six boreal, 17 temperate, and seven Mediterranean climates; and nine needleleaf, eight broadleaf, nine C3 grass, and four C3 crop PFTs (Table S1).
This analysis required half-hourly data for latent heat flux (LE), gross primary production (GPP, based on nighttime respiration partitioning), atmospheric CO 2 concentration (C a ), soil water content (θ), air vapor pressure deficit (D), air temperature, air pressure, wind speed, friction velocity, incoming photosynthetically active radiation, and sensible heat flux. We identified sensor positions, average canopy height, and soil texture from FLUXNET2015 metadata and references (Table S1). We determined soil hydraulic parameters from soil texture and converted θ to ψ s (Brooks & Corey, 1964).
We only analyzed data flagged as observed. We excluded outliers and focused on growing season observations during which transpiration dominates LE, following criteria commonly used to estimate water-use efficiency from flux tower data (Knauer et al., 2018; see Methods S1 for details). The remaining data comprise 151 to 1299 d of observations per site, with on average 15 halfhourly observations per day (Table S1).

Optimal stomatal conductance
The generalized optimality theory for stomatal conductance states that plants seek to maximize net carbon gain, that is, photosynthesis (A, mol m −2 s −1 ) minus a penalty (Θ, mol m −2 s −1 ) (Wolf et al., 2016). Thus, optimal stomatal conductance (g s;opt , mol m −2 s −1 ) varies with environmental conditions so that the marginal photosynthetic gain of transpiration (∂A=∂T ) is equal to the marginal penalty of transpiration or marginal cost of water per carbon gained (∂Θ=∂T ): Although there are a number of existing conceptually different optimality principles, Eqn 1 can be solved using a common mathematical framework. A study by Dewar et al. (2018) provides a generic derivation for g s;opt for any choice of penalty function Θ and leaf photosynthesis model A. Neglecting nonstomatal reductions in photosynthesis and assuming αD ∂Θ ∂T À Á ≪ C a À Γ * for simplicity, the generic analytical solution for g s;opt is ; C a is atmospheric CO 2 concentration (mol mol −1 ); D is the difference in water vapor concentration between the atmosphere and saturated air space in the leaf (mol mol −1 ); Γ * is the CO 2 photorespiratory compensation point (mol mol −1 ); α is the relative diffusivity of water vapor with respect to CO 2 (mol mol −1 ); f 0 is the CO 2saturated rate of photosynthesis (mol m −2 s −1 ); and γ is a kinetic parameter (mol mol −1 ). We represented A by a continuous function embedding limitations from both CO 2 transport (Rubisco activity) and incident light (RuBP regeneration) to avoid a priori selection of light-vs Rubisco-limited conditions and streamline model calculation (Vico et al., 2013). We defined this function and associated f 0 and γ in Methods S2. In addition, Eqn 2 is often recast as a generic relation between g s;opt and A (Medlyn et al., 2011;Wolf et al., 2016;Dewar et al., 2018), which is functionally analogous to commonly used empirical models (Ball et al., 1987;Leuning et al., 1995):
Several other optimality models exist in the literature, including numerical or dynamic stomatal optimization models, which were not evaluated in this study (see reviews by Buckley et al., 2017;Dewar et al., 2018;Mencuccini et al., 2019;Wang et al., 2020). We focus on analytical expressions of g s;opt resulting from WUE, CM, and SOX because they can be applied without numerical optimization. Using each principle, ∂Θ=∂T connects stomatal behavior to the hydraulic system in a more physically meaningful way than commonly used water-limitation constraints (Feddes et al., 1978). Our goal was to evaluate which of the selected optimality principles has the most potential to robustly replace empirical relations between soil moisture, vapor pressure deficit, and stomatal conductance in ESMs without increasing computational costs.

Big-leaf comparison framework
We developed a generic model that provides a level playing field to focus on the comparison of different ∂Θ=∂T formulations without confounding factors. We used a big-leaf framework (Bonan, 2019) to scale g s;opt per unit leaf area, determined based on mechanistic descriptions of leaf-level stomatal behavior, to total surface conductance per unit ground area (G surf , mol m −-2 s−1 ), representing the flux tower scale. We implemented G surf in the Penman-Monteith equation (Penman & Keen, 1948;Monteith, 1965) using forcing data from FLUXNET2015 and calculated ecosystem-scale LE. Model variants refer to the implementation of each of the stomatal optimization principles (WUE, CM, and SOX) in the comparison framework.
We defined surface conductance to water vapor, G surf , as the sum of soil (G soil ) and canopy (G canopy ) components (Li et al., 2019). We expressed G soil as a linear function of soil moisture, bound by a maximum value (G soil; sat ) at soil saturation (θ s ), that is, G soil ¼ G soil; sat θ=θ s . This simple assumption is reasonable because we considered only days with no rain and with no rain during the preceding day, which generally follow water-limited soil water evaporation . The big-leaf framework assumes all leaves in the canopy respond to the same environmental conditions measured by the flux tower. We defined G canopy within the big-leaf framework by recasting the relation between g s;opt and A (Eqn 3) at the ecosystem scale (Lin et al., 2018), assuming air vapor pressure deficit (normalized by air pressure) represents D and substituting A (CO 2 flux per unit leaf area) with GPP (CO 2 flux per unit ground area): For all optimality principles, we calculated ∂Θ=∂T as a function of concurrent ψ s converted to predawn ψ, reflecting a daily optimization timescale. We thus mitigated uncertainties in the instantaneous estimation of canopy and xylem water potentials and avoided iterative calculations, which did not improve the performance of our comparison framework (Fig. S1).
While the big-leaf comparison framework did not explicitly model certain processes, they were included in its structure to some extent by exploiting the availability of GPP data. Our comparison framework was devised to minimize structural errors that are beyond the optimization principles and may affect model comparisons, including uncertainty in scaling from leaf to surface conductance and in estimating photosynthesis. The framework must match information in the data of both LE observations, as the target, in terms of predictions, and the functional relation with GPP. The substitution of A with GPP measurements, despite observational uncertainties, provides constraints to partition soil water evaporation; and scaling factors representing effective ecosystem leaf area index; while accounting for all environmental factors affecting photosynthesis, including irradiance, leaf temperature, and leaf water potential (Figs S1-S4; Methods S2). We could thus focus on comparing selected stomatal optimization principles and how ∂Θ=∂T varied with soil moisture through predawn ψ.

Parameterization
We evaluated the adequacy of the optimization principles for generalized applications, that is, when site-specific or species-level information is not available and parameter values are based on PFTs. We calibrated the parameters in two steps: we first estimated site-specific G soil; sat values, then PFT-level parameter values to express ∂Θ=∂T (Table 1). This two-step parameterization ensures that soil water evaporation is equal in each model variant. We could thus assume that comparisons of model diagnostics mainly reflected differences in ∂Θ=∂T formulations rather than uncertainty in evapotranspiration partitioning resulting from varying G soil; sat calibration in model variants.
We performed the least-square minimization for each model variant (WUE, CM, and SOX) on the full record of each site and averaged the resulting G soil; sat estimates. We then implemented the average site-specific G soil; sat and performed a bootstrapping calibration process for the remaining model-specific parameters we repeated the least-square minimization 200 times on randomly selected data subsamples and initial value guesses. A different 50% of the record in each subsample was randomly split in half for calibration (25%) and validation (25%) to obtain a distribution of 200 parameter estimates. For each site, we determined leave-one-site-out parameters to express ∂Θ=∂T for WUE, CM, and SOX as the median of parameter estimates from all other sites of the same PFT.

Information theory diagnostics
Information theory can measure directional and nonlinear functional relations using nonparametric techniques, thus overcoming limitations of commonly used correlation measures in the analysis of complex environmental processes (Cover & Thomas, 2012). The analysis of information flows is an established technique that quantifies shared information between variables in environmental systems and characterizes functional couplings and joint causal interactions (Ruddell & Kumar, 2009;Goodwell et al., 2020). Information theory is particularly well suited to comparing stomatal optimization principles because metrics only extract information from data, without imposing potentially incorrect assumptions that can lead to misinterpretation of incomplete information (Weijs et al., 2010). Missing or bad information can result from model structural errors (Nearing & Gupta, 2015). An information-theoretical approach especially benefits ecosystem-scale analyses because their observations and processes are often uncertain. We exploited information flows to dissect instantaneous controls on latent heat (LE) by two of its key drivers, soil moisture, θ, and vapor pressure deficit, D; quantify how much information about LE is missing, given a stomatal optimization principle; and evaluate whether model variants under-or over-estimate the interactions between LE, θ, and D.
Measurements of θ and D provide information about LE, that is, knowing θ and/or D reduces LE uncertainty. Information theory quantifies uncertainty in a variable (observed or modeled), by Shannon's entropy and shared information with mutual information, both in units of bits (Cover & Thomas, 2012). Mutual information between LE and one of its key drivers (I ðθ, LEÞ or I ðD, LEÞ, Fig. 1a) includes unique (U θ or U D ) information about LE, provided by θ or D separately and by themselves, and redundant (R) information, provided by knowing either θ or D. Conditional mutual information (I ðθ;LEjDÞ, Fig. 1b) quantifies the reduction in LE uncertainty given knowledge of θ beyond the information provided by D; it includes U θ and synergistic (S) information, provided by knowing both θ and D because they interact in defining LE. Finally, total multi-variate mutual information (I ðθ, D;LEÞ, Fig. 1c,d), the total reduction in uncertainty of LE given knowledge of variables θ and D together, is the sum of four nonnegative partitions U θ þ U D þ S þ R (Williams & Beer, 2010;Goodwell & Kumar, 2017). Three-dimensional information flows thus reveal aspects of nonlinear dependencies of LE on θ and D that are undetectable when considering relations in pairs or with correlation measures. Relative magnitudes of information partitions reflect causal interactions between variables in the system (Goodwell et al., 2018) and are relevant to understanding complex functional ecophysiological relations and accurately representing them in models.
We calculated information flows from half-hourly observed and modeled LE independently, following established methods: we estimated entropy by binning normalized variables and tested statistical significance with shuffled surrogates (Methods S3, S4). We focused on instantaneous controls of θ and D on LE at the half-hourly timescale. We considered I ðθ, D;LEÞ a meaningful measure of functional relations between these variables because it is statistically significant (> 99% confidence level) at all study sites. We normalized the information partitions U θ , U D , S and R by the total information I ðθ, D;LEÞ to measure relative controls of θ and D on LE in units of bits per bit.

Parameterization, predictive and functional performance
Performance diagnostics based on information theory (Methods S5) represent missing or inadequate information in a model compared to observations (Ruddell et al., 2019;Nearing et al., 2020).
Parameterization performance is a measure of how much information a model variant can extract from data about its parameters and model relational assumptions. An overly complex model may require more data than is available to constrain the values of its parameters. We quantified parameter uncertainty or information missing about each parameter given a model variant as the information entropy of the 200 parameter estimates for each site. We evaluated tradeoffs between goodness-of-fit and complexity by comparing the model variants based on their Akaike information criterion (AIC), which is a measure of information loss, taking into account the number of model parameters (Akaike, 1974).
Predictive performance (A p ) is a measure of a model's ability to represent the target variable (LE), similarly to commonly used goodness-of-fit metrics. We calculated A p as the relative fraction of LE uncertainty, observed in the flux tower dataset, which is explained by a model, that is, the relative difference between the entropy of LE observations and mutual information between observed and modeled LE. A value of A p equal to zero indicates that there is no missing information or no uncertainty in the model about LE and all the variability in observed LE is captured by modeled LE; while A p ¼ 1 indicates that observed and modeled LE are independent.
Functional performance (A f ) and its components are a measure of a model's ability to represent the correct strength of the relations between the target variable and its drivers, that is, the total information from θ and D about LE and its partitioning (U θ , U D , S and R) in the model should match that in the observations. We calculated six A f components as the relative difference between observed and modeled information flows: total multivariate mutual information (A f ;T ); its partitioning into unique

Research
New Phytologist negative A f,θ (or A f,D ) values indicate that specifically unique information θ (or D) provided about LE in the model is over/underestimated compared to observations, thus attributing errors to the strength of a particular functional relation. Positive/negative A f,S and A f,R values indicate that controls from θ and D together on LE are inaccurate, thus attributing errors to interactive effects between θ and D represented in the model. A f,P summarizes overall functional accuracy in reproducing the sensitivity of LE to θ and D together, which also takes into account dependencies between the driving variables.

Model ranking and performance scores
We used five criteria to capture a broad range of tradeoffs in performance for each model variant and at each study site: (a) calibration/validation goodness-of-fit, measured by the relative difference between validation and calibration Nash-Sutcliffe efficiencies (NSE) and median absolute percentage errors (MAPE) resulting from the 200 subsamples; (b) low parameter uncertainty measured by the coefficient of variation and the entropy of the 200 parameter estimates; (c) adequate parsimony measured by AIC; (d) predictive accuracy in estimating evapotranspiration measured by A p ; (e) functional accuracy in reproducing the sensitivity of evapotranspiration to θ and D measured by A f,T and A f,P . We quantified and ranked model performance for the full record and for dry, mesic, and wet conditions separately, by calculating AIC, A p , and the six A f components from data below, between, and above the interquartile range of θ observed in each site's record.
We ranked the model variants at each site according to diagnostics for each criterion. We then averaged site-specific and diagnostic-specific ranks and converted them to overall performance scores (0, 1) for each criterion. The performance score translates to the fraction of sites for which (or probability that) the model variant in question is the most desirable. A score > 0.5 indicates that a criterion is statistically better for the considered model variant than a randomly selected variant. A score > 0.67 (or < 0.33) indicates that a model variant is statistically the most (or least) desirable among the three considered. The ranking approach was necessary to summarize results because diagnostic values from sites with heterogeneous record lengths and characteristics are not comparable and absolute values are less meaningful.

Results
We implemented three stomatal optimization principles (WUE, CM, and SOX), based on different water penalty functions in a big-leaf comparison framework to estimate evapotranspiration, and quantified missing or inadequate information in the three model variants using flux tower data. Model ranking according to the five evaluation criteria was largely consistent across biomes (Fig. 2) and soil moisture conditions (Fig. S5).

Calibration/validation goodness-of-fit
The bootstrapping calibration was good for all model variants (Figs 3, 4). SOX scored highest (0.58) and CM lowest (0.38) in terms of calibration/validation goodness-of-fit (Fig. 2a). The median validation NSE and MAPE were 85 and 16%, and the relative difference in NSE and MAPE between calibration and validation subsamples was largely under 5%, for all variants and sites combined (200 per site).

Parameter uncertainty
WUE scored highest (0.65) and CM lowest (0.35) in terms of least parameter uncertainty (Fig. 2b). Parameter coefficients of variation resulting from the bootstrapping calibration, for all sites combined, was < 0.1 for λ ww , β 0 and a (in SOX only), 0.2 for b 0 1 , and > 0.5 for other parameters (Fig. 5a). Entropy of the 200 parameter estimates per site ranged from 0 to 6 bits (Fig. 5b), which translates to the average number of binary questions that need to be asked to determine the parameter value at a site. The range of parameter estimates within each PFT was large (Fig. 6). The relatively uniform parameter distributions for CM, especially ψ 50 and a, indicated that the values were difficult to constrain and that, using CM, ∂Θ=∂T had a low sensitivity to its parameterization. Correlations between parameters were statistically significant (P < 0.01) for all optimality principles. Risks of equifinality appeared highest for SOX and CM, for which Spearman correlations between ψ 50 and other parameters were highest and ranged from 0.43 with b 0 1 to as high as −0.89 with K max .

Adequate parsimony
The AIC scores indicated that WUE (0.63) was more desirable than CM (0.47) and SOX (0.40) at a majority of sites in terms of prediction vs parsimony, especially in dry conditions (Figs 7a,  2c).

Predictive accuracy
Ranges of predictive performance diagnostics among the sites were similarly good for all model variants (Fig. 7b). WUE (0.73) and CM (0.62) scored highest, while SOX (0.15) was the statistically least desirable principle in terms of predictive performance (Fig. 2d). Within a site, predictive performance always decreased in dry conditions, compared to mesic and wet conditions, as reflected by a higher A p value. Missing information was at most 25% greater in dry vs wet conditions, for all sites and models combined. WUE was more robust in dry conditions than CM at a majority of sites (Fig.  S5). The increase in missing information resulting from the use of leave-one-site-out parameters instead of median sitespecific parameters was < 5%, for all sites and models combined (Fig. 7c).

Functional accuracy
Total information from θ and D about LE was overestimated (positive A f,T ) at most sites for all model variants; this overestimation was overall lower under wet conditions compared to mesic and dry conditions (Fig. 8a). Functional performance measured by information partitioning (Fig. 8b-e) indicated that for all optimality principles θ controls on LE were slightly underestimated ( ). We rank model variants within each site using full records and average the 30 site-specific (left) and the vegetation-specific site scores (small plots on the right) for each diagnostic and model variant. We provide details of individual diagnostic scores in each criterion, calculated for the full record, wet, mesic, and dry conditions, in Supporting Information Fig. S5.

Research
New Phytologist underestimated (negative A f,S ); and redundant contributions were accurately represented (A f,R close to zero). WUE (0.79) and CM (0.68) scored highest, while SOX (0.03) was the statistically least desirable principle in terms of functional performance (Fig. 2e).

Predictive and functional performance tradeoffs
Sites with high predictive performance generally also had the highest functional performance (Fig. 9). The fact that the relation between A p and A f,P (Fig. 9a) was stronger than that of A p and jA f ;T j ( Fig. 9b) indicated that A f ;T is insufficient to quantify functional performance. Predictive performance was higher when information from θ and D about LE in the models matched observed information partitions (U θ , U D , S and R) rather than total information amount (I ðθ, D;LEÞ). Tradeoffs between the accuracy of total information and its partitioning (Fig. 9c) indicated that total information accuracy can occur at the expense of more inaccurate model representations of θ and D controls on LE. For example, if a model underestimates U θ and overestimates U D by the same amount of information, performance based on A f,T can be close to zero (modeled and observed multi-variate mutual information match), while A f,P , the sum of partitioning components, will be high (modeled and observed relations between θ, D, and LE do not match). Performance was (b) (a) Fig. 4 Goodness-of-fit between observed and modeled latent heat flux (LE, W m −2 ) at an example site (US-Blo, a seasonally dry evergreen needleleaf forest (Goldstein et al., 2000)), using water use efficiency (WUE, blue), carbon maximization (CM, green), and xylem vulnerability (SOX, yellow) stomatal optimization principles (Table 1). (a) Observed (black squares) and modeled LE, averaged by day of year, using site-specific (colored circles) and leave-onesite-out (colored crosses) calibrated parameters. Soil moisture (θ, cm 3 cm −3 ), averaged by day of year in grey to illustrate seasonal soil-water availability.

Discussion
We evaluated three existing stomatal optimization principles beyond predictive performance using a novel data-driven multicriteria method. We also quantified the extent to which optimization principles were robustly parameterized and accurately captured stomatal response to environmental conditionsin other words, whether they were 'right for the right reasons'. We used a big-leaf framework, taking advantage of flux tower LE and GPP measurements to minimize model structural errors, and created a level playing field to focus on the comparison of stomatal optimization principles. We were thus able to attribute differences in performance metrics between model variants using WUE, CM, and SOX principles to the differences in their respective ∂Θ=∂T , as this was the only component of the comparison framework that changed in each variant. We used information theory to diagnose the multiple aspects of model performance, which proved a simple and effective technique to disentangle nonlinear ecohydrological relations and overcome challenges of ecosystemscale analysis when data and processes are uncertain. Our approach was thus able to rank the ecosystem-scale performance of selected optimality principles in terms of prediction, function and parameterization, and reveal which principle can be most structurally adequate for ecosystem-scale applications beyond the comparison framework.

Parameterization performance for WUE and SOX was superior to CM
Ranges of calibrated parameter values were larger within than between PFTs (Fig. 6), which is a common issue for ecosystem  Table 1. Boxes represent the interquartile range for the 30 study sites, whiskers represent the 5 th and 95 th percentiles, and white lines represent medians.
New Phytologist (2021) Liu et al., 2020b). Stomatal optimization models were nevertheless adequately generalized at the PFT level (Figs 4, 7c), suggesting that all optimization principles are robust and applicable beyond site-specific calibration. Ecosystem-scale parameters were, however, not equally well constrained. For all PFTs, WUE and SOX parameters were easier to calibrate, in relative terms, while CM parameters were more uncertain (Fig. 5). Hydraulic parameters were also more difficult to constrain than marginal water use efficiency in other ecosystem-scale model calibrations (Liu et al., 2020b). Model variants using literature values for a and ψ 50 instead of calibrated values reduced the uncertainty of parameterizations and lowered risks of equifinality; these variants were found to be less desirable using SOX and more desirable using CM in terms of prediction vs complexity tradeoffs; but the variants still did not outperform WUE overall (Fig. S6).
Calibrated WUE parameters generally reflected plant-level patterns (Manzoni et al., 2011;Lin et al., 2015): λ ww was higher for needleleaf than broadleaf trees and for grasses than crops; and marginal water use efficiency was more sensitive to decreasing soil water for needleleaf than for broadleaf trees, and for crops than for grasses (higher β). Calibrated CM parameters were less physiologically meaningful; CM had low sensitivity to a and ψ 50 , and consistency in best-fit b 0 1 and b 0 2 with previous leaf-level parameterizations was uncertain (Anderegg et al., 2018b). Calibrated b 0 1 had similar values to λ ww , and b 0 2 (the key parameter that distinguishes CM from WUE) often approached 0 and had similar variability to β 0 . This suggests that WUE is structurally more consistent with flux tower data than CM, and marginal xylem tension efficiency at ecosystem scales is not sensitive to plant water potential. Equifinality of CM parameters is consistent with leaf-level studies (Zenes et al., 2020). Calibrated values for a in SOX were generally consistent with previous ecosystem-scale applications (Eller et al., 2020) and within broad ranges of measured hydraulic traits (Manzoni et al., 2013a;Mencuccini et al., 2019b): calibrated a was higher (more nonlinear behavior) for needleleaf, closest to 1 for grasses, and overall was smaller than values fit to species-level vulnerability curves. K max was greater for broadleaf than needleleaf trees. Patterns of variation in calibrated K max and ψ 50 values were not always consistent with those of measured traits due to tradeoffs between parameters that lead to similar model fits. While the correlations between hydraulic parameters have ecological significance, equifinality is a challenge for determining generalized parameters for ecosystem-scale models. This equifinality points to a need to parameterize K max independently, for example using long-term climate data (Sabot et al., 2020).

Predictive and functional performance was high overall and inferior for SOX
Predictive and functional performance of all stomatal optimization principles was high (Figs 7, 8), which is expected as model  variants were derived from a common eco-evolutionary theory of optimal plant water-carbon tradeoffs and reflect coordination between stomatal behavior and plant hydraulic strategies (Wang et al., 2020). Functional performance indicated that models were generally overly deterministic, in particular in the sensitivity of LE to D, slightly underestimated θ controls on ∂Θ=∂T , and underestimated synergistic information from D and θ. The positive correlation between predictive and functional performance ( Fig. 9) indicated that optimality principles can generally be calibrated to ecosystem-scale observations at the level of PFT without overfitting parameters for highest prediction accuracy at the expense of poor functional accuracy.
Model performance at sites with herbaceous vegetation was higher, in terms of less missing information and more accurate functional representation of θ and D controls on LE, than at forested sites. Forests create and respond to more heterogeneous environmental conditions (soil moisture at different depths, sunlit vs shaded leaves, aerodynamic resistance, vapor pressure deficit within vs above the canopy), which cannot be represented in a big-leaf framework and can be a source of functional inaccuracy. Nevertheless, these errors affected all variants in the same way. Further comparison framework assumptions, including optimization timescale, evapotranspiration partitioning, leaf-to-canopy scaling, and photosynthesis (Figs S1-S4; Methods S2), do not  (Table 1) to estimate latent heat flux (LE). Functional performance is the relative difference between observed and modeled (a) total multi-variate mutual information from soil moisture (θ) and vapor pressure deficit (D) about LE (A f,T , bits bit −1 ) and its partitioning into (b) unique from soil moisture (A f,θ , bits bit −1 ), (c) unique from atmospheric demand (A f,D , bits bit −1 ), (d) synergistic (A f,S , bits bit −1 ), and (e) redundant (A f;R , bits bit −1 ) information. We define dry, mesic, and wet as data points below, between and above the interquartile range of the observed soil moisture for each site. Boxes represent the interquartile range for the 30 study sites, whiskers represent 5 th and 95 th percentiles, and white lines represent medians. (2021)

Research
New Phytologist affect ranks because inputs, uncertainties and errors were also shared by all model variants. We do not expect that model performance was affected by the use of predawn vs instantaneous water potential because predictive and functional accuracy were inferior when implementing canopy water potential estimates, and the use of predawn water potential led to consistent model ranking (Fig. S1). Additionally, there were no clear discrepancies in functional performance between broadleaf vs needleleaf trees (Fig. 9a), which generally have different stringencies of stomatal control (iso/anisohydry).
The least desirable principle in terms of ecosystem-scale predictive and functional performance was SOX (Fig. 2d,e), despite having a more physiologically grounded penalty function based only on measurable plant parameters (Wang et al., 2020). The SOX optimality principle resulted in the most overly deterministic model variant, which was the least functionally accurate in terms of information partitioning, and it was the only principle to consistently underestimate unique information from θ in dry conditions. Structural differences in SOX lead to lower ∂Θ=∂T values throughout the time series at a majority of sites compared to WUE, and these values were less consistent with observations. Equations for WUE and CM were structurally very similar and emerged as the more robust model variants at the flux tower scale. Predictive and functional performance was higher at more sites using WUE than CM (Fig. 2d,e), although they were not substantially different, in contrast to previous results based on leaflevel gas exchange measurements (Anderegg et al., 2018b;Wang et al., 2020).

Stomatal optimization based on WUE provided more information about evapotranspiration than CM or SOX
The ideal stomatal optimization principle should provide maximum information about ecosystem evapotranspiration and thus minimize uncertainty in both evapotranspiration estimates and model structure. The simplest optimization principle, WUE, was the most adequate for ecosystem applications (Fig. 2), with better-constrained parameterization that characterize realistic behavior of herbaceous and woody PFTs, and higher predictive and functional performance in reproducing the sensitivity of LE to both θ and D as well as the sensitivity of these interactions in different biomes and in wet to dry conditions. In addition, WUE was the principle with the best tradeoff between prediction and parsimony (lowest AIC) and was the most robust in dry conditions (Fig. S5).
Water penalty functions based on xylem vulnerability did not substantially improve predictive or functional accuracy compared to water use efficiency and are also potentially more difficult to parameterize and apply at ecosystem scales. WUE, despite lacking an explicit representation of plant hydraulics, provided physiologically meaningful parameterizations with lower uncertainty than marginal xylem tension efficiency in CM and higher predictive and functional performance than SOX. WUE, including the exponential dependency of ∂Θ=∂T on predawn ψ, is mathematically more consistent with maximization of cumulative carbon over a certain time period (Cowan, 1982), while CM and SOX assume instantaneous (c) Fig. 9 Relations and tradeoffs between performance diagnostics based on information theory. (a) Predictive accuracy, measured in terms of missing information about latent heat flux (LE) in model variants (A p , bits bit −1 ), vs functional accuracy, measured in terms of the accuracy of multivariate mutual information from soil moisture (θ) and vapor pressure deficit (D) about LE partitioned into unique, redundant and synergistic components ( Predictive accuracy (A P , bits bit −1 ) vs functional accuracy, measured in terms of the accuracy of total multi-variate mutual information from θ and D about LE ( A f;T , bits bit −1 ). (c) Tradeoff between the two functional accuracies (A f;T vs A f;P ). The absolute value of functional accuracy is taken because A f components can be positive or negative, where a value close to zero indicates the best match with observations. Markers represent the performance of a study site and are colored according to water use efficiency (WUE, blue circles), carbon maximization (CM, green squares), and xylem vulnerability (SOX, yellow triangles) stomatal optimization principles (Table 1). Small subplots on the right in (a) separate study sites by plant functional type.
Ó 2021 The Authors New Phytologist Ó 2021 New Phytologist Foundation New Phytologist (2021) 231: 586-600 www.newphytologist.com carbon maximization. The higher functional accuracy of WUE may indicate that the associated ∂Θ=∂T functional form represents some longer-term ecosystem optimization reflected in observations (Lu et al., 2020), which are not captured by CM and SOX. As such, stomatal optimization based on xylem vulnerability had limited utility in improving modeled interactions between θ, D, and LE at ecosystem scales. Nevertheless, explicit representation of xylem vulnerability provides information for other research goals, including modeling ecosystem adaptation and mortality (Sperry et al., 2019).

Insights for ecosystem-scale stomatal optimization modelling
Our results complement those of previous empirical, numerical, and theoretical analyses of stomatal optimization models, providing insight into which is most robust for ESMs. While all ∂Θ=∂T formulations performed well, our results raised caution towards formulations based on plant hydraulic traits vs water use efficiency in generalized and larger-scale estimation of evapotranspiration. The physiologically grounded functional relations based on measurable plant-scale parameters can lose their relevance at larger scales; thus, their utility in formulating water penalty functions for ESMs remains empirical and uncertain. Cowan & Farquhar's (1977) original optimality theory, in which ∂Θ=∂T should be solved dynamically from boundary conditions without fitting parameters (Cowan, 1982;Manzoni et al., 2013b;Lu et al., 2016;Mrad et al., 2019), is challenging to approximate into practical analytical equations for stomatal conductance . Hence, the ability of g s;opt to capture these long-term dynamics depends on the adequacy of the empirical dependence of ∂Θ=∂T on ψ. Our results highlight that this problem, at ecosystem scales, remains unsolved by water penalty functions, requiring an increasing number of parameters to characterize plant hydraulic strategies with increasing detail.
Occam's razor is particularly relevant to ecosystem-scale modelling of evapotranspiration, when parameters cannot be measured directly and may be less physically meaningful when aggregated at large scales. While there is a demand for increasingly complex models to assess climate change effects on Earth systems, parameterization and model evaluation is limited by a lack of adequate data (Feng, 2020;Paschalis et al., 2020). An adequate degree of parsimony is required to exploit satellite data and extract ecosystem-scale ecohydrological parameters through inverse modeling , which may provide further opportunities to parameterize stomatal optimization models at large scales and improve ESMs.
The fundamental optimality theory, whether based on water use efficiency or xylem vulnerability, is consistent with data at all scales. Yet, until our approximations (parametrizations) of the theory are accurately translated from leaf or plant to ecosystems, simpler formulations with equivalent performance are preferred according to the principle of Occam's razor.

Fig. S1
Performance is higher for model variants using predawn vs canopy water potential in the big-leaf comparison framework.

Fig. S2
Average soil-to-surface conductance ratios for the study sites.

Fig. S3
Canopy-to-leaf conductance ratios averaged by day of year at the study sites.

Fig. S4
Performance is higher for model variants using photosynthesis parameters at 25°C vs those with temperature dependency in the big-leaf comparison framework.  Methods S1 Data selection criteria. Methods S2 Photosynthesis model and comparison framework assumptions.

Methods S3 Information theory metrics.
Methods S4 Information partitioning.
Methods S5 Model performance diagnostics based on information theory.