One Stomatal Model to Rule Them All? Toward Improved Representation of Carbon and Water Exchange in Global Models

Stomatal conductance schemes that optimize with respect to photosynthetic and hydraulic functions have been proposed to address biases in land‐surface model (LSM) simulations during drought. However, systematic evaluations of both optimality‐based and alternative empirical formulations for coupling carbon and water fluxes are lacking. Here, we embed 12 empirical and optimization approaches within a LSM framework. We use theoretical model experiments to explore parameter identifiability and understand how model behaviors differ in response to abiotic changes. We also evaluate the models against leaf‐level observations of gas‐exchange and hydraulic variables, from xeric to wet forest/woody species spanning a mean annual precipitation range of 361–3,286 mm yr−1. We find that models differ in how easily parameterized they are, due to: (a) poorly constrained optimality criteria (i.e., resulting in multiple solutions), (b) low influence parameters, (c) sensitivities to environmental drivers. In both the idealized experiments and compared to observations, sensitivities to variability in environmental drivers do not agree among models. Marked differences arise in sensitivities to soil moisture (soil water potential) and vapor pressure deficit. For example, stomatal closure rates at high vapor pressure deficit range between −45% and +70% of those observed. Although over half the new generation of stomatal schemes perform to a similar standard compared to observations of leaf‐gas exchange, two models do so through large biases in simulated leaf water potential (up to 11 MPa). Our results provide guidance for LSM development, by highlighting key areas in need for additional experimentation and theory, and by constraining currently viable stomatal hypotheses.

If climate models under-estimate transpiration, they also likely misrepresent: (a) inter-annual variability in the terrestrial carbon cycle-which is primarily driven by water availability of the vegetation (Ahlstrom et al., 2015;Humphrey et al., 2018;Jung et al., 2017); (b) land-surface feedbacks to the boundary layer (Donat et al., 2018;Miralles et al., 2014) and the amplification of some climate extremes (Miralles et al., 2018;Yunusa et al., 2015); and (c) vegetation changes in water use under future atmospheric [CO 2 ] (De Kauwe et al., 2013;Mankin et al., 2019;Swann et al., 2016) and associated impacts on, for example, surface water storage (Trancoso et al., 2017;Ukkola, Prentice, et al., 2016). Resolving the knowledge gap around plant contributions to the carbon and water cycles and to local energy budgets is critical to resolving both long term (terrestrial carbon sink) and short term (amplification of extreme events) model uncertainties.
To address model shortcomings, development of stomatal theory, aided by field and experimental synthesis studies (Choat et al., 2012;Lin et al., 2015;Martin-StPaul et al., 2017a;Mencuccini et al., 2019;Miner et al., 2017), has recently become a major focus of global change ecophysiology. First, g s behavior under well-watered conditions has been shown to vary predictably by species , which can be used to parameterize stomatal conductance schemes on a plant functional type basis . Second, g s can be accurately simulated to decline with decreasing soil moisture empirically (Mäkelä, 1996;Sala & Tenhunen, 1996;Tuzet et al., 2003), or using measured vulnerability curves that represent the progressive impairment of water transport as plant hydraulic conductance is lost (Sperry et al., 2017;Wolf et al., 2016). Third, optimality principles positing that stomatal behavior should balance carbon uptake and water loss (Cowan & Farquhar, 1977) can be used to avoid prescribing the sensitivity of g s to moisture stress and/or atmospheric drivers like vapor pressure deficit Sperry et al., 2017;Wolf et al., 2016).
This lack of systematic multi-model evaluation could be due to difficulties ranging from accessing existing code, to coding new schemes into more sophisticated land surface models (LSMs) that account for energy balance and water-carbon feedbacks. In the first extensive comparison of g s optimization models,  proposed a unified mathematical framework relying on seven theoretical criteria to probe the merits of different optimization models. However, the authors did not assess whether, and to what extent, new optimality-based We aim to identify the g s schemes that improve predictability for use in LSMs (e.g., during periods of water stress, at high vapor pressure deficit) but, also, any aspects of those schemes that might limit current application.

Methods
We addressed our first two questions theoretically, by harmonizing 11 g s models to a single reference model, the Medlyn et al. (2011) model. The Medlyn model was chosen as a baseline expectation of model behavior because it is widely used in climate models (ACCESS-ESM1.3, Kala et al., 2015;CESM2, Lawrence et al., 2019), and it is well understood (i.e., extensively evaluated in the literature), so we can readily diagnose assumption-driven model-to-reference behavioral departures (De Kauwe et al., 2013;Medlyn et al., 2015). Our experimental setup parallels the "perfect model" framework used in climate predictability studies (Hawkins et al., 2011), except our reference model is not intended as a truth. We address our third question by applying all 12 models to observational records of leaf-level gas exchange from 15 tree species spanning a mean annual precipitation (MAP) range of 363-3,286 mm yr −1 . A visual summary of the model experiments (Sections 2.2 and 2.3) is presented in Figure  S1 of Supporting Information S1.
To distinguish between the empirical and optimization approaches, we refer to the former by author name and the latter by optimality criterion. In this study, the approaches are labeled "empirical" if they explicate a functional form for g s (i.e., the sensitivity of g s to both independent and dependent variables is assumed), whereas they are labeled "optimization approaches" if they maximize/minimize function such that the sensitivity of g s emerges from the optimization criteria. Equations 6 and 7 Duursma and Medlyn (2012) and Tuzet et al. (2003) Eller k max -Equations 8 and 9 Eller et al. (2020) Optimization Avoiding primary impairment to water flow: max(A n − CT)

Modeling Framework
We implemented the g s schemes within a simplified LSM  that broadly emulates the Community Atmosphere-Biosphere Land Exchange LSM (CABLEv2.0;. This framework embeds canopy micrometeorology and coupling between energy balance, transpiration, and photosynthesis. We standardized model responses by prescribing soil moisture, thereby removing the feedback between transpiration and root-zone soil moisture. We also opted to ignore leaf-to-canopy scaling, which allows a direct comparison to leaf-level observations, whilst avoiding LSM-specific scaling assumptions (Rogers et al., 2017).

Leaf Level Fluxes and Energy Balance
The net rate of carbon assimilation, A n (μmol m −2 s −1 ), is simulated using the  photosynthesis model (Text S1 in Supporting Information S1).
Leaf transpiration (E; mmol m −2 s −1 ) is assumed to meet the atmospheric demand for water vapor on an instantaneous basis: where g s is expressed on a H 2 O basis in mol m −2 s −1 , g b (mol m −2 s −1 ) is the leaf boundary layer conductance to water vapor, D l (kPa) is the leaf-to-air vapor pressure deficit, P atm (kPa) is atmospheric pressure, and 10 3 converts to mmol.
For the empirical schemes (3 models), leaf-level variables (e.g., leaf temperature) are obtained iteratively until the leaf energy balance converges. The optimization schemes (nine models) avoid iteration (except SOX opt ; see below) by using two-dimensional (water transport vs. photosynthetic) systems of energy-balanced equations to simultaneously optimize g s , [CO 2 ] in the leaf intercellular air spaces (C i ), and leaf water potential (Ψ l ). Note, this implementation difference does not impact the simulations. For more information, refer to the main text and methods S2 of Sabot et al. (2020).

Leaf Water Potential
Ψ l (MPa) is computed by substitution of Equation 1 and the steady-state supply of water from the roots to the leaves (Sperry & Love, 2015): where Ψ is the varying water potential between the root-zone soil water potential Ψ s (MPa) and Ψ l , and k Ψ (mmol m −2 s −1 MPa −1 ) is the associated hydraulic conductance. Ψ cannot drop below Ψ crit , a threshold indicative of critical xylem failure at k crit , and set to match a 95% loss of hydraulic conductivity (i.e., P 95 ). k Ψ is downregulated by a cumulative Weibull distribution representing plant vulnerability to cavitation Sperry et al., 2017): where k max (mmol m −2 s −1 MPa −1 ) is the maximum root-zone-to-leaf hydraulic conductance, and s 1 (MPa) and s 2 (unitless) are the sensitivity and shape of the vulnerability curve derived from measured percentage conductivity loss (Text S2 in Supporting Information S1).
In segmented representations of plant hydraulics, k max (and s 1 and s 2 ) may vary between the rhizosphere, roots, stem, etc. Parameterizing this segmentation presents a challenge for global models, as it requires additional inputs that are not readily available, so we combine all elements into a single hydraulic conductor connecting the rootzone to the leaves.
SABOT ET AL.

The Medlyn Model
The Medlyn g s model (Medlyn et al., 2011), reworked to account for water stress , is an analytical approximation of the water use efficiency hypothesis (cf., WUEH; Cowan & Farquhar, 1977): where C s (μmol mol −1 ) is [CO 2 ] at the leaf surface, 1.57 converts from conductance to CO 2 to conductance to water vapor, g 0 (mol m −2 s −1 ) is the residual conductance to water vapor (assumed negligible), g 1,Med (kPa 0.5 ) is the slope of the sensitivity of g s to A n , and β (unitless) is an empirical moisture stress factor represented by an exponential dependency on Ψ l at predawn (i.e., Ψ l,pd ;Cowan, 1982;Yang et al., 2019;Zhou et al., 2013): where s Med (MPa −1 ) sets the sensitivity of the stomates to Ψ l,pd , Ψ fc is the root-zone water potential at field capacity, and with Ψ l,pd ∼ Ψ s assuming no night-time transpiration.
The choice to attenuate either the ratio of A n : g s (i.e., the "intrinsic" water use efficiency) or E during periods of water stress follows several LSMs, but we acknowledge that alternative approaches exist (cf., Kennedy et al., 2019). Tuzet et al. (2003) empirically linked g s and Ψ l , which Duursma and Medlyn (2012) later simplified to:

The Tuzet Model
where g 1,Tuz (unitless) is an empirical slope parameter and ζ (unitless) characterizes stomatal responses to Ψ l : where Ψ ref (MPa) is a reference water potential and s Tuz (MPa −1 ) is the sensitivity of the stomates to variations in Ψ l . Eller et al. (2020) proposed an analytical approximation to the stomatal optimization based on xylem hydraulics (cf., the SOX opt model below), establishing an empirical link between g s and plant hydraulic vulnerability:

The Eller Model
where C i,col (μmol mol −1 ) is the co-limitation C i -for which biochemical rates of assimilation are equal-and ξ (mol m −2 s −1 ) is a reduction factor arising from xylem conductance loss: where k Ψ is given by Equation 3, P 50 (MPa) is the water potential drop at 50% loss of xylem hydraulic conductivity, 10 3 converts to mmol, and Ψ x,pd is leaf xylem water pressure at predawn, with Ψ x,pd ∼ Ψ s −10 −6 ρgh following Archimede's principle, where ρ (kg m −2 ) is the density of water, g (m s −2 ) is gravitational acceleration, h (m) is canopy height, and 10 −6 converts to MPa.

Optimization Schemes (Nine Models)
The optimization models hypothesize that leaf gas exchange downregulation originates from instantaneous g s controls to avoid primary impairments to water flow (Sperry et al., 2017;Wolf et al., 2016) or assimilation (Eller et al., 2018;, or that it originates from non-stomatal limitations on photosynthetic function induced by water stress (Dewar et al., 2017). Critically, whether impairment is defined as a function of a dynamic variable or not (e.g., Ψ l vs. k Ψ ) and whether it is expressed as a reduction factor (multiplied) or a cost term (subtracted) may affect model responses to changes in the environment.
In the optimization schemes, C i is obtained by solving a system comprising A n from the biochemical photosynthetic model (Text S1 in Supporting Information S1) and the diffusive supply of CO 2 : where C a (μmol mol −1 ) is ambient [CO 2 ], and 1.35 converts from boundary layer conductance to CO 2 to boundary layer conductance to water vapor.
Note, our operational implementation uses the optimization criteria forms (as below) instead of the derivative forms ( Figure S2 and Text S8 in Supporting Information S1), except for the CMax model (see below).

Stomatal Optimizations Avoiding Primary Impairment to Water Flow (Four Models)
The

WUE-f Ψl Model
The long-standing water use efficiency hypothesis (WUEH; Cowan & Farquhar, 1977) posits that plants adjust their carbon uptake given a "carbon cost of water loss" (λ; μmol CO 2 mmol −1 H 2 O), by regulation of their stomates to maximize: However, the WUEH assumes λ is constant over time and fails to describe how Equation 11 changes on the timescale over which soil water changes (Manzoni et al., 2011;Wong et al., 1985).
To address this issue, Wolf et al. (2016) linked the WUEH (Equation 11) with hydraulic function through Ψ l , by equating Equations 1 and 2 and Equation 10 with the biochemical photosynthetic assimilation rates (Text S1 in Supporting Information S1), so leaf gas exchange is downregulated as E declines through falling Ψ s and Ψ l (Equation 2). This proposition keeps λ constant but a priori unspecified, and we henceforth refer to it as the WUE-Ψ model. Wolf et al. (2016) also cast the carbon maximization hypothesis (CMax) as an alternative to the WUEH, replacing λE with a concave-up parabola that increases as Ψ l drops :

The CMax Model
where a (μmol m −2 s −1 MPa −2 ) and b (μmol m −2 s −1 MPa −1 ) and c (μmol m −2 s −1 ) are a priori unknown coefficients of a parabola but knowing c is unnecessary, as the maximization criterion can be satisfied by minimizing the derivative of Equation 12 with respect to Ψ l : a can be interpreted as the carbon cost of a falling Ψ l , whilst |b| sets the minimum cost.

The ProfitMax Model
To avoid the need for unspecified parameters, Sperry et al. (2017) maximized profit (ProfitMax), that is, the net difference between relative carbon gain and relative hydraulic cost: where A n,max (μmol m −2 s −1 ) is the instantaneous maximum A n from Equation 10 over the range of possible Ψ l , g s , and C i .  maximized plant net carbon gain (CGain) given a "carbon cost per recovered unit of xylem conductance" (ϖ; μmol m −2 s −1 ). The CGain hypothesis can account for lagged xylem recovery from embolism but we assume that refilling cavitated conduits comes at no extra cost (Text S4 in Supporting Information S1), to remain consistent across models:

The CGain Model
Here, ϖ is constant in time and a priori unspecified, akin to λ in the WUEH.

The SOX opt Model
Inspired by the ProfitMax model, the stomatal optimization based on xylem hydraulics (SOX opt ; Eller et al., 2018) maximizes carbon uptake whilst A n is directly impaired by a hydraulic reduction factor. This scheme does not rescale hydraulic cost as Ψ s drops (i.e., k max is not evaluated at Ψ s , unlike Ψ in ProfitMax), which removes the need for normalization of A n : Eller et al. (2018) made SOX opt 's system of equations one-dimensional (instead of two-dimensional as in other optimization schemes), by setting the leaf temperature, directly linking Ψ l to A n , C i , g s , and g b through Equations 1, 2, and 10. In this implementation, SOX opt thus iterates leaf temperature until the leaf energy balance converges.

The ProfitMax2 Model
Instead of a conductance reduction factor,  used a transpiration-based factor to increasingly impair A as E increases. The optimization criterion weights carbon uptake by proximity to hydraulic failure, maximizing carbon profits (ProfitMax2): where E crit (mmol m −2 s −1 ) is calculated from Equation 2 and Ψ crit triggers a 95% loss of hydraulic conductivity, with Ψ crit < Ψ < Ψ s .

The LeastCost Model
The least cost hypothesis (LeastCost;  posits that plants minimize the costs of maintaining both transpiration and carboxylation capacity, which is equivalent to maximizing A n reduced by a factor of the maintenance costs: where η (μmol CO 2 mmol −1 H 2 O) is the carbon cost of photosynthetic proteins maintenance through transpiration, and V cmax (μmol m −2 s −1 ) is the maximum rate of carboxylation.
As in the WUE-Ψ model, leaf gas exchange is downregulated as E declines through falling Ψ s and Ψ l (Equation 2) and we assume η (a priori unspecified) constant over time.

Non-Stomatal Optimizations (The CAP and MES Models)
Non-stomatal limitations assume that gas exchange downregulation arise outside of the stomates, from biochemical limitations to photosynthesis or diffusion limitations into the mesophyll cells. Dewar et al. (2017) proposed an optimality framework where hydraulic stress curtails either carboxylation capacity (CAP) or mesophyll conductance (MES), so carbon uptake is maximized at an optimal Ψ l .
Hydraulic stress is represented through a unitless linear factor: where Ψ φ,lim (MPa) is an a priori unknown leaf water potential under which no photosynthesis can occur.
In CAP, hydraulic stress directly reduces V cmax and J max (μmol m −2 s −1 ), the maximum rate of electron-transport, affecting the biochemical photosynthetic assimilation rates (Equations S1.2 and S1.4 in Supporting Information S1): Instead, MES assumes that downregulation of mesophyll conductance from hydraulic stress yields inequity between [CO 2 ] in the chloroplast and C i , so the assimilation rates (Equations S1.2 and S1.4 in Supporting Information S1) account for: where Γ * (μmol mol −1 ) is the photorespiration compensation point.

Theoretical Model Comparisons
To perform controlled model comparisons and experiments, we used a weather generator and created 4 weeks of synthetic half-hourly atmospheric forcing during the northern hemisphere's growing season (Figures S3a-S3c in Supporting Information S1; see code). Two synthetic soil moisture profiles were also created ( Figure S3d in Supporting Information S1) to contrast well-watered conditions with moderately stressed conditions. Timeseries of g s associated to the synthetic atmospheric forcing were simulated using the Medlyn model ( Figure S4 in Supporting Information S1), depending on soil moisture, and with g b set to infinity (i.e., assuming perfect leaf-atmosphere coupling).
The other g s models were calibrated to this idealized reference g s timeseries only ("Idealized calibrations") before we perturbed the environmental variables to compare model-specific sensitivities ("Sensitivity analysis"). Calibrating to g s only (here and in the "Evaluation against observations"), follows standard LSM practice as these models then internally scale fluxes from the leaf to the canopy Fisher & Koven, 2020;Franks et al., 2017). Importantly, this approach does not preclude evaluating how representations of g s impact the prediction of other fluxes, like A n and E.
The model experiments aim to isolate model differences, so driving variables do not always co-vary realistically, for example, air temperature (T a ) and atmospheric vapor pressure deficit (D a ) are coupled, but photosynthetic photon flux density (PPFD) and soil moisture are not. Table 2 summarizes the ranges of environmental conditions used. We assumed a canopy height of 20 m (this information is only used by the Eller model) and set the hydraulic vulnerability (P 50 = −3.1 MPa, P 88 = −4.9 MPa; Maherali et al., 2006) and photosynthetic (V cmax,25 = 35 μmol m −2 s −1 , J max,25 = 58.5 μmol m −2 s −1 ; Ellsworth, 2000) traits and g 1,Med (2.5 kPa 0.5 ; Lin et al., 2015) to those of Pinus taeda. s Med was set to 2 MPa −1 . Tables S1 and S2 in Supporting Information S1 list all other model parameters.

Calibrated Parameters
Within the LSM framework, we calibrated a maximum of two parameters per scheme (Table 1; see Table S3 in Supporting Information S1 for the sampled parameter spaces) to the reference g s , which avoids overfitting and limits autocorrelation. Apart from the reference g s model (Medlyn), all the other g s schemes considered make use of a k max parameter (Equation 3). Although k max should be directly measurable (Mencuccini et al., 2019;see De Kauwe et al. [2020] for an application within a LSM), we chose to calibrate it because models might assume additional/different contributions to k max related to different sensitivities to environmental drivers (cf., . We assumed that the optimization schemes that avoid primary impairment to water flow (WUE-Ψ , CMax, ProfitMax, and CGain) would all share similar k max , owing to their optimality criteria of the form max(A n −CT) that implies cost terms (CT; proportional to k max ) comparable to A n in magnitude. Thus, upon calibrating the WUE-Ψ , CMax, and CGain models, we set k max to the value obtained for the ProfitMax model (Equation 14). Note, also, that these three models' additional cost parameters (e.g., λ in WUE-Ψ ) buffer behavioral departure from the ProfitMax's k max . The Tuzet model adopted the ProfitMax k max here too (but not in the "Evaluation against observations", see below), as the three-parameter combination of k max − g 1,Tuz − Ψ ref is equifinal.

Calibration Method
To answer "how well constrained are g s scheme parameterizations?," we calibrated each scheme via seven non-linear least-square minimizers from the LMFIT python package (Newville et al., 2019), ranging from local solvers to global stochastic minimizers capable of handling complex multimodal objective functions (Table S4 in Supporting Information S1). Using a suite of solvers, instead of a single calibration method, avoids favoring specific model objective functions.
We compared calibrations under well-watered conditions ("wet" soil moisture profile) with moderately stressed conditions ("stressed" soil moisture profile); this allows us to test whether parameter predictability and model agreement differ with soil moisture availability. To further characterize solver skill and parameter predictability, we performed "wet" versus "stressed" calibrations on three randomly selected 7-day-long subsets within the weather-generated forcing.
Within data subsets, solver skill was quantified through the Bayesian information criterion (BIC) and given the proximity between the parameter value(s) estimated by each solver and the median parameter value(s) across  solutions. The best parameterizations (see Figure 1), determined by the most skilled solvers for the complete forcing timeseries, were then used to run idealized simulations (Appendix A and "Sensitivity analysis").

Sensitivity Analysis
To understand the physiological responses produced by the models-that is, to answer: "do model behaviors differ in response to abiotic variations?"-we carried out a variance-based global sensitivity analysis. We used Sobol's method (Sobol', 2001; see below), which relies on a probabilistic framework to quantify the sensitivity of a model's outputs over the full domain of variation of its inputs and is particularly robust when dealing with non-linear model responses (Marzban, 2013). Here, the method probes to what degree variability in key environmental variables (PPFD, T a , D a , C a , and Ψ s ) affects modeled g s , Ψ l , and C i .
The models were parameterized using the best parameterizations from the "wet" calibrations and forced with 1,008,000 combination sets of PPFD, T a , D a , C a , and Ψ s . The Saltelli cross-sampling method (Saltelli et al., 2007) implemented in the SALib python package (Herman & Usher, 2017) was used to generate forcing samples as uniformly distributed as possible (Table 2). Sobol' sensitivity indices (Saltelli et al., 2010;Sobol', 2001) were calculated for g s , Ψ l , and C i . Variance in modeled variables is estimated via the quasi-Monte Carlo method before being apportioned to each individual driver (first-order index, S1), to synergies between pairs of drivers (second-order index, S2), or to each driver plus all their synergic terms with other drivers (total-order index, ST).
For example, if g s ' S1 by T a is 30%, its S1 by D a 40%, and its S2 by T a -D a 20%, then 90% of g s variability is caused by variability in T a and D a and their direct interactions. Inaccuracy in these two drivers will be responsible for most of the uncertainty in simulated g s . By contrast, the ST is a summary measure of the overall contributions of one driver to the variation in a modeled variable. For instance, if T a interacts with other drivers than D a (if g s ' S2 by T a -Ψ s is 5%), then besides the first-and second-order contributions, the ST by T a will also account for third-order contributions of T a (T a -D a -Ψ s synergies) to the total variability in g s .

Evaluation Against Observations
To answer our third question, we evaluated the models against leaf-level observations originating from Australia, France, Panama, and the USA. The original data sets include field observations for 26 sites × species combinations , as well as measurements for four species grown in a common garden experiment . After processing the observations (Text S5 in Supporting Information S1), measured g s , E, and A n were available for 16 sites × species, Ψ l for 15 sites × species, and C i for 12 sites × species. Observed PPFD, T a , D a , C a , g b , and Ψ l,pd (used as a proxy for Ψ s ) were also included as inputs to drive the g s schemes. All sites × species specifics are listed in Table 3 and Table S5 in Supporting Information S1.
We used the four most skilled solvers from our theoretical "Idealized calibrations" ( Figure S5 in Supporting Information S1) to calibrate the 12 schemes to the observations of g s , for each individual site × species. These calibrations were performed within the LSM framework, using measured leaf temperature instead of T a to avoid energy-balance assumptions at the calibration stage.
Calibrating the Tuzet model, however, required a different approach than that used for the idealized calibrations, as not only was there no unique link between g s and Ψ l over all environmental conditions, but the observations were also inconsistently distributed in time, with variable tree age and size classes sampled. The Ψ ref and s TUZ parameters (Equation 7) were directly fitted from the observed g s and Ψ l (Text S6, Figures S6a-S6o in Supporting Information S1), and Ψ ref and s TUZ were then pre-set upon calibration of both g 1,TUZ and k max within the model framework. Unlike in the "Idealized calibrations," calibration of a k max specific to the Tuzet model is needed because Ψ ref and s TUZ were not calibrated within the model framework.
Predictive performance was appraised using four metrics (Text S7 in Supporting Information S1). Pearson's correlation coefficient (r) evaluates the temporal relationship between modeled and observed variables. The Nash-Sutcliffe Efficiency index (NSE) measures a model's ability to outperform the average observation. The Mean Absolute Scaled Error (MASE) gauges the accuracy of a model's dynamics, by assessing whether its predictions are more skillful than preceding observations. Finally, the ranked Bayesian Information Criterion (rBIC) provides information on parsimonious accuracy within each site × species.
SABOT ET AL.

Results
We first evaluate our idealized model calibrations, then explore theoretical model sensitivities, and finally examine whether and/or which theoretical findings are supported by observations.

How Well Constrained
Are g s Scheme Parameterizations? Figure 1 shows the spread in parameter values calibrated under "wet" and "stressed" soil moisture conditions, normalized by the median parameter value within each calibration subset. Where the numerical solvers converge, the parameters are well defined and the plots display a single mode (e.g., ProfitMax, SOX opt ), typically within ±10% of the median parameter value, and with standard errors <10%. Where the solvers do not converge, there can be large spreads in the distributions: these reveal multimodal problems (CAP, MES), low influence parameters (b in CMax), and/or recurring failure from a solver (e.g., LeastCost, WUE-Ψ ) or a class of solvers (e.g., stuck in local minimums).
Regardless of soil moisture conditions, parameters that set the magnitude of g s and E (e.g., k max in ProfitMax, SOX opt , ProfitMax2, Eller; g 1,Tuz in Tuzet) are identifiable across solvers, with few outliers to none. In CAP and MES, though, k max is not readily constrained, perhaps because of empirical dependencies (see below) and because reductions in g s and E are not directly driven by stomatal responses.
The Tuzet model exemplifies a set of well-constrained empirical parameters. Contingent upon a careful solver selection (cf., yellow hatched overlay), the optimization schemes' carbon costs (ϖ in CGain, η in LeastCost, λ in WUE-Ψ , a in CMax) are also among empirical parameters that can be constrained adequately. By contrast, parameters that empirically link to Ψ l in optimization schemes-Ψ φ,lim in CAP and MES, and b (but not a) in CMax-are difficult to calibrate, and range between >0.05-20 times the median parameter values.
To address this issue, CMax's b could be fixed or omitted, given its little influence (Zenes et al., 2020). However, this cannot be advocated for CAP and MES whose parameter pairs often show equifinality and high correlation (from covariance matrices: c. 0.8 in >¾ of CAP's parameter estimates and c. 0.7 in ½ of MES"), thus representing trade-offs in optimization strategies (Appendix A; cf., Figure A1 vs. A2).
We gain two important insights from these results. First, combining optimality principles with empirically defined functions risks reducing predictable behavior by adding degrees of freedom. Second, numerical solver choice matters to the calibration of multi-parameter optimization models. Adopting global or Bayesian solvers (as opposed to commonly used local solvers) would be preferable in those cases. Figure 2 shows the (mostly out-of-sample) influence of atmospheric drivers and soil moisture (Ψ s ) on g s , Ψ l , and C i given by total-order Sobol' sensitivity indices. The total-order Sobol' sensitivity indices align with the first-order indices ( Figure S7 in Supporting Information S1) and give a fuller picture of model variance. Parameter names appear to the right of the half violin plots and corresponding model names to their left; the models appear from the least (bottom) to the most (top) uncertain in their parameter estimates. Each half violin plot comprises 28 parameter estimates, obtained by calibration to four synthetic data sets (one 4-week data set plus three 7-day data sets) using seven calibration methods. Parameter values are normalized to the median of each calibration data set. Hatched overlays highlight the parameters from the three most skilled calibration methods for each model, when those successfully constrain estimates. The best parameter estimates among those calibrated on the 4-week data set are represented by a star. Note, the scale is non-linear.

Do Model Behaviors Differ in Response to Abiotic Variations?
SABOT ET AL.

10.1029/2021MS002761
13 of 30 In Figure 2 and Figure S7 in Supporting Information S1, where the polynomials have a limited coverage over a given axis, variance is low because model responses are highly deterministic. For example, model behaviors are largely unaffected by variability in PPFD (S1 < 0.04, ST < 0.08), which only yields variability in C i , g s , and Ψ l when photosynthesis is limited by ribulose-1,5-bisphosphate regeneration (c. 23% of occurrences, on average across models; see Text S1 in Supporting Information S1 for the biochemical limitation rates of assimilation). Even under these conditions, model responses to PPFD are monotonic (Equations S1.4 and S1.5 in Supporting Information S1) and generate little feedback, except immediately near the CO 2 compensation point and the co-limitation point where biochemical rates of A n are equal.
Inclusion of plant hydraulics (i.e., excluding the Medlyn model) confers the models the following features: (a) variability in g s is primarily driven by variability in D a , with smaller contributions from T a , Ψ s , and C a that vary by model; and (b) variability in C a , and to a lesser extent in D a and T a , induces variability in C i . By contrast, using a β downregulation factor in Medlyn leads to Ψ s having a major impact on both g s and C i .
The picture is more contrasted when looking at the Ψ l responses (cf., Appendix A). Eller, CMax, and LeastCost are mostly unaffected by variability in Ψ s but quite sensitive to variability in D a and T a , and CMax and LeastCost are even affected by variability in C a . Other models are sensitive to Ψ s in three ways: (a) responses to variability in Ψ s are smaller than to variability in D a and/or T a (Tuzet, WUE-Ψ ); (b) sensitivity to Ψ s either slightly dominates Figure 2. Total-order Sobol' sensitivity indices of the stomatal conductance (g s ), leaf water potential (Ψ l ), and CO 2 concentration in the leaf intercellular air spaces (C i ) to variability in environmental drivers for the 12 models parameterized under well-watered conditions. The environmental drivers are: atmospheric vapor pressure deficit (D a ); soil water potential (Ψ s ); ambient CO 2 concentration (C a ); photosynthetic photon flux density; and air temperature (T a ). The concentric circles mark 0.25 increments on a scale of 0-0.75, with 0 signifying no influence and 0.75 high influence; the axes extend up to 1.
14 of 30 or is on par with sensitivity to other drivers (ProfitMax, CGain, SOX opt ); and (c) variability is primarily driven by Ψ s (ProfitMax2, CAP, MES). We expect that the latter scenario would become prevalent if Ψ s were dropping further than the range tested here, and it should happen sooner for scenario (b) than scenario (a).
Sensitivity to Ψ s appears as the root-cause for inter-model differences in exerted hydraulic controls (i.e., high sensitivity to Ψ s signals tight hydraulic controls, and vice versa; see Appendix A). We also note that contrasting early morning/evening versus afternoon g s (Appendix A) most likely arises from high sensitivity to D a (and T a ), with little contribution from PPFD.

Predictive Performance
We used four metrics to assess the modeled variables against observations. Pooling all sites × species, the Profit-Max model reproduces observed behaviors better than observational averages (highest overall NSE, Figure 3a), possesses accurate dynamics (lowest overall MASE, Figure 3b), and is highly parsimonious (lowest overall rBIC, Figure 3d), whilst the MES model shows high correlation with the observations (highest overall r, Figure 3c). It is worth noting that the demanding MASE benchmark of 1-that marks one-step forecasts of the observationsis met by four models (MES, LeastCost, SOX opt , ProfitMax) for g s across sites × species. Additionally, half the new models (Eller, ProfitMax, SOX opt  Given the models were calibrated to observations of g s only, evaluating them for their ability to predict other variables degrades performance, even though some individual site × species are highly predictable (e.g., Figure S8 in Supporting Information S1; cf., Figure S9 in Supporting Information S1). Performance degradation can be stark in the tropics, where the models are less skilled at matching the g s behavior than in the extra-tropics ( Figure 4). Finally, among variables, Ψ l is responsible for the largest performance degradation, particularly for the Eller and LeastCost models, which we examine below.

Functional Relationships
Across forest types, Ψ l estimates are mostly within observational ranges ( Figure 5), but not for the Eller and LeastCost models (and WUE-Ψ , to a degree) which simulate unrealistically low Ψ l as ecosystems get drier, confirming our theoretical expectation of weak hydraulic controls in these models. Other models are either prone to overestimation of Ψ l at the wetter sites (Tuzet,ProfitMax2,CAP,MES) or to underestimation at the xeric sites (CMax, CGain, ProfitMax, SOX opt ). Ψ l bias directly follows from bias in the parameterization of k max ( Figure S6p in Supporting Information S1), so WUE-Ψ , CMax, and CGain share ProfitMax's bias direction (i.e., they use its k max and are also biased low at the xeric sites), with WUE-Ψ and CMax also displaying more variable errors (Figures 5d and 5e).
Predictions of E and A n by sites × species generally fall within observational ranges (Figure 6 and Figure S10 in Supporting Information S1), but the models fail to vary sufficiently, or accurately enough, throughout the whole range of E and A n . Indeed, each model occupies its own intrinsic WUE space, depending on its ability to regulate gas exchange with D a for example (Medlyn, Tuzet, WUE-Ψ , CMax and CGain do not capture the behavior at high D a in Figure 6d). Therefore, individual models can be biased with regards to E, A n , or both, and may keep one of the two variables within a narrow range whilst mostly varying the other. Although a proportion of the biases can be attributed to the forcing (at the Many Peaks Range, the forcing originates from the nearest weather station), to heterogeneity in the data sets, or to uncertain parameterization (not allowing seasonal variations in V cmax,25 may cause A n estimates to be concentrated around the median observation in Figure 6a  ; leaf morphology and flow rate could affect E), discerning the causes of model-specific biases requires examining the co-variation of g s , C i and Ψ l .
The models generally agree on the magnitude of C i per site × species (Figure 7 and Figure S11 in Supporting Information S1), noting that the models that simulate relatively lower C i simulate marginally lower A n (Eller SABOT ET AL.

10.1029/2021MS002761
15 of 30 compared to ProfitMax2 in Figures 6a and 7a), and vice versa. Similarly, relatively lower E is associated with relatively lower g s (LeastCost compared to Tuzet in Figures 6a and 7a), or vice versa. Whereas other models tend toward similar magnitude overestimation of g s and C i , or higher g s than C i , CAP and MES overestimate C i more than they do g s , in fact largely so given their estimates of A n (Figures 6 and 7; Figures S10 and S11 in Supporting Information S1; see also Appendix A). There are four sets of characteristics that describe g s to Ψ l relationships: (a) sensitivities and ranges are consistent with the observations (CAP, MES in Figure 7d and Figures S12l-S12o in Supporting Information S1); (b) sensitivities are consistent with the observations but Ψ l are offset (ProfitMax, SOX opt in Figure 7d; Figures S12l, S12m, S12o, and S12p in Supporting Information S1); (c) models have the wrong sensitivity (ProfitMax2 in Figure 7d; Figures S12l, S12m, S12o, S12p in Supporting Information S1); (d) models have the wrong sensitivity and are offset (LeastCost in Figures 7c and 7d and Figure S12 in Supporting Information S1). Moreover, for WUE-Ψ and particularly CMax, g s (Ψ l ) can split into "branches" at low g s (below c. 0.5 × g s,norm in Figure 7d; Figures S12e, S12o, and S12p in Supporting Information S1), which suggests diverging g s closure at low Ψ s versus high D a .
Average g s sensitivities to D a are within observational ranges across models and sites × species (Figure 8a). At low D a , there is a relatively large spread in model behaviors, as per our idealized findings (Appendix A). At high D a , schemes start to diverge from the average observations between 4 and 5 kPa (but CMax diverges from the observations over the whole range of D a ). Patterns of stomatal closure (Figure 8b) reveal that the modeled rates of g s closure are between thrice faster and six times slower than observed. Three models agree with the observed patterns of stomatal closure (CGain, SOX opt , ProfitMax), whilst one closes prematurely at high D a (CMax, with g s closure c. 45% too high), and the remaining (majority) are prone to largely delayed closure (the stomates are c. 25%-75% too open). Consistency in simulation of g s (Ψ l ) compared to the observations (i.e., accurate functional shape and variability, regardless of bias) appears key to representing stomatal closure as atmospheric aridity increases. For example, CGain, SOX opt , and ProfitMax are characterized by greater Ψ l biases than the Tuzet and ProfitMax2 models in Figures 5d and 5e, yet their functional forms better match the observed g s (Ψ l ) ( Figure S12 in Supporting Information S1). This implies that Tuzet and ProfitMax2 might be compensating for an overestimated sensitivity to soil moisture by underestimation of the sensitivity to D a . By contrast, CAP and SABOT ET AL.

10.1029/2021MS002761
18 of 30 MES's g s (Ψ l ) are consistent with the observations, so it is likely they keep g s high at high D a to prevent declines in A n in the absence of water stress.
Our "Evaluation against observations" highlights differences in internal trade-offs. For the non-stomatal models, compensation between overestimated C i and underestimated A n can result in high predictive performance. More generally, misrepresentation of the relationship between g s and Ψ l does not always degrade the simulation of gas exchange (LeastCost, Eller); however, it may lead to discrepant responses at high D a .

Discussion
Climate models routinely simulate responses to a climate space without a historical analog (Reu et al., 2014), yet the LSMs included in climate models markedly diverge from observations as soil water availability becomes limiting (Martínez-de la Torre et al., 2019;Ukkola, De Kauwe, et al., 2016). This divergence is, in part, associated with uncertainties in soil hydraulic processes (Van Looy et al., 2017) and processes governing vegetation water stress . Stomatal optimization approaches that account for plant hydraulic functions hold promise for LSMs (Eller et al., 2020;Sabot et al., 2020) because they incorporate more predictive capacity than empirical model parameterizations based on historical behaviors and are simpler to parameterize than detailed representations of plant hydraulics (De Kauwe et al., 2020; Kennedy et al., 2019). In this study, we reviewed the assumptions and mechanisms underpinning empirical and (non-)stomatal optimization hypotheses, with a view to guiding model development. Our theoretical results were determined using the Medlyn g s model as reference, but we expect them to be robust to alternative empirical reference choices (e.g., Ball et al., 1987;Leuning, 1995), with subtle differences in the sensitivity to D a (Medlyn et al., 2011).

Limits to Empirical Approaches
An argument can be made in favor of calibrating soil moisture stress functions in state-of-the-art LSMs, as opposed to changing their complex structure (Leplastrier, 2002;Raupach & Finnigan, 1988). Here, using an exponential dependency on Ψ l,pd (or Ψ s ) to down-regulate g s yielded reasonable skill when the Medlyn et al. (2011) model was evaluated against observations. Nonetheless, empirical moisture stress functions can hamper g s model performance (see notes S2 and Figure S6 of Sabot et al., 2020) or be implausible when applied outside their calibration sample (Verhoef & Egea, 2014). Finally, as the Medlyn g s model estimates are not contingent upon estimating Ψ l , its prognostic use beyond the simulation of gas exchange is limited.
An alternative approach to using moisture stress factors is to make g s an empirical function of leaf water potential, where Ψ l is obtained by accounting for plant hydraulics (Tuzet et al., 2003). The problem with this approach is Figure 7. Functional relationships between stomatal conductance (g s ), CO 2 concentration in the leaf intercellular air spaces (C i ), and leaf water potential (Ψ l ) predicted by 12 models at a wet (panels [a and c]) and a dry (panels [b and d]) site × species. Panels (a and b) show the encircled interquartile ranges of simulated to observed g s ratios against simulated to observed C i ratios, such that a "perfect" model would be concentrated at the intersection of the 1:1 lines. Panels (c and d) show the modeled decline in g s with decreasing Ψ l , fitted via a generalized additive model, compared to the observations (light gray crosses). The functional forms were made comparable by normalizing g s by its model-specific maximum per site × species, and Ψ l by the critical leaf water potential indicative of total xylem failure (P 95 in this study). Curves were not fitted if they did not monotonically decrease (e.g., WUE-Ψ and CMax in panel [d]), or where the models operate at, or beyond, the P 95 (e.g., the Eller model in panel [d]).
SABOT ET AL. Recently, Eller et al. (2020) proposed an empirical g s model based on xylem hydraulics which approximates an optimality criterion (i.e., SOX opt ; Eller et al., 2018). Here, the Eller model showed high skill in the simulation of g s and other leaf gas-exchange variables ( Figure 3) but produced unrealistically low Ψ l ( Figure 5). To approximate an analytical solution to SOX opt , Eller et al. (2020) replaced SOX opt 's dependency on Ψ l with a numerically simpler dependency on 0.5(Ψ l,pd + Ψ l ), incidentally weakening its hydraulic reduction factor and impacting its analytical solution. Therefore, to achieve similar Ψ l as predicted by SOX opt , all of the Eller model's plant hydraulic parameters (not only k max , but also the vulnerability curve traits) need to be refitted (see Figure S3 of Eller et al. [2020]). At Puéchabon (a site also considered in this study), parameterizing the Eller model within the Joint UK LSM (Clark et al., 2011) resulted in a P 50 of −1.8 MPa ("BET-Te" in Table 2 of Eller et al. [2020]), whereas measured P 50 is between −7 and −4.5 MPa for Quercus ilex (Martin-StPaul et al., 2014, 2017b. Besides, the Eller model was derived from SOX opt assuming a different representation of plant vulnerability to cavitation than that given by Equation 3 (cf., Manzoni et al., 2013), which could have aggravated its tendency to underestimate Ψ l in our modeling framework.

Added Value and Potential Drawbacks of Optimal Schemes
Optimization approaches avoid reliance on empirical corrections to gas-exchange as water supply becomes limited, instead relying on measurable hydraulic traits (ProfitMax, SOX opt , ProfitMax2), or traits plus parameters specific to the optimization hypothesis (WUE-Ψ , CMax, CGain, LeastCost, CAP, and MES). We found three models (CMax, CAP, and MES) to be difficult to constrain (Figure 1), and one model (CMax) to contain a low . The darker gray shading shows the average observed gs, with lighted shading spanning ±1 S.D. around it. The boxed area defines the selection of g s data shown in (b) and includes a minimum D a of 2 kPa to account for models that predict relatively early g s closure. Panel (b) shows ridge plots of the differences between rates of simulated g s decline (i.e., simulated g s normalized by its model-specific maximum per site × species and scaled to the maximum observation) and rates of observed g s decline, relative to observed g s decline from within the boxed area from (a). Negative spread and peaks signify early stomatal closure compared to the observations, and vice versa.
SABOT ET AL.
10.1029/2021MS002761 21 of 30 influence parameter that could be fixed or omitted (cf., Zenes et al., 2020). At the ecosystem scale, Bassiouni and Vico (2021) confirmed that the less physiologically meaningful parameters of CMax were the most uncertain among those of three g s optimization models. From an operational perspective, constraining multi-parameter models can require over three times as many function evaluations than for one-parameter models. This makes "easy to calibrate" models attractive because global parameterizations would come at a considerable computational cost for multimodal optimization schemes and/or schemes sensitive to the calibration conditions ( Figure 1).
Optimization was an improvement over the well-established Medlyn model for five optimization models evaluated against leaf-level observations (ProfitMax,SOX opt ,ProfitMax2,LeastCost,and MES), and the CGain model was largely on par with the Medlyn model (Figure 3). Contrary to previous findings by , CMax was not a large improvement over WUE-Ψ in our modeling framework. CMax was not among the top performing optimization schemes either, whereas the LeastCost model was, at odds with the findings of . Possible explanations for apparent discrepancies between past findings and ours include the following: 1. The calibration methods are different. For instance,  calibrated their models to studentized distributions of A n , E, and Ψ l . Instead, we calibrated our selected models to observed g s only, which avoids spreading errors across variables and could be critical for models less skilled at capturing Ψ l . 2. Whenever possible, we implemented the optimization criteria forms instead of their derivative forms ( Figure  S2 and Text S8 in Supporting Information S1). 3. Our model simulations include energy-balance feedbacks between the atmosphere and the leaves, whereas previous studies used leaf-level forcing variables. 4. We evaluated the models' ability to match the observed leaf-level fluxes and water potentials independently, using four metrics that outline various aspects of model performance, as opposed to a single metric averaged across variables. 5. The evaluations use different model selections and data sets, for example, ours includes more Eucalyptus species.
Discrepancies aside, our results agree with  in that WUE-Ψ and (mostly) CMax may behave unrealistically at high D a . We also concur that ProfitMax and SOX opt perform to a similar standard, but we find that a slightly more conservative mode of stomatal regulation (i.e., marginally lower gas exchange under well-watered conditions) and less sensitivity to water stress (e.g., higher Ψ l in Figures 5d and 5e) give ProfitMax the edge over SOX opt .

Non-Stomatal Photosynthetic Limitations
The non-stomatal optimization approaches evaluated here (the CAP and MES models) resulted in lower photosynthetic estimates than observed ( Figure 6 and Figure S10 in Supporting Information S1), through mechanisms challenged by the observations of co-occurring g s and C i ( Figure S11 in Supporting Information S1). At the same time, both models proved highly skilled at representing the dynamics of g s , particularly for the Quercus ilex sites × species ( Figure S9c in Supporting Information S1) for which seasonally varying non-stomatal limitations have been demonstrated . One possible explanation is that, despite clear evidence of apparent V cmax,25 downregulation with drought (Keenan et al., 2010;Zhou et al., 2013), the evidence does not unequivocally support instantaneous linear downregulation of photosynthetic capacity with Ψ l . Additionally, robust inclusion of non-stomatal limitations in LSMs would require novel fitting of biochemical photosynthetic traits (i.e., in situ rather than apparent V cmax,25 ; Bahar et al., 2018;Crous et al., 2013;Sun et al., 2014) as well as their temperature dependencies (Knauer et al., 2019). We argue that inclusion of non-stomatal limitations in LSMs through optimality principles would be premature, as our understanding of the timescales at which non-stomatal processes interact with plant hydraulic function, let alone how they arise (Yang et al.,

Estimating k max
k max is a key parameter of hydraulics-enabled models that is measurable yet rarely documented in the plant hydraulics literature. In the absence of measured k max , we calibrated model-specific k max parameters from g s data. These calibrated k max were not always plausible (i.e., outside ranges inferred from observed E and Ψ l , as well as vulnerability curves; Figure S6p in Supporting Information S1) for Tuzet, Eller, LeastCost, CAP, and MES. Furthermore, k max appeared: (a) low at the xeric sites but adequate elsewhere (ProfitMax, SOX opt ); (b) or reasonable at the xeric sites but high at the mesic sites (ProfitMax2, MES); (c) or mostly biased (Tuzet, Eller, LeastCost, CAP).
We opted to use the ProfitMax's k max when parameterizing the WUE-Ψ , CMax, and CGain models (see "Idealized calibrations" for the rationale). An alternative (but more computationally costly) method would have been to calibrate each of these models' k max . Testing (not shown) revealed that calibrated parameter spreads for the WUE-Ψ and CGain models would remain analogous to the LeastCost model's in Figure 1, whereas constraining CMax's parameters would be harder still. The predictive performance (Figures 3 and 4) and functional relationships ( Figures 5-8) of CGain and CMax would see no substantial change in terms of the gas exchange variables. However, their Ψ l would risk further departure from the observations through less realistic k max (compared to the ranges shown in Figure S6p of Supporting Information S1). Finally, had we calibrated its k max , WUE-Ψ would resemble LeastCost in all aspects. That is, WUE-Ψ 's ability to capture the observations of gas exchange would increase at the expense of Ψ l and k max .
The inclusion of reduction/cost parameters in optimality criteria (e.g., λ in WUE-Ψ , η in LeastCost, Ψ φ,lim in CAP or MES) should be considered carefully, as these additional parameters did not confer a consistent advantage for the simulation of gas exchange and/or degraded the simulation of Ψ l compared to one-parameter models which required calibrating k max only.

Future Challenges
Realistic estimates of water potential and embolism are needed for models to represent drought-related mortality (but see De  and Venturas et al. (2020) for the associated challenges) and drought legacy effects. Ideally, models should link Ψ l to g s through hydraulic reduction factors/cost terms that rely on realistic k max . Newly available sap flow measurements from the SAPFLUXNET database , together with existing hydraulic trait databases and aligned with leaf-level observations, offer an opportunity to infer relevant parameterizations of k max and xylem failure thresholds (i.e., Ψ crit , k crit ) that could reduce model bias in Ψ l estimates. Explicit rhizosphere (Venturas et al., 2018;Wang et al., 2019) and/or symplastic limitations (De Cáceres et al., 2021) under dry conditions could also be explored, noting associated parameterization uncertainties (Xu & Trugman, 2021).
Adding complexity might not suffice where the relationship between g s and Ψ l is mischaracterized (e.g., unrealistically shaped) and/or where magnitude biases are very large (Eller, LeastCost, and WUE-Ψ to a lesser extent), so reformulation of optimality criteria cost terms/regulation factors might be necessary. Critically, uncertainty in the formulation of plant vulnerability to cavitation (Ogle et al., 2009), in hydraulic vulnerability measurements for long vessel species (Cochard et al., 2013), and in Ψ l measurements (which can be significantly decorrelated in time from gas-exchange measurements, especially when made in the field) should be accounted for: (a) when selecting parameterizations from the literature (and parameterizations could be constrained using less artifact-prone turgor loss data); (b) when resolving optimality criteria formulations; and (c) in model predictions of Ψ l and plant water status.

Conclusion
Over half the models compared performed well against leaf-level observations of gas exchange. However, care should be taken when analyzing performance metrics, as seemingly good performance can result from problematic internal compensation effects, and/or mischaracterization of model sensitivities to soil moisture and vapor pressure deficit. Only one of the better performing models required calibration of a parameter explicating stomatal sensitivity to the soil moisture conditions (Medlyn), confirming the viability of hydraulics-based stomatal 23 of 30 models for predicting vegetation responses in novel climate spaces. Notably, the better constrained optimization approaches (e.g., ProfitMax, SOX opt ), both in terms of parameterization and functional representations, embed the potential for acclimation and plasticity in the plant hydraulic and photosynthetic responses Sperry et al., 2019), with a link to plant carbon allocation patterns (Potkay et al., 2021;Trugman et al., 2019) of major significance for the future of global coupled climate-vegetation modeling. Figure A1 shows the average diurnal behavior for a suite of fluxes calibrated under "wet" conditions, highlighting how calibration does not ensure agreement throughout the day, despite producing comparable average estimates. Figure A2 shows the same thing under "stressed" conditions. Average daily g s (Figures A1a and A2a) are within ±10% of Medlyn across calibration conditions, but single day average differences can exceed 20% under "wet" conditions and reach up to c. 85% under "stressed" conditions. Figure A1. Average diurnal behavior of leaf-level water (panels a, c, and e) and carbon (panels b and d) fluxes predicted by the 12 models under well-watered conditions for 26 simulated data points per day. The Medlyn model's simulation of stomatal conductance (g s , panel a) was used as a reference to calibrate the other models. Other simulated fluxes shown are: (b) the CO 2 concentration in the leaf intercellular air spaces (C i ) as a proportion of the ambient CO 2 concentration (C a ); (c) the leaf water potential (Ψ l ); (d) the net rate of carbon assimilation (A n ); and (e) the transpiration (E). Figure A2. Average diurnal behavior of leaf-level water (panels a, c, and e) and carbon (panels b and d) fluxes predicted by the 12 models under mild soil moisture stress for 26 simulated data points per day. The Medlyn model's simulation of stomatal conductance (g s , panel a) was used as a reference to calibrate the other models. Other simulated fluxes shown are: (b) the CO 2 concentration in the leaf intercellular air spaces (C i ) as a proportion of the ambient CO 2 concentration (C a ); (c) the leaf water potential (Ψ l ); (d) the net rate of carbon assimilation (A n ); and (e) the transpiration (E). Figure A1a, five optimal models (CMax, ProfitMax, CGain, SOX opt , and MES) are characterized by higher and earlier g s peaks in the morning when T a , D a , and PPFD are low (see Figure S3 in Supporting Information S1), resulting in asymmetric biases between a few simulated points in the early morning/ evening and the rest of the day (>15% greater g s in SOX opt than in Medlyn before 10 a.m. and after 5 p.m. vs. <10% smaller g s otherwise; >5% greater g s in CMax before 10 a.m. and after 5 p.m. vs. >14% smaller g s otherwise). For WUE-Ψ and LeastCost, g s peaks occur later in the morning than for Medlyn. Throughout the day, the Eller model is the steadiest, possibly because it assumes biochemical co-limitation (Equation 8), which smooths photosynthetic transition effects (see below), and because its downregulation factor depends on Ψ x,pd and P 50 rather than Ψ l (Equation 9), which limits D a effects.

Compared to Medlyn in
CAP and MES excepted (see below), well-watered estimates of diurnal C i ( Figure A1b) vary together with g s , so higher g s relative to Medlyn also means higher C i . Under "stressed" conditions, marginal adjustments in water use may result in higher C i versus lower g s than Medlyn, at various points of the day (Eller,ProfitMax2 in Figures A2a and A2b). Marked transitions in C i :C a result from transitions between RuBP-regeneration limited photosynthesis (morning and evening) and Rubisco-limited photosynthesis (afternoon; see Method S1). Schemes that simulate higher morning and evening C i :C a often transition half an hour later than Medlyn-particularly in the morningand those that simulate lower C i :C a transition about half an hour earlier-particularly in the evening.
In the early morning and evening, relative differences in g s and C i make most optimization schemes more profligate water spenders than Medlyn (up to >20% increase in E; Figure A1e) for negligible carbon gains (<3% increase in A n ; Figure A1d), and vice versa in the afternoon. Having a somewhat riskier strategy, WUE-Ψ (both under "wet" and "stressed" conditions) and LeastCost (under "stressed" conditions) can transpire more in the afternoon (up to c. 10%) for modest carbon gains (up to <4%) that risk being offset by photorespiration when investments in E are insufficient. Note, neither WUE-Ψ nor LeastCost explicitly account for hydraulics in their optimization criteria formulation. CAP and MES stand out, always simulating lower A n than Medlyn, because their hydraulic controls directly downregulate V cmax and J max (Equation 21) or mesophyll conductance (Equation 22). Consequently, both models must maintain high C i ( Figure A1b) to prevent stark reductions in A n .
Simulation of Ψ l ( Figure A1c) is where schemes most diverge. LeastCost and Eller simulate lower Ψ l than all other schemes, even beyond the P 88 . These models' calibrated k max is smaller than the ProfitMax's (Table S6 in Supporting Information S1), so LeastCost and Eller have lower Ψ l at the same magnitude g s than other models, pointing to almost no hydraulic controls. Conversely, ProfitMax2, CAP, and MES exert tight hydraulic controls, relying on higher calibrated k max than ProfitMax, and so, yielding higher Ψ l . Both CGain and SOX opt behave similarly to ProfitMax, with ever so marginally lower Ψ l . Under "stressed" conditions, CAP and MES ( Figure A2c) operate in a contrasting mode of optimization compared to the other models, where k max is over three times that of the "wet" calibrations, so Ψ l barely drops throughout the day. Finally, CMax simulates stomatal closure in the afternoon, accompanied by a reversal of the diurnal leaf water potential drop (most visible in Figure A2c), when T a and D a rise.