Corresponding author: K. Schaefer, National Snow and Ice Data Center, Cooperative Institute for Research in Environmental Sciences, University of Colorado at Boulder, Boulder, Colorado, USA. (firstname.lastname@example.org)
 Accurately simulating gross primary productivity (GPP) in terrestrial ecosystem models is critical because errors in simulated GPP propagate through the model to introduce additional errors in simulated biomass and other fluxes. We evaluated simulated, daily average GPP from 26 models against estimated GPP at 39 eddy covariance flux tower sites across the United States and Canada. None of the models in this study match estimated GPP within observed uncertainty. On average, models overestimate GPP in winter, spring, and fall, and underestimate GPP in summer. Models overpredicted GPP under dry conditions and for temperatures below 0°C. Improvements in simulated soil moisture and ecosystem response to drought or humidity stress will improve simulated GPP under dry conditions. Adding a low-temperature response to shut down GPP for temperatures below 0°C will reduce the positive bias in winter, spring, and fall and improve simulated phenology. The negative bias in summer and poor overall performance resulted from mismatches between simulated and observed light use efficiency (LUE). Improving simulated GPP requires better leaf-to-canopy scaling and better values of model parameters that control the maximum potential GPP, such asεmax (LUE), Vcmax (unstressed Rubisco catalytic capacity) or Jmax (the maximum electron transport rate).
 Terrestrial gross primary productivity (GPP) is the total photosynthetic uptake or carbon assimilation by plants and is a key component of terrestrial carbon balance. GPP is the main carbon input to terrestrial ecosystems, noting relatively minor inputs by dissolved organic carbon, as well as deposition by rainwater and sedimentation [Chapin et al., 2006]. GPP depends on climate, climate variability, disturbance history, water and nutrient availability, soil type, species composition, and community structure. Understanding how these factors influence GPP remains a challenge due to complex interactions and the difficulty in quantitatively measuring GPP directly at various temporal and spatial scales. Estimates of GPP are only available at eddy covariance flux tower sites for the past decade, so we depend on models to estimate GPP over long periods of time at regional and global scales, and to project future changes in GPP in response to climate change.
 Any error in simulated GPP will propagate through the model, introducing errors in simulated biomass and fluxes. If simulated GPP is too low or too high, then predicted leaf area index, wood biomass, crop yield, and soil biomass may also be too low or high [Schaefer et al., 2008]. Net ecosystem exchange (NEE) is total ecosystem respiration (Reco) minus GPP, with a positive NEE indicating a net release of CO2 to the atmosphere. Autotrophic respiration depends on GPP and heterotrophic respiration depends on soil conditions and dead plant biomass, so errors in GPP readily propagate to errors in Reco and simulated diurnal and seasonal cycles of NEE. Through representation of stomatal control on GPP and transpiration, errors in GPP introduce errors in simulated latent and sensible heat flux, which in turn can introduce error in simulated atmospheric circulation. GPP is a key carbon flux that needs to be simulated as accurately as possible to ensure the most reliable values of simulated biomass and surface fluxes.
 The modeling community currently lacks a quantitative evaluation of multiple GPP models to gauge overall performance across different ecosystems and help prioritize long-term model development. Many modeling teams compare simulated NEE to observed NEE measured at single points using eddy covariance techniques [Baldocchi et al., 2001; Grant et al., 2010]. Others compare against NEE for large regions estimated from transport inversions that are optimally consistent with observations of atmospheric CO2 concentration [Gurney et al., 2002; Peters et al., 2010]. However, comparisons with observed NEE do not distinguish between Reco and GPP and provide little information on model performance relative to GPP. Fortunately, NEE measured by eddy covariance techniques can be partitioned into Reco and GPP: a temperature function is tuned to nighttime Reco, the function is used to calculate daytime Reco, and the estimated GPP is the daytime Reco minus the daytime NEE [Desai et al., 2008; Lasslop et al., 2010]. There are assessments in the literature of how well terrestrial biosphere models simulate GPP, but they focused on a single or small number of models compared to GPP estimated from eddy covariance data at a small number of towers [e.g., Thornton et al., 2002; Schaefer et al., 2008; Verbeeck et al., 2008]. These studies used different techniques to estimate GPP from observed NEE [Desai et al., 2008], making it difficult to differentiate between errors in the partitioning technique and true model-data mismatches. The evaluations were run at different sites and used different input weather, making it difficult to isolate input errors from true model-data mismatches. The performance measures used in these evaluations are difficult to compare because most used qualitative performance criteria while those with quantitative performance measures used different statistical techniques and quantities. None of these model evaluations account for uncertainty in estimated GPP due to uncertainty in the eddy covariance data and partitioning techniques. The actual GPP value lies within the range defined by uncertainty, so the ideal performance target of any model is to match the observed GPP within uncertainty. These studies provide insight into the performance of individual models. However, the differences among evaluations make it very difficult to compare and synthesize the results to identify strengths and weaknesses common to all terrestrial biosphere models and to determine what changes will provide the greatest improvements in simulated GPP.
 We hypothesize that model performance depends on 1) model structure and 2) how models simulate GPP response to changing environmental conditions. Model structure refers to differences in how models represent various physical and biological processes, such as LUE versus EK models. To test our hypotheses, we compared simulated GPP from 26 models against estimated GPP at 39 eddy covariance flux tower sites in the North American Carbon Program (NACP) site-level interim synthesis. Our analysis includes observation uncertainty to determine if the models hit the desired performance target: matching observed GPP within uncertainty. The number and variety of models and sites in the NACP site synthesis are sufficient to identify the strengths and weaknesses common to all GPP models and, most importantly, how to improve the models.
2.1. Estimated GPP
 Our analysis used daily average GPP estimated at 39 eddy covariance flux tower sites (Table 1). Observed NEE at all towers were processed and partitioned into GPP and Reco using standard techniques [Barr et al., 2004]. The NACP site synthesis included 47 tower sites, but we did not include those sites where the GPP partitioning was not done or the algorithm failed to converge. The chosen sites represented eight major biome types across North America except tundra, with 24 sites from the AmeriFlux network and 15 sites from the Fluxnet Canada Research Network/Canadian Carbon Program. GPP partitioning was not done for tundra sites due to large data gaps in winter and the lack of nighttime data in summer to train the Reco model. NACP site synthesis used the International Global Biosphere Program (IGBP) biome classifications [Loveland et al., 2000]. Some models were designed for specific biome types, such as forest or agriculture sites only, so not every model simulated all sites, resulting in a total of 627 simulations and an average of 16 simulations per site.
 Observed, hourly NEE at each site was gap-filled and decomposed into hourly Reco and GPP using a standard procedure [Barr et al., 2004] and converted into 24-hour daily averages. Before processing, observed NEE was screened to remove outliers [Papale et al., 2006] and exclude data during periods of low turbulence based on a friction velocity threshold (A. G. Barr et al., Use of change-point detection for u*–threshold evaluation for the North American Carbon Program interim synthesis, manuscript in preparation, 2012). Reco was set equal to observed, nighttime NEE and fitted to an empirical model based on air and soil temperature using a moving window approach. The function was then used to calculate the daytime Reco, which was subtracted from daytime NEE to get GPP. Finally, gaps in GPP were filled using an empirical model that was tuned to the estimated GPP values. GPP was set to zero when the soil was frozen and the air temperature was below 0°C, assuming air temperature represents the temperature of the entire system. The partitioning procedure occasionally produced negative GPP values at dawn and dusk at most sites and occasionally for entire days at the dry, grassland sites. The negative GPP values typically greatly exceeded estimates of random uncertainty and probably resulted from errors in the Reco temperature response or the fact that the partitioning algorithm did not account for influences of soil moisture on GPP. We set all negative GPP values to zero and transferred the flux to Recoto maintain the observed NEE, and recalculated the daily average GPP. One third of the models used a daily time step, so we calculated the daily average as the average rate of GPP over a 24-hour period using both estimated and gap-filled values.
 Although treated here as “observed” GPP, they are not strictly observed, but rather estimated from observed tower-based NEE. The GPP includes all the strengths, weaknesses, and assumptions of the original NEE observations. The energy budget at flux towers does not balance because the eddy covariance technique captures small-scale turbulent fluxes less than 1 km, but can underestimate latent and sensible heat fluxes due to large-scale eddies on the order of 10 km [Foken, 2008]. Assuming similarity with the lack of surface energy balance closure, the measured NEE might be between 15 and 20% less than the actual values [Foken, 2008], indicating a potential underestimate of GPP. Eddy covariance techniques can underestimate nighttime Reco, depending on the threshold used to filter out Reco under stable conditions, also resulting in an underestimate of GPP. The empirical formulation for Reco used to estimate GPP from NEE assumed air temperature represents the temperature of the entire system, underestimating GPP when the canopy temperature is greater than zero and the air temperature is less than zero. Last, the partitioning algorithm did not account for how soil moisture and other factors control Reco, resulting in errors in the estimated GPP (as evidenced by occurrences of negative GPP at some grassland sites).
 Total GPP uncertainty included gap filling algorithm uncertainty, partitioning uncertainty, random uncertainty, and threshold friction velocity (u*) uncertainty, summed in quadrature. Summing in quadrature assumes that these sources of error are uncorrelated. We did not correct for potential biases due to the lack of energy closure or underestimates of Reco at night. Random and u* filtering uncertainty was estimated using a Monte Carlo technique [Richardson and Hollinger, 2007; A. G. Barr et al., manuscript in preparation, 2012]. A. G. Barr et al. generated a synthetic flux time series using the gap-filling algorithm, randomly introduced artificial gaps, added noise, and then refilled the gaps. Repeating the process 1000 times for each site-year produced probability distribution functions with the 2.5 and 97.5 percentiles representing uncertainty. Assuming the GPP uncertainty due to the gap-filling algorithm was the same as that for NEE, the GPP gap filling uncertainty was the fraction of filled GPP values for each day times the standard deviation of multiple algorithms [Moffat et al., 2007]. Partitioning uncertainty was based on the standard deviation of multiple partitioning algorithms [Desai et al., 2008]. Total uncertainty generally increased with the magnitude of GPP and varied from a minimum of ∼1 μmol m−2 s−1 in winter to 2–4 μmol m−2 s−1 in summer. Random uncertainty dominated over other sources of uncertainty, ranging from ∼90% of total uncertainty in summer months to ∼50% of total uncertainty in winter months.
2.2. Modeled GPP
 Our analysis used simulated GPP from 24 different models (Table 2) plus two model averages. We used the model characteristics in Table 2as covariates to determine if the different representations of physical and biological processes produce statistically significant differences in model performance relative to GPP. All the EK models included the effects of stomatal conductance and all but two used the Ball-Berry stomatal conductance model [Ball et al., 1987]. The leaf-to-canopy column indicates whether the strategy for scaling from a single leaf to the entire canopy accounts for the effects of diffuse light on shaded leaves (2-leaf) or not (big-leaf). To determine if resolving the diurnal cycle improved performance, we included the ensemble average of all models and the average of all models that resolved the diurnal cycle. Including the two ensemble averages at each site, we have a total of 627 simulations or 4242 site-years of model output with an average of 23 simulations per model.
Zero soil layers indicate the model does not have a prognostic submodel for soil temperature and moisture.
Observed phenology means the model uses remote sensing data to determine leaf area index (LAI) and gross primary productivity (GPP). Semi-prognostic means that remote sensing data is used to specify either LAI or GPP, but not both.
GPP model types: EK (enzyme kinetic) and LUE (light use efficiency).
 We included estimates of MODIS GPP from Collection 5.0 and Collection 5.1, [Heinsch et al., 2003; Running et al., 2004]. We extracted a 3 by 3 pixel window of 8-day maximum composite GPP values at 1 km2 spatial resolution, with the center pixel containing the tower site [Distributed Active Archive Center for Biogeochemical Dynamics, 2010]. We filtered out low quality pixels using the simple binary quality control flag to remove the effects of potential cloud contamination and averaged the rest of the pixels to represent the GPP at each site. We linearly interpolated between 8-day composite values to obtain daily GPP.
 The MODIS GPP from Collection 5.0 and Collection 5.1 were not based on observed meteorology, so we also calculated GPP using gap-filled observed weather and the MODIS algorithm [Heinsch et al., 2003; Running et al., 2004]:
where εmax is the maximum light use efficiency, FSW is the incident shortwave radiation flux, fPAR is the absorbed fraction of PAR, SVPD is the VPD scaling factor, and ST is the air temperature scaling factor. SVPD represents the GPP response to drought and humidity stress and STrepresents the GPP response to temperature, with both varying between zero and one. We used the MODIS Biome-Property-Look-Up-Table [Zhao and Running, 2010] and daily fPAR values interpolated from the monthly mean GIMMSg NDVI data set [Tucker et al., 2005]. Although the complexity and sophistication varies widely, all the models have a GPP formulation similar to MODIS: a peak potential rate times the amount of absorbed light, multiplied by a series of scaling factors representing how GPP responds to changing environmental conditions. The scaling factors represent the ratio of actual to a reference or optimal GPP and vary between zero and one.
 All models used gap-filled observed weather from each tower site [Ricciuto et al., 2009; http://www.nacarbon.org/nacp/] with input parameters and biophysical characteristics derived from local observations, such as soil texture. Missing air temperature, atmospheric humidity, shortwave radiation, and precipitation data were filled using DAYMET [Thornton et al., 1997] or the nearest available climate station in the National Climatic Data Center's Global Surface Summary of the Day (GSOD) database. Daily GSOD and DAYMET data were temporally downscaled to hourly or half-hourly values using the phasing from observed mean diurnal cycles calculated from a 15-day moving window. When station data were unavailable, a 10-day running mean diurnal cycle was used [Ricciuto et al., 2009; http://nacp.ornl.gov/docs/Site_Synthesis_Protocol_v7.pdf]. The models were run for as many years as required, repeating the gap-filled weather, until they reached steady state initial conditions where Reco balances GPP and the average NEE over the entire simulation is near zero. The steady state assumption influences Reco, but has little or no effect on simulated GPP. All models used their standard values for various biophysical parameters except LoTEC, which used optimized parameter values obtained through data assimilation [Ricciuto et al., 2011].
2.3. Model Performance
 We quantified model performance using a statistical analysis of model-data residuals using daily average GPP. We first calculated residuals:
where ri is the residual for the ithmodel-data pair,GPPsi is simulated daily average GPP, and GPPoi is estimated daily average GPP. Bias is the residual mean and the Root Mean Square Error (RMSE) is the residual standard deviation. X2 is the mean of residuals normalized by uncertainty:
where n is the number of residuals and εi is the uncertainty for the ith daily average estimated GPP. We filtered out ∼0.1% of daily estimated GPP values with εi ≤ 0.3 μmol m−2 s−1 which produced extreme outlier X2 values that skewed our results. Such unrealistically small εi values occasionally occurred when GPP was near zero. We did not filter out daily average GPP values based on the number of filled values per day because these values have higher uncertainty and a proportionally lower influence on X2.
 The ideal target for any model is X2 < 1.0, which means, on average, the residuals are less than uncertainty or the model matches observations within measurement uncertainty. Variations of X2 within this target range have no meaning relative to model performance. A model with an X2 value of 0.8, for example is not “better” than a model with an X2 value of 0.9, since both models show no statistically significant differences with observations. Consequently, we identified performance categories based on ranges of X2 values. An X2 value of ≤1.0 indicated good model performance. An X2value between 1.0 and 2.0 indicated marginal model performance, where the model-data mismatch is on the order of two times the observation uncertainty. AnX2value of >2.0 indicated poor model performance, where the model-data mismatch is several times the observation uncertainty. AnX2value of 9, for example, indicates that the model-data mismatch is, on average, three times the uncertainty.
 To test our hypothesis that model structure influence performance, we aggregated the performance measures by model, model characteristic, site, and month-of-year. To identify any statistically significant differences in performance based on how the models represented various physical and biological processes, we aggregated performance measures by the model structural characteristics listed inTable 2. To evaluate seasonal variation in performance parameters, we aggregated by month-of-year, where January is the average of all Januaries, February the average of all Februaries, etc.
 To test our hypothesis that model performance depends on how they represent the GPP response to changing environmental conditions, we compared observed and simulated environmental response curves. We sorted the daily average GPP values into bins based on daily average values of input driver variables and calculated the mean, standard deviation, and uncertainty of the daily average GPP for each bin. We focused on downwelling shortwave radiation flux, air temperature, and relative humidity. The relative humidity response function reflects the reduction in GPP due to stomatal closure under drier atmospheric conditions. VPD, the difference between saturated and actual water vapor pressure, also reflects stomatal closure, but varies with temperature such that the range and magnitude varies among sites. Relative humidity always varies between zero and one and greatly simplifies our analysis by allowing easy comparison among sites. More importantly, 16 of the 20 models in this analysis that account for stomatal conductance used the Ball-Berry stomatal conductance model, which is based on relative humidity. Each model's mathematical formulation and associated parameters values determined the shape of the simulated response curves. We compared simulated and observed shape characteristics, such as slope, to isolate those model formulations or parameters that determine model performance. The response function for each driver variable differs slightly from site to site and is only weakly correlated to response functions for other driver variables. For example, the optimal temperature for GPP is different for each site but does not depend on humidity or light. Consequently, we made no attempt to remove covariance between driver variables.
3.1. Performance Summary
 None of the models in this study achieved a good overall performance of X2 less than one for all sites (Figure 1). LoTEC, which was optimized against flux data and DLEM achieved marginal performance, while SSIB2 and TECO had large X2 values due to large biases. Generally speaking, higher RMSE resulted in higher X2, but the relationship was weak because X2 accounts for uncertainty and bias while RMSE does not. There was no relationship between model bias and either RMSE or X2. The RMSE fell within a narrow performance range, with a mean and standard deviation of 2.8 ± 1.0 μmol m−2 s−1. On average, the models underestimated GPP (negative bias) with a mean bias of −0.3 μmol m−2 s−1, but the spread between models was very large compared to the mean, with a bias standard deviation of ±0.6 μmol m−2 s−1.
Figure 2 shows that models, on average, underestimated GPP in summer (negative bias) and overestimated GPP in winter, spring, and fall (positive bias). These seasonal biases of opposite sign tended to cancel, resulting in the lower overall biases seen in Figure 1. Figure 2 shows the average monthly bias, but every month showed both positive and negative biases for individual models with standard deviations ranging from two to ten times the mean bias. The estimated GPP is smaller in spring and fall, with correspondingly smaller uncertainties, which magnified the relatively small model biases to produce slight peaks in X2 in spring and fall. The models performed worst in the summer and the best in winter, indicating the models properly shut down GPP during winter, but the real challenge is to capture GPP dynamics during the growing season.
 The models generally performed the best at forest sites and the worst at crop, grassland, and savanna sites (Figure 3). The models did not show good overall performance at any site, but did show marginal performance at seven sites. The models performed best for deciduous broadleaf, mixed forest, and evergreen needleleaf biome types and all of the ten sites with the best overall performance were forest. The spread in performance within biome types was broad: two of the ten sites with the worst performance were evergreen needleleaf forest sites. However, seven of the ten sites with the worst model performance were crop, grassland or savanna sites.
 The models showed a large spread in both the magnitude and timing of the simulated GPP seasonal cycle, as indicated by the large spread in Figure 4. The models captured the basic observed seasonal pattern in GPP with near-zero values in winter and a peak value in mid-summer, so the standard deviation of the seasonal cycle is an alternative measure of seasonal amplitude. Consequently, the radial distance inFigure 4 is effectively the ratio of simulated to estimated seasonal amplitude of GPP, an alternative measure of bias. Ratios less than 1.0 indicate the model underestimates the GPP seasonal amplitude (negative bias). The models had standard deviation ratios ranging from 0.5 to 1.4, which means the simulated GPP ranged from 50% to 140% of estimated GPP. The models showed correlations between 0.6 and 0.9, indicating they varied widely in how well they captured the timing of the estimated GPP seasonal cycle, regardless of how well they captured the magnitude. For example, LOTEC, ISOLSM, and LPJ all had standard deviation ratios near one, consistent with the low biases seen in Figure 1. However, these three models showed progressively smaller correlations and correspondingly larger X2, indicating that simulated phenology and associated phasing of the GPP seasonal cycle played an important role in determining overall model performance.
 The two model means showed the highest correlation with observations. This indicates that model-data mismatches associated with the timing of the GPP seasonal cycle can partly cancel out when averaging the results from multiple models. Essentially, the ensemble mean gave better results than any single model alone. Although errors in timing cancel, an ensemble mean does not eliminate overall bias. The standard deviation ratios for the two model means inFigure 4 reflected the overall, average negative bias of all the models in our analysis. Although there was positive bias in winter, spring, and fall, the negative bias in summer dominated and, on average, the models as a whole underestimate GPP by 20%–30%.
 These results complement and extend previous analyses from the NACP site synthesis. Richardson et al.  found that overestimation of GPP in spring and fall resulted in models predicting the start of spring uptake about two weeks earlier than observed and the end of uptake in fall about two weeks later than observed. Schwalm et al.  found models simulate NEE better at forest sites than grassland sites. The positive biases in spring and fall can help explain the decreased model performance relative to NEE in spring and fall [Schwalm et al., 2010]. Underestimating GPP in summer can explain the peaks in the spectral signature of NEE residuals at the annual time scale [Dietze et al., 2011]. Our results and those from prior studies indicate that seasonal biases in simulated GPP can help explain problems in simulating seasonal changes in NEE.
 Overall, there was a very large spread in model performance. On average, a single model showed good performance at 12% of the sites, marginal performance at 26%, and poor performance for the rest. Nearly every model had one “outlier” site where it performed considerably worse than the other sites, with X2 values often exceeding 20. Conversely, nearly all models showed good or marginal performance at least one site. The spread in performance across sites was equally broad, with three outlier sites where none of the models performed well. The spread among models at a single site was also wide: each site, on average, had two good simulations, four marginal simulations, and two outlier simulations with X2 > 20.
3.2. Model Structure
 Model performance did not depend on model structure, as defined by the model characteristics in Table 2. We did not find any statistically significant relationships between performance and how models represent various physical and biological processes (Table 3). In all cases, the difference in mean values between model groups was much smaller than the standard deviation within groups such that none of the differences were statistically significant. For example, LUE models performed better then EK models, but when excluding SSIB2 and TECO, which had large biases, EK models performed better than LUE models. Essentially, EK and LUE models performed equally well in simulating observed GPP. The performance difference for a daily versus hourly time step was consistent with the difference between EK and LUE models, since nearly all EK models use an hourly time step. The same was true for the other model structural characteristics: models that include a nitrogen cycle, a soil model, shaded leaves, or prognostic phenology performed equally well as models that do not.
Table 3. Differences in Performance Based on Model Structural Characteristics (Value ± Standard Deviation) for All 627 Simulations
RMSE (μmol m−2 s−1)
Bias (μmol m−2 s−1)
GPP Model Type
4.2 ± 3.9
2.4 ± 1.2
0.1 ± 1.2
GPP Model Type
3.2 ± 2.3
2.8 ± 1.7
−0.9 ± 1.3
3.7 ± 3.2
2.6 ± 1.5
−0.3 ± 1.4
4.2 ± 3.8
2.4 ± 1.2
−0.1 ± 1.3
3.1 ± 2.5
2.3 ± 1.3
−0.3 ± 1.2
4.8 ± 4.1
2.7 ± 1.4
−0.1 ± 1.4
3.3 ± 2.6
2.6 ± 1.7
−0.6 ± 1.4
4.2 ± 3.8
2.4 ± 1.2
0.0 ± 1.3
4.1 ± 3.6
2.5 ± 1.4
−0.1 ± 1.3
3.7 ± 3.3
2.5 ± 1.4
−0.3 ± 1.3
3.5 ± 3.0
2.7 ± 1.6
−0.7 ± 1.3
4.3 ± 3.8
2.4 ± 1.2
0.2 ± 1.2
 We found no significant relationships, but this does not mean that model structure does not influence performance. Sprintsin et al. , for example, clearly demonstrate that accounting for diffuse light and changing from a big-leaf to a 2-leaf formulation improved BEPS performance. However, SiB is a big-leaf model and performed just as well as BEPS. The lack of significant relationships means that model performance is dominated by some other aspect of model design not represented by the model characteristics inTable 3, such as how models simulate GPP responses to changing environmental conditions.
3.3. Light Response
 The poor overall performance and the negative bias in summer resulted from mismatches between simulated and observed LUE. The light response curve is GPP as a function of downwelling shortwave radiation and its slope is the LUE. Figure 5shows a light response curve based on daily average GPP for US-Me2 which we chose because it had a large number of simulations and was typical of all sites. The uncertainty inFigure 5was dominated by gap-filling and partitioning uncertainty because the bin averaging tended to greatly reduce the random uncertainty. US-Me2 is an evergreen needleleaf site with simulations from 19 models with a marginal overall performance (X2 = 1.9). Five models had good performance (X2 ≤ 1.0), two showed marginal performance (1.0 < X2≤ 2.0), and the rest showed poor performance. Four out of the five models with good performance all had LUEs that matched observed values within uncertainty. We saw no clear pattern in bias, with each model under-predicting GPP at some sites and overpredicting at others, but the spread in simulated GPP among models at US-Me2 was typical of all sites.
 Bias decreased as the ratio of simulated to observed LUE approached one (Figure 6). We calculated the observed and simulated LUE as the regression of GPP versus shortwave radiation flux with a Y-intercept forced to be zero. Both the observed and simulated light response curves were noisy, so we forced the Y-intercept to be zero to guarantee that GPP was zero for zero incident shortwave light. The LUE ratio was simulated LUE divided by observed LUE, with a ratio less than one indicating the model underestimated GPP (negative bias). The LUE ratios formed a diagonal line with near zero bias when the LUE ratio was one. Plots of LUE ratios for subgroups defined by the model structural characteristics inTable 2, individual models, and individual sites all showed the same pattern as in Figure 6: when the LUE ratio was one, the bias was zero.
 To improve performance in simulated GPP, model developers should focus first on those parameters that determine the simulated LUE. The LUE is determined by the leaf-to-canopy scaling and a small number of parameters that define the maximum potential GPP. For the MODIS algorithm described above, for example, the LUE is determined byεmax, so a better value of εmax will improve performance. For other models, Vcmax (the unstressed Rubisco carboxylation rate), α (quantum yield), or Jmax (the maximum electron transport rate) determine LUE. These maximum potential GPP parameters are either held constant for all sites, like in ECOSYS [Grant et al., 2009] or vary with biome or plant functional type (PFT), like SiB3 [Baker et al., 2008]. Models account for changing environmental conditions by multiplying the maximum potential GPP by temperature, moisture, and humidity scaling factors that represent the ratio of actual to peak GPP.
 How a model scales from a single leaf to an entire canopy also influences the simulated LUE. The maximum potential GPP parameters typically represent peak or optimal values for a single leaf at the canopy top. The leaf-to-canopy scaling factor represents the ratio of GPP for a single leaf to GPP for the entire canopy. A model assumes the distribution of leaf nitrogen and light levels within the canopy, and integrates from canopy top to bottom to calculate a leaf-to-canopy scaling factor. SiB and BEPS, for example, both assume the distribution of light is governed by Beer's law [Sellers et al., 1996; Sprintsin et al., 2012]. Unfortunately, the leaf-to-canopy scaling and the maximum potential GPP parameters are coupled and can compensate, indicating that the model has to get both right to get the correct GPP.
 Our results indicate that better LUE parameter values and leaf-to-canopy scaling will improve overall performance in simulated GPP, although we could not delve into individual models to identify the correct parameters and the best values. The number, nomenclature, definition, and units of the various parameters that define LUE differ widely among models. Model developers can use data assimilation of, for example, eddy covariance data to estimate parameter values [Hanson et al., 2004]. The LOTEC model illustrates the potential to improve performance by estimating the maximum potential GPP parameters with data assimilation. However, due to differences among models, parameter values estimated for one model may not work in another. The TRY database of plant characteristics [Kattge et al., 2011] includes observations of Vcmax, Jmax, and αcompiled from many studies that could potentially minimize these inter-model differences. The leaf-to-canopy scaling depends on the assumed variation of light levels and parameter values within the canopy, which may require additional field observations. For example, measuring leaf-level nitrogen content, which determinesVcmax, is relatively easy, but what models need is canopy-level nitrogen content, which, unfortunately, is much more difficult to measure. Changes to how models treat the distribution of light within the canopy could improve the leaf-to-canopy scaling, such as better canopy radiative transfer models coupled to the GPP models or separating sunlit and shaded leaves, but developers have to demonstrate that such changes improve GPP performance. The observed LUE shows strong variability within PFT classes and the biases may result from the fact that the models assume constant parameter values for each PFT. Although spatially explicit maps of LUE parameter values currently do not exist, remote sensing of canopy nitrogen shows promise [Ollinger et al., 2008]. However, nitrogen in Rubisco is the variable of interest and relating total nitrogen to a canopy Vcmax in a way that will work for all models will require new theoretical development.
3.4. Humidity Response
 We found that difficulties in simulating GPP under dry conditions can explain why models performed worse at grassland and savanna sites than forest sites. Figure 7shows normalized humidity response curves for the US-Var grassland and US-Ha1 deciduous broadleaf forest sites. We normalized the humidity response curves to emphasize the shape of the curves, which were typical of all sites: low GPP under dry conditions, an optimal GPP at 70%–80% relative humidity, and a decrease for higher humidity associated with colder temperatures. Models calculate lower GPP under stressed conditions using scaling factors that represent the ratio of stressed to optimal GPP. The scaling factors determine the shape of the response curves inFigure 7, but the simulated LUE determines the GPP magnitude. Decreased GPP at low relative humidity can be caused either by humidity stress reducing stomatal conductance, high temperature stress, or drier soils with reduced water availability (drought stress). Half of the models overpredicted GPP at both sites under low humidity conditions (relative humidity less than 60%). Such dry conditions occurred only 23% of the time at US-Ha1, but occurred 46% of the time at US-Var, which has dry summers with near zero growth. Even though half the models did not capture GPP under dry conditions at both sites, the effect on performance was much stronger at US-Var because the dry periods occurred twice as often than at US-Ha1. This explains the poor performance at the evergreen needleleaf forest sites CA-SJ1 and CA-SJ2, where the dry periods occurred nearly as often as US-Var. Essentially, the more often the dry periods occurred, the worse the performance.
 The NEE partitioning algorithm can partly explain the model-data mismatch at drier sites. The algorithm was not designed for drier sites where soil moisture has greater influence on Reco than temperature. Consequently the algorithm either did not converge or produced negative GPP, which we changed to zero as described above. However, filtering out these zero GPP values did not change model performance at these sites. The partitioning algorithm could be improved to account for moisture, but the models also need to improve simulated GPP under dry conditions.
 Determining exactly how to improve simulated GPP under dry conditions was not possible in our analysis because the effects of drought and humidity stress are intertwined. Periods of low rainfall simultaneously reduce both soil moisture and atmospheric humidity, making the associated effects of drought and humidity stress on GPP difficult to separate. Models that account for drought stress typically calculate a GPP scaling factor using either input precipitation or plant water availability from simulated soil moisture [Schaefer et al., 2008; Potter et al., 1993]. Models that account for humidity stress either calculate a GPP scaling factor based on humidity or directly reduce stomatal conductance [Heinsch et al., 2003; Baker et al., 2008]. Thus, overpredicting GPP under dry conditions could result from problems with the simulated soil moisture, the calculation of plant water availability, or the representation of humidity stress.
 To complicate matters, a model's representation of humidity stress can compensate for poor representation of drought stress, and vice versa. For example, the MODIS algorithm above does not account for drought stress at all, but the humidity response was strengthened to compensate [Heinsch et al., 2003], such that MODIS reproduced the shape (but not magnitude) of the observed humidity response curves. Determining whether models should improve simulated soil moisture, drought stress, or humidity stress requires a simultaneous analysis of simulated and observed soil moisture, latent heat flux, and GPP, which is beyond the scope of our analysis.
3.5. Temperature Response
Figure 8shows a typical temperature response function for the evergreen needleleaf forest site, US-Ho1, which had simulations from 23 models and a marginal overall performance (X2= 1.8). US-Ho1 was the site closest to the “average” temperature response function for all sites. The observed optimal temperature for GPP was 19°C and average across all sites was 20 ± 5°C. Differences between simulated and observed GPP near the peak or optimal temperature reflected differences in simulated and observed LUE in summer, as described above. For US-Ho1, GPP shut down for daily average temperatures below −6.5°C and the average low temperature cutoff across all sites was −6 ± 3°C. Cutoff temperatures below zero reflected conditions in spring and fall where daytime temperatures were above freezing to allow photosynthesis while the nighttime temperatures were below freezing, resulting in a negative daily average air temperature. The average winter season at US-Ho1, defined as the time with daily average temperature below 0°C, was 92 days and the average for all sites was 75 ± 47 days.
 Temperature was the dominant control of the seasonal variation of GPP at most sites. The simulated start of GPP in spring and the stop of GPP in fall is a representation of phenology. Models primarily use temperature to control phenology, but how this is done varies widely. Some models using growing degree days to predict bud burst in spring while others simply shut down GPP below a lower temperature limit. How models simulate GPP at temperatures near 0°C determined the start and stop of the simulated growing season.
 The models tended to overpredict GPP at low temperatures: half of all simulations predicted more than double the observed GPP at temperatures below 0°C. The observations indicated that no more than 0.8 ± 0.6% of total annual GPP occurred in winter, but the models simulated anywhere from 0% to 15% of the annual GPP in winter. Part of this may have resulted from the partitioning algorithm itself, which set GPP to zero when the soil and the air temperature are below zero. However, observations indicate this is realistic since the recovery of photosynthesis after freezing temperatures can be delayed for weeks with repeated exposure to frost and cold and frozen soils limit root uptake of water and stomatal conductance [Strand and Öquist, 1985; Waring and Winner, 1996]. Overpredicting GPP under cold conditions explained the positive biases in winter, spring, and fall, which in turn resulted in uptake starting earlier than observed in spring and later than observed in fall.
 Better low temperature inhibition functions will improve simulated GPP in winter, spring, and fall and improve simulated phenology. Most models use an exponential or “Q10” response function to represent the effects of low temperature on GPP:
where ST is a temperature scaling factor applied to GPP, T is temperature (°C), and Tref is a reference temperature (°C). All models based on the Farquhar et al.  EK model, for example, use this type of formulation. Most models have a second exponential function with separate Q10 and Trefvalues to reflect reduced GPP for high temperatures. Combined, the low and high temperature functions produce an optimal temperature for simulated GPP. The combined temperature scaling factor represents the ratio of actual to optimal GPP and varies between zero and one. The simulated LUE determined GPP magnitude under optimal conditions in mid-summer, but the temperature scaling determined the seasonal cycle in simulated GPP and simulated phenology. The exact values ofQ10 and Tref vary widely between models and we made no attempt to determine which values are correct.
 The positive bias in winter, spring, and fall resulted from the fact that the Q10 function alone will never reach zero no matter how cold the temperature, so those models using this type of formulation without a frost inhibition function over predicted GPP at low temperatures. The frost inhibition function is an additional scaling factor that shuts down GPP below a specified threshold temperature [Kucharik and Twine, 2007; Li et al., 2010]. In addition, some models also include a GPP recovery period after the frost event [Baker et al., 2008; Schaefer et al., 2008]. Data on photosynthesis at low temperatures are relatively scarce, so developing a low-temperature inhibition function to incorporate the effects of nutrient and water availability in partially frozen soils may require more observations. Improving the modeled low temperature inhibition function will improve simulated GPP in spring and fall, and thus simulated phenology.
 None of the models in this study match estimated GPP within the range of uncertainty of observed fluxes. On average, the models achieved good performance for only 12% of the simulations. Two models achieved overall marginal performance, matching estimated GPP within roughly two times the uncertainty. Our first hypothesis proved false: we found no statistically significant differences in performance due to model structure, mainly due to the large spread in performance among models and across sites. The models in our study reproduced the observed seasonal pattern with little or no GPP in winter and peak GPP in summer, but did not capture the observed GPP magnitude. We found, on average, that models overestimated GPP in spring and fall and underestimated GPP in summer. Our second hypothesis proved true: model performance depended on how models represented the GPP response to changing environmental conditions. We identified three areas of model improvement: simulated LUE, low temperature response function, and GPP response under dry conditions.
 The poor overall model performance resulted primarily from inadequate representation of observed LUE. Simulated LUE is controlled by the leaf-to-canopy scaling strategy and a small set of model parameters that define the maximum potential GPP, such asεmax (light use efficiency), Vcmax (unstressed Rubisco catalytic capacity) or Jmax(the maximum electron transport rate). The temperature, humidity, and drought scaling factors determined temporal variability in simulated GPP, but the LUE parameters determined the magnitude of simulated GPP. To improve simulated GPP, model developers should focus first on improving the leaf-to-canopy scaling and the values of those model parameters that control the LUE.
 Many models overpredicted GPP under dry conditions, explaining why, on average, models performed worse at grassland and savanna sites than at forest sites. The importance of this to model performance increases at sites where drier conditions occur more frequently. Since dry conditions occur more frequently at grassland and savanna sites than at forest sites, models tended to perform worse at grassland and savanna sites compared to forest sites. Improving how models simulate soil moisture, drought stress, or humidity stress can improve simulated GPP under dry conditions.
 Many models overpredicted GPP under cold conditions, partly explaining the positive bias in simulated GPP in winter, spring, and fall. The estimated GPP completely shut down for daily average temperatures less than −6°C, but the Q10 formulation used by many models did not shut down GPP under cold or frozen conditions. The simulated GPP started too early in spring and persisted too late in fall, resulting in a positive bias and phasing errors in phenology. Using an ensemble mean can cancel out errors in phenology, but does not cancel out bias. Improving or imposing a low temperature inhibition function in the GPP model will resolve the problem.
 We thank the North American Carbon Program Site-Level Interim Synthesis team and the Oak Ridge National Laboratory Distributed Active Archive Center for collecting, organizing, and distributing the model output and flux observations required for this analysis. We thank Dennis Baldocchi, Lawrence Flanagan, Ni Golaz, Gabrial Katul, Kim Novick, Paul Stoy, and Shashi B. Verma for providing valuable data and advice during the development of this paper. This research was partly funded by NOAA Award NA07OAR4310115 and U.S. National Science Foundation grant ATM-0910766; funding was also provided by the U.S. Department of Energy's Office of Science for AmeriFlux Science Team research to develop measurement and data submission protocols and conduct quality assurance of measurements for AmeriFlux investigators (Grant DE-FG02-04ER63911).