A model-data intercomparison of CO2 exchange across North America: Results from the North American Carbon Program site synthesis



[1] Our current understanding of terrestrial carbon processes is represented in various models used to integrate and scale measurements of CO2 exchange from remote sensing and other spatiotemporal data. Yet assessments are rarely conducted to determine how well models simulate carbon processes across vegetation types and environmental conditions. Using standardized data from the North American Carbon Program we compare observed and simulated monthly CO2 exchange from 44 eddy covariance flux towers in North America and 22 terrestrial biosphere models. The analysis period spans ∼220 site-years, 10 biomes, and includes two large-scale drought events, providing a natural experiment to evaluate model skill as a function of drought and seasonality. We evaluate models' ability to simulate the seasonal cycle of CO2 exchange using multiple model skill metrics and analyze links between model characteristics, site history, and model skill. Overall model performance was poor; the difference between observations and simulations was ∼10 times observational uncertainty, with forested ecosystems better predicted than nonforested. Model-data agreement was highest in summer and in temperate evergreen forests. In contrast, model performance declined in spring and fall, especially in ecosystems with large deciduous components, and in dry periods during the growing season. Models used across multiple biomes and sites, the mean model ensemble, and a model using assimilated parameter values showed high consistency with observations. Models with the highest skill across all biomes all used prescribed canopy phenology, calculated NEE as the difference between GPP and ecosystem respiration, and did not use a daily time step.

1. Introduction

[2] There is a continued need for models to improve consistency and agreement with observations [Friedlingstein et al., 2006], both overall and under more frequent extreme climatic events related to global environmental change such as drought [Trenberth et al., 2007]. Past validation studies of terrestrial biosphere models have focused only on few models and sites, typically in close proximity and primarily in forested biomes [e.g., Amthor et al., 2001; Delpierre et al., 2009; Grant et al., 2005; Hanson et al., 2004; Granier et al., 2007; Ichii et al., 2009; Ito, 2008; Siqueira et al., 2006; Zhou et al., 2008]. Furthermore, assessing model-data agreement relative to drought requires, in addition to high-quality observed CO2 exchange data, a reliable drought metric as well as a natural experiment across sites and drought conditions.

[3] Drought is a reoccurring phenomenon in all climates [Larcher, 1995] and is characterized by a partial loss in plant function due to water limitation and heat stress. For terrestrial CO2 exchange, drought typically reduces photosynthesis more than respiration [Baldocchi, 2008; Ciais et al., 2005; Schwalm et al., 2010], resulting in decreased net carbon uptake from the atmosphere. In the recent past drought conditions have become more prevalent globally [Dai et al., 2004] and in North America [Cook et al., 2004b]. Both incidence and severity of drought [Seager et al., 2007] as well as heatwaves [Meehl and Tebaldi, 2004] are expected to further increase in conjunction with global warming [Houghton et al., 2001; Huntington, 2006; Sheffield and Wood, 2008; Trenberth et al., 2007].

[4] In this study, we evaluate model performance using terrestrial CO2 flux data and simulated fluxes collected from 1991 to 2007. This timeframe included two widespread droughts in North America: (1) the turn-of-the-century drought from 1998 to 2004 that was centered in the western interior of North America [Seager, 2007] and (2) a smaller-scale drought event in the southern continental Untied States from winter of 2005/2006 through October 2007 [Seager et al., 2009]. During these events Palmer Drought Severity Index values [Cook et al., 2007; Dai et al., 2004] and precipitation anomalies [Seager, 2007; Seager et al., 2009] were highly negative over broad geographic areas. Ongoing eddy covariance measurements [Baldocchi et al., 2001], active throughout the aforementioned drought periods, provided flux data across gradients of time, space, seasonality, and drought. We use these data to examine model skill relative to site-specific drought severity, climatic season, and time. We also link model behavior to model architecture and site-specific attributes. Specifically, we address the following questions: Are current state-of-the-art terrestrial biosphere models capable of simulating CO2 exchange subject to gradients in dryness and seasonality? Are these models able to reproduce the seasonal variation of observed CO2 exchange across sites? Are certain characteristics of model structure coincident with better model-data agreement? Which biomes are simulated poorly/well?

2. Methods

2.1. Observed and Simulated CO2 Exchange

[5] Modeled and observed net ecosystem exchange (NEE, net carbon balance including soils where positive values indicate outgassing of CO2 to the atmosphere) data were analyzed from 21 terrestrial biosphere models (Table 1) and 44 eddy covariance (EC) sites spanning ∼220 site-years and 10 biomes in North America (Table 2). All terrestrial biosphere models analyzed simulated carbon cycling with process based formulations of varying detail for component carbon fluxes. Simulated NEE was based on model-specific runs using gap-filled observed weather at each site and locally observed values of soil texture according to a standard protocol [Ricciuto et al., 2009] (http://www.nacarbon.org/nacp/), including a target NEE of zero integrated over the last 5 years of the simulation period. In addition, a mean model ensemble (hereafter: MEAN) was also analyzed. MEAN was calculated as the mean monthly value across all simulations. Furthermore, in contrast to other models, the parameter values used in the model LoTEC were optimized using data assimilation [Ricciuto et al., 2008]. LoTEC simulations were however retained when calculating MEAN as their effect on model skill was negligible due to the relatively small number of site-months simulated.

Table 1. Summary of Model Characteristics
Model AttributeModel 
Temporal ResolutionHalf-hourlyDailyDailyHalf-hourlyHalf-hourlyDailyDailyHourlyHalf-hourlyMonthly 
Vegetation Pools4473463998 
Soil Pools7947339945 
Soil Layers11317321015910 
Canopy PhenologyPrognosticSemiprognosticPrognosticPrognosticPrognosticSemiprognosticPrognosticPrognosticPrognosticPrognostic 
Nitrogen CycleYesYesYesYesYesYesYesYesYesYes 
Gross Primary Productivity (GPP)Enzyme Kinetic ModelEnzyme Kinetic ModelStomatal Conductance ModelEnzyme Kinetic ModelEnzyme Kinetic ModelStomatal Conductance ModelLight Use Efficiency ModelEnzyme Kinetic ModelEnzyme Kinetic ModelLight Use Efficiency Model 
Heterotrophic Respiration (HR)First or Greater Order ModelAir Temperature Soil Temperature Precipitation Soil Moisture Evaporation Soil Carbon Soil NitrogenSoil Temperature Soil Moisture Soil CarbonFirst or Greater Order ModelFirst or Greater Order ModelDecay Methane Air Temperature Soil Temperature Litter and Soil Carbon Soil Nitrogen Soil MoistureDecay Methane Soil Temperature Precipitation Soil Moisture Soil Carbon Vegetation Carbon Soil NitrogenDecay Methane CO2 Diffusion Dissolved Carbon Loss Soil Temperature Soil Moisture Surface Incident Shortwave Radiation Surface Incident Longwave Radiation Soil Carbon Vegetation Carbon Soil Nitrogen Leaf NitrogenSoil Temperature Soil Moisture Soil Carbon Soil NitrogenSoil Temperature Soil Moisture Soil Carbon Dissolved Carbon Loss Vegetation Carbon Soil Nitrogen 
Autotrophic Respiration (AR)Air Temperature Soil Temperature Precipitation Soil Moisture Surface Incident Shortwave Radiation Surface Incident Longwave Radiation Vegetation CarbonAir Temperature GPPAir Temperature Vegetation Carbon Leaf NitrogenAir Temperature Soil Temperature Precipitation Soil Moisture Surface Incident Shortwave Radiation Surface Incident Longwave Radiation Vegetation CarbonFraction of Instantaneous GPPAir Temperature Vegetation Carbon Leaf Nitrogen GPPSoil TemperatureAir Temperature Soil Temperature Vegetation Carbon Leaf NitrogenAir Temperature Soil Temperature Vegetation Carbon Leaf Nitrogen GPPProportional to Growth 
Ecosystem Respiration (R)AR + HRAR + HRAir Temperature Soil Temperature Soil Moisture Soil Carbon Vegetation Carbon LAIAR + HRAR + HRAR + HRAR + HRAR + HRAR + HRAR + HR 
Net Primary Production (NPP)GPP - ARGPP - ARSurface Incident Shortwave Radiation Vapor Pressure Deficit CO2 Vegetation Carbon Leaf Nitrogen LAIGPP - ARFraction of Instantaneous GPPGPP - ARAir Temperature Precipitation Soil Moisture Potential Evaporation Vegetation Carbon Soil Nitrogen Leaf Nitrogen fPARGPP - ARGPP - ARAir Temperature Precipitation Soil Carbon Soil Nitrogen Soil Moisture Vegetation Carbon Leaf Nitrogen LAI 
Net Ecosystem Exchange (NEE)NPP - HRNPP - HRSoil Temperature Soil Moisture Surface Incident Shortwave Radiation Vapor Pressure DeficitNPP - HRGPP - RNPP - HRNPP - HRGPP - RNPP - HRNPP - HR 
Biomes SimulatedCroplands6810910Croplands1066 
Sites Simulated510362731335392510 
Months Simulated192945200119782082224619224501684658 
SourceKucharik and Twine [2007]Liu et al. [1999]Thornton et al. [2005]Williamson et al. [2008]Arain et al. [2006]Tian et al. [2010]Li et al. [2010]Grant et al. [2005]Medvigy et al. [2009]Liu et al. [2003] 
Model AttributeModel
Temporal ResolutionDailyHalf-hourlyHalf-hourlyDailyHalf-hourlyHalf-hourly10 minHalf-hourlyHalf-hourlyHourlyHalf-hourly
Vegetation Pools30438084030
Soil Pools01528051050
Soil Layers15014201015103100
Canopy PhenologyPrognosticPrescribedPrognosticPrognosticPrognosticPrescribedPrescribedPrognosticPrescribedPrognosticPrescribed
Nitrogen CycleYesNoNoNoNoYesNoYesNoNoNo
Gross Primary Productivity (GPP)NilStomatal Conductance ModelEnzyme Kinetic ModelStomatal Conductance ModelEnzyme Kinetic ModelEnzyme Kinetic ModelEnzyme Kinetic ModelEnzyme Kinetic ModelStomatal Conductance ModelStomatal Conductance ModelStomatal Conductance Model
Heterotrophic Respiration (HR)CO2 Diffusion Dissolved Carbon Loss Air Temperature Soil Temperature Precipitation Soil MoistureFirst or Greater Order ModelSoil Temperature Soil Moisture Soil CarbonSoil Temperature Soil Moisture Soil CarbonSoil Temperature Soil Moisture Soil CarbonZero-order ModelSoil Temperature Soil Moisture Soil CarbonSoil Temperature Soil CarbonZero-order ModelFirst or Greater Order ModelFirst or Greater Order Model
Autotrophic Respiration (AR)NilFraction of Instantaneous GPPAir Temperature Soil Temperature Soil Moisture Vegetation Carbon GPPAir Temperature Soil Moisture Vegetation CarbonAir Temperature Vegetation CarbonFraction of Instantaneous GPPAir Temperature Soil Moisture Vegetation CarbonAir Temperature Vegetation Carbon GPPAir Temperature Soil Moisture Surface Incident Shortwave Radiation Relative Humidity LAI fPAR CO2Air Temperature Vegetation CarbonFraction of Annual GPP
Ecosystem Respiration (R)AR + HRAR + HRAR + HRAR + HRAR + HRForced Annual BalanceAR + HRForced Annual BalanceForced Annual BalanceAR + HRAR + HR
Net Primary Production (NPP)Light Use Efficiency ModelNilGPP - ARGPP - ARGPP - ARGPP - ARAir Temperature Soil Moisture CO2 Relative HumidityGPP - ARGPP - ARGPP - ARFraction of Instantaneous GPP
Net Ecosystem Exchange (NEE)NPP - HRGPP - RNPP - HRNPP - HRGPP - RGPP - RGPP - RGPP - RGPP - RGPP - RGPP - R
Biomes SimulatedCroplands569101010Croplands10103
Sites SimulatedU.S.-Ne391029353135544357
Months Simulated48909825212623322258240219228002414291
SourceCausarano et al. [2007]Riley et al. [2002]Hanson et al. [2004]Sitch et al. [2003]Krinner et al. [2005]Baker et al. [2008]Schaefer et al. [2009]Lokupitiya et al. [2009]Zhan et al. [2003]Weng and Luo [2008]Zhou et al. [2008]
Table 2. Summary of Site Characteristicsa
Site IDNamePriorityCountryLatitudeLongitudeElevation (m a.s.l.)IGBP ClassKöppen-Geiger Climate Classification 
  • a

    Sources: IGBP classification, Loveland et al. [2001]; Köppen-Geiger, Peel et al. [2007]; LAI for USA sites, http://public.ornl.gov/ameriflux/; LAI for Canadian sites, Chen et al. [2006] and Schwalm et al. [2006]. Annual precipitation and mean annual air temperature are measurement period averages of meteorological inputs used to drive model simulations. NEE values show yearly integrals and associated error: one standard deviation based on uncertainty due to random noise and the friction velocity threshold aggregated to yearly values and summed in quadrature [Barr et al., 2009]. Data coverages are percentages of half-hourly NEE measurements that satisfied quality control standards (friction velocity threshold) for day- and nighttime separately. Priority: (1) Primary sites with complete (includes ancillary and biological data templates) records; (2) Secondary chronosequence sites. Standardized Precipitation Index available only for Priority 1 sites excluding US-Atq, US-Brw, US-Dk2, US-IB1, and US-Shd. CA-TP3, US-Atq, US-Brw, US-Dk2, and US-Me4 sites used postprocessing protocol from the La Thuile and Asilomar FLUXNET Synthesis data set (http://www.fluxdata.org/) [Moffat et al., 2007; Papale et al., 2006] and lack NEE uncertainties. Biome is combination of IGBP class and Köppen-Geiger climate. US-Atq and US-Brw-Arctic wetlands classified as tundra biome CA-WP1, treed fen (IGBP mixed forest) grouped with wetlands biome US-Los, shrub wetland site (IGBP closed shrublands) grouped with wetlands biome US-SO2, closed shrublands grouped with shrublands (open or closed) biome. IGBP class and biome codes: CRO, croplands; CSH, closed shrublands; GRA, grasslands, ENF, evergreen needleleaf forest; ENFB, evergreen needleleaf forest-boreal zone; ENFT, evergreen needleleaf forest-temperate zone; DBF, deciduous broadleaf forest; MF, mixed (deciduous/evergreen) forest; WSA, woody savanna; SHR, shrublands; TUN, tundra; WET, wetlands.

CA-Ca1British Columbia - Campbell River - Mature Forest Site1Canada49.87−125.33300ENFMaritime temperate 
CA-Ca2British Columbia - Campbell River - Clearcut Site2Canada49.87−125.29180ENFMaritime temperate 
CA-Ca3British Columbia - Campbell River - Young Plantation Site2Canada49.53−124.90165ENFMaritime temperate 
CA-GroOntario - Groundhog River - Mature Boreal Mixed Wood1Canada48.22−82.16300MFWarm summer continental 
CA-LetLethbridge1Canada49.71−112.94960GRAWarm summer continental 
CA-MerEastern Peatland - Mer Bleue1Canada45.41−75.5270WETWarm summer continental 
CA-OasSask. - SSA Old Aspen1Canada53.63−106.20530DBFContinental subarctic 
CA-ObsSask. - SSA Old Black Spruce1Canada53.99−105.12629ENFContinental subarctic 
CA-OjpSask. - SSA Old Jack Pine1Canada53.92−104.69579ENFContinental subarctic 
CA-QfoQuebec Mature Boreal Forest Site1Canada49.69−74.34382ENFContinental subarctic 
CA-SJ1Sask. - 1994 Harvested Jack Pine2Canada53.91−104.66580ENFContinental subarctic 
CA-SJ2Sask. - 2002 Harvested Jack Pine2Canada53.94−104.65518ENFContinental subarctic 
CA-SJ3Sask. - SSA 1975 Harvested Young Jack Pine2Canada53.88−104.64511ENFContinental subarctic 
CA-TP3Ontario - Turkey Point Middle-aged White Pine2Canada42.71−80.35219ENFWarm summer continental 
CA-TP4Ontario - Turkey Point Mature White Pine1Canada42.71−80.36219ENFWarm summer continental 
CA-WP1Western Peatland - LaBiche-Black Spruce/Larch Fen1Canada54.95−112.47540MFContinental subarctic 
U.S.-ARMOK - ARM Southern Great Plains Site - Lamont1USA36.61−97.49310CROHumid subtropical 
U.S.-AtqAK - Atqasuk1USA70.47−157.4116WETTundra 
U.S.-BrwAK - Barrow1USA71.32−156.631WETTundra 
U.S.-Dk2NC - Duke Forest - Hardwoods1USA35.97−79.10160DBFHumid subtropical 
U.S.-Dk3NC - Duke Forest - Loblolly Pine1USA35.98−79.09163ENFHumid subtropical 
U.S.-Ha1MA - Harvard Forest EMS Tower (HFR1)1USA42.54−72.17303DBFWarm summer continental 
U.S.-Ho1ME - Howland Forest (Main Tower)1USA45.20−68.7460ENFWarm summer continental 
U.S.-IB1IL - Fermi National Accelerator Laboratory - Batavia (Agricultural Site)1USA41.86−88.22227CROHot summer continental 
U.S.-IB2IL - Fermi National Accelerator Laboratory - Batavia (Prairie Site)1USA41.84−88.24227GRAHot summer continental 
U.S.-LosWI - Lost Creek1USA46.08−89.98480CSHWarm summer continental 
U.S.-MMSIN - Morgan Monroe State Forest1USA39.32−86.41275DBFHumid subtropical 
U.S.-MOzMO - Missouri Ozark Site1USA38.74−92.20219DBFHumid subtropical 
U.S.-Me2OR - Metolius - Intermediate Aged Ponderosa Pine1USA44.45−121.561253ENFDry-summer subtropical 
U.S.-Me3OR - Metolius - Second Young Aged Pine2USA44.32−121.611005ENFDry-summer subtropical 
U.S.-Me4OR - Metolius - Old Aged Ponderosa Pine2USA44.50−121.62915ENFDry-summer subtropical 
U.S.-Me5OR - Metolius - First Young Aged Pine2USA44.44−121.571183ENFDry-summer subtropical 
U.S.-NR1CO - Niwot Ridge Forest (LTER NWT1)1USA40.03−105.553050ENFContinental subarctic 
U.S.-Ne1NE - Mead - Irrigated Continuous Maize Site1USA41.17−96.48361CROHot summer continental 
U.S.-Ne2NE - Mead - Irrigated Maize - Soybean Rotation Site1USA41.16−96.47361CROHot summer continental 
U.S.-Ne3NE - Mead - Rainfed Maize - Soybean Rotation Site1USA41.18−96.44361CROHot summer continental 
U.S.-PFaWI - Park Falls/WLEF1USA45.95−90.27485MFWarm summer continental 
U.S.-SO2CA - Sky Oaks - Old Stand1USA33.37−116.621392CSHDry-summer subtropical 
U.S.-ShdOK - Shidler- Oklahoma1USA36.93−96.68350GRAHumid subtropical 
U.S.-SyvMI - Sylvania Wilderness Area1USA46.24−89.35540MFWarm summer continental 
U.S.-TonCA - Tonzi Ranch1USA38.43−120.97177WSADry-summer subtropical 
U.S.-UMBMI - University of Michigan Biological Station1USA45.56−84.71234DBFWarm summer continental 
U.S.-VarCA - Vaira Ranch - Ione1USA38.41−120.95129GRADry-summer subtropical 
U.S.-WCrWI - Willow Creek1USA45.81−90.08520DBFWarm summer continental 
Site IDAnnual NEE (g C m2−)Annual NEE Error (g C m2−)Daytime Data Coverage (%)Nighttime Data Coverage (%)LAIAnnual Precipitation (mm)Mean Annual Air Temperature (°C)Measurement PeriodBiomeSource
CA-Ca1−244.361.199266.112568.71998–2006ENFTSchwalm et al. [2007]
CA-Ca2571.731.596234.412228.82001–2006ENFTSchwalm et al. [2007]
CA-Ca391.237.991272.215549.52002–2006ENFTSchwalm et al. [2007]
CA-Gro−36.533.593344.14273.32004–2006MFMcCaughey et al. [2006]
CA-Let−132.914.396460.73356.51997–2006GRAFlanagan et al. [2002]
CA-Mer−68.521.679561.39356.21999–2006WETLafleur et al. [2003]
CA-Oas−158.028.594563.84602.31997–2006DBFBarr et al. [2004]
CA-Obs−56.316.189455.64701.62000–2006ENFBGriffis et al. [2003]
CA-Ojp−29.916.691503.44611.52000–2006ENFBGriffis et al. [2003]
CA-Qfo−13.721.0934048192.72004–2006ENFBBergeron et al. [2007]
CA-SJ128.015.387310.83440.62002–2005ENFBZha et al. [2009]
CA-SJ2117.06.189471.35370.12003–2006ENFBZha et al. [2009]
CA-SJ3−82.017.792344.36940.82004–2005ENFBZha et al. [2009]
CA-TP4−133.229.595433.59598.62002–2007ENFTPeichl and Arain [2007]
CA-WP1−195.816.496502.74811.72003–2007WETSyed et al. [2006]
U.S.-ARM−128.474.489363.162915.62000–2006CROFischer et al. [2007]
CA-TP4−133.229.595433.59598.62002–2007ENFTPeichl and Arain [2007]
U.S.-Atq−12.8-50221.1118−10.61999–2006TUNOberbauer et al. [2007]
U.S.-Brw−72.0-49291.5108−10.91999–2002TUNHarazono et al. [2003]
U.S.-Dk2−718.1-4817109115.12003–2005DBFSiqueira et al. [2006]
U.S.-Dk3−350.0139.075375.6112614.71998–2005ENFTSiqueira et al. [2006]
U.S.-Ha1−217.465.978343.3811227.91991–2006DBFUrbanski et al. [2007]
U.S.-Ho1−223.033.470475.28186.61996–2004ENFTRichardson et al. [2009]
U.S.-IB1−269.031.392461.2971810.12005–2007CROPost et al. [2004]
U.S.-IB2−86.042.080495.3881810.42004–2007GRAPost et al. [2004]
U.S.-Los−78.019.282544.246663.82000–2006WETSulman et al. [2009]
U.S.-MMS−346.166.397464.9110912.41999–2006DBFSchmid et al. [2000]
U.S.-MOz−305.748.994333.9173013.32004–2007DBFGu et al. [2006]
U.S.-Me2−536.065.863463.624347.62002–2007ENFTThomas et al. [2009]
U.S.-Me3−198.032.783280.524238.52004–2005ENFTVickers et al. [2009]
U.S.-Me4−612.3-55412.16418.31996–2000ENFTIrvine et al. [2004]
U.S.-Me5−206.010.697481.13507.61999–2002ENFTIrvine et al. [2004]
U.S.-NR1−37.227.089444.26632.51998–2007ENFTBradford et al. [2008]
U.S.-Ne1−424.041.893426.583211.12001–2006CROVerma et al. [2005]
U.S.-Ne2−382.041.896516.582310.82001–2006CROVerma et al. [2005]
U.S.-Ne3−258.043.394556.262710.92001–2006CROVerma et al. [2005]
U.S.-PFa45.041.185304.057365.11997–2005MFDavis et al. [2003]
U.S.-SO222.425.68730369513.81998–2006SHRLuo et al. [2007]
U.S.-Shd−75.522.096495.9117914.81997–2001GRASuyker et al. [2003]
U.S.-Syv48.534.753514.17004.42001–2006MFDesai et al. [2005]
U.S.-Ton−67.852.077250.654916.42001–2007WSAMa et al. [2007]
U.S.-UMB−132.042.486394.236297.41998–2006DBFSchmid et al. [2003]
U.S.-Var7.3110.680222.456315.92001–2007GRAMa et al. [2007]
U.S.-WCr−222.654.148555.367125.31998–2006DBFCook et al. [2004a]

[6] Gaps in the meteorological data record occurred at EC sites due to data quality control or instrument failure. Missing values of air temperature, humidity, shortwave radiation, and precipitation data, i.e., key model inputs, were filled using DAYMET [Thornton et al., 1997] before 2003 or the nearest available climate station in the National Climatic Data Center's Global Surface Summary of the Day (GSOD) database. Daily GSOD and DAYMET data were temporally downscaled to hourly or half-hourly values using the phasing from observed mean diurnal cycles calculated from a 15 day moving window. The phasing used a sine wave assuming peak values at 1500 local standard time (LST) and lowest values at 0300 LST. In the absence of station data a 10 day running mean diurnal cycle was used [Ricciuto et al., 2009] (http://nacp.ornl.gov/docs/Site_Synthesis_Protocol_v7.pdf).

[7] EC data were produced by AmeriFlux and Fluxnet-Canada investigators and processed as a synthesis product of the North American Carbon Program (NACP) Site Level Interim Synthesis (http://www.nacarbon.org/nacp/). The observed NEE were corrected for storage, despiked (i.e., outlying values removed), filtered to remove conditions of low turbulence (friction velocity filtered), and gap-filled to create a continuous time series [Barr et al., 2004]. The time series included estimates of random uncertainty and uncertainty due to friction velocity filtering [Barr et al., 2004, 2009]. In this analysis, NEE was aggregated to monthly values using only non-gap-filled data, i.e., observed values deemed spurious and subsequently infilled were not considered. Coincident modeled NEE values were similarly excluded. This removed the influence of gap-filling algorithms in the comparison of observed and modeled NEE.

[8] Drought level was quantified using the 3 month Standard Precipitation Index (SPI) [McKee et al., 1993]. Monthly SPI values were taken from the U.S. Drought Monitor (http://drought.unl.edu/DM/) whereby each tower was matched to nearby meteorological station(s) indicative of local drought conditions given proximity, topography, and human impact. This study used three drought levels: dry required SPI < −0.8, wet corresponded to SPI > +0.8, otherwise normal conditions existed. Climatic season was defined by four seasons of 3 months each with winter given by December, January, and February.

2.2. Model Skill

[9] Model-data mismatch was evaluated using normalized mean absolute error (NMAE) [Medlyn et al., 2005], the reduced χ2 statistic (χ2) [Taylor, 1996] as well as Taylor diagrams and skill (S) [Taylor, 2001]. The first metric quantifies bias, the “average distance” between observations and simulations in units of observed mean NEE:

equation image

where the overbar indicates averaging across all values, n is sample size, the subscript obs is for observations and sim is for modeled estimates. The summation is for any arbitrary data group (denoted by subscripts on the summation operator only) where subscript i is for site, j is for model, k is for climatic season, l is for drought level.

[10] The second metric used to evaluate model performance was the reduced χ2 statistic. This is the squared difference between paired model and data points over observational error normalized by degrees of freedom:

equation image

where δ NEE is uncertainty of monthly NEE (see section 2.3), 2 normalizes the uncertainty in observed NEE to correspond to a 95% confidence interval, the summation is across any arbitrary data group (denoted by subscripts on the summation operator). χ2 values are linked to model-data mismatch where a value of unity indicates that model and data are in agreement relative to data uncertainty.

[11] A final characterization of model performance used Taylor diagrams [Taylor, 2001]; visual displays based on pattern matching, i.e., the degree to which simulations matched the temporal evolution of monthly NEE. Taylor plots are polar coordinate displays of the linear correlation coefficient (ρ), centered root mean squared error (RMSE; pattern error without considering bias), and the standard deviation of NEE (σ). Taylor diagrams were constructed for the mean model ensemble (MEAN) and across-site mean model performance using the full data record for each combination of site and model (ranging from 7 to 178 months). More generally, each polar coordinate point for any arbitrary data group can be scored:

equation image

where S is the model skill metric bound by zero and unity where unity indicates perfect agreement, and σnorm is the ratio of simulated to observed standard deviation [Taylor, 2001].

[12] To scale model skill metrics across gradients of site, biome, model, seasonality, and dryness level we aggregated across data groups weighting each by sample size. For example, χ2 for model I, denoted by subscript j = I, is given by

equation image

where the summation is over all sites, seasons, and levels of dryness where model I was used as denoted by subscripts i, k, and l, respectively; nj=I is the total site-months simulated with model I; and χ2j=I is aggregated χ2 for model I. We did not evaluate model performance for any data group with n < 3. In sum, Taylor displays and skill examined models' ability to mimic the monthly trajectory of observed NEE, the calculation of NMAE quantified bias in units of mean observed NEE, and χ2 values quantified how well model-data mismatch scales with flux uncertainty.

2.3. Observational Flux Uncertainty

[13] We calculated the standard error of monthly NEE (δNEE) [Barr et al., 2009] by combining random uncertainty and uncertainty associated with the friction velocity threshold (u*Th), a value used to identify and reject spurious nighttime NEE measurements. Random uncertainty was estimated following Richardson and Hollinger [2007]: (1) generate synthetic NEE data using the gap-filling model [Barr et al., 2004, 2009] for a given site-year, (2) introduce gaps as in the observed data with u*Th filtering, (3) add noise, (4) infill gaps using the gap-filling model, and (5) repeat the process 1000 times for each site. The random uncertainty component of δNEE was then the standard deviation across all 1000 realizations aggregated to months.

[14] The u*Th uncertainty component of δNEE was also estimated using Monte Carlo methods. Here 1000 realizations of NEE were generated using 1000 draws from a distribution of u*Th. This distribution was based on binning the raw flux data with respect to climatic season, temperature, and site-year and estimating u*Th in each bin [Papale et al., 2006]. The standard deviation across all realizations gave the u*Th uncertainty component of δNEE. Both components were combined in quadrature to one standard error of monthly NEE (= δNEE) [Barr et al., 2009].

2.4. Relating Model Skill to Model Structure and Site History

[15] The models evaluated here range widely in their emphasis and structure (Table 1). Some focus on biophysical calculations (SiB3, BEPS), where others emphasize biogeochemistry (DLEM), or ecosystem dynamics (ED2). However, as terrestrial biosphere models simulate carbon cycling with hydrological variables, most models contain both biophysics and biogeochemistry. This motivated characterizing model structure with definite attributes, e.g., prognostic versus prescribed canopy phenology, number of soil pools, and type of NEE algorithm (Table 3). To resolve how such characteristics and site history impacted model skill we calculated S for all observed combinations of site, model, seasonality, and drought level and cross-referenced these with 13 site history variables and 14 model attributes (Table 3). Only 20 models were available for this exercise; MEAN and the optimized LoTEC were excluded. We used S as it is bound by zero (no agreement) and unity (perfect agreement) in contrast to NMAE and χ2 which are unbound. The Taylor skill metric (S) was first discretized into three classes based on terciles. These classes, representing three tiers of model-data agreement, were then related to biome, climatic season, drought level, site history, and model structure using regression tree analysis (RTA) as a supervised classification algorithm. RTA is a form of binary recursive partitioning [Breiman et al., 1984] that successively splits the data (Taylor skill classes as the response; all other attributes as predictors) into subsets (nodes) by minimizing within-subset variation. The result is a pruned tree-like topology whereby predicted values (Taylor skill metric class) are derived by a top-to-bottom traversal following the rules (branches) that govern subset membership until a predicted value is reached (terminal node). The splitting rules at each node as well as its position allow for a calculation of relative variable importance [Breiman et al., 1984] with the most important variable given a score of 100. Variables of high importance were further analyzed using conditional means, i.e., comparing mean values for each predictor value, with statistical differences determined using Bonferroni corrections for multiple comparisons [Hochberg and Tamhane, 1987].

Table 3. Model Structural and Site History Predictors Used to Classify Taylor Skill With Regression Tree Analysisa
Model temporal resolutionDaily, half-hourly or less, hourly, monthly
CanopyPrognostic, semiprognostic, prescribed. Prescribed canopy from remote sensing, semiprognostic has some prescribed input into canopy leaf biomass but calculates phenology with other prognostic variables.
Number of vegetation poolsNumber of pools, both dynamic and static
Number of soil poolsNumber of pools, both dynamic and static
Number of soil layersNumber of layers
NitrogenTrue if the model has a nitrogen cycle; otherwise false.
Steady stateTrue if the simulated long-term NEE integral approaches zero; otherwise false.
Autotrophic respiration (AR)Fraction of annual GPP, fraction of instantaneous GPP, explicitly calculated, nil, proportional to growth
Ecosystem respiration (R)AR + HR, explicitly calculated, forced annual balance
Gross primary productivity (GPP)Enzyme kinetic model, light use efficiency model, nil, stomatal conductance model
Heterotrophic respiration (HR)Explicitly calculated, first or greater order model, zero-order model
Net ecosystem exchange (NEE)Explicitly calculated, GPP - R, NPP - HR
Net primary productivity (NPP)Explicitly calculated, fraction of instantaneous GPP, GPP - AR, light use efficiency model
Overall model complexityLow, average, high
Values correspond to terciles of the total amount of first-order functional arguments for the following model-generated variables/outputs: AR, canopy leaf biomass, R, evapotranspiration, GPP, HR, NEE, NPP, soil moisture.
Site historyTrue if the below listed management activity or disturbance or event occurred on site; otherwise false.
Grazed, fertilized, fire, harvest, herbicide, insects and pathogens, irrigation, natural regeneration, pesticide, planted, residue management, thinning
Stand age classYoung, intermediate, nil, mature, multicohort.
Values based on stand age in forested sites; stands without a clear dominant stratum are treated as multicohort; nonforest types have nil.

3. Results

3.1. Model-Data Agreement Relative to Climatic Season, Dryness, and Biome

[16] Overall agreement across n = 31025 months was better in forested than nonforested biomes; both NMAE (Table 4) and χ2 values (Table 5) were closer to zero and unity, respectively. At the biome level, model skill was loosely ranked in five tiers: evergreen needleleaf forests in the temperate zone, mixed forests > deciduous broadleaf forests, evergreen needleleaf forests in the boreal zone > grasslands, woody savannahs > croplands, shrublands, wetlands > tundra. These rankings were robust across models used in the majority of biomes, although some divergence was apparent for croplands and shrublands (Figure 1). Relative to seasonality and drought level models were most consistent with observations during periods of peak biological activity (climatic summer) and under dry conditions (Figure 2). However, across the three levels of dryness, changes in model-data agreement were negligible for NMAE (∼4% change, Table 4) but more pronounced for χ2 (from 8.10 to 12.72, Table 5). Averaged over just the warm season (excluding climatic winter) dry conditions were coincident with worse model-data agreement, e.g., NMAE was −0.99, −0.91, and −0.84 for dry, normal, and wet, respectively. In biomes with a clear seasonal cycle in leaf area index (LAI) a loss of model skill occurred during climatic spring and fall (Tables 4 and 5), especially for NMAE.

Figure 1.

Normalized mean absolute error (NMAE) by biome for each model. Biomes in ascending order based on model-specific NMAE; biomes on the left show better average agreement with observations. NMAE is normalized by mean observed flux. Across all sites, seasons, and drought levels within a given biome this value is negative (NEE < 0), indicating a sink. NMAE values closer to zero coincide with a higher degree of model-data agreement. Woody savannahs and shrublands not shown: only one site each. Tundra (n = 2 sites) has NMAE < −10 for all models. CN-CLASS croplands value is off-scale (= −8.98). Black cross, no observations; white circle, undersampled (n < 100 months).

Figure 2.

Normalized mean absolute error (NMAE) by climatic season and drought level. NMAE is normalized by mean observed flux such that most values are negative (NEE < 0), indicating a sink. Positive values indicate a source (NEE > 0). These occur in winter for all models as well as spring and fall for all crop only models: AgroIBIS, DNDC, EPIC, SiBcrop. Such values are displayed on the same color bar but with opposite sign. Off-scale values: AgroIBIS and SiBcrop in fall are −7.1 and −11.1, respectively. DNDC in fall and spring is −11.4 and −8.7, respectively. Black cross, no observations; white circle, undersampled (n < 100 months).

Table 4. Normalized Mean Absolute Error by Climatic Season, Drought Level, and Biomea
BiomebClimatic SeasonDrought LevelOverall
  • a

    Drought level was based on monthly values of 3 month Standard Precipitation Index (SPI): dry value were < −0.8; wet > +0.8. Otherwise normal conditions existed.

  • b

    Biome codes: CRO, cropland; GRA, grassland; ENFB, evergreen needleleaf forest-boreal zone; ENFT, evergreen needleleaf forest-temperate zone; DBF, deciduous broadleaf forest; MF, mixed (deciduous/evergreen) forest; WSA, woody savanna; SHR, shrubland; TUN, tundra; WET, wetland.

Table 5. Reduced χ2 Statistic by Climatic Season, Drought Level, and Biomea
BiomebClimatic SeasonDrought LevelOverall
  • a

    Drought level was based on monthly values of 3 month Standard Precipitation Index (SPI): dry value were < −0.8; wet > +0.8. Otherwise normal conditions existed.

  • b

    Biome codes: CRO, cropland; GRA, grassland; ENFB, evergreen needleleaf forest-boreal zone; ENFT, evergreen needleleaf forest-temperate zone; DBF, deciduous broadleaf forest; MF, mixed (deciduous/evergreen) forest; WSA, woody savanna; SHR, shrubland; WET, wetland.


3.2. Skill Metrics by Model

[17] Regardless of metric, model skill was highly variable. Of the three model skill metrics, NMAE was related to both Taylor skill and χ2 (ρ = −0.65; p < 0.0001). Jointly, high Taylor skill co-occurred with NMAE and χ2 values closer to zero and unity, respectively (Figure 3). Across models NMAE ranged from −0.42 of the overall mean observed flux to −2.18 for LoTEC and DNDC, respectively. Values of χ2 varied from 2.17 to 29.87 for LoTEC and CN-CLASS, respectively. Alternatively, the degree of model-data mismatch (the distance between observations and simulations) was at least 2.17 times the observational flux uncertainty. Similarly, Taylor skill showed a high degree of scatter (Figure 4), although two crop only models (SiBcrop and AgroIBIS), LoTEC, and ISOLSM were more conservative and showed a general high degree of consistency with observations.

Figure 3.

Model skill metrics for all 22 models. Skill metrics are Taylor skill (S; equation (3)), normalized mean absolute error (NMAE), and reduced χ2 statistic (χ2). Better model-data agreement corresponds to the upper left corner. Benchmark represents perfect model-data agreement: S = 1, NMAE = 0, and χ2 = 1. Gray interpolated surface added and model names jittered to improve readability.

Figure 4.

Boxplots of Taylor skill by model and site. Taylor skill (S; equation (3)) is a single value summary of a Taylor diagram where unity indicates perfect agreement with observations. Panels show interquartile range (blue box), median (solid red line), range (whiskers), and outliers (red cross; values more than 1.5 × interquartile range from the median). (top) Only models (n = 21) used on at least two sites shown. (bottom) Only sites (n = 32) simulated with at least 10 unique models, excluding the mean model ensemble (MEAN) and the assimilated LoTEC, shown. Models and sites sorted by median Taylor skill.

[18] Among crop models, SiBCrop and AgroIBIS performed well, especially in climatic spring and during wet conditions. In contrast, the crop only DNDC model exhibited poor model-data agreement with χ2 > 15 in climatic spring and summer as well as across all drought levels. Although four crop only simulators were analyzed, the best agreement in croplands (NMAE and χ2 closer to zero and unity, respectively) was achieved by SiB3 and Ecosys, models used in multiple biomes. Based on all three skill metrics the LoTEC model (NMAE = −0.42, χ2 = 2.17, S = 0.95) was most consistent with observations across all sites, dryness levels, and climatic seasons. This platform was optimized using a data assimilation technique, unique among model runs evaluated here, and was applied at 10 sites. In addition, the mean model ensemble (MEAN) performed well (NMAE = −0.74, χ2 = 3.35, S = 0.80). For individual models (n = 12) used at a wider range of sites (at least 24 sites), model consistency with observations was highest for Ecosys (NMAE = −0.69, χ2 = 7.71, S = 0.94) and lowest for CN-CLASS (NMAE = −1.50, χ2 = 29.87, S = 0.48).

[19] Site-level model-data agreement also showed a high degree of variability (Figure 4). At three croplands sites (US-Ne1, US-Ne2, and US-Ne3) Taylor skill ranged from zero to unity. Both NMAE and χ2 exhibited similar scatter by site (not shown). Even for the best predicted site (US-Syv), S ranged from 0.19 to 0.95. Only two forested sites (CA-Qfo and CA-TP4) were predicted well (S > 0.5) by all models; whereas only one tundra site (US-Atq) was consistently poorly predicted (S < 0.5). Despite the wide range in model performance, model skill (NMAE, χ2, and S) was not correlated with the number of sites (p > 0.5) or biomes (p > 0.3) simulated, i.e., using a more general rather than a specialized model did not result in a loss in model performance. Also, model-data agreement was not better at sites with longer data records (p > 0.1).

[20] The steady state protocol had negligible effect on model skill. Long-term simulated NEE by site and model varied from −2904 to 2227 g C m−2 yr−1 with 90% of all values between −600 and 100 g C m−2 yr−1. The extreme values were primarily croplands simulated outside of crop only models. Overall, only 5 models achieved steady state (simulated NEE → 0) over the full simulation: Biome-BGC, LPJ, SiBCASA, SiB3, and TECO. Similar to simulated values, observed annual integrals at the 44 sites examined did not show steady state (Table 1) and varied from −718 to 571 g C m−2 yr−1. Nonetheless, model skill was not related to how close model spinup and initial conditions approximated steady state or how close a given site was to an observed NEE of zero. All three skill metrics were uncorrelated with long-term observed or simulated average annual NEE (p > 0.05). However, two models did show significant relationships: For Ecosys, χ2 increased (decrease in model skill) and S decreased as observed or simulated NEE approached zero; a system closer to steady state was coincident with less model-data agreement. BEPS was similar, showing lower S and more negative NMAE (decrease in model skill) for sites closer to steady state.

3.3. Model and Site-Specific Consistency With Observations Using Taylor Diagrams

[21] Average model performance (both across-site and across-model) was evaluated using Taylor diagrams based on all simulated and observed monthly NEE data. Better model performance was indicated by proximity to the benchmark, representing the observed state. The benchmark was normalized by observed standard deviation such that the distance of σ and RMSE from the benchmark was in observed σ units. Similar to model skill metrics, forested sites were better predicted than nonforested ones. The MEAN model showed ρ ≥ 0.2, apart from CA-SJ2 and US-Atq, but generally (33 of 44 sites) underpredicted the variability associated with monthly NEE at forested (Figure 5) and nonforested (Figure 6) sites. Similarly, 40 of 44 sites were predicted with RMSE < σ. Also 8 (6 forested and two croplands sites: CA-Obs, CA-Qfo, CA-TP4, US-Ho1, US-IB1, US-MMS, US-Ne3, US-UMB) of the 44 sites were predicted with ρ ≥ 0.95 and RMSE < 1. The worst predicted site was CA-SJ2 with ρ = −0.67, σ = 4.3, and RMSE = 5.1.

Figure 5.

Taylor diagram of normalized mean model performance for forested sites. Each circle (n = 26 sites) is the site-specific mean model ensemble (MEAN). Benchmark (red square) corresponds to observed normalized monthly NEE; units of σ and RMSE are multiples of observed σ. Color coding of site letter and circles indicates biome: evergreen needleleaf forest- temperate zone (red), deciduous broadleaf forest (brown), mixed (deciduous/evergreen) forest (blue), evergreen needleleaf forest-boreal zone (black). Outlying sites (evergreen needleleaf forest-boreal zone) not shown: CA-SJ1 (ρ = 0.81, σ = 3.9, RMSE = 3.1) and CA-SJ2 (ρ = −0.67, σ = 4.3, RMSE = 5.1).

Figure 6.

Taylor diagram of normalized mean model performance for nonforested sites. Each circle (n = 16 sites) is the site-specific mean model ensemble (MEAN). Benchmark (red square) corresponds to observed normalized monthly NEE; units of σ and RMSE are multiples of observed σ. Color coding of site letter and circles indicates biome: croplands (red), grasslands (brown), wetlands (blue), all other biomes (black).

[22] Overall model performance, aggregated across sites, was similar (Figure 7). Most models underpredicted variability and showed RMSE < σ. Of all 22 models only DNDC exhibited ρ < 0.2. Based on proximity to the benchmark, i.e., a high S value (Figure 3), the best models were: EPIC (crop only model used on one site), ISOLSM (used on 9 sites), LoTEC (data assimilation model), SiBcrop and AgroIBIS (crop only models), EDCM (used on 10 sites), Ecosys and SiBCASA (models used on most sites, 39 and 35, respectively), and MEAN (mean model ensemble for all 44 sites). All of these “best” models had ρ > 0.75, RMSE < 0.75 and slightly underpredicted variability; except the crop only models and Ecosys where variability was overpredicted. Models whose average behavior was furthest away from the benchmark were DNDC followed by BEPS.

Figure 7.

Taylor diagram of normalized across-site average model performance. Model σ and RMSE were normalized by observed σ. Each circle (n = 22 models) corresponds to the mean across all sites. Benchmark (red square) corresponds to observed normalized monthly NEE; units of σ and RMSE are multiples of observed σ. Color coding of model letter and circles indicates generality of model performance: specialist models used only in croplands (n ≤ 5 sites; black), generalist models used across a range of biomes and sites (n ≥ 30 sites, blue), all other models (red). The correlation for DNDC (ρ = −0.13) is displayed as zero for readability.

3.4. Links Between Model Skill, Model Structure, and Site History

[23] Biome classification was the most important factor in the distribution of model skill (Figure 8) sampled across all combinations of site, model, climatic season, and drought (n = 3132 groups). Climatic season and stand age, the highest scored site-specific attribute, followed biome as lead determinants of model skill. Of the 12 evaluated site disturbances (Table 3) only grazing, which occurred on croplands, grasslands, and woody savannahs, achieved an importance score of at least 25. Apart from drought and grazing activity, the remaining determinants were model-specific: the number of soil layers, vegetation pools, canopy phenology, and soil pools. Two carbon flux calculations also had a variable score > 25, with NEE being the highest.

Figure 8.

Variable importance scores for model-specific (blue) and site-specific (green) predictors. Scores were generated from a regression tree with the Taylor skill classes based on terciles (n = 3132) as the response. Only the 12 of 28 predictants with score > 25 shown; see Table 3 for complete listing of evaluated model structural and site attributes.

[24] Comparing mean S for these relatively important model attributes (Figure 9) revealed three instances where model structure showed a statistically significant relationship with model skill: prescribed canopy phenology, a daily time step, and calculating NEE as the difference between GPP and ecosystem respiration. Models using canopy characteristics and phenology prescribed from remotely sensed products achieved higher skill (S = 0.54) than either prognostic or semiprognostic models (S = 0.43; p < 0.05). Using a daily time step showed lower model skill (S = 0.40) relative to nondaily time steps (S = 0.50; p < 0.05). Finally, calculating NEE as the difference between GPP and total ecosystem respiration showed greater skill (S = 0.50) than other calculation methods (S = 0.42; p < 0.05). None of the other model attributes we studied showed statistically significant relationships between model structure and skill.

Figure 9.

Bar graphs of mean Taylor skill by model attribute. Whiskers represent one standard error of the mean. Only model-specific attributes with variable important scores >25 shown. Note y axis on right panels starts at 0.4.

[25] While not statistically significant, both vegetation pools and soil layers exhibited a weak pattern whereby the simplest and most complex models showed higher skill than models of intermediate complexity (Figure 9). Models with no soil model (zero soil layers) or no vegetation pools showed greater skill than models with the simplest soil model or smallest number of vegetation pools. As the number of soil layers or pools increased, so did model skill, indicating that a more comprehensive treatment of biological and physical processes can improve model skill. For vegetation pools, there was a limit where increased complexity beyond eight pools did not improve model-data agreement.

[26] Despite these effects, model attributes were of secondary importance. The change in S relative to biome varied from 0.28 to 0.55; a much larger range than seen for model attributes. Similarly, the high variable importance scores for biome and climatic season, as well as the lower score for drought level, corroborated the relationships between these factors and model skill as seen with NMAE and χ2. While the regression tree algorithm achieved an accuracy of 68.5% for predicting Taylor skill class, the site history and model characteristics considered here did not explain the underlying cause of biome and seasonal differences in model skill.

4. Discussion

4.1. Effect of Parameter Sets on Model Performance

[27] Model parameter sets are a large source of variability in terms of model performance [Jung et al., 2007b]. They influence output and accuracy [Grant et al., 2005] and are more important for accurately simulating CO2 exchange than capturing effects of interannual climatic variability [Amthor et al., 2001]. For at least some of the models studied here this can be related to the use of biome-specific parameters relative to within-biome variability [Purves and Pacala, 2008]. A corollary occurs in the context of EC observations as tower footprints can exhibit heterogeneity, particularly in soils, that is not reproduced in site-specific parameters [Amthor et al., 2001].

[28] The importance of model parameter sets was visible in this intercomparison in two ways. First, biome had the highest variable importance score. Insomuch as models rely on biome-specific parameter values, this finding indicates that model parameter sets are a key factor in the distribution of model skill. This extends to plant functional types due to the high degree of overlap between both. Furthermore, the variability (Figure 4) in model skill across parameter sets, i.e., across models, underscores that biomes may be too heterogeneous in time [Stoy et al., 2005, 2009] and space to be well-represented by constant parameters relative to, e.g., within-biome climate variability [Hargrove et al., 2003]. Second, the general high degree of site-specific variation in model skill (Figure 4) suggested that model parameter sets may need to be refined to capture local, site-specific realities.

4.2. Effect of Model Structure on Model Performance

[29] In general, models with the highest model-data agreement all used prescribed canopy phenology, calculated NEE as the difference between GPP and ecosystem respiration, and did not use a daily time step. Models that exhibited all of these structural characteristics (SiBCASA, SiB3, and ISOLSM) showed high degrees of model-data agreement across all three skill metrics. Similarly, Ecosys, which used a prognostic canopy but otherwise had similar structural characteristics as SiBCASA, also performed well. Relative to model complexity, consistency with observations was highest in those models with either the simplest structure (e.g., one soil carbon pool in ISOLSM) or the most complex (e.g., SiBCASA with 13 carbon pools). Models with a prognostic canopy seem to perform better with more carbon pools and soil layers (e.g., Ecosys). No model with a prognostic canopy and a low number of carbon pools and soil layers placed in the top tercile of model skill for any skill metric, except SiBcrop and AgroIBIS for Taylor skill in croplands. Using multimodel ensembles (MEAN) or data assimilation to optimize model parameter sets (LoTEC) can compensate for differences in model structure to improve model skill.

[30] The relationships between model structure and model skill were consistent across all biomes. As a whole, the models performed better at forested sites than nonforested sites, but the same models showed the highest consistency with observations in each biome (Ecosys and SiB3). This is true even for agriculture sites, where Ecosys and SiB3 scored as high as crop only models. This suggests that any model with requisite structural attributes can successfully simulate carbon flux in all types of ecosystems.

4.3. Links Between Model Performance and Environmental Factors

[31] Model skill was only weakly linked to drought, showing high variability across dryness level by biome and model. Only during the warm season (all climatic seasons excluding winter) did aggregate model skill decline under drought conditions. While this points to process uncertainty [Sitch et al., 2008], ecosystem response to longer-term drought can exhibit lags and positive feedbacks [Arnone et al., 2008; Granier et al., 2007; Thomas et al., 2009; Williams et al., 2009] that were not explicitly included in the drought metric used here but did influence simulation behavior through model structure, e.g., soil moisture model and soil resolution.

[32] In spring and fall, especially for biomes with a significant deciduous component, models showed a decline in model skill (Table 4) relative to periods of peak biological activity (climatic summer) [see also Morales et al., 2005]. While this was more pronounced for NMAE (Table 4) than χ2 (Table 5), phenological cues are known to influence the annual carbon balance at multiple scales [Barr et al., 2007; Delpierre et al., 2009; Keeling et al., 1996]. The loss of model skill seen in this study during spring and fall was likely linked to poor treatment of leaf initiation and senescence as well as season-specific effects of soil moisture and soil temperature on canopy photosynthesis [Hanson et al., 2004]. In this study seasonality was second only to biome in driving model skill (Figure 8). This and the lack of link between model skill and site history strongly implicate phenology as a needed refinement of terrestrial biosphere simulators.

[33] The evergreen needleleaf forest biome diverged in performance based on whether the sites were located in the temperate or boreal zones. A similar divergence was reported using Biome-BGC, LPJ, and ORCHIDEE to simulate gross CO2 uptake across a temperature gradient in Europe [Jung et al., 2007a]; average relative RMSE was higher for evergreen needleleaf forests in the boreal zone. This was linked to an overestimation of LAI at the boreal sites and relationships between resource availability and leaf area [Friedlingstein et al., 2006; Jung et al., 2007a; Sitch et al., 2008]. Additionally, recent observations in the circumboreal region, where all boreal evergreen needleleaf forested sites are located, suggest that transient effects of climate change, e.g., increased severity and intensity of natural disturbances (fire, pest outbreaks) and divergence from climate normals in temperature, have already occurred [Soja et al., 2007] and influence resource availability. We speculate the loss of model skill in boreal relative to temperate evergreen needleleaf forests was linked to insufficient characterization of cold temperature sensitivity of metabolic processes and water flow in plants as well as freeze-thaw dynamics [Schaefer et al., 2007, 2009] and that this was exacerbated by the effects of transient climate change.

4.4. Effects of Site History and Protocol on Model Evaluation

[34] Disturbance regime and how a model treats disturbance are known to impact model performance [Ito, 2008]. In this study, stand age impacted model skill whereas site history was of marginal importance (Figure 8). However, CA-SJ2, the worst predicted site (Figure 5), was harvested in 2000 and scarified in 2002, and US-SO2, a second poorly predicted shrubland site (Figure 6), suffered catastrophic wildfire during the analysis period. The poor model performance for recently disturbed sites followed from assumed steady state as used in some simulations and the absence of modeling logic to accommodate disturbance. However, the distribution of site history metrics was skewed; only few sites were burned, harvested, or in the early stages of recovery from disturbance when NEE is more nonlinear relative to established stands. Furthermore, age class was biased toward older stands; of the 17 forested sites only one was classified as a young stand. Other site characteristics were also unbalanced; all nonforested biomes occurred on five or less sites; with only one site each for shrublands and woody savannahs. While regression trees are inherently robust, additional observed and simulated fluxes in rapidly growing young forested stands, recently burned or harvested sites, and undersampled biomes are desirable to better characterize model performance.

[35] Aspects of the NACP site synthesis protocol and analysis framework also influenced the interpretation of our results. First, this analysis focused solely on non-gap-filled data to allow the model-data intercomparison to inform model development. However, the low turbulence (friction velocity) filtering removed more data at night than during the day. Average data coverage across all sites was 82% for daytime and 39% at night, respectively (Table 2), so our analysis is skewed toward daytime conditions. Second, each model that used remotely sensed inputs (such as LAI) repeated an average seasonal cycle calculated from site-specific time series based on all pixels within 1 km of the tower site. This likely deflated relevant variable importance scores (Figure 8) and precluded a full comparison of prescribed versus prognostic LAI. While only few models used such inputs (Table 1), removing the inherent bias of an invariant seasonal cycle over multiple years may improve model performance. Incorporating disturbance information to recreate historical land use and disturbance, especially for recent site entries, could also improve model performance. Last, despite the model simulation protocol's emphasis on steady state, this condition was not achieved for most sites (Table 2), even when discounting observational uncertainty, or most models. None of the four crop only models achieved steady state. This followed from site history of croplands in general where active management precluded any system steady state, e.g., DNDC allowed for prescribed initial soil carbon pools. For those models (5 of the 21 evaluated) that achieved steady state in initialization this resulted in an inherent bias between simulated and observed NEE for all sites regardless of site history. However, as biome and seasonality largely governed the distribution of model skill, this bias was too small to manifest itself in this study. Relaxing the steady state assumption [Carvalhais et al., 2008] or initializing using observed wood biomass and the quasi-steady state assumption [Schaefer et al., 2008] could improve these models' performance.

5. Conclusion

[36] We used observed CO2 exchange from 44 eddy covariance towers in North America with simulations from 21 terrestrial biosphere models and a mean model ensemble to examine model skill across gradients in dryness, seasonality, biome, site history, and model structure. Models' ability to match observed monthly net ecosystem exchange was generally poor; the mean squared distance between observations and simulations was ∼10 times observational error. Overall, forested sites were better predicted than nonforested sites. Weaknesses in model performance concerned model parameter sets and phenology, especially for biomes with a clear seasonal cycle in leaf area index. Drought was weakly linked to model skill with abnormally dry conditions during the growing season showing marginally worse model-data agreement compared to nondry conditions. Sites with disturbances during the analysis period and undersampled biomes (grasslands, shrublands, wetlands, woody savannah, and tundra) also showed a large divergence between observations and simulations. The highest degree of model-data agreement occurred in temperate evergreen forests in all climatic seasons and during summer across all biomes. Overall skill was higher for models that estimated net ecosystem exchange as the difference between gross primary productivity and ecosystem respiration, used prescribed canopy phenology, and did not use a daily time step. The model ensemble (mean simulated value across all models) and an optimized model (parameters tuned using data assimilation) also performed well. Models with preferred structural attributes included generalist models (models used at multiple sites and biomes, e.g., SiB3, Ecosys) that exhibited high degrees of model-data agreement across all biomes, indicating that a single model can successfully simulate carbon flux in all types of ecosystems. That is, different model architectures were not needed for different types of ecosystems and model choice is recast as a function of ease of parameterization and initialization.


[37] C.R.S., C.A.W., and K.S. were supported by the U.S. National Science Foundation grant ATM-0910766. We would like to thank the North American Carbon Program Site-Level Interim Synthesis team, the Modeling and Synthesis Thematic Data Center, and the Oak Ridge National Laboratory Distributed Active Archive Center for collecting, organizing, and distributing the model output and flux observations required for this analysis. This study was in part supported by the U.S. National Aeronautics and Space Administration (NASA) grant NNX06AE65G, the U.S. National Oceanic and Atmospheric Administration (NOAA) grant NA07OAR4310115, and the U.S. National Science Foundation (NSF) grant OPP-0352957 to the University of Colorado at Boulder.