Ecosystem models are important tools for diagnosing the carbon cycle and projecting its behavior across space and time. Despite the fact that ecosystems respond to drivers at multiple time scales, most assessments of model performance do not discriminate different time scales. Spectral methods, such as wavelet analyses, present an alternative approach that enables the identification of the dominant time scales contributing to model performance in the frequency domain. In this study we used wavelet analyses to synthesize the performance of 21 ecosystem models at 9 eddy covariance towers as part of the North American Carbon Program's site-level intercomparison. This study expands upon previous single-site and single-model analyses to determine what patterns of model error are consistent across a diverse range of models and sites. To assess the significance of model error at different time scales, a novel Monte Carlo approach was developed to incorporate flux observation error. Failing to account for observation error leads to a misidentification of the time scales that dominate model error. These analyses show that model error (1) is largest at the annual and 20–120 day scales, (2) has a clear peak at the diurnal scale, and (3) shows large variability among models in the 2–20 day scales. Errors at the annual scale were consistent across time, diurnal errors were predominantly during the growing season, and intermediate-scale errors were largely event driven. Breaking spectra into discrete temporal bands revealed a significant model-by-band effect but also a nonsignificant model-by-site effect, which together suggest that individual models show consistency in their error patterns. Differences among models were related to model time step, soil hydrology, and the representation of photosynthesis and phenology but not the soil carbon or nitrogen cycles. These factors had the greatest impact on diurnal errors, were less important at annual scales, and had the least impact at intermediate time scales.
 Ecosystem models remain our most important tool for diagnosing and forecasting carbon cycle dynamics across space and time. These models also play a critical role in understanding the potential responses of ecosystems to global change [VEMAP Members, 1995]. There are a large number of ecosystem models currently in use, but each makes different assumptions and was developed using different study sites and data sets. Observational and experimental data sets that provide detailed information about carbon cycle dynamics are increasingly available for almost every biome [Baldocchi, 2008] and represent an important source of data to test for ecosystem models. Despite numerous efforts to test multiple models at single sites [e.g., Hanson et al., 2004] or single models at multiple sites [e.g., Krinner et al., 2005; Poulter et al., 2009], there has not been a major synthesis effort to evaluate the performance of the community of ecosystem models at multiple sites using a standardized protocol.
 In order to address these discrepancies, the North American Carbon Program (NACP) has initiated an intercomparison between ecosystem models and eddy covariance flux observations from multiple study sites. Recent NACP analyses have demonstrated that, across many models, the error in monthly NEE was lowest in the summer and for temperate evergreen forests, and was highest in the spring, fall, and during dry periods [Schwalm et al., 2010a]. These analyses also suggest that skill is higher in (1) models where canopy phenology is prescribed by remote sensing rather than prognostic, (2) models where NEE is driven by a calculation of GPP-Re rather than NPP-Rh, and (3) models with a subdaily rather than a daily time step [Schwalm et al., 2010a]. These same analyses have also demonstrated that regardless of model structure there is substantial room for improvement in the performance of all models, which suggests that parameter error is at least as important to model performance as structural errors [Schwalm et al., 2010a].
 It is difficult to diagnose the mechanisms responsible for the lack of agreement between model and measurement using conventional model fitting statistics alone. Summary statistics or even residual analyses may not sufficiently explain the model failure that results if one or more carbon cycle processes are not correctly formulated. Model errors are challenging to resolve because ecosystem processes respond to climatic trends at multiple temporal scales. For example, GPP responds not just to the diurnal and annual course of solar radiation, but also to the seasonality in leaf area, short- and long-term drought effects, canopy development and nutrient availability over years to decades, and disturbance and migration at decadal to centennial time scales [Falge et al., 2002; Arnone et al., 2008; Beer et al., 2010; Schwalm et al., 2010b]. In general, models are broadly similar in how they respond to solar radiation, variable in how they represent leaf area index and phenology, and quite diverse in the ways they represent stand development and drought responses. Conventional metrics of model performance, such as root mean squared error, are typically calculated at a single time scale, which for eddy covariance data is often either 30 min or monthly. Such metrics cannot easily separate processes operating at different time scales, and most implicitly emphasize model performance at the fast time scales that dominate model variance at the cost of neglecting performance at longer time scales (e.g., annual-to-decadal variability and trends). An alternative step for model evaluation and improvement is to understand when a model output fails at multiple temporal scales by assessing error in the frequency domain rather than in the time domain [Braswell et al., 2005; Siqueira et al., 2006; Williams et al., 2009; Vargas et al., 2010; Mahecha et al., 2010].
 The goal of this study is to evaluate the performance of ecosystem models at multiple time scales at select NACP sites using wavelet decomposition [Torrence and Compo, 1998; Katul et al., 2001]. A multimodel analysis using wavelet decomposition has not been performed to date and introduces new challenges in interpretation, especially given the NACP goal of explicitly considering uncertainty in observations and model output. Rather than attempting to identify “winners” and “losers,” we direct our analysis to provide information useful for model improvements by identifying the time scales at which models fail, thus giving insight into the processes responsible for model/measurement mismatch. Previous research has identified diurnal and annual time scales as being disproportionately responsible for the variance in surface atmosphere CO2 flux observations [Baldocchi et al., 2001; Katul et al., 2001; Richardson et al., 2007; Stoy et al., 2009]. Model-data comparisons conducted using individual models at small numbers of sites have identified both intermediate time scales (weeks to months) and interannual time scales as those in which models tend to fail [Braswell et al., 2005; Stoy et al., 2005; Siqueira et al., 2006; Vargas et al., 2011]. However, it is unclear whether these patterns would be expected to hold over a much wider range of biomes and against the breadth of different model structures considered in the NACP, which range from simple flux-and-pool matrix models to next generation terrestrial biosphere models [Schwalm et al., 2010a]. Based on these previous findings, we test two interrelated hypotheses:
 1. Models will accurately replicate flux variability at the daily and annual time scales, as important biological processes are regulated by diel and seasonal variation (e.g., the photosynthetic response to solar radiation) that is thought to be correctly formulated in the models.
 2. Models will have difficulty in representing variation at intermediate time scales (weeks to months) because synoptic weather events and lagged responses in plant physiology and ecosystem biogeochemistry regulate variation in fluxes at these intermediate temporal scales.
 To evaluate these hypotheses we focus on an analysis of NEE model-data residuals to highlight where there is still room for improvement rather than on where models and data agree. We also develop a novel method for assessing the contribution of flux-tower observation error via a Monte Carlo approach and demonstrate that a failure to account for flux observation error leads to a qualitative misidentification of the modes of model failure.
2.1. Models and Data
 The NACP site-level model-data intercomparison encompasses 21 ecosystem models and 32 North American eddy covariance flux tower sites. Not all models were run at all sites; in total there are 463 out of 672 possible model-site combinations. What is statistically problematic is that the missing model-site combinations are not randomly distributed, but rather reflect the choices of individual modeling groups, which are undoubtedly influenced by model skill. For example, tundra and wetland model runs are strongly underrepresented suggesting that the models not run at these sites are likely to perform worse than those which were run. The fact that runs are not statistically “missing at random” has the potentially introducing biases in a statistical interpretation. Therefore, since most models were run for nine high-priority sites that had been identified a priori in the original model protocol, we restricted our analyses to these sites (Table 1).
Table 1. Sites and Models Used in Intercomparisiona
 Model runs at each site followed a prescribed protocol to facilitate intercomparison. Each model used a standardized meteorological forcing data based primarily on the observed meteorology at each flux tower. Meteorological data were gap-filled using a combination of nearby met station data and the DAYMET reanalysis as documented byRicciuto et al. . Ancillary data such as soil texture and management history were available via NACP Biological-Ancillary-Disturbance-Methodology (BADM) templates [Law et al., 2008] to ensure that all models were making the same assumptions about the local environment at each site. In addition a standard subset of GIMMS Normalized Difference Vegetation Index data set [Tucker et al., 2005] data were provided for the subset of models that are diagnostic rather than prognostic. Models were expected to be run to steady state using their standard parameter set; site-specific model tuning was prohibited. The exception to this was the LoTEC model, which was run using a data assimilation scheme. The performance of LoTEC relative to other models thus highlights the contribution of parameter error, rather than models structural errors, to model performance. The full modeling protocol can be found online (http://nacp.ornl.gov/docs/Site_Synthesis_Protocol_v7.pdf).
 This analysis focuses on the comparison of observed and simulated net ecosystem exchange (NEE) of CO2. All analyses were conducted on the finest temporal resolution available, which was 60 min at US-Ha1, US-Ne3, and US-UMB and 30 min for all other sites (Table 1). Models with a daily time step used the daily mean value for all values within a 24 h period and thus we refrain from interpreting the results from these models at time scales less than 2 days. For each model at each site we calculated the normalized residual error (εs,m,t) in NEE between models and data as
with subscripts s = site, m = model, and t = time and a bar indicating an average over the full length of the time series. This error metric was designed to highlight the synchrony of the model with the data rather than identify persistent model biases, which are generally a reflection of errors in model parameterization not model structure and are reported on in detail elsewhere [Schwalm et al., 2010a]. Data and model output were mean-centered to eliminate biases in the cumulative flux and divided by the standard deviation (σ) across the entire record to normalize the amplitude of variability.
 As the continuous wavelet transform does not accommodate missing data, flux data were gap-filled using Marginal Distribution Sampling (MDS) [Reichstein et al., 2005], which is a standard FLUXNET data product [Moffat et al., 2007]. An estimate of NEE observation error at every time point was generated by Barr et al. , accounting for uncertainties associated with U* filtering and random measurement error [Richardson et al., 2006; Richardson and Hollinger, 2007]. These uncertainties are incorporated in the spectral null model, rather than in the error metric, as described below.
2.2. Spectral Analysis
 Spectral analyses are based on the premise that a time series can be decomposed into an additive series of wave functions that have different time scales in a way directly analogous to how a Taylor series decomposes a function into a series of polynomials. These analyses allow one to identify the time scales that dominate a signal because wave functions that match the fluctuations in the data will explain the most variance (i.e., power). In contrast to traditional methods such as Fourier spectra, which are based on using sinusoidal waves, wavelet analyses are based on using wave functions that are finite in length but moved over the time series in a way conceptually similar to a moving window. In this way wavelet analyses are able to identify not only the time scales that dominate a signal but also when in time those time scales are strongest. Wavelet analyses are typically plotted on what is referred to as the wavelet halfplane, where time is along the x axis, time scale is along the y, and spectral power is indicated by color, for example with hot (red) colors indicating high power and cool (blue) colors indicating low power. Wavelet analysis has been widely applied in the geosciences [Torrence and Compo, 1998] for quantifying the spectral characteristics of time series that may be nonstationary and heteroscedastic, thereby offering an improvement over traditional Fourier decomposition (e.g., see the demonstration by Scanlon and Albertson ). A continuous wavelet transform was computed in R using the dplR library [Bunn, 2008] using the Morlet wavelet basis function and setting the wave number (k0) to six and calculating four suboctaves per octave (four voices per power of two). The Morlet wavelet looks like a sine wave centered on zero that decays rapidly to extinction in both directions. Wavelet power was corrected for biases following Liu et al.  to ensure a consistent definition of power in order to enable comparisons across spectral peaks.
 The challenge when interpreting the spectral characteristics of model error is to determine when model-data mismatch is statistically significant. For this to be useful it is important that the spectra be compared to the appropriate null model, which for eddy covariance data requires that the null spectra account for the errors in the flux observations. Thus, conventional significance tests using, for example, Monte Carlo analyses on colored noise spectra, are inadequate [Torrence and Compo, 1998; Grinsted et al., 2004; Stoy et al., 2009]. Determining appropriate null spectra is further complicated because flux measurement error is nonnormally distributed; NEE measurement errors have a double exponential distribution, which is more fat-tailed than the normal, and is highly heteroscedastic, with error increasing linearly with the absolute magnitude of the flux [Hollinger and Richardson, 2005; Richardson et al., 2006, 2008; Lasslop et al., 2008]. Because of this error distribution, even a perfect model that exactly predicted the true flux would have a strong diurnal and seasonal error spectrum when compared to data because the magnitude of observation error in the data increases systematically during the day and during the growing season when NEE tends to be larger.
 To account for this observation uncertainty we developed a novel Monte Carlo approach to generating the null wavelet spectrum. A stack of 1000 wavelet spectra were calculated using 1000 Monte Carlo replicate “pseudo-data” time series for each site. Replicate data sets were generated using the methods ofBarr et al. that account for both uncertainty in the gap-filling algorithm and measurement uncertainty, and sample over the distribution of both simultaneously. The relative error between the pseudo-data and the original data, and the wavelet spectra of this error, were both calculated in the exact same way as model error. From the 1000 replicate spectra, the mean, median, and a quantile-based confidence interval were calculated and the distribution of these replicate spectra was used as the null model for the model spectra.
 Rather than present all wavelet half plane diagrams for all model-site combinations, results are summarized in three ways. First, the global power spectra were calculated for all model-site combinations. To account for the data uncertainty, the peaks in the global spectrum for each model at each site were tested for significance by comparison to the distribution of the global spectra of the 1000 Monte Carlo pseudo-data sets for that site. To simplify presentation, the model spectrum at each site was divided by the one-sided 95% confidence bound generated from the 1000 Monte Carlo replicates. In this approach, any part of the spectrum that is greater than one is interpreted as the model error falling outside the range of data uncertainty. To enable comparisons among models, the global spectrum for each model averaged across sites was calculated as the median of these null-corrected spectra. The spectrum for the multimodel mean was calculated by first calculating the ensemble average time series in the time domain, and then treating this like any other model, rather than averaging spectra across models in the wavelet domain.
 Second, to synthesize the proportion of variance in the error metric that was attributable to different time scales, we extracted power from five spectral bands for each site by model combination. These bands and their time scales were the following: subdaily (<0.5 day), daily (0.5–2 days), synoptic (2–180 days), annual (180–700 days), and interannual (>700 days). Bandwidths were determined by examining the wavelet spectra of sinusoidal waves with a “pure” diurnal and annual signal. The analysis of this simple synthetic time series also served to verify that the analytical methods were functioning as expected. The five bands were then summarized on both a by-site and by-model basis in terms of the relative contribution of each band to the overall spectra. As before, all spectra were normalized by the upper confidence interval of the null spectra, which in addition to providing a metric to determine significance, also allows spectra from different sites to be compared despite differences in time series length and total variance. The proportion of spectral energy in each band was compared using three-way ANOVA with site, model, and spectral band as covariates and including all pairwise interactions. Within the ANOVA the interannual time scale was dropped as a response variable because of edge effects within the cone of influence and because its proportional energy is linearly determined by the other four bands. Because this analysis was unbalanced, a second ANOVA was performed on just the 15 models that have completed all runs at the three deciduous forest sites and three conifer forest sites (Table 1). Within this analysis we also included biome as a covariate to test for differences among the conifer and deciduous sites.
 The third way model performance was summarized was to examine the across-model composite wavelet spectra for each site. Each model-site spectrum was first normalized so that total power sums to one before calculating the across-model average of the full wavelet spectra for each site (i.e., averaging was performed in the spectral domain). This analysis was done in order to discern the presence or absence of consistent temporal patterns in model performance within a site in order to quantify when models consistently fail when challenged by data from each site. Errors that are common across models are expected to have higher spectral energy than errors that are unique to a single model because random errors will cancel in the resulting power spectra.
 In order to identify phenological errors in the wavelet spectrum the beginning and end of the growing season was marked on the wavelet half plane for each site. Phenological boundaries were estimated based on a 10 day moving average of tower-based GPP. We used a threshold of 20% maximum GPP, which gives very similar results as the more common zero NEE threshold at most sites, but was less sensitive to noise for the few sites were the zero NEE threshold gave unrealistic results.
 Below we demonstrate the wavelet-based uncertainty analysis using the output from one model, the Ecosystem Demography (ED2) model [Moorcroft et al., 2001; Medvigy et al., 2009], at one site, Howland forest (US-Ho1) [Hollinger et al., 2004]. We use this example to explain the Monte Carlo analysis with pseudo-data and discuss one model-site comparison in more detail. We then analyze the spectra from all sites and models and partition the relative error for each site and model among the different time scales (hourly to interannual) that the length of the data records permit us to interpret.
3.1. Wavelet Decomposition of Eddy Covariance and Ecosystem Model Time Series With Explicit Error Accounting
Figure 1displays the wavelet half plane spectra for one site (US-Ho1,Figure 1a), one model (ED2, Figure 1b) run at this site, and the normalized residual error between the model and the observations (Figure 1c) using data from 1996 to 2003. It is important to note that this analysis focuses on the normalized residual error spectra (Figure 1c), not the spectra of the data (Figure 1a) or the model (Figure 1b) itself, and that this residual (equation (1)) is calculated in the time domain (i.e., it is not the difference between the wavelet coefficients displayed in Figures 1a and 1b). In Figure 1, time is along the x axis, time scale is along the y, and spectral power is indicated the intensity of color on a logarithmic scale with warm colors (dark red) indicating the highest spectral power, which can be interpreted as the strongest match between a Morlet wavelet and the time series. Importantly, the warm colors in Figure 1cindicate regions in the frequency domain of substantial data-model disagreement.
 From Figure 1, the dynamics of both the model and the data are dominated by a diurnal signal (1 day) and an annual signal (365 days). We also observe that while the annual signal is present and relatively constant across time, the spectral power for time scales between ∼1 h and ∼2 weeks is considerably stronger (colors toward red) during the growing season and weaker (blue) during the winter, especially at the 1 day time scale. More subtlety, comparing the data and model spectra suggests that the model may have less variability (visualized as less red) than the observed NEE at both the subdaily time scale and at the time scales between daily and annual (henceforth called the intermediate time scales). The lower variability in models at subdaily time scales is a reflection of the fact that the models do not include measurement noise, which arises from instrument error, the stochastic nature of turbulent eddies, and variation in the flux tower footprint. In addition, many of the models can only predict NEP, which is less variable than NEE due to the absence of canopy CO2 storage, though in all cases we are only considering ecosystem atmosphere CO2 fluxes not dissolved carbon or organic trace gases. When we look at the residual error spectra (Figure 1c), we should be encouraged by the fact that the wavelet coefficients have a smaller magnitude, suggesting a degree of correspondence between the model and the data. However, clear signals of model-data mismatch at the annual and diurnal time scales remain.
 These mismatches between model and measurement in Figure 1c can be further interpreted by accounting for the uncertainties in the observations as in Figure 2. Figure 2a (black line) provides an example of the global power spectrum for the error of the ED2 model at Howland, which is simply the marginal distribution of the full error spectrum in Figure 1c, in comparison to the Monte Carlo estimate of the spectra of the observation error (red line: solid line = mean, dashed lines = 95% CI). In order to facilitate the comparison of these spectra we divided the model-data error spectra by the upper 95% CI of the observation error spectra for each time scale (Figure 2b). In this context any time scale that falls above the horizontal line (>1) indicates a model residual error that is “significantly” higher than the uncertainty in the observed data. This shows, for example, that while the error in the model is greatest at the diurnal time scale, the data uncertainties are also very high at this time scale. By contrast, the absolute error at the annual scale is lower, but the random uncertainty in the net carbon flux at this time scale is also considerably lower, such that the normalized peak at the annual time scale dominates the overall spectra.
3.2. Global Model Spectra
 Prior to correcting for the null spectra, the global power spectra, averaging each model across all sites (Figure 3a), suggest that the error in all of the models considered is dominated by errors in the diurnal cycle. The large majority of models also show a second peak at the annual time scale that is almost the same magnitude as the diurnal peak. Based on these overall spectra, there appears to be little consistent structure to model error at the subdaily or intermediate time scales. Power in the error spectra declines at interannual time scales, but this likely reflects the limited length of the time series at these time scales rather than a confirmation of model performance at capturing long-term trends. This interpretation is supported by the large error bounds in the null spectra at this time scale (Figure 2a, dashed red line).
 When compared to the null model spectra, which corrects for the observation error in the flux data, there are substantial changes in the overall pattern of model performance (Figure 3b). While models continue to have significant error at the diurnal time scale, this is no longer the dominant peak of the spectra because there is significant structure to the observation error at this time scale. This implies that much, but by no means all, of the dominant diurnal peak in the nonnormalized spectral results from the noise in the data (Figure 3a). Once corrected for observation error, the overall error for most models is dominated by error at the annual time scale and the greatest variability in model performance comes at the intermediate time scales. For a subset of models (AgroIBIS, LoTEC, SIBcrop, SiBCASA), as well as for the ensemble mean, model performance at the daily to monthly time scales falls at or within the uncertainty bounds of the data. Of these, AgroIBIS and SIBcrop are crop models that were only run at the crop site, LoTEC was run using a data assimilation routine, and thus would be expected to have a lower error rate, and SiBCASA is driven in part using remote-sensing products and thus has more information than prognostic models. For another set of models (BEPS, BIOME-BGC, DLEM, LPJ-wsl) model error shows a dip immediately after the diurnal peak but then rises rapidly during the first part of the intermediate scale (daily to monthly) to reach levels that are comparable to error in the daily time scale. The common feature of this group of models is that it consists of all of the noncrop models with a daily time step. The remaining models also show a dip after the diurnal peak but then error stays lower throughout these time scales, with errors slightly larger than would be expected by chance between the daily and monthly scale. All models show substantial increase in error in the second half of the intermediate scale (monthly to seasonal).
 In order to further diagnose the drivers of the variability among models within specific time scales, we tested for the effects of model structure on the integrated power within the diurnal, intermediate, and annual bands using ANOVA (Table 2). This analysis was restricted to the group of models that operate at a subdaily time step, as we already identified a distinct pattern for daily models, and repeated both for all models and sites and for just the set of complete forest runs. We also excluded LoTEC because it employed a data assimilation scheme. The inclusion of multiple soil layers in the soil moisture model had a significant effect of increasing error at the diurnal time scale for both all sites and the forest sites. Within the forest sites this effect was also significant at the annual scale and marginally significant at the intermediate time scale. The representation of soil carbon pools was in general not significant across scales and sites with the exception of the diurnal scale when considering all sites, in which case models with multiple pools performed worse than models with one carbon pool or no explicit representation of soil carbon. Canopy phenology (prognostic versus prescribed or semiprognostic) had a significant effect at the diurnal time scale at both all sites and forest sites, with fully prognostic models showing larger error than those which used some amount of external information to control phenology. The choice of photosynthesis scheme (enzyme kinetic versus stomatal conductance) was highly significant at the diurnal time scale, with enzyme kinetic models [e.g., Farquhar et al., 1980] having lower error. For both phenology and photosynthesis these effects were also seen at the annual scale when looking across all models, but not at the forest sites nor at the intermediate scale for either set of sites. Finally, the inclusion of an explicit nitrogen cycle did not have a significance effect on the spectral power at any time scale regardless of whether one considers all sites or just the forest sites. Error spectra on a model-by-model basis (Figure S1) and model structural characteristics (Table S1) are provided in theauxiliary material.
Table 2. The p Values for ANOVAs Assessing the Impact of Model Structural Covariates on the Spectral Power Within Each of the Three Time Scalesa
Structural covariates are all categorical and take on the following states: soil water layers (0,1, >1), multiple soil carbon pools (yes/no), phenology (prognostic, prescribed), photosynthesis (enzyme kinetic, stomatal conductance), and soil nitrogen cycle (yes/no).
All study sites and models were used.
Restricted to the six forest study sites and the models that were run at all sites.
Soil H2O layers
Soil C pools
3.3. The Proportion of Model Error at Different Time Scales
 Model error was binned into temporal bands representing subdaily, daily, intermediate, annual, and interannual time scales (Figure 4). The full ANOVA suggests that the differences among sites (p < 0.001, F = 5.34, df = 8) and the site by band interactions (p < 0.001, F = 4.82, df = 24) were significant. These results were also significant within the forest-only ANOVA (site: p < 0.001, F = 5.16, df = 5; site-by-band: p < 0.001, F = 4.70, df = 15).Figure 4a, which shows the overall relative error partitioning by site and band, shows that the error at most sites was dominated by the intermediate and annual time scales. There is a comparatively large amount of spectral power in the interannual band at the CA-Oas site but almost none at US-Ne3, the latter of which is a reflection of the fact that the crop site was only a 3 year time series and thus almost all of the interannual band falls outside the cone of influence. Differences among sites do not show any obvious pattern between the three deciduous sites (Figure 4a, left), the three conifer sites (Figure 4a, middle), and the three nonforested sites (Figure 4a, right). This was consistent with both the forest-only ANOVA, which did not find a significant biome effect, and with a subsequent post hoc analysis that did not find interactions between biome and any other term.
 The full ANOVA also suggests that there were significant differences among the spectral bands (p < 0.001, F = 943.43, df = 3) and in the band by model interaction (p < 0.001, F = 6.06, df = 57). These results were likewise consistent with the forest-only analysis (band: p < 0.001, F = 531.80, df = 3; band-by-model: p < 0.001, F = 6.02, df = 42).Figure 4bshows the overall relative error partitioning by model and band. The agroecosystem models (AgroIBIS, EPIC, SiBcrop, TRIPLEX) stand out because of their lack of interannual variability, but as noted above this is a characteristic of the crop site not the crop models per se. Among the remaining models, EDCM and SSiB2 were dominated by errors in the annual cycle while BEPS, ED2, and LPJ had the largest fraction of error at the interannual time scale and the smallest fraction at the annual time scale compared to other models. Interestingly LoTEC-DA, the one model to employ data assimilation, was not unusual in terms of the relative contributions among the different bands. Finally, the ANOVA found the model and the model-by-site interactions were significant in neither the full ANOVA, nor the forest-only analysis. Because of a lack of a model effect, no model structural variables were tested.
3.4. The Mean Normalized Spectra of Multiple Models
 The across-model error spectra for each site can help ascertain consistent temporal patterns of the model failures on a site-by-site basis (Figure 5). Strong diurnal and annual error spectral signals appear at all sites. Diurnal error is highest during the growing season at all sites (delineated by vertical black lines) and is lowest during the winter, but can also be nontrivial outside the growing season suggesting that these errors cannot be isolated to the GPP calculations within the model. The fact that this seasonal variation appears stronger at the deciduous and nonforested sites (Figures 5 (top) and 5 (bottom), respectively) but is not absent at coniferous sites (Figure 5, middle) suggests that phenology/LAI may contribute to the error. The elevated error during the growing season is not isolated to the diurnal cycle, but is also present but of lower magnitude across the intermediate time scale at all sites as well.
 The magnitude of the annual error tends to be large and consistent across seasons, with some variability within and among sites. For example, a large annual and seasonal signal is apparent for the agricultural site (US-Ne3) for 2003. This corresponds to the year maize was planted in a maize-soy rotation and is a reflection of the fact that the majority of the models consistently underpredicted maize GPP and NEE (E. Lokupitiya et al., Evaluation of model-predicted carbon and energy fluxes from cropland ecosystems, submitted toGlobal Change Biology, 2011). Similarly, the strong seasonal to annual signal at the grassland site (CA-Let) for 2002–2006 corresponds to a series of years that had appreciably greater NEE [Flanagan et al., 2002].
 Within the intermediate time scale there are also clear indications of brief periods of elevated model error during the growing season across all sites. These error “events” show up as patches or vertical plumes of red and orange in Figure 5. Because these periods tend to be brief and irregular, their contribution to overall error is smaller than dominant annual and diurnal cycle errors, but they do point to systematic errors that are shared across models. These events were investigated further on a model-by-model basis by plotting the wavelet power for individual time scales and comparing this to model and data smoothed to the same time scales. As an example, we return to our previous case of the ED2 model at Howland Forest and investigate the dynamics at two intermediate time scales, 10 days and 70 days (Figure 6). We see that in all cases the intermediate time scale “events” identified by the wavelet analysis correspond to times when there was greater variability in the data than in the model. Examples of other models and other sites generally confirmed this trend (data not shown) that most models were noticeably smoother than the data. As a reminder, these are discrepancies in variability on the order of weeks to months and are thus unlikely to arise from random measurement error in the data, though this does not rule out the possibility of systematic errors in instrumentation. We have not diagnosed the environmental and biotic drivers of these many small events, as this is beyond the scope of this study, but useful examples of this approach can be found in the literature [Mahecha et al., 2010]. Finally, it is worth noting that there does not appear to be any correspondence between error “events” in the intermediate period and the phenological boundaries identified from the tower flux data.
4. Discussion and Conclusions
 Our first hypothesis was that models would perform well at the daily and annual time scales because biological processes at both these scales are driven by a solar radiation cycle and corresponding changes in temperature. In contrast to our expectations, model error was overwhelmingly dominated by the annual cycles and also showed a clear diurnal signal. Models captured a significant amount of variability at these time scales (Figure 1b), but these time scales are nonetheless responsible for such a large fraction of the overall variability in NEE (Figure 1a) that errors in their representation dominate the error spectrum and drive overall model performance. Our analysis further reveals that model error on the diurnal cycle predominantly occurs during the growing season regardless of biome, which is not surprising given the larger magnitude of summertime fluxes. These results suggest that further model development focus first and foremost on correctly replicating flux variability and magnitude on the annual and diurnal time scales. This recommendation runs counter to recommendations from site-specific model wavelet analyses where, for example, flux variability was correctly replicated at most time scales but interannual variability was captured for the wrong reasons [Siqueira et al., 2006]. This discrepancy arises because at a single site models can often be calibrated to match observations, but these calibrations may not hold when applied to other sites. It should also be noted that the spectra in this previous analysis [Siqueira et al., 2006] were not corrected for observation error and on visual inspection appear very similar to the uncorrected spectra in this analysis (Figure 3a).
 Probing deeper into the contribution of model structure to diurnal and annual error (Table 2) reveals that model structure is particularly important at the diurnal scale. Within the diurnal scale, the choice of photosynthetic scheme had the greatest impact, with enzyme kinetic models performing best. Counterintuitively, the choice of phenology scheme also had a strong impact at the diurnal scale, though not surprisingly models which predict their own phenology performed worse than those with relied fully or in part on external phenological information. The representation of soil moisture also had a modest, though significant, effect on diurnal errors, though with the somewhat surprising result that the inclusion of multiple soil moisture layers increased error. This may result either from uncertainties in the soil texture causing errors in the predicted depth distribution of moisture itself, or in errors associated with rooting depth distributions and the ability of plants to take up moisture from different layers, neither of which is an issue within a simple single-bucket approach. Finally, the effects of soil carbon representation were modest and inconsistent, while soil nitrogen representation was nonsignificant. At the annual scale both soil C and N representation remain nonsignificant, while the importance of other structural factors were consistent with the diurnal patterns, but the effects were generally weaker and significance varied the between the complete set of forest models and sites and all sites.
 Our second hypothesis, that models would have difficulty capturing intermediate time scale processes, was supported by the analysis. The intermediate time scale is difficult for models to capture due to the stochastic nature of weather events and the presence of within season biotic feedbacks. Once data uncertainties were accounted for, error at the intermediate time scale constituted a nontrival contribution to overall error (Figure 4). What was not predicted a priori was that there are actually two different domains within the intermediate time scales, split at a time scale of approximately 20 days (Figure 3b). This 20 day time scale is only slightly longer than the time scale at which the influence of radiation variability was found to decline and vapor pressure deficit variability became more important for modeling carbon flux variability in a coniferous stand in the Duke Forest [Stoy et al., 2005]. Likewise, variability in leaf area index in a deciduous stand became disproportionately important for describing NEE variability at approximately a 20 day time scale [Stoy et al., 2005]. More generally, this split in time scales also corresponds to the approximately 3 week duration of synoptic weather patterns. Unlike the predictable error structure in the annual and diurnal cycles, the error at intermediate time scales was much more variable, with periods of large error appearing within stretches where models performed well (Figure 5). Investigations into model dynamics during these error “events” suggest that in general models are not variable enough and tend to smooth over within-season variability. These intermediate scale failures are more important than their overall contribution to model error would suggest because it is the climatic variability at this scale that gives us insight into a model's capacity to capture stress responses, which are critical for forecasting global change.
 The composite full spectra (Figure 5) indicate that these discrete intermediate-scale error “events” appear to be shared among many of the models, suggesting shared structural errors. Such structural errors may arise due to both the sharing of mathematical formulations among models and due to shared false assumptions arising from our incomplete understanding of ecosystem dynamics. That said, there were no significant correlations between model structure and the performance at the intermediate time scale. The hints of structural effects are related to soil processes, specifically the number of soil layers, which is consistent with our expectation that soil moisture plays an important role in synoptic scale responses. However, the fact that models with multiple soil layers performed worse suggests that additional model complexity does not guarantee superior model performance. Further diagnosis of the environmental drivers of these intermediate-scale errors and the structural characteristics of models that avoid them is clearly warranted, as are empirical analyses of these systems at the scales relevant to resolving ubiquitous model uncertainties. Interestingly, the mean of the ensemble of models had lower error in the spectral domain than almost all of the individual models, suggesting that while there may be shared model errors, there are also many errors across models that average out in the ensemble. This result reiterates the common finding in the time domain that a multimodel ensemble frequently has the best predictive skill [Bates and Granger, 1969; Schwalm et al., 2010a].
 Also noteworthy at the intermediate time scales is an absence of error peaks at the beginning and end of the growing season. The representation of phenological cycles is a known challenge for models [Richardson et al., 2011], but on average this error was not found to dominate on intermediate time scales and instead error is consistently elevated across the growing season. The absence of a clear phenological signal may be due to the lack of synchrony in phenological errors among models or because phenological errors are showing up as part of the larger annual error.
 By observing the error contribution across temporal bands, the strong model × band effect combined with the nonsignificant model × site effect suggests that individual models are consistent in their error patterns. This is encouraging because it suggests that model failures are not idiosyncratic and site specific. This implies that model improvements are likely to translate to many sites, as opposed to improvements at some sites coming at the expense of reduced performance at others. The significant effects of site and band × site suggest that, as expected, the models taken as a group are performing differently at different sites. A previous analysis of model error in the time domain showed that absolute error varies with biome [Schwalm et al., 2010a], with the smallest errors in well studied biomes such as deciduous and evergreen forests, and the largest errors in less intensively studies systems, such as tundra and shrublands. The current frequency domain analysis suggests that the relative impacts of different time scales do show consistency among models for a given site. The current pattern of site-to-site differences appears a bit idiosyncratic at this point and the forest-only analysis failed to show a significant difference between deciduous and evergreen sites (Figure 4a). At one deciduous forest site, CA-Oas, models demonstrated an unusually large fraction of error at the interannual time scale. Further investigation showed that the interannual error was over five times greater than average at this site, while the annual error was less than 18% below average, suggesting that high interannual error rather than low annual error drove the pattern at this site. Future work with a larger number of sites may be able to clarify site-to-site differences but within the NACP analysis this requires addressing the nontrivial statistical problem that missing model/tower combinations are not random. However, the dominance of error in the annual time scale across sites and models is so clear that increasing the sample size would not provide much additional guidance on how to improve models.
 One of the most novel and important aspects of this analysis was the inclusion of observation error estimates in the evaluation of model-data mismatch across time scales. Observation error is not randomly distributed (Figure 2a), but has a strong spectral signature that follows flux magnitude [Richardson et al., 2008]. Failing to include the magnitude of observation errors would have resulted in qualitatively different conclusions about the significance and relative importance of the different spectral bands. Specifically it would have resulted in an overestimation of the importance of the diurnal cycle and an underestimation of the importance of both the annual cycle and the longer half of the intermediate time scale.
 One time scale that has received little attention in this analysis is the role of interannual to decadal time scales in model error [Stoy et al., 2009]. We are only beginning to have tower data records long enough to assess the ability of models to capture decadal variability and longer term dynamics [Urbanski et al., 2007]. For spectral methods this is particularly problematic at long time scales because the edge effects on the amount of usable data, a region known as the “cone of influence,” means that a valid inference about interannual variability can only be made for a fraction of the time series. Given that the applications of most models are focused on longer scales, an intercomparison of model performance at longer time scales is critical but largely beyond the length of most existing eddy covariance data records and the protocol of the NACP site-level intercomparison. This indicates a critical data need for long-term records at single sites. Also of large value for assessing long-term dynamics are multitower chronosequence studies [Bond-Lamberty et al., 2004; Stoy et al., 2008], though such sequences cannot be explicitly combined in a spectral analysis along the temporal domain of the chronosequence as there is a substitution of time for space.
 It is difficult to generalize about why certain classes of models fail. Coarse classifications of model structure provided insight into the variation in the diurnal cycle but proved to be largely uninformative about the annual and intermediate-scale errors that dominate the current analysis. Details of model function and parameterization are model specific and a model-by-model diagnosis is beyond the scope of this study. While it is possible that the NACP intercomparison simply failed to identify the model structural characteristics that drive model performance, especially at longer time scales, it is also important that efforts to diagnose model structural errors account for model parameter uncertainties. The current intercomparison protocol makes it difficult to distinguish models that failed due to misparameterization versus inherent structural limitations. The intercomparison did include one model (LoTEC-DA) that made use of data assimilation methods, and not surprisingly this model had the lowest absolute error [Schwalm et al., 2010a]. Likewise, results from previously published spectral analyses of single models optimized to a single site resulted in models with very little diurnal or annual error [Braswell et al., 2005], in contrast with the dominant pattern across models in the current analysis. Both these observations suggest that model parameter error may currently be dominating structural error or that many structural errors can be overcome with sufficient parameter flexibility.
 In conclusion, spectral analysis helps clarify when and where models fail, and provides guidelines for prioritizing efforts to improve our collective modeling capacity. Annual errors dominate model error and thus should be the first diagnostic upon which modelers should focus. Afterward, modelers should aim to capture the growing season diurnal cycles. Finally, models should focus on identification and attribution of synoptic error events.
 We would like to thank the North American Carbon Program Site-Level Interim Synthesis team and the Oak Ridge National Laboratory Distributed Active Archive Center for collecting, organizing, and distributing the model output and flux observations required for this analysis. Research by K.S. was partly funded by NOAA award NA07OAR4310115. C.K. was supported by the U.S. Department of Energy's Office of Science through the Midwestern Regional Center for the National Institute for Climatic Change Research at Michigan Technological University under award DE-FC02-06ER64158.