Motivated by the need to improve the modeling of land-atmosphere carbon exchange, this study examines the extent to which continuous atmospheric carbon dioxide (CO2) observations can be used to evaluate flux variability at regional scales. The net ecosystem exchange estimates of four terrestrial biospheric models (TBMs) are used to represent plausible scenarios of surface flux distributions, which are compared in terms of their resulting atmospheric signals. The analysis focuses on North America using the nine towers of the continuous observation network that were operational in 2004. Four test cases are designed to isolate the influence on the atmospheric observations of (1) overall flux differences, (2) magnitude differences in flux across large regions, (3) differences in the flux patterns within ecoregions, and (4) flux variability in the near and far field of observation locations. The CO2 signals generated from the different representations of surface flux distribution are compared using a Chi-square test of variance. Differences found to be significant are driven primarily by differences in flux magnitude over large scales, and the fine-scale (primarily temporal) variability of fluxes within the near field of observation locations. Differences in the spatial distribution of fluxes within individual ecoregions, on the other hand, do not translate into significant differences in the observed signals at the towers. Thus, given the types of variation in flux represented by the four TBMs, the atmospheric data may be most informative in the evaluation of aggregated fluxes over large spatial scales (e.g., ecoregions), as well as in the improvement of how the diurnal cycle of fluxes is represented in TBMs, particularly in areas close to tower locations.
 Regional estimates of the imbalance in terrestrial carbon sources and sinks (net ecosystem exchange, NEE) generally have large associated uncertainties, due in part to the spatial complexity of the individual processes controlling carbon exchange at large scales. This is compounded by the fact that, at the global and regional scales, land-atmosphere carbon exchange cannot be measured directly [Cramer et al., 1999]. Climate change predictions and carbon management decisions, however, depend on the ability to appropriately assess and model carbon uptake and release across various spatial scales. As a result, two main modeling approaches have been developed to estimate NEE at regional and continental scales: (1) terrestrial biospheric models (TBMs), which are based on current mechanistic understanding of how carbon is exchanged within ecosystems; and (2) atmospheric inverse models, which use measured atmospheric concentrations of CO2, coupled with a transport model to infer surface flux distributions. In some cases, a combination of these approaches is used to optimize TBM parameters using atmospheric observations of CO2.
 TBMs have become an integral tool for better understanding the mechanisms controlling carbon exchange across terrestrial ecosystems [Waring and Running, 2007]. Although TBMs can be used to link carbon sources and sinks to explicit ecosystem processes, they depend heavily on their simplifying assumptions, environmental driving data and initial conditions, as well as the way in which the processes controlling carbon exchange are formulated and scaled within the model. A particular model may do well at reproducing fluxes at a given location, however TBMs often differ considerably in their estimates of fluxes over larger regions [e.g., Schimel et al., 1997; Cramer et al., 1999; D. N. Huntzinger et al., North American Carbon Project (NACP) Regional Interim Synthesis: Terrestrial Biospheric Model Intercomparison, manuscript in preparation, 2011]. In addition, current understanding of the processes controlling carbon exchange is not sufficient to rank models in terms of which is “best” at representing current fluxes or predicting carbon exchange under future climate conditions [Heimann et al., 1998; McGuire et al., 2001; Melillo et al., 1995]. One way to assess model performance is to evaluate TBMs against atmospheric CO2 observations, where an atmospheric transport model is used to transport surface fluxes to observation locations. The signal from the surface fluxes can be compared to the available observations, for example in terms of the depth and timing of the seasonal cycle, as well as interannual variations due to climatic factors, land-use change, and disturbances [Heimann et al., 1998; Nevison et al., 2008; Dargaville et al., 2002; Denning et al., 2003]. This type of evaluation approach depends, in part, on the ability of the atmospheric observations to detect differences between the surface flux distributions of various models.
 Conversely, flux estimates from atmospheric inverse models are more comprehensive, in the sense that all ecosystem sources and sinks, fossil fuel emissions, and any other processes emitting or absorbing CO2 are, in principle, captured in the atmospheric signal [Global Change Project (GCP), 2010]. On the other hand, although inverse models predict fluxes that are quantitatively consistent with atmospheric measurements, atmospheric mixing, coupled with the sparseness of observations, leaves the problem ill-posed and frequently underconstrained [e.g., Enting, 2002]. Thus, multiple sets of surface flux estimates may be consistent with a single record of observed CO2 concentrations [e.g., Kaminski and Heimann, 2001; Enting, 2002]. Generally, an additional constraint, such as explicit prior flux estimates from a TBM, must be included within the inversion to make the problem more tractable [e.g., Kaminski et al., 1999; Rödenbeck et al., 2003; Gurney et al., 2004; Baker et al., 2006]. In underconstrained regions, given the sparseness of the atmospheric network, many regional estimates from atmospheric inversions tend to revert to these explicit prior flux estimates, and, as a result, flux estimates can strongly reflect the characteristics of the specific TBM used within the inversion [e.g., Gurney et al., 2003; Butler et al., 2011]. Thus, the accuracy of inversion results depends on a number of factors, including the resolution (spatial and temporal) of the fluxes being estimated [Kaminski et al., 2001; Peylin et al., 2005], the accuracy of the transport model and prior estimates of flux [e.g., Gurney et al., 2003; Rödenbeck et al., 2003; Baker et al., 2006], the density of the monitoring network, and the sensitivity of available atmospheric observations to the underlying flux distribution [e.g., Gloor et al., 2001; Gerbig et al., 2009; GCP, 2010]. To reduce aggregation errors, there has been an increasing focus in inversions on estimating fluxes at finer spatial and temporal resolutions in order to better resolve the potential responses of various vegetation types and the impact of human activities on regional carbon budgets [e.g., Gerbig et al., 2003a; Rödenbeck et al., 2003; Michalak et al., 2004; Peylin et al., 2005; Peters et al., 2007; Lauvaux et al., 2008; Gourdji et al., 2010]. However, it remains uncertain, at finer spatial scales, whether a unique surface flux distribution can be derived from concentration measurements given the diffusive nature of atmospheric transport.
 The objective of the work presented here is to determine how much information atmospheric CO2 observations can provide in either estimating surface flux distributions at regional scales (e.g., from inversions), or evaluating preexisting sets of surface flux estimates (e.g., from TBMs) across North America. This work is motivated by the need to improve both forward and inverse models. For TBMs, there is a need to validate flux estimates against observational data; thus, we examine the extent to which atmospheric CO2 concentration data can be used to evaluate TBM model performance. In order to evaluate the relative merits of two or more TBMs, or to validate a single TBM using atmospheric data, however, the atmospheric data must be able to discriminate between models. Therefore, this manuscript evaluates whether atmospheric data can be used to do this, given the degree of variability between fluxes as estimated by different models and the errors associated with the ability to reproduce observations (i.e., model-data mismatch). Similarly, the recent focus in inverse modeling studies on finer-scale flux estimation raises questions about the ability of the atmospheric network to provide sufficient information to accurately infer fine-scale flux variations beyond those specified by any explicit prior flux estimates.
 To address the above objectives, the NEE estimates of four TBMs are used to represent plausible scenarios of surface flux distributions and magnitude, which are compared in terms of their resulting atmospheric signals. The analysis focuses on North America, and uses the nine towers of the 2004 continuous monitoring network in the analysis. Although the network has expanded since that time, the 2004 network is representative of the types of towers that are part of the larger, current observational network. Therefore, the 2004 network can be used to identify more general rules about what a tower is likely to “see” or detect, depending on its height, location, regional meteorology, and the ecoregions or land-cover types that surround it.
 Comparing the CO2 signals originating from different surface flux distributions only indicates whether the concentration measurements, in general, can distinguish between different representations (e.g., model estimates) of flux. It does not provide information as to what is driving the differences (e.g., spatial or temporal variations in flux estimates, differences in flux magnitude), and to what extent areas far away from towers contribute to these differences. A biospheric modeler may be interested in not only knowing whether CO2 measurements could be used to validate a given model, but over what regions such an analysis is appropriate and what type of information the CO2 measurements can provide (e.g., spatial variability, temporal variability). Similarly, an inverse modeler may be interested in knowing how much information can be extracted from the CO2 measurements, and how those measurements inform flux estimates over various spatial and temporal scales.
 Therefore, we use four (4) case studies to examine the spatial (and, to some extent, temporal) features of the distribution of surface carbon flux that can be detected by continuous tower observations of CO2. Case 1 examines whether the atmospheric CO2 measurements can detect overall differences between fluxes from different biospheric models, which is a necessary, but not sufficient, condition for evaluating fluxes with atmospheric CO2 data. Then, Cases 2 through 4 involve manipulating the surface flux distribution estimated by the models to isolate the specific influence of ecosystem-scale variability, subecosystem-scale variability, and variability in the near versus far field of observation locations on CO2 observations. Combined, the results from the different case studies are used to evaluate the extent to which atmospheric data can be used to evaluate and/or infer flux variability at various scales.
 To meet the objectives outlined earlier, synthetic CO2 observations were generated for the highest sampling elevation of the nine towers that were collecting continuous CO2 concentration measurements in North America during 2004 (Table 1). The highest measurement locations were used because they sample the most well mixed air and, thereby, better represent the influence of fluxes over the largest spatial footprint. These synthetic CO2 signals were generated using different representations of surface carbon flux from four TBMs, and are influenced only by fluxes occurring within the domain of study. Therefore, there is no need to consider CO2 concentrations of air coming into the North American domain, or the impact of fossil fuel emissions on the synthetic CO2 signals. The signals were compared and evaluated using a combination of case studies (section 2.3) and statistical significance testing (section 2.4).
Table 1. The 2004 North American Continuous CO2 Monitoring Network Tower Locations, Heights, and Estimated Model-Data Mismatch Variances
Park Falls, Wisconsin, USA
Moody, Texas, USA
Sable Island, Nova Scotia, Canada
Barrow, Alaska, USA
Norman, Oklahoma, USA
Petersham, Massachusetts, USA
Argyle, Maine, USA
Fraserdale, Ontario, Canada
Candle Lake, Saskatchewan, Canada
2.1. Sensitivity of CO2 Observations to Surface Fluxes
 Atmospheric transport was modeled using the Stochastic Time-Inverted Lagrangian Transport Model (STILT) [Lin et al., 2003] driven by analyzed winds from the Weather Research and Forecasting (WRF) model version 2.2 [Skamarock et al., 2005]. STILT simulates the influence of upwind fluxes on observations by tracking the evolution of an ensemble of air parcels backward in time [Lin et al., 2003]. The sensitivities of 3-hourly averaged atmospheric measurements at the towers to 3-hourly varying upwind surface fluxes at a 1° by 1° resolution over North America were derived following the methods outlined by Lin et al.  and Gourdji et al. . Thus, the footprint describes how unit fluxes in a particular grid cell of the domain at a particular time affect the CO2 concentration at the tower. The integrated sensitivities or footprints for each tower have units of ppm/(μmol/(m2s)), and represent the influence of fluxes (in μmol m−2 s−1) that occurred up to 10 days prior to the measurement on CO2 concentrations (ppm) at the tower. An example of the combined footprint for all 9 towers is shown in Figure 1a.
2.2. Synthetic Observations
 The concentration footprints quantify the sensitivity of atmospheric observations to upwind fluxes, and are independent of the surface fluxes themselves. To simulate synthetic CO2 signals at the towers, the footprints were multiplied by a set of surface fluxes. Distinct representations of land-atmospheric flux distribution were defined using the 3-hourly, 1° × 1° surface flux distributions predicted by four TBMs: the Simple Biosphere Model (SiB3.0) [Sellers et al., 1986; Baker et al., 2008], Carnegie-Ames-Stanford-Approach as configured for the Global Fire Emissions Database v2 project (CASA GFEDv2) [Potter et al., 1993; Randerson et al., 1997; van der Werf et al., 2006], Organizing Carbon and Hydrology in Dynamic Ecosystems (ORCHIDEE) [Krinner et al., 2005], and Vegetation Global Atmosphere and Soil (VEGAS2) [Zeng, 2003; Zeng et al., 2005]. Given the aims of this study, the choice of TBM is somewhat subjective. However, these TBMs were chosen because they (1) provide distinct temporally and spatially variable maps of land-atmospheric surface flux; (2) have been widely applied across a variety of regions; and (3) have been used as prior estimates in inverse modeling studies [e.g., Gurney et al., 2003; Chevallier et al., 2006; Peters et al., 2007; Wang et al., 2007]. Monthly VEGAS2 and CASA GFEDv2 fluxes were temporally downscaled to 3-hourly resolution using the methods of Olsen and Randerson  and net shortwave radiation and near-surface temperature from the NASA Global Land Data Assimilation System (GLDAS) [Rodell et al., 2004]. The different flux representations from the four TBMs are shown in Figure 2, along with an example of the overall differences in their flux estimates, shown as the July average across-model standard deviation in 3-hourly fluxes.
 Three-hourly, synthetic observation signals were generated for each tower in Table 1. For the two tall and MBL towers, observations were generated throughout the day and night. Atmospheric transport models generally have difficulty in simulating the nocturnal planetary boundary layer (PBL) height [Geels et al., 2007], which can result in biased flux estimates when using nighttime data. Therefore, following Carouge et al.  and Gourdji et al. , only afternoon (1800–2400 UTC) footprints were used to generate concentrations at the shorter towers (≤100 m) (Table 1).
 Although errors in the transport model may impact the resultant concentrations, the same atmospheric transport model (i.e., same footprints) is used to create the four synthetic concentration signals, and differences in the signals are, therefore, due solely to the underlying fluxes. While a different transport model could yield slightly different conclusions, WRF/STILT was selected here because it has been applied in several studies aimed at estimating CO2 sources and sinks within North America [e.g., Gerbig et al., 2003b; Gourdji et al., 2010; Lin et al., 2004].
2.3. Case Studies
 Four test cases were designed to isolate the influence on the atmospheric measurements of: (1) overall flux differences (Case 1); (2) magnitude differences in flux across large regions (Case 2); (3) differences in flux pattern within ecoregions (Case 3); and (4) flux variability in the near versus far field of measurement locations (Case 4). Cases 2 through 4 require manipulation of the surface flux distributions from the four TBMs in order to isolate specific influences, as described below and illustrated in Figure 3.
2.3.1. Case 1: Differences in Both the Distribution and Magnitude of Fluxes
 Case 1 examines the overall combined influence of surface flux magnitude and spatial distribution on CO2 concentrations by using the unique flux distribution from each TBM to generate synthetic observations at the 9 tower locations (Figure 3a). As such, Case 1 examines whether the concentration measurements resulting from flux distributions as estimated by different TBMs are different from one another. Cases 2 through 4 are then used to determine the relative importance of various components of across-model flux variability on the observed CO2 concentration variability in the synthetic observation signals.
2.3.2. Case 2: Subecoregion-Scale Variability Removed
 To isolate the influence of differences in regional flux magnitude on the generated CO2 signals, the flux distributions of the TBMs are normalized to remove subecosystem-scale variability, i.e., the spatial variability within each ecoregion (Figure 3b). North America is divided into spatially contiguous ecoregions based on the work of Olson et al.  and the model-specific weekly mean flux is calculated for each ecoregion and TBM. These mean fluxes are then applied to every 3-hourly period and 1° by 1° cell within that ecoregion over the weekly period, thereby removing subecoregion-scale spatial and subweekly temporal variability, while preserving differences in regional flux magnitude.
2.3.3. Case 3: Normalized Net Ecoregion-Scale Flux
 To examine how the distribution of fluxes within ecoregions influences CO2 concentrations, the flux distribution of each TBM is normalized to have the same net area-weighted weekly flux by ecoregion; however, the distribution of fluxes within each of these ecoregions remains unique to each model (Figure 3c); that is, the 3-hourly temporal and 1° by 1° spatial variations in surface flux remain defined by the TBMs.
2.3.4. Case 4: Isolating Influence of Near- Versus Far-Field Fluxes
 The near field and far field are terms used in the literature to refer to those areas within close proximity to atmospheric observations and those areas that are farther away, respectively. For example, Gerbig et al.  defines the near field as the area within about 50 km from the measurement location. Their work indicates that the spatial variations of surface fluxes in the near field of measurement locations contribute significantly to the observed variability in CO2 concentrations [Gerbig et al., 2003b, 2009]. While observations may be more sensitive to the spatial and temporal variability of fluxes within the near field of tower locations, the ability of the observations to detect differences in the spatial distribution of fluxes in the far field is less well known. Thus, Case 4 was designed to assess the impact of the subecosystem-scale variability beyond the near field of the towers.
 Here, the near field is defined as those 1° by 1° grid cells contributing to the greatest or upper 30% of the sensitivities of the atmospheric observations (Figure 1b), resulting in a near field that is much larger than that discussed by Gerbig et al. . In contrast to the high-resolution grid (20 km) near the towers used by Gerbig et al. [2006, 2009], here we examine the influence of 1° by 1° regional flux variations on CO2 concentrations. Thus, the definition of the near field must be based on this grid. Furthermore, depending on the wind regime and the height of the boundary layer, a given measurement location is sensitive to different regions (i.e., fluxes) over time. As a result, the spatial extent and shape of the near field is expected to change with atmospheric transport. Therefore, here, the cells defining the near field of observation locations are allowed to vary by week, and we define the near field in terms of those areas to which the observations are most sensitive, rather than specifying a fixed area or distance from the tower.
 For each tower and each week, the land cells are sorted in terms of how sensitive the tower measurements are to surface fluxes from that cell. Cells are ordered from those having the greatest influence to those having the least. The influences (i.e., sensitivities) are cumulatively summed and divided by the overall or total sensitivity for that tower to the entire domain. Cells that contributed to the top 30% of the sensitivity are defined as the near field for that tower and that week. Figure 1b shows an example of the combined near field for all of the 9 towers in the 2004 network. The far field is defined as all other land cells outside the near field. Based on the above criteria for defining the near field, the near field constitutes, on average, 14% of the land cells in North America.
 To isolate the impact of the far field, an across-model, weekly mean is applied to every 3-hourly period and cell within the near field of the tower or observation locations. The subecosystem and temporal variability of the far field is then defined as described in Case 3 (Figure 3d).
2.4. Significance Testing
 One of the goals of this analysis is to examine the potential of using real atmospheric CO2 measurements to help validate or compare different TBMs. However, even if surface fluxes were perfectly known, a mismatch between the modeled (i.e., surface flux convolved with an atmospheric transport model) and observed CO2 observations is expected, termed “model-data mismatch” in the inverse modeling literature [e.g., Bousquet et al., 1999; Gurney et al., 2002; Peylin et al., 2002; Michalak et al., 2005]. This mismatch is primarily due to transport model errors, but also includes aggregation, representation, and measurement errors [Kaminski et al., 2001; Engelen et al., 2002], as well as possible errors introduced by choices in the inversion setup (e.g., correlation structure of fluxes).
 In order to assess whether the differences between the synthetic CO2 concentrations modeled in this analysis using the different TBMs are significant, the synthetic signals are compared within the context of expected or estimated model-data mismatch error. Thus, the analysis takes into account the sources of error described in the previous paragraph, and the impact these uncertainties can have on the resultant CO2 signals seen at the towers. In essence, comparing signals within the context of the expected model-data mismatch errors makes it possible to assess how different the signals have to be before this difference can confidently be attributed to differences in flux rather than errors inherent in the system. Although the current study factors out model data mismatch errors by using the same transport model to generate synthetic measurements for the four TBMs, the resulting signals are still compared in light of the types of uncertainties that would be present in comparisons with real atmospheric data. Model-data mismatch can be estimated in a number of ways, such as obtaining values from the literature coupled with analysis of the atmospheric measurements [e.g., Schuh et al., 2010, Butler et al., 2011], assessing the temporal variability in the data around a smoothed curve and assuming that this variability is representative of model-data mismatch errors [e.g., Bousquet et al., 1999; Gurney et al., 2002], or by using a statistical optimization method with the atmospheric measurements [e.g., Michalak et al., 2005; Gourdji et al., 2010]. A more detailed summary of how model-data mismatch variance is quantified in inversion studies is provided by Michalak et al. .
 In order to obtain realistic estimates of model-data mismatch error, it is necessary to use real atmospheric measurements. Because this study compares synthetic concentration signals generated using a single transport model, using real atmospheric concentration measurements is the only way to estimate transport model error and any aggregation error below the 3-hourly and 1° by 1° temporal and spatial resolutions. For the current study, we use 3-hourly real atmospheric observations taken during the 2004 growing season at the towers included in this study (Table 1), and optimize the model-data mismatch (σR2) using the Restricted Maximum Likelihood (RML) approach as implemented by Gourdji et al.  at a 3-hourly flux resolution. This provides a conservative estimate of model-data mismatch variances for biospheric fluxes, because model-data mismatch evaluated using real data also includes the influence of uncertainty in fossil fuel emissions. Although the model-data mismatch variances are specific to WRF/STLT as estimated using the RML approach, we expect these to be similar to variances obtained for other contemporary atmospheric transport models.
 In general, model-data mismatch tends to be relatively low at tall and marine boundary layer towers (Table 1), where the air is relatively well mixed, whereas model-data mismatch can be much higher at shorter towers where local influences tend to have a greater influence on measured concentrations. This is particularly true of shorter towers close to areas with large variability in the flux distribution.
 Differences among pairs of synthetic observation signals from different TBMs are quantified using their mean squared difference (MSD). The MSD incorporates both the variance and the bias or offset between the two signals, and thus quantifies how different, on average, the model signals are from one another. As such, it is directly comparable to model-data mismatch variance, σR2, or how different, on average, we would expect modeled CO2 concentrations to be from the true concentrations. If the MSD between two synthetic concentration signals is greater than the estimated model-data mismatch variance (σR2) at a given tower, then differences in the underlying flux distribution are more likely to be detectable by the real atmospheric observations at those towers. And, being able to detect these differences is important for distinguishing between competing biospheric models.
 The statistical significance of the difference between pairs of synthetic observation signals is quantified using an upper-tailed, Chi-square test of variance, where the Chi-square test statistic (χ2) is defined as
where ν is the number of degrees of freedom (n-1), n is the number of 3-hourly synthetic concentration observations over the examined period t at a particular tower, MSD (ppm2) is the mean squared difference in 3-hourly concentrations between a pair of the synthetic observation time series over that same period, and σR2 (ppm2) is the model-data mismatch variance for a given tower.
 The analysis is performed monthly, on all possible pairs of synthetic observations in each test case. Statistical significance is determined within a hypothesis testing framework, where the null hypothesis is that the MSD between signals at a given tower is equal to the estimated model data mismatch variance (σR2) for that tower. The alternative hypothesis is that the MSD between the signals is greater than σR2. For each pair of signals, a p value for the test statistic (χ2) is calculated using a Chi-square distribution with ν degrees of freedom. The p value represents the significance of the difference between two synthetic concentration signals, and, therefore, the lower the p value, the more significant the result. Thus, for each time period, the test establishes the likelihood that the difference between any two synthetic CO2 signals would be detectable.
3. Results and Discussion
 The weekly averaged synthetic CO2 signals generated using the different TBMs for the examined test cases are shown for four of the nine towers in the 2004 North American continuous atmospheric monitoring network in Figure 4. These four towers were chosen to provide an example of a tall tower (LEF); a marine boundary layer tower (SBL); a tower in an agricultural region (ARM); and one located in a highly productive and spatially variable forested region close to heavily populated areas with large fossil fuel sources (HFO) (see Table 1 for details). Synthetic observations at the temporal resolution used in the case studies (3-hourly) are presented for July 2004 in the auxiliary material Figure S1. The results from each case study are discussed below, along with the implications of these results for the evaluation and comparison of TBM estimates using atmospheric data, as well as the potential implications of the results for the inference of fluxes through atmospheric inversions.
3.1. Case 1: Differences in Both the Distribution and Magnitude of Fluxes
 Case study 1 is designed to assess the ability of the atmospheric network to detect differences in the CO2 signal resulting from different representations of surface flux (section 2.3, Figure 3a). If the synthetic signals generated from these different flux distributions are not statistically significantly different from each other, then atmospheric CO2 measurements may not be useful in evaluating or comparing flux estimates from biospheric models. It would also suggest that inversions cannot be used to infer the types of differences in fluxes represented by the TBMs examined in this study.
 The unique flux distributions from the four TBMs yield different weekly averaged synthetic observation signals, both in terms of the depth and timing of their seasonal cycles (Figure 4). Using 3-hourly concentrations (e.g., auxiliary material Figure S1), the six possible signal pairs are compared to evaluate the overall differences in atmospheric signals generated by the examined TBMs. On a weekly basis, the MSD between the 3-hourly synthetic CO2 time series among these pairs is quite variable (Figure 5). The significance of these differences is examined on a monthly time interval using the model-data mismatch variance estimated from real-concentration data as described in section 2.4. At most towers, the four TBMs generate statistically different 3-hourly synthetic CO2 time series during most months of the year (Figure 6). The differences among the synthetic CO2 signals are less significant in the winter months where there is less temporal variability in the 3-hourly concentrations (and fluxes), and smaller overall differences in the spatial distribution of fluxes among the models. Differences between synthetic signals are not as significant at BRW, where the land cells surrounding the tower generally have a smaller influence on atmospheric CO2 concentrations measured at the tower (e.g., Figure 1). Differences are also less detectable at towers with a higher model-data mismatch variance. Towers such as HFO and AMT are located in areas with stronger variations in fluxes (e.g., highly productive forested regions or towers close to large fossil fuel sources) and tend to have a higher model-data mismatch variance due to greater representation and/or aggregations errors [Kaminski et al., 2001; Gerbig et al., 2009; Gourdji et al., 2010].
 The magnitude of the estimated model-data mismatch variance has a direct impact on the significant testing results. This can be visualized by imagining moving the black line up or down in Figure 5 while keeping the MSD among the six pairs of synthetic signals constant. The closer the black line moves to zero, the more often the signal differences exceed the model-data mismatch variance, and therefore, the more often these differences would be detectable by the measurements. For a given grid or spatial resolution of fluxes, it is expected that as the ability to model atmospheric transport improves (i.e., σR2 decreases), differences in atmospheric concentrations, and therefore the underlying fluxes, will become more detectable.
 In general, the results from Case 1 are encouraging, in that the atmospheric measurements can detect overall differences in CO2 concentrations resulting from competing flux distributions (e.g., different TBMs). The differences in flux among the TBMs examined in this analysis are assumed to be comparable to the discrepancy between some “true” flux distribution and flux as represented by a given TBM. The degree to which the atmospheric data can be used to evaluate models (e.g., Is one model better than another? Are the fluxes predicted by a given model compatible with the atmospheric observations?) depends on whether the atmospheric data can detect differences among competing flux distributions. The ability to evaluate models also depends on the scales for which the atmospheric data are most informative about the underlying flux distribution (e.g., close to the tower, large regional flux differences). Below, the remaining cases examine how the temporal and spatial variability in surface flux translates into the variability observed in atmospheric CO2 concentration seen in Case 1.
3.2. Case 2: Subecoregion-Scale Variability Removed
 Land-atmosphere carbon fluxes exhibit variability at different scales. Case 2 examines the ability of atmospheric concentration measurements to detect weekly differences in large-scale (net) carbon flux (section 2.3, Figure 3b). A weekly time period was chosen for averaging fluxes to preserve the large-scale seasonal cycle of fluxes, while isolating magnitude differences in flux across large regions.
 Overall, the towers are able to detect differences in flux magnitude for large regions (Figure 5). However, the MSD between the 3-hourly concentration signals in Case 2 is smaller than observed in Case 1 (Figure 5). As a result, fewer pairs of synthetic CO2 signals are significantly different from one another, particularly in the late winter and early spring months (February through May). However, in early winter (December and January) and later summer (August and September), the differences in regional-scale net flux between the TBMs have a greater impact on tower observations relative to other months. These seasonal variations in differences are likely due to the larger tower footprints in the winter months (i.e., the towers sampling air from a larger region because the PBL may be below the sampling height at taller towers and/or because wind patterns change more frequently in the winter thereby allowing a given tower to sample a larger area), as well as larger net differences in flux during August and September among the models.
 The results from Case 2 indicate that differences in modeled flux magnitude over large ecoregions are generally detectable by atmospheric measurements, and these observations can therefore be used, for at least some regions and seasons, to discriminate among large-scale fluxes as predicted by different TBMs. This is also encouraging for atmospheric inversions that estimate fluxes over large regions, because the atmospheric data appear to provide sufficient information to help constrain fluxes at these scales. In fact, many inversions prescribe flux patterns within large regions (similar to the ecoregions used in this study), and adjust an initial or a priori estimate of the overall magnitudes of fluxes at the regional scale [e.g., Peylin et al., 2001; Law et al., 2002; Peters et al., 2007].
 Using inversions to scale fluxes for large regions assumes that the atmospheric data are providing information about the large-scale regional variations in flux. Due to spatial aggregation errors, however, there is a debate within the inversion community about the number of estimation regions to use, or whether they should be used at all (e.g., estimate fluxes at grid scale rather than by region) [e.g., Kaminski and Heimann, 2001; Kaminski et al., 2001; Peylin et al., 2001]. Although the atmospheric data appear to provide information about large-scale fluxes that can be used to scale regions, any errors in the prescribed subecosystem-scale variability of the a priori fluxes within the regions can lead to significant errors in the inversion result, particularly in regions close to the tower. Cases 3 and 4 evaluate the influence of surface flux variability and the scales (near versus far field) at which flux differences are detectable by the atmospheric data.
3.3. Case 3: Normalized Net Ecoregion-Scale Flux
 Case 3 is designed to evaluate whether the atmospheric network can detect difference in CO2 observation signals caused by how fluxes are distributed within ecoregions (section 2.3, Figure 3c). If substantially different source/sink configurations do not yield substantially different synthetic CO2 time series, then the atmospheric data cannot be used to evaluate the spatial distribution (e.g., 1° × 1° in this study) of fluxes from TBMs. The implications for inversions are slightly more complicated, particularly when the flux distribution of the prior estimate is prescribed or fixed by the inversion, as with inversions that estimate fluxes for regions (e.g., ecoregions, continents) larger than the resolution of the atmospheric transport model. For example, if the flux patterns within ecoregions do not significantly impact atmospheric observations, then, although an inversion that prescribed patterns based on a particular prior set of fluxes is able to reproduce observations, this does not imply that the prior had the correct patterns. This would imply that the results of such an inversion could not be interpreted at subecoregion scales.
 Results from Case 3 suggest that the atmospheric measurements can detect differences in the 3-hourly synthetic signals due to subecoregion-scale differences in the flux distribution (Figures 5 and 6) for some seasons and regions. Differences among the signals are seen primarily during the growing season (May through September), when both the magnitude and spatial heterogeneity of fluxes is greatest, and particularly in forested and northern regions of North America. The towers with larger summertime footprints (e.g., CDL, FRD, LEF) tend to have greater overall observation sensitivities to surface fluxes in regions farther from the tower location (Figure 1). A combination of larger tower footprint, along with higher across-model differences in surface flux distributions (Figure 1b) contribute to greater differences in the synthetic signals at these towers (Figure 6).
 While the results from Case 3 indicate that the atmospheric measurements are able to detect differences in the fine-scale (spatial and temporal) variability of flux, what is not clear is the relative importance of the spatial distribution of fluxes versus differences in the diurnal cycle of those fluxes (e.g., timing, strength). In order to further investigate the impact of differences in fine-scale flux distributions, Case 3 was repeated with the diurnal cycle of fluxes removed. Thus, the distribution of fluxes within each ecoregion remained unique to each model, but the temporal variability contained in the diurnal cycle was removed. When diurnal variability is removed, the significance of the difference between signals disappears for many model pairs and towers (Figure 6). Thus, the differences in signals observed in Case 3 are driven primarily by differences in the diurnal cycle of fluxes between the TBMs (original Case 3) rather than by differences in how the fluxes are spatially distributed among the models (modified Case 3).
 The impact of the diurnal cycle prescribed by the TBMs on the differences seen in the synthetic CO2 signals at the towers has several potential implications for atmospheric inversions. Most importantly, if the magnitude and/or timing of the diurnal cycle in any prescribed fluxes are incorrect in atmospheric inversions, significant errors could be introduced into the inversion results. In order to avoid large temporal aggregation errors (and biases), inversions that use continuous observations have to account for the diurnal cycle in some way [Law et al., 2004]. For example, the inversion could be allowed to adjust the diurnal variability in fluxes in order to properly account for the observed high-frequency variability in concentrations [e.g., Gourdji et al., 2010], or the diurnal variability in the prior could simply be assumed to be correct. The latter approach is the most common among inversions, and is often implemented by either subtracting from the atmospheric observations the signal generated from a forward simulation run with the prior that includes the diurnal cycle [e.g., Peters et al., 2007], or relying by on the prior to define the high-frequency time variations in surface flux [e.g., Schuh et al., 2010] and thus the concentration data. However, in cases when the diurnal cycle is fixed by the a priori flux estimate, any residuals caused by a mismatch between the “true” diurnal cycle seen by the tower and that prescribed by the fixed diurnal cycle of fluxes can be aliased onto other areas or regions in the inversion estimation.
3.4. Case 4: Isolating Influence of Near- and Far-Field Fluxes
 Case 4 examines the influence of flux variability in the far field on the synthetic observation time series (section 2.3, Figure 3d). Atmospheric data are not equally sensitive to an entire ecoregion or all areas surrounding the tower. Instead, observations tend to be most strongly influenced by sources and sinks within the immediate proximity of the towers [Gerbig et al., 2009]. Case 4 examines the extent to which the results in Case 3 are impacted by differences in fluxes in areas farther away from the tower (e.g., outside of the tower's near field). The ability of the tower to detect flux differences in the far field depends on the tower's height, location, and the weather patterns or atmospheric transport, and will have a strong impact on whether a tower's observations can be used to evaluate flux distributions beyond its near field. Thus, Case 4 examines the spatial range seen by the towers, and therefore whether the atmospheric observations can be used to evaluate fluxes across a large domain, or only those areas very close to tower locations.
 Overall, much of the strong summertime differences among the signals in Case 3 appear to originate from flux variability (both spatial and temporal) within the near field (Figure 6). Although differences among signals still remain at many of the towers in Case 4, fewer pairs of signals have statistically significant differences. Thus, when near-field variability is removed in the TBM fluxes, the resultant signals are much less statistically significantly different. This result is consistent with other studies that examined the influence of near- and far-field fluxes on observations [e.g., Gerbig et al., 2009]. However, at towers with larger footprints (e.g., LEF, CDL, and FRD), differences among fluxes in the far field do appear to translate into significant differences among the synthetic CO2 signals. These towers are also located near regions with greater across-model variability in surface flux estimates between TBM (Figure 2b) and have the lowest model-data mismatch variances.
 As in the modified version of Case 3, the diurnal cycle was removed from the far-field fluxes to isolate the impact of fine-scale spatial and temporal differences in estimated fluxes from the TBMs. Contrary to the modified Case 3, however, the significance of the differences between concentration signals remained largely unchanged, indicating that most of the differences in signals observed in Case 4 are driven by differences in the spatial distribution of fluxes in the far field (Figure 6), rather than differences in the far-field diurnal cycle as represented by each TBM. This is likely due to the fact that transported fluxes from regions farther away from tower locations are more likely to be well-mixed and diffuse, compared to near-field fluxes. Thus, differences in the diurnal cycle of fluxes in the far field are likely less detectable at the towers.
 The combined results from Cases 2 through 4 suggest that the observed differences in the CO2 signal are driven primarily by differences in flux magnitude over large scales, and the diurnal cycle of fluxes within the near field of tower locations. Thus, the scale at which TBMs can be evaluated using atmospheric CO2 observations may be limited to areas within the near field of tower observations, or aggregated net fluxes over large regions.
 The results from Case 4 are also important for inversions that use prior flux estimates derived from TBMs. Diurnal variability in surface fluxes within the near field of tower observations appear to have a significant impact on the high-frequency variations in the atmospheric data. Thus, in order to accurately account for the variability in the atmospheric data, the prior has to accurately resolve the flux at fine temporal scales over regions in close proximity to tower locations, or the inversion needs to be able to adjust the temporal (and spatial) variability of the prior.
4. Summary and Conclusions
 This study examined the ability of atmospheric measurements to detect differences in the 1° by 1° land-atmosphere carbon fluxes from four different TBMs. Motivated by the need to improve both forward and inversion models, this study examined three main applications using atmospheric data: (1) evaluating the relative merits of two or more TBMs; (2) validating a single TBM using a transport model and atmospheric concentrations; and (3) ability of the atmospheric network to provide sufficient information to accurately infer fine-scale flux variations in the context of atmospheric inversion.
 Using a Chi-square test of variance, the CO2 signals generated using the different representations of surface flux were compared, and the results suggest that there is sufficient information in the atmospheric record to evaluate or validate at least some aspects of surface flux estimates from TBMs. Given the types of differences in flux represented by the four TBMs, atmospheric data may be most informative in evaluating aggregated TBM fluxes over large spatial scales (e.g., ecoregions), as well as in the improvement of how the diurnal cycle of fluxes is represented in TBMs, particularly in areas close to tower locations. As the density of the atmospheric network increases, more areas will fall within the near field of tower locations, and the monitoring network will thereby provide more information for model evaluation.
 The high sensitivity of the signals to slight differences in the diurnal cycle of fluxes stresses the importance of accurately accounting for small-scale temporal variability in fluxes in models, both in inversions and process-oriented TBMs. This becomes particularly important when inversions use prior flux estimates derived from TBMs. In order to accurately account for the high-frequency variations in the atmospheric data, the prior either needs to be very good at capturing the true flux variability at fine temporal scales, or the inversion needs to be able to adjust this variability. Whereas studies focused on evaluating inversion setups have shown the potential impact of spatial aggregation errors on estimates [e.g., Kaminski et al., 2001; Schuh et al., 2009], the results of this study confirm that the time domain runs the same risk of aggregation errors [e.g., Law et al., 2004; Peylin et al., 2005; Gourdji et al., 2010]. Results demonstrate that the impact of temporal aggregation may be equally important to the impact of aggregation in the spatial domain.
 The high sensitivity of concentration observations to the near field of tower locations further highlights the importance of potential aggregation errors in inversions. For example, to avoid spatial aggregation errors, some inversion approaches optimize fluxes at small spatial scales [e.g., Gerbig et al., 2003a; Rödenbeck et al., 2003; Michalak et al., 2004; Peylin et al., 2005; Gourdji et al., 2008; Schuh et al., 2010]. Such flexibility is important considering the results from Case 4, which suggest that measurements at many towers are highly influenced by small-scale flux variability in the near field. Unless an inversion can adjust small-scale variability, biased estimates may be expected at large scales. Furthermore, by properly accounting for the near-field, small-scale variability, inversions are likely to recover large-scale fluxes more accurately.
 One of the limitations of this study is that the results apply to individual towers, and do not quantify the compounding benefit of multiple towers. Thus, although two signals may not be statistically significantly different if looked at from the perspective of any one tower, the results might be different if the signals at multiple towers at considered together.
 Finally, the results from the significance testing are strongly influenced by the estimates of model-data mismatch variance used in this study. As the ability to model atmospheric transport improves (i.e., lower model-data mismatch variance), more subtle differences in surface flux representation will be detectable through the atmospheric data. Conversely, as TBMs improve, models may converge toward a similar solution, making the differences between models more difficult to detect.
 This work was supported by the National Aeronautics and Space Administration (NASA) under grant NNX06AE84G “Constraining North American Fluxes of Carbon Dioxide and Inferring their Spatiotemporal Covariances through Assimilation of Remote Sensing and Atmospheric Data in a Geostatistical Framework” issued through the ROSES A.6 North American Carbon Program. We would especially like to thank Janusz Eluszkiewicz, John Henderson, and Thomas Nehrkorn at Atmospheric and Environmental Research (AER) Inc. for their work in customizing WRF-STILT. The WRF v2.2 model was initialized with, and used analysis, nudging from, the 32 km North American Regional Reanalysis from the National Centers for Environmental Protection. Finally, the authors would like to thank the following scientists for sharing their biospheric model output: Ning Zeng (University of Maryland), Ian Baker (Colorado State University), Nicolas Viovy (LSCE), and James Randerson (University of California, Irvine). We also thank Matthew Rodell for making the GLDAS data set available to the community, which was used to downscale the CASA-GFEDv2 and VEGAS2 fluxes to 3-hourly temporal resolution.