Improved Constraints on Northern Extratropical CO2 Fluxes Obtained by Combining Surface‐Based and Space‐Based Atmospheric CO2 Measurements

Top‐down estimates of CO2 fluxes are typically constrained by either surface‐based or space‐based CO2 observations. Both of these measurement types have spatial and temporal gaps in observational coverage that can lead to differences in inferred fluxes. Assimilating both surface‐based and space‐based measurements concurrently in a flux inversion framework improves observational coverage and reduces sampling related artifacts. This study examines the consistency of flux constraints provided by these different observations and the potential to combine them by performing a series of 6‐year (2010–2015) CO2 flux inversions. Flux inversions are performed assimilating surface‐based measurements from the in situ and flask network, measurements from the Total Carbon Column Observing Network (TCCON), and space‐based measurements from the Greenhouse Gases Observing Satellite (GOSAT), or all three data sets combined. Combining the data sets results in more precise flux estimates for subcontinental regions relative to any of the data sets alone. Combining the data sets also improves the accuracy of the posterior fluxes, based on reduced root‐mean‐square differences between posterior flux‐simulated CO2 and aircraft‐based CO2 over midlatitude regions (0.33–0.56 ppm) in comparison to GOSAT (0.37–0.61 ppm), TCCON (0.50–0.68 ppm), or in situ and flask measurements (0.46–0.56 ppm) alone. These results suggest that surface‐based and GOSAT measurements give complementary constraints on CO2 fluxes in the northern extratropics and can be combined in flux inversions to improve constraints on regional fluxes. This stands in contrast with many earlier attempts to combine these data sets and suggests that improvements in the NASA Atmospheric CO2 Observations from Space (ACOS) retrieval algorithm have significantly improved the consistency of space‐based and surface‐based flux constraints.


Introduction
Observations of atmospheric CO 2 provide a constraint on the net surface-atmosphere CO 2 flux and are critical for monitoring carbon flux changes. This has motivated observational programs that measure atmospheric CO 2 , including a global network of surface-based in situ and flask monitoring sites, the Total Carbon Column Observing Network (TCCON) of ground-based spectrometers (Wunch et al., 2011) and several satellite missions (Crisp et al., 2004;Yokota et al., 2009). These observations have provided many insights into the terrestrial and ocean carbon cycle (Bacastow, 1976;Bolin & Keeling, 1963;Bowman et al., 2017;Chatterjee et al., 2017;Keeling, 1960;Keeling et al., 1996;Liu et al., 2017;Tans et al., 1989). However, current measurement programs are unable to continuously monitor CO 2 with global coverage, resulting in observational gaps. These spatial and temporal gaps in observations of atmospheric CO 2 can introduce artifacts into NEE estimates, leading to difficulties in constraining carbon fluxes on regional scales (Basu et al., 2018;Byrne et al., 2017;Collatz et al., 2014). Different observing systems have different gaps in the observational coverage. Space-based measurements retrieve atmospheric CO 2 from measurements of reflected sunlight. This results in highly seasonal observational coverage in extratropical regions. Seasonal differences in observational coverage are further exacerbated by challenging retrievals over snow  and seasonal variations in cloud cover. In contrast, surface-based measurements of atmospheric CO 2 typically have comparatively uniform temporal coverage but heterogeneous spatial coverage. Surface measurement sites most densely cover the northern extratropics (particularly North America and Europe) but have sparse coverage elsewhere (Byrne et al., 2017).
In the northern extratropics, surface-based and space-based atmospheric CO 2 measurements provide complementary observational coverage in space and time. Yet few studies have attempted to combine surface-based and space-based atmospheric CO 2 measurements to obtain top-down constraints on fluxes across the northern latitudes. Nassar et al. (2011) combined surface flask CO 2 measurements with space-based measurements from the Tropospheric Emission Spectrometer (TES) in an inversion system and found improved constraints on CO 2 fluxes, paticularly in the tropics. Chevallier et al. (2011) found consistency between the surface air sample-based and the TCCON-based inversions, suggesting that flux inversions combining both data sources could be performed. Maksyutov et al. (2013) performed a combined inversion of monthly mean-gridded Greenhouse Gases Observing Satellite (GOSAT) X CO2 and GLOBALVIEW-CO 2 , finding that posterior fluxes were in close agreement with fluxes from a GLOBALVIEW-CO 2 -only flux inversion in regions that are well constrained by GLOBALVIEW-CO 2 sampling network but showed considerable differences in other regions. Houweling et al. (2015) performed a series of CO 2 flux inversions assimilating measurements from GOSAT and surface-based CO 2 measurements. They found that comparisons between posterior CO 2 fields and aircraft data did not show significant differences between inversions assimilating surface-based or space-based measurements and that the largest differences were driven by the inversion setup. However, they also found that the two data sets gave large differences in the spatial distribution of the CO 2 sink, with GOSAT flux inversions having increased uptake in the northern extratropics by ∼1 PgC. When both data sets were combined, they found that the posterior fluxes did not recover the observed meridional gradient in CO 2 (which was also found for the GOSAT flux inversions), suggesting that the biases in retrieved GOSAT X CO2 could be adversely impacting the results. Another study (Wang et al., 2018) assimilated both GOSAT measurements and surface-based atmospheric CO 2 measurements in a batch Bayesian synthesis inversion. They found that the differences in observational coverage of the ground-based and space-based data sets were complementary, resulting in smaller posterior uncertainty estimates when both data sets are assimilated than either data set alone. Similarly, in a set of regional Observing System Simulation Experiments (OSSEs), Fischer et al. (2017) showed reduced uncertainty in biosphere and fossil fuel emissions in California by combining space-based X CO2 and surface-based flask and in situ measurements.
In this study, we further investigate combining ground-based and space-based measurements of atmospheric CO 2 to provide estimates of NEE globally, but we focus on northern extratropical regions where surfacebased and aircraft-based measurements are most densely concentrated. We perform a series of 6-year flux inversions (2010-2015, inclusive) assimilating surface-based measurements from the in situ and flask measurement network, TCCON column-averaged dry-air CO 2 mole fractions (X CO2 ), GOSAT X CO2 measurements, and all three data sets combined. For each set of measurements, we perform three flux inversions applying different prior NEE flux and error constraints. From the spread in posterior fluxes due to differences in prior constraints, we quantify the precision to which these data sets constrain posterior fluxes. Spatial structures in the posterior fluxes are examined through comparisons between posterior NEE-simulated X CO2 and X CO2 measurements from Orbiting Carbon Observatory 2 (OCO-2) and GOSAT. The accuracy of posterior NEE-simulated CO 2 is examined through comparisons with aircraft-based CO 2 measurements.
The paper is outlined as follows. Section 2 describes the measurements used in this study, and section 3 describes the flux inversion setup. The posterior CO 2 fields obtained by the flux inversions are compared with OCO-2 and aircraft-based measurements in section 4.1. We then examine the 6-year mean seasonal cycle and annual net fluxes (section 4.2) and interannual variability (IAV) (section 4.3) obtained by the flux inversions. Finally, the implications of the results are discussed in section 5, and conclusions are given in section 6.

Surface-Based In Situ and Flask Measurements
Surface-based measurements of boundary layer atmospheric CO 2 can be performed using an in situ gas analyzer or by taking a flask sample, which is then returned to a lab and analyzed. A number of different groups from around the world collect surface CO 2 observations. We assimilate measurements from version 4.1 of the GLOBALVIEW plus package (Cooperative global atmospheric data integration project, 2018; Masarie et al., 2014) and the Japan-Russia Siberian Tall Tower Inland Observation Network (JR-STATION) of nine tower sites in Siberia (Sasakawa et al., 2010(Sasakawa et al., , 2013. The GLOBALVIEW v4.1 package incorporates data from many observing sites around the world and is specifically prepared for use in data assimilation studies. We include measurements from the Integrated Carbon Observation System (ICOS RI, 2019) in our analysis. We assimilate GLOBALVIEW v4.1 measurements from surface in situ and flask sites, tower sites, and ship-based measurements. Data are only assimilated if the measurements are assimilated by NOAA's CarbonTracker, version CT2017 (CT_assim = 0). Measurements are assimilated at the intake height above the model surface over land and at the intake height above sea level for ocean grid cells. For surface-based flask and in situ measurements, most of the measurement error applied for assimilation is due to representativeness errors (inability to model these measurements). We use the model-data mismatch (mdm) as the measurement errors. This is the error value placed on each measurement in the assimilation system and is meant to express the statistics of simulated minus-observed CO 2 residuals expected if CarbonTracker was using perfect surface fluxes.
JR-STATION is a network of nine towers (http://www.cger.nies.go.jp/en/climate/pj1/tower/). On these towers, high inlet measurements are obtained over 17-20 min of each hour, and the low inlet data are obtained from 37-40 min of each hour; these 3-min averages are then taken to be representative of the hourly means for each inlet. We filter the measurements by removing all measurements where the vertical gradient in CO 2 exceeds 0.5 ppm (to remove measurements when the boundary layer is not well mixed) and use the measured value at the highest intake for the measurement. For each site, the errors (in ppm) are prescribed to be constant throughout a given month; the errors range from 3 ppm in winter to 7 ppm in summer to account for both measurement and representativeness errors. These error estimates were chosen because they are comparable to the error estimates for tower sites in the GLOBALVIEW plus v4.1 package.
We remove outliers and poorly modeled measurements by filtering out measurements for which the difference between the prior NEE-simulated measurements and actual measurements exceeds three standard deviations of the measurement uncertainty (see section 3 for details on the forward model simulations). We also remove measurements for which the difference between prior simulated CO 2 and measurement exceeds 10 ppm, as these are assumed to be poorly simulated by the model. This filtering removes ∼8% of the measurements. For each site, the data are only assimilated between 11 a.m. and 4 p.m. local time.

Aircraft-Based Measurements
Aircraft measurements are used for the evaluation of posterior atmospheric CO 2 fields. Aircraft data are obtained from the version 4.1 of the GLOBALVIEW plus data set. Comparisons between measured and modeled atmospheric CO 2 are performed over three distinct regions: East Asia, North America, and Alaska/Arctic ( Figure S1 in the supporting information). Aircraft measurements over East Asia come exclusively from the Comprehensive Observation Network for Trace gases by Airliner (CONTRAIL) program (Machida et al., 2008. Aircraft data over Alaska/Arctic and North America originate from the NOAA Global Greenhouse Gas Reference Network's aircraft program (Sweeney et al., 2015) and HIAPER Pole-to-Pole Observations (HIPPO) (Wofsy, 2011). The number of hourly mean measurements per month between 3 and 8 km in altitude above sea level (asl) is shown in Figure S2.

TCCON Measurements
TCCON is a network of ground-based Fourier transform spectrometers that record solar absorption spectra in the near infrared from which, among other gases, X CO2 is estimated (Wunch et al., 2011). CO 2 abundances are retrieved using a nonlinear least squares approach from absorption lines in the near-infrared spectral region. The column-averaged dry-air mole fractions of CO 2 (X CO2 ) is calculated by taking the ratio of the column abundance of CO 2 to O 2 (scaled by the mean O 2 concentration), resulting in high-precision (<0.25% in CO 2 ) X CO2 measurements. The TCCON strives to achieve the best site-to-site precision and accuracy possible. Systematic biases that are consistent throughout the network are accounted for by scaling the TCCON retrieval results to the WMO scale via aircraft and AirCore profiles (Wunch et al., 2010). Moreover, the TCCON sets guidelines to ensure that the instrumentation at each site is as similar as possible and that the retrieval software, including the spectroscopic line lists and line shapes, is identical for each site. However, site-specific differences (e.g., instrumental line shape) can cause residual site-to-site biases (Wunch et al., 2010) which might introduce biases in flux inversions.
For this study, TCCON data were obtained from the TCCON Data Archive, hosted by CaltechDATA (https://tccondata.org). We include data from TCCON sites that have mean biases of less than 0.5 ppm relative to both the OCO-2 target-mode X CO2 and the posterior-simulated X CO2 from the surface-only flux inversions. The sites included in this study, which provide data during the years 2010-2015, are given in Table 1. Sites that are excluded from this study are excluded due to several factors that cause apparent biases to be greater than 0.5 ppm. These factors may include: proximity to large CO 2 sources (e.g., cities), proximity to large topographic variability, and in a few cases, known TCCON instrument biases for which a solution either has been applied, or will be applied in an upcoming TCCON data version. Understanding the exact reason for the biases is beyond the scope of this study. Note that the threshold of 0.5 ppm is somewhat arbitrary. This value was set because most sites outside of this threshold are in heavily observed regions (e.g., Europe), which are expected to be well constrained by other data sets (Byrne et al., 2017), or in the Southern Hemisphere and not expected to have a large impact on the performance of the flux inversions in the northern midlatitudes.
In this study, the TCCON data are filtered to remove measurements with solar zenith angles greater than 70°. Measurements are then binned into hourly medians for each site. Only hours with five or more measurements are included. Measurements are only assimilated between 11 a.m. and 3 p.m. local time for the flux inversions, to minimize potential biases relating to errors in the prescribed diurnal cycle of NEE.

Space-Based Measurements
We assimilate X CO2 measured by the Thermal And Near-infrared Sensor for carbon Observations Fourier Transform Spectrometer (TANSO-FTS) aboard GOSAT. GOSAT was launched in February 2009 in a Sun-synchronous orbit, with a repeat cycle of 3 days that produces 44 separate ground track repeats (Yoshida et al., 2013). The footprint of the GOSAT measurements has a diameter of about 10 km. Since August 2010, TANSO-FTS has been measuring with a 3-point cross-track pattern with 263-km cross-track separation, resulting in a swath of 526 km. Measurements have an along-track separation of 283 km (Crisp et al., 2012). We use version 7.3 of the NASA Atmospheric CO 2 Observations from Space (ACOS) GOSAT measurements in this analysis. A detailed description of ACOS retrieval algorithm is available in O'Dell et al. (2012) and Crisp et al. (2012), with recent updates described in Eldering et al. (2017) and O'Dell et al. (2018). We only assimilate high gain (H-Gain) nadir measurements from the TANSO-FTS shortwave infrared (SWIR) band. We assimilate all H-Gain measurements that pass the quality flag requirement. These measurements are only over land.
Measurements from OCO-2 are used for comparisons with the posterior CO 2 fields. OCO-2, launched in July 2014, is a space-based spectrometer in a Sun-synchronous orbit that measures reflected solar radiation to infer X CO2 with a footprint of about 3 km 2 . It has a repeat cycle of 16 days, resulting in 233 separate ground track repeats. OCO-2 has a swath of 10 km and collects eight adjacent, spatially resolved samples every 0.333 s, resulting in roughly 24 soundings per second. We downloaded version 9 of the ACOS OCO-2 lite files from the CO 2 Virtual Science Data Environment (https://co2.jpl.nasa.gov/). Measurements are averaged into super-obs at 1°× 1°resolution grids following Liu et al. (2017), with the additional requirement that there must be a minimum of eight OCO-2 observations within each 1°× 1°grid box. We combine land nadir and land glint measurements for the analysis.

Flux Inversions
Flux inversions are performed with the Greenhouse Gas Framework -Flux (GHGF-Flux) inversion system. GHGF-Flux is a flux inversion system developed under the NASAs Carbon Monitoring System (CMS) project. The GHGF is capable of jointly assimilating multiplatform observations of CH 4 , CO, CO 2 , and OCS. The GHGF inherits the chemistry transport model from the GEOS-Chem and the adjoint model from the GEOS-Chem adjoint.
Chemical transport is driven by the Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2) meteorology produced with version 5.12.4 of the GEOS atmospheric data assimilation system (Gelaro et al., 2017). To perform tracer transport, these fields are regridded to 4°× 5°horizontal resolution and archived with a temporal resolution of 6 hr except for surface quantities and mixing depths, which have a temporal resolution of 3 hr. Tracer transport is performed at 30-min time steps with 47 vertical levels.
For all inversions, we optimize 14-day scaling factors for daily net NEE and ocean fluxes, except for the final temporal grouping of each year, which is padded with 1-2 days so that the groupings cover the same day-ofyear increments for each year. We use an assimilation window of approximately 18 months (7 October to 1 April two years later) and keep posterior fluxes for one year (1 January to 31 December) then shift the inversion window forward one year. Using this method, we optimize NEE spanning 2010-2015. Initial conditions are generated by performing a two-year inversion of surface in situ and flask measurements spanning 1 January 2008 to 31 December 2009. The stratosphere is then adjusted to match the zonal mean structure of Diallo et al. (2017) for October 2009. This is performed by interpolating the climatology of Diallo et al. (2017) to the latitude-longitude grid of GEOS-Chem. Then, a scaling factor is applied each longitude grid cell for a given latitude-longitude location so that the zonal mean mole fraction matched the climatology. The magnitude of this correction was a few parts per million.
Prior NEE fluxes and errors differ between inversions and are generated from three different models: the simple biosphere model (SiB3), the Carnegie-Ames-Stanford Approach (CASA) model, and FLUXCOM. The motivation for using three different priors is to investigate the sensitivity of posterior fluxes to model structure differences in the prior fluxes and assumed prior uncertainties in each model (Philip et al., 2019). As shown in Huntzinger et al. (2017), the model structure differences in biogeochemical models are still a major source uncertainty. This method follows the Monte Carlo approach and is analogous to other flux inversion ensembles, such as those reported in Crowell et al. (2019). We sample three different prior NEE estimates, perform the flux inversion with each, and analyze the distribution of posterior fluxes. Due to the small ensemble, this method is less computationally expensive than methods that estimate the posterior errors (Chevallier et al., 2007). This method also has the advantage that it provides a more realistic error estimate in the case that the prescribed prior errors are not reflective of the true prior uncertainties (i.e., smaller errors are applied due to regularization requirements). As our uncertainty quantification is atypical, we provide a brief comparison of the magnitude of prior model spread and uncertainty. Figure S3 compares the relative magnitudes of the prior model spread (full-width) and 1 σ mean model uncertainty (σ mean = (σ CASA + σ SiB3 + σ FLUXCOM )/3). The root-mean-square (RMS) ratio of model spread to model uncertainty gives a metric of the relative magnitudes of model spread and uncertainty. We find that the ratio is temporally and spatially heterogeneous. On average, the model uncertainty exceeds the model spread for 4°× 5°grid cells. (median ratio of 0.73 on 1 February, 0.97 on 1 May, and 0.89 on 1 August). However, model uncertainties are assumed to be spatially and temporally uncorrelated, whereas model structural differences are more coherent, such that the ratio of model spread to uncertainty increases with aggregation. For example, we obtain ratios of 3.3 for 1 February, 3.5 for 1 May, and 3.7 for 1 August for temperate North America. Thus, the magnitudes of model spread and model uncertainty are comparable, suggesting that the model spread employed here provides a reasonable estimate of uncertainty.
For all prior fluxes the annual total net flux has been adjusted to 4.6 PgC year −1 , to match the mean atmospheric CO 2 growth rate. This ensures that surface-based observations would not be prefiltered out (section 2.1) because the prior NEE has growth rate very different from observations. In addition, using prior that is consistent with observed global growth rate saves computation cost, since the inversion process will only need to adjust the NEE components that we do not have prior knowledge of. Details on the modeled NEE fluxes and prior errors are given in Appendix A. The diurnal cycle in NEE is prescribed using the modeled diurnal cycle from SiB3 for the SiB3 flux inversions and the diurnal cycle from CASA for the CASA and FLUXCOM inversions. Sensitivity tests found that the flux inversions were not sensitive to the prescribed diurnal NEE cycle. The ECCO-Darwin-V1 model (Brix et al., 2015;Dutkiewicz et al., 2009;Menemenlis et al., 2008) estimates are used as the prior ocean CO 2 exchange for all inversions, and prior errors were taken to be 100% of the flux. Fossil fuel, biofuel, and biomass burning CO 2 emissions are prescribed using the Open-source Data Inventory for Anthropogenic CO 2 , version 2018 (Oda & Maksyutov, 2011;Oda et al., 2018) with downscaling to hourly emissions based on Nassar et al. (2013), CASA-GFED4-FUEL, and Global Fire Emission Database, version 4 (GFED4) (Randerson et al., 2018) inventories, respectively.
Prior error covariance matrices are taken to be diagonal, such that there are no spatial or temporal covariances. The prior NEE errors are generated based on the NEE fluxes provided by the models. It is first taken to be 60% of the NEE flux. This is then increased by scaling up the errors at times and grid cells that have active vegetation but small net fluxes. For example, the uncertainty is scaled up during the spring (source to sink) and fall (sink to source) transition periods when the 14-day NEE flux is small but the summer 14-day NEE fluxes are much larger. We also inflate the uncertainty for grid cells in which the flux is small for a given model but is much larger for the other models. The final errors range from 100% to 500% of the NEE flux. Additional details are provided in Appendix A.
A series of flux inversions are performed that assimilate different data sets. This allows us to quantify the influence of different observational data sets on the posterior fluxes. We perform flux inversions that assimilate only ground-based in situ and flask measurements (referred to as surface-only), only TCCON measurements (TCCON-only), only GOSAT data (referred to as GOSAT-only), and all data sets simultaneously (referred to as GOSAT + surface + TCCON). For each data assimilation setup, we perform flux inversions with each of the three prior NEE fluxes and errors. Therefore, we perform a total of 12 flux inversions.

Evaluation of Posterior NEE-Simulated CO 2
Large spatial structures in the posterior-simulated CO 2 fields are compared with GOSAT and OCO-2 X CO2 , while the accuracy of the fluxes is evaluated against aircraft-based CO 2 measurements. Rather than describing the data-model differences for all 12 inversions, the posterior fluxes are grouped by the data set assimilated, and the mean posterior fluxes are evaluated. Tables giving the data-model mismatch between the individual flux inversions and aircraft measurements are provided as supplementary materials (Tables S1  and S2).

Comparison of Posterior CO 2 Against Space-Based X CO2
Space-based X CO2 measurements have broad spatial coverage on the timescale of a month. This allows for comparisons between modeled and measured X CO2 data over large spatial scales. Here, the data-model mismatch between the posterior CO 2 fields and space-based measurements from GOSAT and OCO-2 is examined. Figure 1 shows the zonal mean data-model mismatch as a function of latitude and time for the mean prior fluxes and mean posterior fluxes for the TCCON-only inversions, surface-only inversions,

Journal of Geophysical Research: Atmospheres
GOSAT-only inversions, and GOSAT + surface + TCCON inversions. Note that there are gaps due to GOSAT's observational coverage in the tropics and at high latitudes. The mean prior flux gives larger data-model standard deviations against GOSAT (0.59 ppm) and OCO-2 (0.67 ppm) than all of the flux inversions, implying that the flux inversions improve the variance of the data-model mismatch. The CO 2 fields simulated with the prior fluxes tend to be biased low relative to GOSAT and OCO-2 during the winter and spring and biased high during the summer and fall in the northern extratropics, suggesting that the prior fluxes underestimate the magnitude of the seasonal cycle. Comparing the posterior CO 2 fields against GOSAT, the surface-only and TCCON-only flux inversions give the largest mean data-model standard deviations, which is expected as these were the only inversions that do not assimilate GOSAT data.
Comparing to OCO-2, all of the flux inversions give similar differences. Mean differences range from −0.11 ppm to 0.07 ppm and standard deviations range over 0.41-0.48 ppm, suggesting that all of the flux inversions recover the global X CO2 fields with similar accuracy and precision. However, north of 40°N, the GOSAT + surface + TCCON flux inversion shows better agreement with OCO-2 (RMS = 0.30 ppm) than the other flux inversions (RMS = 0.36-0.41 ppm). Differences between posterior-simulated X CO2 and the OCO-2 measurements are largest in the northern subtropics, where the assimilated data sets have sparse observational coverage. Thus, it is unclear whether the differences in the subtropics are due to gaps in the observational coverage or possible biases between assimilated observations and OCO-2 retrievals.
The spread in simulated X CO2 among the inversions gives a metric of the precision to which the flux inversion recovers atmospheric CO 2 . Figure 2 shows the range of simulated GOSAT X CO2 for the prior and posterior fluxes due to the different prior NEE fluxes and errors applied in the inversions. The largest range is obtained for the prior fluxes (1.37 ppm). The range for the TCCON-only and surface-only fluxes are reduced by 42% (0.79 ppm) and 64% (0.50 ppm) relative to the prior, respectively. Globally, the range for GOSAT-only and GOSAT + surface + TCCON inversions are reduced by 72% and 78%, respectively, relative to the prior. The decrease relative to the prior is largest in the northern extratropics. Differences in range between the GOSAT-only and GOSAT + surface + TCCON inversions are generally quite small. The most notable differences is that the GOSAT + surface + TCCON inversions have a smaller range in the northern extratropics during the fall. GOSAT measurements do not have high sensitivity to northern extratropical fluxes during this time of year (Byrne et al., 2017); thus, it appears that the surface-based measurements provide the additional information necessary to better constrain fall NEE in the northern extratropics. It is notable that the posterior-simulated GOSAT X CO2 range is smallest for the GOSAT + surface + TCCON flux inversion, and that posterior data-model differences against GOSAT are similar for the GOSAT + surface + TCCON flux inversions and the GOSAT-only flux inversions. This implies consistent data quality among data sets and transport model reasonably captures vertical and boundary layer mixing.
For the surface-based in situ and flask flux inversions, and TCCON flux inversions, the posterior range increases in the tropics, where there is sparse observational coverage. For the GOSAT and GOSAT + surface + TCCON, the posterior range is generally reduced relative to the prior at all latitudes, although small increases in model spread occur in the northern tropics during the spring and southern tropics during the fall.

Evaluation of Posterior CO 2 Against Aircraft-Based Measurements
Aircraft-based measurements of atmospheric CO 2 provide a constraint on atmospheric CO 2 that is independent of the surface-based and space-based data sets assimilated. Therefore, aircraft-based CO 2 measurements offer a data set that modeled atmospheric CO 2 can be evaluated against. Evaluation of posterior CO 2 against aircraft-based measurements has previously been employed by a number of studies (Basu et al., 2014;Chevallier et al., 2019;Crowell et al., 2019;Frankenberg et al., 2016;Liu & Bowman, 2016;Peylin et al., 2007;Pickett-Heaps et al., 2011;Stephens et al., 2007). Here, we evaluate the atmospheric CO 2 fields simulated using the prior and posterior fluxes against aircraft measurements over three regions with intensive sampling: East Asia, North America, and Alaska/Arctic. We only use aircraft data between 3 and 8 km in altitude asl. Differences between measured and modeled CO 2 are due to both model transport errors and surface flux errors. We have found that the differences are strongly influenced by model transport errors for individual measurements but that the impact of representativeness errors on data-model mismatches is reduced with temporal aggregation; thus, we aggregate data-model mismatches to monthly means.
The GOSAT + surface + TCCON flux inversions generally show the best agreement with the aircraft-based CO 2 measurements. Figure 3 shows the monthly mean aircraft measurements and modeled CO 2 for the three regions examined here. The mean monthly range in data-model mismatch due to prior constraints is shown for each case, as are the mean, standard deviation, and RMS of data-model mismatch. This allows us to quantify the data-model mismatch relative to the spread due to prior constraints. For aircraft measurements over East Asia, North America, and Alaska/Arctic, we find that the prior range is large (1.59-2.55 ppm) relative to the RMS differences (0.68-1.14 ppm) indicating that aircraft observations are within the range of the prior simulated CO 2 . The posterior-simulated CO 2 measurements have much reduced ranges relative to the prior. For example, the GOSAT + surface + TCCON inversions give ranges of 0.24-0.40 ppm and RMS differences of 0.33-0.82 ppm across the regions, indicating that data-model differences are comparable to the precision of the posterior estimates.
Comparing the assimilated data sets, the GOSAT + surface + TCCON flux inversions give the smallest RMS difference against aircraft-based CO 2 in East Asia (0.33 ppm) and North America (0.56 ppm, equivalent to surface-only mismatch), and south of 30°N (0.61 ppm, Figure S4). The GOSAT-only flux inversions give the smallest RMS difference over the Alaska/Arctic region (0.77 ppm), although all of the flux inversions give larger RMS differences over this region relative to the midlatitude regions, suggesting that none of the flux inversions fully recover NEE at high latitudes. These aircraft measurements are also sensitive to fluxes over Siberia ( Figure S5), which is poorly observed by all data sets. However, the larger RMS  Despite these differences, the data-model biases against the aircraft-based measurements are generally similar between flux inversions. For example, all of the flux inversions give positive biases for East Asia (0.11-0.20 ppm, difference ≤ 0.09 ppm) and North America (0.45-0.51 ppm, difference ≤ 0.06 ppm) but negative biases for the Alaska/Arctic region (−0.07 to −0.01 ppm, difference ≤ 0.06 ppm). Therefore, the differences in posterior fluxes between inversions do not appear to be the largest driver of data-model biases. Transport model errors and representativeness errors may be primary drivers of regional data-model biases. We provide a lower bound estimate of transport model biases by regridding the fluxes and performing the evaluation against aircraft measurements at 2°× 2.5°spatial resolution ( Figure S6). Note that this is a lower bound estimate because there will be transport model errors common to both the 2°× 2.5°and 4°× 5°versions of GEOS-Chem (Yu et al., 2018). We find that model-data biases for the flux inversions change by 0.00-0.01 ppm for East Asia, 0.07-0.08 ppm for North America, and 0.09-0.10 ppm for Alaska/Arctic. These differences are similar to the magnitude of data-model differences between flux inversions, suggesting that transport model errors limit the ability of evaluating CO 2 flux estimates with aircraft-based measurements. This result is in contrast to Chevallier et al. (2019), who found that data-model mismatches were not strongly impacted by the version of Laboratoire de Météorologie Dynamique (LMDz) transport model employed.

Mean fluxes 4.2.1. Seasonal Cycle
In the northern extratropics, the seasonal cycle of NEE produces a large annual oscillation in atmospheric CO 2 , giving seasonal variations of ∼10 ppm in X CO2 . This provides the largest signal of ecosystem carbon dynamics in atmospheric CO 2 . In this section, we examine the seasonal cycle of NEE recovered by the flux inversions in the northern extratropics grouped by the assimilated data set. Figure 4 shows the seasonal cycle for the entire northern extratropics and five subcontinental regions (the spatial extent of the subcontinental regions are shown in Figure 5). We examine (1) the consistency in the seasonal cycle between the data sets and (2) the consistency of the posterior fluxes due to different prior assumptions.
The posterior seasonal cycles of the flux inversions show seasonal cycles with reduced spread relative to the prior for all assimilated data sets. The GOSAT + surface + TCCON NEE fluxes most closely match the GOSAT-only NEE fluxes during the summer, as GOSAT has dense observational coverage. During the winter, the GOSAT + surface + TCCON NEE fluxes most closely match the surface-only fluxes, particularly over temperate North America and Europe where the surface-based measurements are most densely concentrated.
The spread for each set of flux inversions shows the range in posterior fluxes due to differences in the prior fluxes and errors applied. This provides a metric of the precision to which the assimilated observations can constrain NEE. The spread is generally largest for the surface-only flux inversions outside of the winter. This is particularly notable over East Asia, where there is comparatively sparse observational coverage leading to a large spread among surface-only flux inversions. The spread is smallest for the GOSAT + surface + TCCON flux inversion, as expected. The small spread for the GOSAT + surface + TCCON flux inversions shows that the observational constraints provided by combining GOSAT, TCCON, and surface in situ and flask CO 2 measurements are sufficient to constrain the seasonal cycle of NEE on these subcontinental scales. These results suggest that the seasonal cycle is recovered by top-down flux inversions and suggests that analysis of the seasonal cycle of NEE, such as that presented by Byrne et al. (2018), could be extended to these regional scales. In the tropics and southern extratropics, the benefit of combining these data sets is somewhat reduced as the majority of information comes from GOSAT due to the sparsity of surface-based observations ( Figures S7 and S8).

Annual Net Fluxes
Here, we examine the annual net fluxes obtained for the flux inversions over the northern extratropics. Figure 6 shows

Journal of Geophysical Research: Atmospheres
(range of −3.21 to −2.89 PgC year −1 ) for the GOSAT + surface + TCCON flux inversions. It is notable that the prior assumptions applied to the flux inversions introduce substantial differences into the posterior fluxes. The range in the northern extratropical sink due to applying different prior NEE fluxes and errors is 0.32-1.03 PgC year −1 , depending on the assimilated data set.
On regional scales, there is generally overlap in the range of net annual fluxes between the TCCON-only, surface-only, GOSAT-only, and GOSAT + surface + TCCON flux inversions. This suggests that these observational data sets provide a consistent constraint on regional net annual NEE, within the considerable uncertainty introduced through prior assumptions. It is notable that the spread in net annual NEE due to prior constraints is as large as signals that have previously been reported between surface-based and space-based CO 2 measurements. For example, a larger European carbon sink of ∼0.5 PgC year −1 has previously been obtained for GOSAT flux inversions relative to surface flask and in situ CO 2 flux inversions (Chevallier et al., 2014;Houweling et al., 2015;Reuter et al., 2014). However, the results found here show that the differences in sink between the surface-only and GOSAT-only are highly dependent on the prior constraints. For example, using FLUXCOM or SiB3 as priors result in similar European sinks. However, a larger sink is recovered in the GOSAT-only inversion if CASA is employed as the prior. Similar results are found in the tropics and Southern Hemisphere, although with increased spread (Figures S7 and S9).

Interannual Variability
IAV in NEE provides a measure of the response of ecosystems to climate variability. Here, we examine the IAV recovered by the flux inversions, where IAV is calculated to be the anomaly from the 6-year mean. Figure 7 shows the IAV in NEE for the entire northern extratropics and five extratropical regions at 14-day temporal resolution, after performing a 3-point (42-day) running mean to filter out high-frequency variability. The posterior NEE IAV is not sensitive to the prior NEE constraints applied in the flux inversion, such that similar posterior NEE IAV is recovered for each set of prior fluxes with a given assimilated data set. This is illustrated by the small range obtained for each set of colored curves. Similar results were found by Baker et al. (2006), who found that IAV was more robust than other measures. However, the posterior NEE IAV is sensitive to the assimilated data set, such that we find disagreement in NEE IAV for the TCCON-only, surface-only, and GOSAT-only flux inversions. Similar results are found in the tropical and Southern Hemisphere regions ( Figure S10).
Differences in IAV between flux inversions can partially be explained by differences in the observational coverage of the data sets. As an example, let us consider the differences in IAV between the surface-only and GOSAT-only flux inversions in 2011 over temperate North America (Figure 8). This year was characterized by a "dipole" flux anomaly over North America, with a drought in the south (Texas-Mexico region) but with normal to above normal productivity in the northern USAsouthern Canada Byrne et al., 2019). Therefore, this provides a good case study to examine the ability of the inversion to capture structure in CO 2 fluxes. Figure 8a shows the monthly CO 2 anomalies observed by GOSAT and the surface in situ and flask network over the summer of 2011. GOSAT X CO2 measurements are distributed uniformly across North America, while surface in situ and flask measurements are located south of Lake Superior. This observational coverage is reflected in the posterior fluxes. The GOSAT-only posterior NEE anomalies

Journal of Geophysical Research: Atmospheres
On an annual basis, we find mixed agreement between flux inversions in year-to-year variations. Figure 9shows IAV in annual net NEE anomalies for the entire northern extratropics. In general, IAV in annual net fluxes are consistent for a given set of assimilated data, suggesting that the results are not sensitive to the prior fluxes and errors used. Note that the prior NEE fluxes did not contain IAV, which has previously been shown to have a substantial impact on posterior NEE IAV (Byrne et al., 2019). However, posterior IAV is quite variable between different assimilated data sets. The cause of these differences between the flux inversions is likely partially due to differences in the observational coverage between data sets. It is possible that differences between data sets are also partially due to changes in the observational coverage over time, which has previously been shown to have an impact on inferred fluxes (Bruhwiler et al., 2011;Gurney et al., 2008;Rödenbeck et al., 2003).

Consistency in Surface-Based and Apaced-Based Flux Constraints
The results generally show good agreement between the flux inversions assimilating different data sets. The agreement between the surface-only and GOSAT-only flux inversions may seem surprising in the context of a number of previous studies that have shown substantial differences between surface-based and space-based flux estimates (Basu et al., 2013;Chevallier et al., 2014;Houweling et al., 2015). However, more recent studies have shown improved agreement between surface-based and space-based flux inversions. Chevallier et al. (2019) found that flux inversions assimilating OCO-2 ACOS version 9 measurements gave similar net annual fluxes to those assimilating surface-based measurements and that both compared well against aircraft measurements. Interestingly, Chevallier et al. (2019) also found that GOSAT OCO Full Physics (OCFP) v7.1 XCO 2 retrievals did not compare as well against aircraft measurements. Comparisons between the ACOS 7.3 and OCFP v7.1 (downloaded from the Copernicus Climate Change Service: https://climate.copernicus.eu/) show substantial differences in zonal mean X CO2 ( Figure S12). Furthermore, GOSAT ACOS 7.3 retrievals are found to give better agreement with posterior-simulated CO 2 from the surface-only flux inversion ( Figure S13). This suggests that the specific retrieval algorithm used can have a large impact on the posterior fluxes, such that the improved agreement between surface-based and space-based measurements found in recent studies may be primarily due to improvements in the ACOS X CO2 retrieval algorithm. Miller and Michalak (2020) have also argued that recent improvements in the ACOS algorithm have substantially increased the reliability of OCO-2 X CO2 measurements in flux inversions studies (for version 8 in particular). Substantial work has gone into refining the ACOS retrieval algorithm over the past decade (Crisp et al., 2012;Eldering et al., 2017;Kiel et al., 2019;Nelson & O'Dell, 2019;O'Dell et al., 2012;O'Dell et al., 2018). Thus, the improved agreement between surface-based and space-based CO 2 constraints is likely best explained by improvements in the ACOS retrieval algorithm.
A consistent 6-year mean northern extratropical sink is obtained by all observational data sets. This result is in contrast to several previous studies that found substantial differences in the annual net NEE flux of CO 2 in the northern extratropics between flux inversions assimilating surface-based and space-based measurements (Basu et al., 2013;Chevallier et al., 2014;Reuter et al., 2014;Saeki et al., 2013). The reason why

Journal of Geophysical Research: Atmospheres
we obtain a more consistent annual net flux between data sets than some earlier studies is not immediately clear but could be due to advancements in the retrieval algorithm (e.g., ACOS 3.3 and earlier versions were used in Houweling et al., 2015) or due to the fact that we look at a multiyear mean, while earlier studies looked at shorter time periods (e.g., Houweling et al., 2015only examined June 2009to June 2010. In fact, we find that the surface-only inversion suggests weaker uptake in 2010 than average (by 0.40-0.49 PgC year −1 ), while the GOSAT flux inversion suggests near average uptake (see section 4.3), suggesting that the difference in inferred fluxes between these two data sets may have been unusually large for 2010. However, it is important to note that differences in annual net fluxes do not imply biases in the measurements. There are aspects of the inversion setups that can lead to differences. For example, differences in the distribution of observations can lead to significant differences in annual net fluxes (Basu et al., 2018;Byrne et al., 2017;Collatz et al., 2014). Thus, one should not necessarily expect consistent annual net fluxes from observational data sets with spatial and temporal gaps in observational coverage.
It is notable that regional-scale annual net posterior NEE is generally found to be consistent between data sets, but IAV is not found to be consistent between data sets. This may be due to differences in the observational coverage between data sets, such that IAV in NEE is attributed to different regions.

Does Combining Data Sets Improve Flux Inversions?
Is it possible to conclude that the GOSAT + surface + TCCON flux inversions improve flux estimates relative to the flux inversions that assimilate a single data set? Of course, the answer to this question depends on how "improve" is defined. The GOSAT + surface + TCCON flux inversions generally show a small reduction in model-data differences against independent aircraft-based CO 2 and OCO-2 X CO2 (north of 40°N). This suggests that combining these data sets in a flux inversion framework produces NEE fluxes that better recover the true atmospheric CO 2 fields than any data set alone. However, confounding factors in evaluating these fluxes remain a significant concern. Model transport errors appear to be a main driver of data-model differences for aircraft-based CO 2 measurements and obscure the source of data-model differences. Evaluating optimized fluxes against OCO-2 is also problematic because these retrievals are known to have their own biases.
The GOSAT + surface + TCCON flux inversions improve the precision of the posterior NEE fluxes relative to the flux inversions assimilating one data set. This is found to be the case at seasonal, annual, and interannual scales. The GOSAT + surface + TCCON flux inversions closely resemble the GOSAT-only NEE fluxes during the summer and surface-only fluxes during the winter for five northern extratropical regions. This is expected given the spatiotemporal distribution of GOSAT and surface-based CO 2 measurements and suggests that the GOSAT + surface + TCCON posterior NEE fluxes are better constrained by the observations than the GOSAT-only or surface-only flux inversions. Therefore, the GOSAT + surface + TCCON flux inversions are less likely to be impacted by biases in the observational coverage, such that, from an observational coverage perspective, we can conclude that the GOSAT + surface + TCCON flux inversions are better constrained than the GOSAT-only or surface-only flux inversions.
An important concern in combining CO 2 data sets within a single flux inversion system is that there could be relative biases in the atmospheric CO 2 constraints provided by the different data sets. Any inconsistency in flux constraints between data sets has the potential of introducing artifacts into the posterior fluxes. Biases in the observations could be present due to errors in the X CO2 retrieval algorithm, representativeness errors (Agustí-Panareda et al., 2019) or model transport errors. Several previous studies have suggested that unrealistically large uptake over Europe (∼1.5 PgC year −1 ) is recovered in posterior fluxes due to biases in the GOSAT retrieval algorithm (Basu et al., 2013;Chevallier et al., 2014), although the ACOS retrieval algorithm has undergone significant development since these studies O'Dell et al., 2018) resulting in reduced biases (Miller & Michalak, 2020). Similarly, a number of studies have pointed out systematic transport errors in GEOS-Chem Yu et al., 2018), as well as biases in reanalysis winds (e.g., vertical mixing, Parazoo et al., 2012). We do not find clear evidence for biases between the surface-based and GOSAT constraints, although these biases may be challenging to identify. We also note that we perform filtering of TCCON and surface flask and in situ measurements to remove outliers, which makes the data sets more consistent than they would be without filtering. However, we do see the impact 10.1029/2019JD032029

Journal of Geophysical Research: Atmospheres
of model transport errors in comparisons between the posterior-simulated CO 2 and aircraft measurements. Ideally, this analysis should be performed with two different transport models so that transport related errors could be more easily identified.

Conclusions
This study presented a series of flux inversions assimilating surface-based flask and in situ CO 2 measurements, TCCON X CO2 , GOSAT X CO2 , or all data sets combined. All of the flux inversions showed improved agreement with independent aircraft-based CO 2 measurements relative to prior flux estimates. The GOSAT + surface + TCCON flux inversion gave the smallest RMS differences against aircraft-based CO 2 measurements over East Asia and North America, and OCO-2 X CO2 measurements (north of 40°N), suggesting that combining the data sets improves flux estimates. However, the data-model mismatches were strongly impacted by transport model spatial resolution, which makes robust evaluations of posterior surface fluxes challenging.
We found that all observing systems generally give reduced spread in posterior NEE fluxes relative to the prior fluxes. This suggests that these data sets provide consistent information on NEE. The GOSAT + surface + TCCON posterior NEE most closely resembles the GOSAT-only posterior NEE during the summer and surface-only posterior NEE during the winter, consistent with the temporal variations in the observational constraints. This suggests that the GOSAT + surface + TCCON flux inversions benefit from the improved spatiotemporal distribution of measurements, providing posterior fluxes that are better informed by measurements throughout the year.
The results of this study suggest that surface-based and space-based atmospheric CO 2 constraints provide consistent constraints on NEE fluxes and can be combined in a flux inversion framework. This result stands in contrast to earlier attempts to combine these data sets (Houweling et al., 2015) and suggests that the improved consistency between the data sets has been made possible by the considerable effort spent refining the ACOS retrieval algorithm Eldering et al., 2017;Kiel et al., 2019;Miller & Michalak, 2020;O'Dell et al., 2018).

A1. Simple Biosphere Model
SiB3 was originally designed as a lower boundary for General Circulation Models with explicit treatment of biophysical processes. The ability to ingest satellite phenology was later introduced , and further refinements included a prognostic canopy air space (Vidale & Stöckli, 2005), more realistic soil and snow (Baker et al., 2003) and modifications to calculations of root water uptake and soil water stress  These fluxes are adjusted to obtain a global net drawdown equal to 4.6 PgC year −1 . To do this, the annual net flux at each grid cell and global total annual net drawdown are calculated. The annual net flux at each grid cell is then scaled so that the annual net flux is 4.6 PgC year −1 . The difference between the original and scaled annual net flux at each grid cell is then calculated. From this difference, an adjustment at each grid cell for each 14-day period is performed so that the annual net flux then equals the scaled annual net flux at each grid cell.
The prior NEE errors are generated based on the NEE fluxes provided by the models. It is first taken to be 60% of the NEE flux. This is then increased by scaling up the errors if the mean flux for a given grid cell is large but the flux is small at a given time. For example, the uncertainty is scaled up during the fall. We also inflate the uncertainty where the flux is small for SiB3 but large for CASA and FLUXCOM. The final errors range from 100% to 500% of the NEE flux. This results in a global annual error of 0.42 PgC year −1 .

A2. CASA
The version of the model used here, CASA-GFED3, was modified from Potter et al. (1993) as described in Randerson et al. (1996) and van der Werf et al. (2006). It is driven by MERRA reanalysis and satellite-observed NDVI to track plant phenology. We use the same fluxes as are used for the CarbonTracker 2016 (https://www.esrl.noaa.gov/gmd/ccgg/carbontracker/) prior. CASA outputs monthly fluxes of Net Primary Productivity (NPP) and heterotropic respiration (R H ). From these fluxes, GPP and ecosystem respiration (R e ) are estimated to be GPP=2NPP and R e =R H −NPP. Temporal downscaling and smoothing were performed from monthly CASA fluxes to 90-min fluxes using temperature and shortwave radiation from the ECMWF ERA-interim reanalysis (note this method differs from Olsen and Randerson (2004). GFED_CMS is used for global fire emissions (http://nacp-files.nacarbon.org/nacp-kawa-01/). We use average model fluxes by averaging the fluxes for 2007-2012.
These fluxes are adjusted to obtain a global net drawdown equal to 4.6 PgC year −1 . To do this, the annual net flux at each grid cell and global total annual net drawdown are calculated. The annual net flux at each grid cell is then scaled so that the annual net flux is 4.6 PgC year −1 . The difference between the original and scaled annual net flux at each grid cell is then calculated. From this difference, an adjustment at each grid cell for each 14-day period is performed so that the annual net flux then equals the scaled annual net flux at each grid cell.
The prior NEE errors are generated based on the NEE fluxes provided by the models. It is first taken to be 60% of the NEE flux. This is then increased by scaling up the errors if the mean flux for a given grid cell is large, but the flux is small at a given time. For example, the uncertainty is scaled up during the fall. We also inflate the uncertainty where the flux is small for CASA but large for SiB3 and FLUXCOM. The final errors range from 100% to 500% of the NEE flux. This results in a global annual error of 0.31 PgC year −1 .

A3. FLUXCOM
FLUXCOM products are generated using upscaling approaches based on machine learning methods that integrate FLUXNET site level observations, satellite remote sensing, and meteorological data (Jung et al., 2017;Tramontana et al., 2016). Jung et al. (2017) generate R e products using several machine learning methods. For this study, we downloaded the products generated using random forests (RFs), multivariate regression splines (MARS), and artificial neural networks (ANNs) at daily resolution from the Data Portal of the Max Planck Institute for Biochemistry (https://www.bgc-jena.mpg.de). The mean seasonal cycle over 2008-2012 is calculated for each product.
These fluxes are adjusted to obtain a global net drawdown equal to 4.6 PgC year −1 . For FLUXCOM, we only adjust fluxes south of 35°N because the northern extratropical NEE fluxes have been heavily informed by FLUXNET sites. For grid cells south of 35°N, the annual net flux at each grid cell and global total annual net drawdown are calculated. The annual net flux at each grid cell is then scaled so that the annual net flux is 4.6 PgC year −1 . The difference between the original and scaled annual net flux at each grid cell is then calculated. From this difference, an adjustment at each grid cell for each 14-day period is performed so that the annual net flux then equals the scaled annual net flux at each grid cell.
The prior NEE errors are generated based on the NEE fluxes provided by the models. It is first taken to be 60% of the NEE flux. This is then increased by scaling up the errors if the mean flux for a given grid cell is large but the flux is small at a given time. For example, the uncertainty is scaled up during the fall. We also inflate the uncertainty where the flux is small for FLUXCOM but large for SiB3 and CASA. The final errors range from 100% to 500% of the NEE flux. This results in a global annual error of 0.30 PgC year −1 .