##### 2.3.1. NOAA Surface Flask Observations

[14] Currently the NOAA ESRL surface network consists of over 50 surface stations worldwide at which CO mixing ratios are measured weekly with very high analytical precision by using flask samples [*Novelli et al.*, 2003]. However, model simulations on a coarse grid are difficult to compare one-to-one with these flask observations, specifically due to model representativeness errors. For example, in the model the emissions are given per grid box and time step and are instantaneously mixed over the grid volume. In reality, the subgrid-scale variability of the emissions leads to a heterogeneous distribution of CO mixing ratios in that box. Hence, a station located downwind of an emission, would observe higher CO mixing ratios compared to the model. Furthermore, strong gradients in CO mixing ratios due to passing pollution plumes are much sharper in reality than the model can represent.

[15] For these reasons, inverse modeling studies deweight or reject some stations before assimilation to prevent biased results. For example, in CarbonTracker Europe (optimization of CO_{2} fluxes using Kalman filtering [*Peters et al.*, 2010]) 2 stations are explicitly not assimilated and stations in strong emissions regions are assigned large fixed errors (of 7.5 ppm CO_{2}). *Bergamaschi et al.* [2010] (4D-Var optimization of methane fluxes) give an advanced description of the model representativeness error in TM5. The total observational error *σ*_{obs} is the sum of a measurement error and the model representativeness error, consisting of errors due to local emissions, modeled 3-D gradients and variations in time. It was shown that the observational errors calculated in this way vary largely from station to station and can vary in time for a certain station throughout the year.

[16] In this study we first apply a quantitative criterion to select the stations that we assimilate in the system and then apply the scheme to estimate the overall observational error of *Bergamaschi et al.* [2010]. The criterion is based on a model simulation with prior sources for the year 2004. The idea is that stations with a large diurnal cycle, most likely due to nearby sources in the model, are excluded whereas background stations and stations influenced by seasonal emissions from for example biomass burning, are maintained. With a model time step of 45 minutes, the model samples each station 32 times per day. From these modeled CO mixing ratio series we compute a daily standard deviation and use the mean daily standard deviation over the whole year as a measure of the diurnal variation. If this measure exceeds a certain threshold (set to 3.5 ppb in this study), the station is not assimilated in the 4D-Var system. As an illustration, the modeled simulation and the mean daily standard deviation for three stations are presented in Figure 2. For comparison also the standard deviation of the complete model time series (annual standard deviation) is given. For station Sede Boker (Figure 2, top), the mean daily standard deviation amounts to 8.7 ppb and the station is not assimilated. In contrast, although station Barrow, Alaska (Figure 2, middle) has an annual standard deviation of 23.5 ppb, the mean daily standard deviation is only 2 ppb. This station is maintained in the assimilation because the model is expected to reproduce the seasonal cycle more accurately than the diurnal cycle. For comparison, station South Pole, Antarctica shows only a very small spread throughout the year (6.9 ppb) and no daily spread as there are no sources of CO nearby. We acknowledge that since the criterion is based on a model simulation, the choice of stations to be assimilated depends strongly on the emissions used in this simulation. Here we used the prior emissions as described in section 2.2 and we believe that we assimilate mainly stations for which the coarse model can reproduce the observations. The location of the 34 stations maintained in the assimilation are shown in Figure 3.

[17] With respect to our previous study, the measurement error has been increased to 3 ppb, because *Hooghiemstra et al.* [2011] found that a measurement error of 1.5 ppb was too conservative in particular on the remote SH. This was likely due to an underestimate of the model error in this region as potential chemistry and transport errors were not included in this error. As a consequence, a large fraction of the observations was not assimilated in the second cycle (see below). With the enhanced observational errors, the total observational error ranges typically from 3–20 ppb. Close to emission regions, the error can even become as large as 100 ppb. In the clean remote SH, the observation error is dominated by the measurement component of 3 ppb. In contrast, in the polluted NH, where most surface CO is released, the model representativeness error is the dominant error term (e.g., see the black bars representing the total observational error in Figure 7).

[18] Each flask observation contributes to the observational part of the cost function. The costs for a mismatch are defined as , where *y*_{m} is the mean modeled CO mixing ratio during a 3 hour period, *y* is the observed CO mixing ratio and *σ*_{obs} is the observation error. We assume the errors to be uncorrelated leading to a diagonal observational error covariance matrix .

[19] We perform the inversion using surface flask observations in 2 cycles following the approach of *Bergamaschi et al.* [2005]. After the first inversion cycle, all observations outside a 3*σ* interval are not used in the second cycle to avoid single outliers to bias the emission estimates. In our previous work, the amount of rejected data points was 15–20% influencing the inferred emissions regionally from cycle 1 to cycle 2. With the larger observation errors used in this study, the number of rejected data points is reduced to around 8% based on a preduc factor of 100 as used by *Hooghiemstra et al.* [2011]. Moreover, for a preduc factor of 10^{6}, this number further reduces to 4% since the model fits the observations more accurately. As a result, the difference in inferred emissions from cycle 1 to cycle 2 becomes much smaller. We acknowledge that a rejection of 4% is still a large number given the Gaussian range of 3*σ* that statistically should lead to a rejection of less than 1% of the data. However, this 4% is mainly caused by a few stations that are still difficult to fit, most likely due to transitions from polluted to very clean air masses that the coarse model can not resolve and is difficult to model as a representativeness error.

##### 2.3.2. MOPITT V4 CO Total Columns

[20] The MOPITT instrument was launched in December 1999 on board NASA's Terra satellite. Although a cooler failure occurred at one side in May 2001, the instrument is already supplying valuable CO observations for 11 years. The MOPITT instrument measures upwelling radiances in a thermal-infrared (TIR) spectral band near 4.7*μ*m and in a short-wave infrared (SWIR) spectral band near 2.3*μ*m. An optimal estimation technique is used to derive CO profiles [*Deeter et al.*, 2003]. A priori information is supplied since the optimization problem is ill-conditioned. In this paper we use MOPITT Version 4, Level 3 data [*Deeter et al.*, 2010], which are based exclusively on TIR observations. This data comes as a daily product, gridded at a 1° × 1° resolution and the a priori profile, retrieved profile and the corresponding averaging kernel matrix are supplied. As for the extensively validated MOPITT V3 [e.g., *Emmons et al.*, 2009], we use daytime observations between 65°S and 65°N only. Except in regions of strong thermal contrast, the MOPITT TIR-based V4 product is mainly sensitive to free tropospheric CO at altitudes from 4–7 km and per profile on average less than 2 independent pieces of information are inferred. Since the total column is generally retrieved more accurately than a single level [*Deeter et al.*, 2003], we only use the CO tropospheric mean mixing ratio expressed in ppb. Figure 4 shows the differences in the CO tropospheric mean mixing ratio between MOPITT V3 and V4 for the months of March and September 2004. In general, MOPITT V4 is significantly lower compared to MOPITT V3. On the NH, differences up to 30 ppb are observed. In the SH, differences are smaller (up to 20 ppb in September) but the relative differences are as large as on the NH due to the North–south gradient in CO mixing ratios.

[21] So far, inversion studies assimilating MOPITT columns always used all MOPITT pixels over both ocean and land surfaces. However, *de Laat et al.* [2010] and *Hooghiemstra et al.* [2011] showed that MOPITT columns over deserts are biased high. *De Laat et al.* [2010] compared observed MOPITT V3 total columns in a latitude band over the Sahara desert to model columns and SCIAMACHY observed columns (taking into account the averaging kernels for MOPITT and assuming the SCIAMACHY averaging kernels to be unity [*de Laat et al.*, 2010]). They found that while all three were in good agreement over the Atlantic ocean, a sharp increase in MOPITT observed CO total columns at the land-ocean boundary was found, whereas the model and SCIAMACHY data did not show such an increase. Over the Sahara desert MOPITT total columns were on average 25% higher than model and SCIAMACHY columns. Moreover, *Deeter et al.* [2010] also showed that MOPITT V4 at 700 hPa was 10–30 ppb higher compared to the NOAA station Assekrem, Algeria. *Hooghiemstra et al.* [2011] conducted a global inversion for the year 2004 using NOAA surface flasks and compared both the prior and the posterior simulation with MOPITT V4 columns. They found differences over the Sahara desert of 15% between MOPITT and the model simulation for both the prior and the posterior simulations. Figure 5 shows the mean modeled and observed total columns between 15–26°N for all longitudes and the differences (in black on the right axis) with the columns simulated using the prior emissions. MOPITT columns are up to 20% (and 20 ppb) higher over the Sahara desert and the Arabian Peninsula, located between longitudes −15° and 55°E. This discrepancy can not be explained by emissions or by transport. Therefore, we decided not to assimilate MOPITT land pixels in our 4D-Var system in this study. One might expect an unbalanced or even biased system due to this approach as the SH contains more ocean surface compared to the NH and is thus heavier constrained by the observations. However, a sensitivity study using all MOPITT observations (including land pixels) showed only large differences in inferred emission estimates for Africa (see Figure 13). Moreover, this inversion was not able to reduce the prior mismatch over the Sahara desert completely due to the lack of emissions in this region in the prior emission inventory. Sensitivity studies will be discussed in detail in section 4.

[22] The contribution to the observational part of the cost function for a MOPITT observation is in principal calculated in the same way as for the surface stations described above. Thus, the costs are defined as . *y*_{m} and *σ*_{obs} are detailed below. In contrast to the MOPITT V3 retrievals resulting in CO profiles in volume mixing ratios (VMR), the MOPITT V4 retrieved profiles are modeled as log(VMR) values. The V4 averaging kernels describe the sensitivity of retrieved log(VMR) to true atmospheric log(VMR) values. According to the MOPITT V4 user guide [*Deeter*, 2009], an in-situ or model profile should be transformed using the averaging kernel and a priori profile, resulting in a pseudo profile:

where is the resulting pseudo profile, is the MOPITT V4 a priori profile, is the MOPITT V4 averaging kernel and is the modeled CO profile, interpolated to the MOPITT pressure grid. The logarithms in equation (2) however, require the arguments to be positive. Due to the large Gaussian prior errors assigned to the emissions, negative emissions may arise during the iterative 4D-Var process. These negative emissions may occasionally lead to negative model profiles and invalid logarithms in equation (2). Another disadvantage of the formulation in equation (2) is that it leads to a non-linear observation operator in the 4D-Var framework and prevents us from using the conjugate gradient method. This method has the important advantage that posterior emission uncertainties can be easily computed [*Meirink et al.*, 2008b]. Therefore, we have chosen to approximate the averaging kernel to first order (derivation in Appendix A), resulting in an averaging kernel that can be used in the following way to construct the pseudo profile:

[23] From the pseudo profile , the scalar tropospheric-mean mixing ratio *y*_{m} is computed by

where *p*_{surf} is the surface pressure, *N*_{lev} is the number of levels for the MOPITT profile (10 or less depending on orography) and Δ*p* is the vector of layer thicknesses in pressure units. We analyzed the differences in modeled CO columns using equation (3) compared to equation (2). The global monthly mean differences are typically within 2% (1 ppb). However, larger regional differences up to 10% (15 ppb) may occur as shown in Figure 6. These differences also vary over the year as the linearized approach leads to higher model columns on the SH and the NH Tropics, but slightly lower columns on the NH midlatitudes in March 2004 (Figure 6, top). For September 2004, higher model columns are found over much of the NH. In the SH both larger and smaller model columns are present when using the linearized averaging kernel compared to the formulation of equation (2). Hence, we note that this approach may introduce a small bias and thus slightly biased emission estimates. However, a sensitivity study in which we explicitly corrected for the difference between application of equation (3) and equation (2) by subtracting the difference from the model columns in the prior simulation, led to optimized emissions well within the error bounds of the base inversion for each emission category (see Table 5).

[24] For multiple 1° × 1° MOPITT profiles in the same 6° × 4° model grid box (up to 24), we use the same model profile to compute in equation (3) and *y*_{m} in equation (4). However, since every MOPITT retrieval has its own prior and averaging kernel, the values of *y*_{m} in the same model grid box will differ. Due to the varying orography in a grid box, the surface pressure defined on the 6° × 4° grid may differ from the retrieved surface pressure for the MOPITT observation that is given on 1° × 1°. To solve this, a surface pressure filter adopted from *Bergamaschi et al.* [2009] is used in which only observations are used with a surface pressure that is within 25 hPa of the model surface pressure.

[25] For the MOPITT observations we specify an observation error (*σ*_{obs}) for each (1° × 1°) observation. This error consists of a model error *σ*_{mod} and two types of measurement error (*σ*_{unc} and *σ*_{var}) such that

The *σ*_{var} is given in the MOPITT product and represents the variability of all MOPITT profiles falling in the 1° × 1° box. The model error *σ*_{mod} is non-zero only if multiple MOPITT observations fall in the same 6° × 4° model grid box and is defined as the standard deviation of the modeled CO total columns *y*_{m} within that grid box. The dominant part of the observation error is *σ*_{unc} which represents the uncertainty in the MOPITT observation. The resulting *σ*_{obs} is approximately 10% per observation. As for the surface flask observations, we do not include correlations between observations in the observation error covariance matrix. However, correlations are present in both the observations (as roughly the same air mass might be sampled more than once) and in the modeled columns. So far, similar studies have ignored correlations between observations by rebinning the observations on larger spatial scales, e.g., the model resolution. *Chevallier* [2007] performed an Observing System Simulation Experiment (OSSE) for CO_{2} using simulated OCO measurements binned to the 3° × 2° model resolution. They investigated the effect of different treatments for the observations on the inferred emissions. It turned out that the best results are obtained by inflation of the observation errors as an approximation to taking all correlations between observations into account. More recently, *Mukherjee et al.* [2011] introduced the statistical CAR model to take care of observation correlations. The statistical model is described by a few parameters that are jointly optimized with the emissions in the inversion. In addition, this approach was capable to fill in missing observations. Although they showed this approach to be appealing, it was only applied in a so-called big-region approach [*Stavrakou and Müller*, 2006] in which the length of the state vector remains small. However, *Mukherjee et al.* [2011] state that this approach is scalable to larger state vectors typically used in 4D-Var systems.

[26] *Chevallier* [2007] used an arbitrary error inflation factor of 2 in combination with observations that were binned to a 3° × 2° model resolution. Would *Chevallier* [2007] have assimilated the observations on a 1° × 1° resolution, the number of observations would have roughly scaled with a factor 6 and hence it is expected that the observational part of the cost function also increases by a factor 6. Therefore, since we assimilate the MOPITT columns on a 1° × 1° resolution, we should reduce the cost function by inflation of the error by an additional factor of . Therefore, we initially used an inflation factor of , but this led to unrealistic emission estimates as the observations were overfitted (possibly due to the grid-scale emission error of 250%). We ultimately chose a rather large inflation factor of . With this choice we obtained an observational cost function value for the MOPITT data set that was roughly twice the size of the corresponding cost function for the stations-only inversion. The rather large factor of is justified by the fact that there are unknown correlations between the MOPITT observations. Moreover, in a future joint assimilation one needs to balance the observational costs of the individual data sets, otherwise the system may fit mainly the satellite observations and the fit with the stations might deteriorate significantly in the SH as reported in previous studies [e.g., *Arellano et al.*, 2006; *Kopacz et al.*, 2010; *Fortems*-*Cheiney et al.*, 2011].