The goal of this study is to determine how H2O and HDO measurements in water vapor can be used to detect and diagnose biases in the representation of processes controlling tropospheric humidity in atmospheric general circulation models (GCMs). We analyze a large number of isotopic data sets (four satellite, sixteen ground-based remote-sensing, five surface in situ and three aircraft data sets) that are sensitive to different altitudes throughout the free troposphere. Despite significant differences between data sets, we identify some observed HDO/H2O characteristics that are robust across data sets and that can be used to evaluate models. We evaluate the isotopic GCM LMDZ, accounting for the effects of spatiotemporal sampling and instrument sensitivity. We find that LMDZ reproduces the spatial patterns in the lower and mid troposphere remarkably well. However, it underestimates the amplitude of seasonal variations in isotopic composition at all levels in the subtropics and in midlatitudes, and this bias is consistent across all data sets. LMDZ also underestimates the observed meridional isotopic gradient and the contrast between dry and convective tropical regions compared to satellite data sets. Comparison with six other isotope-enabled GCMs from the SWING2 project shows that biases exhibited by LMDZ are common to all models. The SWING2 GCMs show a very large spread in isotopic behavior that is not obviously related to that of humidity, suggesting water vapor isotopic measurements could be used to expose model shortcomings. In a companion paper, the isotopic differences between models are interpreted in terms of biases in the representation of processes controlling humidity.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
 Despite continuous improvements in climate models, uncertainties in the predicted magnitude of climate change and associated feedbacks remain high [Randall et al., 2007]. Processes controlling tropical and subtropical tropospheric humidity are involved both in the water vapor and cloud feedbacks. The former is one of the largest feedbacks in magnitude [e.g., Soden and Held, 2006], while the latter are the largest source of spread in climate change projections [Bony and Dufresne, 2005; Bony et al., 2006]. Atmospheric general circulation models (GCMs) must therefore simulate the processes that control tropospheric humidity correctly for their climate change predictions to be credible.
 Tropical and subtropical tropospheric humidity results from a subtle balance between different processes: large-scale radiative subsidence [e.g., Sherwood, 1996; T. Schneider et al., 2006; Folkins and Martin, 2005], detrainment of condensate from convective clouds and its subsequent evaporation [e.g., Wright et al., 2009], evaporation of the falling precipitation [e.g., Folkins and Martin, 2005] and lateral mixing [e.g., Zhang et al., 2003]. In models, an approximately correct humidity simulation could arise from compensating errors in the representation of these processes. Thus humidity observations alone are insufficient for verifying that all relevant processes are properly represented in the models.
 As a first step, we synthesize a large number of isotopic data sets (four satellite, sixteen ground-based remote-sensing, five surface in situ and three aircraft) and use them to evaluate the spatiotemporal isotopic distribution in the GCM LMDZ (Laboratoire de Météorologie Dynamique-Zoom). We focus on model strengths and weaknesses which can be reliably diagnosed from the ensemble of data, given limitations imposed by current deficiencies in remote-sensing measurement calibration and validation. Then, we compare LMDZ with six other isotopic GCMs from the SWING2 (Stable Water INtercomparison Group phase 2) project, to investigate whether the shortcomings evidenced in LMDZ are common to other models. We characterize the difference in the simulated isotopic composition between GCMs and assess whether isotopic measurements can discriminate between models in their representation of processes controlling humidity. In P2, the differences in the simulated isotopic composition between SWING2 models is exploited to understand the causes for humidity biases.
 We present the LMDZ GCM and the SWING2 database in section 2, and the various data sets and the model-data comparison methodology in section 3. In section 4, we extract features that are the most robust across the different data sets and we use LMDZ to understand the differences between data sets. In section 5, we use the different data sets to evaluate the spatiotemporal isotopic ratio distribution in LMDZ. In section 6, we compare the different SWING2 models. We conclude and propose perspectives for future work in section 7.
 To compare with various data sets that have been collected since 1965, LMDZ is forced by observed sea surface temperatures (SST) and sea ice following the AMIP (Atmospheric Model Inter-comparison Project) protocol [Gates, 1992] from 1965 to 2009. The year 2010 is forced by NCEP (National Center for Environmental Prediction) SST [Kalnay et al., 1996] because the AMIP SSTs were not yet available. We checked that the SST data set used has little impact on the simulations, by comparing LMDZ outputs forced by AMIP and NCEP SSTs for 2009: root mean square errors on monthly outputs are lower than 0.5 K for SST and lower than 10‰ for tropospheric δD. Horizontal winds at each vertical level are nudged by ECMWF reanalyses [Uppala et al., 2005] as detailed by Risi et al. [2010c], forcing the simulations toward the actual meteorology and hence enabling direct comparison with observations on a daily basis.
2.2. SWING2 Models
 We compare eight simulations by seven other GCMs participating in the SWING2 inter-comparison project (http://people.su.se/∼cstur/SWING2/). Some of them are nudged by reanalyses, while some of them are not (i.e. they are free-running) (Table 1). Since daily outputs are not available, we cannot compare these models directly to the data as rigorously as we can for LMDZ. Instead, we compare all models to LMDZ as a common reference. One of the SWING2 models is a slightly different version of LMDZ (as described by Risi et al. [2010c]), where the second order advection scheme [Van Leer, 1977] was replaced by a simple upstream scheme (P2).
Table 1. List of the SWING2 Models Used in This Study and Their Respective Simulationsa
“Free-running” refers to standard AMIP-style simulations [Gates, 1992] forced by observed sea surface temperatures, and whose winds are not nudged.
 We focus on evaluating the HDO/H2O ratio as quantified by the variable δD in ‰: δD = ( −1) ⋅ 1000, where R is the HDO/H2O ratio of the water vapor and RSMOW is the Vienna Standard Mean Ocean Water (VSMOW) isotopic ratio [Craig, 1961]. To evaluate the simulated three-dimensional water vapor δD distribution from the surface to tropopause, we combine various data sets that sample different parts of the atmosphere. We use several satellite data sets which provide a global coverage (Table 2): SCIAMACHY (a short-wave infra-red spectrometer) mainly sensitive to the lower troposphere; TES (a nadir-viewing thermal infrared spectrometer) mainly sensitive to the mid-troposphere; ACE-FTS (an infrared solar-occultation instrument); and MIPAS (a limb infrared sounder). Both ACE-FTS and MIPAS are sensitive in the upper troposphere. In addition, we use ground-based remote-sensing data sets derived from mid-infrared and near-infrared solar absorption spectra acquired at 14 stations (Tables 3 and 4) and in situ measurements made at the surface and by aircraft (Table 5).
Table 2. The Different Data Sets of Water Vapor Isotopic Composition Retrieved By Satellite Instruments That We Useda
Spatial Coverage or Location
The period indicates that used in our analysis, which can be shorter than the total data set.
Table 3. Sites of the NDACC Network for Which δD Profiles Up to 10 km Have Been Retrieved as Part of the MUSICA Projecta
Spatial Coverage or Location
The period indicates that used in our analysis, which can be shorter than the total data set. For all these data sets, model outputs were both collocated and transformed by averaging kernels.
Arrival heights (Antarctica)
77.82° S, 166.65°E
November 2002–December 2010, no winter measurements
Lauder (New Zealand)
December 2001–December 2010
August 2007–December 2010
Izaña (Canary Islands)
June 1999–December 2010
April 1996–December 2010
May 2010–December 2010
May 1996–December 2010
August 2006 to December 2010, no winter measurements
Table 4. Sites of the TCCON Network for Which Total Column δD Have Been Retrieveda
Spatial Coverage or Location
The period indicates that used in our analysis, which can be shorter than the total data set. For all these data sets, model outputs were both collocated and transformed by averaging kernels. Precision estimates as described in Appendix B7.
 Remote-sensing measurements of δD are a very new development. Absolute measurement calibration is dependent on accurate spectroscopy, while retrieval validation requires in situ profiling capability. Currently, both are lacking. In the interim a comparison methodology that is insensitive to absolute calibration uncertainties (i.e. characterization of spatiotemporal variability) is necessary. While it is possible that the observed variability is erroneous, the use of multiple data sets helps to ensure that the conclusions we draw are robust. Measurement principles, observing geometries as well as spectroscopic regions differ widely from data set to data set, hence, common errors are unlikely.
 Since each remote-sensing system of δD has its own sensitivity and is subject to sampling biases, we follow a model-to-satellite approach to estimate what the instruments would observe if measuring the model fields. First, to take into account the spatiotemporal sampling of the data, we collocate the outputs with the data at the daily scale. Second, to take into account the sensitivities of the different remote-sensing instruments to the true state, we apply the appropriate averaging kernels to the model outputs. Averaging kernels define the sensitivity of the retrieval at each level to the true state at each level.
 The SCIAMACHY (Scanning Imaging Absorption Spectrometer for Atmospheric Chartography) instrument on board ENVISAT (European Space Agency environmental research satellite) measures short-wave infrared spectra from reflected sunlight from which precipitable water and δD integrated over the entire atmospheric column are retrieved [Frankenberg et al., 2009]. It is mainly sensitive to the lower troposphere, since about 90% of the atmospheric water is found below 500 hPa and 55% below 800 hPa. The data are currently available from 2003 to 2005. The spatial footprint of SCIAMACHY δD is 120 km (across track) and 30 km (along-track) [Frankenberg et al., 2009]. Precision for a single measurement has been estimated as 40–100‰, but statistical uncertainty in the mean is reduced by averaging in space and time [Frankenberg et al., 2009] (Appendix B).
 To avoid potential isotopic biases related to the presence of clouds or sampling of an incomplete atmospheric column, we discarded all retrievals associated with a cloud fraction higher than 10% or with a retrieved precipitable water differing from ECMWF reanalyses by more than 10%, selecting only about one third of the measurements [Risi et al., 2010b]. The clear-sky sampling bias is further discussed in section 5.1. We sampled LMDZ-iso daily outputs coincident with observations and re-gridded the data on the LMDZ grid (2.5° in latitude × 3.75° in longitude).
 The TES (Tropospheric Emission Spectrometer) instrument [Worden et al., 2006, 2007] on board the Aura satellite measures high-resolution thermal infrared spectra from which the lower mid-tropospheric δD are retrieved. The footprint is 5.3 km × 8.5 km and the precision is about 10–15‰ (Appendix B2). We use the F04 version over the 2004–2008 period. The degree of freedom for signal (DOFS) is a scalar metric indicating the sensitivity of the measurement to the true state. We selected only retrievals for which the DOFS was higher than 0.5, to ensure a significant sensitivity to the true state [Worden et al., 2006].
 A recent calibration study (see section 3.7) suggests that the raw TES δD is biased high [Worden et al., 2006]. This may arise from spectroscopic errors. Following Worden et al.  and Lee et al. , we corrected the HDO data by reducing it by approximately 4% (depending on the averaging kernels, Appendix A). This correction leads to a decrease of δD of about 40‰ in the tropics on average and of about 25‰ in dry subtropical regions, so it significantly affects zonal gradients. Since this correction results from only one calibration campaign, there remains significant uncertainty in the absolute value of the TES δD and on its meridional gradients. We re-interpolated the TES data on the LMDZ grid and analyzed results at 600 hPa where the HDO sensitivity is a maximum.
 To mimic the TES temporal and geographic sampling pattern, we sampled LMDZ-iso daily outputs coincident with observations. Due to limited instrument sensitivity and vertical resolution, the TES retrieval at each level reflects the δD over a broader range of altitudes, and is sensitive to the a priori information. These effects are represented by the averaging kernels, which depend on geographical location and atmospheric state, including the presence of clouds. To make the closest possible comparison, we apply the same averaging kernels to the simulated profiles. To do so, we calculated monthly mean averaging kernels for each LMDZ grid box, and applied these kernels together with the a priori constraint to the model outputs [Worden et al., 2006] (Appendix C2). In doing so, we neglect the day-to-day variability of the averaging kernels, such as that related to clouds. We determined that calculating monthly mean averaging kernels over total sky conditions or clear sky only conditions lead to differences less than 6‰ on kernel-weighted δD (Appendix C2), consistent with similar results for other chemical species [Aghedo et al., 2011]. On average, applying the averaging kernels leads to a δD increase by up to 10‰ in convective regions and by up to 30‰ in dry subtropical regions, demonstrating the importance of accounting for the sensitivity of the TES retrievals for a rigorous model-data comparison.
 The ACE-FTS (Atmospheric Chemistry Experiment Fourier Transform Spectrometer) instrument on board the ACE satellite measures δD profiles from the stratosphere to about 400 hPa depending on cloud cover [Nassar et al., 2007]. As an occultation sounder, it has better vertical resolution than nadir sounders, but a coarser horizontal resolution. We use the v2.2_HDO_Update over the 2003–2008 period. We discarded measurements with errors in H2O and HDO higher than the retrieved values. This leads to a slight bias toward measurements when H2O content is higher. In addition, we applied a 3 median average deviation filter to remove outliers. We checked that this method does not distort the mean and median [Jones et al., 2011]. The precision for individual measurement varies from about 20‰ in the tropics to 60‰ in midlatitudes (Appendix B3).
 Given the low number of solar occultation measurements per day, we re-gridded the data on a 10° × 100 hPa latitude/height grid. We sampled LMDZ-iso daily outputs coincident with observations. ACE does not use optimal estimation, and averaging kernels are not computed. To take into account the vertical resolution of the data, we applied a triangular kernel of base 3 km to the model outputs [Dupuy et al., 2009].
 The MIPAS instrument on board the ENVISAT satellite is a limb sounder measuring δD profiles down to about 300 hPa [Payne et al., 2007; Steinwagner et al., 2007, 2010], at 10:00 am and 10:00 pm local time. We use the V3O_HDO_5 data between September 2002 and March 2004. We discarded data with the visibility flag equal to zero and with diagonal elements of the averaging kernels lower than 0.03. This rejects about 70% of the data in tropics at 13 km. The precision is about 60‰ in the tropics and 150‰ in midlatitudes.
 We sampled LMDZ-iso daily outputs coincident with observations. Based on 216 MIPAS profiles collected during three days representative of different seasons, we calculated 21 averaging kernels representative of different tropopause heights from 7 km to 17 km in bins of 0.5 km (more details in Appendix C2). We applied these representative averaging kernels to model outputs depending on the observed tropopause height. As for TES, applying the kernel is crucial for the model-data comparison, as the convolution increases the δD by up to 300‰ in the tropical upper troposphere.
 Biases related to incomplete scanning of the atmospheric column when clouds are present are discussed in Appendix B5, and are shown not to significantly affect our results.
3.5. Ground-Based FTIR at MUSICA Sites
 High resolution ground-based Fourier Transform Infrared (FTIR) spectrometers have been measuring solar absorption spectra in the mid-infrared region (750–4200 cm−1) since the 1990s at about 15 globally distributed sites that are part of the Network for the Detection of Atmospheric Composition Change (NDACC, www.acd.ucar.edu/irwg [Kurylo and Zander, 2000]). In the mid-infrared spectral region, there are several spectral microwindows with well-isolated and strong H2O and HDO signatures.
 In the framework of the project MUSICA (Multiplatform remote-sensing of Isotopologues for investigating the Cycle of Atmospheric water, www.imk-asf.kit.edu/english/musica), a dedicated water isotopologue retrieval algorithm is applied. It consists in a simultaneous optimal estimation of H2O and HDO as well as δD [M. Schneider et al., 2006, 2010b]. With this retrieval technique, tropospheric H2O and δD column abundances and profiles with a modest vertical resolution can be produced from the NDACC spectra. In this paper we use the ground-based MUSICA data version v101220_Ca.0. These data are retrieved applying signatures in the 2650–3050 cm−1 spectral region [Schneider et al., 2010a].
 The uncertainties are estimated in detail by Schneider et al. [2010a]. Concerning column-integrated data, the precision is about 5‰ and biases can reach 100‰. Concerning profile data, the precision is about 10–25‰ and biases can reach 25–50‰ (Appendix B6). The most important source for systematic biases as well as for the random errors in the profiles are uncertainties when modeling the shape of the high resolution absorption lines.
 The ground-based MUSICA data version v101220_Ca.0 is currently available for 8 FTIR NDACC sites: Eureka (Canadian Arctic), Kiruna (Northern Sweden), Karlsruhe (Germany), Jungfraujoch (Switzerland), Izaña (Canary Island), Wollongong (Australia), Lauder (New Zealand), and Arrival Heights (Antarctica) (Table 3). We sampled the LMDZ H2O and HDO profiles coincident with the observations and applied the averaging kernels and a priori profiles corresponding to each measured profile (Appendix C3).
3.6. Ground-Based FTIR at TCCON Sites
 The Total Carbon Column Observing Network (TCCON) [Wunch et al., 2010, 2011] is a network of very high quality ground-based FTIR systems recording solar absorption spectra in the near infrared spectral region (3800–9000 cm−1). In the near infrared there are strong and well isolated H2O absorption signatures but HDO signatures are significantly weaker than the interfering absorption of H2O and CH4. We use data obtained at the TCCON sites Ny Alesund (Spitzbergen Island), Bremen (Germany), Park Falls, Lamont and Pasadena (United States), Lauder (New-Zealand) and Darwin and Wollongong (Australia) (Table 4). δD is inferred from the retrieved total columns of H2O and HDO. The TCCON HDO data have not been evaluated for spectroscopic errors. Note that total column δD derived from measurements on TCCON and NDACC sites have different error characteristics and sensitivities, due to different spectroscopic errors and retrieval methodologies.
 We sampled the LMDZ H2O and HDO profiles coincident with the observations and estimated the model-equivalent columns using the averaging kernels and a priori profiles, before calculating total column δD (Appendix C3). Averaging kernels were parameterized as a function of solar zenith angle alone. The uncertainty associated with using these kernels compared with using individual kernels is lower than 3‰ (Appendix C3).
3.7. In Situ Surface Measurements
 Two kinds of surface vapor measurements are used in this study. First, we use δD values from vapor samples obtained by cryogenic sampling: samples collected at GNIP (Global Network for Isotopes in Precipitation) vapor stations in Vienna, Ankara, Manaus (as in work by Risi et al. [2010c]), daily samples collected at Rehovot, Israel [Angert et al., 2008] and at Saclay, France (described by Risi et al. [2010c]), and samples collected during cruises in January 2004 by Uemura et al. . δD was measured by mass spectrometers with precision better than 1‰, and tied to the absolute scale using reference standards.
 Second, we use continuous data collected by a Picarro instrument in Hawaii at about 680 hPa [Johnson et al., 2011; Noone et al., 2011]. δD values have been corrected to match values obtained by laboratory analysis of whole air vapor samples simultaneously collected with flasks. This data has been used for estimating the bias correction applied to the TES data [Worden et al., 2010] (section 3.2). We selected only the Picarro data during the nighttime, which represents the free troposphere [Worden et al., 2010]. Since LMDZ cannot see any land in the grid point of Hawaii, the model results at 680 hPa are representative of the free troposphere only, so we discarded observed daytime data that are representative of boundary layer air.
 For each measurement, we sampled LMDZ outputs coincident with observations and re-gridded all data on the model grid.
3.8. In Situ Aircraft Measurements
 Three aircraft data sets are used in this study. The first is data collected between 1–2 km and 9–11 km in Nebraska, around Santa Barbara and in Death Valley [Ehhalt, 1974]. Samples were collected by cryogenic sampling and analyzed on a mass spectrometer. Although the precision of the mass spectrometer is 1‰, some large errors can arise from the contamination in the sampling tubes. The data have recently been corrected for this effect but some unquantified and potentially large errors may remain especially in the upper troposphere [Ehhalt et al., 2005]. We sampled LMDZ outputs coincident with observations.
 The second data set was collected by the ICOS instrument [Sayres et al., 2009] during the Costa Rica Aura Validation Experiment (CR-AVE) campaign near Costa Rica in winter 2006, and the third was collected by the ICOS and Hoxotope instruments [St. Clair et al., 2008] during the Tropical Composition, Cloud and Climate Coupling (TC4) campaign in the same region in summer 2007. These data sets are both described by Sayres et al. . We applied the same data quality-filtering and processing as in work by Sayres et al. , including screening of potentially contaminated data. The measurement precisions are about 17‰ for ICOS and 50‰ for Hoxotope. Both ICOS and Hoxotope were calibrated through laboratory experiments. We sampled LMDZ outputs coincident with observations and re-gridded all data on the model grid. We show data only in grid boxes where data was sampled during both campaigns to calculate seasonal variations and representative annual means.
4. Comparison Between Data Sets
 In this section, we compare the different data sets to extract the most robust features. Then we use LMDZ to understand and quantify the sources of differences between the data sets.
4.1. Robust Features Among Data Sets
Figure 1 synthesizes the data as zonal, annual means at different altitudes throughout the free troposphere. We show in situ data at the surface; SCIAMACHY, MUSICA and TCCON data for total column δD; TES, MUSICA, Hawaii and Ehhalt  data at 600 hPa; ACE, MIPAS, MUSICA, CR-AVE/TC4 and Ehhalt  data at 350 hPa; and ACE, MIPAS and CR-AVE/TC4 at 250 hPa.
 The different data sets show large differences in δD (Figures 1a–1e). To better visualize the meridional gradients, we subtract the tropical average (Figures 1f–1j). We find that the meridional gradient is qualitatively robust across data sets up to 350 hPa: in all data sets and at all levels, δD decreases poleward. This is qualitatively predicted from a simple Rayleigh distillation associated with decreasing temperature toward the poles and at higher altitudes (i.e. the temperature effect given by Dansgaard ). For example, Figure 1g shows that the meridional gradients observed in total column δD are very consistent between SCIAMACHY, MUSICA and TCCON data sets. The meridional gradient from 0 to 60° increases with altitude: about −50‰ at the surface, between −80 and −100‰ in the lower and mid troposphere (ground-based FTIR, SCIAMACHY, TES) and between −120 and −350‰ at 350 hPa (ACE and MIPAS). The weaker meridional gradient at the surface than at higher altitude can be explained by the evaporative recycling near the surface which partly counter-acts the temperature effects as air masses move poleward [e.g., Werner et al., 2001; Noone, 2008]. Further, since the tropopause height decreases with latitude, the water vapor reaches low δD values at lower altitudes in high latitudes.
Figures 1k–1o show the June–July August (JJA) minus December–January–February (DJF) differences at all levels. At all levels and in all data sets, δD is higher in summer than in winter in the subtropics and in midlatitudes. In all data sets, the seasonality reaches its maximum between about 30 and 50°. The amplitude varies between data sets and levels: 20–60‰ at the surface, 20–50‰ in the lower troposphere in SCIAMACHY and TES, 100–150‰ in the lower troposphere in ground-based FTIR, 100–200‰ from Ehhalt , 50–100‰ in the upper troposphere in MIPAS and about 200‰ in the upper troposphere in MIPAS. Note that at Wollongong (marker at 34.41°S), the JJA-DJF variations at 600 and 350 hPa are weak, but this is due to the fact that the seasonal cycle in δD at Wollongong is shifted by two months. April–May–June (AMJ) minus October–November–December (OND) variations are −60‰ and −43‰ at 600 and 350 hPa respectively, which are more consistent with the other data sets. For remote-sensing data sets with averaging kernels, we checked that this seasonality was not simply an artifact of the instrument sensitivity: by applying the kernels to a constant δD profile with the averaging kernels, no such seasonality appears in the kernel-weighted profiles. To summarize, despite differing amplitudes amongst data sets, the sign of the δD seasonality is very robust. Therefore, δD seasonality is a robust observed property that can be used to evaluate models.
 While the equator-to-pole gradient and the seasonal differences are robust features, there are large differences in absolute values and variation magnitudes. The possible reasons for this are explored below.
4.2. Understanding Data Set Differences Using LMDZ
 The differences between the data sets can be explained by (1) spatiotemporal sampling, (2) instrument sensitivity, and (3) systematic biases in each data set. These three sources of differences are difficult to quantify directly since the data coverage is insufficient to quantify the effect of spatiotemporal sampling, and since vertical profiles through the troposphere are not available to explore the effect of instrument sensitivity. Therefore, we use LMDZ to quantify these three sources of differences, as explained in Appendix D. In doing so, we assume that LMDZ simulates the spatiotemporal δD patterns sufficiently well to quantify the spatiotemporal sampling and instrument sensitivity effects. We decompose the difference between each pair of data sets (ΔδDobs) as:
where ΔδDcolloc is the effect of spatiotemporal sampling as predicted by LMDZ, ΔδDconvol is the effect of instrument sensitivity as predicted by LMDZ and ΔδDerrors is the combined effect of systematic biases in the data sets, of possible problems in the spatiotemporal patterns simulated by LMDZ and of sub-daily or sub_grid sampling that our collocation does not resolve (Appendix D). Results of this decomposition are shown in Table 6.
Table 6. Differences of Annual, Zonal Mean δD Between Pairs of Data Sets (ΔδDobs = δDi,obs,data − δDj,obs,data), and Their Relative Contributionsa
Total Difference (ΔδDobs) (‰)
Spatiotemporal Sampling Effect (ΔδDcolloc) (‰)
Instrument Sensitivity Effect (ΔδDconvol) (‰)
Residual (ΔδDerrors) (‰)
Included are effect of spatiotemporal sampling (ΔδDcolloc = ΔδDi,colloc,model − ΔδDj,colloc,model), effect of instrument sensitivity (ΔδDconvol = ΔδDi,convol,model − ΔδDj,convol,model) and residual (i.e., due to measurement errors or problems in the spatiotemporal distribution simulated by LMDZ: ΔδDerrors = ΔδDi,error,data − ΔδDj,error,data + ϵi − ϵj). See Appendix D for a detailed explanation of these different terms and notations. When we compare two satellite data sets, we average both data sets over the same band of latitude noted in the Location column: 30°S–30°N for the tropics and 80°S–80°N for global mean. When we compare a satellite data set with a ground-based or aircraft data set, the ground-based/aircraft observations are averaged over the region of observation indicated in the Location column, and the satellite observation is averaged zonally at the same latitude as the ground-based or aircraft observations. This makes the δD difference in the Total Difference column consistent with what we can see on Figure 1. Ground-based stations at which no satellite data are available are not shown (e.g., no TES data at Arrival Heights).
 At most stations, δD measured by ground-based FTIR is higher than measured by SCIAMACHY or by TES. Spatiotemporal sampling and instrument sensitivity effects are sometimes positive or negative, and cannot explain this systematic difference. The difference is thus likely due to systematic biases in the data or problems in simulated patterns. Except for Jungfraujoch (high altitude) and Eureka, Ny Alesund and Arrival Heights (high latitude), the third term is very consistent (between 30‰ and 87‰) across the 11 other FTIR stations which span various climate conditions. This suggests that the systematic difference between ground-based FTIR and satellite data is mainly due to systematic biases in the data. The δD in SCIAMACHY and TES might be too low, or that of ground-based FTIR too high.
 Aircraft measurements by Ehhalt  have a consistently lower δD than TES at 600 hPa. This is explained partly by spatiotemporal sampling and by the TES instrument sensitivity. Overall, some systematic biases make δD 50–75‰ higher in aircraft data than in TES on two of the three sites. This aircraft data has also a systematically higher δD than ACE by 280 to 325‰, which is also mainly due to measurement biases in one of the data sets. In contrast, aircraft data measured during CR-AVE and TC4 usually have a lower δD than ACE at 250 hPa, which could be due to a systematic bias of about 100‰ in ACE or to systematic differences between clear sky conditions (which ACE requires) and cloudy conditions (which the aircraft may sample) that LMDZ does not resolve. Finally, MIPAS has a systematically higher δD than ACE. This difference is mainly (47 to 91%) due to the difference in instrument sensitivity, which can be taken into account in the model through convolution. However, the remaining 9–53% could be due to systematic biases in one of the data sets, or to different clear-sky sampling biases [Lossow et al., 2011].
 If one were to assume that aircraft in situ data provide calibrated δD values, then the data by Ehhalt  and ICOS are inconsistent, since the former has higher δD than ACE and the latter has lower δD than ACE, even after accounting for spatiotemporal effects. This points to systematic biases in the data, to problems in the δD patterns simulated by LMDZ, or even possibly to problems in the δD patterns observed by ACE. Therefore, even though we are using some in situ data, we remain cautious with all absolute values and we will thus only focus on spatiotemporal variations that are consistent across all data sets.
 A similar decomposition approach can be used to understand the differences of meridional gradient and seasonality between data sets. In particular, in the upper troposphere, the ACE and MUSCICA data exhibits two-four times smaller seasonality than in MIPAS in the subtropics and midlatitudes. This is explained mainly by the instrument sensitivity (e.g. 80% of the MIPAS-ACE difference in the Northern Hemisphere, 95% of the MIPAS-MUSICA data between Izaña and Eureka). Similarly, ACE and MUSICA also exhibit a meridional gradient that is two-three times smaller than in MIPAS. The instrument sensitivity explains about 30% of the MIPAS-ACE difference and up to 70% of the MIPAS-MUSICA difference. This is consistent with the small sensitivity of the MUSICA retrievals to δD in the upper troposphere [M. Schneider et al., 2006], suggesting that MUSICA, and thus maybe also ACE, may underestimate the meridional gradient and seasonality of δD.
5. Model-Data Comparison
 In this section, we compare LMDZ simulations to the different data sets, and focus on model-data differences that are robust across data sets. Before this comparison, we summarize below the different sources of model-data differences.
5.1. Sources of Model-Data Differences
 In addition to the biases in the data found in section 4.2, some sources of model-data differences can arise from our comparison methodology.
 The effect of spatiotemporal sampling is taken into account by collocating model outputs with the data at the daily scale. The root-mean square error associated with the spatiotemporal sampling effect on zonal mean δD is lower than 5‰ for ground-based FTIR data that have a high measurement frequency, but is about 10–20‰ for SCIAMACHY, TES and ACE satellite data sets, and up to 50‰ for MIPAS (Figure 2, black). This shows the importance of taking this effect into account in the model-data comparison. Spatiotemporal sampling effects usually increases (decreases) δD in regions of large-scale ascent (descent) for SCIAMACHY, decreases δD in ACE and in MIPAS at 350 hPa, and has little coherent effect for other data sets. Our collocation method ignores spatial variations at small scales that could lead to differences between δD in a small instrument footprint and δD in the 2.5° × 3.75° GCM grid box. We also ignore sub-daily temporal variability. Some satellites sample the atmosphere once or twice a day at the same local time every day, which may have a systematic effect on δD. These additional sources of model-data differences are difficult to quantify with a GCM.
 The effect of instrument sensitivity is taken into account by applying the averaging kernels to the model outputs. The root-mean square error associated with this effect is about 10–30‰ for TES and larger than 40‰ for MIPAS, showing the importance of taking this effect into account in the model-data comparison (Figure 2, red). Instrument sensitivity effects increase δD in deep convective regions, in the subtropics and higher latitudes for TES, and strongly increases δD for MIPAS (by about 250‰ in annual tropical average). An additional source of model-data difference can arise if the atmospheric conditions (especially the presence of clouds) in the data and the model are sufficiently different to affect the kernels used in the convolution. We estimated this effect for TES (Appendix C1) and show that it is small (6‰).
 The remote-sensing instruments used in this study preferentially sample clear-sky conditions. This sampling effect is taken into account by the collocation if the model simulates clouds exactly at the right place and time. If not, then the clear-sky bias exhibited by the data will be underestimated by the collocated outputs. To estimate an upper bound for this source of uncertainty, we performed a test in which we rejected the cloudiest 30% of scenes among all collocated outputs. Then we examined the difference between monthly δD with and without this additional cloud mask. We assume that the arbitrary 30% threshold is sufficiently high to give an upper bound estimate for the cloud effect. The root-mean square errors associated with this effect are about 5–10‰ in SCIAMACHY, TES, ACE and MIPAS, and can reach up to 20‰ at some ground-based FTIR stations (Figure 2, green). Hereafter we will thus focus on signals that are larger than those values. The clear-sky sampling bias effect increases (decreases) δD in large-scale ascent (descent) regions in lower tropospheric measurements, and has little coherent effect in the upper-troposphere.
5.2. Meridional Gradient
Figures 3a, 3d, 3g, 3j, and 3m show model-data differences at different levels. The model-data agreement is quite good at the surface, with model-data differences within 30‰ (Figure 3a). Model-data differences increase with altitude, reaching 200‰ in the upper troposphere compared to several data sets. There are systematic offsets between the model and the data, and the signs of the offsets differ between data sets. For example, simulated total column δD is higher than observed by SCIAMACHY, but lower than observed by ground-based FTIR (Figure 3d). This arises from systematic errors in the data sets (section 4.2), so we focus on spatiotemporal variations.
Figures 3b, 3e, 3h, 3k and 3n show the model-data differences to which we subtracted the annual tropical average. The simulated meridional gradient is slightly too strong at the surface in mid and high latitudes. In contrast, in the free troposphere, simulated meridional gradients are too weak compared to all satellite and ground-based FTIR data sets. For example, the model-data difference for total column δD is about 20‰ higher in the midlatitudes (i.e. 45°N or 45°S) than in the tropics compared to both SCIAMACHY and ground-based FTIR (Figure 3e). The meridional gradient between the equator and midlatitudes is thus about 30% weaker in LMDZ than in the data. Similarly at 350 hPa, the model-data difference is 70‰ (respectively 300‰) higher in midlatitude and in the subtropics than in the tropics compared to ACE (respectively MIPAS) (Figures 3k and 3n).
 The fact that LMDZ underestimates the meridional gradient throughout the troposphere but not at the surface suggests that an overestimated evaporative recycling along poleward trajectories is not responsible for the model bias. Rather, the bias could be due to overestimated vertical mixing (which transports high δD values upward) in midlatitudes, to overestimated mixing between the tropics and midlatitudes (which smoothes the gradient), or to underestimated convective detrainment of condensate in the tropics (which increase upper tropospheric δD in the tropics [Moyer et al., 1996] (also P2).
 In the upper troposphere, the strong underestimate of the simulated δD meridional gradient could be partly due to the poor representation of stratospheric δD, associated with mis-representation of the tropopause level, or of dynamical and chemical processes in the stratosphere. In MIPAS for example, we find that poleward of 45°S or 45°N, stratospheric δD accounts for more than 40% and 60% of the signal at 350 and 250 hPa respectively. However, within 25°S–25°N where much of the discrepancy in meridional gradients takes place, the model-data difference is completely due to discrepancies in tropospheric values. As a test, we replaced simulated δD by observed δD everywhere above the tropopause and applied the kernels: the impact on kernel-weighted δD values was smaller than 10‰ at both 350 and 250 hPa.
Figures 3d, 3f, 3i, 3l, and 3o show the model-data differences for zonal mean seasonal δD variations. In the subtropics and mid latitudes of both hemispheres, LMDZ underestimates the observed seasonality. For example, at about 30°N, the δD seasonality is underestimated by about 20‰ (corresponding to a relative underestimation of −50%) compared to SCIAMACHY, by 80‰ (−75%) compared to MUSICA at Izaña both for total column and at 600 hPa, by 60‰ (−75%) compared to ACE at 350 hPa and by 80‰ (−50%) compared to MIPAS at 350 hPa. This underestimate is larger than all sources of model-data differences that we have previously described.
 Although the magnitude of the underestimate of the simulated δD seasonality varies depending on altitude and data set, the simulated bias in δD seasonality is robust in sign at all levels and compared to almost all data sets, with just two exceptions. The first exception is TES in the Northern Hemisphere, where model-data differences are very small. The second exception is at Wollongong at 600 and 350 hPa, where we have mentioned that the observed seasonality was shifted by two months. If looking at AMJ-OND variations, LMDZ underestimates the seasonality consistent with the other data sets. Therefore, the consistency between almost all data sets and the large magnitudes of the model-data differences give us confidence that the underestimated δD seasonality is a robust and significant model bias.
 The most likely reason for this bias is investigated in detail in P2 and is shown to be overestimated vertical mixing, which preferentially increases δD in dry regions and during dry seasons.
5.4. Spatial Patterns
 To focus on spatial patterns, we show δD maps after subtracting the tropical mean δD of each data set and model output (Figures 4a–4d). In the lower troposphere, in the annual mean, SCIAMACHY and TES show δD maxima over deep convective regions over land, and lower δD over dry subtropical regions, such as the subsidence regions off the West coast of continents, and to a lesser extent over the Sahara. LMDZ captures these spatial patterns reasonably well, with model-data correlations of 0.63 for SCIAMACHY and 0.94 for TES within 45°S–45°N. However, LMDZ underestimates the depletion over the driest regions. This model bias was also noticed in the GSM GCM [Frankenberg et al., 2009], showing that LMDZ is not the only GCM exhibiting this problem. As for seasonality, a likely reason for this is excessive vertical mixing in dry regions (P2).
 In the lower troposphere, observed seasonal variations are much larger over land than over ocean and are largest over the driest land regions such as the Sahara (Figures 5a–5d). LMDZ reproduces this feature very well. LMDZ simulates a slightly lower δD during the wet season in the Warm Pool and South-East Asian region, which is not the case in TES. This model-data difference was already noticed in the CAM GCM [Lee et al., 2009], though this problem is much less pronounced in LMDZ. Sensitivity tests performed both by Lee et al.  and with LMDZ (P2) suggest that this problem is very sensitive to the fraction of precipitation arising from convection versus large-scale condensation.
 In the upper troposphere, only MIPAS has sufficient sampling to investigate spatial patterns, although the large discrepancies between ACE and MIPAS suggest that we need to be very cautious when interpreting these results. MIPAS shows maximum δD over deep convective regions, and minima in the subtropics and high latitudes (Figure 4). There is a slight secondary δD maximum over the jet stream regions, for example in Northern midlatitude Atlantic. LMDZ does not capture the spatiotemporal variations of δD in the upper troposphere. Compared with the data, the simulated maximum δD in deep convective regions is too weak, consistent with the underestimated meridional gradient. The model has a maximum δD at about 30°N and 20°S, whereas the observed δD is a local minimum at these latitudes. In addition, LMDZ captures the seasonal cycle very poorly in all regions. Note that LMDZ would probably compare better to ACE, given the weaker meridional and seasonal variations in this data set.
 To better visualize the δD contrast between wet and dry regions or seasons in the tropics, we classify the data and model outputs into dynamical regimes based on large-scale vertical velocity at 500 hPa (ω500) as suggested by [Bony et al., 2004] (Figure 6). We use ω500 from the ECMWF reanalysis for both the data and simulations, since simulations were nudged by the ECMWF winds. In the lower troposphere, for ω500 < 15 hPa/day, observed δD decreases as the dynamical regime becomes more convective in SCIAMACHY (consistent with the amount effect) and remains almost constant in TES. For ω500 > 15 hPa/day, observed δD strongly decreases as subsidence increases. This behavior is very well captured by LMDZ, although the ω500 threshold between the two regimes is slightly higher for LMDZ than for SCIAMACHY, explaining the overestimated δD in very dry regions. In the upper troposphere, observed δD increases as the dynamical regime becomes more convective for ω500 < 20 hPa/day. LMDZ very poorly reproduces this behavior.
5.5. Link With Biases in Humidity
 The goal of this section is to investigate to what extent model-data differences in δD are linked to those in humidity, to assess the added value of δD measurements compared with humidity measurements alone. Traditionally, the isotopic distribution has been interpreted in terms of progressive depletion in deuterium as specific humidity (q) decreases, following Rayleigh distillation [e.g., Dansgaard, 1964; Worden et al., 2007]. Spatially at the global scale [e.g., Worden et al., 2007], or temporally in dry regions [e.g., Galewsky et al., 2007; Brown et al., 2008], δD has been shown to be strongly related to q. However, this does not necessarily mean that δD provides the same information as q. In this section, rather than looking at the relationship between q and δD, we look at the link between biases in q and biases in δD. Slight deviations from the Rayleigh-like behavior can arise from mixing between different air masses [e.g., Galewsky and Hurley, 2010], from detrainment of condensate [e.g., Moyer et al., 1996; Dessler and Sherwood, 2003] or from rainfall evaporation [Worden et al., 2007; Field et al., 2010]. If these processes are not appropriately simulated, then biases in δD differ from those in q.
Figures 7a–7d show zonal, annual mean biases in δD as a function of zonal, annual mean biases in q. Correlations are either weak or negative. This means that in LMDZ, the underestimated meridional gradient in δD compared to all satellite data sets is not associated with an underestimated meridional gradient in q. Regions where δD is most over-estimated compared to the global mean are not regions where q is most over-estimated. LMDZ overestimates q at all free tropospheric levels compared to all satellite data sets. This bias has been noticed in many GCMs for more than a decade [Soden and Bretherton, 1994; Roca et al., 1997; Allan et al., 2003; Brogniez et al., 2005; Pierce et al., 2006; John and Soden, 2007; Chung et al., 2011]. This moist bias is most pronounced in tropical and sub-tropical regions, whereas the high bias in δD is most pronounced in the subtropics and in midlatitudes. Therefore, the fact that LMDZ underestimates the meridional gradient in δD provides additional information on the model representation of physical processes compared to the information derived from q only.
Figures 7e–7h show the relationship between biases in q seasonality and biases in δD seasonality. We use the relative seasonality in q (i.e. we normalize it by annual mean q) for both the data and for LMDZ. In this plot we would obtain an approximately straight line if δD and q were linked by pure Rayleigh distillation. In the lower troposphere, the correlation between bias in δD seasonality and bias in q seasonality is 0.61 (Figure 7b). This means that LMDZ underestimates the δD seasonality more so in regions where it also underestimates the relative seasonality in q. Therefore, the underestimated relative seasonality of q could partly contribute to the underestimated δD seasonality. However, the correlation is weaker in the mid-troposphere and is negative in the upper troposphere. Therefore, the underestimated seasonality in δD cannot be explained by the underestimated relative seasonality in q in the mid and upper troposphere. P2 suggests that overestimated vertical mixing explain this underestimated seasonality.
 The spatial correlation between biases in q and biases in δD is weak for all satellite data sets for both annual mean and seasonal variations: for example, within 45°N–45°S, the correlation between annual mean biases in δD and in q is 0.01 for SCIAMACHY and 0.02 for TES, and the correlation between seasonal variation biases in δD and in q is 0.10 for SCIAMACHY and 0.21 for TES. Therefore, model-data differences in δD provide a different information on models shortcomings compared to humidity measurements only, showing the added value of isotopic measurements.
 To summarize this section, compared to satellite data sets, LMDZ successfully reproduces spatial patterns of δD in the lower troposphere, but underestimates meridional gradients and contrasts between dynamical regimes in the upper troposphere. The most robust model data-difference is the underestimated δD seasonality in the subtropics and midlatitudes, which occurs at all levels in the vast majority of data sets.
6. Common Model-Data Differences Among Models
 Having established the robust features and biases in LMDZ using all data sets, we compare the spatiotemporal distribution of δD simulated by a set of 7 GCMs from the SWING2 project. The goal is threefold: (1) identify model-data differences that are robust across models; (2) assess the isotopic spread among models to see to what extent water isotopic measurements could help discriminate between models in their representation of processes controlling humidity; and (3) explore the link between the spread in δD and that in humidity, to evaluate the added value of isotopic measurements.
6.1. Meridional Gradient
Figures 8a, 8d, 8g, and 8j show the meridional gradients simulated by the different SWING2 models. At the surface, all models give similar results within 20‰ between 60°S and 60°N. The spread in δD increases with altitude, reaching more than 200‰ at 250 hPa. The smaller spread at the surface may be explained by the fact that δD variations at the surface are partially damped by oceanic evaporation, whose δD is predicted by the same simple equation [Craig and Gordon, 1965] in all models.
 The spread in absolute δD values between models is not significantly related to that in q (Figures 9a and 9b). Although all models have a moist bias, models with the strongest moist bias are not those with the highest δD, showing again that isotopic composition provides additional information compared to q.
 Even after subtracting the tropical average (Figures 8b, 8e, 8h, and 8k), models show a very wide spread in meridional gradients. The difference between δD at the equator and at 60°S at 350 hPa varies from 60‰ in GISS to 220‰ in CAM. The models disagree on the sign of the gradient at 250 hPa. We showed in section 5.2 that LMDZ underestimated the meridional gradient compared to all remote-sensing data sets, and Figure 8 shows that it is quite typical of the set of models. Models with the strongest meridional gradients in δD are not the models with the strongest gradient in humidity (Figures 9c and 9d).
 The simulated JJA-DJF difference in SWING2 models agree quite well with each other at the surface, but again the spread increases with altitude (Figures 9c, 9f, 9i, and 9l). In the subtropics, at all free tropospheric levels, half of the models have higher δD values in summer, while the other half higher δD values in winter. We showed in section 5.2 that the control simulation with LMDZ underestimated the seasonality in the subtropics at all levels compared to almost all data sets by at least 50%. However, compared to SWING2 models, LMDZ has amongst the strongest seasonality. Therefore, all models are affected with this underestimated δD seasonality, and several even get the wrong sign. This problem may reveal a problem in the representation of humidity processes common to all GCMs. P2 suggests that all models overestimate vertical mixing.
 In the lower troposphere, models with the strongest seasonality in δD are not those with the strongest seasonality in q, again showing the added value of isotopic measurements to reveal GCM shortcomings. In the upper troposphere in contrast, models with the strongest seasonality in δD also have the strongest seasonality in q, suggesting that part of the isotopic behavior could be explained by that of q.
6.3. Spatial Patterns
Figure 10 summarizes the spatial patterns in the tropics by classifying the model outputs based on ω500. The data is also shown for reference but not for direct comparison, since model outputs were neither collocated nor kernel-weighted. Models show a very large spread in δD contrasts between convective and dry regions, both quantitatively and qualitatively. For example, at one extreme, at all levels, Had-AM simulates a strong increase in δD as ω500 decreases, with maximum δD in convective regions. At the other extreme, at all levels, GSM simulates a strong decrease in δD as ω500 decreases from 20 hPa/day to below −70 hPa/day, with a pronounced δD minimum in convective regions. In some models like ECHAM, δD in convective regions is lower in the lower troposphere but higher in the upper troposphere than in subsidence regions. The behavior of LMDZ is within the model spread.
7. Conclusions and Perspectives
 We have evaluated the control version of the LMDZ GCM using a number of water vapor isotopic data sets from satellite, ground based remote-sensing and in situ techniques. This is the first time that so many data sets are being brought together and compared, that an isotopic GCM is so comprehensively evaluated for its water vapor, and using such a rigorous model-data comparison methodology. The different data sets show some consistent features: at all levels, the δD decreases with latitude, and at all levels, δD is higher during summer in the subtropics and in midlatitudes. There are however significant differences between data sets regarding absolute values of δD and the magnitude of meridional gradients and seasonality. We show that some of these differences can be explained by (1) different averaging kernels and a-priori values used in the retrievals of remote-sensing data sets, and (2) spatial and temporal sampling. A rigorous model-data comparison needs to take into account both these effects. We also show that systematic biases in the data are also likely the major source of δD difference between some data sets. The use of several data sets is therefore necessary to ensure the robustness of our conclusions. The lack of absolute calibration of remote-sensing data (e.g. due to the lack of aircraft validation campaigns for HDO) and possible discrepancies in the calibration of some in situ data sets further restricts our analysis to spatiotemporal variations.
 The simulated spatiotemporal distribution of δD shows strengths and weaknesses that are consistent across the different data sets used for the comparison. At the surface, in the lower and mid-troposphere, the simulation reproduces the observations reasonably well. In the upper troposphere, model-data differences are much larger, although the discrepancies between the MIPAS and ACE data sets prevent us from concluding about the magnitude of the model biases. The most consistent weakness of LMDZ is the underestimated seasonality at all levels in the subtropics and midlatitude of both hemispheres, compared to almost all data sets. Also, compared to all remote-sensing data sets, the model underestimates the meridional gradient of δD and the contrast in δD between convective and dry tropical regions. These biases in δD are not linked with those in humidity, which confirms that isotopic measurements provide additional information that can be used to expose model biases.
 Some of the problems that we have exposed in LMDZ are common in all SWING2 models. All models underestimate the seasonality in the subtropics at all levels, underestimate the meridional gradient compared to satellite data sets, and underestimate the δD contrast between convective and subsidence regions in the upper troposphere. However, the spread between models is very large, both quantitatively and qualitatively. For example, models disagree on the sign of the δD contrast between convective and dry regions at all tropospheric levels. These differences between models in δD are not obviously related to those in specific humidity, suggesting that there is additional information provided by isotopic measurement compared to humidity measurements. In P2, these inter-model differences will be interpreted in terms of physical processes and will be used to expose the causes of biases in models.
 Unresolved differences between observed data sets (e.g. near-IR ground-based FTIR and SCIAMACHY or ACE and MIPAS) highlight the need for absolute calibration and improved measurement error characterization. While this is currently very difficult to achieve in practice, such a validation and calibration activity would enable GCM evaluation to be extended to include assessment of absolute values of δD, in addition to the spatiotemporal variability considered in this study. Our analysis shows that the upper troposphere is where GCMs disagree the most, so improving the calibration and frequency of measurements in this region would be particularly valuable.
Appendix A:: Bias Correction in TES
 As discussed by Worden et al. , TES data are known to be affected by a slight spectroscopic bias. Recent calibration studies against in situ measurements at Mauna Loa [Worden et al., 2010] suggest that this bias may be corrected as follows:
where xci and xdi are the corrected HDO mixing ratio and the HDO mixing ratio provided in the data files, at level i, respectively; Axxik is the averaging kernel for HDO and n is the number of TES levels.
Appendix B:: Error Estimates for the Different Data Sets
 Errors estimates are either provided as part of the data sets, or calculated as a function of various retrieved quantities. When error estimates are given for H2O and HDO independently and unless stated otherwise, we calculate the δD error based on the standard propagation of errors, as if the retrieval errors were independent:
where EδD is the error on the δD retrieval, qobs and xobs are the observed H2O and HDO mixing ratio at level respectively, and qerr and xerr are the errors on the observed H2O and HDO mixing ratios respectively. The same applies for total column amounts instead of mixing ratio. In practice, retrieval errors are positively correlated, so this estimate represents an upper bound for the error in δD.
 When n samples are averaged, the random part of the error, when known, is divided by .
 The total error on observed δD, EδD, is provided for each measurement as part of the data set. In zonal mean, annual average, the average error ranges from 50‰ in the tropics to 130‰ in mid latitudes (about 60°). If this error was random, the error on the zonal means would be reduced to about 2‰ in the tropics and up to 30‰ in mid latitudes. However, no detailed error budget is available for this data set so that the random part of the error is unknown. Systematic biases are not documented.
B2. TES Data
 In the TES data, the error on δD is smaller than that calculated from the errors on H2O and HDO contents independently, due to the benefits of the joint retrieval [Worden et al., 2006]. The TES retrievals provide data to calculate separately the random and non-random parts of the δD error. In the F04 version of the TES data, the total error on δD averaged over n samples at level j, EδD, is calculated as:
where Cnon−rand,i and Crand,i are the non-random and random parts of the H2O/HDO error covariance respectively for retrieval i and where Ri is the retrieved H2O/HDO ratio for retrieval i at level j.Cnon−rand,i and Crand,i are taken as the j-th diagonal terms of the corresponding error covariance matrices (named “HDO_H2ORatioMeasurementErrorCovariance” and “HDO_H2ORatioSystematicErrorCovariance” in the data files).
 We find that the precision is about 10–15‰ for individual measurements but is reduced down to 1–2‰ when calculating zonal averages.
EδD was calculated for each profile as following the standard propagation of errors as described earlier. As for SCIAMACHY, the random part of the error is unknown. In zonal, annual means, the errors range from about 20‰ in the tropics to about 50‰ in midlatitudes if the random contribution is zero, and are between 1 and 2‰ if the random contribution is 1.
 As for TES, the error on δD is smaller than that calculated from the errors on H2O and HDO contents independently, due to the benefits of the joint retrieval [Steinwagner et al., 2007]. The different terms of the total error are given by Steinwagner et al. . Data files include random errors due to measurement noise only, which we consider in the following. The general formulation for the δD error for a given profile at level i, EδD, is:
where qobs and xobs are the observed H2O and HDO mixing ratio at level i respectively, qerr and xerr are the errors on the observed H2O and HDO mixing ratio respectively, and ri is calculated as:
where SHDO/HDO and are the i-th diagonal terms of the covariance matrices for HDO/H2O, HDO/HDO and H2O/H2O respectively, provided for each profile. In practice, ri was found to be of the order of 10−2. We thus assumed ri = 0 to avoid manipulating very voluminous data for little benefits.
 The measurement noise error for individual measurements are about 60‰ in the tropics and 150‰ in midlatitudes. Once this measurement noise error is divided by the square root of the number of measurements, the error for zonal mean δD is between 1 and 3‰. Parameter errors (errors in the associated imperfectly known forward model parameters used in the retrieval) are of the order of 100‰ in the upper troposphere [Steinwagner et al., 2007].
B5. Systematic Bias in MIPAS Related to Clouds
 In the absence of clouds, MIPAS observes from 70 km tangent altitude down to about 6 km tangent altitude with 3-km steps. In the presence of clouds, the spectra for the cloud-contaminated altitudes are discarded. However, propagation of systematic errors (e.g. due to undetected clouds) localized at lower altitudes (i.e. 6 km) may lead to systematic errors in δD profiles at higher altitudes [Steinwagner et al., 2010].
 To estimate this effect, all scans which go down to the tangent altitude of 9 km were selected and test retrievals of these scans were performed omitting artificially the lowermost tangent height (i.e. 9 km). These test retrievals were then compared to the original results. This comparison shows that the retrievals from measurements which start at an altitude of 12 km instead of 9 km are biased high by 50 to 100‰.
 As an upper bound estimate of the impact of this systematic error on the observed δD distributions, we tried subtracting 100‰ from profiles that go down to 12 km or higher, 50‰ to profiles that go down to 9 km, and leaving the δD unchanged for profiles that go down to 6 km. At 200 ha, convective regions become slightly more depleted by about 15‰ in annual mean, and the δD seasonality in the subtropics becomes less pronounced by up to 30‰. This impact is much smaller than the model-data differences that we look at, and thus is unlikely to affect our conclusions.
B6. Ground-Based FTIR at MUSICA Sites
 The MUSICA profiles and column-integrated amounts are produced by an H2O and δD optimal estimation approach. For H2O, the ground-based NDACC FTIR systems are mainly sensitive up to about 10–15 km. The vertical resolution (full width half maximum of the kernels) is about 3 km in the lower troposphere, 6 km in the middle troposphere, and 10 km in the upper troposphere. The δD is mainly sensitive in the first 10 km above the surface and the vertical resolution is 3 km in the lower troposphere and 10 km in the middle/upper troposphere.
 Measurement noise, uncertainties in the alignment of the instrument, detector non-linearities, uncertainties in the applied atmospheric temperature profiles, and uncertainties in the applied spectroscopic parameters are considered as the error sources. The propagation of these sources is estimated by a full treatment. Two retrievals are performed: a first with a correct parameter and a second with an erroneous parameter (e.g. 2 K increased lower tropospheric temperature, application of a 1% higher H2O line strength parameter, etc.). The systematic and the random error are then given by the mean and the standard deviation of the difference between the two retrievals [Schneider et al., 2010a]. For the ground-based MUSICA v101220_Ca.0 data used in this study, uncertainties in the applied atmospheric temperature profiles and the applied spectroscopic parameters are the leading random error sources, whereby uncertainties of the instrument's alignment are of secondary importance. The δD random error can reach 5‰ for column-integrated data and 15–25‰ for profiles. Systematic errors are dominated by uncertainties in the applied spectroscopic parameters. They can be 10‰ for column-integrated δD and 25–50‰ for δD profiles.
B7. Ground-Based FTIR at TCCON Sites
 The estimate measurement repeatability (1-σ) was calculated for each profile following the standard propagation of errors as described earlier. Annual mean measurement repeatability varies between 5‰ and 22‰ depending on sites. This is the random part of the error. The absolute calibration error cannot be readily estimated.
Appendix C:: Applying Averaging Kernels to the Model Outputs
 The averaging kernel matrix defines the sensitivity of the retrieval at each level to the true state at each level. For a fair model-data comparison, it is necessary to take into account this sensitivity. This is done by applying to the model output the same averaging kernels as those calculated as part of the retrieval process.
 Let q and x be the volume mixing ratio in H2O and HDO respectively. Subscripts denote the values simulated by LMDZ and interpolated on the TES retrieval grid (s), prescribed as the a-priori profile (p), or that would be measured by TES if TES were flying in LMDZ above an atmosphere similar to that predicted by the model (m). Then at level i, qmi is calculated as [Worden et al., 2006]:
where Aqq is the averaging kernel for H2O provided in the TES data set.
 The HDO mixing ratio is calculated similarly but involves cross terms due to the fact that H2O and HDO are jointly retrieved. The isotopic mixing ratio R = x/q is thus calculated as:
where Axx is the averaging kernel for H2O and Axq and Aqx are the cross kernels.
 Averaging kernels depend on surface temperature and atmospheric state, including the presence of clouds [Lee et al., 2011]. However, since the averaging kernels for each individual profiles are computationally voluminous, and because the atmospheric conditions associated with a particular kernel in TES may be different in LMDZ despite the nudging, we did not attempt to use individual averaging kernels for each profile. Instead, we averaged averaging kernels for each month and each LMDZ grid box. The root mean square error between monthly mean model outputs transformed through individual kernels and model outputs transformed with monthly mean kernels is about 6‰ in average between 45°S–45°N. Therefore, using monthly mean kernels, and thus neglecting the day-to-day variability in the kernels, is a reasonable simplification.
 Another possible problem in comparisons is that differences in cloud properties as observed by TES and as simulated by LMDZ can also contribute to some uncertainty in the exact kernels to use. To quantify this effect, we made a test in which we applied to the model output the monthly mean kernels that were calculated after eliminating 30% of the cloudiest scenes in TES. The root mean square difference between monthly mean model outputs transformed by all-skies kernels and monthly mean model outputs transformed by clear-sky kernels is about 6‰ on average between 45°S–45°N. Therefore, the sources of model-data difference associated with kernel convolution is much smaller than the isotopic signals we look at in this paper. However, to examine smaller signals, for example compare isotopic signatures associated with clear sky and different cloud conditions, then a better account of cloud effects will be needed. This could be achieved, for instance, by taking advantage of the CALIPSO cloud data set [Winker et al., 2007], which can allow collocating with TES, and whose observations can be emulated by the model [Chepfer et al., 2008]. Incorporating this type of calculation in the present set of comparisons is beyond the scope of this paper.
 For MIPAS, the convolution is similar to equations (C1) and (C2) except that logarithms are not used and there are no cross terms. As for TES, we did not use individual averaging kernels for each profile. However, because averaged averaging kernels may lead to a broadening of the kernels, we used representative averaging kernels instead, as described below.
 The main source of kernel variability in MIPAS is the tropopause height. We binned individual profiles based on their tropopause height from 7 km to 17 km by bins of 0.5 km. The tropopause height was defined following the World Meteorological Organization as the lower bound of a layer in which the lapse rate is lower than 2 K/km, provided that this layer is at least 2 km thick. Then for each bin, we calculated an average profile of diagonal elements and , and we selected the most “representative” kernel as the one minimizing . This calculation was performed on 216 profiles collected during three days representative of different seasons. We checked on these 216 samples that using representative kernels rather than the individual kernels leads to an average error of about 50‰ only on the kernel-weighted δD at 200 hPa, which is within the measurement uncertainty.
 These representative averaging kernels were then applied to LMDZ output depending on the observed tropopause height for the time and location of each measurement. Due to systematically higher tropopause in LMDZ, we chose to select the appropriate representative averaging kernels depending on observed, and not simulated, tropopause height. This allows us to focus on the biases related to the isotopic composition rather than biases due to the tropopause height.
C3. Ground-Based FTIR at MUSICA Sites
 For the ground-based FTIR data produced in the framework of MUSICA, the convolution is the same as for TES except that the cross terms involve the H218O and H217O mixing ratios as well. Since the H217O distribution simulated by a GCM has never been evaluated yet, we calculate the H217O mixing ratio based on the H218O mixing ratio, assuming an 17O-excess of 20 permeg (consistent with orders of magnitudes given by Landais et al.  and Luz and Barkan ). The effect of this assumption on kernel-weighted δD profiles can be neglected.
C4. Ground-Based FTIR at TCCON Sites
 For the ground-based FTIR at TCCON sites, the convolution transforms simulated profiles of specific humidity into total column water mass (Q) that would be observed by the instrument [Rodgers and Connor, 2003]:
where Ak is the column averaging kernel profile at level k, qsk and qpk are respectively the simulated and a priori specific humidity at level k, ΔPk is the level thickness in the retrieval grid and g the gravity. After applying a similar equation for HDO, total column δD is finally calculated.
 A priori profiles are provided for every day. Once again, individual kernels are computationally voluminous. Since averaging kernels depend mainly on the solar zenith angle, we calculated a set of representative averaging kernels for different bins of solar zenith angles. To assess the error on the kernel-weighted δD resulting from using these representative averaging kernels rather than the individual kernels for each measurement, we tried using individual averaging kernels at the Lauder site over 2004–2007 (40% of all measurements). The difference between δD transformed by individual kernels and with representative kernels is lower than 3‰. This confirms that using these representative averaging kernels is a good approximation.
Appendix D:: Using LMDZ to Quantify Sources of Differences Between Data Sets
Figure 1 shows large differences in δD between the different data sets. These differences can be explained by (1) spatiotemporal sampling, (2) instrument sensitivity, and (3) errors in each data sets. These three sources of differences are difficult to quantify directly since the data coverage is insufficient to quantify the effect of spatiotemporal sampling, and since vertical profiles through the troposphere are not available to explore the effect of instrument sensitivity. Therefore, we use LMDZ to quantify these three sources of differences between data sets.
 Hereafter, we consider averages over a given spatiotemporal domain. The average δD observed by instrument i, noted δDi,obs,data, can be expressed as:
where δDreal,data is the real average δD, which is independent of the instrument and will never be exactly known, ΔδDi,colloc,data is the effect of spatiotemporal sampling, which can be taken into account in the model by collocation, ΔδDi,convol,data is the effect of instrument sensitivity, which can be taken into account in the model by kernel convolution, and ΔδDi,error,data are all the errors (e.g. spectroscopic) affecting the measurement.
 Similarly, in the model,
and where δDreal,model is the average raw simulated δD, δDi,obs,model is the average simulated δD after both collocation and convolution and δDi,colloc,model is the average simulated δD after collocation only.
 We assume that LMDZ reproduces spatiotemporal δD patterns sufficiently well to predict correctly the effects of spatiotemporal sampling and instrument sensitivity:
where ϵi,colloc and ϵi,convol are possible effects (hopefully small) of problems in the simulated δD patterns and of sub-daily and sub-grid sampling effects not resolved by our collocation. Their sum is noted ϵi.
 The difference between two data sets i and j can thus be decomposed as:
The first term on the right hand side is the effect of spatiotemporal sampling and the second is the effect of instrument sensitivity. These two terms are calculated from LMDZ outputs. The third term combines errors in each data set, possible problems in simulated δD patterns, calculated as a residual and sub-daily and sub-grid sampling effects.
 These terms are evaluated in Table 6. In the table headers and in section 4.2, for brevity we note ΔδDobs = δDi,obs,data − δDj,obs,data, ΔδDcolloc = ΔδDi,colloc,model − ΔδDj,colloc,model, ΔδDconvol = ΔδDi,convol,model − ΔδDj,convol,model and ΔδDerrors = ΔδDi,error,data − ΔδDj,error,data + ϵi − ϵj.
 The ACE mission is supported mainly by the Canadian Space Agency. Level-1b data of MIPAS have been provided by ESA. U.S. funding for TCCON comes from NASA's Terrestrial Ecology Program, the Orbiting Carbon Observatory project and the DOE/ARM Program. The Lauder TCCON measurements are funded by New Zealand Foundation for Research, Science and Technology contracts CO1X0204 and CO1X0406. We thank J. Robinson, who acquires the FTS data at the Lauder site, and B. Connor, who was instrumental in setting up the Lauder TCCON measurements. TCCON measurements at Wollongong and Darwin are supported by Australian Research Council grant DP0879468. The Karlsruhe FTIR experiment has been funded by the Federal German Ministry of Education and Research (BMBF) via its program “Ausbau der wissenschaftlichen Infrastruktur für die Klima-Initiative (HALO)”. IMK-ASF would like to thank U. Raffalski, IRF, Kiruna, for assistance with the FTIR experiment in Kiruna. Research at the University of Liége has primarily been supported by the A3C project funded by the Belgian Science Policy Office (BELSPO, Brussels). Emmanuel Mahieu is Research Associate with the F.R.S.-FNRS We further acknowledge the International Foundation High Altitude Research Stations Jungfraujoch and Gornergrat (HFSJG, Bern) for supporting the facilities needed to perform the FTIR observations. The Bruker 125HR measurements at Eureka were made at the Polar Environment Atmospheric Research Laboratory (PEARL) by the Canadian Network for the Detection of Atmospheric Change (CANDAC), led by James R. Drummond, and in part by the Canadian Arctic ACE Validation Campaigns. They were supported by the Atlantic Innovation Fund/Nova Scotia Research Innovation Trust, Canada Foundation for Innovation, Canadian Foundation for Climate and Atmospheric Sciences, Canadian Space Agency, Environment Canada, Government of Canada International Polar Year funding, Natural Sciences and Engineering Research Council, Northern Scientific Training Program, Ontario Innovation Trust, Polar Continental Shelf Program, and Ontario Research Fund. The authors wish to thank Rodica Lindenmaier, Rebecca Batchelor, PEARL site manager Pierre F. Fogal, the CANDAC operators, and the staff at Environment Canada's Eureka weather station for their contributions to data acquisition, and logistical and on-site support. The mid-infrared FTIR retrievals have been performed in the framework of the project MUSICA (http://www.imk-asf.kit.edu/english/musica), which is funded by the European Research Council under the European Community's Seventh Framework Programme (FP7/2007–2013)/ERC grant agreement 256961. We thank the Anderson Group at Harvard University for providing ICOS and Hoxotope in situ aircraft data. We thank all SWING2 members for producing and making available their model outputs. SWING2 was supported by the Isotopic Hydrology Programme at the International Atomic Energy Agency (more information on http://people.su.se/∼cstur/SWING2). This work was supported by NASA Energy and Water-cycle Study (07-NEWS07-0020) and NASA Atmospheric Composition program (NNX08AR23G). We thank all reviewers for their fruitful comments.