Quantification of global land evapotranspiration (ET) has long been associated with large uncertainties due to the lack of reference observations. Several recently developed products now provide the capacity to estimate ET at global scales. These products, partly based on observational data, include satellite-based products, land surface model (LSM) simulations, atmospheric reanalysis output, estimates based on empirical upscaling of eddy-covariance flux measurements, and atmospheric water balance datasets. The LandFlux-EVAL project aims to evaluate and compare these newly developed datasets. Additionally, an evaluation of IPCC AR4 global climate model (GCM) simulations is presented, providing an assessment of their capacity to reproduce flux behavior relative to the observations-based products. Though differently constrained with observations, the analyzed reference datasets display similar large-scale ET patterns. ET from the IPCC AR4 simulations was significantly smaller than that from the other products for India (up to 1 mm/d) and parts of eastern South America, and larger in the western USA, Australia and China. The inter-product variance is lower across the IPCC AR4 simulations than across the reference datasets in several regions, which indicates that uncertainties may be underestimated in the IPCC AR4 models due to shared biases of these simulations.
 Land evapotranspiration (ET) is a common component in the water and energy cycles, and provides a link between the surface and the atmosphere. Accurate global-scale estimates of ET are critical for better understanding climate and hydrological interactions. Local scale ET observations are available from the FLUXNET project [Baldocchi et al., 2001]. However, dense global coverage by such point measurements is not feasible and the representativeness of point-scale in-situ measurements for larger areas is a subject of active research.
 To address this limitation, several alternative global multi-year ET datasets have been derived in recent years. These datasets include satellite-based estimates, land surface models driven with observations-based forcing, reanalysis data products, estimates based on empirical upscaling of point observations, and atmospheric water balance estimates. The LandFlux-EVAL project (see http://www.iac.ethz.ch/url/research/LandFlux-EVAL) aims at evaluating and comparing these currently available ET datasets. The effort forms a key component of the Global Energy and Water Cycle Experiment (GEWEX) LandFlux initiative, a GEWEX Radiation Panel program that seeks to develop a consistent and high-quality global ET dataset for climate studies. Knowledge of the uncertainties in available ET products is a prerequisite for their use in many applications, in particular for the evaluation of climate-change projections [e.g., Boé and Terray, 2008; Seneviratne et al., 2010]. We provide here an analysis of 30 observations-based multi-year global ET datasets for the 1989–1995 time period, focusing on inter-product spread in various river basins. In addition, we analyze ET in 11 coupled atmosphere-ocean-land GCMs from the IPCC Fourth Assessment Report (AR4). A complementary analysis for a three-year period (1993–1995) by Jimenez et al.  focuses on sensible and latent heat fluxes in a subset of twelve satellite-based, LSM and reanalysis datasets.
2. Data and Methods
 The analyzed datasets are subdivided into four categories (Table 1). In the ‘diagnostic datasets’ category, we include datasets that specifically derive ET from combinations of observations or observations-based estimates, together with relatively simple or empirically-derived formulations. The remaining categories provide ET estimates as a byproduct. The second category includes LSM products driven with observations-based surface meteorological data, while the third includes several atmospheric reanalyses. These first three categories are referred to collectively as ‘reference datasets’ in the context of assessing the IPCC AR4 estimates. IPCC AR4 simulations from 11 GCMs form the fourth category. An overview of the datasets can be found in Table 1. For a detailed description, the reader is referred to the auxiliary material.
AR4 simulations (20c3m) from 11 global climate models
 The subdivision of the datasets in the first three categories is somewhat arbitrary, since they are all based to some degree on observations and modeling assumptions. Thus, it cannot be inferred a priori that one category of datasets may be closer to actual ET. In addition, several datasets are not independent, since they use common calibration or forcing datasets, and/or common model assumptions (ET parametrization).
 The analyses are performed for the common period 1989–1995. The calculation of the interquartile ranges (IQR) and standard deviations presented below are based on the categories (see Table 1), giving each dataset equal weight. Only land pixels that are common to all datasets (excluding Greenland and the Sahara, where ET values are generally low) are considered for the analyses.
3. Results and Discussion
3.1. Annual Means and Global Patterns
Figure 1a displays the mean annual land ET values of each analyzed dataset, as well as the means and the standard deviations within each category. The values are around 1.59 ± 0.19 mm/d (46 ± 5 W m−2), a value close to the reanalyses estimates given by Trenberth et al.  for two different time periods. The standard deviation of the IPCC AR4 simulations (0.16 mm/d or 4.6 W m−2) is lower than those of the reference datasets (standard deviations ranging from 0.17 to 0.19 mm/d or 4.9 to 5.6 W m−2). The standard deviation of the GSWP LSMs is still smaller (0.12 mm/d or 3.6 W m−2) than that of the IPCC AR4 simulations.
 Global patterns of ET for 1989–1995 are displayed in Figures 1b–1p. The mean values of the four categories (first column) reveal high congruence (for example high ET in the tropics, and lower ET in higher latitudes), and nearly no regions with significant differences (5% Level, Wilcoxon Rank-Sum Test) are found in respective comparisons with the mean of all reference datasets (third column), except for the IPCC AR4 category. In this category, the ET values compare well with the reference datasets in many regions (Figures 1k–1o), but ET values are significantly lower in the IPCC AR4 simulations in India and South America, and significantly higher in semi-arid regions such as western Australia, western China and the western USA. Overall, the IPCC AR4 simulations appear to underestimate ET gradients within continents (e.g., in North and South America, in Asia north and south of the Himalaya, and in Australia), which could be related to the generally coarse resolution of the models.
 The relative IQR (IQR divided by the median, second column) of the LSMs is lower than those of the other categories in Australia and in tropical regions, probably because many of the LSMs share a common forcing (GSWP, GLDAS), but higher in, e.g., most of Europe. The IQR of the diagnostic datasets is, compared to the other reference datasets, high in, e.g., Australia, and southern and central Africa, but much smaller in Europe.
 The IPCC AR4 simulations display higher inter-model deviations than the reference datasets in semi-arid regions such as Australia, India, South Africa, and parts of the Tibetan plateau (Figures 1l and 1o). Accordingly, the IQR of the models in these regions (Figure 1p) is much higher. On the other hand, some regions show markedly less inter-model spread across the IPCC AR4 simulations than would be expected based on the uncertainties inferred from the reference datasets (e.g., tropical Africa, East Asia, central Europe, eastern USA). Thus, climate models may share common biases in these regions, either related to biases in forcing (precipitation, clouds, radiation) or in the representation of land hydrology.
3.2. Basin-Scale Analysis
 Multi-year ET values of all analyzed datasets are displayed in Figure 2 as the deviation from the reference datasets' mean for selected basins (Mississippi, Amazon, central European basins, Volga, Nile, Changjiang, Murray-Darling). The catchment definitions from Hirschi et al.  are used for the computation (see Figure 2, bottom). Plots for individual seasons (May to June (MAM), June to August (JJA), September to November (SON), and December to February (DJF)) are provided in the auxiliary material. Datasets are sorted into the four categories (separate bars). Additionally, ET estimated from the difference between precipitation (P) derived from the Global Precipitation Climatology Project (GPCP) and runoff (R) from local measurements is shown for multi-year means (ET = P − R is not generally valid for shorter time scales) in the Mississippi, central European, Volga, Changjiang and Murray-Darling basins. The P − R values can be seen as a long-term constraint on ET (indicated with red lines where available), although multi-year anomalies of terrestrial water storage cannot be excluded in some regions. Overall, the P − R values are found to be close to the reference datasets in the Mississippi, central European and Murray-Darling basins.
 The absolute intra-category spreads are largest in the Amazon basin, where the highest ET rates occur. The second largest spreads are found in the Murray-Darling basin during SON and DJF, most pronounced in the IPCC AR4 simulations (see auxiliary material). Comparing the four dataset categories, the intra-category spreads are similar. However, the values can differ largely between basins. In the Changjiang basin for example, the reanalyses and IPCC AR4 simulations display notably higher ET rates than the other dataset categories (up to 0.75 mm/d on average during MAM; see auxiliary material). The intra-category spreads of the IPCC AR4 simulations are much larger than the other categories in the semi-arid Nile and Murray-Darling basins. ET is water (precipitation) limited in these regions, and since the calculation of ET in the IPCC AR4 simulations is based on modeled precipitation (as compared to observed precipitation in the case of reference datasets), the high variability of ET may be partly explained by the large uncertainties in modeled precipitation.
 Despite overall similarities of the ET values within these analyzed dataset categories, individual datasets stand out in some regions and seasons. For example, during MAM and in the annual mean, the NCEP reanalysis exhibits above average ET values in the Mississippi, central European, Volga and the Amazon basins. The GFDL IPCC simulation stands out in the Amazon basin during SON (auxiliary material). Note that outliers among the reference datasets are not necessarily erroneous. Indeed, congruence across ET datasets may be induced by the use of common data forcing or model algorithms, rather than the correct representation of ET, as several of the considered products are not independent (see next section).
3.3. Cluster Analysis
 In order to study the inter-relationship between the individual datasets, a hierarchical cluster analysis of the multi-year mean ET values is performed (Figure 3). The cluster analysis sorts the datasets into groups in a way that the degree of association between two datasets belonging to the same group is maximal. The criterion used for our analysis is the Euclidean distance between datasets on each land grid cell. Datasets in the same branch of the cluster tree share similar global patterns. The strongest dataset cluster is built by the GSWP simulations (with GS-COLA being the only GSWP model outside the cluster). Most of the IPCC models also form a common branch in the cluster tree. However, the diagnostic datasets and reanalyses are separated into two different main branches of the cluster tree. This indicates that these datasets, although based on observations, exhibit distinct spatial patterns. All the reanalysis datasets are constrained by different exogenous data and some of them are on different main branches of the tree. Note also that simulations using the same model but a different forcing (Mosaic, driven with both GSWP and GLDAS forcing) are separated into two main branches. These findings suggest that forcing can be critical for the resulting ET patterns.
 This study provides an overview and evaluation of 41 global land ET datasets for the 1989–1995 time period. Comparing IPCC AR4 GCM simulations with datasets which include some observational information (reference datasets), similarities can be found regarding their global patterns and level of uncertainty (interquartile ranges) in most regions. In their global average, the IPCC AR4 simulations show a smaller spread than the categories and groups that are partly based on observations, except for LSMs from the GSWP, which are driven with common forcing data. In addition, climate models display narrower inter-model range than the reference datasets in some regions, which may suggest shared biases. However the uncertainty of the observational datasets prevents evaluation of the magnitude of this bias.
 To reduce uncertainty in ET estimates, besides improving ET models, further collection of ‘ground truth’ observations to validate and force the models continues to be essential, especially in data-poor regions. More refined analyses may allow a reduction of the uncertainty range in observations-based ET products, by identifying whether given outliers can be excluded based on physical considerations [e.g., McCabe et al., 2008]. Such analyses should nonetheless also consider the lack of independence among certain products, which may lead to an underestimation of ET uncertainty. This is well illustrated by the analysis of the GSWP simulations, which, e.g., form a strong cluster in the cluster analysis performed for global ET values of all datasets. Further analyses of the datasets collected as part of the LandFlux-EVAL project will allow addressing some of these questions.
 The GPCP precipitation data were provided by NASA GSFC's Laboratory for Atmospheres. NCEP reanalysis data were retrieved from www.esrl.noaa.gov/psd. The JRA-25 data are from the Japan Meteorological Agency and the Central Research Institute of Electric Power Industry. We acknowledge the Global Modeling and Assimilation Office and the GES DISC for the dissemination of MERRA. We acknowledge the modeling groups, the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and the WCRP's Working Group on Coupled Modelling for making available the WCRP CMIP3 dataset. Runoff data were obtained from the GRDC in Koblenz (Germany), the U.S. Geological Survey, the South Australian Surface Water Archive and the Changjiang Water Resource Commission in Wuhan, China.
 Paolo D'Odorico thanks two anonymous reviewers.