Uncertainty in the prediction of continental surface climates is identified by the Intergovernmental Panel on Climate Change (IPCC) as limiting our confidence in projecting future climates. To reduce this uncertainty, the Atmospheric Model Intercomparison Project (AMIP) Diagnostic Subproject 12 (DSP 12) and the Project for Intercomparison of Land-surface Parameterisation Schemes (PILPS) have used a substantially improved experimental design, coupled with a greater variety of land-surface schemes (LSSs) represented in AMIP II, to investigate whether the thirty years of effort in land surface modelling has led to improvements in simulations of continental surface processes. In AMIP II, we find a clear chronological sequence of: First generation ‘no canopy’; second generation ‘SiBlings’; and ‘recent schemes’. We conclude that three decades have improved continental surface modelling capability but that full confidence in our ability to project land-surface quantities using climate models remains elusive, in part due to uncertainties in surface observations.
 Two key processes must be encapsulated by the land-surface component of a climate model: The partitioning of available energy between sensible and latent heat and the partitioning of available water between evaporation and runoff. The quantification of improvements in land-surface simulation requires isolating the role of differences in the atmospheric forcing (rainfall, incoming solar radiation etc.) from the changes in model parameterisations and from the feedbacks between atmospheric and land-surface processes [Pitman et al., 1999]. We use the AMIP II Atmospheric General Circulation Models' (AGCMs') results because the experiments are very well constrained in terms of experimental design and model results have been quality controlled [Gates, 1992; Gates et al., 1999]. As our aim is to identify the overall strengths and weaknesses in the community's land-surface models and strategies against a background of imperfect validation data, we do not identify individual models.
 The simulated 17-year mean latent heat flux (LH) and sensible heat flux (SH) of 20 AMIP II AGCMs (letters A–T) and three reanalyses are compared globally (GLS) and for eight de Martonne climates in Figure 1. The diagonal lines show the location of the mean LH + SH of all models (solid) and the reanalyses (dashed). (The de Martonne  aridity index is in the form Ia = /( + 10)(= mean annual precipitation (mm); = mean air temperature (°C)) plus ‘Polar’ below temperatures of −5°C [Henderson-Sellers et al., 2002]. Classification uses the 1979–1995 precipitation from Xie and Arkin  and the average of the three reanalyses (ECMWF, NCEP-NCAR and NCEP-DOE) near-surface air temperature.) Scatter along the diagonal is due to differences in partitioning the surface available energy (Ea) into LH and SH. Scatter perpendicular to the diagonals is due to different predictions of Ea.
 On the global average (GLS), Ea for 11 of the AGCMs is within the range of the reanalyses but for the eight de Martonne climates this changes. Agreement is better with ECMWF (ERA15: Gibson et al., 1997) in wetter and cold climates and with NCEP-DOE [Kanamitsu et al., 2002] in drier climates. In Mediterranean to Humid climates, models' LH lie outside the reanalyses' range because the reanalyses agree well. The long-term mean LH (arrowed) from VIC [Liang et al., 1994] is lower than almost all models and reanalyses.
 The horizontal lines in Figure 1 show the magnitude of the energy residual (dEa) of each model, assuming that the net 17-year change in the surface energy store is negligible, i.e.
where Rnet is surface net radiation and SmH snow melt equivalent energy. For three AGCMs (L, G & P), dEa is large. These residuals, which differ by (de Martonne) climates, may be due to an energy flux at the lower soil boundary - a hypothesis not tested here. AGCM P, which fails to close its surface energy budget anywhere, has (Rnet − SmH) considerably greater than all the reanalyses. Model P's mean downward longwave radiation is around 30 W m−2 greater than the AMIP average corresponding to an atmospheric temperature deviation of about 6K not found in P's reported fields.
2. Clusters of Land-Surface Schemes
Figure 1 shows that some models systematically underestimate LH (e. g. M), underestimate available energy (e. g. O) or overestimate SH (e. g. Q) (comparative terms such as underestimate and overestimate are used here for descriptive clarity, not to indicate any absolute achievement). Models M, Q and E frequently lie towards the top left (small LH, large SH) and O, P, G and S frequently lie towards the lower part of the distribution (small SH and/or large LH). E and M use SiB [Sato et al., 1989]; Q uses SSiB [Xue et al., 1991]; while both O and P use variants of the bucket hydrology model [Manabe, 1969]. G and S are two-layer soil hydrology models but, like O and P, they do not model the canopy explicitly. All the LSS Figure 1 outliers are from two groups: A one- or two-layer hydrology with no explicit canopy (‘no canopy’) and a SiB-based [Sellers et al., 1986] LSS (SiBlings). Model L's LSS is also a SiBling (SSiB: Xue et al., 1991). Unlike the classical bucket [Manabe, 1969], P and O introduce a pre-infiltration runoff as a linear or nonlinear function (respectively) of relative soil moisture and do not predict as large LH as the original bucket model [e.g. Shao and Henderson-Sellers, 1996]. The SiBlings' small LH (closest to VIC) is either a result of unfavorable atmospheric forcing or due to the high sensitivity of the canopy resistance of the scheme to atmospheric humidity. Model O's low LH (except Extremely Humid) is due to its abnormally small surface available energy.
 Generalizing, variations among the models' LH and SH are relatively smaller in wetter climates than in drier ones; reanalyses agree more with each other in intermediate climates (but all overestimate LH compared to VIC (The lack of high quality observed global datasets of land-surface variables such as evapotranspiration occurs because they are not directly observable at scales appropriate to atmospheric models. Reanalysis data and pseudo-observations generated by forcing the VIC land-surface scheme off-line and constraining its results by known parameters such as large river discharges are known to have limitations.)). While the two-layer soil models with no explicit canopy (G and S) simulate high LH in almost all climates, the bucket model including canopy resistance (A) simulates much lower LH.
 No clustering is evident in surface variables or forcings that match or simply explain these clusters (P. Irannejad, S. Sharmeen, and A. Henderson-Sellers, Importance of land surface parameterization for latent heat simulation in global atmospheric models, submitted to Geophysical Research Letters, 2003). For example, SiBlings do not predict low LH because they are in models from which they derive low surface available energy or low surface available water: Q has large Pr and Rnet whereas E, M and L have much smaller values of both. Many models have too much downward shortwave radiation at the surface but their net surface energy is less than observed [Henderson-Sellers et al., 1995].
3. Separating Forcing Effects From Parameterization Differences
 Differences in available moisture and energy in the AGCMs or feedbacks between surface and atmosphere might be responsible for the clustering of models in Figure 1. Figure 2 illustrates the effect of excluding the differences due to energy availability (scatter across the diagonal lines in Figure 1) by scaling SH and LH against the reanalysis ensemble Ea:
where m, r and s stand for model, ensemble reanalyses and scaled, respectively. This scaling assumes linearity in the apportioning of an increment of available energy between sensible and latent heat, but this limitation is unlikely to affect our conclusions. While NCEP-NCAR has similar LH to NCEP-DOE in the two wettest climates (Figure 1), its scaled LH is smaller than NCEP-DOE in all climates. The AMIP II AGCMs are scattered around the NCEP-DOE reanalysis in dry climates; around the NCEP-NCAR reanalysis in the Mediterranean to Humid climates; and around the ECMWF reanalysis in the Very Humid and Extremely Humid climates.
 The largest scaled LH globally is associated with LSSs with no explicit canopy. However, the relative position of their scaled LH varies in different climates (possibly due to changes in availability of water) i.e. while O and S are among the most evaporating models in all climates, P has this attribute in drier (Arid to Humid) and G in wetter (Mediterranean to Extremely Humid) climates. The relatively low values of scaled LH predicted by P compared to O in more humid climates may be due to its smaller soil depth and hence smaller water holding capacity. E, L, M and Q, which use variants of the SiB scheme, are among the least evaporating models in almost all climates while H, using a scheme philosophically similar to SiBlings [Dickinson et al., 1986], predicts similarly scaled LHs.
 Next, we try to account for precipitation differences among the LSS clusters in Figures 1 and 2. To do this, we adjust the mean LH of each model using observed precipitation and mean reanalyses' surface energy as:
where LHms is scaled (adjusted) latent heat flux for each cluster, m, of LSSs (e.g. SiBlings, reanalyses) using both surface available energy (Ea) and precipitation (Pr) and r and o represent mean reanalyses and observations. The ensemble results for SiBlings, no-canopy schemes and other LSS types are shown in Figure 3a. Artificial correction of soil moisture (nudging) means that the reanalyses' precipitation is not an appropriate measure of available water. An additional adjustment is shown in Figure 3a by assuming that surface water is conserved and by using the sum of the mean reanalyses' evaporation and runoff as a substitute for precipitation.
Figure 3a clearly confirms the suggestion in Figure 1: On average, even when the very large differences in atmospheric forcing of energy and water are removed, SiBlings evaporate less and no-canopy schemes more than other schemes. On an annual basis over the different climate zones, with only a few exceptions, the differences between the model groups are significant at the 5% level (using a t-test).
4. Land-Surface Climate Prediction Evolution
 Scaled land-surface fluxes exhibit an evolutionary trend (Figure 3b). The no-canopy schemes evaporate too much because they neglect a canopy resistance; the second generation, SiBlings, emphasize canopy parameterization predicting lower evaporation; and most recent land-surface schemes have achieved compromise, or central tendency. These results show AGCMs using recent LSSs simulate latent heat fluxes close to the ensemble reanalyses, after scaling for differences in surface available energy and water (Figure 3a).
 Belief systems pertaining to ‘validation’ data change. While we hold no brief for particular evaluation data, we note that VIC conserves water in all climates whereas, due to nudging, the reanalyses do not [Roads and Betts, 2000; cf. Maurer et al., 2000]. VIC results have been challenged recently, although we employ VIC's water balance mode [Nijssen et al., 2001a, 2001b_] not the energy balance mode used by A. Robock et al. (Evaluation of the North American Land Data Assimilation System over the Southern Great Plains during the warm season, submitted to Journal of Geophysical Research, 2003). As all validation datasets have weaknesses at the land surface, convergence to the most recently available may not be desirable.
 Our AMIP ensemble results demonstrate that the prediction and partition of land-surface fluxes has evolved over time in a pattern traceable to the manner of implementation of land surface physics; that land-surface parameterization schemes can and do capture the expected wide range of behaviors; and that not all schemes are currently simulating all characteristic climate behaviors equally well.
 We thank Dr S. Sharmeen for the calculations for this paper and Dr T. Phillips for co-ordination with the PCMDI AMIP II team.