Complexity and resolution of global climate models are steadily increasing, yet the uncertainty of their projections remains large, particularly for precipitation. Given the impacts precipitation changes have on ecosystems, there is a need to reduce projection uncertainty by assessing the performance of climate models. A common way of evaluating models is to consider global maps of errors against observations for a range of variables. However, depending on the purpose, feature-based metrics defined on a regional scale and for one variable may be more suitable to identify the most accurate models. We compare three different ways of ranking the CMIP3 climate models: errors in a broad range of climate variables, errors in global field of precipitation, and regional features of modeled precipitation in areas where pronounced future changes are expected. The same analysis is performed for temperature to identify potential differences between variables. The multimodel mean is found to outperform all single models in the global field-based rankings but performs only averagely for the feature-based ranking. Selecting the best models for each metric reduces the absolute spread in projections. If anomalies are considered, the model spread is reduced in a few regions, while the uncertainty can be increased in others. We also demonstrate that the common attribution of a lack of model agreement in precipitation projections to different model physics may be misleading. Agreement is similarly poor within different ensemble members of the same model, indicating that the lack of robust trends can be attributed partly to a low signal-to-noise ratio.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
 In the discussion on climate change, trends in the hydrological cycle are of particular interest since they are expected to have severe consequences for societies and ecosystems. End users of climate model output with an interest in hydrological changes therefore need information about the quality of the predictions. However, model disagreement about precipitation is large, in particular on a regional scale. Although climate models are getting constantly more complex, unambiguous statements about future changes in precipitation patterns are still difficult to provide [Trenberth et al., 2003]. The aim of this study is to define new metrics to evaluate the ability of current climate models to simulate regional precipitation and to investigate if future projection uncertainty can be reduced when considering the best models in these regions.
 The literature available on the evaluation of climate models is broad and many ways of assessing model performances have been proposed. Although each individual method provides interesting information, so far no widely accepted suite of metrics to evaluate the performance of climate models exists for precipitation or any given climate variable in general [Räisänen, 2007; Intergovernmental Panel on Climate Change, 2007; Knutti et al., 2010a]. Several studies [Lambert and Boer, 2001; Reichler and Kim, 2008; Gleckler et al., 2008; Pincus et al., 2008] evaluate the performance of climate models for a range of climate variables and on a global scale by using statistical measures to quantify the errors. Reichler and Kim  ranked the climate models based on a single performance index, defined as the aggregated errors in simulating the observed climatological mean states of several climate variables. Gleckler et al.  and Pincus et al.  used straightforward statistical measures (e.g., root-mean-square error, correlation, bias or standard deviation) to evaluate models against observations on a global scale for given variables. All three studies conclude that the MultiModel Mean (hereafter MMM) shows better agreement with the observations than any single model.
 However, evaluations on a global scale summarized for many variables are not useful in some specific cases. A model performing well for a given variable, season and region might perform poorly for another variable, season and region [Whetton et al., 2007]. Gleckler et al.  also stress the fact that in their evaluation, the relative merits of each model in simulating individual processes or variables are lost. Gleckler et al.  and Pincus et al.  further state that all models have distinctive weaknesses in simulating specific variables.
 A range of studies concentrated their evaluation on precipitation. As a response to anthropogenic forcing, temperature is expected to increase in all regions of the globe while precipitation is expected to increase in the tropics and high latitudes and decrease in the midlatitudes [Allen and Ingram, 2002]. The regional character of the expected changes suggests a need for a model evaluation on that scale. Giorgi and Mearns  divided the area over land into regions and calculated for each a measure combining information on model performance and convergence. Tebaldi et al.  performed an evaluation based also on the criteria model bias and model convergence. Both studies aim at reducing the uncertainty range for future regional precipitation by weighting the models according to the criteria mentioned above. However, no information on individual model performance is delivered, which would be of interest for end users of climate model output from other scientific communities. For example, precipitation projections are needed as input for hydrological models and the large model disagreement is an issue. Information about the quality of the predictions/simulations of each model might be a way out, although recent studies tend to show that a good performance during a given time period does not guarantee a good performance in a future time period [Jun et al., 2008; Knutti et al., 2010b].
 Finally, some studies concentrated on one region of interest. Phillips and Gleckler  evaluated the ability of the models to simulate the seasonal cycle of precipitation globally and in certain regions. They show that while the MMM outperforms any single model at simulating continental precipitation on a global scale, in some regions, this is less clearly the case. Pierce et al.  found that over the western United States and for a detection and attribution purpose, forming the MMM is a better way to make use of the information than selecting the best models. Contrary to this, Perkins and Pitman  as well as Smith and Chandler  see a reduction in future projection uncertainty by selecting the best models for precipitation for regions over Australia.
Knutti et al. [2010b] showed that most statistical metrics like the root mean square error do not correlate strongly with future projections, and they suggest that feature-based evaluations could provide additional useful information. A feature-based metric considers regional changes that are robust and can be understood physically. This is different from the approach chosen by several regional studies cited above, where the regions were defined quite arbitrarily to partition the land part of the Earth. Here, we define such feature-based metrics and evaluate the models' ability to reproduce them compared to observations. This study consequently aims to provide information on the individual performance of the CMIP3 models for the present climate, as well as information about the persistence of these performance measures in a future climate. Further, the results obtained for precipitation are compared with the ones obtained for temperature in order to identify potential differences between variables.
 The data are briefly presented in section 2, while section 3 provides a pointwise evaluation of the modeled precipitation during the observation period. Section 4 investigates reasons for the lack of model agreement in the future. The definition of the metrics used for the model evaluation along with the results presented as a ranking are shown in section 5. The projections of the corresponding precipitation indices are discussed in section 6 and conclusions are provided in section 7.
 The model simulations for precipitation and temperature used in this study stem from 24 of the global coupled atmosphere ocean general circulation models (AOGCMs) made available by the World Climate Research Program (WCRP) Coupled Models Intercomparison Program Phase 3 (CMIP3) [Meehl et al., 2007a] (see http://www-pcmdi.llnl.gov/about/index.php for further information). One ensemble member of each model of the precipitation field from the simulation of the 20th century and the scenario A1B is used, and they are equally weighted for the multimodel mean. During the observation period (1979–2004), the models are evaluated against a merged product of precipitation with global coverage, the Global Precipitation Climatology Project (GPCP) Version-2 monthly precipitation analysis [Adler et al., 2003]. Another global precipitation product exists and is used as a secondary reference data set, the Climate Prediction Center's (CPC) Merged Analysis of Precipitation (CMAP) [Xie and Arkin, 1998]. GPCP is the reference data set for the evaluation performed in section 5 because CMAP uses atoll data over oceans, which leads to artifacts in trends [Yin et al., 2004]. As a comparison, the same evaluation of the CMIP3 models is performed for temperature. Here, the ERA40 reanalysis data set [Uppala et al., 2005] is used as reference for the time period 1979–2001.
3. Pointwise Evaluation
Figure 1 summarizes the modeled and observed precipitation mean values as well as the bias of the MMM for boreal winter and summer. The mean precipitation of the MMM cannot be compared directly with mean precipitation of GPCP since the former is an average of multiple realizations and the latter represents only one realization. However, the main features of the precipitation patterns are captured by the MMM but with errors in their amplitude and exact location. The Spearman rank correlation coefficients between the MMM and GPCP are highly significant, ρ = 0.9 in DJF and ρ = 0.89 in JJA. However, the bias of the MMM, expressed in percent compared to the mean values of GPCP, is in general large over both oceans and land. Reasons for that are probably a combination of model errors and observational uncertainties plus a contribution of internal variability. As a comparison, the bias of the CMAP data set with respect to the GPCP data set is nonnegligible and in some regions, on the same order of magnitude as the one of the MMM [Yin et al., 2004].
 The biases are shown twice on the third and fourth rows of Figure 1, each time with a different criterion for stippling. In the third row of Figure 1, grid points are stippled where the observations lie outside ±2 standard deviations of the 24 CMIP3 models (“Bias stippling 1”). For those grid points, the biases are larger than one would expect given the internal variability of the models and their structural differences. The stippled area is 6.1% of the globe in DJF and 6.7% in JJA, which is only slightly more than what one would expect to occur by chance with the criteria of 2 standard deviations. This outcome tends to indicate that the observations are indistinguishable from the models (see discussion in section 6.2) given the large model errors. The criterion for stippling in the fourth row of Figure 1 is defined from an estimate of the internal variability of the models for the period 1979–2004. The CMIP3 models that have more than 4 ensemble members (i.e., CGCM3.1(T47), CCSM3, ECHAM5/MPI-OM, MRI-CGCM2.3.2 and PCM) are selected, and for each of these models, the standard deviation of their ensemble members is calculated. Then the average of these standard deviations is used as a measure of internal variability. Grid points are stippled where the absolute value of the bias is at least twice as large as this average standard deviation (“Bias stippling 2”). The fact that 72.5% in DJF and 82.6% in JJA of the whole globe is stippled according to this criterion implies that the observations are inconsistent in many areas with respect to the modeled range of natural internal variability, either because of observational errors, model errors or because the models underestimate internal variability.
 The observed and modeled precipitation trends for the observation period (1980–2004) as well as the modeled precipitation trends for a 100 year time period in the future (2000–2099) are represented in Figure 2. In the observations (GPCP), many small-scale structures in the trends can be seen and finding a physical explanation for them is not obvious. Significant drying (at the 95% confidence level) seems to dominate in the polar regions as well as over the west coasts of the continents while significant wettening is mostly located over Greenland, the Northern Territories and in the Indian ocean. The trends of the MMM from 1980 to 2004 are shown in Figure 2 (middle). If 18 out of the 24 CMIP3 models agree on the sign of significant change, the grid point is stippled, which is never the case during the observation period. On the one hand, this criterion minimizes the possibility that models have the same sign of the trend just by chance and on the other hand, it is not too stringent to prevent that the criterion is never met, which would not be very informative in the case of future trends. For the MMM, a weak wettening of the high latitudes and the equatorial region can be recognized, while some regions in the midlatitudes experience a slight drying but these features are not robustly simulated by the models. Again, the trends in precipitation of the MMM are not expected to agree perfectly with the trends of GPCP for the same reasons described above. The amplitude in the trend patterns is smaller for the MMM than in the observations due to the fact that natural variability is reduced in the MMM because of model averaging [Räisänen, 2007]. Possible reasons for a discrepancy between observed and modeled precipitation trends can be many fold: low signal-to-noise ratio, observation uncertainties, inadequate parameterizations in the models as well as incomplete representation of the forcings or too low spatial resolution. In addition, it must be stressed that the precipitation trends are nonsignificant in many regions for GPCP and the CMIP3 models, which indicates that at such short time scales, natural variability dominates.
 For the future time period, a wettening of the high latitudes and of the equatorial region, along with a drying of the midlatitudes can be recognized. Model agreement (stippling if 18 out of the 24 CMIP3 models agree on the sign of significant change) is generally confined to the high latitudes during the cold season. It is further interesting to note that the drying is a less robust feature than the wettening. The agreement criterion is rarely reached in regions were a drying is expected because there, variability is large and mean precipitation low which leads to highly variable percentage changes in the CMIP3 models. However, the chosen agreement criterion is severe, and robust precipitation changes can still be expected in areas where there is no stippling in the bottom row of Figure 2 [see Meehl et al., 2007b, Figure 10.9] as an alternative criterion for model agreement).
4. Model Agreement
 As shown in section 3, large uncertainties are associated with changes in the precipitation patterns in a warmer climate since model agreement is poor compared to temperature for example. Spread in projections is caused by the differences between the models and by internal variability. To investigate whether the poor model agreement is actually due to differences between the models or rather caused by the nature of precipitation itself, the trends from 2000 to 2099 of the 7 available runs of the CCSM3 model and of a subset of 7 reasonably independent CMIP3 models (CGCM3.1(T47), CSIRO-Mk3.5, GFDL-CM2.1, INGV-SXG, MIROC3.2(medres), ECHAM5/MPI-OM, MRI-CGCM2.3.2) are computed (see Figure 3). The conclusions however do not depend on the exact choice of the subset but likely hold for all possible subsets. While the exact location of spatial patterns of significant precipitation change are slightly different between the 7 CCSM3 runs and the 7 CMIP3 models, the wettening of the high latitudes and the equatorial region along with drying in some areas in the midlatitudes are captured by both. Again, a model agreement criterion is defined: grid points are stippled if at least 6 out of 7 runs/models agree on a significant sign of change. The percentage of area stippled is larger in the 7 CCSM3 runs (13.6% in DJF and 11.8% in JJA) compared to the 7 CMIP3 models (9% in DJF and 5.8% in JJA), as can be expected. Nevertheless, the area stippled for the 7 CCSM3 runs is surprisingly small and still in the same range as for the 7 CMIP3 models. This result indicates that even if the uncertainty caused by model differences is eliminated, internal variability still contributes strongly to the lack of agreement in precipitation projections.
 The relative importance of internal variability compared to model differences can be further quantified. The ratio of the standard deviations of the CMIP3 subset compared to the CCSM runs is computed in Figure 3 (bottom). The global average of this ratio is roughly 3 (2.7 in DJF and 2.95 in JJA), meaning that the contribution of model differences is around 3 times larger than that of internal variability in terms of standard variation. While model differences dominate, this does not imply that reducing model uncertainty in future projections will necessarily improve the significance of the projected trends. For single grid points where variability is large, the signal may not be significant even in a perfect model.
 In this section, the climate models are evaluated on a regional scale using feature-based metrics. These metrics are designed to focus on areas that reveal a clear signal of change in precipitation over the time period considered. This has also been the motivation, at least to some extent, of previous studies [Pitman et al., 2004; Pierce et al., 2009; Perkins and Pitman, 2009] but with the difference that they concentrated only on one region of interest. Here the aim is to go one step further by defining metrics in different regions of the globe over land and ocean parts and to compare the performance of the individual models using several feature-based metrics.
 The selected features are regions where the predicted precipitation change is robust. They are identified with the map of the future trends in precipitation of the MMM discussed in section 4 for two different seasons (DJF and JJA; see Figure 2, bottom). Eight metrics, which we refer to as precipitation indices (see Table 1 for definitions) are chosen, based on the significance of the trends and on the scientific understanding of the physical processes responsible for these changes. It is important to emphasize that the eight precipitation indices have to be regarded as examples and not as the only set of feature-based metrics possible.
DJF (JJA) denotes the mean precipitation during DJF (JJA) in the corresponding domain.
AFI = JJA
AMI = JJA
24°S–1°N, 31°W–59°W, land only
ASI = JJA-DJF
AUI = JJA
Central American index
CAI = JJA
HLI = DJF
52°N–71°N, land only
MEI = JJA
Storm tracks index
STI = DJFB-DJFA
zone A (35°S–46°S)/zone B (49°S–60°S)
 The eight precipitation indices are defined in Table 1. The storm tracks index (STI) is designed to detect the poleward shift of the storm tracks in the Southern Hemisphere, where zone A refers to the preferred region of cyclone activity in the past and zone B to the region where the storms are expected to pass by in the future [Hoskins and Hodges, 2005; Previdi and Liepert, 2007]. The African index (AFI) and the Australian index (AUI) capture the precipitation decrease over the midlatitudes of the Southern Hemisphere that are related to the positive trend of the Southern Annular Mode (SAM) index prevailing since the climate shift of the mid-1970s [Thompson and Solomon, 2002]. The Asian index (ASI) depicts the expected decrease of precipitation during the dry season and the increase of precipitation during the wet season in Southeast Asia. In the context of global warming, more warming over land than over the ocean is expected leading to a northward shift of the lower tropospheric monsoon circulation and consequently to an increase in mean precipitation during the Asian summer monsoon [Dairaku and Emori, 2006; Sun and Ding, 2010]. Changes in the location of the ITCZ are also expected to reduce precipitation during June, July and August (JJA) over the Amazon Basin (the Amazonian index, AMI) [Christensen et al., 2007]. For the Northern Hemisphere, the high-latitudes index (HLI) captures the increase in precipitation during December, January and February (DJF) over the continents [Previdi and Liepert, 2007]. In a warmer climate moisture convergence toward the convection zones will increase and as a consequence, moisture divergence in the midlatitudes will be enhanced, causing a decrease in precipitation [Neelin et al., 2006]. The most prominent features of this subtropical/lower midlatitude drying in the Northern hemisphere are the JJA precipitation decrease over the Caribbean/Central American region (captured by the Central American index, CAI) and the one over the Mediterranean region captured by the Mediterranean index (MEI), which is also associated with the soil moisture feedback over land [Rowell and Jones, 2006; Seneviratne et al., 2006].
 Due to the heterogeneity of primary data, the quality of the merged gauge-satellite monthly precipitation products, GPCP and CMAP, cannot be expected to be equally good in the 8 regions where the feature-based metrics are defined. In general, GPCP and CMAP are more similar over land than over oceans, simply due to the availability of gauge measurements [Yin et al., 2004]. Consequently, the quality of the observation data sets is expected to be better for precipitation indices defined mainly over land, like the AFI, AMI AUI and MEI. As can be seen on Figure 5, GPCP is slightly different from CMAP for the ASI and CAI since these indices are mainly defined over oceans. The largest differences between both data set are however encountered for the two metrics defined in the high latitudes, hence the HLI and STI, where both data sets use different input data [Yin et al., 2004]. Despite the inherent uncertainties, the GPCP and CMAP data sets can be regarded as best estimate data sets of precipitation patterns.
 The precipitation indices allow to identify whether some models clearly perform better than average in regions where significant changes are expected and where the physical processes responsible for the changes are thought to be understood. The spatial pattern of precipitation within a region is however not evaluated. It is further interesting to investigate if the good models in a given region also perform well in this region but for another variable [Whetton et al., 2007]. We therefore compare the results obtained for precipitation with temperature. Temperature is chosen because its signal of change does not strongly depend on the region considered and the field is relatively homogeneous. For this reason, temperature indices can be defined in the same region and for the same season as the precipitation indices and still be meaningful. The ASI and STI are exceptions because they describe processes that exist for precipitation but not for temperature. The ASI and STI for temperature are therefore simply defined as a temperature average for one season (see Table 2).
DJF (JJA) denotes the mean temperature during DJF (JJA) in the corresponding domain.
AFI = JJA
AMI = JJA
24°S–1°N, 31°W–59°W, land only
ASI = JJA
AUI = JJA
Central American index
CAI = JJA
HLI = DJF
52°N–71°N, land only
MEI = JJA
Storm tracks index
STI = DJFAB
zone AB (35°S–60°S)
 The eight index trends of the MMM from 1980 to 2079 are significant on the 0.01 level for both variables. Unfortunately, the observational period is short and in case of precipitation, trends are not significant as discussed in section 3, making an evaluation of the trends meaningless. Consequently, the CMIP3 models and the MMM are ranked according to their ability to simulate the mean value of each index during the observation period (precipitation index ranking and temperature index ranking hereafter). The errors of each model at simulating the mean index value are simply calculated as difference between the observed index mean and the modeled index mean and do not include information about discrepancies in the spatial structure within the index domain. To compare and aggregate these performance metrics, they are converted to a common ranking system. A rank of 1 is attributed to the model with the smallest error on the metric considered, a rank of 2 to the second-best model, etc. While some quantitative information is lost in this ranking method, it has the advantage that indices with different scales and units can readily be compared in an aggregated form. Finally, to summarize the performances of the models over the eight indices, the ranks obtained for each of the eight indices are summed up, and this sum is ranked again (lines “Prec ALL” and “Temp ALL” in Figure 4). The sum of the ranks for the precipitation indices and the temperature indices is finally ranked in Figure 4c (“indices”). The motivation for summing the ranks over different regions and variables is to test if the index-based results gradually converge to the widely used broad-brush metrics that summarize performances for a large range of climate variables on a global scale.
 The index ranking is first compared to a ranking performed on global scale, again for both variables, precipitation and temperature, only. The root mean square error (rmse) of each model with respect to the observations and the spatial correlation between simulated and observed precipitation and temperature (referred to as the rmse/corr ranking hereafter) are calculated separately for each variable. The model having the lowest rmse (highest correlation coefficient) ranks first. In Figure 4c, the “rmse/corr” depicts a ranking of the sum of the ranks obtained for both variables on the rmse and the corr ranking, again to identify if by doing so, the outcomes of the broad-brush metrics can be reproduced.
 Finally, the index and the rmse/corr rankings are compared to a ranking performed with a broad-brush metric, which is a version of the ranking on a broad range of climate variables performed by Reichler and Kim  (RK08 ranking hereafter), updated with more variables and using four seasons (T. Reichler, personal communication, 2009).
 In summary, the RK08 ranking identifies the model performance on a global scale summarized for different climate variables, the rmse/corr ranking provides a picture of the models' spatial error with respect to the precipitation and temperature data on a global scale and the index ranking allows for the identification of the models that best simulate local precipitation and temperature features expected to change in the future due to anthropogenic forcing.
5.2. Results and Discussion
 The results of the index rankings for both precipitation and temperature are summarized in Figure 4a. At first glance, none of the CMIP3 models appears to consistently outperform the rest. This is particularly obvious in the precipitation index ranking. Here, each model performs at least once better and worse than the average, while the MMM performs average for all indices. The results for the temperature index ranking are only slightly different: again the models can perform better and worse than average for different indices, except ECHAM5/MPI-OM and the MMM, which always perform better than average. It is further interesting to note that there are no significant correlations among the eight indices for each variable nor between precipitation and temperature for each index (not shown). The performances of each model are summarized in the lines Prec ALL and Temp ALL (see Figure 4). For Prec and Temp ALL, the MMM ranks eleventh and third, respectively.
 For the interpretation of the results, it is important to keep in mind that even though in evaluation studies the MMM is often considered as just another model, it is actually not. By definition, the MMM can perform only from average up to best but cannot be worse than average, while each individual model can occupy any place from the worst up to the best (see also Figures 5 and 6). The results depicted in the lines Prec ALL and Temp ALL illustrate the fact that the more indices are included, the better the performance of the MMM. This outcome is similar to the findings of Pierce et al. . This continuously improving performance of the MMM is partly due to the fact that it never performs below average, in contrast to the individual models. However, this does not mean that the MMM is better at simulating individual index mean values, but is rather an artefact that arises when more indices are considered. The MMM never has to compensate for a below-average performance plus it is favored by the fact that there is no correlation between the different indices for both variables. The above-average models are therefore difficult to identify because when considering regional features for a given variable, none of the CMIP3 models consistently outperforms the rest.
Figure 4 also shows the results of the ranking performed on the global scale for precipitation and temperature, termed as the rmse/corr ranking. Here, and in agreement with previous studies [Phillips and Gleckler, 2006; Gleckler et al., 2008; Pincus et al., 2008], the MMM clearly performs best for both variables. Averaging the individual models smoothes out variations and small-scale biases of the precipitation field, so that errors partly cancel out in the MMM [Phillips and Gleckler, 2006; Pierce et al., 2009]. Consequently, the MMM is favored by global statistical metrics because of its relatively small magnitude of biases over the whole globe and its good representation of the spatial pattern, while feature-based metrics favor a single model capable of displaying the area mean precipitation over a given region. Except for the performance of the MMM, the rmse/corr ranking is quite different for precipitation and temperature, illustrating that also globally, a model performing well for a variable might perform poorly for an other. Further, it is interesting to compare the index ranking with the rmse/corr ranking for each variable individually. The Spearman's rank correlation coefficient between Prec ALL and Prec rmse is nonsignificant at the 95% level while it is significant between Prec ALL and Prec corr (ρ = 0.49). For temperature, the correlations are also low but significant: ρ = 0.56 between Temp ALL and Temp rmse and ρ = 0.4 between Temp ALL and Temp corr. This indicates that when the performances of the individual models on a few chosen regional features are summarized, the results of a ranking based on the rmse or correlation on a global scale can be approached.
 Finally, the errors obtained for all precipitation and temperature indices are summed to obtain the line “indices” on Figure 4c, which can be compared with the rmse/corr ranking and the broad-brush metric ranking RK08 (Figure 4c). In this final index ranking, the MMM ranks fifth and it is reasonable to assume that including more regional features and/or more variables will contribute to improve the rank of the MMM, which will eventually rank first. It is obvious that the MMM ranks first for the rmse/cor ranking since it was already the case for the rmse and corr ranking of each individual variable. The MMM also ranks first in the RK08 ranking for the same reason. By definition the error of the MMM at each grid point cannot be larger than the mean error of the models and consequently, the errors of the MMM are the smallest when averaged globally. Nevertheless, the three ways of ranking presented here share similarities. The Spearman's rank correlations are significant between the three rows in Figure 4c: ρ = 0.44 between “indices” and “RK08,” ρ = 0.61 between “indices” and “rmse/corr” and ρ = 0.78 between “RK08” and “rmse/corr.” This shows that when summarizing the errors at simulating the mean of several feature-based metrics for different variables, the performance of the individual models is partly the same as when evaluating the models with measures of the global spatial distribution and including more (e.g., RK08 ranking) or less (e.g., rmse/corr ranking) climate variables.
 To summarize, an evaluation of the models using global statistical measures like the RK08 ranking does not capture the average performance of the MMM at simulating the mean precipitation amounts in a given region. Such evaluation techniques rather reflect the fact that the MMM has the smallest errors as soon as the domain size exceeds several grid points because it cannot have per definition the maximal error on a grid point (in contrary to the individual models). When the ranks obtained for the eight precipitation and temperature indices are summed up and ranked again, a similar result is seen: the MMM does not have to compensate for poor rankings and the performance of the MMM becomes gradually better the more regions and variables are summed up. The information that single models are better than the MMM at simulating regional mean precipitation amounts for a given season can be relevant for impact studies but is hidden in evaluations using a global broad-brush approach. In addition, the results obtained for temperature suggests that the worse performance of the MMM for the feature-based metrics compared to global summary statistics is not a particularity of precipitation but is likely to hold for most variables. The interpretation of the MMM is further discussed at the end of section 6.
6. Future Projections
 Once the models performing best for a given regional feature are identified, the question arises whether these models will still be the best performing ones in the future. The assumption that the models simulating the present climate accurately will also simulate well the future climate is often made [e.g., Tebaldi et al., 2004]. While it is impossible for obvious reasons to perform a model evaluation with feature-based metrics for the future to check if this assumption is correct, investigating the convergence of the models on future predictions can partly answer this question. If a subset of models (chosen based on agreement with observations) shows considerably smaller spread, then the observations can be regarded as useful to distinguish between models. This is equivalent to a correlation between biases in the present-day simulation and the predicted change. The assumption is that such correlation is not just an artifact of all models making similar assumptions, but rather that it reflects an underlying physical process or feedback that influences both the base state of a model as well as the simulated change. In practice such correlations unfortunately are relatively low in many cases [Knutti et al., 2010b; Whetton et al., 2007], probably partly as a consequence of the observations being used already in the model development process.
 The time evolution of the absolute values and the anomalies of the eight precipitation and temperature indices for the 100 year period 1980–2079 is shown in Figures 5 and 6 using an 11 year average. For the precipitation indices, the time series of GPCP and CMAP for 1980–2004 are also presented. Similarly, the index time series of ERA-40 are shown besides the modeled index time series of temperature. In addition, for each variable and each index, the five models performing best are identified and represented by dark blue lines. The model spread by the end of 2079 for all models and the five best models is represented by an error bar at the right of each panel. The error bars represent the mean value ±1 standard deviation.
6.2. Results and Discussion
Figure 5 shows the modeled absolute and anomaly time series of each precipitation index from 1980–2079 along with the observed absolute and anomaly time series from 1980–2004. The model spread is large for all indices. For example, in case of the absolute values of the ASI, the projections vary by a factor of 5, hence the difficulty to give clear statements about future precipitation amounts in this region. The reason for the average performance of the MMM in the index ranking presented above becomes evident by looking at the time series. The MMM by definition lies in the middle of the model spread, while the observation data sets lie in most indices at one end of the model spread. Many models have similar biases and averaging models therefore does not reduce the biases, which explains why the MMM cannot perform best. In addition, the individual models capture better the natural variability of regional precipitation patterns than the MMM. This is due to the fact that by averaging all 24 CMIP3 models to construct the MMM, natural variability is automatically removed.
 As already mentioned, regional trends over the relatively short observational period (25 years) are often dominated by natural variability which is why the evaluation is only performed on the ability of the models to simulate the index mean value. Still, it is central that climate models are able to correctly simulate the trends. A source of concern in the case of the HLI is the inability of most CMIP3 models to reproduce the DJF precipitation decrease during the observational period. Further, the discrepancies between the two observational data sets CMAP and GPCP are very large for the HLI and STI. While for the rmse/corr ranking these differences have only a marginal influence (not shown), using CMAP as the reference data set for the feature-based ranking described above will lead to different outcomes. In certain regions it is therefore currently ambiguous to identify the best models due partly to uncertainties in the observational data sets. The implication is that the difficulties in defining model performance are not only a problem of agreeing on a metric, but is seriously limited by observational uncertainties. This underscores the need for continuous, global and homogeneous observations at high resolution.
 Considering only the five best models for each index narrows the range of predicted absolute values (dark blue lines in Figure 5), as expected. However, if anomalies are considered (see right-hand plots in Figures 5a–5h), the model spread is only reduced for 3 indices (AFI, HLI and STI; see error bars in Figure 5), it remains approximately the same for the AUI and CAI and even increases for the AMI, ASI and MEI. The results of the temperature index time series are shown on Figure 6. In contrast to precipitation, the signal of temperature change dominates the natural variability and model agreement is larger. However, in terms of anomalies, only a minority of indices (AMI and HLI) see a reduction of model spread.
 The way the MMM was calculated in this study can be referred to as an “equal weighting” because each model has one “vote.” A more sophisticated approach consists in assigning more weight to the “good” models. Several studies see some improvement in future projections when using “optimum weighting” approaches [e.g., Perkins and Pitman, 2009; Räisänen et al., 2010]. On the other hand, Santer et al.  find that an “optimum weighting” does not affect the results of their detection and attribution study for water vapor. For the feature-based metrics presented here, applying an “optimum weighting” to the models according to the ranking presented in section 6.1 will likely lead to a reduction of the uncertainty for only a few indices but these indices are different for precipitation (AFI, HLI and STI) and temperature (AMI and HLI). However, the problem is that an “optimum weighting” would keep the model uncertainty constant or even increase it for the rest of the indices. In addition, the differences in model spread found between the five best models and all models are highly time dependent: calculating the standard deviation by the year 2059 or 2099 would have lead to slightly different outcomes in terms of the indices showing a reduction of spread but the conclusion would remain the same. A further critical issue is the sampling of small subsets. The standard deviation may also change by picking a random subset of the models, even if the criteria for picking the models has no relevance at all. For 5 out of 24 models, there is a probability of about 5% for the spread (standard deviation) to increase or decrease by 50% or more in a random subset. In other words, at least a 50% change in the spread can be considered significant and very unlikely to arise by chance. Only the AFI for precipitation and the HLI for temperature show such large changes. In most indices the change in the spread after selecting the subset of models is well within what one would expect from randomly picking a subset. The results presented here are in agreement with Weigel et al. , who argued that even if for some cases the “optimum weighting” outperforms the “equal weighting,” the risk that the former is worse than the latter is large. In cases where there is currently no agreement on which skill measure to use in order to identify the best models, it is indeed more transparent to weight the models equally. However, Weigel et al.  also showed that not considering those models known for lacking key mechanisms needed to provide meaningful projections might be justified in some cases.
 Further, Knutti et al. [2010b] showed that means and trends are generally not well correlated. In the case of the precipitation indices, a significant Pearson correlation coefficient between index mean (1980–2004) and trend (2020–2079) is only found for the STI (ρ = −0.43), an index for which the model uncertainty is reduced when considering only the five best models. For the temperature indices, a significant correlation is found for the ASI (ρ = 0.43) and the CAI (ρ = 0.48), which are indices that do not experience reduction of model spread when selecting only the five best models. However, given that these correlations are low and that there is no obvious physical explanation for them, they should not be overinterpreted. Rather, the fact that significant correlations between means and trends for an index do not always correspond to those indices with a reduction of model spread when considering the five best models indicates that feature-based metrics are not more useful to reduce the uncertainty in terms of anomalies than other metrics. Nevertheless, many end users are interested in absolute precipitation amounts and in this case, feature-based metrics are a simple way to identify the models that have some skill in a region but also to identify those who have obviously no skill.
 Finally, it should be noted that the interpretation of the MMM has been the subject of some debate. In particular, different interpretations of model independence, model robustness and of the ensemble of models itself are possible and lead to different interpretations of future model uncertainty [Pirtle et al., 2010; Knutti et al., 2010b, 2010a; Annan and Hargreaves, 2010]. On the one hand, climate models can be considered as “random samples from a distribution of possible models centered around the true climate” [Jun et al., 2008]. Consequently, when averaging all models to construct the MMM, the errors are expected to decrease and the MMM to approach the truth [Tebaldi et al., 2004]. The statistically indistinguishable ensemble paradigm is an alternative way to interpret ensembles, where the truth is a sample from the same distribution as each model of the ensemble [e.g., Tebaldi and Sanso, 2009]. Annan and Hargreaves  compared both paradigms and find that the CMIP3 ensemble generally provides a good sample under the statistically indistinguishable paradigm. Assessing the statistical nature of the CMIP3 ensemble is beyond the scope of this study however, results from section 3 as well as in the case of the eight feature-based metrics for precipitation, it seems that the ensemble of models is not centered around the truth but appears biased. Therefore, the MMM is not closer than any other model to the observations, which seem to be statistically indistinguishable from the ensemble members. For temperature, the CMIP3 ensemble also appears biased but to a smaller extent than for precipitation. Nevertheless, there is a need for further studies focusing on how to interpret results from multiple models.
 The motivation for ranking the models is to specify which one(s) can provide the most reliable projections. Until now, model simulations have often been evaluated with statistical measures and on large spatial scales, where the MMM was found to perform best. As an alternative evaluation method, we provide eight feature-based performance metrics for precipitation and temperature. Feature-based metrics are designed to capture a robust signal of change in a particular variable that can be explained physically. As a first step, the causes behind the large projection uncertainty for precipitation are investigated. In large regions of the world, differences between the models contribute more to the total spread in projections than internal variability. However, agreement in the sign of trend among several runs of the same model is only slightly larger than among different models, indicating that even if differences between models are reduced, internal variability will still cause a large lack of agreement in precipitation projections.
 For the regional feature-based metrics, the models performing best are different for each region and variable, and the choice of the observational data set is important in the case of precipitation. Averaging the models is more effective on aggregated metrics than on small scales and features. This is illustrated by the fact that when summarizing the performances of the models for all indices and both variables, the MMM ranks better than for each index individually. When the performances for the feature-based metrics are summarized, they correlate with the ranking obtained with statistical measures of errors and with the global field-based measures of Reichler and Kim . This is agreement with earlier studies [Boer, 1993; Gleckler et al., 2008; Pincus et al., 2008] who find that the MMM outperforms any individual model if enough metrics or grid points are evaluated and aggregated. We also tested a further way of ranking the models based on their ability to simulate the spatial correlation between the mean precipitation and temperature pattern in the index regions (not shown). It was found that the performance of the MMM for this regional correlation ranking is between its performance in the index ranking and the corr ranking for both precipitation and temperature. The MMM ranked first in ∼35% of the cases, which is better than for the index ranking where it ranks average, but worse than for the corr ranking where it clearly ranks first. These findings confirm our hypothesis that the more grid cells, metrics or variables are aggregated, the better the performance of the MMM becomes.
 In a second part, the convergence of the projections of the best performing models for each index is investigated. On one hand and in particular for precipitation, the projections of the five best models in terms of absolute values appear more realistic than the ones performing below average since for most indices, the observations lie at one end of the model spread. However, when considering the anomalies, it is found that regardless of the variable, the majority of the indices see no reduction or even an increase in future uncertainty. These results suggest that on a regional scale, weighting the models might improve the projections only in few cases. In the absence of a process based argument, given the small number of existing models and the chosen subsets of 5 models, only a reduction in model spread by more than 50% is an indication of a successful constraint (see section 6.2). Model weighting should therefore be performed carefully. Our results tend to support previous findings showing that a good performance in the present does not guarantee skill in the future [Jun et al., 2008; Reifen and Toumi, 2009]. On the other hand, there are a few cases where past and future performance in models are clearly related and physically well understood, for example, past greenhouse gas attributable warming scaling linearly with future transient greenhouse gas warming [Allen and Ingram, 2002; Stott et al., 2006]. Such relationships are routinely used and widely accepted to constrain or calibrate projections with simple and intermediate complexity models [e.g., Knutti et al., 2002; Forest et al., 2002; Meinshausen et al., 2009]. Another prominent example is the Arctic, where models underestimating past sea ice decline also show much weaker sea ice loss in the future [Boe et al., 2009b] and where performance in simulating the current Arctic climate is related to projected future response in that region [Boe et al., 2009a; Mahlstein and Knutti, 2011]. In such obvious cases we argue that observed evidence should not be ignored when synthesizing models.
 Evaluating the models is a central task in climate science and the reason why there is currently no agreement on a standard way to perform an evaluation reflects the fact that on the one hand, the connection between present-day and future performance is poorly understood and on the other hand, it also depends on the purpose. While hydrologists need assessments of the best performing models on a regional scale and primarily for precipitation and temperature, some model developers are more interested in summarizing the performance of climate models for many variables and over all regions of the globe as for example in work by Reichler and Kim . For specific applications and predictions, defining metrics not only based on mean biases but also on regional or temporal characteristics (e.g., distributions of daily rainfall) or on physical processes [e.g., Eyring et al., 2005] may be more promising. It is evident that the index ranking presented here is partly subjective due to the choice of the eight indices. The indices should therefore be regarded as examples and depending on the purpose, other sets of indices can be defined. We also point out that the results are at most valid for precipitation and temperature and do not allow for any evaluations of the model performance on other variables or on a global scale. Further considerations of alternative ways of evaluating climate models in order to make best use of their predictions are encouraged.
 We acknowledge the modeling groups, the Program for Climate Model Diagnosis and Intercomparison (PCMDI) and the WCRP's Working Group on Coupled Modeling (WGCM) for their roles in making available the WCRP CMIP3 multimodel data set. Support for this data set is provided by the Office of Science, U.S. Department of Energy.