Quantifying Progress Across Different CMIP Phases With the ESMValTool

More than 40 model groups worldwide are participating in the Coupled Model Intercomparison Project Phase 6 (CMIP6), providing a new and rich source of information to better understand past, present, and future climate change. Here, we use the Earth System Model Evaluation Tool (ESMValTool) to assess the performance of the CMIP6 ensemble compared to the previous generations CMIP3 and CMIP5. While CMIP5 models did not capture the observed pause in the increase in global mean surface temperature between 1998 and 2013, the historical CMIP6 simulations agree well with the observed recent temperature increase, but some models have difficulties in reproducing the observed global mean surface temperature record of the second half of the twentieth century. While systematic biases in annual mean surface temperature and precipitation remain in the CMIP6 multimodel mean, individual models and high‐resolution versions of the models show significant reductions in many long‐standing biases. Some improvements are also found in the vertical temperature, water vapor, and zonal wind speed distributions, and root‐mean‐square errors for selected fields are generally smaller with reduced intermodel spread and higher average skill in the correlation patterns relative to observations. An emerging property of the CMIP6 ensemble is a higher effective climate sensitivity with an increased range between 2.3 and 5.6 K. A possible reason for this increase in some models is improvements in cloud representation resulting in stronger shortwave cloud feedbacks than in their predecessor versions.


Introduction
Climate model simulations are coordinated as part of the World Climate Research Programme's (WCRP) Coupled Model Intercomparison Project (CMIP) since the early 1990s (Eyring, Bony, et al., 2016;Meehl et al., 1997Meehl et al., , 2000Meehl et al., , 2005Meehl et al., , 2007Taylor et al., 2012). The main objective of CMIP is to better understand past, present, and future climate variability and change arising from natural, unforced variability and in response to changes in radiative forcing in a multimodel context. Model simulations are defined and carried out by the participating modeling groups under common forcings and forcing scenarios. CMIP defines not only common model simulations but also aims at making a wide range of model output available to the research community in order to better learn from a large model ensemble. In this sense, CMIP3 marked a paradigm shift in the climate science community by making model output from state-of-the-art climate change simulations broadly accessible (Meehl et al., 2007). CMIP model simulations and associated publications analyzing the multimodel data set constitute the state of the climate science and thus have been assessed by the Intergovernmental Panel on Climate Change (IPCC) Assessment Reports. CMIP3 supported the Fourth Assessment Report (AR4) (Solomon et al., 2007).
The next phase of CMIP was CMIP5 with an integrated set of experiments (Taylor et al., 2012) and was used in numerous peer-reviewed studies as well as providing the basis for the IPCC Fifth Assessment Report (AR5) (Stocker et al., 2013). Flato et al. (2013) pointed out that there were significant variations in skill across the CMIP5 ensemble when measured against reanalyses and observations, and that some systematic biases remained over several generations of CMIP model ensembles. The current phase of CMIP, CMIP6 (Eyring, Bony, et al., 2016), therefore includes the quantification and understanding of systematic biases as one of its three central scientific questions. There are 23 CMIP6-Endorsed Model Intercomparison Projects (MIPs) that define an additional set of simulations targeting other specific scientific questions. For example, the new High-Resolution Model Intercomparison Project (HighResMIP, Haarsma et al., 2016) that we also evaluate here assesses the robustness of improvements in the representation of important climate processes using weather-resolving global model resolutions (∼25 km or finer).
An important question to answer is how these new simulations compare to previous generations of CMIP ensembles, and whether systematic biases detected earlier are reduced or remain. A thorough assessment of the models' skill in reproducing observed past and present climate is also an essential prerequisite to assess and interpret the results from model projections (Eyring et al., 2019). Known systematic biases include (i) a too strong intertropical convergence zone (ITCZ) in the Southern Hemisphere (SH) which often persists through the seasonal cycle; problems simulating the Walker circulation and the associated dry Amazon bias also seen in many models; (ii) poor simulation of tropical and subtropical low-level clouds, particularly the persistent stratocumulus decks over the eastern parts of ocean basins, which are related to too warm sea surface temperatures (SSTs); (iii) an overly deep tropical thermocline; (iv) the tendency to simulate land surfaces too warm and too dry during summertime, and (v) the northward shift in the position of the SH atmospheric jet which leads to poor simulation of the surface wind stress over the Southern Ocean and to errors in the vertical structure of the water masses in the Southern Ocean (Stouffer et al., 2017).
Even though not sufficient, a systematic evaluation of models results by comparisons with observations and reanalysis data is commonly seen as an important prerequisite to building up confidence in the models' future climate projections (Flato et al., 2013). This more general assessment of model performance is complimented with approaches that use observations to constrain the uncertainty in multimodel projections or feedbacks with observations (Eyring et al., 2019). This aspect is not covered here and requires further analysis on how the different generations of CMIP ensembles compare in this respect. Some initial studies exist, for example, Schlund et al. (2020) compare emergent constraints on effective climate sensitivity (ECS) between CMIP5 and CMIP6 and Tokarska et al. (2020) constrain future warming based on the ability of the models to reproduce past temperature trends.
Many CMIP6 modeling groups already reported improvements in their model's ability to simulate past and present-day climate compared to their CMIP5 predecessor versions Danabasoglu et al., 2020;Gettelman et al., 2019;Mulcahy et al., 2018;Swart et al., 2019). Typically, model developments consist of including more detailed Earth system processes as well as improvements in existing parameterizations or higher horizontal and vertical resolution. However, a systematic assessment of the CMIP6 ensemble in comparison to previous generations is still missing. In order to evaluate how well the models perform, we compare the performance of the CMIP3, CMIP5, and CMIP6 ensemble by evaluating the historical simulations forced by common boundary conditions in each phase with observations or reanalysis data. We apply the Earth System Model Evaluation Tool (ESMValTool) Version 2 Righi et al., 2020) for a consistent assessment of the CMIP ensemble across phases. The ESMValTool is a community-developed open-source diagnostic and performance metrics tool to evaluate Earth system models (ESMs) with observations and reanalysis data.
In section 2, the model ensembles and observations used in this study as well as the ESMValTool are described. The surface temperature record of the three model ensembles CMIP3, CMIP5, and CMIP6 is discussed in section 3 and the multimodel mean biases of some important climate variables such as temperature, precipitation and selected meteorological variables such as zonal wind are compared across the model ensembles in section 4. An overview on the general model performance in comparison with observations by applying performance metrics and pattern correlations are shown in section 5. In section 6 we discuss the high ECS in some CMIP6 models compared with results from previous phases of CMIP and close with a summary in section 7.

Models and Observations
In this study we use model simulations from CMIP Phases 3, 5 and 6, organized by the WCRP CMIP Panel under the auspices of the Working Group on Coupled Modelling (WGCM). The model data (see Tables 1-3) from CMIP3, CMIP5, and CMIP6 are freely available on servers of the Earth System Grid Federation (ESGF), which is an international collaboration that manages the decentralized database of CMIP output. In order to     Bentsen et al. (2013) assess the models' improvements in reproducing observed climate parameters, we use several observational data sets and reanalyses summarized in Table 5.
In CMIP5 most of the models had a higher spatial resolution with 0.5°to 4°for the atmosphere component and 0.2°to 2°for the ocean component than the CMIP3 models (Taylor et al., 2012). In CMIP6 the spread of the models' spatial resolutions shifts again to finer grids. For the first time, a new "input4MIPs" activity (https://pcmdi.llnl.gov/mips/input4MIPs/) has been initiated in CMIP6 encourage adoption of common data standards and to create an archive of the forcing data sets and boundary conditions needed for the CMIP6 simulations available via ESGF. Many of the new forcing data sets are improved versions of the ones used in CMIP5 (see summary available at http://goo.gl/r8up31).

CMIP6
CMIP6 consists of three main elements (Eyring, Bony, et al., 2016): (1) a set of common experiments, the DECK (Diagnostic, Evaluation and Characterization of Klima) and CMIP historical simulations (1850 to near present) that are used here to document the basic characteristics of the models across different phases of CMIP; (2) common standards, coordination, infrastructure, and documentation that facilitates the distribution of model output and the characterization of the model ensemble; and (3) an ensemble of CMIP-Endorsed MIPs that are specific to a particular phase of CMIP (now CMIP6) and that build on the DECK and CMIP historical simulations to address a large range of specific scientific questions and help fill the scientific gaps of previous CMIP phases. CMIP6 models have an increased degree of freedom by including more processes and couplings, primarily aimed at being able to better simulate future feedbacks (e.g., nitrogen effects of terrestrial carbon uptake or permafrost processes). In this study we use the CMIP6 Carbon Dioxide (CO 2 ) concentration driven historical simulations (historical) over the time period 1850-2014 (Table 1). Common forcing data sets are defined for the CMIP6 historical simulations are largely based on observations and include: land-use changes (Ma et al., 2019), emissions and concentrations of long-lived greenhouse gases (Meinshausen et al., 2017) and of short-lived species (Hoesly et al., 2018), stratospheric aerosol from volcanoes (Zanchettin et al., 2016), biomass burning emissions (van Marle et al., 2017), and solar forcing (Matthes et al., 2017). It should, however, be noted that forcings in the model simulations can differ according to the complexity of the model. For example, some models are forced with a parameterization of anthropogenic aerosol optical properties and an associated Twomey effect , National Center for Atmospheric Research (NCAR), USA Collins et al. (2006); Washington et al. (2000) ukmo_hadcm3 ukmo_hadgem1 Hadley Centre for Climate Prediction and Research/Met Office, UK Gordon et al. (2000); Martin et al. (2006); Pope et al. (2000) while others treat aerosols interactively and therefore prescribe emissions of aerosols and their precursors instead (Hoesly et al., 2018). We only consider one ensemble member per model ("r1i1p1f1," if available).
In order to calculate ECS, we also use the simulations forced by an abrupt quadrupling of CO 2 (abrupt-4 × CO2) and the preindustrial control simulations (piControl).

CMIP5
For CMIP5 (Taylor et al., 2012), we use the results from up to 48 models (Table 2) for the historical simulations depending on data availability for a specific variable. The historical simulations are twentieth-century simulations covering the time period 1850-2005 and are performed using the then best available record of natural and anthropogenic climate forcing (Cionni et al., 2011;Lamarque et al., 2010). In case there are multiple ensemble members available for a given model, we only consider the first ensemble member "r1i1p1" in our analysis. Again, we use the idealized abrupt four times CO 2 and the preindustrial control simulations to calculate ECS from the models.

CMIP3
The CMIP3 model simulations analyzed are the twentieth century runs (1860-1999) with natural and anthropogenic forcings (20C3M experiments). Again, in case there are multiple ensemble members available for a given model, we only analyze the first ensemble member "run1." In total, there are up to 22 CMIP3 models considered in our analyses depending on data availability for a specific variable (Table 3).

HighResMIP
The HighResMIP (Haarsma et al., 2016) applies, for the first time, a multimodel approach to systematically investigate the impact of horizontal resolution on the results of global ESMs. A coordinated set of experiments over the time period 1950-2014 has been designed to assess both, a standard and an enhanced horizontal resolution simulation, in the atmosphere and ocean of each participating model (Table 4). To make the highest-resolution models computationally affordable, some compromises were necessary. The experiment design incorporates only a short (30-50 year) spin-up from 1950 initial conditions before control and historic-future simulations. Therefore, a direct comparison to the CMIP6 historical simulations that start in 1850 is not always possible. In this study, we therefore compare the lower-resolution and high-resolution model versions within HighResMIP, both starting in 1950, in order to assess possible improvements due to higher horizontal model resolution. In HighResMIP, physical models with few ESM components are used, and the aerosol optical properties are specified over time using the MACv2-SP scheme .

Observations and Reanalysis Data
The observations and reanalysis data used for the model evaluation and assessment of the progress made during the different phases of CMIP are summarized in Table 5 including the type of observation, variables used, time period covered, and main reference(s). Where available, we use observational data sets from the  Note. In each column the name of the high-resolution version (first line) and the corresponding low-resolution version (second line) is given.
observations for Model Intercomparison Projects (obs4MIPs; Ferraro et al., 2015;Waliser et al., 2020), which can be downloaded freely from the ESGF and, because they are provided in the same file format including all relevant meta data as the output from the CMIP6 models, and can be used directly with the ESMValTool.

ESMValTool
The ESMValTool is a community diagnostics and performance metrics tool specifically developed for evaluation of ESMs contributing to CMIP Righi et al., 2020). ESM results from single or multiple models can be compared with their predecessor versions and against observations. The diagnostics available in the ESMValTool cover a wide range of scientific themes focusing on selected essential climate variables, a range of known systematic biases common to ESMs, meteorology, clouds, tropospheric aerosols, ocean variables, land processes, etc. All diagnostics are grouped in sets of standard "recipes" for each scientific topic reproducing diagnostics or performance metrics that have demonstrated their importance in ESM evaluation in the peer-reviewed literature. The main aim of the ESMValTool is to facilitate and improve ESM evaluation beyond the state-of-the-art and to support activities within CMIP and at individual modeling centers. This includes provision of well-documented diagnostics and source code as well as ensuring reproducibility and traceability of the results (provenance). The ESMValTool is an open source project and can be found on GitHub at https://github.com/ESMValGroup/ESMValTool with contributions from the community very welcome. Contributions could include, but are not limited to, documentation improvements, bug reports, new or improved diagnostic code, scientific and technical code reviews, infrastructure improvements, mailing list and chat participation, community help/building, education, and outreach. For more information on contributing to the ESMValTool, general guidelines, code style, etc., we refer to the ESMValTool user's guide available at https://docs.esmvaltool.org website. A general overview on the ESMValTool is given by Eyring, Righi, et al. (2016), technical details of the latest version (v2.0) can be found in Righi et al. (2020), diagnostics and metrics newly added to v2.0 are described in three companion papers (Eyring et al., 2020; Lauer  et al., 2020; Weigel et al., 2020). The ESMValTool is fully integrated into the ESGF infrastructure at the Deutsches Klimarechenzentrum (DKRZ) where all the model output and the observations are stored in a local replica and the tool is run. All diagnostics used for this paper will be made available in the ESMValTool after acceptance of this publication and the figures can be reproduced with the newly added recipe "recipe_bock20jgr.yml." Figure 1 shows the time series of anomalies in annual global mean near-surface temperature simulated by CMIP3, CMIP5, and CMIP6 models. The time period 1850-1900 has been used as reference period to calculate the temperature anomalies (1870-1900 for CMIP3 models starting in 1870). The reference data set for comparison with the models is HadCRUT4 (Morice et al., 2012). In general, the models of all CMIP phases are able to reproduce the observed temperature record reasonably well in a range of ±0.9°C showing an increase in global-averaged annual mean near-surface temperature since the year 1850 including an accelerated warming beginning in the 1970s and a temporary cooling that follows large volcanic eruptions such as Krakatoa in 1883 or Agung in 1963. The temperature changes since the late nineteenth century are driven by a number of factors, including increasing atmospheric greenhouse gas concentrations, changes in aerosol amounts, changes in solar activity, volcanic eruptions, and changes in land use. Natural variability also plays an important role particularly on shorter timescales such as for the observed slowdown ("hiatus") in the observed increase in global surface temperature warming rates during the time period 1998-2013 , although ocean heat content continued to increase over the same period (Yin et al., 2018).

Surface Temperature Record
The CMIP3 multimodel mean already captured the observed surface temperature change quite well with a warming for the years 1990 to 1999 in the range of 0.45°C to 0.73°C compared to 0.38°C to 0.74°C for the observations. Even though there are some outliers leading to a rather large intermodel spread. A similarly large spread exists in mean absolute temperatures simulated by CMIP3 models, and that spread persists in CMIP5 and CMIP6 (see insets in Figure 1). Figure 2 shows the intermodel spread of the three CMIP ensembles as ±1 standard deviation around the multimodel means in comparison to the uncertainty estimates of the global temperature anomalies from HadCRUT4. The observed uncertainty estimates are the 5% and 95% percentiles of the confidence interval of the combined effects of uncertainties from measurement and sampling as well as bias and coverage (Morice et al., 2012). All models have been sampled according to the temporal and spatial data availability from HadCRUT4 and therefore include similar sampling and coverage uncertainties as the observations. The intermodel spread for temperature anomalies, which are less uncertain in observations than absolute values (P. D. Jones et al., 1999), are slightly reduced in CMIP5 and CMIP6 with standard deviations of 0.16°C and 0.17°C, respectively, after the reference period compared to 0.19°C for CMIP3. Particularly from the second half of the twentieth century onward, the intermodel spreads in all CMIP enselbles are larger than the HadCRUT4 uncertainty estimates and do not narrow down with time. This suggests that besides natural variability, model uncertainty is an important contribution to the intermodel spread in all three CMIP phases. Since the intermodel spread does not change substantially among the different CMIP phases, this further suggests that model uncertainties remain to be important factors determining the intermodel spread throughout the observed time period.
There were discussions focussing on the observed reduction in the rate of surface temperature warming during the hiatus period which was apparently not reproduced by the CMIP5 models (Flato et al., 2013;Meehl et al., 2014). It has subsequently been shown that the slowdown in the rate of global warming in the early 2000s likely predominantly due to internal variability from the negative phase of the Interdecadal Pacific Oscillation (IPO) in the Pacific (England et al., 2014;Fyfe et al., 2016;Xie & Kosaka, 2017) with some contributions from aerosol forcing from a collection of moderate sized volcanic eruptions (Santer et al., 2015) and perhaps partly from anthropogenic aerosol forcing (D. M. Smith et al., 2016) though such a role for anthropogenic aerosols is still being debated (Oudar et al., 2018). Thus, uninitialized climate models averaged across multiple ensemble members to remove the effects of internal variability cannot be expected, by definition, to reproduce, in such a multimodel mean, a phase of internal variability in the single realization of the observations. However, a small number of CMIP5 model realizations were, by chance, able to simulate the internally generated slowdown that happened to occur at the same time as shown by the observations, and those simulations also were characterized by a negative phase of the IPO . This strongly suggests that the models do indeed include the processes that can produce  Morice et al., 2012). All models have been subsampled using the HadCRUT4 observational data mask (see Jones et al., 2013). Inset: The global mean surface temperature for the reference period 1850-1900 of the subsampled fields. CMIP6 models, which are masked with an asterisk are either tuned to reproduce observed warming directly, or indirectly by tuning equilibrium climate sensitivity. decadal slowdowns or accelerations, but this presents a challenge for interpreting multimodel ensemble averages when comparing to observed decadal-timescale variability from the single realization of the observations. As the historical CMIP6 simulations extend beyond the hiatus period, we found that there is again a convergence between the time series of the multimodel mean and the observed temperature record until the year 2014. But the CMIP6 multimodel mean tends to simulate reduced warming over the period 1950-1990 (with a mean bias of −0.07°C) which is probably at least partly related to an overestimation of the cooling in response to large increases in anthropogenic emissions of primary aerosol and precursors in the 1950s in some models Dittus et al., 2020;Flynn & Mauritsen, 2020;Hoesly et al., 2018). The lack of simulated warming in that period (Figure 1) could be caused by a high aerosol effective radiative forcing (ERF) in these models. Dittus et al. (2020) supports that explanation by varying the strength of aerosol ERF in the CMIP6 version of the HadGEM3 climate model. They find that temperature trends over the period 1951-1980 are significantly more sensitive to the strength of aerosol ERF than the 30 previous  and following  years, when temperature trends where driven by greenhouse gas increases. Aerosol ERF measures imbalances in the Earth's energy budget due to anthropogenic aerosols, including aerosol-radiation interactions and aerosol-cloud interactions and their rapid adjustments (Sherwood et al., 2015). Several models reduced the strength of their simulated aerosol radiative forcing during their development phase to ensure that total anthropogenic radiative forcing remained positive (Danabasoglu et al., 2020;Mulcahy et al., 2018). Potentially as a result of overly sensitive aerosol-cloud-radiation coupling, individual CMIP6 models may underestimate the observed global temperature anomalies in the 1960s to 1980s by up to 0.5°C, while being much closer to the observations during the rest of the historical period. By correlating each model's aerosol ERF for 2014 (C. J. Smith et al., 2020) with its simulated warming trend between 1945 to 1970, we find some evidence to support the hypothesis that CMIP6 models with particularly strong negative aerosol forcing show a larger surface cooling trend in the midtwentieth to late twentieth century, with this relationship most clear when temperature trends for the NH extratropics are considered. We note that the C. J. Smith et al. (2020) aerosol ERF for 2014 is not always representative of the aerosol ERF experienced by models over the time period 1945-1970 because models could have different aerosol ERF histories. We do not, however, expect this to have a large impact on the strength or sign of the relation found between aerosol ERF and temperature trend as preliminary results from the RFMIP piClim-histaer simulation suggest that the aerosol ERF values for midcentury and present-day typically scale rather similarly among the models. In addition to the forcing itself, details of how individual models respond to this negative forcing also plays a role in determining their overall historical temperature record. The very high warming rates in the last part of the twentieth century of some models such as CanESM5 and UK-ESM, as well as their strong cooling after volcanic eruptions, are reflected in very large climate sensitivity values (see further discussion in section 6).
When evaluating model simulations of historical temperature change, it is important to keep in mind that good agreement with the long-term twentieth century trend of observed surface temperature changes is expected for models that are directly or indirectly tuned to reproduce observed twentieth-century warming (Hourdin et al., 2017;Mauritsen et al., 2012). Tuning itself means an objective process of parameter estimation to fit a predefined set of observations (Hourdin et al., 2017). However, the tuning is not time-dependent so the decadal variability of the time evolution of global temperature relies on how the models respond to external forcings such as volcanic eruptions, solar variability, and time-evolving anthropogenic aerosols. Thus, there is no significant difference in the multimodel mean anomaly time series of near-surface temperature obtained for models that have been tuned toward the observed warming rates or for models that have not (not shown). The anomaly time series for surface temperature for the tuned models (marked with asterisks in the legend of Figure 1) is too cold in the second half of the twentieth century, just like models that are not tuned to twentieth century warming.

Systematic Biases
Climate models are known to exhibit a number of different and partly long-standing biases in reproducing observed climate (Stouffer et al., 2017). In order to be able to address one of the scientific key question in CMIP6, "What are the origins and consequences of systematic model biases?" (Eyring, Bony, et al., 2016), a first step is to identify which of the systematic model biases are still present in the CMIP6 historical simulations. A second step is then to assess potential progress and improvements in the models' performance compared with older model generations that contributed to CMIP3 and CMIP5 throughout the last two decades. Here, we are not specifically aiming at tracking the performance of individual models but rather the performance of generations of climate models. We therefore compare multimodel means of CMIP Phases 3, 5, and 6 against observations and against each other in order to identify still existing biases and assess potential progress in reproducing the observed climate state of the last decades.

Surface Temperature
One of the prognostic variables of climate models that is most commonly used and downloaded from the CMIP archives is surface temperature. Figure 3a shows that the CMIP6 multimodel mean is able to simulate the key characteristics of the observed global surface temperature pattern. The dominant feature of the climatology (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014) is the zonal gradient from high temperatures at the equator to low values at the poles. High-elevation regions like the Himalayas, the Andes or Antarctica are significantly cooler than the latitudinal average temperature. Seasonal changes in temperatures are also generally well reproduced (Flato et al., 2013).
All CMIP ensembles reproduce the large-scale annual mean patterns from the reference reanalysis data set ERA5 quite well (pattern correlations of 0.99 or larger for all CMIP models, see Figure 7). The global mean bias improves from CMIP3 (−0.451°C) to values near zero for CMIP5 and CMIP6 (Figures 3b-3d). And also the global mean root-mean-square difference (RMSD) decreases continously for the different CMIP ensembles but in some regions there are some long-standing biases (Figures 3b-3d). These biases include too high surface temperatures in the upwelling regions of subtropical oceans of up to several°C. One possible reason for this warm bias is an underestimation of the stratocumulus cloud fraction in these regions. Biases in the high-elevation regions are also still apparent in CMIP6 but typically somewhat smaller than in CMIP3. This also applies to biases along the edge of the North Atlantic sea ice field. The positive temperature bias over the Southern Ocean, however, seems to have gotten worse in time (Hyder et al., 2018) with the CMIP6 multimodel mean showing larger biases than in the two previous CMIP phases. Regional absolute biases in surface temperature of up to 6°C as seen in CMIP5 and some pre-CMIP6 models (Lauer et al., 2018) are still present in CMIP6.
There are many different, model-dependent causes for biases in modeled surface temperature. Common causes include biases in downward shortwave radiation at the surface because of errors in simulated cloud properties (Hyder et al., 2018;Lauer et al., 2018), errors in oceanic circulation (Kuhlbrodt et al., 2018), errors in the simulation of trade winds (Lauer et al., 2018), and errors in surface albedo and moisture propagated from the vegetation schemes (Séférian et al., 2016). Even though the multimodel mean of the surface temperature bias shows only small improvements in CMIP6, some individual models made significant progress (Danabasoglu et al., 2020).
It is noteworthy that some of the long-standing biases seem to be related to horizontal model resolution. After increasing the horizontal resolution, as done in HighResMIP (Figures 3e and 3f), some of the biases were reduced or even disappeared compared to the mean bias of the corresponding lower-resolution versions of the same models simulating the same time period. Both, the global mean bias and RMSD decrease with higher horizontal resolution. There is a clear improvement in many of these regional biases, particularly in the stratocumulus regions where typical biases found in the high-resolution models of HighResMIP are below 1°C compared to up to 3-4°C found in the multimodel mean of the low-resolution models of HighResMIP (M. J. Roberts et al., 2019). Improvements can be seen in the upwelling regions off the west coasts of South America and Africa and also over the northern Atlantic Docquier et al., 2019). The cold bias along the equator in the Pacific Ocean with too cold SSTs extending too far west (Lauer et al., 2018) disappeared in the high resolution versions (Roberts et al., 2018. A notable exception to biases improvements from higher resolution is again the Southern Ocean, where 10.1029/2019JD032321

Journal of Geophysical Research: Atmospheres
biases increase at these eddy-permitting ocean resolutions. It should, however, be noted that the most models in the current group of HighResMIP models include a NEMO-based ocean, so there is little ocean model diversity. Also, because of the shorter simulation period starting from observed initial conditions in 1950 in HighResMIP, compared to starting in 1850 from a preindustrial spun-up state for the CMIP6 historical simulations, better agreement of the HighResMIP simulations with observations can be expected. Because of this, the performance of the HighResMIP simulations is not directly comparable to the one of the CMIP6 historical simulations.

Precipitation
The multimodel mean of the CMIP6 ensemble shows the well-known large-scale features of the global precipitation pattern (Figure 3a). Precipitation near the equator is high due to frequently occurring deep convection connected with the Intertropical Convergence Zone (ITCZ). In the subtropical subsidence regions precipitation rates are low and increase again in midlatitudes due to precipitation by frontal systems (midlatitude storm tracks). The cold temperatures and the associated low water vapor saturation ratio at the poles leads to a relatively low amount of precipitation in high latitudes. Pattern correlations between the modeled and observed geographical distribution of annual mean precipitation range between 0.69 and 0.87 for CMIP3, 0.79 to 0.88 for CMIP5, and 0.80 and 0.92 for CMIP6 models (Figure 7).

Journal of Geophysical Research: Atmospheres
The comparison of the multimodel mean with the global precipitation data set from the Global Precipitation Climatology Project (GPCP; Adler et al., 2003) shows that the global mean climatology of the CMIP ensembles is a bit too wet but the global RMSD decreases from CMIP3 to CMIP6 (Figures 4b-4d). But there are some long-standing systematic model biases throughout the different CMIP phases. The largest precipitation biases of up to 3.5 mm day −1 are found in the tropics. They include the occurrence of a double ITCZ in the tropical Pacific and a southward shifted ITCZ in the equatorial Atlantic with rather little progress from CMIP3 to CMIP6. A double ITCZ is often driven by incorrect simulation of the meridional gradients in SST across the equator (Oueslati & Bellon, 2015) and thus a complex problem of the coupled atmosphere-ocean system. In general, the amplitude and geographical pattern of the precipitation biases in CMIP6 are quite similar to those from CMIP5. There is some improvement, however, in CMIP6 compared with CMIP3 and CMIP5 in the overly intense Indian Ocean ITCZ and the too dry South American continent (excluding the Andes) by about 1 mm day −1 . There have been also progressive improvements in the extratropical representation of precipitation from CMIP3 to CMIP6. The CMIP6 models have an improved zonal tilt of the North Atlantic winter storm track (Priestley et al., 2020), which may have contributed to the decrease in the dry bias over that region. Also, the equatorward and zonal mean bias in the SH midlatitudes has been largely reduced. These improvements have been attributed to model horizontal resolution in the NH and to model physics in the SH (Priestley et al., 2020).
The multimodel mean bias of the high resolution versions in comparison to the multimodel mean bias of their corresponding low resolution counterparts in HighResMIP ( Figures. 4e and 4f) shows some improvements. There is a strong decrease in the precipitation bias in the tropical Atlantic by about 1-2 mm day −1 as well as a near disappearance of the dry bias in the equatorial Pacific. A possible explanation for this improvement could be that, together with the improved SST biases (Figures 3e and 3f), the seasonal mean circulation and ITCZ migration are better represented with higher horizontal resolution (Vannière et al., 2019) leading to smaller biases. Figure 5 shows climatological annual means of zonally averaged temperature, specific humidity and zonal wind (u-component) from the CMIP multimodel means compared with data from the ERA-Interim reanalysis (Dee et al., 2011). Prominent, well-known biases in the simulated vertical temperature distribution throughout all CMIP phases include a cold bias of several K in the extratropical upper troposphere at around 200 hPa and a warm bias of about 1-2 K in the tropics at about 100 hPa (John & Soden, 2007). The cold bias is somewhat reduced from maximum values of around 8 K in CMIP3 to about 6 K in CMIP6. Additionally, the warm bias in the upper tropical troposphere is reduced in its extent and magnitude from CMIP3 (up to 3 K) to CMIP6 (up to 2 K). The same is true for the cold bias in the lower stratosphere in the tropics with a reduction from about 3 K in CMIP3 to about 1 K in CMIP6. An improvement from CMIP5 to CMIP6 can also be seen in the cold bias throughout most of the troposphere in the SH that is reduced from 1-2 K in CMIP5 to about 1 K in CMIP6.

Meteorology
The concentration of water vapor in the atmosphere spans several orders of magnitude. Therefore, Figure 5 shows relative biases of simulated specific humidity instead of absolute differences in order to facilitate assessment of the performance of the CMIP multimodel means to reproduce the observationally based reference data set ERA-Interim. Consistent with the cold bias in the extratropical upper troposphere at around 200 hPa in both hemispheres, water vapor is underestimated in the CMIP models. While there is little change in this dry bias from CMIP3 to CMIP5, this bias is clearly improved in CMIP6 with bias values now ranging between −10% and −30% down from −20% to more than −45%. Similarly, the wet bias in middle-and high-latitude upper troposphere/lower stratosphere is improved from CMIP3 through CMIP5 to CMIP6. Throughout most of the stratosphere, the CMIP3 multimodel mean shows a strong dry bias (−40% to −130%). This bias has been reduced in CMIP5 with the dry bias now being mostly confined to high latitudes and even further reduced in the CMIP6 multimodel mean with values now mostly below −20% to −30% compared to ERA-Interim ( Figure 5, middle column).
The simulated multimodel mean zonal wind speed (u-component) from CMIP6 models shows a reduction in the positive bias in stratosphere in midlatitudes in both hemispheres (bias up to 4-5 m s −1 ) compared to CMIP3 (bias up to 9 m s −1 ) and also compared to CMIP5 (bias up to 6 m s −1 ). The negative bias in zonal wind speed found above the tropical tropopause of up to several m s −1 in CMIP5 is also clearly reduced in the , middle) and zonal wind (m s −1 , right) from CMIP6 (1995CMIP6 ( -2004. Also shown are the absolute (temperature and zonal wind) and the relative (specific humidity) deviations from ERA-Interim for (from top to bottom) CMIP6 (1995CMIP6 ( -2004, CMIP5 (1995CMIP5 ( -2004 and CMIP3 (1980CMIP3 ( -1999 multimodel means. Stippled areas show differences that are statistically significant at a 95% confidence level.

Journal of Geophysical Research: Atmospheres
CMIP6 multimodel mean by about 0.1 m s −1 . Reasons for that could be the increased vertical resolution in the upper (tropical) troposphere and lower stratosphere and better resolved processes in the stratosphere, for example, gravity wave parameterizations (Manzini et al., 2014). In addition, the better representation of zonal mean specific humidity ( Figure 5, middle column) might contribute to the improvement of the zonal wind speed climatology in CMIP6.

Quantification of Model Performance Across the CMIP6 Ensemble and CMIP Phases
In this section, the performance of the three different generations of climate models from CMIP3, CMIP5, and CMIP6 is assessed across different variables using multiple diagnostic fields. For every diagnostic field considered, model performance is compared to one or multiple observational reference data sets, and the quality of the simulation is summarized in a single number such as correlation coefficient or a RMSD. The general model improvements can then be quantified by simultaneously assessing a number of different performance indices. The use of such performance metrics in the model development phase potentially introduces the risk of tuning models to reproduce a set of metrics ignoring deficiencies elsewhere. This is why such performance metrics should mostly be seen as a possible starting point for more in-depth processoriented evaluation that allows the identification of compensating errors (Eyring et al., 2005).
Performance metrics such as a portrait diagram shown in Figure 6 or a summary plot of the pattern correlations for different variables as shown in Figure 7 offer the possibility to quickly get an overview on model performance and can be either used as a starting point for more in-depth evaluation of individual variables or climate parameters with observations (Flato et al., 2013) or as one possible summary of overall model performance. Figure 6 is an extended and updated version of Figure 9.7 of Flato et al. (2013) that is based on Gleckler et al. (2008). It shows the normalized relative space-time RMSD of the climatological seasonal cycle from model simulations compared with observations for selected variables. Here, RMSD values are normalized with the centered median RMSD, that is, by substracting the median RMSD from the RMSD of an individual model and then dividing by the median RMSD. The median RMSD for each variable used for normalization is calculated across all models from all CMIP phases to make the grading of the models directly comparable across CMIP3, CMIP5, and CMIP6. Thus, positive and negative values are possible with positive values indicating a model performance worse than the median RMSD and negative values a performance better than the median RMSD. Here, all RMSD values are averaged over the whole globe. Where available, the model results are not only compared to one observational (ly based) reference data set but also to a second alternative data set to get an estimate of the observational uncertainty. This is indicated by diagonally divided boxes in Figure 6. All model data are masked according to data availability from the reference data sets and averaged over the same years with observational data available. Figure 6. Relative space-time root-mean-square deviation (RMSD) calculated from the climatological seasonal cycle of the CMIP3, CMIP5, andCMIP6 simulations (1980-1999) compared to observational data sets (Table 5). A relative performance is displayed, with blue shading being better and red shading worse than the median RMSD of all model results of all ensembles. A diagonal split of a grid square shows the relative error with respect to the reference data set (lower right triangle) and the alternative data set (upper left triangle) which are marked in Table 5. White boxes are used when data are not available for a given model and variable. Updated and expanded from Figure 9.7 of Flato et al. (2013).

10.1029/2019JD032321
Journal of Geophysical Research: Atmospheres Figure 6 shows that model performance varies across the models and across the variables, with no single model outperforming the other models for all variables. Nevertheless, we see model families of which members are performing quite similar, for example, the CMIP6 GFDL or CMIP6 GISS models. This is, however, not true for all model families with, for example, CMIP6 models MIROC-ES2L and MIROC6 showing quite different performances.
In general, there are clear improvements from CMIP3 to CMIP6 with the majority of CMIP3 models showing on average more red (positive values) boxes (CMIP3 ensemble median RMSD over all diagnostics = 0.127; 25%/ 75% percentiles = 0.003/0.283) than CMIP5 (CMIP5 median RMSD = 0.022; 25%/75% percentiles = −0.069/0.146) and the CMIP6 models showing the most blue (negative values) boxes (CMIP6 median RMSD = −0.064; 25%/75% percentiles = −0.146/0.048). Radiation fields have already shown improvements from CMIP3 to CMIP5 and this development continues in CMIP6 as the models fit quite well to the CERES-EBAF observations. The same applies to total cloud cover (clt) and precipitation (pr). The seasonal cycle of near-surface air temperature is not represented extremely well in CMIP3 (median RMSD = 0.191) but there were a lot improvements through CMIP5 (median RMSD = 0.014) to CMIP6 (median RMSD = −0.069). Moreover, the dynamical fields, sea level pressure (psl) and the geopotential height at 500 hPa (zg 500 ) show improvements from CMIP3 (median RMSD for zg 500 = 0.357) to CMIP6 (median RMSD for zg 500 = −0.121) even though some individual models still have problems in specific regions. Also, wind fields simulated by the CMIP6 models are in better agreement with observations than those from previous CMIP phases (see also section 4.3). The results for the temperature fields in 200 and 850 hPa show quite a large range in the RMSD for the different models in CMIP3 (median RMSD = 0.166), CMIP5 (median RMSD = 0.017) and also in CMIP6 (median RMSD = −0.050).
Using centered pattern correlations for selected fields (here: near-surface air temperature; precipitation; outgoing top of the atmosphere, TOA; longwave radiation; TOA shortwave cloud radiative forcing; and sea level pressure), Figure 7 shows significant improvements from the CMIP3 ensemble to the CMIP6 ensemble. Little progress was found for fields that were already quite well simulated such as near-surface air temperature and TOA outgoing longwave radiation. For precipitation, the intermodel spread is reduced from CMIP3 to CMIP5 and CMIP6, particularly because the worst performing models improved significantly. Additionally, there is a continuous improvement of the pattern correlation from CMIP3 to CMIP6 in all variables. The short-wave cloud radiative effect shows large improvements in CMIP6 regarding the correlation and also the multimodel spread. In CMIP3 and CMIP5, shortwave cloud radiative effect was relatively poorly simulated with a large intermodel spread. Concerning sea level pressure, there is an improvement from CMIP5 to CMIP6 but the wide intermodel spread has not been reduced significantly.

Effective Climate Sensitivity
Since the release of the first CMIP6 simulations one of the most discussed topics is the higher ECS reported in some of the models (Forster et al., 2019;Meehl et al., 2020). ECS is an important metric for assessing the future warming sensitivity of the climate system to increasing concentrations of CO 2 , which is an important constraint on the total amount of greenhouse gases, in particular CO 2 , that can be emitted before a given global mean warming target is exceeded. ECS provides a single number, defined as the change in global mean surface air temperature resulting from a doubling of atmospheric CO 2 concentration compared to preindustrial conditions, once the climate has reached a new equilibrium (Gregory et al., 2004). For this study we used the common assumption by the Gregory method of extrapolating the relationship between the changes in near-surface temperature and the changes in the net downward radiation flux at TOA (Gregory et al., 2004). This method is unable to represent nonlinearities in the climate response and tends to  1980-1999. Results are shown for individual CMIP3 (black), CMIP5 (blue), and CMIP6 (brown) models as short lines, along with the corresponding ensemble averages (long lines). The correlations are shown between the models and the reference observational data set listed in Table 5. In addition, the correlation between the reference and alternate observational data sets are shown (solid gray circles, marked in Table 5). To ensure a fair comparison across a range of model resolutions, the pattern correlations are computed after regridding all data sets to a resolution of 2.5°in longitude and 2.5°in latitude. Only one realization is used from each model from the CMIP3, CMIP5, and CMIP6 historical simulations. underestimate the true ECS obtained from equilibrating the climate models (Rugenstein et al., 2020). However, since only a small subset of the CMIP models provides the long-running simulations necessary for the calculation of the true ECS, we use the Gregory method for an approximate, yet consistent ECS assessment for all climate models. The ESMValTool offers the flexibility to adjust the ECS calculation for example by changing the first year and the length of the time period used for calculating the slope of the Gregory relationship. This allows to repeat this study with different settings if needed.
The modeled range of ECS of 2.1 to 4.4 K in CMIP3, which was quite similar in CMIP5 (2.1 to 4.7 K), has increased to 1.8 to 5.6 K in CMIP6 (Figure 8). Consistent with this, the ECS multimodel mean has also increased from 3.2 K in CMIP3 and CMIP5 to almost 3.8 K in CMIP6. The increased range in ECS in CMIP6 suggests an increased uncertainty in this metric compared to previous CMIP phases, which might lead to reduced trust in the models' projections of future climate by some stakeholders and decision makers. It is therefore critically important to understand the reasons for the increased span in ECS given by the latest generation of CMIP models. In addition to Meehl et al. (2020), several modeling groups have already published studies confirming higher ECS values in their CMIP6 models Gettelman et al., 2019;Wyser et al., 2019).
Numerous improvements to the underpinning physical, chemical, and biological processes have been developed and implemented in the new CMIP6 models. These result in models that are capable to represent the coupled climate system in more detail. Some of these improvements influence the ECS in the models (Forster et al., 2019). Meehl et al. (2020) give possible explanations for the occurrence of high ECS values in some of the models, with coupled cloud microphysical and aerosol developments potentially being a common factor. Besides cloud feedbacks, other main contributors to ECS are, for example, the water vaporlapse rate feedback and the snow/ice albedo feedback. Cloud feedbacks play a particularly important role because (i) they remain the largest contributor to the spread of ECS across models (Flato et al., 2013;Zelinka et al., 2020) and (ii) a number of models have specifically increased the degree of complexity/detail with respect to mixed phase clouds Gettelman et al., 2019;Mulcahy et al., 2020;Williams et al., 2020). Further studies are required to better understand the higher ECSs in CMIP6 relative to CMIP5. Figure 8. Effective climate sensitivity (ECS) calculated for CMIP3 (blue), CMIP5 (orange), and CMIP6 (green) models using the method from Gregory et al. (2004). The ensemble means are indicated by a darker shading of the corresponding bars.

Journal of Geophysical Research: Atmospheres
Improvements to the representation mixed phase clouds in some CMIP6 models reduces a long-standing model bias of too little supercooled liquid water (and conversely too large amount of ice crystals) in lowlevel, midlatitude clouds, particularly over the relatively pristine Southern Ocean (Bodas-Salcedo et al., 2016;McCoy et al., 2016). These developments also improve the representation of both cloud microphysical structure and cloud radiative impacts in these regions (Hyder et al., 2018;Kay et al., 2016). Earlier models, such as  (1980( -1999( ), CMIP5 (1986, and CMIP6 (1995-2014) compared against the Clouds and the Earth's Radiant Energy System (CERES) Energy Balanced and Filled (EBAF) Version 2.7 data set (Loeb et al., 2012). Left column: Geographical distributions of the differences between the multimodel means and CERES-EBAF (bias). Right column: Zonal averages from the individual models (gray lines), the multimodel-mean (red lines) and the observational data set (black lines).

10.1029/2019JD032321
Journal of Geophysical Research: Atmospheres in CMIP5 and CMIP3, exhibited a large negative SW cloud feedback over the Southern Ocean as predominantly ice clouds melted to become liquid clouds as the simulated climate warmed (McCoy et al., 2015). For a given water content, a cloud consisting of (physically smaller) liquid droplets will be more reflective to solar radiation than the "same" cloud composed of (larger) ice crystals. Furthermore, a predominantly liquid cloud will also tend to precipitate less than a cloud composed of both ice and liquid, resulting in more water staying in the liquid cloud. Earlier (CMIP3/CMIP5) models exhibited a widespread (erroneous) tendency over the Southern Ocean to go from predominantly ice clouds in the present-day period to liquid clouds in the future. This cloud phase change provided a relatively strong

10.1029/2019JD032321
Journal of Geophysical Research: Atmospheres negative shortwave feedback on warming (through a reduction in cloud reflectivity in the simulated future). This negative feedback is removed (or significantly reduced) in those CMIP6 models that simulate predominantly liquid clouds for the present day over the Southern Ocean. This negative SW cloud feedback acts to balance other (mainly tropical and subtropical) positive cloud feedbacks, reducing the overall global net cloud feedback . The size of this negative (cloud phase change) feedback has long been questioned due to the known systematic cloud phase bias seen in many models (McCoy et al., 2015;Tan et al., 2016). Improving the microphysical structure of mixed-phase clouds acts to reduce this negative SW feedback as the climate warms, increasing the net global cloud feedback Tan et al., 2016) and the resulting ECS in those models Gettelman et al., 2019). Figure 9 shows that in CMIP6, the simulated TOA shortwave cloud radiative effect agrees better with observations than in previous CMIP phases (Figure 9). The main improvement in CMIP6 compared with previous phases of CMIP is a reduced (less negative) bias in the tropics and over the Southern Ocean. The latter collocated with the aforementioned cloud phase negative shortwave feedback.
The geographical distribution of the net cloud feedback parameter, defined as changes in the sum of shortwave and longwave cloud radiative effect per degree of surface warming is dominated in many regions by the shortwave component (Figures 10a and 10c). The sign change at around 60°S seen in the shortwave cloud feedback is indicative of where models are switching, in their preindustrial and present-day experiments, from simulating clouds almost totally composed of liquid droplets to clouds with an increasing ice component. With increasing latitude there is an increasing ice component in model clouds that will support a negative shortwave feedback on warming. Figure 10d supports the results of Zelinka et al. (2020) that there is an increase in the shortwave cloud feedback parameter over the Southern Ocean in CMIP6 compared with CMIP5 (in many regions understood as a decrease, or even sign change, in the size of a negative shortwave cloud feedback). Zelinka et al. (2020) found that the distribution of net cloud feedback is shifted toward larger positive values in CMIP6 due to a stronger positive (reduced negative) low-level cloud feedback, mainly in the extratropics. The CMIP6 models show weaker increases in extratropical low-level cloud cover and associated liquid water content with increasing surface temperature than previous model generations. This primarily arises from an increase in the liquid condensate fraction (LFC) simulated in these clouds for the preindustrial and present-day periods (Zelinka et al., 2020), leading to the aforementioned reduction in cloud phase change on warming. A higher cloud feedback contributes to an increase in climate sensitivity and could be one possible explanation for the high climate sensitivity values of some CMIP6 models.

Summary
In this study, we evaluated multimodel ensembles from three different phases of CMIP, namely CMIP3, CMIP5 and CMIP6. Improvements or changes in model performance from one CMIP phase to the next are typically a combination of different factors such as an increasing spatial and vertical resolution, a more complete and also a more detailed representation of individual ESM components and the inclusion of additional Earth system processes that could be added in recent years as increasing computing power became available. In addition, also input data including prescribed emissions and forcings were continuously refined and further developed (Eyring, Bony, et al., 2016;Taylor et al., 2012). These changes in combination with modifications of the experiment design over time make a direct one by one comparison of the model results among different CMIP phases difficult if not impossible. We therefore focused on ensemble average results as one possible representation of the state-of-the-art climate modeling at the time of a particular CMIP phase in order to assess the general progress in the field over the last two decades. For this we compared the model results from CMIP3, 5 and 6 for present-day climate with observations that serve as one possible benchmark for the overall model performance. The main aim was to assess the different generations of climate models as a whole instead of tracking the progress made by individual models. For this, we analyzed data from the historical CMIP6 simulations published to the ESGF in comparison with observations and reanalyses as well as with results from CMIP3 and CMIP5. Additionally, we evaluated some results from HighResMIP to assess the potential improvements achieved by increasing the horizontal model resolution.
To analyze how the performance of different generations of CMIP models compared to observations has changed relative to each other, we have used the ESMValTool for the production of all figures in this 10.1029/2019JD032321 Journal of Geophysical Research: Atmospheres study. It enables a comprehensive evaluation of the models and ensures as an open source software provenance and traceability. One of the topics widely discussed even outside of the climate science community was the apparent "failure" of the CMIP5 models to reproduce the warming hiatus seen in observations of the global mean warming rates from 1998 to 2013. Because of the high attention this topic received, there were even potential implications on the public perception of the trustworthyness of climate models and climate projections in general. It has been shown that the hiatus was likely predominantly a result of internal climate variability with the phase of the IPO playing an important role. The uninitialized historical CMIP5 model runs cannot be expected to reproduce the exact timing of effects caused by internal variability as seen in observations. In fact, a small number of CMIP5 model simulations were, by chance, in a negative IPO phase at the right time and able to simulate the observed pause of the increase in global warming rates. Now, CMIP6 models show the observed accelerated temperature increase in recent years and agree quite well with the observed mean global warming in the 2010s. Some CMIP6 models, however, also show a cooling in the second half of the twentieth century and a too large increase in near-surface temperatures in the last years which might be related to a too strong aerosol-related ERF. This needs to be further investigated in order to fully understand the driving mechanisms of this potentially overestimated sensitivity to the prescribed aerosol emissions.
The CMIP6 results currently available show that the latest generation of CMIP models have a similar or even slightly higher skill in reproducing observed large-scale mean surface temperature and precipitation patterns as their CMIP3 and CMIP5 predecessors. CMIP6 models have an increased degree of freedom by including more processes and couplings, primarily aimed at being able to better simulate future feedbacks (e.g., nitrogen effects of terrestrial carbon uptake or permafrost processes). All these additions make the models better "fit for purpose," if the purpose is simulating future global change. But the increased degree of freedom has the potential to increase model biases. A reduction of some of the long-standing systematic model biases for instance over high-elevated regions, the North Atlantic and Southern Ocean, and upwelling regions is found particularly in the high horizontal resolution models contributing to HighResMIP. Other biases however, notably in Southern Ocean, seem to be more stubborn. Vertical distributions of key variables such as temperature, water vapor and zonal wind speed also show improvements throughout the three different CMIP phases. While most of the long-standing model biases are still present in CMIP6, their amplitude is often smaller than in CMIP3 and CMIP5.
The performance metrics (portrait diagram) and the correlation patterns of some important fields such as TOA radiative fluxes, temperature, precipitation, and sea level pressure show some overall improvements across the different CMIP ensembles with a reduced intermodel spread and higher average skill of the CMIP6 ensemble (RMSD, pattern correlation).
A maybe surprising result from CMIP6 is the high ECS in some of the models resulting in an even larger spread in ECS than the large range of values obtained from previous generations of climate models. This has been already discussed in first studies and the exact probably model-specific reasons need to be understood in detail as the increased spread in ECS potentially shows an increased uncertainty in this important climate metric. First studies suggest that the causes might be improvements in the representation of mixed-phased clouds, which leads to changes in cloud feedbacks and in the shortwave component of the cloud feedback in particular. It is noteworthy that cloud-radiation interactions and in particular the shortwave cloud radiative effect in CMIP6 models are closer to observations than in previous generations of climate models. As ECS depends on numerous and interacting feedbacks, improvements in one specific variable or physical process can potentially lead to less error compensation and thus more spread in such complex quantities as ECS. A realistic representation of clouds, however, remains a challenge in current climate modeling. Here, further model improvements, stemming from higher resolution (Palmer & Stevens, 2019) or completely novel approaches to parameterize clouds and convection in climate models such as, for instance, machine learning based cloud parametrizations Rasp et al., 2018) are required to make further progress toward more realistic simulations of clouds with climate models.

Data Availability Statement
CMIP model data are available freely and publicly from the Earth System Grid Federation (ESGF, https:// esgf.llnl.gov) and listed in Tables 1-3. Observations used in the evaluation are detailed in Table 5  Journal of Geophysical Research: Atmospheres manuscript. Observational data sets available through the observations for Model Intercomparisons Project (obs4MIPs, https://esgf-node.llnl.gov/projects/obs4mips/) can be downloaded freely from the ESGF and used directly with the ESMValTool. For all other observational data sets, the ESMValTool provides a collection of scripts (NCL and Python) with downloading and processing instructions to recreate the data sets used in this publication.  . The ESMValTool and its workflow manager/ preprocessor (ESMValCore) are developed on the GitHub repositories available at https://github.com/ ESMValGroup (last access: 5 October 2020). We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP6 and ESGF. We thank Kevin Debeire (DLR) for his help with the portrait diagram and the entire ESMValTool development team for the great work. The computational resources of the Deutsches Klimarechenzentrum (DKRZ, Hamburg) were essential for developing and testing the ESMValTool used to produce the figures in this paper as well as to do the analysis presented here. We would also like to acknowledge the