An Evaluation of the Large‐Scale Atmospheric Circulation and Its Variability in CESM2 and Other CMIP Models

The Community Earth System Model 2 (CESM2) is the latest Earth System Model developed by the National Center for Atmospheric Research in collaboration with the university community and is significantly advanced in most components compared to its predecessor (CESM1). Here, CESM2's representation of the large‐scale atmospheric circulation and its variability is assessed. Further context is providedthrough comparison to the CESM1 large ensemble and other models from the Coupled Model Intercomparison Project (CMIP5 and CMIP6). This includes an assessment of the representation of jet streams and storm tracks, stationary waves, the global divergent circulation, the annular modes, the North Atlantic Oscillation, and blocking. Compared to CESM1, CESM2 is substantially improved in the representation of the storm tracks, Northern Hemisphere (NH) stationary waves, NH winter blocking and the global divergent circulation. It ranks within the top 10% of CMIP class models in many of these features. Some features of the Southern Hemisphere (SH) circulation have degraded, such as the SH jet strength, stationary waves, and blocking, although the SH jet stream is placed at approximately the correct location. This analysis also highlights systematic deficiencies in these features across the new CMIP6 archive, such as the continued tendency for the SH jet stream to be placed too far equatorward, the North Atlantic westerlies to be too strong over Europe, the storm tracks as measured by low‐level meridional wind variance to be too weak and a lack of blocking in the North Atlantic sector.


Introduction
The Community Earth System Model, Version 2 (CESM2), is the second generation Earth System Model developed by the U.S.'s National Center for Atmospheric Research (NCAR), in collaboration with university researchers (Hurrell et al., 2013). Prior to the first incarnation of CESM (CESM1), the history of development of this model can be traced through the Community Climate System Model, Versions 4 (CCSM4, Gent et al., 2011), 3 (CCSM3, Collins et al., 2006), 2 (CCSM2 Kiehl & Gent, 2004), the Climate System Model 1 (CSM1 Boville & Gent, 1998), and, before that, the Community Climate Model, versions 3 (CCM3 Kiehl et al., 1998), 2 (CCM2 Hack et al., 1993), 1 (CCM1 Williamson et al., 1987), and 0 (CCM0 Washington, 1982;Williamson, 1983). As such, CESM2 represents the current state-of-the-art in Earth System Modelling from this center, incorporating model development contributions from over four decades of research and the efforts of countless individuals.
Over this development history, the array of complex atmospheric, oceanic, hydrologic, cryospheric, and biogeophysical processes represented by this model has made CESM2 one of the most comprehensive and complex Earth System Models (ESMs) available. Given its fundamental role in the Earth System, the large-scale atmospheric circulation has been represented with some realism, relatively speaking, since the earliest days of climate modeling. Nevertheless, persistent biases remain in certain aspects and, as our models increase in complexity, we must continue to strive for the greatest accuracy possible in the representation of this underpinning feature of the Earth System.
In this study we present an evaluation of basic features of the large-scale atmospheric circulation and its variability in CESM2. We provide context by assessing changes compared to its predecessor (CESM1) and by placing it within the wider distribution of Earth System Models as represented by those participating in the Coupled Model Intercomparison Project, Phases 5 and 6 (CMIP5 Taylor et al., 2012 andCMIP6 Eyring et al., 2016). The range of atmospheric circulation features presented here is not exhaustive, and the primary focus is on the global climatology of the divergent circulation and stationary waves, midlatitude jet streams and storm tracks, and aspects of extratropical variability. Separate studies in this special issue provide an assessment of tropical intraseasonal variability, monsoons (Meehl et al., 2020), and El Niño-Southern Oscillation (ENSO) variability and its teleconnections (Capotondi et al., 2020).
Rather than taking the traditional approach of providing an overall introduction, methodological description and summary of the results, we instead provide a self-contained introduction and methodology within each results section for each feature considered, such that a reader can easily find all the relevant information in one place for their feature of interest. This diagnostic analysis is intended primarily as a resource for CESM2 users but also serves as a concise summary of the representation of these features in CMIP5 and CMIP6 models.
We begin by describing the model simulations and observational data sets in section 2, followed by a description of the error metrics used and the uncertainty assessments performed in section 3. In section 4we discuss the representation of jet streams and storm tracks, in section 5 we discuss stationary waves and the global divergent circulation and in section 6 we assess the annular modes, North Atlantic Oscillation (NAO) and blocking. Summary and conclusions are provided in section 7.

Model Simulations and Observation-Based Data Sets
For each of the model historical simulations and reanalyses described below, our primary focus is on the period from 1979 to 2014 and on monthly and daily averaged fields of zonal wind (ua), meridional wind (va), geopotential height (zg), and sea level pressure (slp). Note that here we are using variable names as specified by CMIP as opposed those used in CESM2. Each of these fields is first regridded to a common 2°horizontal grid using bilinear interpolation before any other fields or metrics are derived. Only the summer and winter seasons are considered in the main text, but equivalent figures are shown for the spring and autumn in the supporting information.

CESM2
In its default configuration, CESM2 simulates the global coupled Earth System at approximately 1°horizontal resolution. It contains interactive components for the atmosphere, land, ocean, sea ice, river transport, and land ice. CESM2 represents a significant advance over CESM1 in many ways (see Danabasoglu et al., 2019 for more details). As the updates within the atmosphere component (Community Atmosphere Model 6, CAM6) are likely to be the most relevant, we summarize some of those major changes here, but readers are referred to Bogenschutz et al. (2018) and Gettelman et al. (2019) for a more detailed description of CAM6 and the high-top atmospheric component (Whole Atmosphere Community Climate Model, WACCM6), respectively.
In the transition from CAM5 to CAM6, almost every physical parameterization within the atmosphere has been updated, with the exception of radiation. A major change is that the boundary layer, shallow convection, and cloud macrophysics are combined within the new Cloud Layers Unified By Binormals (CLUBB) scheme (Golaz et al., 2002), resulting in a more consistent representation of boundary layer turbulence (Bogenschutz et al., 2013). The prognostic cloud microphysics scheme (MG2, Gettelman & Morrison, 2015) has been updated from its predecessor (MG1) with a major change being the inclusion of prognostic precipitation. Finally, and of relevance to some of the following results, there have been major updates to the representation of orographic drag. The orographic gravity wave drag scheme now includes a representation of the orientation of subgrid orography (ridges) and the effects of mesoscale orographic blocking (MOB). Furthermore, the turbulent orographic form drag (TOFD) scheme has been updated from the Turbulent Mountain Stress (TMS, Richter et al., 2010) parameterization to that of Beljaars et al. (2004).
Our primary focus will be on four CESM2 historical ensembles that differ in the vertical extent of the atmospheric component and in the presence or absence of coupling to the fully dynamic ocean model. These ensembles are summarized in the lower left of Table 1 and a more detailed description is provided in Table 2. Eleven members make up the ensemble BCAM6 in which the low-top atmosphere model (CAM6), with 32 layers in the vertical extending to ∼40 km, is coupled to the ocean model. A three-member coupled ensemble with the high-top WACCM6 (Gettelman et al., 2019), which has 70 levels in the vertical extending to ∼130 km, will be referred to as BWACCM6. Spatial maps in the main text are only shown for BCAM6, but equivalent figures for BWACCM6 are shown in supporting information Figures S16 and S17. In addition, there are three member ensembles, with prescribed historical sea surface temperatures (SSTs) (Hurrell et al., 2008), referred to as FCAM6 and FWACCM6, for CAM6 and WACCM6, respectively. In this naming convention, B refers to the CESM B-component set, which includes coupling to the ocean model, while F refers to the CESM F-component set where SSTs and sea ice are prescribed (i.e., AMIP-type simulations).
These simulations are run under historical forcings (Hoesly et al., 2018;van Marle et al., 2017) until 2014. The coupled simulations are each initialized from different years from a preindustrial (i.e., perpetual year Note. The top portion of the left three columns depict the model number (used to depict the model in each plot), model name, and number of members of each CMIP5 model. The right-hand columns show the same for CMIP6. An "*" at the end of the CMIP model name depicts whether that model is used in the analyses requiring daily ua or va, and a "+" depicts whether a model is used in analysis requiring daily zg. The lower portion of the left columns summarizes the CESM1 and CESM2 simulations. The period from 1979 to 2014 is used for all simulations.
1850 forcing) control that has been spun-up for over 1,000 years (Danabasoglu et al., 2019), while the prescribed SST simulations begin in 1950. For each ensemble we will only consider the period from 1979 to 2014 for comparison with modern reanalyses over the satellite era.
In addition to these four ensembles of simulations which are contributed to the CMIP6 archive, we will make use of the following simulations that are designed to isolate the underlying cause of some of the changes found in CESM2. FCAM6MOD is an historical simulation with CAM6 with prescribed SSTs but with the SSTs taken from one of the coupled BCAM6 members, as opposed to observations. We will use 1979-2014 of this simulation to explore the role of SST differences versus the lack of coupling in explaining differences between BCAM6 and FCAM6. To explore the influence of changes in orographic parameterizations schemes, four single-member experiments, under historical forcings from 1979-2005, with SSTs prescribed to observations will be considered. FCAM6* is an uncoupled simulation with prescribed observed SSTs, very similar to FCAM6 described above, but with biogeochemistry in the land turned off. While the issue of land biogeochemistry is not important for our purposes, we use this rather than FCAM6 for likewith-like comparison with each of the following experiments that also have biogeochemistry turned off. FCAM5 is a simulation performed in the same way as FCAM6* (same forcings, boundary conditions, and land model) but with CAM5 physics used instead of CAM6. This allows for an assessment of the overall influence of the atmospheric physics package in isolation, which can then be compared with the following two experiments to isolate the orographic influence. FCAM6_TMS is as FCAM6* but with the new TOFD scheme of Beljaars et al. (2004) replaced by the older Turbulent Mountain Stress (TMS) parameterization of CAM5. A comparison of FCAM6* with FCAM6_TMS demonstrates the influence of this change in TOFD. FCAM6_NOMOB is as FCAM6* but without the new MOB parameterization included. A comparison between FCAM6* and FCAM6_NOMOB indicates the influence of the new MOB scheme.

CESM1
To examine the changes that have arisen as a result of the developments in advancing from CESM1 to CESM2 and to provide an indication of the sampling uncertainty in each metric as a result of internal variability, we will compare with the CESM1 large ensemble (Kay et al., 2014). This 40-member ensemble of simulations is initialized in 1920 from a single state, with ensemble spread introduced through a round-off level noise perturbation added to the temperature field at initialization. The initial state is that of a single realization that was branched from an 1850s control simulation and run until 1920 under historical forcings (Lamarque et al., 2010). The 40-member ensemble is then run under historical forcings to 2005 and RCP8.5 forcings, thereafter Meinshausen et al., 2011). We will assess the 1979-2014 period using the historical and RCP8.5 simulations combined and this will be referred to as LENS.

CMIP5
As with LENS, we will combine years 1979-2005 of the historical simulations with years 2006-2014 of the RCP8.5 simulations for the 35 CMIP5 models listed in Table 1. For monthly data we make use of all available ensemble members that have both historical and RCP8.5 components, resulting in ensemble sizes ranging from 1 to 10 members (third column of Table 1). We will always show the ensemble mean of a metric for each model when multiple members are available. Error metrics are first calculated for individual members before the ensemble averaging is performed so as to avoid comparing smoother ensemble mean spatial fields with the noisier fields of individual members. For metrics that involve daily ua and va data, we use one member from the 16 models (highlighted with an * in Table 1), and for daily zg data, we use one member from the 15 models (highlighted with a + in Table 1). For models that have more than one member available, we use the member with the lowest realization number. Daily fields are obtained by averaging 6-hourly pressure level fields. In each figure, a CMIP5 model can be identified by the model number given in the left column of Table 1.

CMIP6
For CMIP6, we make use of 1979-2014 of the historical simulations, run under the same forcings as the CESM2 simulations described above. At the time of writing, 42 models are available with ensemble sizes ranging from 1 to 32 (Table 1, right three columns). While the BCAM6 and BWACCM6 ensembles are contributed to the CMIP6 archive, we consider them separately here. Only a subset of 27/20 models, highlighted with a * /+ in Table 1, have daily averaged (ua,va)/zg data available and for each of these we only use one member (the member with the lowest realization number).
In each figure, a CMIP6 model can be identified by the model number given in the third from right column of Table 1. We only show error metric summaries for the CMIP6 models in the main text, but ensemble mean spatial bias maps along with indications of model consensus are provided in the Appendix A.

Observation-Based Data Sets
Our primary observational comparison will be with atmospheric reanalyses. The new ERA5 reanalysis (C3S, 2019) will be taken as the observational baseline and all simulations and other reanalysis products will be compared to that. Three other modern reanalyses that assimilate a wide array of observations will also be shown: ERA-Interim (Dee et al., 2011), MERRA2 (Gelaro et al., 2017), and JRA55 (Kobayashi et al., 2015). Two twentieth century reanalysis products: ECMWF's ERA20C (Poli et al., 2016) and NOAA's twentieth century reanalysis, 20CR (Compo et al., 2011), are considered, partly for the purpose of assessing those reanalysis products compared to others and also for the purpose of providing a longer-term context for the observational record in certain metrics. These twentieth century reanalyses are only constrained by surface pressure observations (and marine surface winds in the case of ERA20C) and, therefore, lack the additional constraint arising from the multitude of other observations that are assimilated in the other products.
For the most part, we only use 1979 to 2014 for these reanalysis products for direct comparison with the model simulations. MERRA2 only starts in 1980, so for that product we use 1980 to 2014 and ERA20C only extends to 2010, so for that product we use 1979 to 2010. For the North Atlantic, where a relatively large number of surface pressure observations constrain the twentieth century reanalyses back to 1900, we provide an assessment in the variability of metrics using all overlapping 36-year segments between 1900 and 2014 (2010 for ERA20C). For metrics involving daily data, we do not make use of the twentieth century reanalyses.

Normalized Mean Square Error Metric
When assessing the error in a spatial field (X), we will use the Normalized Mean Square Error (NMSE) metric proposed by Williamson (1995). This metric has been applied to the geopotential height field in evaluations of previous NCAR models Kiehl et al., 1998;Neale et al., 2013), but here we apply it to all spatial fields considered. The NMSE of the model field X m is given by where X o refers to the "observed" field (in our case ERA5); the overbar refers to the area weighted spatial average and the prime refers to the deviation therefrom. To give some indication of where the errors are coming from, the NMSE can be further decomposed into three components as follows: where σ o and σ m refer to the spatial standard deviation of the observations and model, respectively, and r mo refers to the spatial correlation between the model and observations. For a derivation of this, see Murphy (1988), their Equation 10, which is essentially the same as this but without normalization of the mean squared error. The first term, U, is the unconditional bias, which is a nondimensional measure of the overall bias in the spatial mean of a field. The second term, C, is the conditional bias, which is nondimensional and arises through both amplitude and phase errors. It is only nonzero if the regression of X m onto X o yields a slope of 1, that is, if X m is perfectly correlated with X o and their spatial variances are equal. The third term is the phase error P, which arises only from errors in the phasing of the spatial variations.
If r mo = 1, that is, the phase error is 0, the interpretation of any conditional bias is straightforward; it arises if the amplitude of the spatial variations are too large or too small. When the phase error is nonzero, the interpretation of the conditional bias is less straightforward as it arises through both amplitude and phase errors. Furthermore, the conditional bias can be artificially reduced through a lack of spatial variance. Interpretation can, therefore, be aided by consideration of the scaled variance ratio (SVR) given by which indicates whether conditional bias arises from too much (SVR > NMSE) or too little (SVR < NMSE) spatial variance. When SVR < NMSE, it also provides caution that the conditional bias component may be artificially reduced through the lack of spatial variance.
Altogether, U, C, P, and SVR provide an indication of the roles of an overall mean bias, biases that arise due to errors in the amplitude of spatial variations and biases that arise due to phasing errors. Figure 1 provides an explanatory key for how this will be represented in each figure. The NMSE will be depicted in each plot with a vertical bar composed of three different colored components for U, C, and P, while the SVR will be depicted by a circular symbol. When the SVR symbol lies above/below the bar the SVR is greater/less than 1. For models with a spatial mean (unconditional) bias with a magnitude greater than 10% of the ERA5 spatial mean, we depict whether that bias is positive or negative by shading the SVR symbol red or blue, respectively.

Assessment of Uncertainty due to Internal Variability
The 36-year observational record that the models are being compared to will be subject to uncertainty due to the sampling of internal variability. To provide some indication of the magnitude of this effect, or the significance of differences between the model and the reanalysis, we take a number of approaches: • When assessing the bias of BCAM6 or LENS relative to ERA5 in map form, we provide an assessment of whether ERA5 lies within the distribution of the 11(40) ensemble members for BCAM6(LENS) and assume that where this is not the case, there is a significant difference between the real world and the model. This is equivalent to a significant difference at the ∼9% (2.5%) level for BCAM6(LENS) by a one-sided nonparametric test. Regions where ERA5 does not lie outside of the model ensemble spread will be shaded gray in each figure.
• For each metric we show the minimum to maximum range of that metric from the 40 LENS members, giving an indication of the range of values that can arise due to internal variability within CESM1. • For each metric we show each of the individual BCAM6, BWACCM6, FCAM6, and FWACCM6 members, giving an indication of the range of values that can arise due to internal variability for CESM2, although this is limited by the relatively small ensemble sizes for these simulations.

SH
The Southern Hemisphere (SH) midlatitude jet stream is important for the representation of weather and surface climate in the SH and also has global implications through the wind stress influence on Southern Ocean upwelling (Gent, 2016;Marshall & Speer, 2012) and the leakage of warm salty water from the Indian Ocean into the Atlantic through the Agulhas current (Biastoch et al., 2009;Tim et al., 2019). Yet, the SH jet has proven notoriously difficult to model with accuracy, with the majority of models in previous generations positioning the jet stream too far equatorward (Bracegirdle et al., 2013;Fyfe & Saenko, 2006;Russell et al., 2006;Wilcox et al., 2012). Less consistency has been found in model biases in SH jet strength (Russell et al., 2006), although many models exhibit a weaker jet stream and wind stress than observed in the Pacific sector (Bracegirdle et al., 2013;Fyfe & Saenko, 2006). Aside from its implications for the representation of present-day climate, the modeled position of the SH westerly jet stream correlates with its projected poleward shift in response to both ozone depletion (Son et al., 2010) and greenhouse gas forcing (Kidston & Gerber, 2010), primarily during the winter season (Simpson & Polvani, 2016). Given that models with lower-latitude jets that sit further from the observed location tend to show a larger poleward shift under forcing, this implies that many models may be overpredicting such responses, although the dynamics behind this remain to be understood (Simpson & Polvani, 2016).z Intrinsically linked to the SH jet stream is the SH storm track, composed of transient synoptic scale baroclinic eddies that are responsible for the high-and low-pressure systems that bring day-to-day weather variability to the midlatitudes and are an important contributor to hemispheric energy, momentum, and moisture transports (Chang et al., 2002;Shaw et al., 2016). An equatorward bias in the SH storm track accompanied that of the SH jet stream in both CMIP3 (Chang et al., 2013) and CMIP5 (Chang et al., 2012) models. In terms of storm track strength, the modeled representation has been found to be rather varied in previous intercomparisons but with a preponderance for the SH storm track to be too weak (Chang et al., 2012(Chang et al., , 2013. Inaccuracies in the simulation of the SH jet stream and storm tracks can have many origins: Errors in the representation of clouds and radiation can impact on the equator-to-pole temperature gradient and hence baroclinicity (Ceppi et al., 2012); orographic processes have been shown to affect jet latitude (Pithan et al., 2016); and increasing resolution is often accompanied by an increase in the storm track strength (Chang et al., 2013;Hertwig et al., 2015). Meehl et al. (2019) recently assessed the influence of resolution and model physics in a series of CESM1 configurations and found comparable effects on storm track intensity arising from changes in both resolution and the representation of cloud radiative effects.
Characteristics of the tropospheric zonal mean midlatitude westerlies can first be assessed from Figure 2 (see Gettelman et al., (2019) for an assessment of the stratospheric zonal winds). Here, the jet latitude (speed) is defined as the location (magnitude) of the maximum 850-hPa zonal mean zonal wind, determined by a quadratic fit to the winds at the grid point maximum and the two adjacent grid points. During the ). The SVR is depicted by the circle. Where the SVR lies above the bar as in the first, third, and fifth bars here, the SVR is greater than 1 and vice versa. When the magnitude of the spatial mean bias is more than 10% of the ERA5 spatial mean, we consider that to be a "large" summer (December-February, DJF), LENS exhibited zonal mean westerlies that were around 2 m s −1 too strong on the poleward side of the jet and around 1 m s −1 too weak on the equatorward side of the jet (Figure 2d). This leads to a jet that was slightly too far poleward and slightly too strong (Figure 2i, red). The strong bias on the poleward side of the jet is still present in CESM2 ( Figure 2c) but is shifted slightly equatorward, such that the CESM2 jet stream is actually stronger than that in LENS, but is in approximately the correct location ( Figure 2i, green and blue).
The representation of the SH winter (June-August, JJA) jet stream in LENS was excellent, with the jet latitude and speed being very close to observed and the only bias of note in the zonal mean ua being subtropical westerlies that were too strong ( Figure 2h). In CESM2, this bias in the subtropical westerlies remains, but in addition, a substantial westerly bias of around 2 m s −1 is now found in the midlatitudes (Figure 2g). This leads to a wintertime jet stream that is stronger than observed but located at the correct latitude ( Figure 2j green and blue).
Figures 2i and 2j demonstrate the prevalence of an equatorward bias in the jet position across both the CMIP5 and CMIP6 models. Curtis et al. (2020) recently argued that, for the annual mean, the equatorward bias in the CMIP6 models was considerably reduced compared to that of CMIP5. For the models and time period considered here, which differ slightly from that in Curtis et al. (2020), we indeed find that the CMIP6 (including BCAM6 and BWACCM6) ensemble mean equatorward bias in the annual mean is only around 1.1°compared to around 2.6°in CMIP5. However, in the JJA season alone, not considered by Curtis et al. (2020), we still find an ensemble mean equatorward bias in the CMIP6 models of around 3.3°, which can be compared with a value of around 4.3°for the CMIP5 models. Thus, while there are some improvements, a substantial equatorward bias still remains in CMIP6, in JJA in particular, and there are still some models that place the SH westerlies about 16°too far equatorward in this season ( Figure 2j). These biases are very large and can be compared with, for example, projected poleward shifts of the jet stream under climate change ranging between 0°and 6° (Barnes & Polvani, 2013;Simpson & Polvani, 2016). LENS had an excellent representation of the SH westerlies in comparison to other CMIP models and in CESM2 the fidelity of the jet position has been maintained but, unfortunately, the SH westerlies are now around 2-3 m s −1 too fast in all seasons (Figures 2 and S1). A tendency toward zonal mean westerlies that are too strong is common among the CMIP models (Figures 2i and 2j), although not universal. However, CESM2 lies on the strongest end of the CMIP scale.
A local view of the SH jet stream and storm track is provided in Figure 3. Here 850-hPa ua is used to depict the jet stream (contours) and 850-hPa 10-day high-pass-filtered meridional wind variance (va′va′), using a Lanczos filter with 91 weights, is used to depict the storm track (shading). During the summer (Figure 3a), the SH westerlies and storm track are more zonally symmetric and located closer to the pole than their wintertime counterpart ( Figure 3e) (Hoskins & Hodges, 2005). CESM2 broadly captures the main characteristics of the jet stream and storm track (Figures 3b and 3f), but during the summer, the jet stream exhibits a zonally symmetric westerly bias at the latitude of the Drake Passage ( Figure 3c). This bias is similar to that in LENS ( Figure 3d) but is larger in magnitude and extends farther equatorward. The CMIP6 models exhibit a similar westerly bias around New Zealand and similar easterly biases in the subtropical Atlantic and Pacific to those found in CESM2, but they do not uniformly exhibit a similar westerly bias around the Antarctic continent ( Figure A1a).
Substantial local biases in the jet stream also exist in CESM2 in winter (Figure 3g), and this represents a considerable degradation compared to LENS ( Figure 3h). The westerlies are too strong to the south of Australia (also common to other CMIP6 models; Figure A1b) and to the south of South America, leading to the jet stream being less localized in the Indian ocean sector in CESM2 than in observations.
Turning now to the representation of lower tropospheric storm track activity, improvements are seen in all seasons in CESM2 compared to LENS. LENS was characterized by a hemispheric lack of storm track activity, resulting in an unconditional bias in this field (salmon pink component of the LENS bar in Figures 3, S2j, and S2l). An unconditional bias in the sense of a lack of storm track activity is common to the CMIP5 and CMIP6 models (the prevalence of blue circles in Figures 3j and 3l). In CESM2 this unconditional bias has been substantially reduced, leading to an overall reduction in NMSE (green and blue bars in Figures 3, S2j, and S2l). In summer, the storm track activity was too weak over the whole of the southern midlatitudes in LENS ( Figure 3d) and while a weak bias still remains in the ocean basins in CESM2, it is reduced ( Figure 3c). In winter, LENS exhibited a rather dramatic low bias in the lee of the Andes as well as in the Atlantic andIndian Ocean basins-features that are common to other CMIP6 models ( Figure A1d). These biases are substantially reduced in CESM2, although the hemispheric increase in storm track activity has now resulted in the Pacific sector exhibiting too much ( Figure 3g). The role of changes to the orographic drag and blocking schemes in the improvement in the lee of the Andes is discussed further in section 4.3. Too little winter storm track activity off the coast of Antarctica, around the dateline, is a common feature to LENS (

NH
The structure of the North Atlantic jet stream and storm track has received considerable attention in prior model intercomparisons, given the importance of this feature for the weather and climate of Western Europe. During winter, the North Atlantic jet stream and storm track are tilted from southwest to northeast Bars depict the U, C, and P contributions to the NMSE (Equation 3) as described in the legend and circles depict the SVR. Circles are shaded red/blue when the unconditional bias corresponds to the magnitude of the mean bias being greater than 10% of the ERA5 spatial mean. Reanalyses are from left to right; ERA-Interim, MERRA2, and JRA55 (and ERA20C and 20CR for ua). The light red range that extends across each panel (i-l) depicts the minimum to maximum range of NMSE from the LENS members. and many prior model assessments have demonstrated that models are deficient in the representation of this tilting and instead have an overly zonal jet stream that intercepts the European continent too far south (Doblas-Reyes et al., 1998;Woollings, 2010) along with a lack of storm track activity in the Norwegian sea (Ulbrich et al., 2008;Zappa et al., 2013). It has been argued that this aspect of the circulation is sensitive to resolution (Doblas-Reyes et al., 1998;Greeves et al., 2007) and the representation of orographic drag and blocking (Pithan et al., 2016;van Niekerk et al., 2017). In addition, metrics of the North Atlantic jet structure averaged over the basin indicate that models often exhibit an Atlantic jet stream that is too far south during the winter (Barnes & Polvani, 2013;Delcambre et al., 2013;Woollings & Blackburn, 2011) and a general tendency toward storm tracks that are too weak in summer and winter, both in terms of cyclone number and intensity (Chang et al., 2012;Zappa et al., 2013).
As demonstrated by Woollings et al. (2010), the climatology of the North Atlantic jet stream in winter is really the average over three preferred states. These states are apparent in the probability distribution function (PDF) of daily jet latitude which indicates three preferred jet locations: the southernmost is thought to be associated with the occurrence of Greenland blocking (Hannachi et al., 2012); the central associated with the positive phase of the East Atlantic pattern (Hannachi et al., 2012); and the northern position recently argued to be related to the occurrence of Greenland tip-jet events . Both the CMIP3 and CMIP5 models were shown to be deficient in their representation of this trimodal structure, instead exhibiting a jet latitude PDF that is too narrow and too peaked in the central location (Hannachi et al., 2013;Iqbal et al., 2018).
Compared to the Atlantic, previous model intercomparisons have exhibited less consistency in Pacific jet stream biases. In terms of jet location, Barnes and Polvani (2013) showed the CMIP5 models to be rather evenly distributed around the observed location in the annual mean. In the upper troposphere, Delcambre et al. (2013) demonstrated widely varying representation of the jet stream across CMIP3 models, particularly in the jet exit region, but again with a lack of consistency in the nature of the biases.
Given the larger land fraction and prevalence of stationary waves in the NH, we consider metrics for the Pacific and Atlantic jet streams separately. The local jet latitude and speed are defined as the latitude and magnitude of the maximum ua at 850 hPa, between 20°N and 65°N, at each longitude, determined by a quadratic fit to the grid point maximum and the two adjacent grid points. For the Atlantic basin, the jet latitude and speed are defined as the mean of the local jet latitudes and speeds over the grid points between 60°W and 10°W but we note that the same conclusions can be drawn for the latitude and speed of the zonal wind averaged over the Atlantic sector first. The Atlantic jet tilt, in degrees latitude, is defined as the angle, relative to the zonal direction, of the best fitting straight line to the local jet latitudes between 60°W and 10°W. For the Pacific, these calculations are performed over 150°E to 130°W.
For the North Atlantic during winter, we will assess the PDF of daily jet latitude as defined by Woollings (2010). The location of the jet maximum between 15°N and 75°N is obtained for the 10-day low-pass-filtered (lanczos filtered with 91 weights) daily 850-hPa ua, averaged between 60°W and the Greenwich Meridian. This jet latitude time series is then deseasonalized by removing the annual mean and the first three harmonics of the seasonal cycle. The jet latitude PDF is presented using the kernel estimate of Silverman (1981) (their Equation 1) with the smoothing parameter h = 1.06σn −1/5 , where σ is the standard deviation and n is the sample size, as in Woollings (2010).
During winter, the ERA5 North Atlantic jet stream is located at 46.6°N ( Figure 4b) with a mean speed of 9.8 m s −1 (Figure 4c) and is tilted by ∼16°latitude (Figure 4d), although both LENS and the twentieth century reanalyses indicate substantial uncertainty on these metrics with a 36-year record. Nevertheless, Figures 4a-4d demonstrate that even given the sampling uncertainty, many of the CMIP5 and CMIP6 models exhibit a North Atlantic jet stream that is located too far south, is too fast and is lacking in the southwest to northeast tilt, in agreement with the aforementioned previous studies. The ensemble spread in BCAM6 encompasses the observed value in both jet latitude and tilt and it intercepts the continent at close to the correct location (over the United Kingdom), while the majority of other CMIP models intersect the continent over France (Figure 4a). The CESM2 wintertime North Atlantic jet is, however, too fast by about 2 m s −1 in all configurations ( Figure 4c). Interestingly, the representation of the North Atlantic jet stream is degraded in WACCM6 compared to CAM6, in that it does not exhibit the correct southwest to northeast tilt in the east Atlantic, meaning that the jet stream intersects the continent at lower latitudes over France, like 10.1029/2020JD032835

Journal of Geophysical Research: Atmospheres
many of the other CMIP5 and CMIP6 models (Figure 4a). Given that this difference between the high-top and low-top configuration is consistent between the coupled and uncoupled configurations, it is likely a robust feature of the difference between CAM6 and WACCM6 but remains to be understood.
Compared to winter, the summer North Atlantic jet stream is situated further north, is weaker and is less tilted (Figures 4e-4h). The CMIP5 and CMIP6 models are relatively evenly distributed around the observed values with no clear tendency toward a bias of one sign or the other (Figures 4f-4h, gray points). LENS simulated these characteristics of the North Atlantic jet extremely well with its range encompassing ERA5 in all three metrics (Figures 4f-4h, red range). While CESM2 captures the summertime jet tilt correctly, it now places the jet stream too far North and the jet speed is about 1 m s −1 too fast. In contrast to the wintertime, the summertime jet structures in CAM6 and WACCM6 are very similar to each other.
The PDF of DJF daily jet latitude for the Atlantic jet is shown in Figure 4i. The reanalyses exhibit three preferred jet latitude locations: the southern location about 12°south of the mean jet; the central location, about  Table S1. (i) the PDF of daily jet latitude with gray vertical shaded regions depicting the latitude range used for the probabilities shown in (j)-(l). (j-l) The probability of the jet latitude being in the southern, central, and northern regions (gray regions in (i)). Green and blue shading in (a), (e), and (i) depict the minimum to maximum range for the 11 BCAM6 and 3 BWACCM6 ensemble members, respectively. ERA20C and 20CR are not shown for the daily metrics.
3°south of the mean jet; and the northern location about 8°north of the mean jet. Interestingly, both ERA5 and JRA55 exhibit an additional local maximum in the PDF around 2°north of the mean jet-a feature not present in ERA-Interim or MERRA2. To assess the representation of this PDF for each of the models, the probability of the jet being located in the 5°latitude bands centered on the southern, central, and northern ERA5 PDF peaks (gray shading in Figure 4i) is shown in Figures 4j-4l. The probability of occurrence in the central locations is well represented by CESM2, LENS and many of the CMIP5 and CMIP6 models. The probability of occurrence in the northern location is also represented well in CESM2 and LENS while many of the CMIP5 and CMIP6 models exhibit a reduced probability in this region, consistent with Hannachi et al.'s (2012) conclusions that the PDF is too narrow. The two features of concern for CESM2's representation of the jet latitude PDF are that there is a reduced probability compared to observations of the jet being situated in the southern location, with a compensating enhanced probability of it sitting between the central and northern locations (Figure 4i). The reduced probability of occurrence in the southern location is a feature that is common to many CMIP models ( Figure 4j) and LENS (Kwon et al., 2018) and is consistent with the reduced number of Greenland blocking events compared to observations. This aspect looks to be slightly improved in CESM2 compared to LENS, which is consistent with an improvement in Greenland blocking to be discussed in section 6.4.1. An interesting difference between CAM6 and WACCM6 is that, relative to CAM6, WACCM6 exhibits an increased probability of the jet being situated in the central latitudes and a reduced probability of it being situated in the northern latitudes (compare green and blue in Figure 4i). This is a robust difference present in both the coupled and uncoupled runs, but it is not clear whether WACCM6 or CAM6 is closer to observed.
For the Pacific jet stream ( Figure 5), during winter, all three jet metrics are rather well simulated in CESM2 (Figures 5b-5d). Many of the CMIP5 and CMIP6 models and LENS have the Pacific jet stream placed too far equatorward, but this is not true in coupled CESM2 and the uncoupled CESM2 runs actually exhibit a Pacific jet stream that is too far poleward (Figure 5b). During the summer, the Pacific jet is placed too far poleward in all CESM2 configurations-a problem that was also present in LENS (Figure 5f). In addition, the Pacific jet stream has now become a bit too strong in CESM2, while its speed was well represented in LENS (Figure 5g).
Taking a broader view of the spatial structure of the jet streams and storm tracks by considering 850-hPa zonal wind and 10-day high-pass-filtered eddy meridional wind variance in Figure 6, it is clear that the climatology during the wintertime is greatly improved in both aspects in CESM2. The large biases in 850-hPa ua that were present in the Pacific in LENS leading to a jet stream that was placed too far equatorward are now gone and the biases that extended across the Atlantic basin are reduced ( Figure 6c compared to 6d). The only bias of note in 850-hPa ua that remains is the westerlies that are too strong over western Europe and an accompanying easterly bias over North Africa (Figure 6c), which is a feature common to the majority of CMIP6 models ( Figure A1e). These climatological biases in CESM2 of around 3 m/s over Europe and North Africa are larger than the average projected end-of-century changes in this region, which are of the order 1 m/s (Woollings & Blackburn, 2011, their Figure 1a). Nevertheless, consideration of the overall NMSE for DJF 850-hPa ua (Figure 6i) makes clear that CESM2 is a top ranking model in this aspect and is improved compared to LENS.
Massive improvements are also found in the representation of the NH winter storm tracks (Figures 6c and  6d). As measured by the 850-hPa 10-day high-pass-filtered eddy meridional wind variance (va′va′), LENS exhibited storm tracks in the Atlantic, Pacific, and the lee of the Rockies that were too weak-in some regions by a very large fraction (>50%). A tendency toward weaker storm tracks than observed in the lee of the Rockies, over the North Atlantic and in the central Pacific is also common among the CMIP6 models ( Figure A1g) and when a large unconditional bias is present in the CMIP models, it is typically negative, that is, the storm tracks are too weak (the prevalence of blue circles in Figure 6j). The biases that were present in LENS are now greatly reduced in CESM2, and this appears as a substantial reduction in NMSE arising from both an elimination of the unconditional bias and a reduction in the phase error (Figure 6j), leaving CESM2 as one of the highest ranking CMIP6 models. The role of changes in the orographic drag formulation in the improvements in the lee of the Rockies is discussed in section 4.3 below.
During the summertime, there has been a degradation in the representation of the overall 850-hPa ua field in CESM2 compared to LENS, largely due to strengthened westerlies around the Arctic circle. This degrades 10.1029/2020JD032835 Journal of Geophysical Research: Atmospheres even further in the uncoupled simulations. Despite the degradation in ua, large improvements are seen in the NH storm tracks in JJA, again through the elimination of the negative unconditional bias that was present in LENS and is found in many other CMIP5 and CMIP6 models (light gray (salmon pink) component of the CMIP (LENS) bars in Figure 6l).

The Influence of New Orographic Schemes on Storm Track Activity in the Lee of Large Mountain Ranges
As discussed above, in the lee of both the Andes in the SH and the Rockies in the NH, there are large improvements in the representation of lower tropospheric meridional wind variance on synoptic timescales. These improvements are more notable in the winter season of each hemisphere and are further highlighted in Figure 7 for the Rockies during DJF (a-c) and for the Andes during JJA (i-k). The large increase in meridional wind variance in the lee of the mountains in the coupled simulations is also present in the uncoupled difference between FCAM6* and FCAM5, that is, where the CAM6 physics package is replaced with CAM5 (Figures 7d versus 7e and 7l versus 7m). A large fraction of the increase in storm track activity in the lee of these mountain ranges can be explained by the sum of the influences of the new TOFD and MOB schemes.
See panels (f) and (n) for their sum and (g) and (h) and (o) and (p) for their individual contributions. The TOFD scheme is by far the dominant influence in the lee of the Rockies. In the lee of the Andes, both play a role, but the sum of their individual influences falls short of the total change found in going from CAM5 to CAM6 physics. We do not investigate the additional influences on the downstream storm tracks here as the change in storm track activity over the ocean basins to the east is more muted in the CAM6* − CAM5 difference than in the BCAM6-LENS difference, but it is something that would be interesting to address in coupled simulations in a subsequent study.

Stationary Waves
Stationary waves, or zonal asymmetries, in the atmospheric circulation arise through interactions between the zonal mean flow, orography, diabatic heating, and transient vorticity and divergence fluxes (e.g., Held et al., 2002;Wang & Ting, 1999). They play an important role in determining regional climate (e.g., Broccoli & Manabe, 1992;Rodwell & Hoskins, 2000) and in projected future regional climate change (e.g., Seager et al., 2014;Shaw & Voigt, 2015;Wills et al., 2019) and are, therefore, an important aspect to simulate correctly. Notable biases that have been discussed in previous model intercomparison studies are the inability of models to accurately capture the trough/ridge pattern over the North Atlantic in winter (Boyle, 2006) and the inability of models to capture the localized anticyclonic circulation to the south of New Zealand in summer (Simpson et al., 2013). These issues are related to the overly zonal jet stream and storm track in the North Atlantic and the zonally extended SH jet stream discussed above. Here, we assess the representation of the climatological stationary waves using 300-hPa eddy stream function (ψ * ), but the extratropical features have an equivalent barotropic structure and similar conclusions can be drawn throughout the depth of the troposphere there.
A comparison of Figures 8b and 8c indicates substantial improvements in the representation of DJF NH stationary waves in CESM2 compared to LENS. The large biases in LENS over the subtropical and midlatitude Atlantic and Pacific oceans have been substantially reduced in accordance with the improvements in the jet streams discussed above. The primary bias that remains is the anticyclonic circulation centered over the Mediterranean and cyclonic anomalies to the south and north of this. These features are common among the CMIP6 models ( Figure A1i) and are associated with the westerly (easterly) biases over Europe (North Africa) discussed above. We also note that the high latitude stationary wavenumber 1 bias present in CCSM4 discussed in Shaw et al. (2014) has been alleviated in both CESM1 and CESM2 (not shown). In terms of the NH winter NMSE, CESM2 is one of the highest ranking models in the CMIP6 archive and is improved over LENS through reductions in both the conditional bias and the phase error (Figure 8g).
During JJA, the representation of the stationary waves in the Pacific-North American sector is quite comparable to that in LENS (Figure 8e compared to 8f). The positive ψ * bias that was present over the Middle East and North Africa in LENS has been reduced, leading to an overall reduction in the NMSE (Figure 8h) and making CESM2 one of the top ranking models in this regard in the CMIP6 archive. However, compared to the coupled simulations, the representation of the NH summer stationary waves is degraded in the 10.1029/2020JD032835

Journal of Geophysical Research: Atmospheres
simulations with prescribed observed SSTs (hatched bars in Figure 8h) and the role of coupling versus different underlying SSTs in this will be discussed in section 5.3below.
Unfortunately, the representation of SH stationary waves has not fared so well under the CESM2 developments. In the summer, the SH stationary wave representation in CESM2 is rather comparable to that in LENS (Figures 8b, 8c, and 8i). The SH winter stationary waves were represented extremely well in LENS, and it ranked among the top of all CMIP5 and CMIP6 models (Figure 8j). In CESM2, along with the degradation of the SH westerlies discussed above, the SH winter stationary waves have degraded across the whole of the midlatitudes with, most notably, a large cyclonic bias to the south of Australia. This has resulted in a substantial increase in NMSE for SH stationary waves in JJA, although CESM2 still sits roughly in the middle of the CMIP6 range in this metric (Figure 8j).

The Global Divergent Circulation
The global divergent circulation is intrinsically connected to the hydroclimate of the tropics and the forcing of extratropical stationary waves (Sardeshmukh & Hoskins, 1988) and is, therefore, a feature that should be accurately simulated in order to give a reasonable representation of the climate system, but it is more challenging to observationally constrain than the extra-tropical rotational flow.
We begin our analysis of the divergent circulation by assessment of bulk measures of the low latitude, zonally symmetric, mean meridional circulation, or Hadley circulation, given by where va here refers to the zonal mean meridional wind, p refers to pressure level, φ to latitude, and g to the acceleration due to gravity. Figures 9a and 9b, and 9e and 9f indicate that CESM2 depicts the summer and winter Hadley cells with reasonable fidelity. A noticeable difference is present in DJF at low levels in  Table S2. Panels (e)-(h) are as in (a)-(d) but for JJA. Green and blue shaded ranges in (a) and (e) depict the minimum to maximum range for the 11 BCAM6 and 3 BWACCM6 members.

Journal of Geophysical Research: Atmospheres
the equatorial region where the clockwise overturning circulation extents too far south-a feature that was also present in LENS (Figures 9c and 9d). In JJA, the anticlockwise cross equatorial Hadley cell is now too strong (Figure 9g), whereas it was slightly too weak in LENS (Figure 9h). In the equinox seasons, there are notable biases in the mean meridional circulation on the equator, which reflect the This is obtained at 500 hPa by the linear interpolation between the two grid points either side of the zero crossing (Adam et al., 2018). The scatter of the reanalysis points illustrates the challenge in observationally constraining this measure, but overall, CESM2 lies within the range of observational uncertainty. In both DJF and JJA, it is extremely close to the ERA5 value. Hadley cell extent varies by about a ∼5°latitude range across the CMIP5 and CMIP6 models in each season, but they are evenly distributed around the observed value, with no systematic tendency toward a bias of one sign or the other.
We define the Hadley cell intensity as the maximum magnitude of the mean meridional stream function at 500 hPa, determined by a quadratic fit to the grid point maximum of the absolute value of stream function and the two adjacent grid points. The CMIP models exhibit wide ranging values for this metric of Hadley cell intensity with a slight propensity for an overly strong Hadley circulation (Figures 9i and 9j, bottom). Similar degrees of intermodel spread have been found in older model intercomparisons and Caballero (2008) related this to the varied representation of extratropical eddy driving. The Hadley cell intensity is well represented in CESM2 and lies within the reanalysis ranges in both seasons.
We provide a more local view of the upper level divergent flow in Figure 10 with the 200-hPa velocity potential (χ). Velocity potential is proportional to the inverse Laplacian of divergence and is related to ua and va as follows:

Journal of Geophysical Research: Atmospheres
such that the divergent winds flow perpendicular to isolines of χ, out of regions of negative χ and into regions of positive χ. During DJF, the global structure of χ depicts divergence in the western tropical Pacific and southPacific convergence zone and convergence over the subtropical Atlantic, Mediterranean, and eastern sides of the southern subtropical ocean basins (Figure 10a). In JJA, the maximum divergence is present over Asia and the north tropical Pacific, with convergence in the southern subtropical Atlantic (Figure 10d). The NMSE and its components (see legend) along with the SVR (circles) for NH DJF, NH JJA, SH DJF, and SH JJA, respectively. The light red range that spans each panel (g-j) depicts the minimum to maximum range of NMSE of the 40 LENS members and the ordering of the reanalyses from left to right is ERA-Interim, MERRA2, JRA55, ERA20C, and 20CR. Note there is no unconditional bias for ψ * since its average around the longitude circle is 0 by definition.
During DJF, LENS exhibited too much divergence over the western Indian ocean (Figure 10c), which is a common feature among CMIP6 models ( Figure A1k) and was associated with too much precipitation in that region. In CESM2, the excess divergence is reduced and shifted further west and accompanies a reduction in the excess precipitation bias (Danabasoglu et al., 2019). Overall, the errors in the 200-hPa velocity potential field have been reduced in DJF, making CESM2 one of the top ranking models in this metric (Figure 10g).
During JJA, the representation of 200-hPa velocity potential is also substantially improved in CESM2 compared to LENS (Figures 10e and 10f) with the large biases that were present over the Indian Ocean, East Asia, tropical Atlantic, and Amazonia now greatly reduced, in association with precipitation improvements in these regions (Meehl et al., 2020). CESM2 is a top ranking model in the JJA global divergent flow field

Journal of Geophysical Research: Atmospheres
and rather remarkably, the NMSE in CESM2 is not that much larger than for ERA-Interim relative to ERA5 (Figure 10h). That being said, the difference between ERA5 and ERA-Interim is not in the same location as the biases found in CESM2, so there are still improvements that can be made. Furthermore, as with the JJA NH stationary waves, Figure 10h makes clear that the simulation of the global divergent circulation is degraded in uncoupled simulations with prescribed observed SSTs-a feature that will now be discussed in more detail.

The Role of Ocean-Atmosphere Coupling and SST Differences in the Difference Between Coupled and Uncoupled Simulations
Compared to the coupled simulations, the representation of the NH stationary waves and global divergent circulation in JJA is degraded when observed SSTs are prescribed in both CAM6 and WACCM6 (Figures 8h and 10h). There are two possible reasons for this. One is that the lack of ocean-atmosphere coupling in the prescribed SST simulations could be responsible for the degradation. If that were the case, then we should not be concerned about this degradation as the presence of coupling is more realistic. The other along with the SVR (circles) for DJF and JJA, respectively. The light red range that extends across panels (g) and (h) depicts the minimum to maximum range of the LENS ensemble members and the ordering of the reanalyses from left to right is ERA-Interim, MERRA2, JRA55, ERA20C, and 20CR. Note there is no unconditional bias for χ since its global average is 0.

Journal of Geophysical Research: Atmospheres
possibility is that the different SSTs in the real world, compared to the model, are responsible. If this is the case, then it would indicate that compensating biases, one of which being biases in the SST field, are responsible for the improvements in the coupled simulation over the uncoupled simulations.
To assess which of these is true, a comparison with the simulation FCAM6MOD, where SSTs from the model are prescribed (Table 2), is provided in Figure 11. This reveals that the biases in both χ and NH ψ * in FCAM6MOD are similar to those in BCAM6 (compare Figure 11d with 8e and Figure 11g with 10e) and smaller than those in FCAM6 (Figures 11c and 11f). FCAM6 exhibits larger biases than both FCAM6MOD and BCAM6 in χ over the Atlantic and Indian Ocean sectors ( Figure 11f) and in ψ * over much of North Africa and Eurasia. The implication of this is that the degradation in the uncoupled compared to coupled simulations arises through differences in the SSTs and not the lack of coupling, indicating that compensating errors are contributing to the fidelity of the coupled simulation. Compared to the modeled SSTs, the observed SSTs are cooler in the tropical Atlantic (Figure 11a), which is likely responsible for the reduced divergence in that region in FCAM6 and compensating reduced convergence over the Middle East and Asia ( Figure 11h) and the degradation in the stationary wave representation over North Africa and Eurasia ( Figure 11e).

Extratropical Variability: SAM, NAM, NAO, and Blocking
In addition to representing the climatological features of the large-scale atmospheric circulation with fidelity, ESM's must also accurately represent the higher frequency variability of the climate system. In this realm, there are many features that could be assessed. Here we focus on the dominant modes of extratropical atmospheric variability along with the representation of atmospheric blocking, given the attention that has been paid to these features in previous model intercomparisons. Note that we have also presented a climatological view of extratropical storm track statistics and North Atlantic jet variability in section 4 above, which could also fall under this category of extra-tropical variability.

The SAM
The Southern Annular Mode (SAM), also known as the Antarctic Oscillation, is the dominant mode of variability in the SH extratropical circulation (Gong & Wang, 1999;Thompson & Wallace, 2000). It can be defined using zonal wind, geopotential height, or sea level pressure on daily to seasonal timescales and is most commonly identified as the first empirical orthogonal function (EOF) of variability in these fields. In the troposphere, the positive phase of the SAM represents a poleward shifting of the eddy-driven midlatitude westerlies, while in the stratosphere it represents a strengthening of the stratospheric polar vortex.
A model's representation of the tropospheric SAM is of particular interest, not only because of its influence on regional surface temperature and hydroclimate variability (Gillett et al., 2006) but also because the SH midlatitude circulation response to many forcings, such as ENSO (L'Heureux & Thompson, 2006), ozone depletion (Thompson & Solomon, 2002), and increasing GHG concentrations (Gillett & Fyfe, 2013;Miller et al., 2006), projects strongly onto the SAM. The dominance of the SAM in both internal variability and the response to forcings in the troposphere is thought to arise because of a positive eddy-mean flow feedback that occurs in response to SAM-like circulation anomalies (Lorenz & Hartmann, 2001), although this interpretation has recently been questioned (Byrne et al., 2016).
One metric of SAM variability that was thought to provide an indication of the strength of eddy-mean flow feedbacks is its persistence timescale (Gerber & Vallis, 2007), although this is complicated by the fact that additional forcings, for example, stratospheric variability, also impart additional persistence on the SAM (Baldwin et al., 2003;Simpson et al., 2011). Prior model intercomparison studies have demonstrated the tendency for models to exhibit an overly persistent SAM Gerber et al., 2010;Kidston & Gerber, 2010;Son et al., 2010), raising concern that the eddy-mean flow feedbacks in such models may be too strong and, as a result, they may produce a response to external forcings that is too large. Indeed, in support of this viewpoint, Kidston and Gerber (2010) argued that CMIP3 models with a more persistent SAM also exhibited a larger response to GHG forcing. However, Simpson and Polvani (2016) have since shown, with a larger number of CMIP5 models and through consideration of individual seasons, that there is no significant relationship between a models SAM timescale and its response to external forcings. So, while there is good theoretical reasoning to expect the SAM timescale to have bearing on a models response to external 10.1029/2020JD032835

Journal of Geophysical Research: Atmospheres
forcing (Leith, 1975), and there is ample evidence for such a correspondence from idealized modeling studies (e.g., , evidence for the connection between SAM persistence and the circulation response to forcings in the more comprehensive system is currently lacking. Here, we follow Gerber et al. (2010) and define the SAM for the full year as the first EOF of daily zonal mean geopotential height after subtracting the global mean, deseasonalizing, and linearly detrending. The EOF is calculated on each vertical level separately, using cosine weighted data from the equator to 90°S and the SAM index is the accompanying Principal Component (PC) time series. To determine the SAM timescale, the lagged autocorrelation function of the SAM index is first determined for each day of the year according to Equation 1 of Simpson et al. (2011). This is then smoothed over a 181-day window using a Gaussian filter with a full width at half maximum of around 42 days and the SAM timescale is the e-folding timescale of a least squares exponential fit to this smoothed autocorrelation function out to a lag of 50 days. Figure 12a presents the zonal mean structure of the SAM on selected pressure levels for BCAM6 and BWACCM6, weighted by cosine of latitude so that it represents a mass displacement. These height anomalies are in geostrophic balance with a poleward shifting of the midlatitude westerlies in the troposphere and a strengthening of the polar vortex in the stratosphere. Following Gerber et al. (2010), in Figure 12b, we characterize the SAM structure by the latitude of the node of the dipolar zg anomalies, that is, the midlatitude zero-crossing point as illustrated in Figure 12a. In Figure 12c we show the latitude weighted, root-meansquare amplitude of the annular mode structure. These are shown as a function of height for each CESM2 simulation and the reanalysis and only at the 500 hPa for all other simulations.

Journal of Geophysical Research: Atmospheres
CESM2 captures the amplitude of the SAM very well, with the reanalysis amplitude sitting within the range of the 10 BCAM6 members for all levels except 10 hPa, where the amplitude of reanalysis variability is slightly lower than that found in the model (Figure 12c). Most CMIP models also simulate a SAM

Journal of Geophysical Research: Atmospheres
amplitude at 500 hPa that is close to observed. The location of the SAM node is well captured in the troposphere in CESM2 (Figure 12b), which is in contrast to many CMIP models where the node is placed too far equatorward in association with the equatorward bias in their climatological jet location (section 4.1). In the lower stratosphere, the location of the SAM node in CESM2 is displaced slightly equatorward of that found in reanalysis.
The latitude-longitude structure of geopotential height variability that accompanies the zonal mean SAM can be assessed in Figures 12d-12g, which show the regression of 500-hPa geopotential height (after deseasonalizing and detrending) onto the SAM index. While the SAM is predominantly zonally symmetric, some zonal asymmetries do exist in its structure and these are more pronounced in the reanalysis than in CESM2 (Figure 12d versus 12e). In the reanalysis, a localized intensification of the Antarctic low occurs over the Amundsen-Bellingshausen sea region which leads to poleward flow toward the west of the Antarctic peninsula and contributes to warmer surface temperatures over West Antarctica during the positive phase of the SAM as well as regional sea ice anomalies (Lefebvre et al., 2004;Sen Gupta & England, 2006). This feature is largely absent in CESM2 as evidenced by the large positive height bias off West Antarctica in Figure 12f, but it was well reproduced in LENS (Figure 12g). In addition, CESM2 is lacking in the local intensifications of the positive height anomalies to the south east of New Zealand, off the east coast of South America and in the Southern Indian ocean. Similar biases are common among the CMIP6 models ( Figure A2a), but this was rather well represented in LENS ( Figure 12g). Nevertheless, when considering the NMSE of the 500-hPa SAM structure (Figure 12k), despite being degraded slightly compared to LENS, CESM2 still ranks highly among the CMIP5 and CMIP6 models.
CESM2 accurately captures the enhanced SAM persistence during spring in the stratosphere and the associated enhanced SAM persistence in the spring and early summer in the troposphere (Figures 12h-12j), which arise due to variability in the timing of the polar vortex breakdown and its downward influence (Baldwin et al., 2003;Byrne & Shepherd, 2018;Simpson et al., 2011). However, as is commonly found in models , the enhanced tropospheric persistence extends too late into January and February. Given the importance of stratospheric variability in SAM persistence, rather than comparing with LENS, we compare BCAM6 with BWACCM6 in Figure 12j. The SAM persistence in BCAM6 is rather comparable to that in BWACCM6 and the greater springtime stratospheric persistence in BWACCM6 is not significant. A comparison across models of the DJF averaged SAM persistence at 500hPa is provided in Figure 12l. This is the season when prior studies have demonstrated an overly persistent SAM in models, while during JJA, all models compare well with observations (not shown). Many of the CMIP5 and CMIP6 models exhibit SAM persistence that is greater than observed, although LENS indicates that the sampling uncertainty is large. At least from the data currently available, none of the CMIP6 models exhibit extreme SAM persistence like that found in a few CMIP5 models. In particular, the IPSL model which was extremely persistent in CMIP5 is now improved in CMIP6. All CESM2 configurations have a reasonable representation of SAM persistence.
Overall, the structure and persistence of the SAM is well represented in CESM2, but there are some notable degradations in the representation of zonal asymmetries in the SAM structure around the Antarctic continent, which may have implications for the representation of surface temperature and sea ice variability.

The NAM
The Northern Annular Mode (NAM), also known as the Arctic Oscillation, is the NH equivalent to the SAM and is the dominant mode of NH zonal mean extratropical circulation variability. It is accompanied by surface signatures in temperature and precipitation and forced circulation responses are often found to project onto the NAM (Gillett & Fyfe, 2013;Miller et al., 2006;Thompson & Wallace, 1998). Previous model intercomparison studies that assessed the monthly NAM have shown a systematic tendency for models to underestimate the amplitude of the centers of action in the Atlantic, while overestimating the centers of action in the Pacific, and to overestimate the negative height/surface pressure anomaly over Siberia (Gong et al., 2017;Stoner et al., 2009;Zuo et al., 2013). In addition, Miller et al. (2006) argue that the NAM represents too large a fraction of temporal variability in the NH in models.
We present an assessment of daily NAM variability using exactly the same methodologies as described above for the SAM, but calculating the EOF using data from the equator to 90°N. The overall zonal mean structure 10.1029/2020JD032835

Journal of Geophysical Research: Atmospheres
of the NAM is well represented in CESM2 throughout the troposphere and lower stratosphere (Figure 13a), with the NAM node being very well represented in WACCM6 but slightly too far poleward(equatorward) in the troposphere(stratosphere) in CAM6 (Figure 13b). The tropospheric NAM amplitude is slightly too large

Journal of Geophysical Research: Atmospheres
in all configurations and an interesting difference between CAM6 and WACCM6 appears higher up in the stratosphere, where the amplitude of NAM variability in WACCM6 is reduced, and further from the observations, compared to that in CAM6 (Figure 13c). This is unexpected given the enhanced representation of the stratosphere in WACCM6 compared to CAM6. Since the model top in CAM6 is rather low (∼2.26 hPa) and the layers below that are influenced by the sponge layer, chances are high that this improved representation of the NAM amplitude in CAM6 compared to WACCM6 is due to spurious reasons, although it remains to be fully understood.
The horizontal 500hPa NAM structure in CESM2 broadly resembles that in the observations (Figure 13e versus 13d). As in CESM1 (Figure 13g) and other CMIP6 ( Figure A2b) and CMIP5 models (Zuo et al., 2013), there is a significant overestimation of the amplitude of the negative height anomalies over Northern Russia (Figure 13f). Compared to LENS, the representation of the NAM structure in the western Atlantic is substantially improved in CESM2, but it is degraded in the Pacific where the amplitude of the positive height anomalies there in CESM2 are now larger than observed (Figure 13g versus 13f). This overestimation of the amplitude of the Pacific center of action is a similar bias to that found in many other CMIP6 models ( Figure A2b). Overall, the structure of the NAM, as quantified by the NMSE, appears to be slightly improved in CESM2 compared to LENS since a number of the CESM2 members have a lower NMSE than the LENS range ( Figure 13k).
Both the low-top and high-top CESM2 configurations exhibit enhanced stratospheric NAM persistence and the concomitant increase in tropospheric NAM persistence during the winter, when longer timescale stratospheric variability plays an important role and imparts persistence to the troposphere (Baldwin et al., 2003;Simpson et al., 2011). The ensemble mean NAM persistence suggests enhanced timescales in BCAM6 compared to BWACCM6 during JFM. However, a closer inspection of the individual ensemble members during DJF reveals a large sampling uncertainty in this metric ( Figure 13l) and indicates that this difference is likely not significant. The spread across the BCAM6 members is also as large as the CMIP5 and CMIP6 intermodel spreads, again demonstrating the substantial uncertainty on this metric due to internal variability and questioning its usefulness for model validation.

The NAO
The NAO is the dominant mode of variability in the atmospheric circulation of the North Atlantic sector. It is characterized by fluctuations in the sea level pressure difference between the Iceland low and the Azores high and concomitant variations in the strength and location of the North Atlantic jet stream, with implications for surface weather in the North Atlantic sector (Bladè et al., 2012;Hurrell, 1995). The NAO is highly correlated with the Arctic Oscillation, or NAM, but reflects variability that is more localized to the North Atlantic sector and is argued by Ambaum et al. (2001) to be a more physically relevant mode of variability for the NH than the zonally symmetric NAM. To minimize the duplication of information from the NAM calculation above, here we assess the representation of the NAO on lower frequencies using monthly data, although similar conclusions can be drawn for the NAO calculated using daily data (not shown).
In general, prior assessments of the representation of the NAO have indicated that models are capable of simulating the structure of the wintertime NAO, reasonably well (Cohen et al., 2005;McHugh & Rogers, 2005;Stephenson & Pavan, 2003;Stephenson et al., 2006;Stoner et al., 2009). The same is true of the summer NAO, although Bladè et al. (2012) emphasize inadequacies in the representation of associated precipitation anomalies over the Mediterranean, with the amplitude of the precipitation signal being too weak, especially in the east.
While models can reasonably well capture the structure of the NAO and interannual timescale variability, a number of recent studies have indicated that models underestimate the amplitude of multidecadal NAO and jet stream variability (Kim et al., 2018;Kravtsov, 2017;Simpson et al., 2018;Wang et al., 2017). Given that our focus is on the representation of atmospheric circulation as compared to satellite-era reanalyses, we do not address the issue of multidecadal variability here.
We define the NAO as the first EOF of monthly mean sea level pressure variability, area weighted, over the North Atlantic domain (90-40°W, 20-80°N) (Hurrell, 1995). Monthly slp data were deseasonalized and detrended and the seasonal EOF calculated for the concatenated time series of the individual component months of the season for all years. The PC time series and EOF patterns are constructed such that the PC 10.1029/2020JD032835

Journal of Geophysical Research: Atmospheres
time series has standard deviation of 1 and the EOF has slp units. For CMIP, the NAO pattern and variances are calculated separately for each ensemble member and then averaged. For the models, to account for the possibility of the NAO not appearing as the dominant mode of variability, the first three EOFs are calculated, and the one that has the highest spatial pattern correlation with the observed NAO is considered to be the NAO. Only a few model members have the NAO as the second EOF.
The structure of the wintertime NAO is relatively well represented in CESM2 (Figure 14b versus 14a). CESM1 exhibited an Azores high anomaly that was too weak (Figure 14d) and this is now improved in CESM2, although the magnitude of the accompanying Pacific SLP anomaly is now larger (Figure 14c). A salient feature of the NMSE intercomparison (Figure 14e) is the large uncertainty in this metric across ensemble members of the LENS (red range) and across 35-year overlapping segments of the twentieth

10.1029/2020JD032835
Journal of Geophysical Research: Atmospheres century reanalyses (black ranges). These ranges encompass almost the whole of the intermodel spread, which renders any intercomparison somewhat meaningless as the sampling uncertainty on the real world and across members of an individual model are of comparable magnitude to the intermodel spread. One bias feature that is common to all CMIP6 models ( Figure A2c) and CESM2 is a positive SLP anomaly centered over the British Isles and Scandinavia-the region where the North Atlantic jet stream is too zonally extended. The fraction of SLP variance explained by the winter NAO is well represented in CESM2 and most models, given the sampling uncertainty range (Figure 14f).
The summertime NAO is also well represented in CESM2 (Figures 14g and 14h) with some minor biases that are rather similar to those in LENS. The sampling uncertainty on the NMSE is smaller but still covers a substantial fraction of the intermodel spread and CESM2 performs reasonably well and within approximately the same range as displayed by LENS (Figure 14k). The percent variance explained by the summer NAO is well represented in CESM2 and the majority of CMIP5 and CMIP6 models (Figure 14l).

Blocking
Blocking refers to quasi-stationary circulation anomalies that block the midlatitude westerly flow for several days or more (see, e.g., Woollings et al., 2018, for a recent review). Blocks are associated with extreme cold events in winter and heat waves in summer, through their effects on thermal advection and radiative fluxes (Buehler et al., 2011;Brunner et al., 2018;Pfahl & Wernli, 2012;Sillmann et al., 2011;Trigo et al., 2004). Accurate representation of atmospheric blocking is, therefore, necessary for the accurate simulation of extreme events and their projected future changes, yet blocking statistics are often misrepresented in models.
Many blocking indices exist (see Woollings et al., 2018, e.g., intercomparisons of a variety of blocking indices). Here, we adopt the two-dimensional procedure of Masato et al. (2013a). This method identifies persistent reversals of the meridional gradient of the 500-hPa geopotential height field and was used in CMIP5 comparisons in both the NH (Masato et al., 2013b) and SH (Patterson et al., 2019).
Daily 500-hPa zg data on a 2°longitude × latitude grid are used to calculate a blocking index B, as follows. At each grid point between 25°and 75°north or south, the following integrals are evaluated with ΔΦ = 30°latitude, that is, Z P and Z E represent the integrals of geopotential height over the 15°latitude range poleward and equatorward of the grid point latitude φ ∘ , respectively. The blocking index is then given by B = Z P −Z E , such that when a large-scale reversal of the meridional gradient of zg occurs, B>0. For each day, local positive maxima in B are identified within the latitude range 40°to 70°. Each local grid point maxima and all adjacent positive values of B that are contiguous to that grid point are considered to be part of the same blocking event. For each day, other than the first day of the season, if a local maximum in B lies within ±18°longitude and ±14°latitude of a blocking center identified on the previous day, then it is considered to be a continuation of that previous day's blocking event. An event is considered to end if there is no local positive B maximum within ±27°longitude and ±20°latitude of the local maximum of the first day of that block. In this way, at each grid point the number of days where B>0 in association with persistent blocking events of a given duration can be quantified. The metric we use for evaluation is the overall fraction of days in the time series that are blocked, expressed as a percentage. To identify the blocking climatologies of the NH, we consider only blocking events that last 5 days or more. For the SH, we consider blocking events that last 4 days or more, to account for the greater transience of the SH (Berrisford et al., 2007). An assessment of NH blocking using slightly different metrics can also found in Gettelman et al. (2019), their Figure 15, and Benedict et al. (2019).
Since the blocking index used here relies on the absolute threshold that the meridional gradient of zg must reverse, both climatological biases in the mean zg gradient and biases in the representation of the synoptic variability that gives rise to blocks can contribute to a bias in blocking frequency. The implications for the representation of surface climate variability could be rather different depending on which of these dominate. For example, someone looking to use the model to investigate heat waves associated with blocking events will be less concerned if a low bias in blocking occurs because the climatological geopotential gradient is 10.1029/2020JD032835

Journal of Geophysical Research: Atmospheres
too steep, so it is difficult to reverse, as opposed to if it were due to the model being unable to produce persistent anticyclones. To provide an indication of the relative importance of each of these contributions to the overall bias, we adopt the procedure of Scaife et al. (2010). The geopotential height field is divided up into the seasonal mean climatology and the deviations therefrom for both the model and ERA5, that is, where ð:Þ refers to the seasonally averaged climatology from 1979-2014, (.)′ refers to deviations therefrom, and subscripts mod and era refer to quantities from the model and ERA5, respectively.
To assess the impact that a mean flow bias is having on the blocking statistics, we can replace zg mod with zg era in (8), recalculate the blocking statistics, and assess the extent to which the bias goes away. This is referred to as "Mean fixed" in the figures. Conversely, to assess the impact of the variability bias, we can replace zg ′ mod ðtÞ in (8) with zg ′ era ðtÞ and recalculate the blocking statistics. This will be referred to as "Var fixed" in the figures. There are many situations in which the interpretation of this decomposition will not be straightforward or quantitative. For example, the sum of the "Var fixed" and "Mean fixed" errors may be greater than the actual error due to compensating errors in both aspects. Furthermore, since the mean and the transients are intimately coupled, in a situation where both are in error, the decomposition gives no indication of the ultimate origins of the bias. For example, both the mean flow and transients may contribute, but it is possible that all that needs to be fixed is the transients and the mean flow will follow, or vice versa. Nevertheless, in situations where the contributions add up to the total error, this can provide a meaningful assessment of the relative influence of mean state biases versus biases in the representation of transient systems.

NH
Systematic errors in the representation of blocking were first recognized in the numerical weather prediction context (Tibaldi & Molteni, 1990) and subsequently in early model intercomparisons (D'Andrea et al., 1998). Over the last two decades, substantial improvements in the modeled representation of Pacific blocking have occurred, but a reduced blocking frequency compared to reanalysis continues to be a pervasive issue in the North Atlantic, particularly during the winter (Anstey et al., 2013;Dunn-Sigouin & Son, 2013;Masato et al., 2013b;Vial & Osborn, 2012). Scaife et al. (2010) argued that this issue does not arise from the inability of models to capture the synoptic systems, but rather from biases in their mean state that prevent such systems from triggering the gradient reversal thresholds used in many blocking indices. The importance of mean state biases has now been identified in many studies for this region, but it is often not the only contributor (Anstey et al., 2013;Davini & D'Andrea, 2016;Vial & Osborn, 2012).  Figures 15a and 15b). In CESM1 (LENS), there was too much blocking over Eastern Siberia and Alaska ( Figure 15d) and this issue is now fixed in CESM2, but this improvement is due to an improvement in the mean state as opposed to any difference in the nature of synoptic anticyclone variability in this region (not shown). As is common with many models, including those from CMIP6 ( Figure A2e), both CESM1 and CESM2 have reduced blocking over the European and Greenland regions compared to observed. The Greenland blocking bias has been reduced in CESM2, and Gettelman et al. (2019) showed this improvement is larger in WACCM than in CAM (their Figure 15). We see the same improvement in the Greenland blocking sector in WACCM with the blocking metric used here but larger biases occur in WACCM in the North Pacific (Figure S17e), and so it does not appear as an improvement in our hemispheric error metric (Figure 15i). Figures 17a-17c show that the BCAM6 NH winter biases can be roughly linearly decomposed into contributions that arise from the bias in the mean state and contributions that arise from the bias in the variability. This reveals that if the mean state bias is removed by artificially replacing CAM's mean state with ERA5's before calculating the blocking statistics, much of the European bias and roughly half of the Greenland bias goes away (Figure 17b). Similar conclusions can be drawn for the European blocking bias in the CMIP6 ensemble mean (Figures A2e and A2i). This indicates that a large fraction of the problem in this region lies in the mean state and the fact that it limits the potential for persistent anticyclones to reverse the zg gradient,

10.1029/2020JD032835
Journal of Geophysical Research: Atmospheres as opposed to there being a problem with the variability itself. Nevertheless, the lack of variability is still contributing to roughly half of the Greenland bias and around a quarter of the bias off the coast of Europe in CESM2. Overall, in DJF, CESM2 has a very good representation of NH blocking statistics, is improved over CESM1 and is one of the top ranking models in this regard (Figure 15i).
During the summer, CESM2 has unfortunately seen some degradations in the representation of blocking compared to LENS (Figure 15g versus 15h). Too much blocking occurred in the highest latitudes in LENS and this issue remains in CESM2, but now this is accompanied by a severe lack of blocking in a latitude band to the south. This includes a lack of blocking in the observed centers of action over Russia, Iceland, and Alaska. If this degradation were due to a change in the representation of the variability, then this would be of concern for the simulation of heat waves in these regions . However, Figures 17d-17f make clear that this particular bias is associated with a mean state bias, as opposed to a and SVR (circles) for DJF and JJA, respectively. The light red range that spans (i) and (j) depicts the minimum to maximum range for LENS and the reanalyses (barely visible) are ordered from left to right: ERA-Interim, MERRA2, and JRA55. When the unconditional bias corresponds to a bias in the spatial mean of more than 10% of the ERA5 spatial mean, the red (blue) SVR circles depict positive (negative) biases.

10.1029/2020JD032835
Journal of Geophysical Research: Atmospheres variability bias and, indeed, it is the mean state change between LENS and CESM2 as opposed to a variability change that has given rise to this degradation. The zg gradient has strengthened around the Arctic circle, in association with the westerly bias discussed in section 4.2. This limits the ability of synoptic variability to induce a zg gradient reversal. A similar lack of summertime blocking over Northern Europe and western Russia is seen in the CMIP6 models ( Figure A2f), which can be partially attributed to a mean state bias and partially to a variability bias (Figures A2j and A2n). Overall, the representation of summertime blocking in CESM2 has degraded, but this is largely a result of mean state changes and the errors due to the representation of the synoptic variability itself are relatively minimal (Figure 17e). (i and j) NMSE (see legend) and SVR (circles) for JJA and DJF, respectively. The light red range that spans (i) and (j) indicates the minimum to maximum ranges for LENS and the reanalyses are ordered from left to right: ERA-Interim, MERRA2, and JRA55. When the unconditional bias corresponds to a bias in the spatial mean of more than 10% of the ERA5 spatial mean, the red (blue) SVR circles depict positive (negative) biases.

10.1029/2020JD032835
Journal of Geophysical Research: Atmospheres 6.4.2. SH During the wintertime (Figure 16a), blocking occurs primarily in the Pacific sector (Berrisford et al., 2007;Sinclair, 1996), where the SH jet stream splits giving rise to blocking favorable conditions (Trenberth & Mo, 1985). During the summertime, blocking occurs primarily in the west Pacific/New Zealand sector but is less prevalent overall (Figure 16e). In comparison to the NH, relatively little attention has been paid to the representation of blocking in the SH. Ummenhofer et al. (2013) assessed the representation of blocking in the New Zealand sector of an older version of CAM (CAM3) and found the preferred blocking locations and seasonality to be well represented but with a systematic lack of blocking occurrence, which they related to the overly zonal flow in that region. Recently, Patterson et al. (2019) assessed the representation of SH blocking in the CMIP5 models and found that, while individual models can be substantially biased, there was a lack of consistency in the sense of the biases across models, except in the region south of Australia during summer where the majority of models exhibit a lack of blocking.
An assessment of SH blocking is provided in Figure 16. The blocking index identifies substantial activity around the Antarctic continent, but this is in a region of relatively weak westerly winds and Berrisford et al. (2007) argue that it is debatable whether these should be considered blocking events. Therefore, we focus on midlatitude blocking events by masking out the regions poleward of 60°S and evaluate the blocking characteristics between 30°S and 60°S.
During SH winter, the representation of blocking has unfortunately degraded in CESM2 compared to LENS. LENS only had a slight underestimation of blocking in the New Zealand sector (Figure 16d), but CESM2 now also substantially underestimates blocking in the East Pacific center of action (Figure 16c). This lack of blocking can be almost entirely ascribed to a degradation of the mean state (Figures 17g-17i) and is related to the fact that the westerlies have become too strong in the East Pacific, preventing anticyclonic circulations from triggering the zg gradient reversal threshold (Figure 3g). This is not an issue that is common to the CMIP6 models ( Figure A2g), but it seems that other CMIP6 models may have compensating errors that lead to a reasonable blocking climatology in this region, as Figure A2k shows that if the CMIP6 variability were placed on top of the ERA5 climatology, the models would consistently underestimate blocking in this region. Overall, the NMSE for JJA (Figure 16i) should be viewed with caution for some models and for CESM2, since Figure 17. (a-c) A decomposition of the NH DJF blocking bias in BCAM6 (a) into a contribution that is present when the seasonal mean climatology of zg is replaced by that of ERA5 (b) and a contribution that is present when the deviations from the seasonal mean climatology of zg are replaced by those of ERA5 (c), that is, (b) shows the bias that would be present if the mean state were improved but the variability left unchanged and (c) shows the bias that would be present if the variability were improved but the mean state left unchanged. Panels (d)-(f), (g)-(i), and (j)-(l) are as in (a)-(c) but for the NH during JJA, the SH during JJA, and the SH during DJF, respectively. the SVR is considerably less than one indicating that the NMSE might be artificially reduced through the lack of spatial variance. Nevertheless, it reveals that it is more common for the CMIP6 models to overestimate blocking in the winter, rather than underestimate it like CESM2 does (the prevalence of red circles in Figure 16i). Some models indicate too much spatial variance while others exhibit too little and there are hardly any models where the SVR is close to one.
CESM2 also lacks SH blocking during the summer season in the West Pacific (Figure 16g), which is rather different from the bias that was present in LENS ( Figure 16h). Again, this can be approximately linearly decomposed into contributions from errors in the mean state and errors in the variability (Figures 17j-17l). South of New Zealand, biases in the mean state dominate, where the anomalously strong westerlies and associated strong zg gradient prevent anticyclones from overturning the zg gradient. In contrast, the deficiency in latitudes north of New Zealand is clearly dominated by a lack of variability in that region. The lack of blocking to the North of New Zealand is also common across the CMIP6 models Figure 18. A summary of the representation of various (a) NH features, (b) SH features, and (c) aspects of the global divergent circulation in CESM2-CAM6 (BCAM6), CESM2-WACCM6 (BWACCM6), and CESM1 (LENS) as compared to the distribution of CMIP5 and CMIP6 models combined. Red point and range = the LENS ensemble mean and the range from the worst to best individual ensemble member. Green = BCAM6 ensemble mean. Blue = BWACCM6 ensemble mean. Y axis displays the ranking of these CESM configurations among the CMIP5 and CMIP6 models combined, expressed as a percentile from the worst at the bottom to the best at the top. The CMIP distribution consists of 77 models for monthly fields, 43 models for daily ua and va metrics, and 35 models for daily zg metrics.

10.1029/2020JD032835
Journal of Geophysical Research: Atmospheres ( Figure A2h), but biases in the mean state seem to play a more important role in this in CMIP6 (Figures A2l  and A2p). Again, the SVR in this season indicates that the NMSE should be viewed with caution as many models, including CESM, underestimate the spatial variance, while some others overestimate it (Figure 16j).

Summary
A summary of CESM2's representation of many of the features discussed above, as compared to other CMIP models, is provided in Figure 18. Here, all CMIP5 and CMIP6 models have been grouped together (excluding CESM2 versions). The ensemble mean LENS, coupled CESM2-CAM6 and coupled CESM2-WACCM6 have been given a ranking within this CMIP distribution based on either proximity to the observed value for metrics such as jet latitude, or the magnitude of the NMSE for spatial fields. For LENS, a minimum to maximum range is provided based on the ranking of the poorest and highest performing of the 40 individual members. Here, we highlight some of the main conclusions of this analysis: • SH jet stream: CESM2 ranks highly in SH jet position, as did CESM1. It is unusual in having the jet stream in the correct place, while the majority of CMIP5 and CMIP6 models place it too far equatorward. Degradations have occurred in the representation of SH jet speed in CESM2 as the westerlies have become too strong in all seasons. This places CESM2 at the poorest end of the model distribution in the SH jet speed and 850hPa zonal wind assessments ( Figure 18b) and is an aspect that warrants particular attention in future development, given the global importance of southern ocean wind stress. Preliminary analysis indicates that parameter changes within the ice microphysics scheme are the primary contributor to this strengthening of the SH westerlies. • NH jet streams: Substantial improvements in the representation of the NH wintertime 850-hPa zonal wind are found in CESM2 compared to CESM1. The only biases of note that remain in this field are westerlies that are too strong over Europe and easterlies that are too strong over Africa-an issue that is very common among the CMIP6 models ( Figure A1e). The NH summertime jet streams have degraded slightly with a westerly bias that has developed around the Arctic circle leading to an Atlantic jet stream that is too fast and too poleward and a Pacific jet stream that is too fast. • Storm tracks: A major advance in CESM2 over CESM1 is an improvement in the representation of storm tracks in both hemispheres and all seasons, as represented by 850-hPa 10-day high-pass-filtered eddy meridional wind variance (Figures , 3,6, S2, and S5). LENS and many other CMIP models exhibit a hemispheric lack of storm track activity (an unconditional bias) in both the NH and SH. In CESM2 this has been alleviated, leaving only smaller remaining phase errors, making CESM2 a high-ranking model in this aspect. A particularly notable improvement is found in the lee of the Andes and the Rockies. Here, the real world exhibits substantial lower tropospheric meridional wind variance that was almost entirely absent in LENS but is now represented with great fidelity in CESM2. Most of this improvement has arisen from the changes in the representation of TOFD and, to a lesser extent, MOB (see section 4.3). • Stationary waves: Given the importance of stationary waves for the representation of regional climate, it is important that they be simulated with accuracy. CESM2 is one of the highest ranking models in the representation of NH stationary waves in both winter and summer ( Figure 18a) and is substantially improved compared to CESM1. This comes with the additional caveat that compensating errors contribute to the summertime representation, as it degrades when observed SSTs are prescribed (see section 5.3). In the SH, along with the degradation of the zonal mean state, the representation of stationary waves has degraded in CESM2 compared to CESM1, but it still ranks at roughly the middle of the CMIP range ( Figure 18b). • Divergent circulation: The divergent circulation is closely connected to tropical precipitation and represents an important forcing of extra-tropical stationary waves. CESM2 has a remarkable representation of the upper tropospheric velocity potential in both summer and winter (Figure 18c) with an NMSE that is almost as small as the difference between ERA5 and ERA-Interim reanalysis during summer. Again, for the summertime, this does come with the caveat that its representation is degraded when observed SSTs are prescribed, pointing to compensating errors. Nevertheless, this is a field that has been substantially improved compared to CESM1. Any changes in Hadley cell metrics (Figure 18c, right) should be viewed with caution given the lack of consistency across reanalysis data sets, but overall, CESM2 lies close to the reanalysis range.
• SAM: The SAM is the dominant mode of variability in the extratropical SH circulation and future projected circulation change is expected to project strongly onto the SAM. Thus, the SAM should be simulated accurately in order to represent both variability and future climate change with fidelity. The zonal mean structure of the daily zg SAM and the SAM persistence are represented well in CESM2. The latitude-longitude structure of zg anomalies associated with the SAM has seen some degradation with potential consequences for surface temperature and sea ice variability around the Antarctic peninsula. Despite this degradation, CESM2 still ranks highly in terms of SAM structure (Figure 18b, right). • NAM: Much like the SAM, the NAM is the dominant mode of extratropical zonal mean circulation variability in the NH and future predicted circulation changes are expected to project strongly onto the NAM. The NAM structure and persistence are also well represented in CESM2 with slight improvements compared to CESM1. A ranking is not provided for NAM persistence in Figure 18a because of the large sampling uncertainty that is present for this quantity. • NAO: The NAO is closely connected to the NAM but is more localized in the North Atlantic sector.
Accurate simulation of the NAO is important for the representation of climate variability over much of the North Atlantic sector. CESM2 represents the structure of the winter and summer NAO well, but the large sampling uncertainty on individual members and in the observations means it is challenging to be quantitative about this. • Blocking: Blocking is relevant for the simulation of extreme weather events in the midlatitudes. CESM2 is one of the highest ranking models in terms of NH blocking (Figure 18a). While there is still an underestimation of blocking frequency in the Greenland and European sectors in winter, a large fraction of this can be ascribed to errors in the mean state that prevent persistent anticyclones from overturning the zg gradient, as opposed to an error in the variability itself. While summertime blocking has degraded in CESM2 compared to CESM1, this is primarily due to a degradation of the mean state as opposed to a change in the nature of the synoptic variability. We choose not to rank SH blocking in this figure given the issues that arise with the NMSE calculation resulting from the widely varying representation of spatial variance across the models. SH blocking is poorly represented in CESM2, but again, this is primarily because of biases in the mean state zg gradient as opposed to the variability being in error, except for in the region north and east of New Zealand during DJF.
The majority of the features assessed here were not considered directly during the tuning processes with only some attention paid to the representation of the climatological slp, wind stress, and precipitation fields. Many of the improvements (and degradations) are, therefore, emergent properties of this new model, presumably arising from the upgraded physical parameterizations or improved tuning of basic aspects of the mean climate such as precipitation and slp. Overall, despite some degradations, CESM2 exhibits many improvements over CESM1. It is a high-ranking model in most aspects of the atmospheric circulation and will be a useful tool for the study of many aspects of climate variability and change.

Appendix A: Common Biases in the CMIP6 Models
For many of the features discussed, systematic biases in CMIP5 have been assessed in the various studies cited in the text. Given that such studies are not yet available for the newer CMIP6 archive, we provide an overview of systematic biases in the CMIP6 models in the features considered here in Figures A1 and A2. In these figures, the ensemble mean bias relative to ERA5 is shown by the color shading where more than 75% of models agree on the sign of the bias. Green contours outline regions where more than 95% of models agree on the sign of the bias. CESM2-CAM6 and CESM2-WACCM6 have not been included in this analysis.
For the SH westerlies, there is a clear systematic bias toward the westerlies being too strong to the south of New Zealand and the easterly trade winds to be too strong over much of the hemisphere in DJF ( Figure A1a). In JJA, there is strong model agreement on westerlies that are too strong in the low latitude Pacific and more than 75% consensus on the westerlies being too strong to the South of Australia. For the NH westerlies, the major systematic bias across the CMIP6 models is the westerlies that are too strong over Europe with easterly biases to the south over North Africa in DJF ( Figure A1e). These accompany an anticyclonic bias centered over the Mediterranean ( Figure A1i).
For the storm tracks, as measured by the 850-hPa 10-day high-pass meridional wind variance, there is a systematic tendency toward storm tracks that are too weak in both hemispheres and all seasons ( Figures A1,   10.1029/2020JD032835

Journal of Geophysical Research: Atmospheres
S14c, S14d, S14g, and S14h). This is particularly true downstream of the Andes and the Rockies during winter (Figures A1d and A1g). This lack of meridional wind variance was also present in CESM1 but has now been improved in CESM2.
For the upper level divergent flow, there is less consensus among the models on the bias relative to ERA5 but there is strong agreement on too much divergence over the Indian Ocean in DJF ( Figure A1k), as well as convergence over the tropical Atlantic and divergence over Asia that are too weak during JJA ( Figure A1l).
For the variability, there is a strong consensus on the three localized positive centers of action in the zg SAM structure in the South Pacific, Southern Indian Ocean and South Atlantic to be too weak ( Figure A2a) and for the negative anomaly in zg of the NAM structure too be too strong over Northern Russia, as found in Figure A1. CMIP6 ensemble mean bias relative to ERA5. Gray shading depicts regions where less than 75% of the models agree on the sign of the bias and green contour shows where more than 95% of the models agree on the sign of the bias. For blocking, models systematically underestimate European blocking during DJF ( Figure A2e) and underestimate JJA blocking over Eastern Europe/Western Russia, while producing too much blocking over Eastern Russia ( Figure A2f). However, these blocking biases are predominantly associated with a mean state bias that limits the ability of synoptic anticyclones to reverse the zg gradient, as opposed to there being a systematic lack of persistent anticyclones in these regions (see the decomposition in Figures A2i, A2j, A2m, and Figure A2. CMIP6 ensemble mean biases relative to ERA5 for variability metrics. Gray shading depicts regions where less than 75% of the models agree on the sign of the bias and green contour shows where more than 95% of models agree on the sign of the bias. (a and b) The 500-hPa zg structure associated with the SAM and NAM. (c and d) The winter and summer NAO. (e-h) The representation of blocking. (i-l) As in (e)-(h) but replacing each models climatology with that of ERA5 before calculating the blocking statistics. (m-p) As in (e)-(h) but replacing each models transient variability (deviations from climatology) with that of ERA5 before calculating the blocking statistics.
A2n). In the major blocking centers of the SH, there is no model agreement on systematic biases in JJA ( Figure A2g), while there is reasonable model agreement on a lack of blocking in the New Zealand sector and western Pacific during DJF ( Figure A2h).

Data Availability Statement
All CMIP5