Evaluating Climate Models’ Cloud Feedbacks Against Expert Judgment

The persistent and growing spread in effective climate sensitivity (ECS) across global climate models necessitates rigorous evaluation of their cloud feedbacks. Here we evaluate several cloud feedback components simulated in 19 climate models against benchmark values determined via an expert synthesis of observational, theoretical, and high‐resolution modeling studies. We find that models with smallest feedback errors relative to these benchmark values generally have moderate total cloud feedbacks (0.4–0.6 W m−2 K−1) and ECS (3–4 K). Those with largest errors generally have total cloud feedback and ECS values that are too large or too small. Models tend to achieve large positive total cloud feedbacks by having several cloud feedback components that are systematically biased high rather than by having a single anomalously large component, and vice versa. In general, better simulation of mean‐state cloud properties leads to stronger but not necessarily better cloud feedbacks. The Python code base provided herein could be applied to developmental versions of models to assess cloud feedbacks and cloud errors and place them in the context of other models and of expert judgment in real‐time during model development.


of 18
in which three semi-independent lines of evidence (process studies, historical climate record, and paleoclimate record) were brought together in a Bayesian framework to place robust bounds on Earth's climate sensitivity (Sherwood et al., 2020).
Our goals in this work are several-fold. First, we evaluate GCM cloud feedback components against those assessed in Sherwood et al. (2020). This allows us to answer several questions, including: Do models with extremely large or small climate sensitivities have cloud feedback components that are erroneous? If so, which component(s)? How are cloud feedbacks in CMIP6and their biases with respect to expert assessment-changing from CMIP5? Are some models getting the "right" total cloud feedback via erroneous components that compensate?
Second, we investigate whether the fidelity with which models simulate present-day cloud properties is linked to their cloud feedbacks and to the fidelity with which their cloud feedbacks agree with expert judgment. A key question is whether better simulation of present-day cloud properties leads to cloud feedbacks that are better aligned with expert judgment. This is particularly relevant because aspects of the cloud simulation in many high-ECS CMIP6 models are in many cases considered superior to those in CMIP5 (Bodas-Salcedo et al., 2019;Gettelman et al., 2019), yet holistic aspects of the climate simulation in these models appear inferior to their lower-ECS counterparts (Nijsse et al., 2020;Tokarska et al., 2020;Zhu et al., 2020).
Finally, we provide a code base to compute cloud feedbacks and error metrics for all of the assessed categories, and visualize them in a multimodel context. This will allow, for example, model developers to evaluate cloud feedbacks in developmental versions of their models against expert judgment, other models, and other variants of their model, providing them with detailed information about a key process affecting their model's climate sensitivity.

Data and Methods
We are primarily interested in cloud feedbacks in response to CO 2 -induced global warming, so we make use of abrupt CO 2 quadrupling experiments conducted with fully coupled GCMs in . We first compute cloud-radiative anomalies at the top-of-atmosphere (TOA) by multiplying cloud fraction anomalies with cloud-radiative kernels (Zelinka et al., 2012a(Zelinka et al., , 2012b. The cloud fraction anomalies needed for this calculation are reported in a matrix of seven cloud top pressure (CTP) categories by seven visible optical depth (τ) categories matching the categorization of the International Satellite Cloud Climatology Project (ISCCP; Rossow & Schiffer, 1999). These matrices are produced by the ISCCP simulator (Klein & Jakob, 1999;Webb et al., 2001), referred to as clisccp in CMIP parlance. Cloud-radiative kernels quantify the sensitivity of top-of-atmosphere radiative fluxes to small cloud fraction perturbations in each of these 49 cloud types. Hence the product of the two yields the radiation anomaly from each cloud type, which can be summed over the entire matrix to provide the total cloud-radiative anomalies at a given location. Because of the reliance on clisccp, we are limited in this study to those models (listed in Table 1) that have successfully implemented the Cloud Feedback Model Intercomparison Project (CFMIP) Observation Simulator Package (COSP; Bodas-Salcedo et al., 2011). As will be evident below, these models exhibit cloud feedbacks spanning nearly the full range of values produced in the full ensemble of CMIP5 and CMIP6 models analyzed in Zelinka et al. (2020), and we therefore consider this subset to be a sufficiently representative sample of model diversity.
Anomalies are computed with respect to the contemporaneous preindustrial control (piControl) simulation, with three exceptions: CNRM-CM6-1, CNRM-ESM2-1, and IPSL-CM6A-LR-INCA did not archive clisccp from the piControl simulation, so we take this field from piClim-control, a 30-year long atmosphere-only simulation that uses sea-surface temperatures (SSTs) and sea ice concentrations fixed at the model-specific piControl climatology (Pincus et al., 2016). Note. CMIP5 and CMIP6 models are indicated with lower-case and uppercase symbols, respectively. Years within the abrupt-4xCO2 simulation with data available to analyze are indicated. We compute cloud feedbacks by regressing annual mean cloud-radiative anomalies on annual and global mean surface air temperature anomalies over the duration of the 150-year abrupt-4xCO2 experiment containing all necessary data. In CMIP6, clisccp output is available throughout the full duration of the run, whereas in CMIP5 it is typically only available for two noncontiguous 20-year periods, one at the beginning and one at the end of the run (Table 1). Zelinka et al. (2012a) validated cloud feedbacks computed using the cloud-radiative kernel (CRK) methodology against independent estimates derived as the adjusted change in cloud-radiative effect (ΔCRE adj ; Shell et al., 2008;Soden et al., 2008) for six CMIP3 models. Here we update this comparison using the CMIP5 and CMIP6 models analyzed in this study. We compare CRK-derived cloud feedbacks with the ΔCRE adj and approximate partial radiative perturbation (APRP; Taylor et al., 2007)-derived values computed in Zelinka et al. (2020). Six ΔCRE adj feedbacks are provided based on the adjustments from the noncloud-radiative kernels of Soden et al. (2008), Shell et al. (2008), Block and Mauritsen (2013), Huang et al. (2017), Pendergrass et al. (2018), and Smith et al. (2018). APRP provides only the SW component, but it additionally provides estimates of SW cloud amount, scattering, and absorption feedbacks, allowing us to compare to the CRK-derived SW amount and optical depth components. Figure S1 in Supporting Information S1 shows the multimodel mean zonal mean SW and LW cloud feedbacks from these three techniques, along with their across-model correlations, and Figure S2 in Supporting Information S1 scatters the global mean CRK-derived and non-CRK-derived feedback values against each other. The CRK-derived feedbacks are in excellent agreement with the ΔCRE adj and APRP feedbacks, for both the spatial characteristics of the multimodel mean and the across-model correlation of the zonal and global means. This confirms the validity of the CRK technique for estimating cloud feedback.
We focus in this study on feedbacks estimated from abrupt-4xCO2 experiments so as to stay consistent with Sherwood et al. (2020), but have repeated all calculations using Atmospheric Model Intercomparison Project (amip) experiments with imposed +4K SST perturbations that are spatially uniform (amip-p4K) and patterned (amip-future4K), as described in the CFMIP protocol (Webb et al., 2017). Feedbacks in these simulations were computed as cloud-radiation anomalies normalized by global mean surface air temperature anomalies between the +4K experiments and the control amip experiment. All basic conclusions reported in this study are insensitive to whether we consider feedbacks diagnosed in amip-p4K, amip-future4K, or abrupt-4xCO2 experiments.
To distinguish feedbacks occurring in regions of large-scale ascent from those occurring in regions of largescale descent over tropical oceans, we aggregate (with area weighting) all monthly control and perturbed climate fields over the tropical oceans into 10-hPa wide bins of 500 hPa vertical pressure velocity (ω 500 ) following Bony et al. (2004). Anomalies between perturbed and control climates are then performed in ω 500 space rather than geographic space when computing tropical marine ascent/descent feedbacks. The resulting feedbacks can be further broken down into dynamic, thermodynamic, and covariance terms (see Bony et al., 2004), but for the purposes of this study, we will consider only their sum, and will further aggregate these to "ascent regions" where ω 500 < 0 and "descent regions" where ω 500 ≥ 0.
Following Zelinka et al. (2016), we separately quantify feedbacks arising from low, boundary layer clouds and from nonlow, free tropospheric clouds, hereafter referred to as "low" and "high" cloud feedbacks, respectively. This is done by performing the cloud feedback calculations using only restricted parts of the clisccp histogram: CTPs 680 hPa for low clouds and CTPs ≤ 680 hPa for high clouds. Within these subsets, the cloud feedback is further broken down into (1) the "amount" component due to change in total cloud fraction holding CTP and τ distribution fixed; (2) the "altitude" component due to the change in CTP distribution holding total fraction and τ distribution fixed; and (3) the "optical depth" component due to the change in τ distribution holding the total fraction and CTP distribution fixed (Zelinka et al., , 2016. Passive satellite-based measurements-like those mimicked by the ISCCP simulator used in this study-provide unobscured cloud fractions visible from space. This means that low clouds may be hidden and revealed by changes in high-cloud cover. This complicates interpretation of low-cloud feedbacks, since high-cloud changes are aliased to an unknown extent into low-cloud feedbacks. To avoid this potential source of misinterpretation, we express the standard low-level cloud feedbacks as a sum of three terms following Scott et al. (2020) and Myers et al. (2021): low unobsc is the "true" low-cloud feedback occurring in regions that are not obscured by upper-level clouds and are unaffected by changes in obscuration, which we further break down into amount, altitude, optical depth, and residual components. Δobsc is the "obscuration-induced" component of low-cloud feedback arising entirely from changes in upper-level cloud fraction that reveal or hide low-level clouds. It is therefore by definition solely an "amount" component, so we absorb it into the high-cloud amount feedback. The covariance term, cov, is typically very small. To summarize, the total cloud feedback can be expressed as: where i ∈ {amount, altitude, optical depth, residual} components, and the high-cloud amount component includes the Δobsc component.
In Table 2, we list the central value and 1 − σ uncertainty of the cloud feedback components assessed in Sherwood et al. (2020) and describe how we compute them in GCMs. We also provide a matrix in Figure S3 in Supporting Information S1 to help visualize the feedback components that are computed in this study. A large amount of observational evidence, based mainly on interannual variability, was used to provide quantitative values for the assessed total cloud feedback and several of its individual components. In addition, process-resolving models in the form of large eddy simulations were a key piece of evidence for the strength of tropical marine low-cloud feedback, while guidance from theoretical understanding underlies the assessed high-cloud altitude, tropical anvil, and land-cloud amount feedbacks. Many of the expert-assessed cloud feedbacks are independent of any GCM results, but the assessed central value and uncertainty for the high-cloud altitude, land-cloud amount, and middle-latitude marine low-cloud amount feedbacks were derived at least partially from GCMs, albeit a collection that included pre-CMIP5 models that are excluded here and that excluded some recently published CMIP6 models that are included here. Comparing GCM results to expert-assessed values can therefore be thought of as a quick and economical way of evaluating model feedbacks against the very wide body of evidence that forms the basis of the expert-assessed cloud feedbacks.
Values of effective climate sensitivity (ECS) are taken from Zelinka et al. (2020), updated to include recently available models. These ECS values are computed in a manner consistent with the cloud feedbacks, by regressing global and annual mean TOA net radiative flux anomalies on global and annual mean surface air temperature anomalies over the 150-year duration of the abrupt-4xCO2 experiment. Anomalies are computed with respect to the contemporaneous piControl simulation, except in IPSL-CM6A-LR-INCA, for which we use piClim-control because no piControl fields are available.
Finally, for each model we compute a radiatively relevant cloud property error metric, E NET , using Note. Feedbacks are computed at each spatial location (or ω 500 bin as appropriate), then summed over the region of interest with weighting by the fractional area of the globe represented. As explained the text, high-cloud amount feedbacks include the Δobsc term and all low-cloud feedbacks are computed using low unobsc components.  Sherwood et al. (2020)

(in W m −2 K −1 ), and Description of How Each Component Is Computed in GCMs in This Study
tor cloud fraction histograms from amip simulations and the ISCCP HGG observational climatology (Young et al., 2018). Both modeled and observed climatologies are computed over the 26-year period January 1983 to December 2008, when all model simulations and observations overlap, but error metrics are very insensitive to the time period considered. Second, these errors are multiplied by net (LW + SW) cloud radiative kernels, thereby weighting them by their corresponding net TOA radiative impact. Third, this product is aggregated into six cloud types: optically intermediate and thick clouds at low, middle, and high levels. These are then squared, averaged over the six categories, summed (with area weighting) over month, longitude, and latitude between 60°S and 60°N, and the square root is taken. Finally, this scalar value is normalized by the accumulated space-time standard deviation of observed radiatively relevant cloud properties, defined analogously. This process yields a single scalar error metric, E NET , in each model that quantifies the spatiotemporal error in climatological cloud properties for clouds with τ > 3.6, weighted by their net TOA radiative impact. We acknowledge that evaluation against ISCCP observations is a limited viewpoint on the quality of models' cloud simulations-one that may change if using other cloud data sets, like those derived from active sensors.

GCM Cloud Feedbacks Evaluated Against Expert-Assessed Values
In Figure 1, cloud feedbacks from 7 CMIP5 and 12 CMIP6 models are compared with the assessed values for feedback categories listed in Table 2. Each feedback value is scaled by the fractional area of the globe occupied by that cloud type such that summing all components yields the global mean feedback. Each marker is color-coded by its ECS, with the color boundaries corresponding to the 5th, 17th, 83rd, and 95th percentiles of the Baseline posterior PDF of ECS from Table 10 of Sherwood et al. (2020). In Table 3, we list the GCM values and highlight any values that lie outside of the very likely (90%) and likely (66%) confidence intervals of expert judgment with double and single asterisks, respectively. Figures S4-S22 in Supporting Information S1 are identical to Figure 1, but with individual models highlighted in each figure for better discrimination.
All but seven models fall within the likely range assessed for the high-cloud altitude feedback and the multimodel means are very close to the central assessed value. However, some models have weak high-cloud altitude feedbacks that lie below the lower bound of the likely (MRI-CGCM3 and MIROC6) and very likely (MIROC5 and MIROC-ES2L) confidence intervals, and some have strong high-cloud altitude feedbacks that lie above the upper bound of the likely (HadGEM2-ES and CanESM5) and very likely (E3SM-1-0) confidence intervals. This feedback component has the greatest number of models (3) lying outside of the assessed very likely range; these are the same three models that lie outside the assessed very likely range for total cloud feedback. Such wide intermodel variation is noteworthy for a feedback having a strong theoretical basis and both observational and high-resolution modeling support. All models lie within the assessed likely range for the land-cloud amount feedback, while all but five models (MIROC5, HadGEM3-GC31-LL, MIROC-ES2L, MIROC6, and UKESM1-0-LL) lie within the assessed likely range of the middle-latitude marine low-cloud amount feedback.

Consistent with
Whereas the central estimate of the high-latitude low-cloud optical depth feedback from the assessment is 0, all models simulate a negative feedback. All but two models (MIROC-ESM and MPI-ESM-LR) fall within the likely assessed range, however. In the multimodel average, the negative feedback values are more than halved in CMIP6 relative to CMIP5, bringing CMIP6 models into better agreement with expert judgment. This may be related to a weakened cloud phase feedback owing to improved simulation of mean-state cloud phase (Bodas-Salcedo et al., 2019;Gettelman et al., 2019;Flynn & Mauritsen, 2020;Zelinka et al., 2020). The intermodel spread in this feedback component has also dramatically decreased in CMIP6.
The unassessed feedback is near zero on average across all models, consistent with it being assigned a value of zero in the expert assessment. However, its across-model standard deviation and its CMIP5-to-CMIP6 increase in multimodel average are larger than all other individual components except the high-cloud altitude feedback. Contributors to this feedback will be discussed in greater detail in Section 3.5.
The sum of all six assessed feedback components is positive in all but two models (MIROC5 and MIROC-ES2L) and exhibits substantially more intermodel spread than any individual component comprising it. Its standard deviation (σ = 0.27 W m −2 K −1 ) is also larger than would exist if the feedback components comprising it were uncorrelated across models (σ if summing individual uncertainties in quadrature = 0.20 W m −2 K −1 ), as discussed further in Section 3.2. While the multimodel mean value is close to the expert-assessed value, some models lie below the lower bound of the assessed likely (CCSM4 and MIROC6) and very likely (MIROC5 and MIROC-ES2L) confidence intervals, and E3SM-1-0 lies above the upper bound of the assessed likely confidence interval.
The total cloud feedback, which is the sum of assessed and unassessed components, has a larger standard deviation than would occur if these two components were uncorrelated. Owing to this correlation, all but four models  (MIROC-ESM, MPI-ESM-LR, CNRM-ESM2-1, and MRI-ESM2-0) exhibit degraded agreement with expert assessment once accounting for their unassessed feedbacks. In addition to the models that fell outside the likely and very likely ranges for the sum of assessed feedbacks, there are now four new models (CanESM5, IPSL-CM6A-LR, IPSL-CM6A-LR-INCA, and UKESM1-0-LL) that lie above the upper bound of the assessed likely confidence interval, and E3SM-1-0 has now moved above the upper bound of the assessed very likely confidence interval.
Unsurprisingly, models with larger total cloud feedback tend to have higher ECS. All five models with total cloud feedbacks above the upper limit of the expert-assessed likely range (CanESM5, E3SM-1-0, IPSL-CM6A-LR, IPSL-CM6A-LR-INCA, and UKESM1-0-LL) are part of CMIP6. These models also have ECS values above 3.9 K, the upper limit of the expert-assessed likely ECS range, and all but IPSL-CM6A-LR and IPSL-CM6A-LR-INCA have ECS values above 4.7 K, the upper limit of the very likely ECS range. However, two models with ECS 3.9 K (HadGEM2-ES, MIROC-ESM) and even three with ECS 4.7 K (CNRM-CM6-1, CNRM-ESM2-1, and HadGEM3-GC31-LL) have total cloud feedbacks within the likely range, indicating that other noncloud feedbacks are pushing these models to very high ECS. No models considered here-even those whose cloud feedbacks lie below the lower limit of the likely and very likely total cloud feedback confidence bound-have ECS values below 2.6 K, the lower limit of the Sherwood et al. (2020) assessed likely range. In general, too-large cloud feedbacks seem to guarantee too-large ECS, but too-small cloud feedbacks do not guarantee too-small ECS. Also, too-large ECS can arise even without too-large cloud feedbacks.
Turning now to the multimodel mean cloud feedback components, we see that the mean total cloud feedback is roughly twice as large in CMIP6 than in CMIP5, qualitatively consistent with Zelinka et al. (2020), who assessed a much larger collection of models. This occurs because the high-cloud altitude, midlatitude marine low-cloud amount, high-latitude low-cloud optical depth, and unassessed feedbacks all become more positive, on average, in CMIP6. The other feedbacks change very little on average.
All multimodel mean assessed feedback components lie within the respective expert-assessed likely range. They also lie very close to the central assessed values, with two exceptions: The tropical marine low-cloud feedback averaged across all models (0.12 ± 0.07 W m −2 K −1 ) is about half as large as assessed (0.25 ± 0.16 W m −2 K −1 ), and the tropical anvil cloud area feedback averaged across all models is close to zero (−0.04 ± 0.06 W m −2 K −1 ), whereas it was assessed to be moderately negative (−0.20 ± 0.20 W m −2 K −1 ). For these two components, GCM values were not used to inform the expert judgment value, but rather they were based upon observations and, in the case of tropical marine low-cloud feedbacks, large eddy simulations that resolve many of the cloud processes that must be parameterized in GCMs (see Table 1 of Sherwood et al., 2020).

Correlations Among GCM Cloud Feedbacks
The previous section provided several indications that models with large positive total cloud feedbacks tend to have systematically higher cloud feedbacks for all components rather than having a single anomalously strong positive component, and vice versa for models with small or negative total cloud feedbacks. We quantify this more rigorously in this section by diagnosing the correlation structure among the individual components.
All individual cloud feedback components are positively correlated with the total cloud feedback, especially the high-cloud altitude, midlatitude marine low-cloud amount, and unassessed feedbacks (Figure 2a, column 1). While the tropical marine low-cloud feedback is significantly correlated with the total, it is markedly weaker than for several other components, which is surprising given previous findings that low latitude marine low clouds in regions of moderate subsidence drive intermodel spread in climate sensitivity (Bony & Dufresne, 2005). The discrepancy may arise from the relatively small subset of models considered here, but it also may be related to the precise definition of low-cloud types: Taking the sum of stratocumulus and trade cumulus cloud feedbacks diagnosed in Myers et al. (2021) using different meteorological criteria than employed here as an alternative estimate of tropical marine low-cloud feedback, we find a larger correlation (r = 0.80) with total cloud feedback.
The positive correlations between individual components and the total cloud feedback are expected: If all the models were distributed randomly for each feedback component, one would expect the models with largest total cloud feedback to be the ones that most consistently lie on the positive tail of all components. To demonstrate this, we generated normal distributions with 10,000 samples matching the multimodel mean and standard deviation for each of the six assessed and one unassessed components and repeated the above calculations on these random data. All individual components are significantly positively correlated with their sum, with correlation strengths proportional to the individual component variances (Figure 2b, column 1).
The prevalence of strong and significant positive correlations among individual feedback components seen in the actual model data is, however, not expected from chance. This leads to (a) individual components being more strongly correlated with the total cloud feedback and (b) a wider spread in the total cloud feedback than would occur if individual components were uncorrelated. Models with large positive total cloud feedbacks tend to have systematically larger-than-average cloud feedbacks across multiple components rather than being generally near-average but having a single large component. E3SM-1-0, for example, has the largest positive total cloud feedback, and its feedback values are among the largest values in all categories except the land-cloud feedback ( Figure S14 in Supporting Information S1 and Table 3). Conversely, models like MIROC5 with negative total cloud feedbacks tend to have cloud feedbacks on the left tail of the distribution for all components ( Figure S8 in Supporting Information S1 and Table 3). Consistent with this, we find that most models with near-average total cloud feedbacks have components that are systematically near-average rather than having several components with extreme values of opposing sign that counter each other. One exception is CNRM-ESM2-1, which has feedbacks on the high tail of the model distribution for some components and on the low tail for others ( Figure S12 in Supporting Information S1 and Table 3).
That all of the significant correlations in Figure 2a are positive might suggest that they are linked by a physical mechanism rather than arising from tuning artifacts. As will be shown in Section 3.5, high-cloud feedbacks are among the largest components of the unassessed feedback; hence it is plausible that the positive correlations among the unassessed, high-cloud altitude, and anvil feedbacks reflect a shared physical mechanism involving high clouds. Other large positive correlations (e.g., between high-cloud altitude and tropical and middle-latitude marine low-cloud amount) are harder to rationalize. We discuss further implications of all of these correlations in Section 3.4.

Metrics of Overall Cloud Feedback Errors
To assess the overall skill of each model in matching the expert-assessed cloud feedback components, we compute a single cloud feedback error metric for each model as the root mean square error (RMSE) with respect to the central expert judgment value over all six assessed feedback components of Sherwood et al. (2020). Each model's cloud feedback RMSE is provided in Table 3 and is plotted against total cloud feedback in Figure 3. CMIP5 and CMIP6 models exhibit both high and low-cloud feedback RMSE values, and the multimodel mean RMSE values are the same for both ensembles (Table 3). Although the three best-performing models in this measure are CMIP6 models, there is no systematic tendency for CMIP6 models to be performing better than CMIP5 models with respect to expert judgment. For models from the same modeling centers that can be tracked between the two generations, the same number of models show degraded performance as improved performance in this measure: The seven models with smaller-than-average cloud feedback errors (i.e., RMSE ≤ 0.11 W m −2 K −1 ) have moderate (0.4-0.6 W m −2 K −1 ) total cloud feedbacks, except for CanESM5 [J], which has a total cloud feedback of 0.8 W m −2 K −1 . All but three of these models have moderate (3-4 K) ECS values, the exceptions being HadGEM2-ES [c], MIROC-ESM [d], and CanESM5 [J], which have ECS values above 4.5 K. This makes sense given that the expert-assessed value of total cloud feedback, which has the greatest leverage on ECS, led to moderate values of ECS in Sherwood et al. (2020). Of the seven models with below-average feedback errors, GFDL-CM4 [L], MRI-ESM2-0 [R], and CanESM2 [b] are the only ones for which all assessed feedbacks lie within the expert likely range (Figures S15, S21, and S5 in Supporting Information S1, respectively; Table 3). Put simply, they get the right answer for the right reasons.
Models with too-large or too-small total cloud feedbacks and ECS tend to have larger-than-average cloud feedback RMSE values. That is, the models that lie farthest from the horizontal dashed line tend to be located on the right side of Figure 3. All five models with small total cloud feedback ( 0.2 W m −2 K −1 ) and small ECS ( 3 K) have cloud feedback components that are systematically biased low relative to expert judgment, giving them larger-than-average RMSE. Most models with large total cloud feedback and large ECS have cloud feedback components that are systematically biased high relative to expert judgment, also giving them larger-than-average RMSE. Of the nine models with ECS 4.  Figure S4 in Supporting Information S1 and Table 3).
Two models (CNRM-CM6-1 [H] and CNRM-ESM2-1 [I]) have total cloud feedbacks very close to the central value of the expert assessment but larger-than-average RMSE values. They achieve reasonable total cloud feedbacks partly through having low-biased tropical marine low-cloud feedbacks that counteract their high-biased tropical anvil cloud area feedbacks (Figures S11 and S12 in Supporting Information S1; Table 3). Put simply, they get the right answer for the wrong reasons.
GFDL-CM4, CanESM5, MRI-ESM2-0, and CanESM2 remain the four models with lowest RMSE regardless of whether we use feedbacks derived from abrupt-4xCO2 or amip-p4K experiments.  Table 3 and are colored according to their (a) effective climate sensitivity (ECS) values and (b) net radiatively relevant cloud property error metric, E NET .

Relationship Between Cloud Feedbacks and Mean-State Cloud Property Errors
The fidelity with which models simulate mean-state radiatively relevant cloud properties is strongly and significantly correlated with total cloud feedback (Figure 4a). We show this result for the net radiatively relevant cloud property error (E NET ), but it is also strong and significant for the SW-radiation error as well as the cloud property error without radiative weighting (not shown). This result is consistent with Figure 11 of Klein et al. (2013), but now the relationship holds across two ensembles of models (CMIP5 and CMIP6). Given that E NET is an aggregated metric, we also tested whether the anticorrelation persists when considering relationships between individual cloud feedbacks and cloud-type specific E NET values (e.g., between midlatitude marine low-cloud amount feedback and mean-state errors for midlatitude marine low clouds). This anticorrelation continues to hold for all but the land-cloud amount feedback, albeit with weaker correlation coefficients (not shown). While caution is necessary given the relatively small sample size, an important question is why better simulating present-day cloud properties is associated with larger cloud feedbacks. We leave this as an open question for future research.
On average, mean-state cloud properties are simulated better in CMIP6 than in CMIP5 (Figure 4a and Table 3). Six CMIP6 models now have smaller error values than the smallest exhibited in CMIP5. For models from the same modeling center than can be tracked, all but one has improved in this measure from CMIP5 to CMIP6. Specifically, marked improvement is seen from CanESM2 It is often implicitly assumed by model developers and model analysts that the degree to which a model's clouds resembles reality can be used as a basis to trust their response to climate change. In Figure 4b, we test this assumption by comparing the agreement with expert judgment for cloud feedbacks (encapsulated in RMSE) to the agreement with observations of the present-day climatological distribution of clouds and their properties (encapsulated in E NET ). While the correlation between these two metrics is positive, it is very weak and not significant at 95% confidence. Moreover, many models with small mean-state cloud errors have cloud feedback errors that are as large or larger than models with large mean-state errors, indicating that improved simulation of mean-state cloud properties does not necessarily lead to improved cloud feedbacks with respect to expert judgment. The weak correlation also holds for relationships between RMSE and components of E NET corresponding to individual cloud feedbacks (not shown).
In Figure 3b, models are color-coded by E NET , allowing for a simultaneous assessment of how well models simulate mean-state cloud properties and match expert judgment of total cloud feedback and its components. From this, it is evident that most of the models with small mean-state errors (yellow shading) have large cloud feedback errors and several lie above the upper limit of the likely range of total cloud feedback (i.e., in the top-right portion  Table 3 and are colored green for CMIP5 and purple for CMIP6. Expert likely and very likely ranges of total cloud feedback indicated with horizontal shading in (a). Correlations that are significant at 95% confidence are indicated with an asterisk. of the diagram). The one exception is GFDL-CM4 [L], which achieves low-cloud feedback RMSE, low values of E NET , and total cloud feedback near the central value of expert judgment.
While realistic mean-state cloud properties may not guarantee that a model simulates more reliable cloud feedbacks, the models with worst mean-state cloud properties (i.e., E NET > 1.3) all have poor agreement with the expert-assessed total cloud feedback and/or its components (see models at top right of Figure 4b). This is also evidenced by the fact that most of the models with large mean-state errors (purple/black shading) have large cloud feedback RMSE and lie below the lower limit of the likely range of total cloud feedback (i.e., in the bottom-right part of Figure 3b). This suggests that simulating poor mean-state cloud properties precludes a model from simulating cloud feedbacks in agreement with expert judgment. In other words, better simulation of mean-state cloud properties may be a necessary but insufficient criterion for simulating more trustworthy cloud feedbacks.
This finding has support in the recent literature. Mülmenstädt et al. (2021) showed that a model with better meanstate cloud properties could have greater biases in its climate responses owing to compensating errors in cloud and precipitation processes. As noted in that study, fidelity in simulating mean-state clouds alone is an insufficient constraint on a model's feedback because of the many different combinations of process representations that can lead to equally valid representations of mean-state clouds. Since these process representations can all differ in their sensitivity to warming, the cloud feedback is not uniquely determined by mean-state properties, and improving the representation of the mean-state (especially at the expense of the process-level) does not guarantee that feedbacks will be more reliably simulated. This notion is supported by the fact that the set of model parameters driving the variance in mean-state extratropical cloud-radiative effect across members of the HadGEM3-GA7.05 perturbed physics ensemble differ from those driving the variance in its cloud feedback (Tsushima et al., 2020). A corollary to this are the many examples in which models with better "bottom-up" process representation more poorly satisfy "top-down" constraints like the observed historical global mean temperature evolution Suzuki et al., 2013), expert-assessed magnitude of aerosol indirect effects (Jing & Suzuki, 2018) or paleoclimate states (Zhu et al., 2020(Zhu et al., , 2021 Sherwood et al. (2020) only assessed quantitative values for a selection of well-studied cloud feedbacks, so it is important to know whether any of the unassessed feedbacks are substantial. Examining these feedback components is important as it may guide where future research with observations, process-resolving models, and theory is needed to further constrain GCMs' cloud feedbacks. Figure 5 shows a breakdown of explicitly computed feedbacks that were not assessed in Sherwood et al. (2020). There are an infinite number of ways of breaking down these components, but our strategy was to quantify those that complement the assessed feedbacks, either in altitude or geographic space, to the extent possible. For example, we quantify the low-cloud altitude feedback since the high-cloud feedback is an assessed category, and we quantify the low-cloud optical depth feedback between 30° and 90° latitude but excluding the 40-70° zone where it was already assessed. The sum of these closely reproduces the implied unassessed feedbacks in Figure 1 (not shown). See Figure S3 in Supporting Information S1 for a matrix that helps to visualize and rationalize the discretization made.

GCM Cloud Feedbacks in Unassessed Categories
The multimodel mean unassessed cloud feedback transitions from being 0.01 W m −2 K −1 on average in CMIP5 to 0.08 W m −2 K −1 on average in CMIP6. The largest shift occurs for the multimodel mean extratropical high-cloud optical depth component, which transitions from a negative to a weak positive value. This component, along with the tropical marine ascent low-cloud amount plus optical depth component exhibit the largest intermodel spread among all unassessed categories, and may be worthwhile targets for future expert assessment.
There are a few models whose unassessed feedbacks sum to a value that is large relative to their total and/or combined assessed feedbacks and worth examining in greater detail. MIROC5, MIROC-ES2L, and MIROC6 exhibit strong negative unassessed cloud feedbacks (with values − 0.10 W m −2 K −1 ) that are comparable in magnitude to the sum of their assessed feedbacks. MIROC5 and MIROC6 have strong negative low-cloud amount plus optical depth components in tropical marine ascent regions, while MIROC-ES2L has strong negative highcloud amount and optical depth components in tropical marine subsidence regions. All three of these models have moderately negative extratropical high-cloud optical depth feedbacks as well. Two CMIP6 models (CanESM5 and E3SM-1-0) have positive unassessed feedbacks that exceed 0.15 W m −2 K −1 -the multimodel mean plus standard deviation. This occurs because of several systematically positive components, the largest of which is the 0.11 W m −2 K −1 extratropical high-cloud optical depth component in E3SM-1-0.

Discussion and Conclusions
We have evaluated cloud feedback components simulated in 19 CMIP5 and CMIP6 models against benchmark values determined via an expert synthesis of observational, theoretical, and high-resolution modeling studies (Sherwood et al., 2020). We found that, in general, models that most closely match the expert-assessed values across several cloud feedback components have moderate total cloud feedbacks (0.4-0.6 W m −2 K −1 ) and moderate ECS (3-4 K). In contrast, models with largest feedback errors with respect to expert assessment generally have total cloud feedbacks and climate sensitivities that are too large or too small.
There is no evidence that CMIP6 models simulate cloud feedbacks in better agreement with expert judgment than do CMIP5 models. While the three best models in our error metric are CMIP6 models, all models with total cloud feedbacks above the upper limit of the expert-assessed likely range are part of CMIP6 and have ECS values above 3.9 K, the upper limit of the expert-assessed likely ECS range. However, the converse is not true: several models with high ECS have total cloud feedbacks within the likely range. This means that large cloud feedback ensures a high ECS, but high ECS can emerge even with moderate cloud feedbacks, a result consistent with Webb et al. (2013) for CMIP3 models. More generally, having 2xCO 2 radiative forcing and feedbacks in agreement with  Figure 1, but for cloud feedback components that were not assessed in Sherwood et al. (2020). Note. the x axis spans a range that is only a third of that in Figure 1. expert judgment does not guarantee that a model's ECS will be in agreement with expert judgment because the latter is further constrained by evidence from the paleoclimate and historical records (Sherwood et al., 2020).
On average, and for most individual modeling centers, mean-state cloud properties are better simulated in CMIP6. Better simulation of mean-state cloud properties is strongly and significantly correlated with larger total cloud feedback. The reasons for this remain to be investigated, but it is consistent with emergent constraint studies involving mean-state properties of clouds or their environment, nearly all of which point to higher-than-average cloud feedbacks and climate sensitivities (Brient et al., 2016;Fasullo & Trenberth, 2012;Sherwood et al., 2014;Siler et al., 2018;Tian, 2015;Trenberth & Fasullo, 2010;Volodin, 2008).
But more skillful simulation of mean-state cloud properties does not guarantee more skillful simulation of cloud feedbacks, and many models with small mean-state errors have large cloud feedback errors with respect to expert judgment. In general, better simulation of mean-state cloud properties leads to stronger but not necessarily better cloud feedbacks. GFDL-CM4, which has the smallest cloud feedback error, small mean-state cloud property error, and a total cloud feedback near the expert-assessed central value, is the exception to this rule. Skill at simulating mean-state cloud properties appears to be a necessary but not sufficient criterion for simulating realistic cloud feedbacks.
Models with large positive total cloud feedbacks tend to have systematically higher cloud feedbacks for all components rather than having a single anomalously strong positive component, and vice versa for models with small or negative total cloud feedbacks. This means, for example, that there is no single feedback that all high-ECS models are exaggerating. However, if there is some physical relationship causing the correlation between individual feedback components, this may imply that constraining one component would have knock-on effects across several components. In this case, feedbacks from multiple cloud types could be constrained with less evidence than would be needed if they were uncorrelated, and changing one aspect of a model might systematically change the feedbacks from multiple cloud types, making it easier to improve its cloud feedbacks. Establishing and understanding the physical basis of correlations among feedback components and their potential linkages with mean-state cloud properties is important future work.
The high-latitude low-cloud optical depth feedback has shifted from being robustly negative across CMIP5 models, with some models simulating moderately strong negative feedbacks below the expert-assessed likely range, to a much weaker negative feedback in CMIP6, with the models tightly clustered about it. This represents a shift toward better agreement with expert judgment (also seen in Myers et al., 2021), and may be tied to reductions in super-cooled liquid biases in the latest models (Bodas-Salcedo et al., 2019;Gettelman et al., 2019;Zelinka et al., 2020).
Results from several individual cloud feedback components raise important questions and motivate future investigation: 1. The high-cloud altitude feedback strength varies widely across models, despite its firm theoretical basis and support from observational analyses and high-resolution modeling. This motivates further work to pin down causes of intermodel spread and to eliminate sources of bias in this feedback. 2. Although we found that the tropical marine low-cloud feedback simulated by most models lies at the low end of the expert-assessed likely range, recent observational constraints support slightly lower values (Ceppi & Nowack, 2021;Cesana & Del Genio, 2021;Myers et al., 2021) owing in part to a better discrimination between strong stratocumulus feedbacks and weaker trade cumulus feedbacks. If incorporated into a future assessment, the expert value of this feedback could be revised downward, likely resulting in a better alignment between it and the multimodel mean. To the extent that the assessed confidence bounds also narrow, however, the models with very weak tropical marine low-cloud feedbacks may still lie below the expert judgment range. 3. Despite the wide uncertainty in its expert-assessed value, eight models have positive tropical anvil cloud feedbacks that place them above the upper bound of the assessed likely confidence interval. This discrepancy between models and expert judgment can be traced to the disagreement between models and observations in the sensitivity of tropical TOA radiation and deep convective cloud properties to interannual fluctuations in surface temperature found in the studies of Mauritsen and Stevens (2015) and Williams and Pierrehumbert (2017), which were influential in establishing the expert-assessed value. Much uncertainty remains surrounding the processes controlling tropical anvil cloud fraction and its changes with warming, and the fidelity with which GCMs can simulate them Gasparini et al., 2021;Hartmann, 2016;Seeley et al., 2019;Wing et al., 2020). 4. Cloud feedback components that were not assessed in Sherwood et al. (2020), though summing to zero on average across models, have substantial intermodel spread and partly drive the increase in multimodel average cloud feedback from CMIP5 to CMIP6. Of these, the extratropical high-cloud optical depth component exhibits the largest increase. This, along with the aforementioned uncertainties surrounding high-cloud altitude and anvil cloud feedbacks highlights the need for further observational analyses, process-resolving modeling, and theoretical studies targeting high-cloud feedbacks.
We have provided Python code that performs all calculations and generates all visualizations presented in this study. The code is also easily modified to accommodate comparisons between GCM cloud feedbacks and the similar but not identical breakdown of cloud feedback components that is used in the sixth Assessment report of the IPCC. We envision that this code could be applied to perturbed parameter or perturbed physics ensembles and to developmental versions of models to assess cloud feedbacks and cloud errors and place them in the context of other models and of expert judgment in real-time during model development. This may be particularly valuable in less computationally expensive prescribed SST perturbation experiments that are routinely performed during model development. Despite their simpler design, these "Cess-type" experiments effectively capture the feedbacks present in fully coupled experiments (Ringer et al., 2014). So doing could help modelers to identify and correct erroneous cloud feedbacks that lead to biased climate sensitivity prior to the model being frozen, thereby increasing the reliability of the model for policy-relevant climate projections (e.g., Voosen, 2021).

Data Availability Statement
Python code to perform all calculations and produce all figures and tables in this manuscript is available at https://doi.org/10.5281/zenodo.5206838 (M. Zelinka, 2021a) and is being incorporated into the PCMDI Metrics Package (Doutriaux et al., 2018)