Evaluating Cloud Feedback Components in Observations and Their Representation in Climate Models

This study quantifies the contribution of individual cloud feedbacks to the total short‐term cloud feedback in satellite observations over the period 2002–2014 and evaluates how they are represented in climate models. The observed positive total cloud feedback is primarily due to positive high‐cloud altitude, extratropical high‐ and low‐cloud optical depth, and land cloud amount feedbacks partially offset by negative tropical marine low‐cloud feedback. Seventeen models from the Atmosphere Model Intercomparison Project of the sixth Coupled Model Intercomparison Project are analyzed. The models generally reproduce the observed moderate positive short‐term cloud feedback. However, compared to satellite estimates, the models are systematically high‐biased in tropical marine low‐cloud and land cloud amount feedbacks and systematically low‐biased in high‐cloud altitude and extratropical high‐ and low‐cloud optical depth feedbacks. Errors in modeled short‐term cloud feedback components identified in this analysis highlight the need for improvements in model simulations of the response of high clouds and tropical marine low clouds. Our results suggest that skill in simulating interannual cloud feedback components may not indicate skill in simulating long‐term cloud feedback components.

This approach has some limitations.First, results from GCMs were used to inform the quantitative values of several expert-assessed long-term cloud feedbacks, which reduced their independence as a gauge of GCM performance.Second, S20 approximated the total cloud feedback as the sum of six components, and hence the model evaluation conducted in Z22 was limited to these six assessed cloud feedback values.Other cloud feedback components not discussed in S20 could be large in nature and/or strongly biased in models.For instance, Z22 showed that the inter-model spread of the feedbacks that were not assessed in S20 is nonnegligible and that their sum is a major contributor to the increase in multi-model mean cloud feedback between the latest generation of GCMs and the previous generation (Zelinka et al., 2022).Third, for some feedbacks, such as the tropical anvil cloud area feedback, S20's assessed value relied heavily on a single observational study, calling into question its utility as a benchmark for the long-term feedback.
Finally, S20's assessed feedback values were derived using expert judgment, which is informed by the literature available at the time.This means, for example, that the recent spate of evidence of weak trade cumulus feedbacks (e.g., Cesana & Del Genio, 2021;Myers et al., 2021;Vial et al., 2023;Vogel et al., 2022)-which would presumably decrease the assessed tropical marine low cloud feedback-is not accounted for since these studies were published after S20.Still, we note that other recent studies support the overall cloud feedback values assessed in S20 (Ceppi & Nowack, 2021) and opinions differ on whether the assessed values of low cloud feedbacks should indeed be updated in light of the aforementioned studies (Sherwood & Forest, 2023).
In this study, we evaluate models against a different source of evidence: satellite observations.Here we estimate short-term cloud feedbacks in response to interannual variability inferred from the observations and evaluate the Coupled Model Intercomparison Project Phase 6 (CMIP6; Eyring et al., 2016) models' ability to simulate the individual components of the cloud feedback.This study not only investigates the assessed cloud feedbacks but also analyzes the cloud feedback components that are not explicitly identified in S20.The comparison against observations provides an in-depth evaluation that is aligned with real-world climate conditions.While avoiding the limitations of Z22 noted above, we acknowledge several caveats with our analysis.First, the cloud feedback in response to short-term climate fluctuations may differ from the long-term feedback relevant to climate sensitivity.This is not a limitation per se, since it is important to assess whether models can simulate cloud responses on all timescales, not just in response to CO 2 induced global warming.Furthermore, in Section 3.8 we explicitly assess whether model skill in simulating short-term cloud feedbacks implies skill in simulating long-term cloud feedbacks.Second, the cloud radiative response to forcing agents (e.g., aerosol-cloud interactions)-which may vary considerably across models and between models and observations-is not explicitly accounted for.We expect this to be a minor effect, as we quantify the response of clouds to interannual fluctuations in surface temperature which are unlikely to have a coincident aerosol signature (unlike, for example, if our analysis relied on secular trends during this time).Indeed, we derive statistically indistinguishable feedbacks in atmospheric simulations with and without historical forcing variations, confirming that forcing variations make a minor contribution to the interannual cloud feedback during 2002-2014 (not shown).Third, the diagnosis of cloudiness is similar but not identical in observations and models, as the observations measure two daytime samples per day (Aqua and Terra satellite overpasses) while the model's cloud simulator output is based on cloudiness from all sunlit timesteps.We anticipate this to be a minor effect as well since studies have shown that temporal sampling has limited effect on cloudiness estimates at the global scale (e.g., Pincus et al., 2012).Finally, our analysis is subject to potential observational uncertainty arising from satellite instrument or calibration errors.However, the CERES and MODIS instrument calibration is stable as shown in previous studies (e.g., CERES Science Team, 2022; Loeb et al., 2016;Yue et al., 2017), and recent studies found that the cloud feedbacks estimated from four different satellite data sets are consistent, suggesting that the measurement error is small (Myers et al., 2021;Scott et al., 2020).

Satellite Observations
The cloud feedback is estimated as the change in cloud-induced top-of-atmosphere (TOA) radiation anomalies (ΔR cloud ) per degree of global average surface temperature change.ΔR cloud is calculated as the product of cloud fraction anomalies and cloud radiative kernels, both of which are a function of cloud top pressure (CTP) and cloud visible optical depth (τ).Two data sets of observed monthly cloud fraction are considered in this study: the Clouds and the Earth's Radiant Energy System (CERES) Flux By Cloud Type (FBCT) Ed4A product (Sun et al., 2022) and the Moderate Resolution Imaging Spectroradiometer (MODIS) COSP Level-3 product (Pincus et al., 2023), both available starting July 2002.Both products provide cloud properties from the MODIS instrument on Terra and Aqua spacecrafts but using different cloud retrieval algorithms, Minnis et al. (2011) for CERES-FBCT and Platnick et al. (2017) for MODIS-COSP.To be consistent with CERES-FBCT product, which only provides cloud properties with an optical depth above 0.25, we discard the first τ bin (0-0.3) in the MODIS-COSP product, such that both have seven CTP bins and six τ bins.MODIS COSP reports cloud fraction joint histograms for fully cloudy and partly cloudy scenes, and we sum them to obtain the total cloud fraction following previous studies (e.g., Wall et al., 2022).The conclusion of the paper and the cloud feedback values are consistent when including or excluding partly cloudy cloud fraction histograms.Note that both data sets utilize only daytime observations to ensure consistency of cloud-type classification (Pincus et al., 2023;Sun et al., 2022).
Cloud radiative kernels (Zhou et al., 2013) quantify the sensitivity of TOA radiation to CTP-and τ-stratified cloud fraction changes.The first τ bin (0-0.3) in the radiative kernels is also discarded for consistency, yielding six τ bins in all data sets.Multiplying the kernel by the change in cloud fraction histogram yields the TOA radiative impact of each cloud type.The total ΔR cloud is then the summation of ΔR cloud for the 42 cloud categories.ΔR cloud consists of shortwave (SW) and longwave (LW) components and is defined as positive downward (i.e., planetary heating).
The cloud feedback is calculated separately using CERES-FBCT and MODIS-COSP products.These are then averaged to produce a single combined observational estimate of the feedback, with 90% and 66% confidence intervals calculated as the square root of the sum of the square of uncertainties from the two data sets, divided by two.The uncertainties are averaged under the assumption that the two data sets have uncorrelated errors.Uncertainty intervals are very similar if calculated using a bootstrapping method that does not rely on a priori assumptions about the distribution of uncertainties from the two data sets.Feedbacks estimated using the two observational data sets have quantitative differences that are discussed further in Section 3.6, but the main conclusions of the paper are unchanged if using either data set in isolation.

CMIP6 Models
One of the main goals of this study is to evaluate climate models' performance in simulating cloud feedbacks measured from real-world climate variability.We analyze Atmospheric Model Intercomparison Project (AMIP) simulations in CMIP6, where the models are forced with anthropogenic and natural forcing and observed fields of sea surface temperatures and sea ice concentrations.The use of AMIP models ensures a fair comparison with the observations and allows us to focus on the representation of cloud radiative response in the atmospheric model.This is especially important given the substantial influence of sea surface temperature patterns on the sign and strength of individual feedbacks, which is discussed in Section 3.7.
The calculation of individual cloud feedbacks in models requires cloud fraction anomalies stratified by cloud top pressure and cloud optical depth, which are produced from the ISCCP simulator (Klein & Jakob, 1999;Webb et al., 2001).This limits our analysis to 17 climate models (listed in Table S1 in Supporting Information S1) that participated in the Cloud Feedback Model Intercomparison Project (CFMIP; Webb et al., 2017).We calculated modeled cloud feedback from all available members from individual climate models.Since the MODIS instrument cannot retrieve thin clouds with τ < 0.25, we exclude the first τ bin (0-0.3) in modeled cloud fraction for consistency.The modeled cloud feedback components computed excluding or including the first τ bin are highly correlated (r = 0.96), suggesting that excluding the first τ bin does not significantly affect the study's key findings (Figure S1 in Supporting Information S1).

Decomposition of Cloud Feedback Into Contributions From Individual Cloud Types
To decompose the contributions of different cloud types to total ΔR cloud , we follow Z22.ΔR cloud is first separated into contributions from low clouds (CTP > 680 hPa) and high clouds (CTP ≤ 680 hPa).Within each regime, we further break down ΔR cloud into the components that are solely due to changes in cloud amount, altitude, and optical depth, along with a residual term (Zelinka et al., 2012(Zelinka et al., , 2013(Zelinka et al., , 2016)).
Changes in low clouds observed by passive sensors may be due to changes in high clouds that mask or reveal low clouds.To avoid this ambiguity, we separate low cloud feedback into three components: that due to changes in unobscured low clouds, that due to changes in obscuration by upper-level clouds, and a small covariance term (see Equation 4in Scott et al. (2020) and also Myers et al. (2021)).The unobscured low cloud component is treated as the true low cloud feedback and is further decomposed into cloud amount, altitude, and optical depth components.The obscuration component refers to changes in the fraction of low-level clouds that occur exclusively because of variations in the fraction of higher-level clouds that can cover or uncover the lower clouds beneath them.As in Z22, this component is combined with the high-cloud amount component, treating them as a single entity.
In the tropics (30°N-30°S), stratocumulus and trade cumulus dominate the marine descent regions, while anvil clouds dominate the marine ascent regions.To distinguish the clouds in different large-scale circulation regimes, we compute tropical marine ascent/descent feedbacks separately following Bony et al. (2004).Cloud radiative kernel and cloud fraction fields over the tropical ocean are first aggregated into 10-hPa/day wide bins of monthly averaged 500-hPa vertical velocity (ω 500 ) with area-weighting.ΔR cloud is then calculated in ω 500 space rather than geographic space and is further decomposed into cloud amount, altitude, and optical depth components in ω 500 space.Tropical ascent and descent components are computed by summing over the bins where ω 500 < 0 and ω 500 ≥ 0, respectively.
Lastly, the individual components of the cloud feedback are estimated by regressing ΔR cloud components at each grid point (or each ω 500 bin) against the global average surface air temperature anomaly.Both SW and LW components are calculated, and their sum is what we report below.For observed cloud feedbacks, the surface temperature and ω 500 are obtained from ERA5 reanalysis (Hersbach et al., 2020).Monthly average data are used for all calculations, and the anomalies are deviations from the climatological seasonal cycle.While the cloud feedback is defined as the changes of cloud radiative anomalies per degree of global temperature change, this does not imply that global average surface temperature directly controls changes in clouds in every location.Rather, individual cloud responses at a given location are responding to their proximate controlling factors (e.g., inversion strength, circulation, and humidity), which themselves change as global temperature changes.All feedbacks reported in this study are scaled by the fractional area of the globe over which they operate, so they represent relative contributions to global mean total cloud feedback.We analyze the cloud feedback over the globe to be consistent with the definition in S20 and Z22.Confining the analysis to 60°N-60°S where FBCT sampling is better yields similar results (Table S4 in Supporting Information S1).
In summary, in both models and observations we compute LW, SW, and net cloud feedbacks at each geographic location over both land and ocean.Over the tropical oceans, we additionally distinguish feedbacks within regimes of ascent and descent.All feedbacks are broken down into amount, altitude, and optical depth components for both high and low clouds (with careful accounting of obscuration effects).While this allows for a wide variety of possible groupings in which to compare modeled and observed feedbacks, we adhere to the groupings used in Z22 since-in the case of the assessed feedbacks-they represent relatively well-studied feedbacks with known governing mechanisms.

Significance Test for Biased Cloud Feedback Components
This study identifies biased cloud feedback components from AMIP simulations by considering the uncertainty in both observations and models.We perform all significance tests using a bootstrapping method, as it avoids a priori assumptions about the shape of the data distribution.First, the distribution of each cloud feedback component is estimated through Monte Carlo sampling (e.g., Chao & Dessler, 2021;Dessler & Forster, 2018;Donohoe et al., 2014).For each sample, we calculate the feedback by randomly selecting a number of points (with replacement) from the time series equal to the effective sample size adjusted for temporal autocorrelation (Santer et al., 2000), and conducting linear regression.This process is repeated 100,000 times for both observations and individual model members.Next, a new distribution representing the difference between modeled and observed feedbacks is generated through Monte Carlo sampling.Biased cloud feedback components are identified if the differences are statistically significant (i.e., significantly depart from zero) at 90% or 66% confidence intervals.

Model Evaluation Strategy
When evaluating models' short-term cloud feedbacks against observations, one has to carefully account for uncertainty in the feedback estimates arising from meteorological noise.As shown below, this noise can be substantial for feedbacks derived from interannual fluctuations due to meteorological fluctuations that are not systematically dependent on global surface temperature, especially when using only 12.5 years of data.This means that for the following analyses, we do not compare central feedback estimates between models and observations but rather the distributions of both following the aforementioned bootstrapping method.Moreover, we consider all available realizations of the individual models' AMIP simulations, which provides another indication of how much weather noise can affect the feedback estimates.Because inter-member differences in central feedback estimates within any given model can be comparable to the biases with respect to observations, this limits our ability to conclusively determine that any particular model is biased.This is further complicated by the fact that many models provide only a single realization and could therefore be unfairly judged as biased or unbiased owing to a small sample size.We therefore focus hereafter on identifying biases that are systematic across models and that are relatively robust across the multiple realizations of any given model.

Comparisons of Total Cloud Feedback in Models Against Observations
Figure 1a shows the short-term total cloud feedback inferred from the observations and from CMIP6 AMIP simulations.We first focus on the period of July 2002 to December 2014, determined by the overlap between the satellite observations and CMIP6 AMIP experiments.All models have at least one member agreeing with the observations within the 90% confidence intervals and only one model member has a value that is biased at 90% confidence.We note that the inter-member spread arising from internal meteorological variations could be large for specific models (one standard deviation of ensemble members varies across models from 0.10 to 0.57 W/m 2 /K).These internal variations remain in longer analyzed periods (e.g., 20-year cloud feedback in AMIP simulation) and in other simulations (e.g., 12-year and 20-year segments of piControl simulation) (not shown).
CHAO ET AL. 10.1029/2023JD039427 6 of 18 We also compare the spatial pattern of total cloud feedback between observations and models.CMIP6 AMIP multi-model mean qualitatively reproduces some spatial features of observed total cloud feedback, such as the positive feedback over the northeastern Pacific, equatorial Pacific, and southwestern Pacific (Figure S2 in Supporting Information S1).The multi-model mean is obtained by averaging the first member of each model to ensure consistent treatment with all models, but the results are similar if including multiple members.
The total cloud feedback is further decomposed into the individual assessed feedback components that are closely tied to well-studied processes (i.e., assessed cloud feedbacks that are discussed individually in Section 3.2), and the feedback components that are not assessed in S20 (i.e., unassessed cloud feedbacks that are discussed individually in Section 3.3).The description of each cloud feedback component is provided in Table 1.Table 2 summarizes the values of cloud feedback components estimated from the observations, along with the number of models significantly differing from the observations.The values for individual models and members are listed in Tables S1 and S2 in Supporting Information S1.The breakdown is rationalized in Z22 (see their Figure S3), and the summation of all cloud feedback components equals the total cloud feedback, with no overlap in their definitions.Most of the variance in the total cloud feedback is accounted for by the sum of assessed components.All models except CESM2 and MIROC-ES2L simulate the sum of assessed cloud feedbacks that match well with the observations.However, for the sum of unassessed cloud feedbacks, the majority of models (13 of 17 models) have at least one member that exhibits significant bias at 66% confidence, and several of these are biased at 90% confidence.All of these significant biases are negative except one realization in IPSL-CM6A-LR.Note that there are multiple ways to interpret the number of models that are biased, and in this study we adopt a more stringent approach to minimize the risk of underestimating biased models.

Comparisons of Assessed Cloud Feedback Components
In this section, we evaluate models' ability to capture the individual assessed feedback components (Figure 2).The observational value of the high-cloud altitude feedback is +0.40 ± 0.22 W/m 2 /K (90% confidence intervals; Figure 2a).The altitude of high-cloud tops increases as temperature increases, which hinders the system's ability to radiatively damp warming and results in positive feedback (Hartmann & Larson, 2002;Zelinka & Hartmann, 2010).Models are systematically low-biased, with at least one member from every model significantly underestimating the feedback magnitude at 66% confidence, and many members underestimating it at 90% confidence.Not a single member of any model overestimates the observed value of the high cloud altitude feedback.Inclusion of clouds in the thinnest optical depth category only slightly increases the multi-model mean value (i.e., from +0.02 ± 0.12 W/m 2 /K to +0.06 ± 0.13 W/m 2 /K (one standard deviation)), suggesting that the bias is not simply arising from ignoring the thinnest clouds in models.Note that the magnitude of high-cloud altitude Components that are significantly biased relative to the observations at 90% (66%) confidence are denoted with red (pink) markers.
feedback is sensitive to choice of cloud radiative kernel, but models are significantly biased low with respect to the observations regardless of which radiative kernel is used (Figure S9a in Supporting Information S1).Sensitivity of results to choice of cloud radiative kernel is discussed further in Section 3.5.This bias is not seen in the long-term high-cloud altitude feedback in Z22, where the CMIP6 multi-model mean matches the expert-assessed value well, and 11 of 12 models have best estimates that fall within expert judgment's 90% confidence intervals.Note that the cloud radiative kernels used in Z22 (Zelinka et al., 2012) and in this study (Zhou et al., 2013) are comparable as they are constructed using the same methodology, including the same radiative transfer model.The multi-model mean spatial pattern of high-cloud altitude feedback inferred from CMIP6 AMIP models is similar to the major observed spatial features, notably over the Pacific Ocean, but models fail to reproduce positive values observed over several regions including the Indian Ocean, North Atlantic, Eurasia, and Australia (Figures S3a and S3e in Supporting Information S1).
Six models (CanESM5, E3SM-1-0, GFDL-CM4, IPSL-CM6A-LR, MIROC6, and MRI-ESM2-0) show at least one member with moderate to strong negative high-cloud altitude feedback.For these six models, the response of cloud fraction from the ISCCP simulator indicates a decrease in high-cloud altitude with increasing temperature, even when including clouds in the thinnest optical depth category.In contrast, cloud fraction profiles taken directly from the models (cl) or from the CALIPSO simulator (clcalipso) show an increase in high-cloud altitude.Future work is needed to reconcile the inconsistent high-cloud responses between various cloud fraction diagnostics in AMIP simulations, which are not seen in corresponding long-term warming simulations (i.e., abrupt-4xCO2).The tropical marine low-cloud feedback is −0.51 ± 0.28 W/m 2 /K in this 12-year period, primarily due to a strong negative SW cloud amount component that arises from increases in low-cloud cover as temperature increases.
Note that this feedback component is highly sensitive to the time period considered, which will be discussed in Section 3.7.Ten of seventeen models have at least one member matching the sign of the observed feedback, but the models are generally biased too positive for this feedback.Notably, at least one member from fourteen of the models has a tropical marine low-cloud feedback that is significantly high-biased at 66% confidence, with many being high-biased at 90% confidence.There are no members of any model that have a negative tropical marine low-cloud feedback value that is significantly larger in magnitude than what is observed.Marked disagreement is apparent among models and between models and the observed tropical marine low cloud feedback in regimes of weak descent (Figure S4a in Supporting Information S1).
The tropical anvil cloud feedback inferred from the observations is weakly negative (−0.06 ± 0.09 W/m 2 /K), as the negative high cloud optical depth feedback is largely offset by the positive high cloud amount feedback in regions of tropical oceanic ascent.As temperature increases, the high-cloud fraction decreases over ascending regions, which results in less SW reflection (not shown).At the same time, the optical thickness of high clouds increases, reflecting more SW radiation, and these two effects nearly perfectly counteract each other (not shown).
In both the amount and optical depth feedbacks, the LW component counteracts the SW component with opposite sign but smaller magnitude.The majority of AMIP models exhibit a weak to moderate negative feedback, with nine models having at least one member exhibiting negative anvil cloud feedbacks that are strong enough to be significantly low-biased with respect to observations at 66% confidence.However, the bias is less systematic than for other feedback components, as two models have at least one member with positive anvil feedbacks that are high-biased at 66% confidence.These anvil cloud feedback biases (of either sign) are mainly driven by the high cloud amount with minor contribution from optical depth component (not shown).
The observed land cloud amount feedback is +0.14 ± 0.13 W/m 2 /K, where low-and high-clouds contribute around 60% and 40%, respectively.The positive feedback is consistent with previous literature based on GCMs and observations, where relative humidity decreases over land as temperature increases and leads to a reduction in cloud cover (e.g., Kamae et al., 2016;Zhang & Klein, 2013).Every member of each model has a positive feedback, in agreement with observations.However, eleven models have at least one member exhibiting land cloud amount feedbacks that are significantly high-biased at 66% confidence, and no models have any members with feedbacks that are significantly low-biased with respect to observations.The regions contributing to positive feedback values include western Asia/eastern Europe, southern Australia, southeastern U.S., and northern South America.The CMIP6 multi-model mean agrees with the sign of the observations over subtropical regions but fails to reproduce the strong positive values over Eurasia and the southeastern U.S., while also overestimating the positive feedback over Australia and Africa (Figures S3b and S3f in Supporting Information S1).
The majority of models have mid-latitude (30-60°N/S) marine low-cloud amount feedbacks that agree reasonably well with the small value estimated in observations (+0.04 ± 0.12 W/m 2 /K), with all but three models having at least one member that lies within the statistical uncertainty of the observed value.While nine models have at least one member that is significantly biased with respect to observations at 66% confidence, these biases are not systematic across models, with some biased low and others biased high.The general agreement between observations and models is also apparent in the spatial pattern, where the multi-model mean captures the positive values over northeastern and southwestern Pacific, as well as the negative values over northwestern and southeastern Pacific (Figures S3c and S3g in Supporting Information S1).
In contrast to the neutral high-latitude (40-70°N/S) low-cloud optical depth feedback assessed in S20 on longterm time scales, the observations suggest a positive short-term feedback of +0.09 ± 0.08 W/m 2 /K, in close agreement with a previous observational analysis (i.e., +0.10 ± 0.03 W/m 2 /K (1.645 standard deviation); Wall et al. ( 2022)).In observations, low clouds transition from optically thick to thin as global mean temperature increases, leading to a reduction of SW reflection and a positive optical depth feedback.Changes in low cloud optical depth arise from a complex mix of temperature-dependent processes including changes in the moist adiabatic lapse rate, inversion strength, boundary layer decoupling, phase partitioning, and changes in adiabaticity (e.g., Terai et al., 2019), which are not further explored here.The CMIP6 models are systematically low-biased for this feedback, with every model having at least one member (and in many cases, all members) exhibiting a low-biased feedback that is significant at 66% confidence.No members of any model have a positive high latitude low cloud optical depth feedback that is significantly larger than estimated from observations.This is consistent with Terai et al. (2016), who found that low cloud optical depth increases too strongly as temperature increases in most CMIP5 AMIP models.
In summary, despite substantial uncertainty arising from weather noise, several systematic biases in modeled assessed cloud feedback components are apparent: The positive high cloud altitude and high latitude low cloud optical depth feedbacks are systematically underestimated in models, with many model realizations actually having the wrong sign and no realizations having values that are significantly larger than observed.Several model realizations also exhibit negative anvil cloud feedbacks that are too large in magnitude relative to the small negative anvil cloud feedback in observations.These low biases are countered somewhat by high biases in other components.Notably, most model realizations underestimate the strength of the negative tropical marine low cloud feedback, with many simulating a positive feedback instead, and some model realizations significantly overestimate the strength of the positive land cloud amount feedback.Hence most of the individual assessed components are systematically biased in models, but these biases closely compensate, resulting in an overall lack of systematic errors in the sum of assessed feedbacks and the total cloud feedback (Figures 1a and 1b).

Comparisons of Unassessed Cloud Feedback Components
S20 quantified a set of six cloud feedback components that are well-studied and/or important, and model evaluation in Z22 was done only for those components.An advantage of our approach is that we are not restricted to S20's feedbacks and can evaluate feedback components that were unassessed in S20.As mentioned above, the observations show a positive unassessed cloud feedback on the interannual timescale (significant at 90% confidence).However, the majority of CMIP6 AMIP simulations underestimate its magnitude (Figure 1c).
We decompose the unassessed cloud feedback into 6 components plus a residual term (Table 2 and Figure 3).Note that the unassessed components could be grouped in many ways.Here we start from the components used in Z22 and further aggregate those that share similar characteristics for simplicity.The observed unassessed cloud feedback receives the largest contribution from the extratropical (30-90°N/S) high-cloud optical depth feedback (i.e., +0.20 ± 0.10 W/m 2 /K, Figure 3a).This component is also one of the main contributors to model-observation discrepancy.Every model has at least one realization that is biased low at 66% significance, with many of the realizations being significantly biased at 90% confidence.Not a single realization has an extratropical high cloud optical depth feedback that is significantly larger than the observed value.As global temperature increases, the high clouds shift from optically thick to optically thin, thereby reflecting less SW radiation.As shown in Figures S5a and S5d in Supporting Information S1, this comes mainly from clouds over land over the northern hemisphere (around 52%) and southern Pacific (around 30%).Underestimation over these two regions contributes to a weaker positive or even weak negative value in CMIP6 AMIP simulations.
The extratropical low-cloud optical depth component is weakly positive in the observations (i.e., +0.09 ± 0.05 W/m 2 /K, Figure 3b).Here extratropical is defined as poleward of 30° latitude but excluding the 40-70° band that is already counted in the assessed high-latitude low-cloud optical depth feedback.The mechanism driving positive values is similar to the high-latitude low-cloud optical depth, with 30-40° latitude having decreased cloud optical depth as temperature changes and 70-90° latitude having little cloud fraction changes among optical depth bins.The models systematically underestimate the feedback, with all but one model having at least one realization that is significantly low-biased at 66% confidence and many low-biased at 90% confidence.Not a single realization has a feedback that exceeds the observed value.The underestimation occurs at nearly every location within the domain (Figures S5b and S5e in Supporting Information S1).Dominated by high clouds, the tropical land cloud optical depth feedback is +0.04 ± 0.06 W/m 2 /K in the observations (Figure 3c).Most CMIP6 models (13 of 17 models) have at least one member that agrees within observed 90% confidence intervals.Ten models have at least one member that significantly differs from the observations, but the biases are not systematic as seven models exhibit low-biased feedbacks while three models exhibit high-biased feedbacks.The observed positive feedback mainly originates from East Asia and the northern part of South America, but these spatial features are not well-captured in CMIP6 multi-model mean (Figures S5c and S5f in Supporting Information S1).
The observations show a positive tropical marine descent high-cloud feedback (+0.11 ± 0.13 W/m 2 /K; Figure 3d).This is mainly due to a positive optical depth component from high clouds becoming optically thinner with warming.In contrast, the net high cloud amount component is small despite increases in high cloud cover with warming owing to opposing LW and SW effects (not shown).All models and members exhibit a feedback value that agrees with the observations except for four models that have at least one member that is low-biased at 66% confidence.
The tropical marine ascent low-cloud feedback is −0.11 ± 0.10 W/m 2 /K in the observations (Figure 3e), due to a negative cloud optical depth component partially offset by a positive cloud amount component.The low clouds shift from optically thin (τ = 0.3-3.6) to moderate optical thickness (τ = 3.6-23) with little change in the optically thick regime (τ > 23) (not shown).Meanwhile, the low-cloud fraction decreases as temperature increases, reflecting less SW radiation.Around half of the models (9 of 17 models) have at least one member that significantly overestimates the feedback at 66% confidence, while no model simulations are significantly low-biased.
The term "Others" includes extratropical (30-90°N/S) marine high-cloud amount, 60-90°N/S marine low-cloud amount, and low-cloud altitude feedbacks, as these three components have smaller magnitudes in both observations and CMIP6 AMIP simulations.Together they make a minor contribution to the unassessed cloud feedback and the model values are in reasonable agreement with the observations (Table 2 and Figure 3f).
The unassessed cloud feedback includes three residual terms: the cloud radiative kernel decomposition residual (Zelinka et al., 2013(Zelinka et al., , 2016)), a covariance term from low-cloud obscuration calculation (Myers et al., 2021;Scott et al., 2020), and a residual term from performing tropical marine cloud feedbacks in ω 500 space (Bony et al., 2004) (Tables 1 and 2).The first two terms have a small magnitude in both observations and models, but the last one, due to nonlinearity when aggregating geospatial information to ω 500 space, has a relatively large magnitude and inter-model spread (not shown).We have also tested an alternative method in which feedbacks are computed in geographic space and then aggregated over ascent and descent regions based on the climatological 500-hPa vertical velocity at each month and each grid point.The residual term vanishes in the latter method, but our default method better ensures that we capture feedbacks due to cloud property changes within specific vertical motion regimes, which is desirable for connecting the feedbacks to their underlying physical processes.
In summary, the sum of unassessed cloud feedbacks is significantly positive in observations during the 2002-2014 period, and the optical depth components over four cloud regimes are the major contributors: extratropical (30-90°N/S) high-cloud, extratropical low-cloud, tropical land, and tropical high-cloud over oceanic descent regions (Figures 3a-3d).The CMIP6 ensemble tends to underestimate the magnitudes of the first three optical depth feedbacks, specifically for extratropical high-and low-cloud optical depth feedbacks.This highlights a possible bias in models' representation of high and low cloud optical depth and its response to warming.

Sources of Model-Observation Discrepancies
To evaluate the skill of individual models in reproducing observed cloud feedbacks, we compute the mean squared error (MSE) over all cloud feedback components for each model (Figure 4).The use of MSE allows us to quantify the relative contribution from each cloud type to the total MSE since the sum of the squared errors from individual cloud feedback components (divided by the number of components) equals the total MSE for a model.As shown in Figure 4, the model-observation discrepancies in most models come from three main components: the tropical marine low-cloud (light blue), high-cloud altitude (dark blue), and extratropical high-cloud optical depth feedbacks (dark red), with minor contributions from extratropical low-cloud optical depth (dark green and light green) and tropical anvil cloud feedback (dark orange).We highlight the components that are significantly biased high and low (90% confidence) with stippling and hatching.The three main components mentioned above contribute over 50% of MSE for all models except MIROC-ES2L, in which a wider diversity of components contribute.
Our results suggest that caution is needed when comparing a single climate model simulation with a single observation in either 12-year or 20-year analysis (i.e., the current longest satellite observation record), since both modeled and observed results could be affected by internal variability.Despite quantitative differences, individual realizations of a particular climate model tend to resemble each other in terms of the error properties and are typically distinct from other models (Figure 4).For instance, CESM2's bias overwhelmingly comes from an overestimated tropical marine low cloud with little role for a high cloud altitude bias.In contrast, high cloud altitude feedback is significantly low biased in all members of IPSL-CM6A-LR while their tropical marine low cloud feedback is close to observations.Additionally, in contrast to most models, a diverse mix of components are biased other than tropical marine low cloud or high cloud altitude in MIROC-ES2L.One can also discern systematic improvements across generations of models and between model configurations.For example, E3SM-2-0 and E3SM-2-0-NARRM show marked improvement in all categories over their predecessor, E3SM-1-0.And the large biases in the tropical marine low-cloud feedback in CESM2 are much weaker and insignificant in CESM2-FV2.

Sensitivity of Model-Observation Comparison to Choice of Cloud Radiative Kernel
In this section, we test the sensitivity of our results to the choice of cloud radiative kernel.Figure S6 in Supporting Information S1 shows the comparison of observed cloud feedback using the cloud radiative kernels of Zhou et al. (2013Zhou et al. ( , 2022)), Zhang et al. (2021), and CERES FBCT (Sun et al., 2022).Zhou et al. (2013Zhou et al. ( , 2022) ) kernels are derived using reanalysis thermodynamic fields supplied to the Fu-Liou and RRTM radiative transfer models, respectively.The Zhang et al. (2021) kernel is calculated from the GISS model radiation code that is used in constructing the International Satellite Cloud Climatology Project (ISCCP) H data sets.CERES FBCT is derived empirically using clear-sky and overcast fluxes from CERES observations coincident with each MODIS cloud type.Cloud feedbacks inferred from the first three products are consistent in general, while the CERES FBCT-estimated cloud feedbacks have smaller values.This is most apparent for the LW high cloud altitude and amount components, which is consistent with the fact that the FBCT LW kernel shows a weaker increase in strength with altitude than the radiative transfer-based kernels (Figure S7 in Supporting Information S1).The SW components, in contrast, are in better agreement among the kernels (not shown).Further investigation is needed to determine whether this discrepancy implies that radiative transfer model-based kernels give a biased estimate of the radiative impact of certain cloud types (perhaps due to assumptions made in constructing them) or that the empirical nature of the CERES FBCT introduces errors (e.g., sampling at most two times per day during the sunlit MODIS overpasses, with many cloud types rarely observed from space).Repeating all analyses using the FBCT kernel instead results in quantitative differences in feedback magnitudes and MSE values, which results in a smaller positive total cloud feedback that is closer to the cloud feedback derived from cloud radiative effect (i.e., the difference between all-sky and clear-sky radiative fluxes) adjusted for non-cloud effects (Raghuraman et al., 2023).Yet, the choice of a different kernel does not change any of the conclusions.We find that the models overestimate the tropical marine low-cloud feedback and underestimate high-cloud altitude and extratropical high-and low-cloud optical depth feedbacks (Figures S8 to S11 in Supporting Information S1).

Sensitivity of Model-Observation Comparison to Choice of Cloud Fraction Product
In addition to being sensitive to choice of cloud radiative kernels, cloud feedback values estimated from observations depend on which cloud fraction product is used (Figures 1-3, Table S4 in Supporting Information S1).
Compared to the moderate positive cloud feedback derived using CERES-FBCT, the MODIS-COSP-derived feedback is near-zero.This is because all but two feedback components are less positive or more negative in this data set (Table S4 in Supporting Information S1).The two exceptions are the positive high-cloud altitude and middle latitude marine low-cloud amount feedbacks, which are stronger in MODIS-COSP.None of the differences in the individual cloud feedback components or the total cloud feedback between the two observational products are statistically significant at 90% confidence (comparing cloud feedbacks from the two data sets using the Monte Carlo method described in Section 2.4, not shown).We find that the main conclusions of the study remain robust, independent of whether the models are evaluated against CERES-FBCT or MODIS-COSP observations (Table S4 in Supporting Information S1).Since both CERES-FBCT and MODIS-COSP report cloud properties measured by the MODIS instrument, the differences in feedbacks must be related in some way to differences in their cloud retrieval algorithms.Understanding the nature and causes of these differences is important for future work but is beyond the scope of this study.

Observed Cloud Feedback Inferred From Different Periods
This study also analyzes the full period of available observations, from July 2002 to December 2022.The best estimates of the total cloud feedback inferred from 12-yr and 20-yr observations are both positive (0.33 and 0.57 W/m 2 /K), implying that the radiative response of clouds amplifies interannual temperature fluctuations.The regression uncertainty decreases when analyzing a longer period, and the likelihood of a negative cloud feedback is reduced from 16% to less than 0.1%.This analysis indicates that several observed cloud feedback components vary in strength and even sign across time periods.With more recent years included, the tropical marine low-cloud feedback switches from a strong negative to a weak negative value (Figure 2b), the positive extratropical high-cloud optical depth feedback becomes smaller (Figure 3a), and the tropical marine ascent low-cloud feedback switches from moderately negative to weakly positive (Figure 3e).Changes in these feedback components between the two periods are significant at 90% confidence (following the methodology described in Section 2.4 but comparing 12-yr and 20-yr observations).While it is difficult to fully rule out random meteorological noise, we can ask whether feedback differences between pairs of realizations within a single model subject to the same SST boundary conditions within the 2002-2014 period are ever statistically significant at 90% confidence.If they are, then one cannot rule out noise as the reason for statistically significant differences in feedbacks.This is the case for the tropical marine ascent low-cloud feedback.In contrast, not a single pair of realizations of any model exhibit significant differences at 90% confidence in tropical marine low-cloud and extratropical high-cloud optical depth feedbacks (not shown).From this we conclude that the observed differences in these two components exceed what can be reasonably expected to arise from noise, and are instead reflecting a true dependence on time period.
The spatial pattern of temperature response per degree of interannual global warming is different between the July 2002 to December 2014 and July 2002 to December 2022 periods (Figure S12 in Supporting Information S1), suggesting that the dependence of these cloud feedbacks on the choice of periods is related to the pattern effect (Stevens et al., 2016), which describes how the radiative response depends not only on global average surface warming but also on the spatial pattern of warming.Specifically, studies have shown that the pattern effect is mainly due to the low-cloud feedback, which is strongly modulated by variations in inversion strength that are themselves tied to the large-scale gradients in surface temperature (e.g., Andrews & Webb, 2018;Ceppi & Gregory, 2017;Zhou et al., 2016).Recent studies have demonstrated the impacts of the pattern effect on total climate feedback and cloud feedback during the satellite era (e.g., Chao et al., 2022;Loeb et al., 2020).In this study, we confirm the substantial impacts of the pattern effect on low cloud feedback, but also highlight other cloud feedbacks that exhibit substantial dependence on time period considered.Future work should clarify the mechanism underlying the temporal variations in the strength of the extratropical high-cloud optical depth feedback, and how it may be related to variations in surface warming pattern.

Relations Between Short-Term and Long-Term Cloud Feedbacks and Their Errors
How do cloud feedbacks estimated on the interannual timescale relate to those in response to long-term warming?Plotting the six assessed cloud feedback values computed on interannual timescales against those in response to 4xCO2 (taken from Z22) for the 11 models that have both sets of simulations indicates that there is no significant relation (at 95% confidence) between the two timescales for any individual assessed component nor total cloud feedback, except for a significant negative correlation for the sum of assessed feedbacks (Figure 5a-5h).This result suggests that cloud feedback estimates derived from interannual global temperature fluctuations may be of limited value for long-term climate projections or for constraining ECS.A more fruitful approach to leverage observations may be to perform cloud controlling factor analyses in which the sensitivity of cloud properties to environmental controlling factors is determined from observed fluctuations and then scaled by model-projected changes in these factors (e.g., Ceppi & Nowack, 2021;Cesana & Del Genio, 2021;Gordon & Klein, 2014;Klein et al., 2017;Myers et al., 2021;Myers & Norris, 2016;Qu et al., 2015;Terai et al., 2016).
Do models that perform better in simulating short-term cloud feedback also perform better in simulating the cloud feedback in response to long-term warming?We estimate the RMSE (model minus observations) across the six assessed short-term cloud feedbacks and scatter them against their 4xCO2 counterparts (model minus expert judgment) taken from Z22.There is no significant correlation between the skill in simulating climate feedback on short-term and long-term timescales (Figure 5i).Models with better skill in matching long-term cloud feedback components have diverse performances in simulating short-term cloud feedback, and vice versa.The models that fall in the lower left quadrant represent smaller-than-average errors at both interannual and long-term timescales (e.g., GFDL-CM4), and the models falling in the upper right quadrant are the ones with larger-than-average errors at both timescales, such as E3SM-1-0 and IPSL-CM6A-LR.Two of the models that perform best with respect to S20 on the long timescale are among the worst performers here on the short timescale (CanESM5 and MRI-ESM2-0).
Note that the scatter plot should be interpreted with caution as several uncertainties can influence the results.
For instance, here we only assess the skill of simulating cloud feedback based on six assessed cloud types.
Including unassessed cloud feedbacks does not affect the short-term feedback RMSE much (not shown), but it is unknown whether the long-term RMSE could be altered by including more components.Given the fact that several currently unassessed cloud feedbacks are not negligible in most models' abrupt-4xCO2 simulations (Z22), it is likely that long-term RMSE would change if one or more of these components is quantified in a future assessment.Second, the calculation of long-term feedback RMSE relies on the expert assessment in S20; should another assessment exercise be conducted in the future, the expert assessment of the cloud feedback components may change.Lastly, the interannual cloud feedback is estimated during a relatively short period (i.e., 12.5 years).
Extending the AMIP experiments beyond 2014 to more closely match the longer record of CERES observations (e.g., Raghuraman et al., 2021;Raghuraman et al., 2023;Schmidt et al., 2023) will allow for a more robust comparison between modeled and observed short-term cloud feedbacks.

Discussion and Conclusions
This study quantifies individual short-term cloud feedback components using satellite observations and CMIP6 AMIP simulations.Building upon Zelinka et al. (2022)'s work in evaluating modeled long-term cloud feedback, this work provides another test of the CMIP models by comparing them to observed cloud feedbacks in response to interannual temperature fluctuations.We present a novel analysis of cloud feedback components observed in recent decades and provide an assessment of how well the models perform in matching individual cloud feedback components.
In observations, we find a positive total cloud feedback that is mainly due to positive high-cloud altitude component, land cloud amount component, and extratropical high-and low-cloud optical depth components, partially offset by negative tropical marine low-cloud amount component.The extratropical high-and low-cloud optical depth feedbacks, which were not assessed in S20, are both positive and not negligible on the interannual timescale.Our results also show evidence of the pattern effect on tropical marine low-cloud feedback and extratropical high-cloud optical depth feedbacks, as their values differ significantly between the 2002-2014 and the 2002-2022 periods that are characterized by different patterns of warming.
CMIP6 AMIP models simulate a total cloud feedback that agrees with the observations in general.However, the models systematically overestimate tropical marine low-cloud and land cloud amount feedback and underestimate high-cloud altitude and extratropical high-and low-cloud optical depth feedbacks.The model-observation discrepancies are mainly from the tropical marine low-cloud, high-cloud altitude, and extratropical high-cloud optical depth feedbacks.
The biases in AMIP models present a number of important issues that warrant future investigation.Many previous studies focused on the uncertainty in tropical marine low-cloud feedback over subsidence regions.This study finds that the substantial uncertainty in this feedback persists for the short-term cloud feedback in AMIP simulations, primarily due to an inability to accurately simulate the response of the marine low-cloud fraction to temperature changes.Additionally, our results highlight the need to investigate high-cloud feedback and the mechanisms driving high clouds, in agreement with the key findings in Z22.Despite a well-developed theoretical background, AMIP models tend to underestimate the increase of high-cloud altitude as global mean temperature increases.Moreover, the unassessed cloud feedback is biased low in most AMIP models mainly due to a systematically weaker extratropical high-cloud optical depth feedback.More work is needed to better understand the processes governing high-cloud optical depth in nature (including those that determine cloud phase, total water content, particle sizes, and other cloud microphysical properties) and its changes in response to interannual temperature variations.This echoes recent studies arguing for the importance of examining high-cloud optical depth feedbacks (Gasparini et al., 2023;Li et al., 2019;McKim et al., 2024;Zelinka et al., 2022).The consistent underestimation in high cloud feedbacks motivates further work to analyze how high clouds respond to their cloud controlling factors.
This study also demonstrates that the magnitude of short-term cloud feedback components is dependent on several factors that must be carefully accounted for when evaluating models against observations.First, statistical noise arising from meteorological fluctuations that are not systematically dependent on global mean temperature is substantial, limiting the robustness of any quantitative estimate derived from a single short observational data set and the rigor of any model evaluation that relies on a small number of realizations.Second, the magnitudes of short-term cloud feedbacks vary depending on the choice of cloud radiative kernel and observed cloud fraction data set.While the main conclusions of this paper are largely insensitive to these choices, caution is needed when quantifying the short-term cloud feedback from a single radiative kernel or observational cloud data set.Finally, cloud feedbacks estimated from coincident interannual cloud and temperature fluctuations are not correlated across models with their long-term counterparts in response to greenhouse warming.It is therefore inappropriate to view feedbacks quantified on observable timescales as being interchangeable with the long-term feedbacks that largely determine equilibrium climate sensitivity.Targeted analyses using cloud controlling factors may be necessary to more rigorously establish the link between observable cloud sensitivities on short timescales and the uncertain cloud feedback in response to long-term warming.
One of the ultimate goals in climate science is narrowing the range of equilibrium climate sensitivity in GCMs.While we do not find that model skill in simulating short-term feedback translates into skill in simulating longterm feedbacks, the evaluation of climate models on interannual timescales against observations provides an apples-to-apples comparison that avoids evaluation against partly-subjective expert judgment values and helps reveal compensating errors in modeled cloud feedback components, which is a crucial step in further improving the representation of clouds in climate models.
High-Cloud High-cloud amount and optical depth Ocean descent 30°S−30°N Tropical Marine Ascent Low-Cloud Low-cloud amount and optical depth Ocean ascent 30°S

Figure 1 .
Figure 1.Short-term (a) total, (b) sum of assessed, and (c) sum of unassessed cloud feedbacks inferred from satellite observations from 2002 to 2014 and 2002 to 2022, and CMIP6 AMIP simulations between 2002 and 2014.For the observations (CERES and MODIS), the best-estimated value and 66% (90%) confidence intervals are indicated as the black circles and thick (thin) error bars.Gray shading represents 90% confidence intervals for observations from 2002 to 2014.The observations are the average of CERES-FBCT (blue) and MODIS-COSP (light blue) as described in the text.The individual model ensemble member values are shown as markers.Components that are significantly biased relative to the observations at 90% (66%) confidence are denoted with red (pink) markers.

Figure 2 .
Figure 2. As in Figure1, but for the assessed cloud feedback components.Note that the range of the x axis is smaller than in Figure1.

Figure 3 .
Figure 3.As in Figure1, but for the unassessed cloud feedback components.Note that the range of the x axis is smaller than in Figure1.

Figure 4 .
Figure 4.The contribution from each cloud feedback component to mean squared error (MSE).Components that are significantly biased high and low at 90% confidence are identified with stippling and hatching, respectively.

Figure 5 .
Figure5.The scatterplot of (a-f) individual assessed, (g) sum of assessed, and (h) total cloud feedbacks (in W/m 2 /K) inferred from the short and long timescales for each model common to this analysis and that of Z22.(i) Root mean squared error (RMSE; in W/m 2 /K) of cloud feedbacks on the long timescale scattered against those on the short timescale.See text for the description of RMSE on two timescales.For each model, the realization used in both Z22 and this analysis is shown as the filled marker, while other realizations are shown as open markers.Correlation coefficients calculated from filled markers are shown, with significant correlations at 95% confidence intervals (p-value < 0.025) marked by asterisks.

Table 1
The Definition of Individual Cloud Feedback Components Used in This Study

Table 2
The Individual Cloud Feedback Components and 90% Confidence Intervals (W/m 2 /K) Inferred From Satellite Observations (Combined CERES-FBCT and MODIS-COSP) From 2002 to 2022 and 2002 to 2014, and the Number of Models (Out of 17 Models) Having at Least One Member That is Significantly Biased Low or High at 90% (66%) Confidence With Respect to the Observations (2002 to 2014)