Comment on “Advanced Testing of Low, Medium, and High ECS CMIP6 GCM Simulations Versus ERA5‐T2m” by N. Scafetta (2022)

Scafetta (2022, https://doi.org/10.1029/2022gl097716) purports to test Coupled Model Intercomparison Project Phase 6 (CMIP6) climate models through a comparison of temperature changes over three decades. Unfortunately, the paper contains numerous conceptual and statistical errors that undermine all of the conclusions. First, no uncertainty is given for the observational temperature difference, making it impossible to assess compatibility with any model result. Second, the CMIP6 data are the ensemble means for each model, but the metric being tested is sensitive to the internal variability and so the full ensemble for each model must be used. When this is corrected, the conclusion that “all models with ECS > 3.0°C overestimate the observed global surface warming” is not sustained. Third, the statistical test in Section 2 would reject all models even in a perfect model setup given sufficient ensemble members, thus the second conclusion “that spatial t‐statistics rejects the data‐model agreement” is also not sustainable.


Global Mean Comparisons With Observations
We downloaded the global mean surface air temperature (SAT) from the European Centre for Medium Range Weather Forecasts Re-Analysis (version 5) (ERA5) (Hersbach et al., 2020;Simmons et al., 2021) directly from the Copernicus data store.We calculate the temperature difference between the two full 11-year periods 1980-1990 and 2011-2021.We note that this is not substantively different from the period used in Scafetta (2022) (January 2011-June 2021).We compare the same period in the models, again noting that this is not substantively different from the average of the 2011-2020 and 2011-2021 periods used in Scafetta (2022).These differences simplify the calculations without affecting the issues.Additionally, we compare the models to the GISTEMP observations (Lenssen et al., 2019) for the same periods.
Uncertainty in the expected temperature difference arises because of the random nature of internal variability (such as the timing of El Niño events), and the standard error can be estimated using the residuals of the annual data points that is, where T i is the set of annual anomalies from 2011 to 2021 baselined to 1980-1990, and N is the number of years.We estimate that the mean and 95% confidence interval (±1.96 × σ E ) for the difference is then 0.58 ± 0.10°C for ERA5, and a very similar 0.57 ± 0.10°C for GISTEMP.The three decade period used in Scafetta's analysis is simply too short for internal variability to be ignored.
We summarize the results of the comparison in Figure 1, which can be contrasted with the right-hand panels in Scafetta's Figure 1.First, even with just the ensemble means from each model, there are three models with ECS well above 3°C that can't be statistically distinguished from the observations.More importantly, looking at the full ensemble, we find that 49 ensemble members from 18 models are compatible with the ERA5 result.Of those 18 models, 9 of them have ECS above 3°C.This is in direct contradiction to the claims made in Scafetta (2022).

Spatial Comparisons and Statistical Test
Spatial patterns of change in models and observations are more affected by internal variability than the global mean.Thus even more care must be taken to compare like with like.It is a common error to compare the multi-model mean and its standard error with the observations.This test is essentially meaningless because we know a priori that they will not be equal (see Santer et al., 2008 2022) also fails this check.In their Equation 2, the denominator contains a  √  term which means that as the number of ensemble members increases, so will the rejection rate.Thus the test is simply ill-formed.
A more appropriate test would be something that includes both observational uncertainty and ensemble spread, such as: where s{<T m >} is the standard deviation of the model temperature differences T m and s{T o } is the standard error of the observed temperature difference, respectively, following Santer et al. (2008) (their Equation 12with a single model).This tests whether the observations are plausibly a sample from the distribution of model runs rather than for exact equality between the ensemble mean and the observations that physical considerations tell us will not be the case.In other words, it is a test of whether the observations are statistically exchangeable with the model runs.For the models with more than 2 ensemble members, 16 out of 24 models pass this test at the 5% confidence level, with sensitivities ranging from 2.3 to 5.1°C.It may also be important to account for the spatial correlation and the rate at which spurious results would be generated by chance with so many tests being performed at the gridbox level.
Given that an incorrect and misleading test (as has been long discussed in the literature) is being applied, we are confident that the conclusions drawn from the spatial tests in Scafetta (2022) are spurious or, at best, grossly exaggerated.

Additional Issues
There are a number of additional issues that, while minor relative to the two raised above, should nonetheless be acknowledged.First, it is important to note the forcing uncertainty over the historical period.For instance, the CESM2 model has been shown to have a noticeable sensitivity to changes in the source and frequency of biomass burning emission fields (Fasullo et al., 2022).A spurious global warming of up to 0.2°C was identified as a result of decadal mean biomass burning inputs being replaced by annually varying inputs, which led to a rectified effect on global temperature through a non-linear response to black carbon aerosols.Other forcings, such as ozone, or solar activity, are also imperfectly known, and this makes simple comparisons between the hindcasts and observations more complicated.Differences may arise between them not because of anything intrinsic to the model processes, but rather to the uncertainty in the drivers.Second, the number of ensemble members for many of the models is insufficient to estimate their forced response and magnitude of internal variability which limits the extent to which comparisons with those models will be informative.
In critiquing the tests in this particular paper, we are not suggesting that hindcast comparisons should not be performed, nor are we claiming that all models in the CMIP6 archive perform equally well.Indeed, there are multiple papers that demonstrate that CMIP6 models with high ECS values (above around 4.5°C) do not perform well in historical hindcasts (Ribes et al., 2021;Tokarska et al., 2020) or paleoclimate tests (Zhu et al., 2021).However, the claims in Scafetta (2022) are simply not supported by an appropriate analysis and should be withdrawn or amended.

Figure 1 .
Figure 1.The difference between 1980-1990 and 2011-2021 in the global mean surface air temperature and the Coupled Model Intercomparison Project Phase 6 ensemble plotted against each model's Equilibrium Climate Sensitivity.Green triangles represent the model ensemble mean for each model or variant, while black dots represent (up to 25) other ensemble members.The pink shading represents the 95% uncertainty in the ERA5 estimate.
, for a discussion).Consider an ideal climate model, with perfect representation of the relevant physics and unlimited spatial and temporal resolution.An individual run of this model will not exactly match the observations because the initial conditions of the model run will not exactly match those of the real Earth.The model run will have the same forced response, but a different realization of the internal variability.Initial condition ensembles are therefore used to capture the statistical distribution of the effects of internal variability, with a better estimate of that distribution arising as more ensemble members were added.The standard error of the ensemble mean continuously decreases as more ensemble members are used, which means that the statistical test used in Scafetta (2022) is essentially guaranteed to reject a perfect model ensemble, and is therefore inappropriate.
Santer et al. (2008)8)lly significant difference between an individual model run held out from the ensemble and the mean of the remaining members would obviously not indicate a "model failure"; nor would it be an indication of an "inconsistency"-a model run cannot be inconsistent with the model from which it was generated.This provides a practical sanity check of any proposed statistical test; if a model ensemble is used to estimate the forced response, a test with 5% power should not reject individual held-out ensemble members more than 5% of the time on average.This was the key problem with test used inDouglass et al. (2008)thatSanter et al. (2008)addressed.The test used byScafetta (