Reply to “Comment on ‘Advanced Testing of Low, Medium, and High ECS CMIP6 GCM Simulations Versus ERA5‐T2m’ by N. Scafetta (2022)” by Schmidt et al. (2023)

Schmidt, Jones, and Kennedy’s (SJK) (2023, https://doi.org/10.1029/2022GL102530) critique of Scafetta (2022, https://doi.org/10.1029/2022GL097716) is flawed. Their assessment of the error of the ERA‐T2m 2011–2021 mean (≈0.10°C) is 5–10 times overestimated and contradicts published literature. SJK confused natural variability with random noise and mistook the error of the mean of a temperature chronology for the stochastic error of the regression parameter M of a nonphysical isothermal climate model (T(t) = M). SJK's allegations regarding the internal variability of the models, the role of the global climate model ensemble members, and other issues were partially addressed in Scafetta (2022, https://doi.org/10.1029/2022GL097716) and, later, more extensively in Scafetta (2023a, https://doi.org/10.1007/s00382-022-06493-w) where Scafetta's (2022, https://doi.org/10.1029/2022GL097716) conclusions were confirmed.

SJK2023 did not acknowledge that S2022's goal was to test three macro-GCMs, nor that S2022 examined three average simulations for each model (when available), which partially accounted for the uncertainty related to the GCMs' internal variability, nor that Scafetta (2023a) extensively analyzed all CMIP6 GCM simulations available on KNMI Climate Explorer, which included 143 GCM ensemble average records and 688 GCM member simulations.Finally, Scafetta (2023b) examined the 175 CMIP6 GCM simulations that Schmidt proposed as indicative of the CMIP6 GCMs (Hausfather et al., 2022, Supplementary file).Scafetta (2023aScafetta ( , 2023b) ) confirmed the warm bias of the Medium-and High-ECS macro-GCMs.

The Error of the Mean
According to SJK2023, S2022 overlooked the ERA5-T2m error of the mean from 2011 to 2021.SJK claimed that, in order to calculate the error of the mean (95% confidence), one must calculate the standard deviation of the annual temperature anomalies around their 11-year mean and multiply it by (2) is their mean over the 11-year period: see Figure 1a.
However, as Scafetta (2023a, Appendix) already explained, Equation 1 is wrong because the global surface temperature uncertainty from 2011 to 2021 at the decadal mean is typically estimated to be about 0.02°C: see Figure 1c.In fact, Equation 2 is a function of N separate quantities rather than the mean of a distribution of N repeated measurements of the same quantity.Thus, as explained below, its error cannot be calculated with the standard deviation of the mean (SDOM) of a (here-nonexistent) distribution of one quantity (Taylor, 1997, chapter 4), but must be computed with the Gaussian error propagation formula (GEPF) of a function of several quantities (Taylor, 1997, chapters 3 and 9).

Distinction Between the "Mean" and the Regression Isothermal Model "T(t) = M"
GEPF establishes that the error of a generic function where is the covariance between z i and z j : see https://en.wikipedia.org/wiki/Propagation_of_uncertainty;JCGM Member Organizations (2008, Equation 13); Taylor (1997, Equation 9.9).Note that    ,  =  2   is the variance of each variable z i , and the partial derivatives are calculated at the average values    .Equation 3 derives from a first-order Taylor series expansion of the function q(z 1 , …, z N ) applied to stochastic variables.(Rohde & Hausfather, 2020, 2023).The red box highlights the typical error of the 10-year mean (±0.020°C) centered in 2016, which contradicts SJK2023's Equation 1 (±0.10°C).
Therefore, if T i ± σ i , for i = 1, 2, …, N, denote the global surface temperatures of 11 different years and their errors σ i are independent, their mean and its error are Thus, where Equation 8 differs from Equation 1 because   2 is not defined as the variance of the data set around its mean, but as the mean of the variances of the data-points (Equation 9).
Equation 5 should be used when the temperature record is represented by an ensemble of K alternative records.The statistics resulting from different temperature records is addressed in Scafetta (2023a).In general, if the record permits to evaluate the data-point error covariance, σ μ must be calculated with Equation 5, which establishes that (10) (Taylor, 1997, Equations 3.17 and 3.48).

Definition of the Error of the Isothermal Regression Parameter M: "σ M,95% "
In fact, Equation 1indicates something conceptually very different from the error of the mean of N different temperature quantities.Equation 1 indicates the standard deviation σ M of the coefficient "M" of the regression model which postulates a strict physical temporal evolution of the global surface temperature from 2011 to 2022.
The regression coefficient M is numerically equal to μ, but their standard errors differ from each other because of the specific physical meaning that Equation 11 has.In fact, Equation 11 represents an isothermal climatic system.A model of this kind postulates that from 2011 to 2021, the "true" global surface temperature was constant and equal to M for each year and, consequently, that the temperature anomalies from M, were N independent random numbers belonging to a zero-mean Gaussian distribution standard deviation equal to Under such hypothesis, the error of M is which is Equation 1.If each data-point is also affected by a measurement error σ i , Equation 14 is rewritten as where Equation 15 shows that the error of the regression parameter M is different, and can also be significantly larger than the error of the mean μ as deduced from the data uncertainties (Equation 8).Thus, postulating Equation 11 and that the anomalies ΔT i are random noise artificially increases the error-variance of the temperature data-points from   2  to   2  +  2 Δ and, therefore, also increases the uncertainty of their mean.Equation 13 can also be interpreted as a particular case of Equation 3 when N = 1, q(z) = z and ∂q/∂z = 1; that is, when there only are K repeated measurements of the "same" physical quantity z.Then, Equation 14 (or Equation 1) is the SDOM of the distribution of the K measurements of z (Taylor, 1997, Equations 4.9 and 4.14).

Discussion
The error of the 2011-2021 temperature mean is an observable that solely depends on the given errors of the temperature data: if the data are error-free, so is their mean.This error is unrelated to the standard deviation of the constant coefficient of any hypothetical regression model of the data (e.g., Equation 11), whose error varies as the model changes.
SJK2023 assumed that the temperature anomalies ΔT i (Equation 12) were some kind of random noise produced by a mysterious stochastic process that they labeled "random nature."The annual standard error of this unidentified stochastic process was claimed to be  95% = 0.1 √ 11 = 0.33 • C , which is one order of magnitude greater than the known annual stochastic error of these data (cf.Table 1).Such an assumption physically implies that from 2011 to 2021 the actual climate was isothermal, T(t) = M, which is nonphysical.The same assumption statistically implies that the 11 annual global surface temperatures from 2011 to 2021 are 11 repeated stochastic measurements of their 11-year mean, which is incorrect.
Equation 1 can only be used when dealing with repeated measurements of the "same" physical quantity (Taylor, 1997, chapter 4).This occurs, for example, when several thermometers simultaneously measure the temperature of the same location.In such instances, the measurement discrepancies among the thermometers can be physically interpreted as generated by stochastic processes.Yet, 11 numbers, each indicating the annual global surface temperature of a different year, are not 11 repeated measurements of their 11-year mean; different years reflect separate physical states of the climate, each with its own temperature, and a 1-year period is not an 11-year period.Thus, the 11 temperatures do not represent a stochastic distribution of one quantity but 11 different quantities.Therefore, the error of their mean cannot be calculated with the SDOM (Equation 1), but only with the GEPF (Equation 6, or, in general, Equation 5).
The interannual temperature variability is not "noise" or some kind of "error" because it is due to well-known physical mechanisms such as ENSO fluctuations, volcanic eruptions, variations in solar activity, anthropogenic and natural warming trends, etc. Equation 11 can only be replaced with a realistic physical model of the type where M is the mean and the function ΔT physical (t) captures the actual interannual physical component of the record, which should coincide with the experimental measurements within errors of measure.ΔT physical (t) can derive from external forcings and/or internal mechanisms.Scafetta (2023a, Appendix) showed that when this exercise is done, the error of M approaches the error of the mean (Equation 8).In fact, all temperature records resemble the same temperature fluctuations, indicating that those changes are primarily physical signal and not noise.As a result, if no other kind of random error can be physically demonstrated, the natural interannual variability of ERA5-T2m cannot increase the statistical error of its 2011-2021 temperature mean.
Scafetta (2023a) outlined other serious contradictions with applying Equation 1.For example, by replacing the annual (N = 11) with the monthly (N = 132) temperature record, the value of σ μ,95% calculated by Equation 1changes from about 0.108 to 0.034°C.However, for records made of genuine stochastic Gaussian fluctuations, σ μ,95% is independent of the time-scale of the data.Thus, statistics excludes that ERA5-T2m consists of random Gaussian fluctuations around a decadal mean.Therefore, SJK2023's calculation is arbitrary because Equation 1highly depends on the temporal resolution of the data.This inconsistency persists even if SJK2023's isothermal model (Equation 11) is replaced with a linear model (a referee proposal): in fact, by simply interpolating the data with additional points, the error of its constant coefficient converges to zero as N increases (Taylor, 1997, Equation 8.16) because the data variance around the model prediction will not change much since the data-points are physically inter-correlated.
Figure 1a shows another severe logical, physical, and mathematical inconsistency: SJK2023 estimated σ M,95% (Equation 1 or 14) just for the ERA-T2m temperature record and showed it with a ±0.10°C pink-bar.However, for the 208 (36 green + 172 black dots) GCM hindcasts, they calculated σ μ,95% (Equation 8).The latter are correctly represented by dots since the simulations are unaffected by any statistical error while showing significant interannual variability like ERA-T2m, as Figure 1b demonstrates.Yet, the same algorithm must be used for both the real and synthetic temperature records.
It must also be mentioned that evaluating the 2011-2021 temperature mean and its error does not depend on the GCMs' performance in modeling the observed inter-annual climatic variability, nor on the possibility of predicting the "timing of El-Niño events."Moreover, these events do not possess any "random nature," as SJK2023 claimed by confusing a weakly chaotic system, like the climate, with a stochastic process.Only the quantum world presents instances of true randomness.In fact, while forecasting the "timing of El-Niño events" may be challenging (chaotic system), an El-Niño event is a well-established fact when historically recorded.Yet, Equation 1 erroneously implies that the annual temperature data are prone to stochastic errors that are so excessively large (σ year,95% = 0.33°C) that it may not even be possible to determine (stochastic process) whether or not an El-Niño event has occurred in any given year.
In other words, since we are interested in determining what happened from 2011 to 2021, the 2011-2021 ERA5-T2m interannual variability-which represents the actual climatic chronology that occurred-cannot be replaced by random data and, accordingly, Equation 1 cannot be applied to it.SJK somehow misunderstood that only the internal variability of the GCMs can generate stochastic ensembles of interannual hindcasts by randomly varying their initial conditions or internal parameters.However, ERA5-T2m is not a GCM, but the given temperature chronology that produces only one 2011-2021 temperature mean with its own error of measure.

Estimation of the Likely "Error of the Mean" for the ERA-T2m 2011-2021 Record
S2022 did not report the error of the mean because the ERA-T2m monthly or annual uncertainties had not been explicitly published (Hersbach et al., 2020).However, such error was expected to be negligible because, as Scafetta (2023a) explained, over the last 50 years, the annual temperature uncertainties are estimated to be 0.03-0.05°C(95% confidence) by all global surface temperature records (Lenssen et al., 2019;Morice et al., 2021;Rohde & Hausfather, 2020) and, if their error-covariance is ignored, such uncertainties should be divided by  √ 11 to get the error at the 11-year scale (Equation 8).
However, the decadal-mean temperature uncertainties were explicitly published for the Berkeley Earth Land/ Ocean Temperature record, which reports σ decadal.μ,95%≈ 0.02°C since 1970 relative to the 1951-1980 period (Rohde & Hausfather, 2020).This publication directly contradicts SJK2023's result derived from Equation 1.Similarly, I processed the provided K = 200 HadCRUT5 ensemble members as 1980-1990 temperature anomalies with Equation 5, which evaluates also the covariance among the individual members: the 2011-2021 mean was 0.58°C and σ μ,95% ≈ 0.02°C (by also including the coverage uncertainty).This error-estimate is one-fifth of Equation 1, and is well-defined because the same result is obtained with both the annual (N = 11) and monthly (N = 132) records.

The Internal Variability of the Models
According to SJK2023, S2022 ignored the internal variability of the models and insisted that "the full ensemble for each model must be used." As mentioned in Section 2, SJK2023's claim is incorrect because S2022 characterized each GCM using three independent average simulations.That variability already provided an approximate estimate of the hindcast-range related to each model's internal variability.Moreover, SJK2023 overlooked that S2022's ultimate goal was to test three macro-GCMs where each GCM ensemble average served as a sample of the "internal variability" of the respective macro-GCM.

ERA5-T2m
HadCRUT 5.0.1.0(infilled)  9)   95%, = 0.033 (Equation 8) Note.HadCRUT.5.0.1.0 (1961HadCRUT.5.0.1.0 ( -1990 anomalies) anomalies) with their published statistical error (https://www.metoffice.gov.uk/hadobs/hadcrut5/data/current/download.html).ERA5-T2m (1980-1990 Anomalies) and HadCRUT.5.0.1.0 (1961-1990 Anomalies) Figures 2b and 2c depict the statistical analysis of the distribution of GCM hindcasts due to model internal variability using the Monte Carlo methodology described in Scafetta (2023a) where each average simulation (three per every GCM, Figure 2a) was assumed to present an additional stochastic dispersion given by a Gaussian process with σ 95% = 0.10°C (like the ERA5-T2m inter-annual variability calculated by SJK2023), which was simulated by several hundred thousand random values.The analysis uses the average simulation data from S2022's Table 1.Only the Low-ECS macro-GCM is optimally compatible with the temperature data within the 95% confidence, whereas the Medium and High-ECS macro-GCMs show a clear warm bias because the distribution of their hindcasts is warmer than the data.In fact, the z-score percentiles of obtaining GCM-hindcasts larger than 0.56 ± 0.01°C are: 95% (High-ECS-macro-GCM), 97% (Medium-ECS-macro-GCM), and 57% (Low-ECS-macro-GCM).Therefore, the Medium and High-ECS macro-GCMs perform rather poorly in hindcasting the warming from 1980-1990 to 2011-2021 and should not be used to formulate public policies., where μ obs = 0.56 is the observed temperature mean, μ GCM is the macro-GCM-hindcast-distribution mean,   2  = 0.01 2 is the data variance,   2  is the macro-GCM variance and 0.05 2 is the assumed additional variance related to each GCM internal variability.Indeed, when a set of GCMs with comparable ECS properties is evaluated as a macro-GCM, as done in S2022, even if a few models or some of their simulations roughly hindcasted the observation (as some black dots shown in Figure 1 do) it would not demonstrate that the macro-GCM performs well as a whole.In order to compare the actual data with the macro-GCM hindcasts, statistical analyses like those shown in Figure 2 are required.

1
Here, I'm only concerned with the 11-year uncertainty of one record, not the uncertainty resulting from variations among different temperature records.In any case, all available global surface temperature records show a 1980-1990 to 2011-2021 warming between 0.52 and 0.59°C (Scafetta, 2023a), which could also be significantly overestimated according to alternative climatic records (Scafetta, 2023a(Scafetta, , 2023b;;Spencer et al., 2017;Zou et al., 2023).The real 2011-2021 warming could be 0.4-0.6°C,which is well-below the Medium-and High-ECS macro-GCM distributions.Scafetta (2023b) proposed an ideal macro-GCM composed of 17 GCMs (from a collection of 41 GCMs) by selecting the top performing models based on their ECS and transient climate response (TCR) rankings.Because of their low TCR, certain medium and high ECS models could be added to the low-ECS category in this situation.

The Areal Student t-Test Analysis
SJK2023 also critiqued the second analysis that S2022 proposed, which was based on areal t-test statistics.The analysis was based on these equations: where j = 1, …, M are the grid cells, i = 1, …, N are the models, t j represents the Student t-test variable and σ j is the standard deviation of the distribution of the GCM hindcasts: m indicates "model," and o indicates "observation."SJK2023 claimed that  √  should be replaced by 1.
SJK2023 did not propose a "more appropriate test" but only a complementary one that does not change the conclusion of S2022 that, as the ECS lowers, the agreement between model hindcasts and actual observations increases, on average, also at the local scale (Scafetta, 2022, Figure 5).In fact, S2022's areal Student t-test maps can just be rescaled by  1∕ √  if a reader is interested in studying the alternative analysis.The test without  √  was adopted for the global scale (Figure 2) where a dynamical comparison among the various regions was not needed.
It is perplexing why SJK2023 asserted that the Student t-test would not be adequate for evaluating the performance of the climate models given that it is a well-known and reliable statistical instrument of analysis.The major goal was to assess the local performance of the three macro-GCMs.If N increases, the rejection rate would only rise if the models' reconstruction of the temperature changes exhibited a systematic bias.In fact, for reliable models, the mean divergence from the observed values should converge to zero  (|Δ| → 0) quicker than  1∕ √  , and Equation 18 will keep t small (t → 0).Additionally, the (rather weak) dependence of the t-test equation on N can be easily solved by standardizing the proposed test by using always a fixed number of models (e.g., N = 10 or other), and, if necessary, averaging the areal t-test results among these subsets.
In fact, the GCMs should hindcast the primary dynamical patterns produced by the climate system, including those generated by the global atmospheric and oceanic circulation.The fact that they cannot do it yet, does not invalidate Equation 18. Simple energy balance models might just replace the GCMs if they were not intended to reconstruct the global circulation even at a timescale larger than the decadal one.S2022's test successfully brought attention to the persistent problems in climate dynamics that need to be solved in order to improve the models.
On the other side, the "more appropriate test" proposed by SJK2023 would be inadequate for detecting specific dynamical biases in the model ensemble because, for instance, one GCM simulation might accurately reproduce the warming over China but not over Africa, whereas another simulation might accurately reproduce the warming over Africa but not over China (and so on).If the scaling provided by the factor  √  in Equation 18is not used, and the two simulations are just superimposed, the analysis would give the false impression that the GCM reconstructs the temperatures in both China and Africa, which would be misleading because none of the actual simulations would do it.S2022's recommended test is stringent, but statistical tests are generally more useful when they reveal physical inconsistencies rather than when they cover them under the umbrella of large stochastic error-bands.

Conclusion
SJK2023 made an incorrect use of statistics and physics because they evaluated σ μ of 11 different annual temperatures by using the SDOM of a (here-nonexistent) distribution of one quantity instead of the GEPF of a N-quantity function like the "mean," and confused natural interannual chaotic variability with random noise.
Statistical equations only apply to the stochastic component of a signal, not to its physical one, which is regulated by the laws of physics.The uncertainty of any function of a temperature chronology (such as its mean) arises only from the stochastic errors of the data-points and their error-covariance (Equation 5).Whether the El-Niño fluctuations are caused by internal mechanisms or external forcings, or how well they can be predicted by modern GCMs, is irrelevant for calculating the data errors at the decadal or any other timescale.The observational errors must not be confused with the errors of the regression parameters of a model-interpretation of the data.
More specifically, SJK2023 inflated the error of the ERA5-T2m 2011-2021 mean by 5-10 times by confusing it with the error of the regression parameter M of an isothermal climate model that postulates that the interannual temperature variability is random noise, which is nonphysical.Their σ μ,95% ≈ 0.10°C 11-year error-estimate is contradicted by published temperature data (e.g. Figure 1c) and was also arbitrarily calculated using annual average values, as opposed to monthly ones, which produce with Equation 1 a very different result, σ μ,95% ≈ 0.03°C.
Additionally, SJK2023 ignored that S2022's main result was confirmed by three different sets (one for each SSP) of GCM average simulations and that S2022's primary goal was to test three macro-GCMs characterized by different ECS ranges.Finally, Scafetta (2023a) answered all of SJK2023's questions about the hindcast uncertainty of the models and confirmed that the most likely ECS should be lower than 3°C, as Figure 2, which complements S2022, shows once again.

Figure 1 .
Figure 1.(a) SJK2023's figure displaying the actual and simulated temperature changes from 1980-1990 to 2011-2021.The green dots represent a small subset of the model ensemble averages examined in S2022.(b) The insert shows that both the actual temperature record and computer simulations present inter-annual variability.My critiques (highlighted in the blue comments) are better detailed in the text.(c) Screenshot of the Berkeley Earth's global surface temperature record with its estimated (95% confidence) errors(Rohde & Hausfather, 2020, 2023).The red box highlights the typical error of the 10-year mean (±0.020°C) centered in 2016, which contradicts SJK2023's Equation 1 (±0.10°C).