Robustness of CMIP6 Historical Global Mean Temperature Simulations: Trends, Long‐Term Persistence, Autocorrelation, and Distributional Shape

Multi‐model climate experiments carried out as part of different phases of the Coupled Model Intercomparison Project (CMIP) are crucial to evaluate past and future climate change. The reliability of models' simulations is often gauged by their ability to reproduce the historical climate across many time scales. This study compares the global mean surface air temperature from 29 CMIP6 models with observations from three datasets. We examine (1) warming and cooling rates in five subperiods from 1880 to 2014, (2) autocorrelation and long‐term persistence, (3) models' performance based on probabilistic and entropy metrics, and (4) the distributional shape of temperature. All models simulate the observed long‐term warming trend from 1880 to 2014. The late twentieth century warming (1975–2014) and the hiatus (1942–1975) are replicated by most models. The post‐1998 warming is overestimated in 90% of the simulations. Only six out of 29 models reproduce the observed long‐term persistence. All models show differences in distributional shape when compared with observations. Varying performance across metrics reveals the challenge to determine the “best” model. Thus, we argue that models should be selected, based on case‐specific metrics, depending on the intended use. Metrics proposed here facilitate a comprehensive assessment for various applications.


Introduction
The decade of the 2010s was the hottest yet in more than 150 years of global mean temperature measurements (NOAA, 2020). The key climate change signatures include intensifying the water cycle (Markonis et al., 2019), modifying the frequency of precipitation extremes (Papalexiou & Montanari, 2019), affecting the water availability (Gosling et al., 2011;Miao & Ni, 2009;Sheffield et al., 2012), alternating the regime of heat waves Perkins et al., 2012) and flooding (Min et al., 2011;Pall et al., 2011;Rahmstorf & Coumou, 2011), and impacts on human health (Gosling et al., 2009), food security (Piao et al., 2010), ecology (Allen et al., 2010;García et al., 2018), and species biodiversity (Nowak, 2010;Wake & Vredenburg, 2008). The multi-model climate experiments carried out as a part of various phases of the Coupled Model Intercomparison Project (CMIP) are designed to advance our understanding and knowledge of climate variability and climate change. These experiments simulate the current climate and project future conditions (IPCC, 2013). Specifically, the CMIP models are forced with multiple scenarios of greenhouse gas concentrations, representing the forced variability (IPCC, 2013). The CMIP models are initialized using multiple initial states; the different initial states result in different trajectories of the climate system, representing the internal or non-forced variability. The previous phases of CMIP have been very useful to both understand changes in the climate system and help inform climate policy (Aloysius et al., 2016;Forster et al., 2013;Knutti & Sedláček, 2013;Li et al., 2014;Miao et al., 2014;Sillmann et al., 2013;Toreti & Naveau, 2015;Wuebbles et al., 2013). The latest phase of CMIP, CMIP6, is developed to overcome limitations of the CMIP5 and continue to address the ongoing scientific challenges (Eyring et al., 2016(Eyring et al., , 2019. • Global evaluation of mean surface air temperature from 29 CMIP6 models with observations from three data sets • CMIP6 assessment based on warming and cooling rates, autocorrelation and long-term persistence, entropic and distributional shape metrics • CMIP6 reproduce well the 1880-2014 warming trend yet simulations and observations differ in distributional and long-term persistence metrics Supporting Information: • Supporting Information S1 • Figure S1 • Figure S2 • Figure S3 • Figure S4 • Figure S5 • Figure S6 • Figure S7 • Figure S8 • Figure S9 • Figure S10 • Figure S11 • Figure S12 • Figure S13 • Figure S14 • Figure S15 • Figure S16 • Figure S17 • Figure S18 • Figure S19 • Figure S20 • Figure S21 • Figure S22 Correspondence to: S. M. Papalexiou, sm.papalexiou@usask.ca The credibility of climate projections is typically defined by how accurately climate models represent the historical variability and trends (Miao et al., 2014). Previous work has focused on quantifying the forced climate signal, obtained by averaging multiple ensemble members across multiple climate models (Lyu et al., 2015;Stott & Forest, 2007), and understanding the non-forced variability (the internal variability), obtained by studying alternative trajectories of the climate system (Brown et al., 2015;Deser et al., 2012). This work is critical for both climate science and climate policy; yet, most studies in the literature focus on basic metrics (mean and variance) and do not provide much insight on the statistical character of the simulated global temperature time series (e.g., the temporal clustering and severity of climate impacts associated with global temperature variability). A credible representation of higher-order statistics in climate models will benefit users of climate model output.
The purpose of this paper is to investigate the historical global mean temperature simulations from CMIP6 models and assess their performance at multiple time scales using as reference three observations data sets. Specifically, we calculate the global temperature anomalies from 267 simulations from 29 CMIP6 models and (1) assess the overall trends and models' ability to capture warming, cooling, and hiatus periods; (2) evaluate the autocorrelation structure (ACS) and long-term persistence; (3) rank the models based on different criteria, such as model deviance, entropy measures, and relative entropy; and (4) use distributional-shape metrics (L-moments) to evaluate simulations in terms of the shape of the probability distribution. This study offers a detailed evaluation of the new phase of CMIP model simulation and aims to go beyond the basic metrics used in assessing models; time series properties such as correlations, long-term persistence (LTP), or distributional shape characteristics reveal valuable information on the dynamics of the models affecting their ability to create clustering of anomalies, local increasing or decreasing trends, or the potential to observe extreme anomalies. The analysis shows that CMIP6 simulations reproduce the historical observed trends but also reveals that not all simulations reproduce autocorrelations, LTP, and distributional shape.

CMIP6 and Observations
The monthly mean near-surface air temperature (tas) outputs from 29 models were obtained from the World Climate Research Program (WCRP) CMIP6 (https://esgf-node.llnl.gov/projects/cmip6/). Here we consider historical simulations (Eyring et al., 2016) corresponding to different variants based on realizations (r), initialization (i) schemes, different physics (p), and forcing ( f ) indexes. We consider a total of 267 simulations from 29 models provided by 19 modeling groups (see Table 1 for details). Historical simulations refer to the 1850-2014 period except for the AWI-CM-1-1-MR model .
To consider the uncertainty in observations, we use three data sets of observed monthly anomalies: (1) GISS Surface Temperature Analysis (GISTEMP v4; https://data.giss.nasa.gov/gistemp), built by combining the NOAA GHCN v4 database for land (meteorological stations), and the ERSST v5 for ocean areas (Hansen et al., 2010;Lenssen et al., 2019); (2) Berkeley Earth Surface Temperatures (Berkeley; http://berkeleyearth.org/data/), a gridded reconstruction of land surface air temperature records spanning 1850-present, obtained by combining the land air temperature with the HadSST3 sea surface temperature reconstruction (Rohde et al., 2014); and (3) HadCRUT4 (https://crudata.uea.ac.uk/cru/data/temperature/), a global temperature data set, obtained by merging the CRUTEM4 and HadSST3 for the land and ocean, respectively (Jones et al., 2012;Osborn & Jones, 2014). To match the periods between the observed data sets and the historical simulation we consider the period 1880-2014, a total of 135 years.

Methods
In this study we examine (1) warming and cooling rates in several sub-periods, (2) LTP, (3) multi-model performance ranking based on deviance metrics and entropy, and (4) indices that quantify the distributional shape (L-moments). All of the analyses are based on temperature anomalies, obtained as follows: for a given model we (1)  LTP (introduced in the 1940s by Kolmogorov, studied extensively by Mandelbrot, Wallis, and Van Ness in the 1960s, and popularized in geosciences by Beran, 1998) occurs when autocorrelation slowly approaches zero. Such behavior is characterized by persistence around specific values (not necessarily close to the longer-term mean) for extended periods. For example, if global temperature variability is a process with strong persistence, then this could imply that temperature anomalies persist at low or high values for several decades. During such periods, temperature might not be representative of a "true" longer-term climatology, the forced response, or any other longer-term climatological attractor-a challenge inherent to any detection and attribution study. For these reasons, competent models should not only reproduce the probability laws of a process but also its correlation structure (Papalexiou, 2018;Papalexiou, Markonis, et al., 2018;Papalexiou & Serinaldi, 2020) Therefore, the comparison of autocorrelation properties between simulations and observations is crucial. LTP in models and observations is quantified by the Hurst coefficient (H) for the 1880-2014 period; H < 0.5 indicates anti-persistence, H = 0.5 denotes time independence, and as H increases from 0.5 to 1 the persistence also increases. Several methods exist to calculate the LTP (for complete account see Beran, 1998;Beran et al., 2013); here we use the slope of the fitted line in a logarithmic plot between time scale (k) and standard deviation of scaled time series (σ (k) ). For a time series {x t | t = 1,…,n} the jth value of the scaled time series, at scale k, is given by the following.
The ACSs from models and observations, used to assess how well models replicate the observed ACS, are estimated by calculating the celebrated Pearson correlation coefficient up to lag 10 years. We rank each The deviance and entropic measures used in this study are (1) Model Deviance Criteria (MDC), (2) relative entropy, or the Kullback-Leibler divergence criteria (KLDC) (Kullback & Leibler, 1951), (3) Approximate Entropy (ApEn) (Pincus, 1991), and (4) Sample Entropy (SampEn) (Richman & Moorman, 2000). The MDC calculates the deviance of the inverse cumulative distribution functions (also known as quantile functions) between models and observations by estimating the Root Mean Square Error (RMSE) given by where u varies from 0.01 to 0.99 in steps of 0.01 and Q o (u) and Q s (u) denote the quantile values of observed and simulated anomalies for the corresponding u. The lower the RMSE the better the match between observations and simulations.
The KLDC also calculates the difference between the probability distribution of model and observations; the lower the KLDC values the better agreement in the probability distribution of simulations and observations (Shukla et al., 2006). Here we calculate the relative entropy following Delsole and Tippett (2007) and Kleeman (2002), that is, where μ and σ are mean and variance, respectively, and subscripts o and s indicate observations and simulations. This equation can be decomposed into dispersion and signal components; it evaluates both the predictability of the spread (dispersion) and the evolution of the mean (signal) of the model prediction.
The third and fourth criteria are based on entropy measures. Approximate Entropy (Pincus, 1991) (ApEn) and sample entropy (SampEn) are measures of persistence or regularity in a system (time series), thereby assessing the randomness of data without any previous knowledge about the source generating the data set. Low entropy reflects a persistent or a system that is predictable, with patterns (or values) that repeat throughout the series, while high entropy means a stochastic system or independence in the data series. More details on entropy measures, information theory and complexity are found in Delgado-Bonal and Marshak (2019). Let x n , n = 1,2,…,N, be a time series of length N. For a nonnegative integer m (also called as the embedded dimension), blocks are defined such that Each distance is compared with a threshold r (also called as resolution or tolerance), and the correlation sum C m i r ð Þ is calculated as This is repeated with all reconstructed blocks, and the probability of a pattern (of length m) in the time series is obtained by the following.
The ApEn is defined as ApEn(m, r, N) = ϕ m (r) − ϕ m+1 (r), with m ≥ 1. Low values of ApEn imply that the system is highly repetitive and predictable; in contrast, high values imply randomness and unpredictability. ApEn in some cases might be biased as it strongly depends on the sample size; Sample Entropy (SampEn), however, is more consistent in this respect (Chen et al., 2005;Yentes et al., 2013). The SampEn is defined as where A m (r) is the probability that two sequences are similar for m points and is given by the following.
A typical value for m is 2, while r is estimated by identifying the r resulting in highest entropy (Delgado-Bonal & Marshak, 2019) (see also supporting information Figure S21; here r = 0.28 times the standard deviation of the time series).
Finally, we compare models and observations using L-moment ratios (L-skewness [τ 3 ] and L-kurtosis [τ 4 ]), which offer more robust estimates than their product-moments counterparts (skewness and kurtosis); the sample estimate are based on formulas found in the literature (Hosking, 1990;Papalexiou & Koutsoyiannis, 2016;Vogel & Fennessey, 1993). L-ratios allow exploring important statistical properties of the simulated or observed time series and provide information regarding the underlying distribution. While the MDC or KLDC compares the overall distribution, L-moment ratios (τ 3 , τ 4 ) focus on properties that are more intuitive and conceptually easy to understand. Particularly, L-skewness reveals if the sample has positive or negative skewness showing if the distribution has longer right or left tail, respectively. L-kurtosis shows how heavy the distribution tails are; this is crucial as distribution with heavy (or heavier) tails can generate values that deviate significantly from the mean value.

Comparison of Observed and Simulated Temperature Evolution
The comparison between observed and simulated anomalies from the CMIP6 models, and their multi-model mean (Figure 1), shows general agreement in the temporal evolution of global mean temperatures. The CMIP6 models capture the variability and the overall trend in the observed anomalies over the historical period. Most major episodic cooling and warming periods in observations are also evident in models: intermittent cooling after volcanic eruptions (Flato et al., 2013); the mid-twentieth century plateau during 1941-1950, attributed to sulfate aerosol increases (Meehl et al., 2004) and ENSO activity (Brönnimann, 2005); and the strong warming over the 1964-2004 period, attributed primarily to anthropogenic forcing (Giorgi et al., 2011;IPCC, 2007;Meehl et al., 2004). These features are also visible at monthly and seasonal resolution and are reproduced qualitatively by all models (Figures S1-S16).

Reproduction of Warming, Cooling, and Hiatus Periods in CMIP6 Simulations
Superimposed on the secular warming trend since 1880 are periods of accelerated and slowed warming relative to the long-term trend. Examples of accelerated warming periods include the early twentieth century and late twentieth century. Examples of slowed warming periods, or "hiatus" periods, include the mid-twentieth century and the 2000s (Li & Baker, 2016;Medhaug et al., 2017;Trenberth & Fasullo, 2013). We evaluate whether models reproduce the warming rates during these periods. The historical time period is divided into four slices following Zhu et al.    (Knutson et al., 2016). The hiatus from 1942 to 1975 (Time Slice 3), attributed to a combination of external forcing (anthropogenic aerosols, volcanic eruptions, solar variability) and internal variability, is replicated by most of the models (except AWI-CM-1-1MR, BCC-ESM 1, E3SM-1-0, and NorESM2-LM for which none of the associated simulations [with at least 3] are within the range of observed trends) with near zero trend over the period. For these two periods (Time Slices 3 and 4), the previous versions of CMIP are also in agreement with the observations (IPCC, 2013).
Interestingly, the early twentieth century warming (Time Slice 2) tends to be underestimated by most models. This acceleration, potentially resulting from increase in greenhouse gases, natural forcing, and internal climate variability (Delworth & Knutson, 2000;Hegerl et al., 2018;Meehl et al., 2004), ranges from 0.13°C to 0.15°C per decade for observational data sets, and only 6% of all simulations capture this range. The simulated warming rate of the early twentieth century was also lower in CMIP3 and CMIP5 simulations compared to the observed one (IPCC, 2007(IPCC, , 2013. While CMIP6 on average shows a weak warming trend during the late nineteenth century (Time Slice 1), observations are ambiguous with regard to the trend given their uncertainties for that time period (mostly originating from reduced global coverage); a decreasing rate is reproduced in 20% of the models' simulations. The observed rates for these two periods are not simulated well by most models in the previous versions of CMIP (IPCC, 2013;Knutson et al., 2016;Zhu et al., 2018).

Discrepancies in Autocorrelations and Persistence Between Observations and Simulations
We first investigate autocorrelation properties of global temperature anomalies, as LTP is characterized by the temporal decay of ACS. Marked differences are observed among the ACSs of the observed and individual models (Figure 3). Even though the simulations track the observations relatively closely (as seen in sections 3.1 and 3.2), most individual simulations underestimate the observed ACS. We rank each model using SRE, and for models with more than one simulation, we calculate the ensemble ACS by averaging the ACS of all simulations in the ensemble. The four best and worst performing models are shown in Figure 3 (results  Figures S18 and S19).
for all models are presented in Table S1 and Figure S18). Note that there are large differences among the simulations; some perform exceptionally well, and others show unrealistic ACSs. In some models all simulations are consistent in reproducing the observed ACS (e.g., EC-Earth3-Veg and AWI-CM-1-1-MR); however, there are models for which some but not all simulations are consistent (e.g., CanESM5 and EC-Earth3) and also models with no consistent simulations (e.g., MIROC6 and NorESM2-LM). We stress that the autocorrelation properties are affected by the presence of trends (most prominently, the slow decay rate of the ACS). The results in terms of model validation could potentially also be affected if the trends, either among the models or between models and observations, differ significantly; yet, the preceding analysis has shown that these trends are very similar. For models with sufficient ensemble members (Milinski et al., 2019), one could detrend the data with the ensemble mean, but for observations this is not possible without making subjective choices on the trend. Further, we used two available large ensembles with relatively different trends (MPI-ESM and CanESM5) to assess the effect of detrending on the results and find no systematic difference in ACS between models before and after detrending (see Figure S20). We thus choose to compare ACS without detrending.
Next, we investigate the LTP of anomalies as quantified by the scaling of standard deviation versus time scale. LTP is crucial as it governs how slowly the autocorrelation decays with lag or how standard deviation decreases at larger scales which reveals the fluctuation potential at multidecadal scales ( Figure 3). For example, for a given annual standard deviation σ, the decadal σ (10) , assuming LTP with H = 0.95, is expected to be 254.8% larger than the decadal standard deviation assuming independence. Observed anomalies show a strong LTP with H = 0.95 on an average; model simulations show H is ranging from 0.75 to 0.96 ( Figure 3 shows for four best and worst models; see Figure S19 for all models). Results show that there are models reproducing high persistence as seen in observations. This indicates that such models replicate the long-term variability due to several factors such as volcanic, anthropogenic activities, and natural variability. Models reproducing observed autocorrelations are expected to better mimic climate's temporal dynamics and allow for realistic local (decadal) trends and clustering.

Probabilistic and Entropic Ranking of Climate Models
Since the inception of CMIP, models have been ranked according to their performance in representing aspects of the climate system deemed important for determining the forced response, such as emerging constraints based on past warming, ocean heat uptake, or cloud and radiation characteristics (Brient & Schneider, 2016;Jiménez-de-la-Cuesta & Mauritsen, 2019;Steinacher & Joos, 2016); this practice is considered to lead to more trustworthy projections (Wang et al., 2019). The previous section revealed discrepancies between the CMIP6 simulations in reproducing the observed autocorrelations. Equal-weight multi-model means are widely used; however, some simulations clearly represent observations-in this case global temperature variability-more realistically than others, providing motivation for a performance-based model weighting or ranking.
Here models are ranked by four probabilistic and entropic criteria in comparison with the observations (best performing model is ranked with 1). The first two, the MDC and KLDC, compare the probability distributions between observations and simulations; two time series might have the same MDC or KLDC values but different autocorrelation properties. The other two, ApEn (Pincus, 1991) and SmpEn (Richman & Moorman, 2000), are measures of temporal complexity (Delgado-Bonal & Marshak, 2019). They measure the "information flow" by quantifying the regularity or the unpredictability of fluctuations; two time series might have the same ApEn and SampEn values but different probability distributions. Note that most models have many simulations (except for four models having one simulation; see Table 1), and we use three observational data sets; thus, the number of total combinations is of the order of 10 21 ; thus, we randomly select one simulation of each model and compared with a randomly selected observational data set. We repeat this process thousands of times and eventually calculate the average rank for each model.
Analysis shows that different criteria rank different models as best or worst ( Figure 4). For example, based on the MDC and KLDC the best performing models are the NorCPM1 and NESM3, respectively; while based on ApEn and SampEn measures, the best performing models are the GISS-E2-1-H and SAM0-UNICON. In general, the relative entropy (KLDC) and the MDC methods are consistent with each other; this is expected as both methods compare the observed and simulated anomalies without considering the temporal structure. Likewise, the entropy-based methods (ApEn and SampEn) that focus on temporal patterns are also 10.1029/2020EF001667 ranking models similarly. Since the first two methods have different foci than the last two, their results are not expected to be the same. Six models show contradicting results across the different performance metrics; for example, INM-CM4-8 is considered to be one of the best models according to the relative entropy and MDC methods; yet, ApEn and SampEn rank this model poorly. Perhaps unsurprisingly, no model performs best for all methods; if we consider the lowest average rank from all methods as an overall performance metric then the NorCPM1 would appear as best performing. Other well-performing models with low average ranks are the GISS-E2-1-H, NESM3, MPI-ESM 1-2-HR, INM-CM4-8, and FGOALS-g3.

Inconsistencies in Distributional Shape
The higher-order L-moment ratios (L-skewness [τ 3 ], L-kurtosis [τ 4 ]) quantify the distributional shape properties (see also section 2.2). The metrics τ 3 and τ 4 are measures of skewness and kurtosis and indicate the potential of a process to generate values that differ notably from the mean and appear as extremes. To understand the distributional properties of the anomalies, we estimate the τ 3 and τ 4 in simulations and observations. For models with more than one ensemble member, we calculate the τ 3 and τ 4 metrics by averaging across all members of the ensemble.
The τ 3 and τ 4 values have large variations across the models ( Figure 5). Specifically, τ 3 ranges from 0.03 (for CAMS-CSM1-0) to 0.3 (for SAM0-UNICON), indicating symmetric and positively skewed distributions, respectively; τ 4 varies from 0.11 (for EC-Earth3) to 0.24 (for E3SM-1-0) suggesting that some models might have heavier tails than the normal distribution. Interestingly, the observed (τ 3 ,τ 4 ) points for the observations  10.1029/2020EF001667 simulated points in the τ 3 − τ 4 plane (Figure 5a; see Figure S22 for monthly scales). We calculate L-moments for the aggregated time series (aggregated using equation 1), that is, timeseries showing multiyear anomalies up to decadal scale. We note that τ 3 and τ 4 change with aggregation scale in both the observations and simulations (see Figure 5b, time scales ranging from annual [green] to decadal [blue]). Typically, temperature anomalies are assumed to follow the normal distribution under stationarity (theoretically supported by central limit theorem); yet, the noted variation could be explained by nonstationarity and local trends, such as external forcings discussed in section 1. Some models have the same τ 3 as the observations, indicating similar skewness; yet, they do not match the observations' τ 4 , specifically at low scales (green color; Figure 5b). Investigation of the (τ 3 , τ 4 ) points in the τ 3 − τ 4 plane (at the aggregated scales) shows a clear clustering of the observation-based points (triangles) and mean of all models (circles). However, some individual models (given by the range around the multi-model mean estimate in Figure 5b) match well the observations of GISTEMP.

Conclusions
Climate models are developed to help advance our knowledge and understanding of climate variability and climate change. A rigorous evaluation of model simulations reveals their credibility. However, performance assessment depends on the evaluation criteria used; for example, typically, no single model is top-performing in all criteria. More specifically, models are assessed by comparing their historical simulations with observations and focusing on properties such as the mean, standard deviation, or the deviance between probability density functions. However, other distributional measures such as shape characteristics, and important joint distribution properties such as the autocorrelation and LTP at higher scales, are not generally examined.
Here, an extensive assessment of global mean surface air temperature is presented. Studying changes in global mean surface temperature is important for analyzing global response of various natural, managed, and human systems to warming rate (IPCC, 2018). The IPCC Special Report 15 (SR15) highlights the impacts of 1.5°C global warming above preindustrial levels and illustrates the impacts and risks of different levels of global warming for people, economies, and ecosystems across different sectors and regions. Given the importance of studying the changes in global mean surface temperature, and the ability to simulate historical and future trends by climate models, the simulations have to be assessed with observed ones in a comprehensive manner. We therefore evaluate the global mean surface temperature from the sixth phase of the Coupled Model Intercomparison Project (CMIP6) in comparison 10.1029/2020EF001667 with three observation data sets (GISTEMP, Berkeley and HadCRUT4) at annual and monthly scales. We use 267 simulations from 29 models provided by 19 modeling groups and assess the proximity of each model with three observation data sets using performance measures and robust criteria in terms of distributional properties. Particularly, we focus on (1) warming and cooling trends, (2) ACS and LTP, (3) probabilistic and entropic measures, and (4) distributional shape characteristics as quantified by L-moments.
1. The overall historical warming is captured by most of the models and by the multi-model mean. The rapid warming in the observed anomalies during the late twentieth century and the warming hiatus from 1942 to 1975 is well replicated by most of the models, confirming the important role of external forcing for producing these variations around the secular warming trend. Also, between 1998 and 2014, the observed trend is smaller (0.1°C per decade) than most of the models (90% of simulations overestimated this trend). 2. Only six out of 29 models reproduce LTP accurately. The ACS decays slowly, which is primarily due to the long-term trend; yet even so, it is generally underestimated by most models. 3. The models were ranked, from best to worst, using the MDC and three entropy-based criteria. No single model performs best in all four criteria, leaving some subjective choices up to the model users when determining the feasibility of models for a given application. 4. The simulated L-skewness and L-kurtosis are found to be higher in most models compared to observational estimates. This indicates that improvement is needed to capture the distributional shape characteristics. The current tendency of models to produce higher skewness could be related to their tendency to warm more than observations over the late twentieth century (Tokarska et al., 2020). Their tendency for higher kurtosis is harder to diagnose. It could be related to a combination of lack of low frequency variability paired with a tendency to overestimate the cooling after volcanic eruptions, although the latter issue could well be due to the exceptional coincidence of El Nino events during observed volcanic eruptions during the twentieth century (Lehner et al., 2016).
This study offers a detailed comparison on CMIP6 historical global temperature simulations. It provides additional metrics and performance assessments of the global temperature anomalies and reveals the challenge to determine the "best" model. Rather, it appears that models need to be chosen for a purpose, and their performance should be assessed based on that purpose. These metrics and assessment criteria are not limited to temperature and can be used for any hydro-meteorological variable (such as precipitation and wind speed) and at any spatial and temporal scales. Many real-world applications might demand accurate reproduction of the probability distribution-that should be the case if extremes are investigated. Other cases may need to focus on the temporal evolution of time series that can affect the clustering of high or low values. The fact is that informed decisions and model selection should not only be based on the typically used model comparison measures, such as mean and variance, but go beyond. Here we propose alternative measures that we deem will prove valuable in comparing and selecting simulations for actual applications; that could be risk assessment studies using temperature, precipitation, and so forth, at local or regional scales, or studies aiming to develop climate change adaptation strategies. As in all model evaluation efforts, uncertainties in observations themselves have to be considered carefully.