Decadal Predictability of Seasonal Temperature Distributions

Decadal predictions focus regularly on the predictability of single values, like means or extremes. In this study we investigate the prediction skill of the full underlying surface temperature distributions on global and European scales. We investigate initialized hindcast simulations of the Max Planck Institute Earth system model decadal prediction system and compare the distribution of seasonal daily temperatures with estimates of the climatology and uninitialized historical simulations. In the analysis we show that the initialized prediction system has advantages in particular in the North Atlantic area and allow so to make reliable predictions for the whole temperature spectrum for two to 10 years ahead. We also demonstrate that the capability of initialized climate predictions to predict the temperature distribution depends on the season.


Introduction
Changes in distributions of temperature are a common theme in the analysis of effects of climate change (Meehl et al., 2000;Samset et al., 2019;Sherrer et al., 2005).Distributions and their changes are therein characterized by their averages and variability (Meehl et al., 2000), with considerable differences by region and seasons (Samset et al., 2019).Studies in decadal prediction regularly focus on the predictability and the prediction skill of single values, like means or percentiles of extremes.These studies have shown considerable success in the past years in predicting surface temperatures years in advance.Therein, it is common to compare initialized prediction systems with their uninitialized counter part, demonstrating significant increase in skill, in particular on regional scales (Marotzke et al., 2016;Meehl et al., 2009;Merryfield et al., 2020).Nevertheless, this advantage appeared reduced within the most recent CMIP6 model simulations (Borchert et al., 2021).
Long-term climate predictions are done for time spans of multiple years from which the single value is estimated with an average, extreme, percentile or variability measure (Boer et al., 2022).The underlying reason for this is that generally on these time scales only the prediction of statistical properties can be achieved (Schneider & Griffies, 1999).While single summarizing values offer a good first insight toward predictions, for some applications the full underlying distribution might be of interest for stakeholders from different sectors, for example, for the planning and adaptation of energy distribution, agricultural business, building activities, or health services.In the case of a normal distribution it is determined by two values (mean and variance) alone, while in the case of non-parametric distributions, stakeholder can profit from additional knowledge about the occurrence of single day values.So they might be able to benefit not only of the information on the amplitude of an extreme, but also their frequency.
We demonstrate with this study the general concept of an approach to predict and verify the full distribution of seasonal daily temperatures on decadal time scales.To compare the predictions of different simulations, the difference of two non-parametric distributions is estimated.For this, different measures exist and each measure highlights different properties of the differences and with this show strength and weaknesses in comparing distributions depending on the application (Düsterhus & Hense, 2012;Thorarinsdottir et al., 2013).With such a difference measure it is possible to create scores for predictions, but also simply rank different simulations by their distance to a reference (Düsterhus, 2020).
In this study we investigate daily 2 m air temperature predictions with the Max Planck Institute Earth system model (MPI-ESM) for 10 lead years.After bias correcting the discrete distributions of temperature, anomalies are estimated for each 3 year period of any season.Initialized decadal prediction, uninitialized historical simulation and climatology are then compared to a reference created from the corresponding MPI-ESM assimilation simulation.For the comparison we apply a metric based on the integrated quadratic distance (IQD) (Düsterhus, 2020;Thorarinsdottir et al., 2013).The results demonstrate that prediction skill for distributions can be identified around the globe, especially in the North Atlantic sector.Additionally we show that while the annual patterns of predictability are generally similar, seasons show distinct differences especially in the northern latitudes.

Model Data
All simulations used in this study were done with version 1.2 of the in its low resolution setup (MPI-ESM-LR, Mauritsen et al. (2019)).It primarily consists of the atmospheric component ECHAM6 (Stevens et al., 2013) with a horizontal resolution of 1.85°, and 47 vertical levels between surface and 0.1 hPa, and the oceanic component MPIOM (Jungclaus et al., 2013) with a horizontal resolution of nominally 1.5°, and 40 vertical levels.All simulations use external forcing according to the Coupled Model Intercomparison Project Phase 6 (CMIP6, Eyring et al. (2016)), that is, historical forcing until 2014, and SSP2-45 scenario forcing after 2014.Both the ensemble assimilation and the initialized decadal retrospective forecasts ("hindcasts") are taken from the 80 member MPI-ESM large ensemble decadal prediction system (Hövel et al., 2022;Krieger et al., 2022), with hindcasts initialized November 1st every year from the ensemble assimilation.The system uses an oceanic ensemble Kalman filter (Brune et al., 2015) to assimilate temperature and salinity profiles from EN4 (Good et al., 2013) with the Parallel Data Assimilation Framework (Nerger & Hiller, 2013), and atmospheric nudging to ERA40, ERAInterim, ERA5 reanalyzes (Dee et al., 2011;Hersbach et al., 2020;Uppala et al., 2005).Due to this atmospheric nudging, the 2m air temperature in the ensemble assimilation very closely resembles the one from ERA reanalyzes.We therefore chose to use the assimilation simulation to create our references.The uninitialized simulations ("historical") used in this study are part of the MPI CMIP6 grand ensemble (MPI-GE CMIP6, Olonscheck et al. (2023)).For consistency reasons, we use the first 16 members of each simulation set, because the assimilation ensemble consists in total of 16 members.We investigate for all of the historical and hindcast simulations and lead years the time period 1971 until 2017 (for the boreal winter season DJF until February 2018).

Method
In a first step the daily data of the assimilation, the historical simulation, and for each lead year of the hindcast for each season (MAM, JJA, SON, DJF) are averaged over the time period of 1981-2010 to create the seasonal mean for each simulation.This is then subtracted from each corresponding data set to create anomalies and performing a bias correction.The distributions are created by a histogram with 200 classes between the anomalies of 20°C and +20°C (step size 0.2°C).This approach is applied for the assimilation simulation for each season between 1981 and 2010 to generate the climatology.For the hindcast simulation for three lead years, for example, lead years 2-4, 3-5, etc, the data is transformed to a distribution by the same approach.For the historical simulation, the distributions are created for the same 3 year windows.To create the reference for the comparisons, the same 3 year spans are used on the assimilation data.As a consequence, each distribution for a simulation in a specified season has the same number of entries (number of days in a season (DJF 90; MAM and JJA: 92; SON: 91) times number of years (3) times number of ensemble members ( 16)).
To compare two distributions we apply integrated quadratic distance (IQD) (Düsterhus, 2020;Thorarinsdottir et al., 2013).In this the distance of two discretized distributions ( f and g) at given points v i is measured by the squared distance of their cumulative distribution functions (cdfs, F and G).The distance is then given by Geophysical Research Letters

10.1029/2023GL107838
In this study we will compare the climatological, hindcast and historical distributions for each grid point against the corresponding distribution from the reference.Thus a distance value is created for each 3 year period for each simulation at each grid point.Our evaluation is based on the comparison of two simulation for each grid point and for each 3 year time period.To evaluate over the whole time period for each grid point, two simulations are compared and the number of years counted where the distance of one simulation is lower to the reference distribution than the other.To calculate the significance by the 5% significance level the result is compared to a random walk (DelSole & Tippett, 2016).By this approach we prevent that a single outlier year will dominate our analysis and answer for each grid point the question how often a specific simulation is better than another over the evaluation period.

Summer Prediction
To demonstrate the overall capability of the simulations, we investigate in Figure 1 the prediction skill of temperature distributions for the global scale for June-July-August (JJA).In this we compare in a first step the hindcast prediction for different lead times against climatology.For lead year 2-4 (Figure 1a) we see that the hindcast has advantages compared to the climatology in the North Atlantic area, the Indian Ocean area, and in the Western Pacific.We identify that especially in the Arctic region the constant climatology assumption is superior.
For longer lead times (Figures 1b and 1c) the general patterns persist and only minor changes happen.When we look at the historical simulation (Figures 1d-1f), which is for all lead times in this setup the same, the patterns are comparable, but show distinctive differences in shape in some areas.Investigating the effect of initialization by comparing the hindcasts with historical simulations (Figures 1g-1i) shows an advantage of the hindcasts mainly in the Southern Ocean and in the North Atlantic, while the tropical areas are mainly indecisive.In the Arctic region, the historical simulation is first superior in lead years 2-4 (Figure 1g) before becoming neutral in lead years 5-7 (Figure 1h).On time scales of lead year 8-10 (Figure 1i), the hindcast is then significantly superior compared to the historical simulations.Areas on land where the hindcast is superior against the historical simulation are India, West Arabia and North-Eastern Brasil.We also can see some added skill in Northern Europe.
As the North Atlantic area and Europe are one of the prime areas of skill, we investigate this area in Figure 2 in more detail.For the hindcast (Figures 2a-2c) and historical (Figures 2d-2f) simulations we see vast areas being better predictable when compared to climatology.This is especially true for the subpolar gyre region and Central Europe.The climatology dominates especially in areas affected by the Gulf Stream and on the Western side of Northern Africa.When we compare the hindcast and the historical simulations, we identify for lead years 2-4 (Figure 2g) the Northern part of the North Atlantic and Northern Europe as areas where the initialization in particular shows clear advantages.Skill reduces over longer lead times and we can identify for lead years 5-7 (Figure 2h) an area in the North East Atlantic as predictable.The skill in Southern Scandinavia as well as the British and Irish Isles persist up to lead years 8-10 (Figure 2i).

Predictions for Different Seasons
After focusing on the boreal summer (JJA) we now investigate in a next step the skill for different seasons.Here we concentrate in Figure 3 on the comparison between hindcast and historical simulations for Europe.In the boreal spring (MAM, Figures 3a-3c), we see a vast area east of Greenland over the Denmark Strait where the initialization creates additional value, which can be attributed to sea ice.Also compared to the predictions for the boreal summer, the Eastern North Atlantic area is more predictable with the hindcast simulation for lead years 2-4 (Figure 3a).While this prediction skill reaches down to 35°N for the early lead years, it covers only to about 45°N for lead years 8-10 (Figure 3c).In contrast, the prediction skill for the hindcast simulation is much less pronounced on the land areas.Especially in Scandinavia there is much less area covered by significant skill than it was in the summer.Just like the boreal spring, the boreal autumn (SON, Figures 3d-3f), also demonstrates a pattern in the Denmark Strait, but for a much narrower band along the Greenland Coast.In contrast to the boreal spring, there is a clear indication that on the West side of Greenland over the Baffin Bay, the historical simulations show more skill than the hindcasts.Also in the lower latitudes the historical simulations show their advantages.The subpolar gyre region can be much better compared in its pattern toward the boreal summer.In Scandinavia none of the two simulations show any clear indication of significance.The boreal winter (DJF, Figures 3g-3i) is on the West side of Greenland more comparable toward the boreal autumn, while on the east side it resembles the pattern of the boreal spring.In the subpolar gyre region significant prediction skill is seen for all three seasons for all lead times, but with different patterns.For longer lead years, like lead years 8-10 (3 i), it is the only time that the historical simulations outweigh the hindcasts over Scandinavia.Overall it can be shown, that there is a seasonal cycle of the shown differences between hindcast and historical simulation over the North Atlantic sector.
The results in the mentioned regions in the higher latitudes might be explained by differences within the simulations in its capacity to represent snow and sea ice, in particular during melting and freezing times throughout the year.

Example From the North Atlantic
To further dive into the comparison of the different prediction, we chose a grid point in the North East Atlantic (20°W, 60°N) from the boreal summer prediction.This point is taken from a region where the hindcast prediction is significantly better than the other two simulations for lead years 2-4.For each year we see in Figure 4 the distribution, with the inter-quantile range (25%-75%) shown by a block and the distance toward the 5th and 95th percentile by whiskers.The reference assimilation shows generally an upward trajectory, but also demonstrates some variability over time.In the middle of the 1970 and 1990s, there are weak negative dents in the trend, while in the late 2000s and early 2010s a positive anomaly can be identified.For most times, e. g. highlighted especially around the year 2000, the assimilation exhibits that the lower whiskers are longer than the upper ones, indicating a non-symmetric distribution.In general, this evolution over time is better represented with the hindcasts than in the historical simulations.By analyzing in how many years one simulation is better than the other we get a good indication about the quality of a particular prediction.The analysis shows that hindcast (40-7) and historical (43-4) are clearly better than the climatology at this grid point.We also find that the hindcast shows better results than the historical in the majority of the years (37-10).Comparing hindcast, historical, and climatology in a three-waytie, we find that hindcast performs best in 35 years, historical performs best in 12 years, and climatology only in two years (35-10-2).This grid point example demonstrates a potential application, comparable to the common scatter plots, for the deeper understanding for correlation based measures.We like to highlight that analyzing single grid points without a physical context is not sufficient enough to understand the implications of predictability for a given area.

Discussion and Conclusion
This study demonstrates the ability of initialized decadal simulations to predict 2m air temperature reliably better than climatology could do, on a global scale, and in particular in the North Atlantic region.Furthermore it shows an improved prediction skill compared to uninitialized historical simulations in the Northern parts of the North Atlantic and Europe.This has already been shown with predecessor systems based on MPI-ESM (Marotzke et al. (2016); Polkova et al. (2019); Brune and Baehr (2020)).Compared to their investigations, which were based on the commonly used anomaly correlation (ACC), the bias and its correction plays an important role in our investigation.Our metric of the 1D-IQD takes the deviation from the mean into account for the evaluation, similar to, for example, an analysis of the root-mean-square error.By using a simple bias correction scheme as done in this study, we can focus on the prediction of the variability of the target variable.The correction might be challenging for variables with specified boundaries like precipitation on short time scales.The bias correction is also the main explanation when we see with increasing lead times increased relative skill of initialized over uninitialized simulations.Especially, such phenomena can be observed when bias correction is applied to data sets and regions with different trends or cycles.While generally such results are counter-intuitive, Düsterhus and Brune (2024) have recently demonstrated that the assumption that an initialized hindcast regresses back to the uninitialized historical simulation is not necessarily applicable, not even over longer lead times.We note that for metrics based on the correlation the results can be dominated by single outliers.This is prevented in our approach by the yearly evaluation of the results, which is then aggregated by a sum.In situations of Gaussian or symmetrically distributed variables, it can be assumed that the analysis will reach similar results to an ACC analysis on seasonal mean values (see Figures S1A-S3 in Supporting Information S1).Nevertheless, we also find areas with considerable differences, for example, over Scandinavia the difference in prediction skill between initialized decadal predictions and uninitialized historical simulations based on 1D-IQD (Figures 2g-2i) is generally higher than the difference in skill based on ACC (Figures S1d-S1f in Supporting Information S1).We highlight that this is not only an effect due to the non-normality of the temperature distributions, but also the different skill measures.Furthermore, we have analyzed the skill on the grid point level.It is therefore important to state that statistical significance can only be a first step in the determination of predictability, while subsequent strains of argument have to be drawn by physical reasoning.
We apply the inter-quartile distance to investigate the difference between two distributions, which evaluates how much the probability needs to be shifted to transform one distribution into another.This approach allows us not only to focus on an ensemble mean, but all ensemble members.It is important to take into account that the assimilation simulation we use as a reference has by design a smaller variability than the other simulations, even when we use the same amount of ensemble members.Nevertheless, it is still a valid approach, as the inter-quartile distance investigates not each bin separately, but the transfer of probability (also known as binning-problem, Düsterhus and Hense (2012)).This is also an important advantage of the IQD compared with other metrics that compare distributions with each other.One major difference to the correlation based metrics is our evaluation procedure by giving the overall results for a time series by the number of three-year-periods one simulation is better than the other.As a consequence we do not receive a continuous result, but a categorical one, posing a challenge to existing controls for false alarm rates, for example, like Wilks (2016).Having tested our results with this test, we conclude that the outcomes can hardly be interpreted the same way as correlation based metrics are.
Daily mean 2 m air temperature shown here as a target variable is only a first step toward a better understanding of how well simulations, be it initialized decadal predictions or uninitialized historical simulations, are able to predict the full distribution of the climate system.For stakeholders, absolute temperature distributions on time scales shorter than a daily mean might be of interest, like currently provided as minimum or maximum temperature.Other variables could be of interest for stakeholder like the renewable sector, especially wind.The full advantage of this approach can be expected for generally non-normal distributed variables.The advantage of this non-parametric approach is that it can be applied to most variables predicted by the climate model and is only limited by the capacity of the evaluating computer system and the ability to properly bias-correct the simulations.
Supporting Information may be found in the online version of this article.

Figure 1 .
Figure 1.Global comparison between initialized decadal prediction and uninitialized historical simulation applying the 1dcontinuous-IQD-score on daily 2 m temperature distributions in the boreal summer.Analyzed are the years between 1971 and 2017 over three different lead times (lead years 2-4 (1st column), 5-7 (2nd column) and 8-10 (3rd column)) and comparisons between hindcast and climate (1st row), historical and climate (2nd row) and hindcast and historical simulation are shown.The results show the number of years in which the score of the distribution of daily mean temperature anomaly was better compared to the assimilation simulation in the first two rows and in the last row the initialized decadal prediction system was better compared to uninitialized historical simulations.Black dots indicate significance by the 5% significance level estimated by comparison to a random walk process.

Figure 2 .
Figure 2. Comparison between decadal prediction and historical simulation over the North Atlantic sector applying the 1dcontinuous-IQD-score on daily 2m temperature distributions in the boreal summer.Analyzed are the years between 1971 and 2017 over three different lead times (lead years 2-4 (1st column), 5-7 (2nd column) and 8-10 (3rd column)) and comparisons between hindcast and climate (1st row), historical and climate (2nd row) and hindcast and historical simulation are shown.The results show the number of years in which the score of the distribution of daily mean temperature anomaly was better compared to the observations in the decadal prediction system than in the historical simulation.Black dots indicate significance by the 5% significance level estimated by comparison to a random walk process.

Figure 3 .
Figure 3.Comparison between decadal prediction and historical simulation over the North Atlantic sector applying the 1dcontinuous-IQD-score on daily 2m temperature distributions for three seasons (MAM (1st row), SON (2nd row), DJF (3rd row)).Analyzed are the years between 1971 and 2017 over three different lead times (lead years 2-4 (1st column), 5-7 (2nd column) and 8-10 (3rd column)).The results show the number of years in which the score of the distribution of daily mean temperature anomaly compared to the assimilation in the initialized decadal prediction system was better than in the uninitialized historical simulation.Black dots indicate significance by the 5% significance level estimated by comparison to a random walk process.

Figure 4 .
Figure 4. Comparison of the distributions of daily temperature data in boreal summer for a grid point in the North Atlantic (20°W; 40°N).Shown are for each year the distribution of the hindcast (blue) with lead times 2-4 years and historical simulation (black), climatology (cyan) and assimilation reference (red) between 1971 and 2017.Dots at the top demonstrate the ranking (top is higher) of hindcast (blue), historical (black) or climatology (cyan) for each year applying the 1dcontinuous-IQD-score The upper left legend gives the statistics of the analysis.