A Data Set for Intercomparing the Transient Behavior of Dynamical Model‐Based Subseasonal to Decadal Climate Predictions

Climate predictions using coupled models in different time scales, from intraseasonal to decadal, are usually affected by initial shocks, drifts, and biases, which reduce the prediction skill. These arise from inconsistencies between different components of the coupled models and from the tendency of the model state to evolve from the prescribed initial conditions toward its own climatology over the course of the prediction. Aiming to provide tools and further insight into the mechanisms responsible for initial shocks, drifts, and biases, this paper presents a novel data set developed within the Long Range Forecast Transient Intercomparison Project, LRFTIP. This data set has been constructed by averaging hindcasts over available prediction years and ensemble members to form a hindcast climatology, that is a function of spatial variables and lead time, and thus results in a useful tool for characterizing and assessing the evolution of errors as well as the physical mechanisms responsible for them. A discussion on such errors at the different time scales is provided along with plausible ways forward in the field of climate predictions.


10.1029/2021MS002570
2 of 23 interactions, are generally affected by different types of errors and biases (e.g., Hawkins et al., 2014;Meehl et al., 2014;Mulholland et al., 2015), which may severely reduce predictive skills (Doblas-Reyes et al., 2011;Kim et al., 2017;Shukla et al., 2018;Smith et al., 2013;Yeager et al., 2018). These errors are commonly described as initial shock and model drift. The first term refers to the fast adjustment of the model immediately after initialization caused by inconsistencies between different components of the coupled model (Balmaseda & Anderson, 2009;Mulholland et al., 2015), whereas the model drift defines the tendency of the model state to evolve from the prescribed initial conditions toward the model's own climatology over the course of the prediction (Mulholland et al., 2015), which differs from that of observations due to biases arising from systematic model errors (Zadra et al., 2018). Conceptually, shocks have been viewed as including initial imbalances that excite rapidly varying degrees of freedom such as gravity waves on the fast manifold, whereas drift represents balanced evolution on the slow manifold (Ballabrera-Poy et al., 2009;Cotter, 2013;Puri, 1981). As yet, however, there are no widely accepted definitions or diagnostics that can precisely identify and distinguish shocks from drifts in complex climate models. Therefore, these phenomena are here collectively referred to as transient behaviors of the initialized prediction.
The purpose of this paper is to describe a community resource developed to enable intercomparative studies across a range of prediction systems of transient behavior, which bears on several crucial aspects of climate prediction. One such aspect is that systematic errors associated with model transients must be corrected in order to reduce errors in predictions. These corrections are applied a posteriori to the raw outputs from the models and range from a simple correction of the mean error to more sophisticated corrections that take into account other properties of the distributions or the joint distribution of variables so as to maintain physical consistency. Bias reduction is typically accomplished by using retrospective forecasts or "hindcasts" (e.g., Arribas et al., 2011;Choudhury et al., 2017;García-Serrano & Doblas-Reyes, 2012;Saha et al., 2014) to evaluate systematic errors as a function of the lead time of the prediction, which are robustly estimated for typical ensemble sizes (Manzanas, 2020). Such ensemble hindcasts provide the basis for the data set described here.
Another important aspect of transient behavior in initialized predictions relates to strategies for avoiding it. The method used to initialize dynamical forecasts is clearly important, with the full-field approach usually leading to larger shocks and drifts than anomaly initialization (e.g., Hazeleger et al., 2013;Kim et al., 2012;Smith et al., 2013;Volpi et al., 2017). Additional initialization approaches have been developed that aim to minimize transients. These include reduced-dimension implementations of four-dimensional variational data assimilation (He et al., 2017(He et al., , 2020 and anomaly initialization (Polkova et al., 2019), anomaly initialization taking into account biased model responses to external forcings (Chikamoto et al., 2019) and various formulations of coupled data assimilation (e.g., Mulholland et al., 2015). An alternative approach is to adjust the forecast model runs by introducing additive bias correction terms in the model equations, either as two-dimensional flux corrections at the atmosphere-ocean interface (Magnusson et al., 2013;Polkova et al., 2014), or as three-dimensional corrections derived from analysis increments for atmospheric state variables .
Analysis of the physical mechanisms associated with model transients is motivated by their potential to provide insights for improving initialization procedures, as well as clues to the origins of model systematic errors and hence guidance for reducing them. Such analyses have focused on a range of time scales, processes, and locations. The Transpose-AMIP II experiment examined the development of systematic errors in the atmospheric components of several climate models on weather prediction time scales out to five days. This type of approach has been used to elucidate the development of shortwave radiation biases over the Southern Ocean (Williams et al., 2013), marine stratocumulus clouds (Brient et al., 2019), and the Indian Summer Monsoon (Keane et al., 2019). On subseasonal time scales, Shonk et al. (2018)  At longer seasonal to decadal time scales, studies have largely focused on the development of sea surface temperature (SST) biases caused by model misrepresentations of the Atlantic Meridional Overturning Circulation and other ocean circulations (Huang et al., 2015;Wang et al., 2015), oceanic heat uptake (e.g., Hawkins et al., 2014) (Sánchez-Gómez et al., 2016), and combinations of atmospheric and oceanic processes (Exarchou et al., 2018;Siongco et al., 2020;Vannière et al., 2013). Once developed, the maintenance of SST biases is often attributable to ocean-atmosphere coupling (e.g., Toniazzo & Woolnough, 2014).
Additional studies have related biases that develop following initialization to those in long-term simulations unconstrained by initialization (Hermanson et al., 2018;Ma et al., 2014Ma et al., , 2020 and inconsistencies between sea ice and ocean initial conditions (Cruz-García et al., 2021).
A further approach to identifying sources of model error leading to transient behavior and long-term biases is to assess systematic tendencies in model equations integrated from observationally constrained states. This can be accomplished either by averaging initial tendencies near the start of a set of shortterm forecasts (Rodwell & Palmer, 2007) or by estimating systematic tendencies from averages of analysis increments when data assimilation is applied (Batté & Doblas-Reyes, 2015;Bhargava et al., 2018;Crawford et al., 2020). Because models are not allowed to evolve far from observational states, these methods primarily highlight errors attributable locally to model parameterizations of "fast physics." Such an approach can be extended to diagnose nonlocal, longer time scale influences of model error by applying tendency bias corrections as forcing terms in specified regions. For example, Schubert et al. (2019) used this method to quantify the relative roles of remote (teleconnection-driven) versus local influences on the total bias.
Most efforts thus far to describe, diagnose, and reduce systematic errors leading to transient behavior in subseasonal to decadal predictions have focused on individual prediction systems and have thus generated conclusions that, while informative, are specific to the models considered. By contrast, major efforts to intercompare the behavior of climate simulations  and predictions (Tompkins et al., 2017) across diverse sets of models have led to broader perspectives. In a similar spirit, this paper introduces a recently developed data set that aims to facilitate multi-model intercomparisons of transient behavior across numerous subseasonal, seasonal, and decadal prediction systems, leading to a better understanding of the transient behavior of forecast models and the mechanisms involved in the origin and growth of such errors. An overview of this data set is provided in Section 2, and several intercomparisons of transient behavior in subseasonal, seasonal, and decadal prediction models derived from it are presented in Section 3. Section 4 contains concluding remarks and recommendations for enabling further progress in characterizing and understanding physical mechanisms of transient behavior in climate predictions.

The Long-Range Forecast Transient Intercomparison Project Data Set
Aiming at increasing and improving our knowledge of the physical processes related to the origin and growth of initial transient behavior (i.e., shocks and drifts, leading ultimately to biases), the World Climate Research Programme (WCRP)'s Working Group on Subseasonal to Interdecadal Prediction (WGSIP) has led the Long-Range Forecast Transient Intercomparison Project, LRFTIP. Its purpose is to facilitate multi-model intercomparison studies of the transient behavior of coupled long-range forecast models evolving from observation-based initial conditions, the term "long-range forecast" standing for predictions from 30 days up to years according to the World Meteorological Organization definition. The main goals of LRFTIP can be summarized as follows: 1. Developing a publicly available multi-model online archive of hindcast climatologies and related diagnostics from systems contributing to projects such as the Subseasonal to Seasonal Prediction Project (S2S; Vitart et al., 2012), the Climate-system Historical Forecasting Project (CHFP; Tompkins et al., 2017), and the Decadal Climate Prediction Project (DCPP; Boer et al., 2016). 2. Consolidating a standard set of diagnostics in order to consistently characterize transient behavior in initialized prediction models. 3. Addressing scientific questions, including the influence of different initialization methods on transient behavior of climate model components and the identification of any impacts on forecast quality. 4. Including hindcast initialization experiments, using the same model but different initialization methods, in contributions to items 1-3.
LRFTIP has developed a data set that is expected to be useful to the scientific community in order to address such goals. It encompasses a large set of subseasonal to decadal hindcasts that have been processed in the following manner: For a particular model and start date, hindcasts are averaged over available years and ensemble members to form a hindcast climatology that is a function of spatial variables and lead time. When available, climatologies for the same model are also constructed for the free-running model (ideally, from CMIP historical simulations, averaging over multiple ensemble members) and for hindcast initial conditions represented by assimilating model runs, analyses used for initialization or other observation-based reference states. The objective of the data set is to facilitate studies aimed at better characterizing and intercomparing systematic transient evolution away from the initial conditions of a prediction model toward the model's own climate. Notwithstanding, drifts are typically nonstationary, particularly at longer time scales, meaning that they could be shaped by the prevailing conditions during initialization as has been shown for the Atlantic Meridional Overturning Circulation (Bilbao et al., 2021) or the Niño 3.4 index (Teng et al., 2017). This problem has been illustrated for globally averaged temperature for the Decadal Prediction Large Ensemble (DPLE) with the Community Earth System Model version 1 (CESM1) whereby the initialized hindcasts drift toward the uninitialized model climate, but drifts early in the hindcast period are larger than later in the period (Meehl et al., 2021, Figure 4). Thus, when removing bias by subtracting the drifted model climatology from individual drifted hindcasts, errors will be introduced particularly for hindcasts early in the period and late in the hindcast period. This poses a challenge for interpreting hindcast skill measures, and studies on interannual or decadal time scales making use of the data set described here might need to take this issue into consideration.
The LRFTIP data archive covers three different time scales: subseasonal, with daily data out to 30/60 days; seasonal, with daily data out to 30/60 days plus monthly means through the forecast range (3-6 months); and decadal, which includes daily, monthly, and annual means, as available. Files within the data set are CF-compliant NetCDF4 and variable naming follows the CMIP/ESGF conventions. The archive can be accessed at http://crd-data-donnees-rdc.ec.gc.ca/CCCMA/products/LRFTIP/.
As of July 2021, the data set includes hindcast climatologies of 6 subseasonal forecast models from the S2S project, data archives from 21 seasonal forecast models from CHFP and ENSEMBLES (Doblas-Reyes et al., 2010), and output from 16 decadal model predictions from CMIP5, CMIP6, and ENSEMBLES. Detailed information on the model names, numbers of atmospheric, oceanic, land and sea ice variables, and temporal frequencies available can be found in Tables 1-3 for subseasonal, seasonal, and decadal hindcasts, respectively.
A wide range of scientific questions can be addressed with the LRFTIP data set. These potentially involve characterizing and better understanding the sources of initial shock, describing the evolution of drifts toward the preferred states of the models and assessing the relative roles of the oceans and the atmosphere in the maintenance of the long-term biases. Examples of such topics are presented in the next section following a discussion of different methods that may be used to quantify transient behavior. These topics include an analysis of initial shocks in subseasonal predictions of total cloud cover (TCC), SST, and precipitation (PR), followed by drifts in seasonal predictions of SST and PR, and finishing with the evolution of SST errors in decadal forecasts.
Note. The number of variables in the reference or analysis data set (in the hindcast climatology) are indicated in first place (second place).
The number of variables in the reference or analysis data set (in the seasonal hindcast climatology) are indicated in first place (second place). Note that ENSEMBLES is not a model itself but a set of 5 models. a Daily values also available. b Daily values also available for hindcast climatology only. c 5 models as detailed in Doblas-Reyes et al. (2010).

Quantifying Transient Behavior
The purpose of the LRFTIP data set is to enable quantification and intercomparisons of transient behavior in initialized climate predictions. However, there is no unique way to characterize such transient evolution. This nonuniqueness is illustrated in Figure 1a, which schematically depicts model transient behavior for a particular variable as a function of lead time t L . At the start of the prediction, the model climatology (MOD) coincides with that of the initial conditions (IC), which may depart from observed climatology (OBS) if anomaly initialization is used, or if initial states are otherwise not strongly constrained by observations. MOD subsequently evolves away from IC, and for large t L asymptotically approaches the climatology of the free-running model represented here by that of uninitialized historical simulations (HIST) subject to the same external forcings as the predictions. It should be noted that for simplicity, this diagram omits complicating factors such as time dependences of climatologies other than MOD, the possibility that MOD does not monotonically approach HIST (e.g., Hermanson et al., 2018) and so forth.
For scalar quantities such as global mean temperature, it may be straightforward to depict transient evolution in plots such as Figure 1a. However, for spatial fields it is more convenient to depict transient evolution as maps representing differences from some reference climatology. Which differences are shown may depend on what aspect of transient behavior the analysis is attempting to capture, and what reference data are available. The standard practice of characterizing model errors in terms of climatological differences from observations (dependent in this case on t L ) may be appropriate if IC is close to OBS and the necessary observational data are available, noting OBS should be constructed using the same temporal sampling as MOD. We label such MOD minus OBS differences, that is, the evolving model bias, as DIFF-OBS ( Figure 1b CanCM4 25-16-6 25-26-17 13-13-13 4-3-2 2-2-2 CanESM5 21-21-21 26-26-26 11-11-11 Note. The number of variables in the reference or analysis data set, in the decadal hindcast climatology, and in the historical run climatology are indicated in first, second, and third place, respectively. If IC is not close to OBS or OBS is not available, then it may be desirable to consider the evolving departure from initial conditions, that is, MOD minus IC, labeled as DIFF-IC in Figure 1c. However, this requires the availability of IC, which is not always the case for hindcast data sets even though some, such as the DCPP CMIP6 archive, do request data reflecting model initial states (Boer et al., 2016). An alternative way to quantify transient evolution particularly if OBS and/or IC are not available is to consider differences in MOD at successive values of t L (Figure 1d). Although this method has the limitation that such differences cannot be separated from changes rooted in the seasonal cycle, it can be useful for examining large transients during the first few days of hindcasts when daily or higher-frequency data are available and is labeled here as DELTA.

Table 3 List of Model Names and Number of Atmospheric (ATM), Oceanic (OCE), Land (LND) and Sea-Ice (SIC) Variables With Daily (dly), Monthly (mly) and Yearly (yly) Data From Decadal Hindcasts
Alternatively, when long temporal ranges are considered as for decadal predictions, the approach to asymptotic model climatology, that is, MOD minus HIST, may be of greater interest than the departure from OBS or IC. This measure, labeled DIFF-HIST in Figure 1e, requires the availability of ensembles of uninitialized simulations that are congruent with the initialized hindcasts, as is the case for some CMIP5 and CMIP6 decadal prediction systems. In instances where these simulations are unavailable or such analysis is otherwise inconvenient, differencing from the MOD lead-time average (LTA) over the entire hindcast range may be approximately representative of convergence to HIST if maximum lead time is sufficiently long that LTA is much closer to HIST than to IC. This approach is labeled DIFF-LTA and is represented in Figure 1f.
Each diagnostic illustrated in Figures 1b-1f can be constructed in two ways. The first, referred to here as Method 1, is to consider t L increasing from a common initial time t i , often the first day of a particular calendar month or sometimes a fixed set of start dates (e.g., Mulholland et al., 2016). In this case, the target interval t T →t T + Δt T where Δt T is typically one week, month, or season evolves forward with t L according to t T = t i + t L . Although this approach is common, one disadvantage is that because t T changes with t L , simply tracking the evolution of MOD with t L as with the DELTA approach in Figure 1d does not separate transient evolution from the climatological seasonal cycle. As a result, the DELTA approach is only useful when initialization transients greatly exceed seasonal changes as may occur for very short lead times of a few days or if a particular month or season is considered in consecutive years of a multi-year prediction. In the second way of constructing these diagnostics, referred to here as Method 2, the target month (or other averaging period) is held fixed and t L is varied by adjusting t i . For example, t L can be advanced N months by moving the initialization date N months earlier, as in Figure 1 of Manzanas (2020). An advantage is that transient behavior as a function of t L does not have to be separated from the climatological seasonal cycle. However, a disadvantage is that to examine drift at intervals of Δt L requires hindcasts that are initialized at the same interval. This precludes, for example, the consideration of monthly Δt L in decadal hindcasts initialized once per year or daily Δt L in seasonal hindcasts initialized once per month.
The examples described below sample some of the measures of transient behavior depicted in Figure 1, in all cases using Method 1.

Initial Transients in Subseasonal Predictions of TCC, SST and PR
We first focus on the origin of initial shocks in subseasonal hindcasts of TCC from the S2S project. The analysis of this variable is of particular interest in this context, given that it has a large dependence on model physics. The ERA-Interim reanalysis (Dee et al., 2011) is used as a proxy of observations against which the predictions are compared as a function of lead time. Subseasonal hindcasts within the LRFTIP data set span prediction ranges of 30-60 days. Figure 2 shows the zonal mean TCC in hindcasts initialized in May for the first 7 forecast days. Note that ERA-Interim shows very little day-to-day variation in the zonal-mean TCC, so only day 1 is shown for reference. A noticeable feature from this figure is the difference in Météo-France-S2S and UKMO-S2S (Figures 2d and 2f) at day 1 of the prediction compared to the following days, which is mostly absent in the other models. In the case of UKMO-S2S, this singularity at the beginning of the prediction quickly vanishes at day 2 and little further adjustment occurs after that lead time. Likewise, Météo-France-S2S depicts a clear difference between day 1 and the following days in TCC over the Southern Hemisphere, but over the Northern Hemisphere (particularly north of 45°N) the sudden jump in zonal mean TCC from day 1 to day 2 is followed by a gradual trend back toward the day 1 pattern afterward. These differences arise from IC 7 of 23 being ERA-Interim in the case of UKMO-S2S and Météo-France-S2S rather than from their own analyses. In addition, ECCC-S2S and JMA-S2S show TCC that is much lower than the ERA-Interim at day 1 at high northern and southern latitudes, but with little evolution between day 1 and day 7, except at high northern latitudes in JMA-S2S. These models are both uncoupled and have physical formulations that are motivated primarily by accurately predicting atmospheric circulation out to two weeks or so rather than simulating a stable climate.
The zonal-mean patterns in ECMWF-S2S ( Figure 2b) and NCEP-S2S (Figure 2e; excluding the North Pole) are very similar to those derived from the ERA-Interim, even though larger differences are found in the other forecast systems. Still, the initialization shock is largely gone by day 2 in all the models except for the mid-to high-latitudes of the Northern Hemisphere where some continued evolution is evident. Differences between successive days of the subseasonal predictions, that is, the DELTA metric depicted in Figure 1d, are shown in Figure 3. In line with the previous results, differences between days 1 and 2 in Météo-France-S2S and UKMO-S2S are much larger than in the other forecast systems, with both models exhibiting an increase of about 10%-20% in mean TCC (Figures 3d, 3f and 3g). The evolution of day i+1 minus day i differences in global-mean (Figure 3g) as well as 60°N-90°N averaged (Figure 3h) TCC largely ceases after day 2, once again highlighting the initial shock effect on the predictions. Regarding the models that do not display such a clear initialization shock, it is worth noting that there are indeed differences in TCC between the first and second days of the predictions but such differences are not uniform over the globe (Figures 3a-3c, and 3e). ECMWF-S2S is the only model displaying a systematic underestimation of TCC over the Tropical Pacific in day 1 of the prediction (Figure 3b), followed by a drop toward a more stable condition that lasts for the rest of the forecast (Figure 2b). (a-f) Differences between days 1 and 2 (day 2 minus day 1) in subseasonal predictions of total cloud cover (TCC) initialized in May. (g-h) Evolution of the absolute difference in global mean TCC for each forecast system (colors) between two subsequent days, from day 1 to day 10, for (g) the global mean and (h) the region between 60°N and 90°N. Values in %. Initial transient evolution of SST in key regions where models tend to exhibit large SST biases is illustrated for the ECMWF-and NCEP-S2S models in Figure 4, where the Optimal Interpolation Sea Surface Temperature version 2 data set (OISSTv2; Reynolds et al., 2002) is employed as the observational reference to compute DIFF-OBS (Although only two models are considered here due to space limitations, the LRFTIP data set enables this analysis easily to be extended to the additional S2S models in Table 1). Biases on days 1 and 21 of hindcasts initialized near the start of November are shown for the North Atlantic region in panels a-d and for the central and eastern tropical Pacific in panels e-h. In all cases, biases on day 1 tend to be relatively small as should be expected when SST is initialized realistically, given that SST adjusts considerably less rapidly than TCC. However, even on day 1 the biases, though modest, bear some imprints of the more highly developed biases on day 21, for example, off the US east coast and in the tropical Pacific. Note that the cold SST bias in ECMWF-S2S over the equatorial Pacific, which is similar in May to that shown for November in Figure 4e could explain its negative drift in TCC between day 1 and day 2 for May initialization (Figure 3b). By day 21, particularly large biases have appeared off the northeastern US coast, where many models exhibit unrealistically warm ocean temperatures due to failing to simulate the separation of the Gulf Stream from the coast at Cape Hatteras (e.g., Ezer, 2016). The development of this bias as a function of forecast time in hindcasts initialized near the start of November and May is illustrated in Figure 5a, showing daily values of DIFF-OBS averaged in the Mid-Atlantic Bight region (box in Figure 4a), which is especially affected by the misrepresentation of the Gulf Stream separation. The warm SST bias in this region is seen to develop more rapidly in the ECMWF hindcasts, having substantially reached its eventual magnitude at ≈2°C already by day 10 for hindcasts initialized in May. By contrast, the bias continues to develop more gradually in November-initialized NCEP hindcasts until at least day 41, when it has become comparable to the ECMWF bias at ≈3°C. The NCEP bias also shows greater seasonality, reaching only ≈1°C by day 41 in hindcasts initialized in May, whereas the May-initialized ECMWF bias at day 41 approaches that from November initialization. In the southeast tropical Pacific region indicated in Figure 4e, the large warm bias in the November-initialized NCEP hindcasts develops steadily throughout the first 41 days, whereas the bias is considerably weaker in May-initialized NCEP and ECMWF in both seasons (Figure 5b). Pervasive warm SST biases in this and other eastern tropical ocean regions have been ascribed to the underrepresentation of stratocumulus clouds and coastal upwelling in such regions (e.g., Richter, 2015).
Another example of transient behavior is demonstrated with daily PR over the North Atlantic (defined here as 70°W-10°E, 45°N-70°N) for subseasonal hindcasts initialized near the start of November. Figure 6 presents deviations from GPCPdd v1.3 daily observation data (a part of GPCP2.3 project; see Adler et al., 2018) for area-averaged PR in ECMWF-S2S, NCEP-S2S, and UKMO-S2S for the first 7 forecast days. The observed values for this region during this period vary between 3.7 and 4.4 mm day −1 , although it can be seen that the three models depict different behaviors. NCEP-S2S shows a positive bias at day 1 while UKMO-S2S shows the opposite behavior, and ECMWF-S2S yields no bias. The negative bias in UKMO-S2S at day 1 is consistent with the similar deficit of TCC in this forecast system, suggesting that the first-day biases in these two variables could be connected. Also worth noting is that the ECMWF-S2S system did not apply coupled atmosphere-ocean sea-ice data assimilation at the time of the LFRTIP data set collection, which could have an impact on the initial shock in the atmospheric model (e.g., Browne et al., 2019). Beyond day 3 of the predictions, it is interesting to note that biases in the forecast systems tend to follow a similar evolution pattern.

Bias and Drift in Seasonal Predictions of PR and SST
Seasonal predictions included in the LRFTIP data set extend up to 12 months into the future and have been initialized at least two times per year, in May and in November. These months are close to the beginning of summer in the Northern and Southern hemispheres, respectively, and near the onset of the rainy season of monsoon circulations over several land areas of each hemisphere. Indeed, seasonal forecasts started in May are commonly used to predict the behavior of the Southeast Asia or the Indian monsoons, while the start date in November can provide useful information on the likely evolution of the South American monsoon or on the strength and location of the South Pacific Convergence Zone. The following analysis on the evolution of the biases in seasonal predictions of PR from 18 seasonal forecast systems in the CHFP archive is tackled by (at least) two target issues: the drift of the models toward their own climate and the season-dependent evolution of biases related to problems with the representation of the monsoon circulation and associated rainfall. Figure 7 displays the difference (prediction minus observational reference, i.e., DIFF-OBS as in Figure 1b) in seasonal hindcasts of PR for month 1 of the prediction initialized in May, while a similar analysis for month 3 of the prediction (July) is shown in Figure 8. The reference data set here is GPCPv2.3. The first outcome from the comparison of both figures is the growth of the differences between prediction months 1 and 3, which is more noticeable over the Northern Hemisphere, following the intraseasonal migration of the Intertropical Convergence Zone (ITCZ), suggesting that systematic problems exist regarding the simulation of its location and/or strength. Some models also show large differences over areas affected by monsoon rainfall such as JMA-MRI-CGCM2, which shows a strong negative bias over India in July (Figure 8l). Overall, the smallest mean absolute errors within the latitudinal belt 50°N-50°S at months 1 and 3 of the predictions are found for the two versions of the ECMWF and JMA models, with UKMO-DePreSys also performing well at month 3 (Figures 7g, 7h, 7k and 7l and Figures 8g, 8h, 8k 8l, and 8q). A similar analysis for predictions initialized in November (Figures 9 and 10) shows that the largest differences are now displaced to the Southern Hemisphere, with most models attaining maxima in differences over the Maritime Continent, parts of the Indian Ocean, and over central and southeastern South America at month 3 of the predictions (i.e., January). There is a tendency for most models to overestimate PR over the tropical Indian Ocean and far western Pacific, whereas there is less consistency in the sign of the biases in the tropical Atlantic and central-eastern Pacific, in agreement with the analysis by Scaife et al. (2019) of PR forecasts in these regions from a similar set of models. The overall smallest differences are found for the same models as for May initialization, highlighting their capability to simulate the seasonal climatology of PR (i.e., the ITCZ). Open questions, which could be explored in the future include whether the lead time-dependent difference in rainfall in the models is explained by the parameterization schemes used to simulate cloud microphysics and PR or by the differences in the dynamics. Figure 11 shows the time evolution of the mean absolute error averaged over 50°N-50°S in PR from the different hindcast data sets for predictions initialized in May (Figures 11a) and November (Figures 11b).
Beyond the point related to the smallest differences occurring in the ECMWF and JMA forecast systems as already discussed above (and which can also be seen here), most of the models depict a fairly stable pattern of differences from the first until the third prediction month. However, some of them display particular 10.1029/2021MS002570 13 of 23 patterns of variation: for instance, GloSea5 shows an increase in PR difference from month 1 to month 2, followed by a decrease in month 3, in hindcasts initialized in both May and November. Some others, such as L38GloSea4, show a maximum difference in month 1 of the November hindcasts, followed by declines in months 2 and 3. This could be related to large errors at the beginning of the simulation caused by initialization shock followed by a gradual drift toward smaller errors, but further analysis is needed to confirm that.
A similar analysis focused on SST predictions from 5 CHFP models is shown in Figures 12 and 13 for hindcasts initialized in May and November, respectively, and considering prediction months 1, 3, and 5. OISSTv2 is used as the reference data set in computing DIFF-OBS. As for PR, JMA-MRI-CGCM1 shows relatively small biases throughout the prediction, with little dependence on lead time. By contrast, other models are affected by substantial errors at each lead time (MIROC5), a decrease and reversal of sign of the biases after month 1 (JMA-MRI-CGCM2), or a notable increase in magnitude at longer lead times (CanCM3 and Can-CM4). Comparisons between hindcasts initialized in May and in November show the seasonal dependency of biases in some forecast systems, as is the case of JMA-MRI-CGCM2. These results motivate the analysis of the physical mechanisms of these bias patterns in order to disentangle their causes.

Evolution of Transients in Decadal Prediction of SST
The LRFTIP data set also allows for multi-model intercomparison of transient behavior on multi-annual time scales. Figure 14 shows Hovmoller diagrams of the difference (hindcast minus historical, i.e., DIFF-HIST in Figure 1e) in monthly mean SST over the Equatorial Pacific (defined here as 120°E−90°W, 5°S-5°N) in various decadal forecast systems. In this case, the reference data set consists of CMIP historical (free-running) simulations for the same models, that is, without initialization. The differences thus represent departures from the free-running climatology of the models, which one may expect the hindcasts to evolve toward over the 10-year range of the forecasts. The Hovmoller diagram becomes useful for the type of analysis pursued in this article, since it can synthesize information on the fast response of the model at the beginning of the prediction (first few months) as well as any zonal propagation and more gradually emerging transients.
Among the models considered, the most persistent difference occurs in BCC-CSM1.1 (Figure 14a), which exhibits positive (warm) differences relative to the free-running climatology between 180°W and 90°W from immediately after initialization until the end of the forecast range. The cause of this persistent difference is not obvious, considering that the BCC-CSM1.1 decadal predictions and historical simulations were subject to the same external forcing (Xin et al., 2019). In the cases of CanCM4 and CCSM4 (Figures 14b-14d), decadal hindcasts for CMIP5 were performed using two different initialization methods, labeled i1 and i2 (Karspeck et al., 2015;Merryfield et al., 2013). Although specifics of the initialization methods differ, for both models method i1 assimilates subsurface ocean data, whereas i2 applies surface ocean forcings only with no subsurface data constraints. With method i1 both models start from a condition very close to the reference data set (i.e., small differences in the first months of the predictions) but then evolve toward a temporary cold difference from year 2 to 4 in CanCM4_i1 and from year 1 to 3 in CCSM4_i1 (Figures 14b  and 14d). The onset of both patterns apparently results from the emergence of cold biases that are present in sub-surface ocean temperatures at the start of the prediction (not shown). With method i2, CanCM4_i2 exhibits general warm differences (Figure 14c) somewhat like BCC-CSM1.1, whereas CCSM4_i2 shows warm differences in years 1 and 2 and cold differences in years 3 and 4 (Figure 14e), which have been attributed to spurious trends in the NCEP/NCAR reanalysis surface winds used to initialize the ocean component (Teng et al., 2017). CNRM-CM5 (Figure 14f) is affected by positive differences in the first year of the simulation that fade away around forecast year 3. Regarding the other models, EC-Earth and HadCM3 (Figures 14f  and 14h) both show hints of zonal propagation of the errors from the coasts of South America westward in the first three years following initialization. The Hovmoller diagram suggests that these errors take around 12 months to cross the basin. After the third year, the differences in these models largely subside, although HadCM3 shows signs of warm differences emerging again east of 180° after year 8.
Having hindcast climatologies from a particular model initialized in multiple ways present opportunities for exploring the physical nature of transient behavior such as that for CanCM4 (Figures 14b and 14c), as an example. Figure 15 shows CanCM4 i1-i2 differences in SST and depth of the 20°C isotherm (which provides an indication of the depth of the equatorial Pacific thermocline). CanCM4 hindcasts using method i1 (which assimilates subsurface ocean temperatures) are slightly warmer than those using method i2 (no Figure 11. Evolution of mean absolute error (MAE) of seasonal predictions of monthly mean precipitation against the GPCPv2.3 reference data set, averaged over the 50°N-50°S region from month 1 to month 3 of the predictions. Units are mm month −1 . subsurface assimilation) in the first year ( Figure 15a). However, at the end of the second year, a strong cold difference suddenly emerges (seen less distinctly in Figure 14b), which is followed by similar but gradually diminishing behavior in the next four years. Clues to the origins of these differences in transient behavior are provided by a comparable plot of differences in the 20°C-isotherm depth (Figure 15b). Late in the second year, a sudden upward, eastward propagating displacement of the thermocline appears, followed by similar but progressively weakening displacements in the next four years. These anomalous thermocline displacements resemble equatorial upwelling Kelvin waves, which propagate eastward across the Pacific in two to three months and are associated with cooling in the eastern equatorial Pacific and the onset of La Niña episodes (Sarachik & Cane, 2010). Determining the origin of these waves, which can be excited by either convergent surface wind stress or ocean dynamical imbalances as might exist in the initial conditions, requires further analysis.

Concluding Remarks and Ways Forward
The development of a data set comprising hindcast climatologies derived from subseasonal, seasonal, and decadal predictions through the LRFTIP project provides an opportunity for the climate prediction research community to systematically characterize and better understand the physical mechanisms associated with transient behavior in initialized forecast systems. Although drawn from existing hindcast sources, the LRFTIP data set removes the necessity of accessing all ensemble members for all hindcast years in order to compute the lead time-dependent climatology for a given variable, and thus reduces associated data volumes by orders of magnitude. In addition to providing precomputed hindcast climatologies for several variables from many models for time scales ranging from daily to decadal, the data set in most instances provides one or more reference data sets consisting of similarly gridded observations, initial conditions, and/ or uninitialized historical simulations subject to the same temporal averaging as the hindcasts enabling the rapid calculation of metrics such as those illustrated in Figure 1. This article aimed to describe and show some examples of the usage of the LRFTIP data set as an illustration of the types of analyses enabled by it. Any major progress in this field could prove extremely useful for both model developers (through providing guidance for improving model parameterizations and numerics) and to developers of climate services responsible for providing trustworthy climate prediction information across time scales.
In the context of open data policies and following examples such as the S2S, CHFP, and CMIP data sets, the LRFTIP data set aims to support community climate prediction and predictability studies. The usage of this data set is, therefore, strongly encouraged and indeed sought, as are feedbacks to and interactions with the WGSIP group.

Data Availability Statement
The data used in this study is freely available at the following link: http://crd-data-donnees-rdc.ec.gc.ca/ CCCMA/products/LRFTIP/ and is permanently registered as doi:10.18164/5dc6d3ae-8eb4-4292-b181-6a96104e63c5.  Figure 14, but showing i1-i2 differences between CanCM4 decadal predictions, where i1 and i2 refer to two different initialization methods in the CMIP5 archive, for (a) sea surface temperature (SST) and (b) depth of the 20°C isotherm, over the Equatorial Pacific. suggestions helped improve and clarify several aspects of the manuscript, as well as Editor Dr. Stephen Griffies. In addition, Hai Lin and Yuhei Takaya provided helpful comments about the behavior of the ECCC-S2S and JMA-S2S models, respectively. The assistance of Marina Trubina in constructing S2S hindcast climatologies for the LRFTIP data set is also kindly acknowledged.