Assessment of Model Drifts in Seasonal Forecasting: Sensitivity to Ensemble Size and Implications for Bias Correction

Despite its systematic presence in state‐of‐the‐art seasonal forecasts, the model drift (leadtime‐dependent bias) has been seldom studied to date. To fill this gap, this work analyzes its spatiotemporal distribution, and its sensitivity to the ensemble size in temperature and precipitation forecasts. Our results indicate that model continues to drift well beyond the first month after initialization, leading to significant, highly space‐ and time‐varying drifts over vast regions of the world. Nevertheless, small ensembles (less than 10 members) are enough to robustly estimate the mean model drift and its year‐to‐year fluctuations in skillful regions. Differently, in regions of low model skill, larger ensembles are required to appropriately characterize this interannual variability, which is often larger than the drift itself. This points out a necessity to develop new strategies that allow for efficiently dealing with model drift, especially when bias correcting seasonal forecasts—most of the techniques used to this aim rely on the assumption of stationary model errors. We demonstrate here that the use of moving windows can help to remove not only the mean forecast bias but also the unwanted effects coming out from the drift, which can lead to important intraseasonal biases if it is not properly taken into account. The results from this work can help to identify the nature and causes of some of the systematic errors in current coupled models and can have large implications for a wide community of users who need long, continuous unbiased seasonal forecasts to run their impact models.


Introduction
Seasonal forecasts have an enormous impact on different socioeconomic sectors such as agriculture, tourism, energy, and health (see, e.g., Doblas-Reyes et al., 2013;Hill, 2002, and references therein). Nowadays, these forecasts are routinely produced (and delivered) by a number of World Meteorological Organization-designated Global Producing Centres for Long Range Forecasts (http://www.wmo.int/pages/ prog/wcp/wcasp/gpc/gpc.php), based on different state-of-the-art global ocean-atmosphere coupled models. However, due to the important simplifications that need to be done when building these models-which can lead to deficient representations of circulation, energy exchanges, etc.-seasonal forecasts are known to present important errors, either at regional or local scales (see, e.g., Manzanas et al., 2019. In particular, in addition to the systematic mean error or standard bias (mean deviation from observations for a particular target period and location), a second-order bias that depends on the leadtime-the time that passes from the moment in which the model is initialized to the start of the target period to be 10.1029/2019MS001751 predicted-arises in seasonal forecasting. The latter, known as drift, is a consequence of having initial conditions inconsistent with the model dynamics (Alves et al., 2004;Fernández et al., 2009) and can be defined as the tendency of the model to evolve from the initial (observed) state to its own attractor (see, e.g., Collins et al., 2006;Delworth et al., 2006;Doblas-Reyes et al., 2013;Magnusson et al., 2013). This tendency, which should not be confused with a seasonal climate signal, can lead to important leadtime-varying errors that can considerably affect the quality of the forecasts (Smith et al., 2013;Vannitsem et al., 2018). In this aspect, many previous works have documented substantial errors in key fields such as precipitation when the coupled models are initialized with observed sea surface temperatures (see, e.g., Troccoli et al., 2008). Moreover, the origin and causes responsible for model drift are not obvious, and it is highly dependent on the variable and the geographical area analyzed (Bedia et al., 2018).
Despite all this, only a few studies have paid attention to the issue of model drift in the context of seasonal forecasting (see, e.g., Stockdale, 1997), being most of the previous works focused on decadal predictions and, more concretely, on finding ways of reducing the long-term drift by performing some kind of correction on the initial conditions (see, e.g., Fučkar et al., 2014;Kharin et al., 2012;Sánchez-Gomez et al., 2016;Zhang, 2011;Zhang et al., 2012). At seasonal timescales, Shonk et al. (2018) found a tendency in the simulated Intertropical Convergence Zone to move to the north (as compared to observations)-they refer to this effect as a spatial drift. More recently, Hermanson et al. (2018) analyzed model biases and drifts for different timescales, including both decadal and seasonal forecasts. Even though this work provides essential knowledge to better understand the model drift, they only focused on a few regions (with extratropical latitudes being misrepresented) and only assessed average model drifts, without worrying about their temporal variability or the importance of the ensemble size in characterizing those drifts. Moreover, they did not propose any strategy for correction. Therefore, to overcome the existing necessity of providing a complete diagnosis of model drifts globally (Hermanson et al., 2018), this work analyzes their spatiotemporal distribution for seasonal forecasts of temperature and precipitation worldwide. This can help to identify specific model deficits and offers the possibility of targeted improvement of certain processes formulation, resolution, and parametrization (Ehret et al., 2012). Nevertheless, it is important to note that identifying/understanding the mechanisms of the physical processes involved in the model drift is out of the scope of the present study-for this particular regard, the reader is referred to a few relevant previous papers on the topic (see, e.g., Carrassi et al., 2014;Magnusson et al., 2012Magnusson et al., , 2013. Furthermore, there exists an open discussion on the ensemble size that is required for a proper statistical correction of model errors in seasonal forecasting. Using as many members as possible is usually the preferred option. However, Manzanas et al. (2019) have recently shown that small ensembles are enough to robustly correct the mean model biases. We test here if this also holds for the case of model drift.
Finally, we explore the suitability of considering moving windows (Bedia et al., 2018) for the application of a standard quantile-mapping technique as a way to minimize the unwanted effects coming out from model drifts-note that most of the state-of-the-art techniques that are used for bias correction of raw seasonal forecasts (see ; Manzanas et al., 2019, for a review) rely on the assumption of stationary model errors, and their application in nonstationary circumstances remains unclear (see, e.g., Anderson, 2011). To do this, we focus on the Philippines, where most important sectors could greatly benefit from the use of suitable, unbiased seasonal forecasts.
The paper is organized as follows: The data and the methodology used are described in section 2. Results are presented through section 3. The most important conclusions are given in section 4.

Seasonal Forecasts and Definition of Drift Used
Daily temperature and precipitation from the European Center for Medium Weather Forecasts (ECMWF) System 4 (Molteni et al., 2011) were considered over the entire globe at their original spatial resolution (0.75 • ). System 4 is based on the atmospheric model IFS (Cycle 36r4) and the oceanic model NEMO and provides the longest-to-date seasonal hindcast, covering the period 1981-2010, which is fully used here. Whereas IFS is initialized with ERA-Interim (Dee et al., 2011) data, the ocean data assimilation system NEMOVAR is used to initialize NEMO. By producing a set of perturbed initial conditions, an ensemble of 15 members is generated-in particular, five members originate from perturbations of ocean wind surface initial conditions, whereas the other 10 members originate from sea surface temperature perturbations and stochastic physics-on the first day of each month and run for 7 months. Note therefore that there are seven possible leadtimes for predicting each calendar target month. For instance, mean conditions for August can be computed considering the forecasts initialized the first of that same August (i.e., Leadtime 0), the previous July (Leadtime 1), and so on until February (Leadtime 6)-leadtime is expressed in months in the forthcoming.
For a particular gridbox and target month M, the mean drift d is defined as the difference between climatologies p-as given by the ensemble mean over 1981-2010-of M at different leadtimes (see Figure 1). For instance, for leadtimes LT i and LT (assume > i), Note the convenience of this way of defining the drift, since it does not involve the use of any verification data, eliminating thus the issue of observational uncertainty (Herrera et al., 2018;Kotlarski et al., 2017)-different results might come out if the drift was defined relative to observations. Figure 2 shows the mean value of the drift (equation (1)) for temperature (top) and precipitation (bottom). For brevity, results are only shown for four illustrative target months: February, May, August, and November. Only drifts significantly ( = 0.05) different from 0 are displayed-white gridboxes identify areas where not significant values were found. To compute the significance of the drift, a bootstrap approach was followed (Mason & Graham, 2002). In particular, 1,000 different ensembles of 15 members each were built by random selection among the 15 members available (allowing for repeated members). The mean drift and its confidence intervals were then computed upon the 1,000 bootstrapped values, which were derived from the 1,000 different ensembles. Note also that drifts are only shown for a selection of incrementally increasing leadtimes. In particular, the left/middle/right column corresponds to the drift between Leadtimes 1 and 0/3 and 1/6 and 3. We do this in order to properly assess how the drift varies along the entire model run.

Statistical Assessment of Model Drifts
Although providing a detailed description of the patterns found is not the aim here, there are some important conclusions that must be mentioned. First, significant drifts are found worldwide (especially for temperature). Second, model drifts considerably vary both in space and time. Third, despite the drifts shown in the first column might be expected to be larger than those in Columns 2 and 3 due to the rapid adjustment processes that occur during the first days of the model run when the atmosphere and the ocean are initialized from different systems (see, e.g., , this figure shows that the model continues to drift significantly well beyond the first month after the initialization moment. This is particularly the case for temperature, for which the model drifts are even larger during the last part of the run. For this variable, drifts are mostly negative and are present over vast parts of the globe (with absolute values exceeding 1.5 K in some regions). Moreover, they are stronger over land (especially in large portions of the Northern Hemisphere such as North America and Siberia) than over the oceans. Differently, for precipitation, drifts are mainly located in tropical latitudes and are more pronounced over the oceans (with absolute values above 50 mm per month in many cases). Note that, in some regions, these drifts can be close to or larger than the underlying climatology, which may seriously reduce potential model skill (Smith et al., 2013).
For a better characterization of the drifts shown in Figure 2, we also analyzed their interannual variability.
In particular, Figure 3 shows the standard deviation of the year-to-year drifts-which is simply referred to as (d) hereafter-divided by their mean absolute value (see Figure 2), for temperature (top) and precipitation (bottom). Therefore, green (brown) colors identify those gridboxes where (d) is smaller (larger) than the mean drift itself. In other words, brown colors correspond to regions where the model drift is far from being stationary.
According to this figure, temperature drifts might be only safely removed in the tropics (green areas) under the assumption of stationary model errors. The situation is even more problematic for precipitation, for which brown colors are predominant in most regions (with some exception as the Gulf of Guinea), and therefore, time-invariant corrections would not be optimal to correct the year-to-year drifts. These results suggest that there is a need to develop new strategies that allow for efficiently dealing with model drifts. In this regard, we test in section 3.3 if the use of moving windows can help to remove not only the mean bias from raw seasonal forecasts but also the unwanted effects that may appear due to the model drift.

Sensitivity to Ensemble Size
Given the high computational costs involved in the generation of large ensembles for seasonal forecasting, it is important to have some estimation about the number of members (n in the following) that are needed to robustly characterize particular model properties, for instance, the drift. Therefore, Figure 4 shows how the mean drift obtained for temperature and precipitation (left and right columns, respectively) varies with n for the case of February (similar results are obtained for the rest of months). The analysis is performed for two illustrative regions-the spatially averaged time series are considered-Europe and El Niño 3 (shadowed areas in the top and bottom rows, respectively). Note that whereas low-to-moderate seasonal predictive skill is in general acknowledged for the former, high model skill has been documented in the latter (see, e.g., Manzanas et al., 2014). Ensembles of increasing size (n = 1, … , 15) were built based on bootstrapping (Mason & Graham, 2002). In particular, for each n, 1,000 different ensembles were constructed by randomly selecting n members out of the 15 available ones (repeated members are allowed). From these 1,000 ensembles, the mean drift value and its standard deviation (error bars) were obtained.
Interestingly, this figure reveals that the mean drift is almost independent of n, and therefore, small ensembles would be sufficient for a robust estimation of model drifts. For instance, the drift patterns obtained for n = 5 are very similar to those presented in Figure 2 (not shown).
Additionally, we also assessed how the standard deviation of the year-to-year drift, (d), depends on n. Taking into account that 2 (ax − b ) = a 2 2 (x) + b 2 2 ( ) − 2ab · cov(x, ) for any two samples from random variables x and and two scalars a and b-where 2 is the variance and cov the covariance operator-note from equation (1) that  Taking also into account that cov(x, ) = (x) · ( ) · (x, )-where is the correlation coefficient-this can be expressed as where 2 (p M LT ) and 2 (p M LT i ) represent the variance of the year-to-year ensemble mean predictions, p, at leadtimes and i, respectively. However, according to a Bartlett test (Snedecor & Cochran, 1989)-which checks the null hypothesis of equal variances across different samples-we found that 2 ) in about the 99% of global gridboxes, either for temperature or for precipitation, and for all target months (not shown). The only leadtime for which the null hypothesis of the Bartlett test can be rejected ( = 0.05) in a considerable number of gridboxes is 0, which is due to the aforementioned rapid adjustment processes that occur during the first days of the model run in seasonal forecasting. Therefore, and equation (3) can be rewritten as And, finally, which indicates that the interannual variability of the drift comes determined by a first term that is basically (p M ) (the interannual variability of the ensemble mean at any leadtime except 0), and a second term which is related to the persistence of the model. With respect to the latter, note that high values of would reflect that the model provides consistent (similar) predictions independently of the leadtime considered. It is reasonable to think that such situations will occur in cases for which a persistent predictability signal exists, and thus, skillful predictions could expected.
Figure 5 puts some light on the contribution of each of these two terms for the case of temperature over Europe and El Niño 3 for February (similar conclusions are obtained for the rest of months, as well as for )-see the numbers in the upper corners inside the panels-which reflect a low (high) intermember consistency leading to low (high) correlations between predictions at Leadtimes 1 and 3, either for the ensemble mean or for the aleatory member-see the numbers in the lower right corner. Therefore, according to equation (6), and taking into account that (x) = 1 √ N (x)-where N is the sample size of the random variable x-the interannual variability of the drift should decay with n as 1∕ √ n-although modulated by (1) and (2)-in regions of low model skill such as Europe, whereas it should be solely determined by (1) and (2) in skillful regions such as El Niño 3.
To check this premise, Figure 6 shows the standard deviation of the year-to-year drifts as a function of n for the same example of Figure 5. Again, for each n (with n = 1, … , 15), 1,000 different ensembles were constructed by randomly selecting n members out of the 15 available ones (allowing for repeated members). The mean value (solid line) and the standard deviation (error bars) of the drift were obtained from these 1,000 bootstrapped ensembles. Additionally, dashed (dotted) lines draw the theoretical equation (7) (equation (8)) for regions of low (high) model skill.
The experimental results are very close to the theoretical expected ones, especially in El Niño 3. Yet, in this region, the interannual variability of the drift decays with n-especially for the first few members-which implies that either (p 1memb M ) or (p M LT i , p M LT ) should vary with n. Indeed, Figure 7 evidences that (p M LT i , p M LT ) increases with n (markedly for the few first members) in El Niño 3, which would explain the decaying graph obtained in Figure 6 for this region. Moreover, Figure 7 also confirms that, in Europe,  (1) and (2) in equation (6) with the ensemble size (left and right columns, respectively), over Europe and El Niño 3 (top and bottom rows, respectively). Solid lines (error bars) represent the mean value (standard deviation) obtained from 1,000 bootstrapped samples. the interannual variability of the drift exclusively depends on n, since neither (p 1memb In summary, our results indicate that small ensembles (n < 10) are enough to robustly estimate the mean value-computed over a sufficiently long period-of model drifts, independently of the variable and/or the region being considered. However, whereas such small ensembles also allow to capture the representative year-to-year fluctuations of these drifts in skillful regions, larger ensembles are required to this aim in regions of low model skill. Analysis such as the one undertaken here may help to determine the optimum number of members needed in the different regions of the world.

Implications for Bias Correction
In order to better understand the role that the model drifts shown so far may play when standard state-of-the-art techniques are used to bias correct raw model seasonal forecasts, we focus on the Philippines, a moderately skillful region for which this type of predictions is key for various sectors-e.g., rice production (Koide et al., 2012). In particular, we used the 42 gauge stations made available by the Philippine Atmospheric, Geophysical and Astronomical Services Administration (PAGASA: http://www. pagasa.dost.gov.ph), which cover the four climatic types present in the country (Coronas, 1920;Flores & Balagot, 1969;Kintanar, 1984) and have been already used in previous studies (see, e.g., Manzanas et al., 2015. Instead of the usual 3-month-long seasons (DJF, MAM, JJA, and SON) we focus here on extended 6-month-long ones (DJFMAM, MAMJJA, JJASON, and SONDJF), which allow for better illustrating the unwanted effects introduced by model drifts. One-month lead predictions of temperature and precipitation from the ECMWF System4 were first bilinearly interpolated to the 42 PAGASA stations. Then,   an empirical quantile-mapping method participating in the VALUE downscaling intercomparison initiative (Maraun et al., 2017)-referred to as EQM hereafter-was applied to correct them. This method, which has been recently applied in the context of seasonal forecasting (see, e.g., , consists of calibrating the predicted empirical probability density function by adjusting a number of quantiles based on the empirical observed one (see, e.g., Déqué, 2007). In particular, for each gridbox, we adjusted percentiles 1st to 99th and linearly interpolated every two consecutive percentiles inside this range. Outside, a constant extrapolation (using the correction obtained for the 1st or 99th percentile) was applied. All members were independently corrected based on their joint distribution. For the case of precipitation, the frequency adaptation proposed by Themeßl et al. (2012) was applied to account for possible cases for which the predicted frequency of dry days was larger than the observed one. To avoid overfitting, a leave-one-out cross-validation scheme (Lachenbruch & Mickey, 1968) in which each year was separately considered for test while keeping the rest for training was applied. Beyond the standard implementation of the method, in which the total test set in each iteration is corrected at once (based on the total train set available), we also assessed here the suitability of using moving windows, which allow for independently correcting determined consecutive periods (e.g. days and weeks), based on a collection of data centered around the target period being corrected. For instance, if the method is applied on a daily basis, and we want to correct the forecast for 16 January 1981, we may use all January days from the period 1982-2010 for the mapping. This configuration would correspond to a 31-day window, and it is the one used here. This choice for the width of the window is based on Bedia et al. (2018), who found, in the context of seasonal forecasting, that a 31-day window assures for smooth daily transitions whilst being narrow enough to encompass periods for which the possible trends introduced by model drift can be safely neglected. Figure 8 shows, for the illustrative case of temperature, the mean bias obtained for the different extended seasons (in columns) at the 42 PAGASA stations, as given by the raw model forecasts (top row) and a standard EQM in which moving windows are not considered (bottom row).
As expected by construction, the biases are basically negligible after applying the EQM method. However, if these results are individually analyzed for each month within the season, important sign-varying intraseasonal (i.e., monthly) biases appear as a consequence of the evolving model drifts (see section 3.1). An example is provided in the top row of Figure 9 for MAMJJA.
In this case, the use of moving windows in the application of the EQM method virtually eliminates the unwanted effects of model drift, leading to negligible biases for all months within the season (bottom row). Moreover, as shown in Figure 10, this also holds for the rest of seasons and for precipitation. Nevertheless, despite their suitability to correct the intraseasonal biases, it must be also noticed that moving windows do not allow for improving the interannual skill of the corrected predictions. An example of this can be seen in Figure 11, where the interannual correlation with observations is shown for predictions of monthly temperature given by the EQM method when moving windows are not/are considered (top/bottom row).
Despite its inability to improve the predictive skill, the results found in this work regarding the application of moving windows might still be key to many sectors that need long, continuous unbiased climate forecasts (e.g., hydrology or crop modeling).

Conclusions
Despite its systematic presence in state-of-the-art seasonal forecasts, a rigorous statistical characterization of model drift (leadtime-dependent bias) is still lacking for this type of prediction. To fill this gap, the present work analyzes the spatiotemporal distribution of model drifts for global seasonal forecasts of temperature and precipitation-the most important variables in user's applications-and their sensitivity to the ensemble size.
Our results indicate that model continues to drift well beyond the first month after the initialization moment, leading to significant, highly space-and time-varying drifts over vast regions of the world (especially for temperature). Nevertheless, small ensembles (less than 10 members) are enough to robustly estimate the mean value of these drifts, independently of the variable and/or the region being considered, and also allow to capture their year-to-year fluctuations in regions with good model predictive skill. This important finding suggests that costly approaches for seasonal impacts forecasting (e.g., dynamical downscaling) might benefit from drift removal strategies involving smaller ensemble sizes in skillful regions.
Differently, in regions of low model predictive skill, larger ensembles are required to appropriately characterize the year-to-year variability of model mean drifts, which is detected to be larger than the drift itself in many cases (especially for precipitation). This points out an existing necessity to develop new strategies that allow for efficiently dealing with model drifts, especially when bias correcting seasonal forecasts-note that most of the state-of-the-art techniques used to this aim rely on the assumption of stationary model errors.
In this regard, we demonstrate here that the use of moving windows can help to remove not only the mean bias but also the unwanted effects coming out from the drift. This is illustrated for the Philippines, where the use of moving windows virtually eliminates the intraseasonal biases that emerge when entire seasons are corrected at once (as it is usually done). This can have important practical implications for a broad community of users who need long, continuous unbiased seasonal climate forecasts to run their impact (e.g., hydrology or crop) models. Also, and despite that it is out of the scope of this study, the results shown here can help to better understand how the state-of-the-art coupled models work and, particularly, to identify the nature and causes of some of the most important accompanying errors, which is still a major task in seasonal forecasting. This could be the aim for a future paper.
Finally, it is worth to notice that all the analyses presented in this paper rely on a single forecasting model, the ECMWF System 4. To further test the robustness of the results found, in particular, regarding the usefulness of moving windows to provide long, unbiased seasonal forecasts, we plan to extend this study by including a number of newer forecasting systems. This will be the focus for a future paper. This study was supported by the EU projects EUPORIAS (EUropean Provision Of Regional Impact Assessment on a Seasonal-to-decadal timescales) and SPECS (Seasonal-to-decadal climate Prediction for the improvement of the European Climate Services), funded by the European Commission's Seventh Framework Research Programme through Grant Agreements 308291 and 308378, respectively. The ECMWF System 4 data used here can be retrieved from the User Data Gateway (UDG), a THREDDS-based service from the Santander Climate Data Service that provides access to a wide catalogue of popular climate data sets: http://meteo.unican.es/tds5/catalogs/ system4/System4Datasets.html (new users need to register first: http:// meteo.unican.es/udg-tap/home). The observational data considered for the Philippines have been privately provided by PAGASA. Please contact directly PAGASA staff (http://bagong. pagasa.dost.gov.ph/aboutus/ key-officials) for new requests. Finally, the author wants to acknowledge J. M. Gutiérrez for his constructive comments and continuous support during the elaboration of this work.