Multiyear climate predictions with two initialization strategies are systematically assessed in the EC‒Earth V2.3 climate model. In one ensemble, an estimate of the observed climate state is used to initialize the model. The other uses estimates of observed ocean and sea ice anomalies on top of the model climatology. The ensembles show similar spatial characteristics of drift related to the biases in control simulations. As expected, the drift is less with anomaly initialization. The full field initialization overshoots to a colder state which is related to cold biases in the tropics and North Atlantic, associated with oceanic processes. Despite different amplitude of the drift, both ensembles show similar skill in multiyear global temperature predictions, but regionally differences are found. On multiyear time scales, initialization with observations enhances both deterministic and probabilistic skill scores in the North Atlantic. The probabilistic verification shows skill over the European continent.
 The Earth's climate features pronounced multiyear variability. It is controlled by changes in external drivers, such as incoming solar radiation and anthropogenic emissions of greenhouse gases, and part of it is due to inherent variability of the climate system. On seasonal to multidecadal time scales, oceanic processes play an important role. The role of external forcing and oceanic processes imply potential predictability beyond persistence [Boer, 2004]. Operational dynamical prediction systems show nontrivial prediction skill on seasonal time scales. Skillful decadal predictions have been published as well [e.g., Smith et al., 2007; Keenlyside et al., 2008; van Oldenborgh et al., 2012; Hazeleger et al., 2013].
 Here we assess multiyear predictability using hindcasts with EC‒Earth, which is a climate model that has been derived from an operational seasonal prediction system [Hazeleger et al., 2010]. Prediction systems that use general circulation models are initialized with estimates of the observed state of the climate. The models have systematic biases, and therefore such predictions tend to show strong initial drifts. This may affect the variability simulated by the prediction system, and hence drift may reduce skill. This may be corrected using retrospective forecasts assuming that the drift is independent of the climate state. Alternatively, only estimates of anomalies of the observed climate can be initialized on top of the biased model climate. This is meant to reduce the strong initial drift. We address systematically differences between both strategies using the same prediction system and the same atmosphere, ocean, and sea ice analysis.
 We employ the EC‒Earth V2.3 atmosphere‒land‒ocean‒sea ice model. The model has a horizontal resolution of T159 in the atmosphere and 1° in the ocean with equatorial and high‒latitude refinement. We refer to Hazeleger et al.  for details of the model. The current version uses a different aerosol optics scheme and rescaled inhomogeneity factor for clouds.
 We use the protocol of the Coupled Model Intercomparison Project 5 (CMIP5, Taylor et al. ). Every 5 years, a hindcast is performed, starting at 1 November 1960, 1965,.., 2005. Each hindcast runs for 10 years. We use external forcing of historical data sets provided by CMIP5. The RCP4.5 emission scenario is used for the period after 2005.
 The initial states of the hindcasts are determined from the ECMWF ORAS4 data for the ocean [Mogensen et al., 2012], for sea ice we use data from an ocean‒sea ice model forced with Drakkar Forcing Set V4.3 surface fluxes [Brodeau et al., 2010], and for the atmosphere and land surface we use ERA‒40 [Uppala et al., 2005] data up to 1979 and ERA‒interim data thereafter [Dee et al., 2011].
 Two ensembles were performed that differ in the way the ocean and sea ice are initialized and in initial perturbations that were applied. In one ensemble (FULL‒AO, 10 members), we use the full oceanic state (all dynamic and thermodynamic variables) of the five members of ORAS4 and perturb the atmosphere to create the ensemble [see Du et al., 2012]. In the second ensemble (ANOM‒O, five members), we initialize only the anomalous oceanic state with respect to its climatology on top of a model climatology. Monthly averages from 1960 to 2005 of ORAS4 were used to create the ocean climatology and the anomalies. The climatology of EC‒Earth is determined from individual historical simulations that were started from a preindustrial spinup in 1850. Three‒dimensional anomalies of the horizontal velocities, temperature, and salinity were added to the five EC‒Earth control climatologies. Sea ice anomalies have been provided by the forced ocean‒sea ice model. We check for consistency in the sea ice variables (depth, fraction, temperature, snow, velocity). If after adding the anomalies sea ice extent becomes negative, all variables are set to zero. If the extent is positive but other parameters are negative, they are replaced by climatological values. In ANOM‒O, the atmosphere was not perturbed.
 We compare FULL‒AO and ANOM‒O with an 11‒member model ensemble forced with historical external forcing (greenhouse gases, aerosols, solar forcing, land use; NOINIT) and initialized from a spinup at 1850.
 We focus on surface and near‒surface temperature only. In those variables, differences are found most clearly. The observations that are used for verification are a combination of land surface temperatures derived from the GHCN/CAMS data set [Fan and van den Dool, 2008], sea surface temperatures from the NCDC‒ERSSTv3b data set [Smith et al.,2008], and, north of 60°N, GISTEMP [Hansen et al.,2010].
 When verifying the hindcasts, ANOM‒O, FULL‒AO, and NOINIT are bias‒corrected with the same methodology. The climatology is defined as a function of leadtime by averaging the hindcasted sea surface temperature (SST) or corresponding observations across the starting dates when data are available. The skill is measured using correlations or root mean square error (RMSE). The confidence interval for the correlation is computed after a Fisher Z‒transformation. For the RMSE, it relies on a Chi 2distribution. In both cases, the autocorrelation of the time series is accounted for [see, e.g., Du et al., 2012].
 We also assess the Brier skill score (BSS), which measures the skill of a probabilistic forecast by comparing predicted probability of events to a climatological reference forecast. The BSS depends on the ensemble size for small ensembles. We use an analytical expression developed by Ferro  to estimate the BSS for different ensemble sizes. It allows us to compute a theoretical BSS for infinite ensemble sizes. We present Attributes diagrams as well. Further details on the BSS and Attributes diagrams in the context of decadal predictions can be found in Corti et al. .
3.1 Initial Drift
 The main difference between the ensembles is the temporal drift due to the shock caused by the distance between initial state and mean model climate. The smallest global mean drift is found in NOINIT and the largest in the FULL‒AO ensemble, and there are distinct regional variations (Figure 1, and Figure S1 in the auxiliary material). This result is expected because the initial state is least in balance with the model climatology in FULL‒AO and most in NOINIT.
 A seasonal cycle is visible in all ensembles, with maximum bias in late boreal summer (Figure S1). This is due to the cold bias of the model (here the bias is defined as the difference between the (quasi) equilibrium of the model and the observations which corresponds to the bias in NOINIT) [Hazeleger et al., 2012]. This bias is strongest over the tropical regions and in the North Atlantic subpolar gyre. In boreal winter, the bias reduces due to a warm bias in the Southern Ocean and in the western boundary current extensions in the Northern Hemisphere [Sterl et al., 2012].
 All ensembles generate similar spatial patterns of drift, but the amplitude differs. Surprisingly, the ANOM‒O and NOINIT ensembles have a smaller globally averaged bias after 1 month leadtime than the FULL‒AO ensemble, despite the fact that the FULL‒AO ensemble is initialized from the observed state represented in the ocean and sea ice analysis. This is due to regional differences in development of the bias as explained in the next paragraph.
 All ensembles have a warm bias in the Southern Ocean. It is present in the model climatology, and it develops quickly in FULL‒AO. The fast response points to atmospheric sources of the bias and is partly caused by excessive short‒wave radiation at the surface. We do not find large differences in mixed layer heat content in the Southern Ocean at these short leadtimes (see Figure S2). The ANOM‒O and NOINIT are colder than FULL‒AO in the first months due to cold biases in the North Atlantic and tropics that compensate the warm bias in the Southern Ocean. The cold bias exists already in the climatology of the coupled model and takes time to develop in FULL‒AO.
 After the initial globally averaged warm bias, the FULL‒AO ensemble overshoots to a state that is colder than the ANOM‒O and NOINIT ensembles. Even with a leadtime of 10 years, the difference between the ensembles is substantial. It is caused by a very strong La Niña‒like pattern that develops after 1 year. This pattern is stronger in FULL‒AO than in ANOM‒O. Beyond 1 year, this cold pattern stretches out over the entire tropics. In ANOM‒O, the Southern Ocean is even warmer than in FULL‒AO. Also, the North Atlantic cold bias is larger. Overall, this causes the ANOM‒O ensemble to be warmer than the FULL‒AO ensemble at longer leadtimes.
3.2 Deterministic Verification
 Figure 2 shows the RMSE for annual and global mean surface air temperature as a function of leadtime. The score of the ANOM‒O ensemble seems to be systematically less than those of the FULL‒AO and NOINIT ensembles. In the first year, NOINIT has most skill. However, the error bands overlap strongly. We conclude that at global scale, the external forcing that is common to all ensembles dominates the skill.
 At regional scales, the differences between the ensembles are more pronounced. We assess the skill for selected regions where skill is expected due to coupled ocean atmosphere interactions. Sampling is limited due to the small number of start dates and ensemble members. Therefore, we applied a 4 year running mean to the results. This means that we will only focus on multiyear variability.
 In the equatorial Pacific, the initialization has hardly an effect on the skill on multiyear time scales. The anomaly correlations seem to show more skill in the NOINIT with lower RMSE as well, but error bands overlap (Figures 3a, 3e, and S3).
 To compare the North Atlantic results to other studies, we detrended SST with global mean SST (60°N– 60°S). The North Atlantic shows enhanced skill in ANOM‒O and FULL‒AO with respect to NOINIT, with FULL‒AO showing more skill than ANOM‒O (Figures 3b and 3f). The poorer skill in ANOM‒O may be due to differences in the observed and model climatology. These differences can project on the AMO. This is subject of further study. It is encouraging that initialization has a positive effect in accordance with what was found in previous studies [e.g., van Oldenborgh et al., 2012; Hazeleger et al., 2013].
 In the Gulf of Guinea in the equatorial Atlantic, correlations are high when the global mean trend is included (Figures 3c and 3g) but low when it is subtracted (Figure S3). Since the different ensembles overlap strongly, this points to skill due to the joint external forcing. There is an apparent rise of correlation skill and reduction of RMSE at a leadtime of 5 years. As shown by García‒Serrano and Doblas‒Reyes , this can be due to limited number of startdates. The skill is affected by events that occur in most of the third and eighth forecasted years.
 The Indian Ocean results are similar to the equatorial Atlantic (Figures 3d and 3h). The external forcing provides much skill. Unlike the equatorial Atlantic, when detrending the data with the global mean SST change, some impact of the initialization is found, with similar scores for ANOM‒O and FULL‒AO (Figure S3). The impact of initialization is larger than found by Guemas et al.  in an ensemble of different models.
3.3 Probabilistic Verification
 We assess the characteristics of the prediction system by a probabilistic evaluation using the BSS, which is a quadratic measure of probabilistic forecast skill. A positive value of BSS indicates a forecast that is better than climatology. The BSS was computed for the binary events here defined as the anomaly below (above) the lower (upper) tercile for near‒surface air temperature in different regions.
 For the North Atlantic and Europe, clear differences between the ensembles were found.
 Figure 4 shows that beyond 1 year, FULL‒AO is more skillful than NOINIT and ANOM‒O for the lower tercile of near‒surface temperature predictions. In the first year, no distinction can be made, which indicates that there is no additional skill by initalization at seasonal time scales. FULL‒AO initialization has a positive impact on predicting SST at longer time scales though. Apparently the low frequency variability which is thought to be related to the ocean circulation in this region can be better predicted by initialization.
 Interestingly, the BSS is positive over European land regions beyond 1 year, in particular for the upper tercile. Scores are worse for the lower tercile, but here the initialization provides extra skill, and the uncertainty is smaller. The differences between the lower and higher terciles can be due to biases in the simulated (nonlinear) trends. Attributes diagrams, which illustrate the reliability of probabilistic forecasts, were computed as well and confirm these results (see Figure S4).
 EC‒Earth climate prediction simulations with a full field initialization (FULL‒AO) and anomaly initialization strategy (ANOM‒O) are compared to simulations with historical forcing and no knowledge on the observed natural variability in the initial state (NOINIT).
 The main difference is the initial drift due to the initialization shock. As expected, the drift is highest in the FULL‒AO ensemble. The characteristics of the drift could be traced back to biases in the climate of EC‒Earth, in particular the warm bias in the Southern Ocean and cold bias in the tropics and North Atlantic. The warm bias develops very fast and is likely associated with atmospheric processes; the cold bias develops slower and indicates that oceanic processes are at play.
 The different behavior of the drift did not affect prediction skill clearly. Magnusson et al.  also find similar prediction skill for different initialization strategies at decadal time scales. Here we show that most skill originates from the external forcing. Initialization improves predictions on multiyear time scales in the North Atlantic region, with somewhat better skill scores for FULL‒AO than for ANOM‒O. Compared to earlier studies, we added a probabilistic verification. Again, the North Atlantic stands out as a region where probabilistic skill can be obtained. This probablistic skill extends even to Europe.
 This work has been supported by the EU‒funded THOR (212643), COMBINE (226520), QWeCI (243964), and CLIM‒RUN (265192) projects, the MICINN‒funded RUCSS (CGL2010‒20657) projects, and the Catalan Government. The authors acknowledge resources provided by the Red Española de Supercomputación (RES) and ECMWF and M. Asif for assistance.