The capability of seasonal forecasting of global drought onset at local scales (1°) has been investigated using multiple climate models with 110 realizations. Climate models increase the global mean probability of drought onset detection from the climatology forecast by 31%–81%, but only increase equitable threat score by 21%–50% due to a high false alarm ratio. The multimodel ensemble increases the drought detectability over some tropical areas where individual models have better performance, but cannot help more over most extratropical regions. On average, less than 30% of the global drought onsets can be detected by climate models. The missed drought events are associated with low potential predictability and weak antecedent El Niño–Southern Oscillation signal. Given the high false alarms, the reliability is very important for a skillful probabilistic drought onset forecast. This raises the question of whether seasonal forecasting of global drought onset is essentially a stochastic forecasting problem.
 Drought is nothing more than a water deficit, but it can have devastating impacts on society, the economy and environment because of its association with heat waves [Mueller and Seneviratne, 2012], food insecurity [Haile, 2005], and ecosystem degradations [Choat et al., 2012]. Drought is a slowly developing process and usually begins to impact a region without much warning once the water deficit reaches a certain threshold [Mo, 2011; Sheffield and Wood, 2011]. Therefore, predicting the drought onset a few months in advance will benefit a variety of sectors by allowing sufficient lead times for drought mitigation efforts.
 In fact, dynamical seasonal forecasting systems that are based on coupled atmosphere-ocean-land general circulation models (CGCMs) have been widely used for drought early warning in recent years [Luo and Wood, 2007; Dutra et al., 2012; Yuan et al., 2013]. The rationale for predicting drought through CGCMs is that the atmospheric models should have reasonable response both to the remote sea surface temperature (SST) anomalies through ocean–atmosphere teleconnections and the local terrestrial water (e.g., soil moisture and/or snow) anomalies via land-atmosphere coupling. While persistent SST anomalies have been acknowledged as major drivers for large-scale drought conditions [Hoerling and Kumar, 2003] that should lead to drought predictability, the land surface anomalies could also strengthen or weaken the drought intensity and thus affect the drought duration and recovery [Schubert et al., 2007]. However, the above mechanisms may not be sufficient for predicting all drought cases, or for all CGCMs, because drought could sometimes arise without strong external forcings [Kumar et al., 2013]. Therefore, seasonal forecasting of drought with multiple climate models and multiple ensembles is not only useful for deriving drought outlooks and their associated uncertainties that are necessary for decision makers but also offers the potential for more robust understanding of the climatic mechanisms of drought occurrence.
 Currently, there is multiagency support for a North American Multimodel Ensemble (NMME) intraseasonal to interannual prediction experiment [Kirtman et al., 2013], where several decades of hindcast data from multiple climate forecast models developed in North America are made fully available to the public, providing a promising opportunity for model comparison and combination, skill assessment and mechanism attribution for drought seasonal forecasting. In this paper, we use the NMME data to calculate the meteorological drought index, to investigate the global drought onset detectability and its potential mechanism with El Niño–Southern Oscillation (ENSO), and to analyze the probabilistic drought onset forecast skill and characteristics.
 This study differs from previous research in that: (1) we focus on the seasonal forecasting of individual drought onset events based on the joint distribution of the forecast and observation rather than assessing the forecast skill for drought indices [Quan et al., 2012; Yoon et al., 2012; Sohn et al., 2013]; and (2) we comprehensively investigate the capability of dynamical models in seasonal forecasting of global drought onset within last three decades using multiple state-of-the-art climate forecast models, rather than analyzing single drought event [Luo and Wood, 2007; Dutra et al., 2012] or using single climate model [Kumar et al., 2013; Yuan et al., 2013] at regional scales.
2 Data and Method
 The NMME phase-I project consists of seven models developed at University of Miami (Rosenstiel School of Marine and Atmospheric Science/Community Climate System Model, Version 3(CCSM3)), Geophysical Fluid Dynamics Laboratory (GFDL/CM2.2), International Research Institute for Climate and Society (IRI/ECHAMA, ECHAMD), National Aeronautics and Space Administration (NASA/Goddard Earth Observing System Model, Version 5 (GEOS5)) and National Centers for Environmental Prediction (NCEP/Climate Forecast System Version 1 (CFSv1), CFSv2). During the phase-II of the project, the two IRI models and the NCEP/CFSv1 stop producing real-time forecast, but two models developed at the Canadian Meteorological Centre (CMC/CanCM3, CanCM4) participate in the NMME. The overlapping hindcast period for the above nine models (see Kirtman et al.  and Yuan and Wood  for full references) is 1982–2009, and the monthly mean precipitation forecasts at 1° are downloaded from the IRI website (http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/). The hindcasts start from each calendar month during 1982–2009, with forecast horizons between 8 and 12 months. The number of hindcast ensemble members for each model varies from 6 to 24, with a total of 110 members. The observed monthly precipitation used in this paper for bias correction and validation is processed from a 63 year (1948–2010) 1° data set described in Sheffield et al.  and recently updated to 2010.
 The biases of monthly predicted precipitation are corrected through quantile mapping, and the cumulative distribution functions for each model have been generated separately, using all the ensemble members of each model. An extreme value type-III Weibull distribution and a type-I Gumbel distribution are used for correcting those precipitation forecasts that exceed the lower and upper bounds of the observed climatological distributions, respectively [Wood et al., 2002; Yoon et al., 2012]. The bias-corrected model precipitation is then appended to the antecedent observed precipitation for calculating the 6 month Standardized Precipitation Index (SPI6) [McKee et al., 1993], an index that is widely used for quantifying meteorological drought at seasonal scales [Mo, 2011; Quan et al., 2012; Yuan et al., 2013].
 Given an observed SPI6 time series for a 1° grid cell, the drought event is selected when the SPI6 is below a specified threshold for at least 3 months, and the drought onset month is the first month that SPI6 falls below the threshold [Mo, 2011]. The threshold used in this study is −0.8 following Svoboda et al. , which results in about 8.7 drought events during 1982–2009 as averaged over global land grid cells (excluding Greenland, Antarctica, and desert regions with a mean annual precipitation less than 50 mm), with a maximum number of drought events up to 21. A similar procedure can be applied to the SPI6 time series that blends antecedent observation and the seasonal forecast.
 For the drought onset forecasting in this paper, we are particularly interested in answering questions like: given a nondrought condition (SPI6(t − 1) > −0.8), what is the probability for a drought occurrence in the next 3 months based on the model forecast (SPI6(t) < −0.8, SPI6(t + 1) < −0.8, and SPI6(t + 2) < −0.8)? Or given a forecasted drought onset, what is the probability that it turns out to be a false alarm? These probabilities can be quantified using a 2 × 2 contingency table that is based on the joint distribution of the forecasts and observations [p(y, o); Wilks, 2011]. Using the definitions of SPI6 and drought onset, the first 3 month forecasts are used in this study, where the antecedent observations could be considered as initial conditions of SPI6 similar to the soil moisture drought forecast.
 Figure 1 shows the probability of detectable drought onset, p(y1|o1), calculated from the hindcasts; that is, given the observed drought onset events during 1982–2009, what percentage of them were successfully forecasted by the seasonal climate forecast model. The Ensemble Streamflow Prediction (ESP) [Twedt et al., 1977] method is used here as a reference forecast, where the corresponding SPI6 can be calculated similarly to the climate models by replacing the forecast with climatology for the target month. As discussed by Lyon et al. , all forecast information from ESP comes from initial condition, and thus, it represents a baseline SPI forecasting skill. To be consistent with probabilistic forecast in the later part of this paper, the ESP SPI6 forecast in this study consists of 20 ensemble members, i.e., 20 years randomly selected from the historical observation during 1948–2010 (excluding the target forecast year). The results shown in Figure 1 are based on the ensemble mean forecasts. Without any forecast information, the ESP approach can detect less than 20% of the observed onset drought events over most of global land areas (Figure 1a). The probability of drought onset detection is higher over parts of North America, Europe, West Russia, and East Africa (Figure 1a). An interesting feature is that areas with relatively higher drought detectability are not located over either humid or arid regions in terms of annual mean precipitation, suggesting that the initial condition for drought onset forecast is more important in transition zones, a finding consistent with Shukla et al. . On average, globally, the drought detectability of the ESP approach is about 0.16 (Table 1).
Table 1. Global Mean Values of Probability of Detection, p(y1|o1), Probability of False Alarm, p(o0|y1), and Equitable Threat Score (ETS) for Drought Onset Forecasting
 After bias correction, the individual climate forecast models show higher probability of detection than ESP, especially over the central U.S., northeastern Brazil, East Africa, Europe, southern China, and Australia (Figures 1b–1j). Climate models developed at the same institution have similar patterns for the drought onset detectability, such as the two IRI models (Figures 1d–1e), the two NCEP models (Figures 1g–1h), and the two CMC models (Figures 1i–1j). This is consistent with the model similarity study by Yuan and Wood , though both wet/dry events are used for calculating a similarity index in that study. Regional differences in drought detectability exist for the models developed in different institutions. For instance, GFDL/CM2.2 has the highest drought detectability over U.S. and northeastern part of South America (Figure 1c), NASA/GEOS5 has the highest detectability over the Murray-Darling basin in eastern Australia (Figure 1f), and CMC/CanCM4 is the best over Europe, West Asia, and East Africa (Figure 1j). Table 1 shows that on average, globally, CMC/CanCM4 has the highest detectability, followed by GFDL/CM2.2, NASA/GEOS5, and CMC/CanCM3; CCSM3 has the lowest detectability. However, the models with high drought onset detectability usually have a high false alarm ratio, p(o0|y1). Therefore, the performances of the forecast models are not that different in terms of Equitable Threat Score (ETS) [Wilks, 2011], a skill score that takes into account both the detectability and reliability (false alarm).
 Figures 1k–1l show the drought onset detectability with two multimodel ensemble methods. The NMME1 SPI6 is based on the precipitation time series averaged over the 110 bias-corrected ensemble members from the nine climate models; while the NMME2 is generated by transferring 110 precipitation members to the normal space through quantile mapping based on individual model ensemble distributions, calculating mean values in the normal space, and then transforming the mean values back to the original space and calculated SPI6 time series. The performance of NMME1 falls between the best and worst individual models (Figure 1k). For the NMME2 approach, it averages the model ensembles in normal space, so the ensemble spread has been greatly reduced. Figure 1l shows that the latter method significantly increases the drought onset detectability, especially for the areas that individual models have high detectability, such as U.S.-Mexico, East Africa, and Australia. However, similar to the comparison for individual models, the false alarm ratio is also high for the second ensemble method, although its ETS does increase (Table 1).
 To diagnose the asymmetric performances of the forecast models in terms of drought onset detectability and false alarms (i.e., the models with higher detectability also have worse reliability), the relationships between drought forecast skill and potential predictability as well as antecedent SST signals are investigated. The potential predictability measures the capability of how well the model can predict itself [Koster et al., 2004]. It usually assumes that the model is perfect, and all errors come from chaotic atmosphere dynamics acting on epsilon-scale errors in the initial conditions. Investigating a model's potential predictability for drought onset is for diagnosing the model's internal variability. Here the standard deviation of ensemble members (ensemble spread) normalized by the total spread across all forecasts, σfcst/σtotal, is used as an indicator of potential predictability; where σfcst and σtotal are calculated using realizations for a specific forecast event and all hindcast events, respectively. The smaller the normalized ensemble spread, the higher the potential predictability. The mean normalized spreads for the drought events conditional on the joint distribution of the forecast and observation are calculated for each land grid cell. The frequency distributions of those mean values over all land grid cells for each forecast model are plotted in Figure 2a. For example, the green lines represent the frequency distributions of normalized spread averaged over those forecast events where the model issues a drought onset forecast (fcst = T) but does not occur in the observation (obs = F). All models consistently show that they tend to issue a drought onset forecast (red and green lines in Figure 2a) when the potential predictability is high (ensemble spread is small), while they fail to capture the drought onset if the potential predictability is low (blue lines in Figure 2a). Therefore, forecast models can predict a drought onset event only when they have a very deterministic response to the initial conditions. For those drought onset, due to internal atmospheric variability [e.g., Kumar et al., 2013], forecast models have both lower potential and actual predictability. Figure 2a also shows that the frequency distributions of the ensemble spread for the successful (red lines) and false alarm (green lines) events are quite similar, which indicates that high potential predictability alone cannot guarantee a correct drought onset forecast. This could partly explain the asymmetric performances for detectability and false alarms, where in this study, a high detectability usually comes with a low reliability (high false alarm).
 To further crack the nut of drought drivers, the frequency distributions of the absolute anomaly of antecedent seasonal mean Niño3.4 SST [Rayner et al., 2003], conditional on the joint distribution of forecast and observation (3 months before forecasted or observed drought onset), are plotted in Figure 2b. These results show that for weaker ENSO signals, the models would have higher probability of missing the drought onset (blue lines in Figure 2b). However, similar to the potential predictability analysis above, a strong ENSO signal could also lead to high false alarms. This indicates that the SST-drought relationship in the models is less nonlinear than in the actual climate system. Other atmospheric dynamic processes and local land-atmosphere coupling need more investigations to offset the apparent over-represented SST-drought relationship in the models.
 Given the limited skill in deterministic drought onset forecast, one may wonder the usefulness in terms of probabilistic forecast. The Brier Score (BS) and its decompositions as measures of reliability (Rel) and resolution (Res) are used to assess the quality of probabilistic drought onset forecast [Wilks, 2011; Yuan et al., 2013]. They are defined as , where yk and ok are the probability of drought occurrence from forecast and observation for the kth event, respectively, I is the discrete number of allowable forecast values (e.g., I = 7 if the forecast has six realizations), Ni is the number of times each forecast value yi is identified in the hindcasts, , and . So the smaller the BS value and reliability term (Rel), as well as the larger the resolution term (Res), the better the probabilistic forecast. Another measure of the forecast attribute is the sharpness (Shp), which is defined as , and . Although the sharpness by itself does not contribute to the skill (BS) directly, we show the results here because of its connection with the detectability analysis above.
 Figure 3a shows that NMME has the most skillful forecast in terms of the BS. For individual forecast models (note that ECHAMA, CFSv1, and CanCM3 are not shown here because of similarity to their successors), only CFSv2 has significantly better performance than the ESP. Some models with high drought detectability (e.g., GFDL/CM2.2 and CMC/CanCM4) even perform worse than ESP. The reason is that the probabilistic forecasts from these models are much less reliable than ESP (Figure 3b), although all forecast models (except CCSM3) have better resolution (Figure 3c). Figure 3d shows that CM2.2 and CanCM4 produce the sharpest forecasts, but that does not help with probabilistic predictive skill. In contrast, CFSv2 has the worst sharpness as compared to other individual climate models (Figure 3d), but it has comparable reliability as the climatological forecast, i.e., ESP (Figure 3b). Actually for the hindcast, CFSv2 has quite different ensemble generation procedure from other models. Most climate models used in this study initiate the ensemble forecast on the first day of the target month, with perturbation on atmospheric or oceanic states; while the CFSv2 hindcasts are initiated at 00Z, 06Z, 12Z, and 18Z every 5 days, resulting in ensemble forecasts with different lead times. It seems that CFSv2-type ensemble generation method is able to capture more atmospheric variability and to produce more reliable drought onset forecast, although it reduces the sharpness of the forecast. The probabilistic forecast assessment in this study illustrates that reliability is very important for a skillful drought onset forecast.
4 Concluding Remarks
 Using 28 year (1982–2009) North American Multimodel Ensemble (NMME) hindcast data with 110 ensemble members in total, both deterministic and probabilistic forecasting skill for global drought onset including detectability has been investigated. The relationships between forecast skill and potential predictability and ENSO have also been diagnosed. After precipitation bias correction, the forecast models increase the global mean probability of drought onset detection over the reference ESP forecast by 31%–81%, but only increase ETS by 21%–50% due to their high false alarm ratios. The multimodel ensemble increases the probability of drought onset detection over the areas where individual models have relatively high detectability, such as U.S.-Mexico, East Africa, and Australia; but could not increase skill beyond individual models where the detectability is low.
 A diagnostic study indicates that the forecast models tend to predict a drought condition (regardless of whether a correct forecast or a false alarm) when the potential predictability is high. While for those drought events that the models cannot capture, their potential predictability is low too. A similar feature is found for the Niño3.4 SST anomaly, conditional on the joint distribution of observed and forecasted drought event, where smaller antecedent SST anomalies are associated with drought events that are likely to be missed by the climate forecast models. Given the high false alarm ratio, increasing the reliability is very important for obtaining a skillful probabilistic drought onset forecast.
 Although drought detectability could be as high as 0.5–0.7 for specific regions (Figure 1), even the multimodel ensemble can only detect about 30% of the global drought occurrences on average (Table 1). Large-scale SST anomaly such as ENSO may trigger a drought that can be predicted by the climate models, but there are many drought onset events that do not need extreme forcings and thus are missed by the models (Figure 2b). Reducing the apparent over-representation of the SST-drought relationship as diagnosed in this study might be a direction for reducing false alarms, but this would not necessarily resolve problems related to limited drought detectability, especially for extratropical regions. This raises the question of whether seasonal forecasting of global drought onset at local scale (e.g., 1° in this study) is essentially a stochastic forecasting problem.
 We would like to thank IRI to make the NMME forecast information available. We thank Kingtse Mo for the SPI code, and Michael Tippett, Gabriel Vecchi, and Rich Gudgel for the GFDL hindcast data. The research presented in the paper was supported by the NOAA Climate Program Office through grants NA10OAR4310246 and NA12OAR4310090.
 The Editor thanks two anonymous reviewers for their assistance in evaluating this paper.