This study explores the performance of a suite of off-line, global (hydrological and land surface) models in mapping spatial and temporal patterns of large-scale hydrological droughts in Europe from simulated runoff in the period 1963–2000. Consistent model behavior was found for annual variability in mean drought area, whereas high model dispersion was revealed in the weekly evolution of contiguous area in drought and its annual maximum. Comparison with nearly three hundred catchment-scale streamflow observations showed an overall tendency to overestimate the number of drought events and hence underestimate drought duration, whereas persistence in drought-affected area (weekly mean) was underestimated, noticeable for one group of models. The high model dispersion in temporal and spatial persistence of drought identified implies that care should be taken when analyzing drought characteristics from only one or a limited number of models unless validated specifically for hydrological drought.
Droughts are regional events that have a wide range of environmental and socio-economic impacts and thus it is vital that global (hydrological and land surface) models correctly simulate drought characteristics in a future climate. Most studies define drought locally (at a grid cell) by an anomaly percentile or standardized value [e.g., Sheffield and Wood, 2007; Vidal et al., 2010]. Subsequently, regional-scale drought statistics can be derived [e.g., Hisdal and Tallaksen, 2003; Andreadis et al., 2005; Wang et al., 2009; Lloyd-Hughes, 2012] and characteristics of future droughts assessed [e.g., Burke et al., 2010; Dai, 2013]. Due to the lack of easily accessible, up-to-date large-scale hydrological data sets for regional analyses, observation-based off-line model simulations are currently considered the best estimate of terrestrial variables like soil moisture and runoff [e.g., Andreadis and Lettenmaier, 2006; Burke et al., 2010] and commonly serve as reference data sets for comparison with global climate model simulations to evaluate the uncertainty in future projections. However, recent studies have revealed large model differences in reproducing terrestrial hydrology even when forced with the same climate input [Gudmundsson et al., 2012b]. Hence, the ability and robustness of these global data sets, i.e., off-line model simulations, to represent correctly spatial and temporal drought characteristics require further investigation.
Studies that have compared across several models suggest that different land surface models largely contain the same temporal variation in soil moisture and are able to reproduce anomalies and seasonal variability in observations [e.g., Koster et al., 2009]. It remains to be evaluated how well these models capture hydrological droughts. Some studies have suggested a higher uncertainty, i.e., larger spread among models, in simulating runoff under dry conditions and in dry regions due to high sensitivity to model structure and parameterization [Wang et al., 2009; Gudmundsson et al., 2012a; Stahl et al., 2012; Van Huijgevoort et al., 2013]. This study adds to previous work by analyzing spatial and temporal characteristics of large-scale runoff droughts in Europe as simulated by a suite of off-line global models. It aims to test the hypothesis that there is a high consistency among models in capturing the extent and timing of droughts, implying that any model would provide an acceptable data set for hydrological drought impact studies. The study is supported by a unique validation experiment of model simulations against streamflow observations.
The model ensemble consists of seven global (hydrological and land surface) models run with the same simulation setup developed in a joint effort within the WATCH (www.eu-watch.org/) project. Details of the models, i.e., GWAVA, HTESSEL, JULES, LPJml, MPI-HM, Orchidee, and WaterGAP, can be found in Haddeland et al.  and Gudmundsson et al. [2012a], including an overview of the schemes used for evapotranspiration, runoff generation, and snowmelt. All models were run at the 0.5° spatial resolution defined by the Climate Research Unit land mask and forced with the same input data, i.e., the bias-corrected WATCH Forcing Data [Weedon et al., 2011]. The models vary, however, considerably in structure and parameterization. With the exception of WaterGAP, which has undergone a limited calibration procedure, they rely solely on input maps for parameter estimation, and different maps of land properties were used by the individual models. In this study we analyzed daily total runoff (sum of fast and slow component) simulated for each grid cell in Europe (4425 land grid cells) for the period 1963–2000.
Streamflow observations from a data set of small, near-natural catchments from the European Water Archive (EWA) were used to validate spatial and temporal drought patterns. Most catchments are smaller than the area of the grid cells and only cells holding one or more stations were used (in case of more than one station, the largest catchment was used). In total, 293 cells with paired series of observed and simulated runoff (in mm) resulted (see Stahl et al.  for details of the data set).
3 Areal Drought Indices
Simulated and observed daily (7 day backward-smoothed) runoff series for each grid cell were first transformed into nonparametric anomalies (i.e., percentiles of their empirical distribution on each calendar day) for a comparison across models, independent of potential biases in simulated runoff. A grid cell is considered to be in drought if the runoff is below q20 (i.e., the 20% nonexceedance frequency of that day) in line with former studies [e.g., Hisdal and Tallaksen, 2003; Sheffield et al., 2012]. For lower thresholds, model performance decreases notably [Stahl et al., 2011].
The mean annual drought area, i.e., average of the daily total area in drought, is used to characterize the overall dryness of a calendar year. The annual maximum drought cluster area, i.e., the area of the largest cluster of spatially contiguous cells in drought within a year (following minor spatial smoothing, including bridging the sea areas between UK, the continent, and Scandinavia), is chosen as a measure of the severity of a given drought. The total number of drought events is defined as runs of consecutive days in drought over the entire record.
4 Simulated Large-Scale Droughts
Annual time series of mean drought area reveal a high consistency among models in reproducing the annual variability (Figure 1). (Note that the mean area in drought over all cells and for the whole time series will approach 20% for all models). A marked decadal variability is evident with two rather dry periods in the 1970s and 1990s, interrupted by a wetter period in the 1980s. Two droughts stand out, namely 1976 and 1989–1990 as also reported in previous studies for central Europe [Briffa et al., 2009] and north-western Europe [Zaidman et al., 2002]. All models rank 1976 the driest (six models) or second driest (one model) year during the period of record. The mean area in drought in 1976 is approximately 30% of the European domain, whereas it is slightly less in 1989 and 1990. Figure 1 further shows that maximum drought cluster area is notably higher, with an average of about 40%. Most models rank 1990 the most severe drought year, followed by 1976. Overall, a less consistent ranking is found for maximum cluster area, which is mainly due to the index being a more extreme index than mean annual drought area. Still, most models reproduce a similar dynamic pattern, i.e., interannual variability.
Box plots for each model confirm the discrepancy between the two areal drought indices (Figure 1). The high model consistency observed in mean annual drought area can be expected given the 20% threshold; however, large differences are seen in the interannual variability with three models (labeled blue) showing noticeable lower spread. Annual maximum cluster area, on the other hand, displays large differences also in the median, with two groups of models emerging — Group I: GWAVA, HTESSEL, JULES, and WaterGAP and Group II (labeled blue): LPJml, MPI-HM, and Orchidee. Group I has an overall lower area coverage; still, higher interannual variability yields the largest maximum cluster area in 1976 (~60%) and 1989/1990 (~70%) for models in Group I.
We further looked at model consistency in mapping location and temporal evolution of drought area for the droughts of 1976 and 1989–1990. The maps in Figure 2 show grid cells in drought on a given day (all, some or none of the models). The 1976 drought (upper row) developed during spring and summer over western Europe centering in NW France to SE England. Throughout May and June, it spreads north and eastward resulting in a contiguous cluster over central Europe that peaked early in July 1976. Severe conditions prolonged until September 1976, which was the start of a very wet fall. The 1989–1990 drought (lower row) originated in SE Europe during the winter 1989–1990 and when peaking in mid-May 1990, it affected most of Europe, except the very south and north. Later in summer, the event's center moved again east before the drought ended in fall. Although the maximum cluster area was larger in 1990 than in 1976, the event was less coherent. This is reflected in larger model dispersion in mapping drought area. For both events, it is evident that the uncertainty resulting from using only one model is high, although the models perform slightly better when the drought is more extreme.
5 Validation Against Streamflow Observations
The observed and simulated runoff series were first transformed into drought indicator series labeled 1 or 0, depending on whether the runoff was below q20 (in drought) or not [Stahl et al., 2011]. Then, four performance measures were defined based on the agreement of the indicator series for each of the 293 paired grid cells. The spatial agreement index is a measure of how well the two paired daily indicator series (observed and simulated) agree (location) at a given time step (i.e., fraction of cells with the same indicator value on that day). The temporal agreement index compares the agreement of the daily indicator series over the entire record (timing) for each grid cell separately. The run ratio relates the observed to the simulated number of drought events (continuous runs below the threshold) in the entire record, whereas the correlation coefficient compares annual series of number of drought days each year. The spatial agreement index has one value for each day in the record, whereas the other measures have one value for each grid cell pair (in total 293). Apart from the run ratio, all indices range between 0 and 1.
The result is shown for all models in the box plots in Figure 3, now also including the multimodel median. The temporal and spatial agreement indices both show rather low agreement for all models (~0.4) and small intermodel differences, with larger spread for the spatial index. Small model differences are also seen for the correlation coefficient (with the exception of GWAVA), and the models perform rather well (correlation coefficients ~0.7). The run ratio, on the other hand, shows quite large differences among the models and a tendency to overestimate the number of drought events (values >1). The three models in Group II produce about twice as many events as observed. Consequently, duration will be underestimated, whereas maximum cluster area likely will be larger than observed (cf., higher area coverage in Group II; note this is not the case for the two most extreme droughts, i.e., 1976 and 1989/1990, when models tend to agree better). One reason may be that models in Group II respond quicker and more synchronized to precipitation (or the lack of it) across all grid cells and thus more cells will be in drought at the same time resembling the scale of the synoptic forcing, whereas models in Group I respond more slowly and more heterogeneously. This is further supported by higher interannual variability for models in Group I for both indices (Figure 1). The multimodel median performs equivalent to the model average for all measures.
Differences in the partitioning of precipitation into evapotranspiration and runoff (surface and subsurface) produce different model dynamics and thus drought characteristics. One reason for the lack of agreement in simulated drought characteristics could be that the models are generally too responsive, i.e., persistence in the hydrological system is not adequately accounted for as also suggested by previous studies [e.g., Gudmundsson et al., 2012a; Wang et al., 2009]. Persistence occurs as a result of a memory in the investigated system reflecting the role of stores, such as extensive aquifer systems and high soil water holding capacity. A high frequency of droughts may result from low persistence and thus low serial correlation (autocorrelation) in the drought series, both at the grid cell scale (number of events) and for the affected area. Hence, the autocorrelation function was derived for weekly mean drought area for lags up to one year (k = 1,...,52) for the whole of Europe as well as for the fraction of grid cells in drought for the 293 paired cells (Figure 4). Both sets of curves show a decline in autocorrelation with increasing lag time; however, there is a considerable dispersion among the models, with Group I showing higher autocorrelation, hence persistence, in drought area. The observed fraction of drought cells has overall steeper curves than area in drought for the whole European domain, which is likely a result of these cells being mainly fast responding headwater cells. The observed curve follows closely the models in Group I, particularly WaterGap (and GWAVA), suggesting that these models are better able to capture hydrological drought. Prudhomme et al.  found WaterGap (out of three global models) to be best suited to reproduce regional characteristics of high and low flows in Europe, with a runoff response somewhere in between JULES (large storages and slowly responding runoff) and MPI-HM (fast responding runoff).
In Gudmundsson et al. [2012a], the same set of models were evaluated with respect to their ability to capture the interannual variability of high and low flows (aggregated for the whole of Europe) and the three best performing models identified are identical to Group I above. No single performance metric could be identified that would explain why some models performed better than others, and in a follow-up study [Gudmundsson et al., 2012b], it was concluded that it was not possible to perform a systematic model evaluation due to the many choices made by the modeler during model setup. Rather, this would require a specially designed model-testing framework, which is an important area for further research.
This study evaluated seven global (hydrological and land surface) models in their ability to reproduce hydrological (runoff) drought patterns across Europe and revealed significant model dispersion in simulating temporal and spatial persistence of drought. Comparison with observations showed a tendency to overestimate the number of events, and hence underestimate drought duration, and underestimate persistence in weekly drought area, particularly for one group of models. Results herein suggest that these models are too responsive and therefore are not able to satisfactory capture hydrological drought. Nonetheless, another group of models performed quite well, which could guide further model development.
For drought management and early warning, an improved conceptualization of the dynamic behavior of large-scale models, in particular the representation of stores and hydrological response, is vital to enable a better simulation of prolonged dry periods and extensive drought extent. Hence, conclusions should be made with care when only one or a limited number of models are used to derive design estimates of large-scale droughts or to predict the impacts of droughts in a future climate. Model deficiencies as those identified here may have serious implications for model-informed future drought preparedness in Europe and we strongly urge for more validation studies to build trust in global models for hydrological impact studies.
The work was supported partly by the European Union FP7 funded project DROUGHT-R&SPI (contract 282769) and partly by the FP6 funded project WATCH (contract 036946). It contributes to the work of the UNESCO-IHP VII FRIEND program. The authors thank the WATCH modelers for their contribution of the forcing and model data.
The Editor thanks Adriaan Teuling and an anonymous reviewer for assistance evaluating this manuscript.