Identification of hydrologic drought triggers from hydroclimatic predictor variables



[1] Drought triggers are patterns in hydroclimatic variables that herald upcoming droughts and form the basis for mitigation plans. This study develops a new method for identification of triggers for hydrologic droughts by examining the association between the various hydroclimatic variables and streamflows. Since numerous variables influence streamflows to varying degrees, principal component analysis (PCA) is utilized for dimensionality reduction in predictor hydroclimatic variables. The joint dependence between the first two principal components, that explain over 98% of the variability in the predictor set, and streamflows is computed by a scale-free measure of association using asymmetric Archimedean copulas over two study watersheds in Indiana, USA, with unregulated streamflows. The M6 copula model is found to be suitable for the data and is utilized to find expected values and ranges of predictor hydroclimatic variables for different streamflow quantiles. This information is utilized to develop drought triggers for 1 month lead time over the study areas. For the two study watersheds, soil moisture, precipitation, and runoff are found to provide the fidelity to resolve amongst different drought classes. Combining the strengths of PCA for dimensionality reduction and copulas for building joint dependence allows the development of hydrologic drought triggers in an efficient manner.

1. Introduction

[2] Drought, as a prolonged status of water deficit, is perceived as one of the most expensive and the least understood natural disasters. In monetary terms alone, a typical drought costs American farmers and businesses $6–8 billion dollars each year, more than damages incurred from floods and hurricanes [Federal Emergency Management Agency, 1995]. The consequences tend to be more severe in areas, where agriculture is a major economic driver. Dracup et al. [1980] stated that proper definition of drought depends on the nature of water deficit relevant to the study area. As water moves through the hydrologic cycle, precipitation deficits (meteorological droughts) lead to low soil moisture levels (agricultural droughts) that translate into low streamflows, reservoir, and/or groundwater levels (hydrologic droughts).

[3] The occurrence and magnitude of hydrologic droughts are heralded by triggers that may be manifested in specific patterns of hydroclimatic variables. Identification of these triggers at appropriate lead times is necessary for devising effective drought mitigation plans. Estimating water deficits and drought categories at weekly, monthly, seasonal, and annual lead times are needed for scheduling irrigation events and managing water resources of a region. Drought characterization is currently accomplished by indices such as standardized precipitation index (SPI), palmer drought severity index, crop moisture index, surface water supply index, and reclamation drought index ( Drought indices are typically designed for assessing current conditions and have little predictive capability. Large-scale oceanic and atmospheric indicators such as the El Niño-Southern oscillation phases, North Atlantic oscillations, Pacific North American index, Atlantic multidecadal oscillations, and Pacific decadal oscillations are used as long-term precursors to annual/seasonal forecasts of precipitation [Ropelewski and Halpert, 1996; McHugh and Rogers, 2001; Maity and Nagesh Kumar, 2008a]. However, for many parts of the world, including Indiana, USA, these indicators have been found to have little to no influence [Charusombat and Niyogi, 2011]. Further, their incapability to provide short-term predictions (several weeks, to 6 month range) render them unsuitable as drought triggers for such time scales. We hypothesize that hydrological droughts, reflected in unregulated streamflows, would have precursors in local hydrometeorologic variables related to rainfall and soil moisture over the corresponding watersheds. McKay et al. [1989] suggested that accurate drought predictions will need models that link between climate and weather factors to streamflows and river stage data.

[4] Several considerations come into play for the development of drought triggers including drought types, data availability, choice of hydrologic variables (precipitation, temperature, streamflows, storage levels, etc.), temporal scales, and validity of the trigger. Over the past two decades, drought triggers have been developed by several states and utilities [Steinemann, 2003]. However, these have met with limited success because of (i) anomalies between results from different drought indicators and (ii) lack of a strong record length for proper model development and validation exercises. Moreover, these triggers are often defined as some preset thresholds to be crossed by various drought indices at the same instance of time for which drought status is being analyzed. Thus, they may not recognize early warning signals that may be present in the record.

[5] Though droughts are fundamentally triggered by insufficient precipitation, the evolution of water deficits from precipitation to soil moisture and to streamflows is not instantaneous and is controlled by complex physical mechanisms. As hydrologic droughts are based on abnormally low flows, estimation of streamflows is, therefore, a necessary prerequisite to drought analysis. Since a drought trigger governs the level of future response, it is important that the trigger be based on methods that convey predictive uncertainty. There are many methods available for estimation of streamflows, classified mainly into physics-based, conceptual, and data-driven approaches. Several watershed models have been developed that rely upon the physical knowledge of the watershed and the hydrological cycle, often resulting in complex representations that require intensive computer effort for model calibration and corroboration. Data-driven techniques do not require detailed understanding of the inherent physical mechanisms, but have shown comparable accuracy for streamflow prediction as physics-based models [Wu et al., 2009]. The time scale of 1 month lead forecasts is particularly challenging because physics-based models (HEC-HMS, MIKE-SHE, etc.) are not able to project using input data beyond several hours to days without a disaggregation procedure. Process-based models such as SWAT perform simulations at a daily time step [Srinivasan and Arnold, 1994], and model outputs have to be aggregated to obtain monthly values. However, the strength of such models lies in examining long-term consequences of management practices rather than monthly forecasts. There are many conceptual lumped-parameter models developed in the last four decades, mainly for flood forecasting, with one day or shorter time resolutions [Xu and Singh, 2004], but their predictive capabilities are very limited if the time horizon exceeds several days.

[6] Statistical approaches have been utilized to model the complex relationships between streamflows and the large-scale atmospheric circulation phenomena [Anmala et al., 2000; Maity and Nagesh Kumar, 2008b]. The predictors used in majority of these data-driven approaches were hydroclimatic variables such as mean temperature, mean sea-level pressure, soil moisture, precipitation, runoff, and wind speed. While these studies have stressed the importance of hydroclimatic variables for enhancing streamflow prediction, they were primarily targeted toward long-range forecasting [Salas et al., 2011]. Even with the predictor set identified, new approaches are needed for achieving short-term (few weeks to months) forecasts. The use of advanced statistical models based on Markov properties [e.g., Mallya et al., 2013] have helped in probabilistic classification of drought states and alleviated the need for user-specified thresholds for drought categorization. Thus, though robust models exist for forecasting streamflows and upcoming hydrologic droughts, these models are not suitable for development of triggers that require identification of the ranges of predictor variables that herald a particular drought.

[7] The joint probability density function between streamflows and hydroclimatic predictor variables is needed to identify and develop drought triggers. Copulas are a natural choice for this task [Nelsen, 2006]. They allow the dependence structure to be modeled without any restriction on the distributions of the marginals [Genest and Favre, 2007] and have been gaining popularity with hydrologic applications. Favre et al. [2004] used Frank and Clayton 2-copulas to model the dependence between streamflow peaks and volumes. Salvadori and De Michele [2004] adopted copulas in their study of the return period of hydrological events. Zhang and Singh [2006] used copulas to determine bivariate distributions between flood peaks, volumes and durations, and employed them to define joint and conditional return periods needed for hydrologic design calculations. The joint distribution of intensity, duration, and severity of droughts was modeled using copulas by Shiau et al. [2007], Wong et al. [2010], and Madadgar and Moradkhani [2013]. Maity and Nagesh Kumar [2008a] analyzed the dependencies among the teleconnected hydroclimatic variables using copulas for the prediction of response variables using large-scale oceanic and atmospheric indicators. Kao and Govindaraju [2010a] utilized copulas to construct an intervariable drought index, where the dependence structure of precipitation and streamflow marginals was preserved. The review by Mishra and Singh [2010] highlights the expanding role of copulas in drought assessment studies.

[8] Given the large number of potential hydroclimatic variables in the predictor set, the direct use of copulas to model their joint dependence with streamflows is impractical because of the mathematical complexity in constructing higher-dimensional copulas. If the dependence between all the interacting variables cannot be represented by multivariate Gaussian (or meta-elliptical) copulas, then models at even the trivariate level can be very challenging [Kao and Govindaraju, 2008, 2010b]. Moreover, with multiple interacting variables, the curse of dimensionality adds further challenges to estimation of model parameters from limited record lengths. While many options exist for modeling bivariate dependence between variables, models for higher dimensions are not easily available.

[9] Principal component analysis (PCA) provides an elegant way of projecting the precursor hydroclimatic variables onto a feature space, and representing the original data through a reduced number of effective features called principal components (PCs) [Jolliffe, 1986; Preisendorfer, 1988]. If the first few (two in this case) features are able to explain most of the variability (>90%) in the original data set, then substantial dimensionality reduction may be achieved through unsupervised learning. PCA is recognized as the most widely used tool for dimensionality reduction for multivariate data problems. Lins [1985] utilized PCA to construct parsimonious models for multisite streamflows. Maurer et al. [2004] showed the effectiveness of PCA for both reducing the dimensionality of large data sets and better graphical representation of the modes of variability in streamflows. Tripathi and Govindaraju [2008] developed algorithms for data compression using PCA for data sets with noise. PCA was adopted by Keyantash and Dracup [2004] to achieve dimensionality reduction for developing an aggregate drought index.

[10] The goal of this paper is two-fold. The first goal is to model the joint distribution of streamflows and the important principal components of precursor hydroclimate variables using an appropriate copula family for two study watersheds in Indiana, USA. This copula model is tested for its capability to forecast low streamflows that are of concern for hydrologic droughts. The second goal is to utilize the PCA-copula framework to develop drought trigger information. While copulas and PCA have been widely used individually, to the best of our knowledge, no prior studies exist for identifying drought triggers in this fashion. The details of study watersheds are provided in section 'Study Area and Data Used'. The methodology adopted in the study with details of principal components analysis, copula models, and drought trigger analysis are explained in section 'Methodology'. These are followed by results and discussion in section 'Results and Discussion', and the summary and conclusions of the study in section 'Summary and Conclusions'.

2. Study Area and Data Used

2.1. Study Area

[11] The study was carried out over two watersheds in the state of Indiana, USA. Both the watersheds form a part of the Ohio River Basin. The first watershed (WS I) extending from 38°34′N to 39°49′N and 85°24′W to 86°31′W spreads over 6259 square kilometers. The second watershed (WS II) lies between 40°47′N to 41°24′N and 85°08′W to 86°20′W and extends over an area of 1657 square kilometers. The two watersheds are shown in Figure 1. The land use in these watersheds consists of mainly agricultural and forest lands, followed by public and urban built-up lands. Agriculture being the major economic activity prevalent in WS I and WS II, high irrigation water demands exist during the growing season. The choice of the watersheds was governed by the need to conduct drought analyses for locations, where streamflows were not influenced by human activities.

Figure 1.

Map of the study watersheds WS I and WS II.

2.2. Data Used

[12] The 30 m resolution DEMs obtained from USGS National Elevation Data set was used to delineate the watersheds. Though the choice of coarser resolution affects the identification of drainage features in low relief landscapes, there is substantial reduction in computational efforts involved in the processing of the 30 m digital elevation model (DEM) over a high-resolution DEM. Modeling the dependencies and analysis of drought triggers require a long record of historic observations. Therefore, monthly data with a minimum record length of 50 years were adopted in the present study. The various hydroclimatic variables used in the study are listed in Table 1. The 0.5° × 0.5° climate prediction center (CPC) global monthly data sets [Huang et al., 1996; Fan and van den Dool, 2004], available from 1948 onwards, were used. The land model was treated as a one-layer “bucket” water balance model, when generating the CPC data sets. The data used in our study include modeled monthly soil moisture values, modeled monthly runoff values, observed monthly precipitation values, observed monthly temperature values, and modeled monthly evaporation values. The location of CPC stations is marked by circles in Figure 1. Given the small watershed sizes determined by the need for unregulated streamflows, the number of CPC grid points directly over the study areas is quite small. The variables: Sea-level pressure, u-wind, and v-wind were obtained from the NCEP/NCAR Reanalysis-1 project data, at a spatial resolution of 2.5° × 2.5° [Kalnay et al., 1996]. The resultant of the u-wind and v-wind components was adopted as the wind speed variable in the present study. Given the monthly time scale chosen for this study, the time of concentration for these watersheds is in the order of days. Thus, variables were multiplied by the Thiessen weights at different grid points to obtain their spatially averaged values over the study watersheds. The US Geological Survey (USGS) monthly streamflow data from 1958 to 2010 recorded at the USGS 03371500 (East Fork White River near Bedford, Indiana) were used for WS I, while the data at USGS streamflow gage 03328500 (Eel River near Logansport, Indiana) from 1948 to 2010 were used for WS II.

Table 1. List of Variables Used in the Study
Sl. No.VariableUnitPeriod of Data
1Soil moisturemm1948–2010
6Sea-level pressurembar1948–2010
7Wind speedm/s1948–2010

3. Methodology

3.1. Dimensionality Reduction Using Principal Components Analysis

[13] The formulation of a dependence model between the seven predictor variables in Table 1 and streamflows is impractical even when using copulas. PCA was performed to transform the set of correlated n-dimensional (n = 7 here) predictor set into another set of n-dimensional uncorrelated vectors (called principal components). The PCs are arranged in order of their ability to explain the variability in the data. The conventional or standard PCA, which is formulated as an eigenvalue problem, was used for unsupervised dimensionality reduction [Jolliffe, 1986]. Prior to extracting the principal components, the mean value was subtracted from each of the predictors to obtain a series of predictor anomalies. The covariance matrix was obtained for the anomaly data sets, and the eigenvalues and eigenvectors of this covariance matrix were computed. The degree of dimensionality reduction achieved in the predictor set was determined by variance explained by the first two principal components.

3.2. Asymmetric Archimedean Class of Copulas

[14] A copula is a function that models the dependence between multiple random variables, regardless of their marginals. A d-dimensional copula is a multivariate cumulative density function (CDF) C defined in the unit d-dimensional space inline imagewith uniform margins inline imageand with the following properties: (i) inline image if at least one coordinate of u is equal to 0, and inline image if all the coordinates of u are equal to 1 except uk; (ii) inline image such that inline image where V is the C-volume [Nelsen, 2006]. The copula approach to dependence modeling has its roots in the theorem by Sklar [1959], according to which a d-dimensional CDF with univariate margins inline image is defined by:

display math(1)

where inline image for inline image with inline image if Fk is continuous.

[15] Archimedean copulas are very popular, with both symmetric and asymmetric forms available in the literature [Joe, 1997; Nelsen, 2006]. They possess closed form expressions and allow modeling of a variety of different dependence structures. An Archimedean symmetric d-copula is of the form:

display math(2)

where the function φ (called the generator of the copula) is a continuous strictly decreasing function from inline image to inline image, such that inline image and inline image, and its inverse φ−1 is completely monotone on inline image, that is, φ−1 has derivatives of all orders which alternate in sign [Nelsen, 2006]:

display math(3)

for all t in inline image and inline image

[16] In equation (2), if a certain uk is assigned the value 1, then the joint distribution of inline image is obtained. Since inline image when k = 1, the (d − 1)-dimensional marginal of the symmetric Archimedean copula is also an Archimedean copula. The expressions for these (d − 1)-dimensional copulas are identical regardless of the choice of k. As a result, only one Archimedean 2-copula is required to model all mutual dependencies among the variables. This exchangeability property that can be modeled by symmetric copulas limits the nature of the dependence structures. Since the study took into account correlated variables such as streamflows and principal components that possess different bivariate dependence structures, a more general multivariate extension of the Archimedean 2-copula, namely the fully nested or asymmetric copula as described in Whelan [2004], was adopted here. This copula is given by d − 1 distinct generating functions as:

display math(4)

[17] For example, in a fully nested 3-copula, two variables u1 and u2 are coupled using copula C2 and the copula of u1 and u2, is coupled with u3 by copula C1. In general, there are inline image ways of coupling d variables. When the bivariate joint probability of two variables conditioned on the third variable is computed, different dependence structures are obtained based on the conditioning variable. Grimaldi and Serinaldi [2006] used asymmetric Archimedean copulas to model trivariate joint distribution of flood peaks, volumes, and durations. A nested 3-copula was adopted in the present study to model the dependence between the monthly streamflow anomaly and the first two principal components of a set of predictor variables. There are two parameters for the nested 3-copula model: θ1 and θ2 such that θ1 ≤ θ2 implying a higher degree of dependence for the inner nested variables. It has been found that only two dependence structures can be reproduced for three possible pairs [Grimaldi and Serinaldi, 2006]. When two variables u1 and u2 are likely correlated with the third one u3, and the degree of dependence between u1, u2 is stronger than that of either u1 and u2 with u3, the asymmetric three-dimensional model may be applied. The dependence between the variables is expressed in terms of the Kendall's correlation coefficient,τ. Kendall's τ for a random vector (X, Y)T is simply the probability of concordance minus the probability of discordance [Embrechts et al., 2003]:

display math(5)

[18] The various asymmetric Archimedean copula families selected for the study, their permissible θ values, and dependence ranges are listed in Table 2.

Table 2. Asymmetric Archimedean Copula Families Used in the Study
FamilyNested Copula inline image inline image inline imageReference
M3 inline image[0,∞)[0,1]Joe [ 1997]
M4 inline image[0,∞)[0,1]Joe [ 1997]
M5 inline image[1,∞)[0,1]Joe [ 1997]
M6 inline image[1,∞)[0,1]Joe [ 1997] and Embrechts et al. [ 2003]
M12 inline image[1,∞)[0.333,1]Embrechts et al. [ 2003]

3.3. Parameter Estimation

[19] Several copula parameter estimation methods are available in the literature namely, the method of moments, canonical maximum likelihood (CML) method, and inference from margins method. When one-parameter bivariate copulas are adopted, the popular approach is the simple method of moments based on inversion of Spearman's or Kendall's rank correlation [Genest and Favre, 2007]. In the multivariate-multiparameter case, this method becomes less elegant and may lead to inconsistencies. In such instances, a more natural estimation technique is the CML method [Genest et al., 1995; Kojadinovic and Yan, 2011]. The parameters of the five nested 3-copula families used in this study were estimated using the CML method. This method performs a nonparametric estimation of the marginals by using the respective scaled ranks. The dependence parameters θ1 and θ2 are obtained by maximizing the log-likelihood function l(θ) given by:

display math(6)

where cθ denotes the density of the copula Cθ, and inline image(also denoted as uk) is the rank-based nonparametric marginal probability of kth variable given by:

display math(7)

where I(•) is indicator function returning 1 if the argument is true and 0 otherwise.

3.4. Goodness-of-Fit Tests for Asymmetric Copulas

[20] When there exist more than one feasible copula families that satisfy the dependence range for the given data, the final selection of a suitable copula is based on the best fit to observations. This fit can be assessed graphically by comparing the scatter plots of observed and simulated data in the case of bivariate distributions, but becomes difficult for higher dimensions. Goodness-of-fit tests examine the null hypothesis inline image for a copula class C0 against inline image. These tests compare the distance between the empirical distribution of copula, Cn and an estimation of inline image of C obtained under H0 [Genest et al., 2009]. Formally, the goodness-of-fit tests are based on the statistic:

display math(8)

where the empirical copula of the data inline image is defined by Deheuvels [1981] as:

display math(9)

[21] In this study, the rank-based versions of Cramér-von Mises and Kolmogorov-Smirnov statistics were used for testing the goodness-of-fit of the nested copulas. The Cramér-von Mises statistic Sn has been a popular goodness-of-fit test procedure for copula models [Genest et al., 2009]. The statistic Sn was determined using equation (10), substituting the value of Cθ evaluated from the copula expression.

display math(10)

where Cn is the empirical copula computed as per equation (9).

[22] The Kolmogorov-Smirnov statistic Tn utilizes the absolute maximum distance between the empirical copula probability distribution and that simulated using the estimated parameters to measure the fit of the copulas as shown below [Genest et al., 2009].

display math(11)

[23] Additionally, the probability plots of the empirical distribution and the nested copula families were compared to assess the performance of copulas. The family providing the best fit based on the above criteria was selected for subsequent analysis.

3.5. Streamflow Forecasting and Drought Analysis

[24] The joint dependence modeled using the best copula was employed to estimate 1 month ahead streamflows. The probabilistic predictions of streamflows at different quantiles were made using the copula function. The expected values of monthly streamflows during the model development and model testing periods were computed. The range of forecasts was quantified by estimating predictions at 2.5% and 97.5% probabilities, i.e., 95% confidence interval for the prediction. The forecasts of streamflow were analyzed to identify the occurence of extremes, particularly for droughts in the study area. Given the focus on streamflows in this study, hydrological droughts were characterized by the standardized streamflow index that is similar to the SPI introduced by McKee et al. [1993] for meteorological drought analysis. The long-term streamflows record was fitted to a gamma probability distribution and then transformed to a standard normal distribution through the quantiles so that the mean standardized index for a certain location and particular period (1 month) is zero [Edwards and McKee, 1997]. A positive value of the index shows the degree of wetness, while a negative value indicates the severity of streamflow deficit. The ranges of this drought index for different hydrological conditions, labeled exceptionally dry (D4) to exceptionally wet (W4), are presented in Table 3. This drought severity classification based on SPI values was adopted from The streamflows estimated using copula were used for the prediction of droughts in the study areas.

Table 3. Range of Drought Index for Different Hydrological States
StateDescriptionDrought Index
D4Exceptional drought−2 or less
D3Extreme drought−1.6 to −1.9
D2Severe drought−1.3 to −1.5
D1Moderate drought−0.8 to −1.2
D0Abnormally dry−0.5 to −0.7
NormalNormal condition−0.4 to 0.4
W0Abnormally wet0.5–0.7
W1Moderately wet0.8–1.2
W2Severely wet1.3–1.5
W3Extremely wet1.6–1.9
W4Exceptionally wet2 or more

3.6. Analysis for Drought Triggers

[25] The occurrence of hydrological extremes in the study areas was highly correlated with the local hydroclimatic variables at 1 month lead times, and as such short-term predictions of droughts could be achieved. The joint dependence information contained in the copula was exploited to obtain the expected values of the climate precursor anomalies conditioned on a streamflow anomaly. This allowed for identification of patterns in the precursors that could trigger hydrological droughts of different categories.

4. Results and Discussion

4.1. Principal Components Analysis

[26] The anomalies of hydroclimatic predictors and streamflows at monthly scale were obtained by subtracting their respective monthly means. The dependence between the first two principal components of the anomalies of these variables was represented by a joint asymmetric copula in the present study and was used to predict streamflows. The data from January 1958 to December 1993 were used for developing the statistical model for WS I, whereas model development period for WS II was from January 1948 to December 1990. Thus, two thirds of the data were used for model training and the remainder used for evaluating model performance.

[27] Starting from the large suite of potential predictors, PCA was used for dimensionality reduction. The results of principal components analysis performed on the predictor variables for the two watersheds are given in Table 4. As the first two components (PCs) were found to explain more than 98% of the variance, only these were selected for modeling streamflows. Next, the correlation values of different pairs (streamflow anomaly and two PCs) for different lags (1–3 months) were computed. PCs from predictor variables lagged by only 1 month were adopted for streamflow forecasting, as significant correlations were observed at this lag for both WS I and WS II.

Table 4. Principal Components and the Explained Variance
Principal ComponentEigenvaluesExplained Variance (%)

4.2. Analysis of Asymmetric Archimedean Copula

[28] The joint dependence between the streamflow anomaly, PC-1 and PC-2 require that the nature of association between them be identified. The scatter plots of the pairs of predict and predictor variables indicated a higher degree of dependence between the streamflow anomaly and PC-1 with a correlation of 0.43 and 0.37 for WS I and WS II, respectively. The correlation between streamflow anomaly and PC-2 is 0.08 and 0.02, respectively, for WS I and WS II, whereas the first two PCs are uncorrelated by nature. Correlations between higher order PCs are very close to zero.

[29] The scatter plots show that the pairs of variables have different bivariate dependence structures that cannot be modeled by the symmetric copulas. They are not included in the paper for the sake of brevity. The Kendall's τ values of the various pairs of these variables are listed in Table 5. Given this nature of dependence, a class of asymmetric Archimedean copulas were adopted wherein the streamflow anomaly and PC-1 was coupled by a copula C2, and this structure was then associated with PC-2 by another copula C1.

Table 5. Kendall's τ and Parameter θ for Different Copulas
τ12τ13τ23Nested Copula FamilyMaximum Likelihood Estimate
θ1θ2Maximum Likelihood Value

[30] From the streamflow anomaly values and the two PCs, their rank-based nonparametric marginal probabilities u1, u2 and u3, respectively, were calculated for modeling the copula function. The properties of asymmetric Archimedian copulas are mentioned in section 'Asymmetric Archimedean Class of Copulas'. However, as the study data set did not conform to the requirement of the M12 nested 3-copula family that inline image (Table 2), this copula family was rejected for both study watersheds.

4.3. Parameter Estimation

[31] The parameters of the nested copula were estimated using the CML method [Genest et al., 1995; Kojadinovic and Yan, 2011]. The parameter values must conform to the range specified for each class of copula. The condition that the more nested variables have a stronger degree of dependence among them, that is, inline image was satisfied by the M3 and M4 families, and the condition inline image was satisfied by the M5 and M6 families of copula. The estimated values of the copula parameters and the maximum likelihood value obtained for each of the copula families are listed in Table 5.

4.4. Goodness-of-Fit Tests

[32] From the copula families evaluated in the study, the best copula was selected using popular goodness-of-fit measures. The probability distribution function of different copula families and the empirical copula are plotted in Figure 2. The performance statistics computed for the probability distribution function between the empirical and estimated copulas are given in Table 6. The M6 copula family was found to have lowest value of Sn and Tn statistics calculated for WS I. The goodness-of-fit for this copula family is also evident from Figure 2a. The lowest value of Sn and Tn was obtained for M6 copula in the case of WS II. It also provided the best distribution fit among all copula models in Figure 2b. Plots in Figure 3 show the performance of only the M6 copula for different months, suggesting that the dependence structure of the first two principal components of anomalies of the hydroclimatic variables and streamflow anomalies could be modeled by the same M6 copula family for all months in both study watersheds.

Figure 2.

Comparison plots of probability distributions of different copula families used in (a) WS I and (b) WS II.

Table 6. Goodness-of-Fit Test Statistics for Different Copulas
Nested Copula FamilySnTn
Figure 3.

Plots showing M6 copula fit for each month in (a) WS I and (b) WS II.

4.5. Streamflow Prediction Using Copula

[33] Given u2 and u3 (the rank-based values of PCs extracted from the predictors), the probability distribution of u1 (derived from streamflow anomalies) was generated using the M6 copula model (Table 2). The streamflow anomalies corresponding to different quantiles were calculated from this CDF. The rank-based nonparametric marginal probabilities at 0.025, 0.5, and 0.975 quantiles were calculated and transformed into the streamflow anomaly values; subsequently, the estimates of streamflows for the next month were obtained. Streamflows simulated for the model development period were compared with the observed flows for evaluating model performance.

[34] The model developed for WS I was tested for the period January 1994 to December 2010, while model testing was carried out for the period 1991–2010 for WS II. The PCA coefficients obtained for predictors during model development period were used to obtain the PCs for the testing period as well. The predicted streamflow values for the model development and testing periods are compared with corresponding observed flows in Figures 4a, 4b, 5a, and 5b for the two watersheds. The uncertainty in the predictions is quantified by the plot of interquantile range of predicted streamflows. Most of the observed flows lie within the predicted range during the model development periods in WS I. Typically, low flows in the late 1960s and 1970s are in close agreement with the expected values of streamflows obtained from the model (Figure 4a). The low flows during the testing period, especially in the 1990s, match well with the expected values in Figure 4b. However, this is not the case with high flows in WS I during both training and testing periods, where 1 month lead forecasts underestimate the observed peaks. In WS II, the recorded flows fall within the range of probabilistic predictions offered by the developed model. In Figure 5a, the predicted low flows in the 1950s, 1960s, and 1980s conform to observations. During the testing period also, the model performed well with low flow predictions (Figure 5b). The peak flows for both training and testing periods were typically underestimated perhaps because of the small numbers of training samples in this range. Additionally, the box plots for model development and testing periods in WS I and WS II in Figures 4c and 5c, respectively, indicate that though the model performance is not satisfactory in the case of high flows, low flows are estimated well. Overall, the predictive capability of the model was found to favor low flow conditions, prompting us to explore the development of droughts over the two study watersheds. The coefficients of determination (R2) values obtained were 0.64 and 0.53, respectively, for the model development and testing periods in WS I, and 0.58 and 0.50, respectively, for WS II. Comparisons with state-of-the-art statistical models [Tripathi and Govindaraju, 2008] using the same set of predictors for streamflow showed similar performance, but the results are not reported here for brevity.

Figure 4.

(a, b) Comparison plots of observed and predicted streamflows in WS I during (a) model development period and (b) model testing period (lower and upper quantile curves correspond to 0.025 and 0.975 quantiles respectively), and (c) box plots for observed and predicted (expected) values of monthly streamflows during model development and testing periods in WS I. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually with a plus symbol.

Figure 5.

(a, b) Comparison plots of observed and predicted streamflows in WS II during (a) model development period and (b) model testing period (lower and upper quantile curves correspond to 0.025 and 0.975 quantiles respectively), and (c) box plots for observed and predicted (expected) values of monthly streamflows during model development and testing periods in WS II. On each box, the central mark is the median, the edges of the box are the 25th and 75th percentiles, the whiskers extend to the most extreme data points not considered outliers, and outliers are plotted individually with a plus symbol.

4.6. Drought Analysis

[35] The results of the drought analysis carried out for the model development period (January 1948 to December 1993) for WS I are shown in Figure 6a. There were few occurences of D3 and D4 classes of droughts during the model development periods, and mild (D0) and moderate (D1) droughts prevailed in most of the drought months. The drought index values obtained from the expected streamflows provided good forecasts of dry as well as wet conditions. The drought analysis was then carried out for the testing period and compared with the observed conditions. Few occurences of D2 and D1 classes of droughts marked the testing period. Wet conditions dominated during this period, with most of them being underestimated by the model (Figure 6b). The plots for drought indices calculated for WS II in Figures 7a and 7b also indicate that different drought categories were better predicted than the wet categories. The sequences of drought months in different subperiods during the entire model development and testing periods were also well predicted.

Figure 6.

(a) Drought index values during the model development period in WS I and (b) drought index values during the model testing period in WS I.

Figure 7.

(a) Drought index values during the model development period in WS II and (b) drought index values during the model testing period in WS II.

[36] Apart from visual inspection, the model performance for multiple category classification of streamflows was assessed by computing the contingency coefficient C, proposed by Pearson [1904]. This coefficent is a measure of degree of association between multiple categories in a contingency table classifying N samples [Gibbons and Chakraborti, 2011] and mathematically expressed as:

display math(12)

where Q is a statistic that tests the null hypothesis that there is no association between observed and predicted categories. It is expressed as:

display math(13)

where r and k are the number of categories, Xij is the number of cases falling in ith observed and jth predicted category, inline image and inline image.

[37] The statistic Q approximately follows chi-square distribution with degrees of freedom (dof) equal to (r − 1)(k − 1). Thus, the null hypothesis (no association) can be rejected if the p value is very low. Higher values of C correspond to better association. The value of C cannot exceed 1 theoretically and has an upper bound of inline image [Gibbons and Chakraborti, 2011]. The ratio C/Cmax is often used as a measure of degree of association.

[38] In order to ensure sufficient data for robust statistics, a contingency table with three different categories, dry, normal, and wet, was prepared. The extreme categories were merged to ensure that the observations and predictions are available sufficiently in all categories. These contingency tables are shown in Table 7 for WS I and WS II, respectively. Thus, both r and k are 3, and dof is 4. The statistic Q, contingency coefficient C, and the measure of degree of association C/Cmax are shown at the end of Table 7. The low p values for the statistic Q indicate that the null hypothesis of no association between observed and predicted categories should be rejected. The degree of association was found to be reasonable for both the watersheds during model development as well as testing periods.

Table 7. Contingency Table and Degree of Association Between Observed and Predicted Drought Categories for WS I (Top) and WS II (Bottom)
Predicted CategoryModel Development Period (1958–1993)Model Testing Period (1994–2010)
Observed CategoryObserved Category
p value<0.0001<0.0001
Predicted CategoryModel Development Period (1948–1990)Model Testing Period (1991–2010)
Observed CategoryObserved Category
p value<0.0001<0.0001

4.7. Analysing the Drought Triggers

[39] Using the modeled asymmetric copula dependence function, the conditions that trigger hydrological droughts or extremes in the watershed were examined. The triggers for various streamflow conditions were generated using the conditional copula. The procedure is illustrated as follows. Given a certain streamflow anomaly quantile α, let inline image and inline image correspond to the first and second PCs conditioned on the streamflow anomaly value. The quantities inline image and inline image are obtained from the M6 copula for the particular watershed. Since these two PCs explain over 98% of the total variation, the other principal components remain unaffected by the choice of the streamflow quantile. Our goal is to find the expected values of the precursor variables inline image that would correspond to this particular streamflow quantile. If aij are the PCA coefficients for the data set, then the following equation provides the conditional expectation of the precursor variables:

display math(14)

where aij is the ijth element of the matrix [A], inline imageand inline image are computed from the M6 copula, and inline image are simply the expected values of the principal components(≈0).

[40] The expected values of PC-1 and PC-2 conditioned on various streamflow anomaly quantiles (corresponding to different α values) are shown in Table 8 for both watersheds.

Table 8. Expected Principal Component Values for Various Quantiles of Streamflow
Streamflow Anomaly QuantileStreamflow Anomaly (cumecs)Expected PC-1 ValueExpected PC-2 Value

[41] The expected anomaly values of all the predictor variables for different values corresponding to different streamflow anomalies are shown in Table 9. Low flows correspond to smaller values of soil moisture, temperature, precipitation, evaporation, and runoff of the previous month in both watersheds. Sea level pressure anomaly varied inversely with the streamflow anomaly for WS I and WS II suggesting that increase in sea-level pressure from the long-term mean can enhance the chances of droughts in the regions. Increase in wind speed was found to trigger droughts in WS I, in contrast to the trend observed in the case of WS II. The dissimilar trends in some variables suggest that drought triggers are likely to be specific to each watershed.

Table 9. Conditional Expectations (in Terms of Anomalies of Hydro-Climatic Variables) Associated With Streamflow Anomaly Values
Expected Streamflow Anomaly (cumecs)Hydro-Climatic Triggers in Terms of Expected Values of Anomalies
Soil Moisture Anomaly (mm)Temperature Anomaly (°C)Precipitation Anomaly (mm)Evaporation Anomaly (mm)Sea-Level Pressure Anomaly (mbar)Wind Speed Anomaly (m/s)Runoff Anomaly (mm)
WS I       
WS II       

[42] The conditional expectations of anomalies of different precursors corresponding to different streamflow quantiles (Table 9) were utilized to develop potential triggers for each drought category. The long-term monthly means of hydroclimatic variables were added to their expected anomaly values to carry out this analysis. The resulting precursor values were then associated with the 1 month lead drought index values. From the expected streamflow anomaly, streamflows for each month were computed and corresponding drought indices were calculated. The trigger analysis is limited to low flow conditions corresponding to droughts reflecting the better model performance for flows in this range. The plots in Figures 8a and 8b show the expected precursor range in each month obtained for different drought classes for WS I and WS II, respectively. If the values of the hydroclimatic variables fall within the suggested range for any class of drought, then that drought would likely occur in the succeeding month. For WS I, soil moisture, precipitation, and runoff are able to offer a range of predictor values for different drought categories as shown in Figure 8a. Some months (May to July) do not show any range of potential predictor values for certain drought classes, implying the likelihood of such droughts being very low in those periods in WS I. While soil moisture, precipitation, and runoff show some variability with drought classes in WS II, the other variables stay within a very tight band for any given month (Figure 8b). Thus, only these three variables are capable of resolving amongst different drought classes for the study watersheds. Low variability is manifested in the expected anomaly values of temperature, evaporation, sea-level pressure, and wind speed in Table 9.

Figure 8.

Contour plots showing expected ranges of different hydroclimatic variables as precursors to droughts in (a) WS I and (b) WS II.

[43] The precursor ranges developed in this manner were validated by means of scatter plots between the observed and modeled values of variables over the model development and testing periods (Figures 9a and 9b) for all classes of droughts. These scatter plots demonstrate good agreement between the observed and modeled triggers in both watersheds. The scatter is less in the case of soil moisture, precipitation, runoff, evaporation, and temperature in both watersheds. Among the predictors, wind speed shows the most scatter making it the least reliable precursor for both watersheds. The modeled triggers for soil moisture, precipitation, and runoff values are underpredicted compared to observations during calibration as well as validation. Additionally, correlation values for all the trigger variables were calculated and tabulated in Table 10. High correlations in some predictors (for example, temperature and evaporation in WS I and WS II), however, were not useful as they were found incapable of resolving among the different drought categories.

Figure 9.

Scatter plots of different hydroclimatic precursors (modeled versus observed) for model development and testing periods in (a) WS I and (b) WS II.

Table 10. Correlation Values Between Observed and Modeled Drought Precursors
HydroClimatic PrecursorWS IWS II
Soil moisture0.570.580.410.44
Sea level pressure0.580.430.500.52
Wind speed0.450.560.480.52

[44] The results indicate that drought trigger information retrieved in this manner has potential for applications in hydrologic drought preparedness. Even though individual variables show scatter, if multiple variables fall close to their trigger values, the confidence in their effectiveness as hydrologic drought triggers will improve. Hence, the combined behavior of predictor variables needs to be considered when estimating potential drought triggers.

5. Summary and Conclusions

[45] This study provides a novel method for developing drought triggers by combining the strengths of PCA for dimensionality reduction and copulas for modeling the joint dependence between variables. The first two PCs were found capable of explaining the variability in the anomaly set of predictor variables for both study watersheds. The joint dependence of the streamflow anomaly and the two principal components was modeled by a scale-free association using a suitable asymmetric 3-copula selected based on goodness-of-fit statistics. The developed model was first tested for forecasting streamflows in two study watersheds. The study focused on 1 month lead predictions because correlations between the principal components and streamflow anomaly diminished rapidly beyond a lag of 1 month. Underprediction of peak flows was observed in the results of both watersheds, but low streamflows were reasonably predicted allowing hydrologic drought studies. Drought index values based on standardized flows were computed to identify the occurrences of droughts during the model development and testing periods in the two study regions

[46] The conditional dependence of the principal components PC-1 and PC-2 on streamflow anomaly was used to determine the drought triggers in the two watersheds. The precursors to droughts were expressed in terms of the anomaly values of the climatic variables. Negative anomalies of soil moisture, precipitation, evaporation, temperature, and runoff, and increased sea-level pressure and wind speeds were obtained as potential drought triggers for WS I. Similarly, increased sea level pressure conditions and reduced soil moisture, precipitation, evaporation, temperature, runoff, and wind speeds from their respective long-term means led to drought conditions in WS II.

[47] Further, the patterns of various hydroclimatic variables as potential precursors to different categories of droughts were examined for the two watersheds. The ranges of predictor values that led to different drought conditions were estimated from the expected precursor values for low streamflow quantiles. The trigger analysis results were validated by comparing the observed hydroclimatic variables with their expected trigger values for the model development and testing periods. The correlation values computed indicated that the analysis could yield reliable information on the pattern of drought triggers for both the watersheds.

[48] The following conclusions are derived from this study:

[49] 1. Drought triggers are likely to be specific to watersheds. Even though the two study watersheds are located in the same part of the world and have similar land use distribution, local conditions influence streamflows especially at monthly time scales.

[50] 2. Using copulas, conditional expectations of first two PCs based on different quantiles of streamflow anomalies provide a method for estimating drought triggers. Among all the precursors, soil moisture, precipitation, and runoff showed the greatest potential for assessing different classes of droughts for both watersheds. The other variables, despite showing strong seasonal trends, demonstrated little capability for resolving the different drought classes.

[51] 3. Validation results for triggers over all drought classes show results with different degrees of variability. Even with the scatter present for single (individual) variables, if triggers for multiple variables fall within expected ranges, the confidence in the trigger would improve. Hence, it is recommended that precursors for droughts be examined in combination by using multiple input variables.

[52] Even though the results and conclusions are specific to study watersheds, the method shows promise for application to different watersheds. An important limitation is that the level of dimensionality reduction that can be achieved in different watersheds cannot be known a priori. If multiple predictors were to be important, the model for constructing the joint distribution would be too complex for practical purposes except in limited cases modeled using Gaussian copulas. Data limitations also continue to be a serious challenge for many hydrologic studies. Large amount of data need to be used for capturing the trigger behaviors in drought studies. The model development and testing periods were short in this study, and the methodology performs reasonably well even for the small record lengths available here. Future efforts employing more hydroclimatic variables and different watersheds will help develop better understanding of trigger mechanisms for droughts.


[53] The first author, as a BOYSCAST Fellow, acknowledges the support of Department of Science and Technology (DST), Govt. of India, for offering the fellowship. He also wishes to acknowledge the support of School of Civil Engineering, Purdue University, USA where this study was undertaken. Studies of the second and third authors were supported in part by the National Science Foundation under Grants ACI 0753116 and AGS 1025430, and by USDA NIFA award number 2011-67019-21122. This support is gratefully acknowledged. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or the USDA.