## 1. Introduction

[2] The generation of synthetic hydrological time series for model testing or water resource planning is an important research area in stochastic hydrology. This is because such data sets permit more robust testing of hydrological models, can be used to construct water planning scenarios, and can also be used to place confidence limits on hydrological forecasts based on a bootstrap methodology. For univariate time series, there are a great many applications of autoregressive (AR) and autoregressive moving average (ARMA) models [*Box et al.*, 1994] in hydrology, as well as nonparametric methods that avoid imposing the assumption of Gaussianity [*Lall and Sharma*, 1996]. Of course, for many hydrological and hydroclimatological series, it is the relation between discharges on neighboring rivers or between rainfall and discharge that is of significant interest (e.g., for determining flood risk due to the sequencing of peak discharges on different tributaries or lags in the catchment system from input to output, respectively). Hence, multivariate generalizations of these methods are required that, at least, retain the correlation/covariance structure between series. Classical approaches to this problem have a long history [*Pegram and James*, 1972; *Valencia and Schaake*, 1973; *Grygier and Stedinger*, 1988].

[3] However, given the difficulty of estimating the parameters of a multivariate ARMA model, a useful trick is to transform the original data into decorrelated time series, meaning that univariate estimation may be attempted on the transformed series. Hence, one procedure for multivariate synthetic data generation is as follows:

[4] Transform the multivariate data array, **Z**, into a set of decorrelated series, **PC**, using a technique such as principal component analysis (PCA), which is discussed in section 1.1;

[5] Perform some form of randomization method (e.g., bootstrapping) on each decorrelated series;

[6] Invert the PCA to generate a set of synthetic series, where the covariance structure in the original data will be largely retained, as the covariance matrix underpins the PCA method.

[7] A very important development to this framework was introduced by *Westra et al.* [2007] who replaced PCA by independent component analysis (ICA), which is described in section 1.2. Thus, one moves from decorrelated variables to independent variables, which means that potential alternative forms of association between the are extracted and, therefore, may be reimposed during the final part of the algorithm. As shown by *Westra et al.* [2007] and in section 2, the ICA method is clearly superior to the PCA approach, which captures neither the joint nor marginal behavior appropriately. To provide some insight into why ICA is advantageous, we briefly outline the PCA and ICA techniques.

### 1.1. Principal Component Approach to Synthetic Data Generation

[8] *Joliffe* [2004] provides a detailed explanation of PCA and its use in the geosciences. Consider a matrix **Z** consisting of *m* hydrological time series (columns), each of length *N* (rows). If we find the columnwise mean values, , , then these may be subtracted from **Z**to give the mean-centered matrix . The covariance matrix is then , and singular value decomposition may be used to derive the unit-norm eigenvectors, , ordered such that their eigenvalues are in descending rank order (i.e., contains the rotations associated with the axis in the principal component space that explains the greatest variability). Each principal component may then be extracted using the following equation:

### 1.2. Independent Component Approach to Synthetic Data Generation

[9] The PCA method produces components that are uncorrelated, but other forms of dependence may still exist. To produce components that are truly independent, a more advanced method is required. The ICA [*Jutten and Herault*, 1991] uses a PCA with the variances of the extracted components normalized to unity (a whitening matrix) as a precursor step. With **E** organized such that each vector occupies a different column, and **D** the diagonal matrix of the eigenvalues of **C**, the whitening matrix is as follows:

The whitened data, **w**, may be obtained by multiplying the data, , by **W**, and the independent components, **s**, are derived from , where **A** is known as the mixing matrix. The central limit theorem states that the application of a linear transform to a set of independent random variables results in variables that tend toward Gaussian. Given the linear transform to derive **W**, it then follows that a means must be found to optimize the non-Gaussianity of the extracted components to move from merely uncorrelated variables to independent ones. Hence, it is assumed that the components,**s**, are maximally non-Gaussian, and a mixing matrix,**A**, is sought that yields appropriate **s**. The most common approach to characterizing non-Gaussianity in ICA is the minimization of mutual information, or equivalently, the maximization of the negentropy [*Comon*, 1994], which is the difference in the entropy for a random variable, **y**, and for a Gaussian variable with the same covariance matrix:

where

and is the marginal probability distribution for , although for computational simplicity, an approximation to is often adopted [*Hyvarinen and Oja*, 1998]. A worked example of the technique in a hydrological context is provided by *Westra et al.* [2007].

### 1.3. Bootstrap-Based Approaches to Generating Hydroclimatological Data

[10] So far we have reviewed classical ARMA-type models and their multivariate representation in the form of principal or independent components. Alternative approaches to producing appropriate generators of hydroclimatological variables are commonly based on either the use of a Markovian representation of the temporal behavior or the notion of a bootstrap. Thus, in the former case, the problem of estimating a full ARMA model over multiple sites is dealt with by assuming that the value for a hydrometeorological parameter of day*t* is conditional on the value for *t* − 1, rather than on, potentially, the full previous record [*Mehrotra and Sharma*, 2007]. Such models may be implemented in various ways as evaluated by *Mehrotra et al.* [2006].

[11] Our method combines autocorrelative and bootstrap principles, and, hence, it is worthwhile briefly reviewing recent bootstrap approaches, although note that our approach is not restricted to a Markovian assumption. In one example of this approach, *Clark et al.* [2004a]first constructed a synthetic record by selecting, in a uniformly random way, the current day's value from the 15 days of the historical record that extend to ±7 days either side of the current day. The appropriate cross-correlative structure was then reimposed using a “Schaake shuffle” method [*Clark et al.*, 2004b]:

[12] A matrix, **G**, is formed from each of these synthetic series, for each variable and each field station, meaning that randomized versions of the original data are generated for each station, but that cross-correlative properties (within the 15 day window) are destroyed;

[13] An additional matrix, **H**, is formed from the original observations, with the difference being that the same date is used to populate records across all stations and variables, thereby retaining the cross-correlative structure. Thus, while values in a given column of**G** are sampled independently of other columns, values in **H** retain correlations between columns (data sets);

[14] Variants of **G** and **H** are produced by placing each columnwise set of data into descending rank order, denoted here as and . is a matrix containing the positions of the elements in in the original matrix **H**;

[15] Given these matrices, the final step is to generate the synthetic matrix by reshuffling **G** with respect to .

[16] See the graphical example in Figure 2 of *Clark et al.* [2004a]for a visual explanation. A similar rank-order matching technique is employed in the methods used in this article, although it is combined with a Fourier spectrum method to ensure better preservation of all the periodicities in a time series.

### 1.4. An Alternative Approach to the Generation of Synthetic Hydrological Time Series

[17] Our approach to generating synthetic series has two major advantages compared to the methods reviewed above:

[18] Improved preservation of the full cross-correlation function, rather than the simple variable intercorrelation, and

[19] The introduction of a control parameter that can be used to tune the extent to which higher-order properties of the original data (which may be selected independently) are preserved in the synthetic data.

[20] The following section of this article reviews the mathematical and algorithmic basis for the relevant techniques used to develop our approach and goes on to develop a multivariate version of these techniques in section 2.3. This method is tested and compared to PCA and ICA methods in section 3 and then applied to a data set of daily discharge data from 107 U.S. gauging stations in section 4.