Water Resources Research

Exploratory functional flood frequency analysis and outlier detection

Authors


Abstract

[1] The prevention of flood risks and the effective planning and management of water resources require river flows to be continuously measured and analyzed at a number of stations. For a given station, a hydrograph can be obtained as a graphical representation of the temporal variation of flow over a period of time. The information provided by the hydrograph is essential to determine the severity of extreme events and their frequencies. A flood hydrograph is commonly characterized by its peak, volume, and duration. Traditional hydrological frequency analysis (FA) approaches focused separately on each of these features in a univariate context. Recent multivariate approaches considered these features jointly in order to take into account their dependence structure. However, all these approaches are based on the analysis of a number of characteristics and do not make use of the full information content of the hydrograph. The objective of the present work is to propose a new framework for FA using the hydrographs as curves: functional data. In this context, the whole hydrograph is considered as one infinite-dimensional observation. This context allows us to provide more effective and efficient estimates of the risk associated with extreme events. The proposed approach contributes to addressing the problem of lack of data commonly encountered in hydrology by fully employing all the information contained in the hydrographs. A number of functional data analysis tools are introduced and adapted to flood FA with a focus on exploratory analysis as a first stage toward a complete functional flood FA. These methods, including data visualization, location and scale measures, principal component analysis, and outlier detection, are illustrated in a real-world flood analysis case study from the province of Quebec, Canada.

1. Introduction

[2] Extreme hydrological events such as floods, droughts and rain storms may have significant economic and social consequences. Hydrological frequency analysis (FA) procedures are essential and commonly used for the analysis and prediction of such extreme events, which have a direct impact on reservoir management and dam design. Flood FA is based on the estimation of the probability inline image of exceedance of the event xT corresponding to a quantile of a given return period T, e.g., T = 10, 50 or 100 years. The random variable X is commonly taken to be the peak of the flood which is the maximum of the daily streamflow series during a hydrological year or season. Relating the magnitude of extreme events to their frequency of occurrence, through the use of probability distributions, is the principal aim of FA [Chow et al., 1988].

[3] The accurate estimation of the risk associated with the design and operation of water infrastructures requires a good knowledge of flood characteristics. Indeed, an overestimation of the design flood leads to an oversizing of hydraulic structures and, would therefore involve additional costs, while underestimation of design floods leads to material damages and loss of human lives. Flood FA is commonly employed to study this risk. It has been traditionally carried out for the analysis of flood peaks in a univariate context. The reader is referred to Cunnane [1987] and Rao and Hamed [2000].

[4] In general, a flood is described through a number of correlated characteristics, e.g., peak, volume and duration. The univariate treatment of each flood characteristic ignores their dependence structure. Consequently, the univariate framework is less representative of the phenomenon and reduces the risk estimation accuracy. Thereafter, several authors focused on the joint treatment of flood characteristics through the use of a number of multivariate techniques such as multivariate distributions and copulas [e.g., Yue et al., 1999; Shiau, 2003; Zhang and Singh, 2006; Chebana and Ouarda, 2011a]. Multivariate studies contributed to the improvement of the estimation accuracy and provide information concerning the dependence structure between flood characteristics. The multivariate framework is applied in several hydrological events, such as floods, droughts and storms. For instance in floods, it is used for hydraulic structure design and extreme event prediction purposes (see Chebana and Ouarda [2011a] for recent references).

[5] Despite their usefulness, univariate and multivariate FA approaches have a number of limitations and drawbacks. The separate or joint use of hydrograph characteristics constitutes a major simplification of the real phenomenon. Furthermore, the way these characteristics can be determined is neither unique nor objective (in particular, flood starting and ending dates). In addition, each flood characteristic can be seen as a real-valued transformation of the hydrograph, e.g., the peak is the maximum. For hydrological applications, the bivariate setting is largely employed to treat two hydrological variables. A limited number of studies deals with the trivariate one, e.g., those of Serinaldi and Grimaldi [2007] and Zhang and Singh [2007]. The trivariate models generally suffer from less representativity and formulation complexity. Note that, in general, the number of associated parameters grows up rapidly with the dimension of the model and therefore the generated uncertainty increases. In addition, higher dimensions are not considered in hydrological practice. Finally, given the lack of data in hydrology, working with a limited number of extracted characteristics represents a loss of information in comparison to the overall available series.

[6] The main data source in FA is daily streamflow series, which during a year constitutes a hydrograph, from which the univariate and multivariate variables are extracted. The total information that is available in a hydrograph is necessary for the effective planning of water resources and for the design and management of hydraulic structures. The entire hydrograph, as a curve with respect to time, can be considered as a single observation within the functional context. In the univariate and the multivariate settings an observation is a real value and a vector, respectively. Therefore, the functional framework which treats the whole hydrograph as a functional observation (function or curve) is more representative of the real phenomena and makes better use of available data. Figure 1illustrates and summarizes the three frameworks.

Figure 1.

Illustration of the different approaches, (a) univariate, (b) multivariate, and (c) functional, with the corresponding types of variables, series, and a number of references.

[7] In the hydrological literature, there were some efforts toward a representation of the hydrograph as a function, such as in the study of the design flood hydrograph [e.g., Yue et al., 2002] and in the flow duration curve study [e.g., Castellarin et al., 2004] where the mean, median and variation are presented as curves. These studies underlined the importance to consider the shape of the hydrograph which is necessary, for instance, for water resources planning, design and management. The shape of flood hydrographs for a given river may change according to the observed storm or snowmelt events. More practical issues and examples related to the hydrograph are discussed by, e.g., Yue et al. [2002] and Chow et al. [1988]. Note that the main flood characteristics, i.e., peak, volume and duration, cannot completely capture the shape of the hydrograph. The study of the hydrographs by Yue et al. [2002], and similar studies, are simplistic and limited, as they approximated the flood hydrograph using a two-parameter beta density and considered only single-peak hydrographs. On the other hand, the flow duration curve approach [Castellarin et al., 2004] is in the univariate setting and the presented functional elements (e.g., mean and median curves) are important but remain limited. The previous studies show the need to introduce a statistical framework to study the whole hydrograph and to perform further statistical analysis. The functional framework is more general and more flexible and can represent a large variety of hydrographs.

[8] Functional data are becoming increasingly common in a variety of fields. This has sparked a growing attention in the development of adapted statistical tools that allow us to analyze such kind of data. For instance, Ramsay and Silverman [2005], Ferraty and Vieu [2006], and Dabo-Niang and Ferraty [2008] provided detailed surveys of a number of parametric and nonparametric techniques for the analysis of functional data. In practice, the use of functional data analysis (FDA) has benefited from the availability of the appropriate statistical tools and high-performance computers. Furthermore, the use of FDA allows us to make the most of the information contained in the functional data. The aims of FDA are mainly the same as in the classical statistical analysis, e.g., representing and visualizing the data, studying variability and trends, comparing different data sets, as well as modeling and predicting. The majority of classical statistical techniques, such as principal component, linear models, confidence interval estimation and outlier detection, were extended to the functional context [e.g., Ramsay and Silverman, 2005]. The application of FDA has been successfully carried out, for instance, in the case of the El Niño climatic phenomenon [Ferraty et al., 2005] and radar wave curve classification [Dabo-Niang et al., 2007]. Dabo-Niang et al. [2010] proposed a spatial heterogeneity index to compare the effects of bioturbation on oxygen distribution. Delicado et al. [2008] and Monestiez and Nerini [2008] considered spatial functional kriging methods to model different temperature series. Sea ice data are treated in the FDA context by Koulis et al. [2009].

[9] The functional methodology constitutes a natural extension of univariate and multivariate hydrological FA approaches (see Figure 1). This new approach uses all available data by employing the whole hydrograph as a functional observation. In other words, FDA permits to exhaustively analyze hydrological data by conducting one analysis on the whole data instead of several univariate or multivariate analyses. In addition, the approach proposed by Yue et al. [2002] can be generalized in the FDA context where it becomes more flexible and includes hydrographs with different shapes such as multipeak ones.

[10] Given the above arguments, for hydrological applications, the functional context could be seen as an alternative framework to the univariate and multivariate ones, or it can also be employed as a parallel complement to bring additional insight to those obtained by the two other frameworks. The main objective of the present paper is to attract attention to the functional nature of data that can be used in all statistical techniques for hydrological applications through the FDA framework. A second objective is to introduce some of the FDA techniques, point out their advantages and illustrate their applicability in the hydrological framework. In the present paper, we focus on hydrological FA.

[11] 7Four main steps are required in order to carry out a comprehensive hydrological FA: (1) descriptive and exploratory analysis and outlier detection, (2) verification of FA assumptions, i.e., stationarity, homogeneity and independence, (3) modeling and estimation, and (4) evaluation and analysis of the risk. Step 1 is commonly carried out in univariate hydrological FA as pointed out, e.g., by Rao and Hamed [2000], Kite [1988], and Stedinger et al. [1993], whereas in the multivariate framework it was investigated recently by Chebana and Ouarda [2011b]. Contrary to the univariate setting, exploratory analysis in the multivariate and functional settings is not straightforward and requires more efforts. Table 1 summarizes the four FA steps and their status in each one of the three frameworks. It is indicated that the specific aim of the present paper is to treat step 1, which deals with data visualization, location and scale measures as well as outlier detection. A new nongraphical method to detect functional outliers is also proposed in the present paper. The presented techniques are applied to floods on the basis of daily streamflow series from a basin in the province of Quebec, Canada.

Table 1. FA Steps in the Three Frameworksa
FA StepsFramework
UnivariateMultivariateFunctional
  • a

    Note that in the univariate framework, step 1 is straightforward and is generally not treated separately. The references are given only as examples from the literature because of space limitations.

1. Exploratory analysis and outlier detectionLarge body of literature: Cunnane [1987], Kite [1988], Stedinger et al. [1993], Rao and Hamed [2000]Very sparse body of literature: Chebana and Ouarda [2011b]The specific aim of the present paper
2. Checking the FA assumptions: stationarity, homogeneity, independenceLarge body of literature: Yue et al. [2002], Kundzewicz et al. [2005], Khaliq et al. [2009]Very sparse body of literature: Chebana et al. [2010]To be developed
3. Modeling and estimationLarge body of literature: Cunnane [1987], Bobée and Ashkar [1991]Large body of recent literature: Shiau [2003], Zhang and Singh [2006], Salvadori et al. [2007]To be developed
4. Risk evaluation and analysisLarge body of literature: Chow et al. [1988]Sparse but growing body of literature: Shiau [2003], Chebana and Ouarda [2011a]To be developed

[12] Exploratory data analysis as a preliminary step of FA is useful for the comparison of hydrological samples and for the selection of the appropriate model for hydrological variables. It consists in a close inspection of the data to quantify and summarize the properties of the samples, for instance, through location and scale measures. Outliers can have negative impacts on the selection of the appropriate model as well as on the estimation of the associated parameters. In order to base the inference on the right data set, detection and treatment of outliers are also important elements of FA [Barnett and Lewis, 1998]. Therefore, it is essential to start with the basic analysis (step 1) in order to perform a complete functional FA.

[13] This paper is organized as follows. The theoretical background of functional statistical methods is presented in section 2 in its general form. In section 3, the functional framework is adapted to floods. The functional FA methods are applied, in section 4, to a real-world case study representing daily streamflows from the province of Quebec, Canada. A discussion as well as a comparison with multivariate FA are also reported in section 4. Conclusions and perspectives are presented in section 5.

2. Functional Data Analysis Background

[14] This section presents the general functional techniques. It is composed of four parts representing FDA phases: first, data smoothing is discussed, second location and scale parameters are introduced, then functional principal component analysis (FPCA) is described and finally data visualization and outlier detection methods are presented.

[15] Data are generally measured in discrete time steps such as hours or days. Therefore, the first phase in FDA consists in the conversion of observed discrete data to functional data. Once the discrete data are transformed to curves, they can be analyzed in the functional framework. In a descriptive statistical study, it is of interest to obtain estimates of the location and scale parameters within FDA. The next phase in the considered FDA is to extract information from functional data using FPCA where the corresponding scores to these components are the basis for visualization and outlier detection.

2.1. Data Smoothing

[16] The objective of this step is to prepare data to be used in the FDA context. As a preparation step of the data to be employed, it is analogous to the step of extracting peaks in univariate FA or peak and volume series in the multivariate FA. Note that the statistical object of FDA is a function (curve) as shown in Figure 1. However, the curves are not observed, instead, only discrete measurements of the curves are available. In the case where data series are of good quality and long enough records, one can simply interpolate the measurements to obtain the curves, e.g., for rainfall series. Otherwise, smoothing can be required. This is typically the case for diffusive processes like in the present study of floods. However, even in the first case, smoothing could be necessary depending on the goal of the study [e.g., Ramsay and Silverman, 2005].

[17] Let inline image, inline image be a set of n discrete observations where each inline image, inline image is the jth record time point from a given time subset inline image. For a fixed observation i, each set of measurements inline image is converted to be a functional data (curve) denoted inline image by using a smoothing technique where the index t covers the continuous subset inline image. To this end, we suppose that the discrete observation inline image is fitted using the regression model:

display math

where inline image are the errors and the functions inline image are linear combinations of basis functions inline image, that permit to explain most of the variation contained in the functional observations:

display math

[18] The functional data sets inline image are then given by

display math

where the estimated coefficients inline image are obtained by minimizing the following sum:

display math

[19] For more details, the reader is referred, for instance, to Ramsay and Silverman [2005]. A number of possible types of basis inline image have been presented in the literature. Most of the practical situations are treated with the well-known basis, such as, polynomial, wavelet, Fourier and the various Spline versions. Fourier and B-spline basis are widely employed in the FDA context. The functional representation uses Fourier series for periodic or near periodic data. When the data are far from being periodic, spline approximations are commonly used in FDA for most problems involving nonperiodic data [Ramsay and Silverman, 2005]. Splines are more flexible but more complicated than Fourier series. The latter allows capturing the seasonal variability while the Spline series captures high and low values of the data [Ramsay and Silverman, 2005; Koulis et al., 2009]. In general, the basis functions or the smoothing method to use should be based on objective considerations depending mainly on the nature of the data to be studied. Fourier basis functions inline image are defined by

display math

[20] Splines are piecewise polynomials defined on subintervals of the range of the observations. In each subinterval, the Spline is a polynomial function with a fixed degree but could be with different shapes. For instance, when the polynomial degree is three, we talk about cubic splines. For a comprehensive review about splines, the reader is referred to de Boor [2001].

[21] Note that the aim of using the above expansion (3) is to obtain smooth functions to be employed as observations in FDA. In this case, the expansion series need not to be interpreted since the interest is not to extract a signal from the whole series. However, the number of basis functions to be selected is important where a large number leads to overfitting of the data while a small number leads to underfitting. Hence, the smoothing degree of the obtained functions to be employed as observations depends on the aim of the analysis, e.g., in principal component analysis, the aim is to capture a large variability rather than to reach the peaks. For more flexibility, a penalty term can be added to (4) to ensure the regularity of the smoothed functions. More details are given by, for instance, Ramsay and Silvermann [2005] and Wahba [1990].

2.2. Location and Scale Parameters for Functional Variables

[22] In a descriptive statistical study, we generally begin by looking for centrality and dispersion properties of a given sample. A location parameter summarizes the data and indicates where most of the data are located. Scale parameters are useful to measure the dispersion of a sample and also to compare different samples. These notions are useful in hydrology since they appear in almost all commonly employed probability distributions and models. In hydrology, location curves can also be used to characterize a given basin and to proceed to comparison or grouping of a set of basins. The scale measures can be used in a similar way but at a second level. In the setting of real or multivariate random variables, this is usually done through the mean, median, mode, variance, covariance and correlation. To avoid the possibility of missing important information, it is generally recommended to employ more than one measure for each feature. For instance, by looking only at the mean of the sample one might miss a possible heterogeneity in the population which would be captured by the mode. Obviously, these same problems will also appear when one studies a sample composed of curves inline image. In this setting, it is straightforward to define the mean curve inline image of the sample as

display math

[23] One has to use this mean curve carefully according to the shape of the data. For instance, if the data exhibit a high roughness degree the mean curve could be less informative.

[24] Robust and efficient alternatives to the sample mean are the median and the trimmed mean [e.g., Ouarda and Ashkar, 1998]. In the functional context, both measures are based on the statistical notion of depth function which is initially introduced in the multivariate context. The aim of depth functions is to extend the notion of ranking for a multivariate sample. These functions were introduced by Tukey [1975] and were introduced and applied to water sciences by Chebana and Ouarda [2008]. Recently, the notion of depth has been extended to functional data [e.g., Fraiman and Muniz, 2001; Febrero et al., 2008]. The reader is referred to Chebana and Ouarda [2011b] for hydrological applications and a brief review and to Zuo and Serfling [2000] for a general and detailed description.

[25] Fraiman and Muniz [2001] presented the definition of trimmed means in the functional context which are based on the empirical inline image-trimmed functional region. It is defined by inline image for inline image where inline image is an empirical functional depth function, as the various ones defined, e.g., by Fraiman and Muniz [2001] and Febrero et al. [2008] where the corresponding formulations are explicitly given. A depth-based functional trimmed mean can be defined as the average over the inline image that belong to the empirical trimmed region inline image:

display math

where inline image is the cardinal of the set A. For functional observations, the median curve is the deepest function in the sample inline image. It maximizes the depth function inline image:

display math

where inline image stands for the element in the set A that maximizes the function g.

[26] From a theoretical point of view, the mode as a location measure, when it exists, is the value that locally maximizes the probability density f of the underlying variable. Developments and applications related to nonparametric density estimation in this context are given by Dabo-Niang et al. [2007]. An estimator of the modal curve can be obtained by

display math

where f n is an estimate of the density f.

[27] The median and mode given in (8) and (9), respectively, are natural extensions of their multivariate counterparts. However, they are rarely used in practice because of their complex computations. Alternatively, they are commonly defined on the basis of the bivariate scores of a functional principal component analysis of the curves observations as described in section 2.4 below.

[28] Variability is one of the important quantities to be evaluated and analyzed in statistics. For multivariate data, the reader is referred to Liu et al. [1999] and Chebana and Ouarda [2011b]. The simplest way to define a variance function in the functional context is by

display math

[29] The covariance function summarizes the dependence structure across different argument values:

display math

[30] The variability of the functional sample is analyzed by plotting the surface inline image as a function of s and t as well as the corresponding contour map.

[31] Note that, for functional observations, several types of variability can occur such as the variability within the same observation or between the different observations. In addition, functional principal component analysis is also employed to explore the variability between observations. The reader is referred to Ramsay and Silverman [2005] and section 2.3 for a presentation of the functional principal component analysis.

2.3. Functional Principal Component Analysis

[32] Principal component analysis, as a multivariate procedure, is usually employed to reduce the dimensionality by defining new variables as linear combinations of the original ones and which capture the maximum of the data variance. After converting the data into functions, functional PCA allows us to find new functions that reveal the most important type of variation in the curve data. Note that these new functions cannot be in the Fourier or Spline basis since their aim is not to smooth but rather to produce a reasonable summary of the data by maximizing the capture of the variability. The FPCA method maximizes the sample variance of the scores (defined below) subject to orthonormal constraints. It decomposes the centered functional data in terms of an orthogonal basis as described in the following.

[33] Let inline image be the functional observations obtained by smoothing the observed discrete observations inline image.

[34] By definition, the mean curve is a way of variation common to most curves that can be isolated by centering. Let inline image be the centered functional observations where inline image is the mean function of inline image given by (6). A FPCA is then applied to inline image to create a small set of functions, called also harmonics, that reveals the most important type of variation in the data.

[35] The first principal component of inline image denoted by inline image is a function such that the variance of the corresponding real-valued scores inline image written as

display math

is maximized under the constraint inline image The next principal components inline image are obtained by maximizing the variance of the corresponding scores inline image:

display math

under the constraints inline image. As in the multivariate setting, the interpretation of the principal component function wk is slightly difficult as it depends on the type of data being used and may require nonstatistical considerations. A useful way consists in examining the plots of the overall mean function and perturbations around the mean on the basis of wk. The perturbation functions are obtained as suitable multiples of the considered wk, namely,

display math

where inline image is the square root of the variance (eigenvalue) of the corresponding kth principal component. This presentation allows us to isolate the perturbations about the mean across time and then assess the variability of the observations. Note that the principal components wk are optimal, according to the maximization in (12) or (13), but are not unique. Therefore, any rotation with an orthogonal matrix of the wk is also optimal and orthonormal. A well-known choice of such matrices is the VARIMAX. These rotated components can be useful for the interpretation. More technical details are given, for instance, by Ramsay and Silverman [2005]. On the other hand, the regularity of the harmonics inline image can be controlled. Rice and Silverman [1991] and Silverman [1996] extended this traditional functional PCA to the regularized FPCA (RFPCA) that maximizes the sample variance of the scores subject to penalized constraints.

2.4. Functional Data Visualization and Outlier Detection Methods

[36] In general, outliers represent gross errors, inconsistencies or unusual observations and should be detected and treated [Barnett and Lewis, 1998]. Univariate outliers are well defined and their detection is straightforward [e.g., Hosking and Wallis, 1997; Rao and Hamed, 2000]. This topic is also relatively well developed in the multivariate setting [e.g., Dang and Serfling, 2010]. The identification and the treatment of outliers constitute an important component of the data analysis before modeling. For hydrologic data, outlier detection is a common problem which has received considerable attention in the univariate framework. In the multivariate setting, the problem is well established in statistics. However, in the hydrologic field the concepts are much less established. A pioneering work in this direction was recently presented by Chebana and Ouarda [2011b]. As it is the case in the univariate and multivariate settings, outliers may have a serious effect on the modeling of functional data.

[37] In this section, we focus on visualization methods that help to explore and examine certain features, such as outliers, that might not have been apparent with summary statistics. Different outlier detection methods exist in the functional context literature [e.g., Hardin and Rocke, 2005; Febrero et al., 2007; Filzmoser et al., 2008]. However, Hyndman and Shang [2010] showed, on the basis of real data, that their methods are more able to detect outliers and computationally faster. The methods proposed by Hyndman and Shang [2010] are graphical and consist first in visualizing functional data through the rainbow plot, and then in identifying functional outliers using the functional bagplot and the functional highest-density region (HDR) box-plot. The latter two methods can detect outlier curves that may lie outside the range of the majority of the data, or may be within the range of the data but have a very different shape. These methods can also exhibit curves having a combination of these features. In practice, depending on the nature of the data, the two outlier detection methods can give different results.

[38] As pointed out by Jones and Rice [1992] and Sood et al. [2009], the considerable amount of information contained in the original functional data is captured by the first few principal components and scores. The outlier identification methods of Hyndman and Shang [2010] considered here are based on these first two score vectors. Let inline image, and inline image be the smoothed observed curves, the principal component curves, and the corresponding scores obtained from the FPCA decomposition (section 2.3), respectively. Let inline image and inline image be the first two vector scores and inline image the bivariate score points. At the end of this section, a nongraphical outlier detection method is proposed on the basis of inline image.

2.4.1. Rainbow Plot

[39] The rainbow plot, proposed by Hyndman and Shang [2010], is a simple presentation of all the data, with the only added feature being a color palette based on an ordering. In the functional context, this ordering is either based on functional depth or data density indices. These indices are evaluated from the bivariate score depths and kernel density. The bivariate score depth is given by

display math

where inline image is the half-space depth function introduced by Tukey [1975]. Tukey's depth function at zi is defined as the smallest number of data points included in a closed half space containing zi on its boundary. The observations are decreasingly ordered according to their depth values OTi. The first ordered curve represents the median curve, while the last curve can be considered as the outermost curve in a sample of curves. As indicated in section 2.2, this median curve based on Tukey depth function of the bivariate principal scores zi will be used in the following adaptation to floods. Let inline image be this Tukey bivariate depth median defined as inline image. If there are several maximizers, the Tukey bivariate depth median can be taken as their center of gravity.

[40] The second way of ordering functional observations is based on the kernel density estimate [e.g., Scott, 1992] at the bivariate principal component scores:

display math

where inline image is the kernel function and hj is the bandwidth for the jth bivariate score points inline image. The functional data inline image are then ordered in a decreasing order with respect to ODi. Hence, the first curve with the highest OD is considered as the modal curve while the last curve with the lowest OD can be considered as the most unusual curve. This modal curve will also be used in the following application.

[41] The smoothed observations are presented with colors according to the values of OT and OD. The curves close to the center are red while the most outlying curves are violet.

2.4.2. Functional Bagplot

[42] The bivariate bagplot is introduced by Rousseeuw et al. [1999] and is based on the half-space depth function. It is employed by Chebana and Ouarda [2011b] for multivariate hydrological data. The functional bagplot version is obtained from the bivariate bagplot based on the first two principal scores inline image given in section 2.3. Each curve in the functional bagplot is associated with a point in the bivariate bagplot. Similar to the bivariate bagplot, the functional bagplot is composed by three elements: the Tukey median curve, the functional inner region and the functional outer region. The inner region includes 50% of the observations whereas the outer region covers either 95% or 99% of the observations. The outer region is obtained by inflating the inner region by a factor inline image. Hyndman and Shang [2010] suggested that the factor inline image could take the values 1.96 or 2.58 in order to include inline image or inline image of the curves in the outer region, respectively. These values of inline image correspond to the case where the bivariate scores follow the standard normal distribution. Finally, points outside the outer region are considered as outliers.

2.4.3. Functional HDR Box-Plot

[43] The functional HDR box-plot corresponds to the bivariate HDR box-plot of Hyndman [1996] applied to the first two principal component scores inline image. The bivariate HDR box-plot is constructed using the bivariate kernel density estimate inline image. An HDR with order inline image is defined as

display math

where inline image is such that inline image and inline image is defined by (16). An HDR can be seen as a density contour with expanding coverage decreasing with inline image. The associated bandwidth hj in inline image is selected by a smooth cross validation procedure [Duong and Hazelton, 2005].

[44] The functional HDR box-plot is composed of the mode defined as inline image, the inline image inner region ( inline image) and the inline image outer highest-density region ( inline image). For an HDR with inline image outer region, one can take inline image instead of inline image. Curves excluded from the outer functional HDR are considered as outliers.

[45] The difference between detecting outliers by the bagplot and by the HDR box-plot lies mainly in the way the inner and outer regions are established. The bagplot uses a depth function (Tukey) and the estimated median curve (based on the Tukey depth function of the first bivariate scores zi) while the HDR uses the density estimate of the zi and its mode. Hence, the most outlier curves from HDR are unusual compared to the mode curve whereas those detected by the bagplot are unusual with respect to the median curve.

[46] In connection with the multivariate setting, as indicated by Chebana and Ouarda [2011b], the points outside the fence of the bagplot are considered as extremes rather than outliers. Chebana and Ouarda [2011b] considered the approach proposed by Dang and Serfling [2010] to detect outliers. This approach is based on the evaluation of the outlyingness of each observation, the determination of a threshold and then the identification of the observations that exceed this threshold are considered as outliers. The outlyingness values are simple decreasing transformations of depth functions. In the present study, we propose to consider this approach on the basis of the first two scores. A brief presentation of the approach is given by Chebana and Ouarda [2011b, section 2.6].

[47] The above graphical approaches should be considered as preliminary indications for suspected observations. The latter could be seen as extremes rather than outliers [see, e.g., Chebana and Ouarda, 2011b]. In addition, the approach by Dang and Serfling [2010] is based on the outliyngness criteria and the corresponding threshold is empirical and not necessarily normally based (instead of the values of the inflating central region 1.96 or 2.58).

3. Adaptation to Floods

[48] The first and most important adaptation for floods lies in the nature of hydrological data. The main data source in hydrology is daily flow from a given station. Flows can also sometimes be available on an hourly, instantaneously or any other time scale. In the following, we focus on daily data and we assume it is recorded during a number n of years of measurements, inline image, inline image, inline image, with T = 365 days and inline image is the flow measured at the day tj of the ith year. The time subset index inline image is then the interval inline image. According to this kind of data, we have n discrete observations inline image. The observation inline image denotes the daily flow for the ith year. A functional observation constitutes a year starting from January first to December 31st. However, it can be cut out in different ways according to the seasonal characteristics of the geographical area of interest. For instance, for most parts of Canada, it is possible to define the March–June season for spring floods and the July–October season for fall floods.

[49] The discrete observed data inline image are to be converted to smooth curves inline image as temporal functions with a base period of T = 365 days and with inline image weeks nonconstant basis functions as in (2). This smoothing can be obtained through the two well-known Fourier and B-spline bases. Usually, the flow data of the whole series present some seasonal variability and periodicity over the annual cycle. Therefore, Fourier bases are preferred. Although the two smoothing methods do not give identical results, the differences between them in this adaptation are generally insignificant to affect interpretations. The choice of inline image can be justified to capture the flow variation within a week. Since in flood studies, the peak value is important, in order to ensure that the smooth curves reach the associated peaks, it may be recommended to consider values of p greater than 52. Nevertheless, this could lead to irregular curves which could not reasonably capture the entire flow variation.

[50] The nonparametric approach presented in section 2.4.3, using the kernel density estimate of zi, is employed for curve ordering and outlier detection and not for estimation purposes. Note that even though nonparametric approaches have been employed in hydrological FA in the univariate context [see, e.g., Adamowski and Feluch, 1990; Ouarda et al., 2001], they are still of limited use for the hydraulic design of major structures [Singh and Strupczewski, 2002]. In addition, the mode as a location measure is useful to detect the presence of inhomogeneity in the data. In hydrological FA, the mode is not commonly used since, generally, data should pass a homogeneity test. Therefore, the fitted models should be unimodal.

[51] Generally, in hydrology, there are two main sources of outliers. The data may be incorrect and/or the circumstances around the measurement may have changed over time [Hosking and Wallis, 1997]. However, a detected outlier can also represent true but unusual observed data. In the present functional context, outlier curves have different magnitudes and shapes compared to the rest of the observed curves.

4. Case Study

[52] The methods described in section 2 are applied to hydrological data series by using the adaptation presented in section 3. In the following, the data are described, and functional as well as analogous multivariate results are presented and discussed. More precisely, the conversion of the discrete data to be employed as continuous functions is the first preliminary step. Then, the different location functions are obtained and the variability of the sample is studied directly as well as using the FPCA. The latter are also used for data visualization and as a preliminary tool to identify outliers. These outliers are checked by the previously presented approaches and interpreted on the basis of meteorological data. Section 4.3 provides results using multivariate approaches for comparison purposes.

4.1. Data Description and Smoothing

[53] The data series is a daily flow (m3 s−1) from the Magpie station with reference number 073503. It is located at the outflow of the Magpie lake in the Côte-Nord region in the province of Quebec, Canada. The area of the drainage basin is 7230 km2 and the flow regime is natural. Data are available from 1979 to 2004. Figure 2indicates the geographical location of the Magpie station.

Figure 2.

Geographical location of the Magpie station.

[54] According to the present data set, we have inline image discrete observations inline image. The ith discrete observation inline image denotes the daily flow measurements for the ith year which is converted to a smooth curve inline image. This is done through the technique based on Fourier series expansion. This smooth representation of flow data is done with a 365 day base period (T = 365 days) and 52 week nonconstant basis functions (p = 52). The obtained functional observations are given in Figure 3with the corresponding univariate and bivariate samples. Figure 3 allows us to have an overall view of the data within the three frameworks.

Figure 3.

Data for each one of the three frameworks: univariate, bivariate, and functional.

[55] Figure 4aillustrates the whole daily flow series. It shows that the data are nearly periodic. As indicated, this periodicity can justify the use of Fourier basis. A number of observed hydrographs with the corresponding Fourier and B-splines smoothing curves are also presented in Figures 4b4e. They show that the Fourier and B-splines smoothing are similar and indicate also that the peaks are generally reached. Figure 5displays the standard deviation of the residuals inline image over j after smoothing the flow data. It gives the residual variations across days, within each year. We observe that these errors are generally very small and do not exceed 32 m3 s−1. The highest errors are associated with the years 1981, 1999, and 2002.

[56] Note that, other values of p, both smaller and larger than 52, e.g., 4, 12, 90, 122, 182, 300, were also tested. Even though, large values of p, e.g., close to the number of observations per year (here 365), allow us to reach almost all the daily flow points including the peaks, the obtained curves are not smooth or regular enough and also do not allow us to capture enough of the variance by the first few principal components. Small values of p, e.g., 4,12 give a very bad quality of smoothing, where a large amount of daily flow points are not reached, particularly the high and low values. Therefore, it is reasonable to choose a number p which combines the quality of smoothing (related to (4)) and a high percentage of explained variance by PCA analysis. In the present application, the choice p = 52 fits reasonably the discrete data except for some extreme points corresponding to a number of years (e.g., 1980, 1989 and 1993) where the resulting differences between the real peaks and the smooth ones are less than 150 m3 s−1 (see Figure 3).

Figure 4.

(a) The representation of all the data and (b–e) illustration of discrete hydrographs and the corresponding smoothing curves (Fourier in blue and B-splines in red) for some selected years.

Figure 5.

Standard deviations of the residuals from the smooth flows across days.

4.2. Functional Results

[57] Figure 6presents the smooth location curves (mean, median, and mode). It shows that generally the maximum flow occurs in late April and early May followed by a recession during May and June. This phenomenon is common in Canada where floods are mainly caused by snow melt during the spring season. On the right tail of the curves, we observe a small flood which occurs in the autumn and which is caused generally by liquid precipitations. This kind of flood is exhibited by the mode. In both floods, spring or autumn, we observe that the mode is always higher than the mean and the median. The mean seems to be more regular and cannot reach high flow values. Therefore, it is useful to consider all these location measures. These location curves lead to different basin characterization through the whole event rather than just some of its parts or summaries and therefore allow for comprehensive basin comparisons.

Figure 6.

Fourier smoothed location curves: the mean, the median, and the mode.

[58] The bivariate (temporal) variance-covariance surface obtained from (11) and the corresponding contour are presented in Figure 7We observe that the main part of the variability occurs in the middle of the year and it is negligible elsewhere. That is, the highest variability occurs approximately between April and late June. This period corresponds approximately to the highest flows. This measure has the advantage of providing information concerning the variance structure and also when it occurs.

Figure 7.

(a) Estimated variance-covariance surface of the flow curves for years 1979–2004 and (b) the corresponding contour map.

[59] The principal components are obtained by FPCA on the centered observations inline image. The variance rates accounted for by each one of the first four principal components are inline image, inline image, inline image, and inline image, respectively. These components account for inline image of the total variance of the flow. The centered principal components are presented in Figure 8aThe perturbations of these first four principal components about the mean, as given in (14), are presented in Figure 8b.

Figure 8.

First four smoothed principal components: (a) centered components and (b) components with variation about the mean inline image. Negative and positive perturbations are indicated by the minus and plus symbols, respectively.

[60] From Figure 8, where the first two principal components accumulate inline image of the total variance, one can observe that the station flow is most variable between April and July. This variation dominates the variation occurring between July and the end of the year and which is associated with the third and fourth components, and represents 19.9% of the total variance. This finding is, for all practical purposes, consistent with the one obtained from the variance-covariance surface (Figure 7). More precisely, the first two principal components w1 and w2 are representative of the spring floods whereas w3 and w4 are more likely to represent autumn floods.

[61] The scores corresponding to the first four principal components are given in Table 2. Given the high variation rate captured by the first principal components, the corresponding variation indicates that the years for which the first or the second principal score is higher (lower) have higher (lower) flow during April to July. Therefore, the year 1981 represents the highest variability during this period followed by the year 1999. On the other hand, the smallest variability of the flow during April–July is associated with the year 1987. Other years could be considered also with low flow variability, such as 1986 and 2002. The flow variability associated with the years 1981, 1986, 1987, 1999 and 2002 is unusual where some of the corresponding curves (1981, 1987, 1999) are displayed with the location curves in Figure 9

Figure 9.

Curves corresponding to the suspected years (based on principal component scores) with the mean, median, and mode curves.

Table 2. First Four Principal Component Scoresa
Yearz1z2z3z4
  • a

    The bold entries indicate the largest and the smallest values for the first and the second components.

1979−1180.341174.701457.11373.28
198042.34329.59−115.61121.06
19812613.262046.54−448.91225.00
19821947.70−704.01500.88−600.22
1983−1673.671095.671936.68−133.87
1984843.83822.51−10.20−238.71
1985903.82−1171.340.07−419.37
1986−1737.16−185.44466.4036.39
1987−1671.02−1615.92285.9310.93
1988−130.59393.79−732.2318.73
1989−633.73−176.40−663.4287.96
1990−669.66−519.85−375.08−324.09
1991529.43−604.15−281.41−52.87
1992−465.75−725.79−342.08944.50
1993−374.77−753.53−315.6419.32
19941058.7968.18451.041281.09
1995−268.43463.72−933.12−268.28
1996748.05235.06560.82−575.86
19971085.56−516.08447.61482.64
1998−1557.15428.41−504.03−165.38
1999−1306.021809.15−621.20−641.45
2000879.38−20.22145.23−178.76
2001−1173.90−702.48−837.66133.55
20021134.07−1692.79796.87−433.89
2003−67.68120.028−1087.29315.00
20041123.65400.67219.25−16.70

[62] In order to check the above unusual years, the outlier detection methods described in section 2 are employed. Other functional methods are also tested, such as the functional depth method of Febrero et al. [2007] and the Integrated squared error method of Hyndman and Ullah [2007]. However, these two methods gave either too many or no outliers. Hence, the corresponding results are omitted.

[63] Figure 10presents the rainbow plots based on the bivariate depth ordering and the density ordering indices (15) and (16), respectively. The colors indicate the ordering of the curves where the blue curves are the closest to the center. The red and black outlier curves correspond to 1981 and 1999, respectively. Results show that both methods lead to a similar ordering especially for the years associated with high or low ordering.

Figure 10.

Rainbow plots of the flow curves for years 1979–2004 using (a) the bivariate score depth and (b) the kernel density estimate.

[64] The bivariate bagplot associated with the first two principal scores as well as the corresponding functional bagplot for both 95% and 99% of probability coverage are presented in Figure 11We observe that the curve corresponding to the year 1981 is outside the outer bivariate bagplot region for both 95% and 99% cases. It corresponds to the red curve in the associated functional bagplot (Figures 11c and 11d). Hence, this year is considered as an outlier according to Tukey depth, as described in section 2. However, when considering the 95% bagplot, the additional outlier curve that is detected is the one corresponding to 1987 as shown in Figure 11b. Note that generally when outliers are relatively near the median, the functional bagplot is not a good way to detect them [Hyndman and Shang, 2010]. Even though it is not the case here, it is also more appropriate to use the functional HDR box-plot.

Figure 11.

Bivariate score bagplot with (a) 99% and (b) 95% of probability coverage and the corresponding functional bagplot with (c) 99% and (d) 95% of probability coverage. The solid black curve shows the median curve, and its inline image or inline image pointwise confidence intervals are presented in blue; in Figures 11a and 11b the red asterisk is the Tukey median of the bivariate principal scores.

[65] The bivariate HDR and the associated functional HDR box-plots of the smooth flow curves are presented in Figure 12for both 95% and 99% of probability coverage. The only detected outlier with inline image coverage probability is 1981 which is outside the bivariate HDR outer region. In the present case, we can deduce that the flow corresponding to the year 1981 is the most outlier, has a different magnitude and shape compared to the other curves and is not near the median. Hence, we can conclude that 1981 is an effective outlier according to the HDR box-plot. When the probability coverage is inline image, another outlier is detected and corresponds to the year 1999 as shown in Figure 12b. This curve is closer to the median than the curve corresponding to 1981 (Figure 9), that is why the functional HDR box-plot is more able to detect it as outlier than the functional bagplot.

Figure 12.

Bivariate score HDR box-plot with (a) 99% and (b) 95% of probability coverage and the corresponding functional HDR box-plots with (c) 99% and (d) 95% of probability coverage. The solid black curve shows the modal curve, and its inline image or inline image pointwise confidence intervals are presented in blue. In Figures 12a and 12b the red asterisk is the mode.

[66] As discussed in section 2.4, the HDR box-plot and the bagplot are graphical outlier detection methods and their thresholds are based on normality. Therefore, the above detected years can be considered as extreme curves and could be outliers. The approach developed by Dang and Serfling [2010] is applied on the first two functional principal component scores Z of the data set. We evaluated spatial, Mahalanobis, and Tukey outlyingness functions for the bivariate score series. The corresponding thresholds are obtained by selecting the ratio of false outliers to 15% and the true number of outliers as 5 (the same choices as Chebana and Ouarda [2011b] and in section 4.3 below). Hence, the threshold corresponds to the 0.97%-quantile of the outlyingness values. Figure 13presents the detected outliers. We observe that the Tukey outlyingness function detects several years as outliers (including 1981, 1987, 1999 and 2002) whereas the year 1981 is detected by the three outlyingness functions. In addition, the year 1987 corresponds to the second largest spatial and Mahalanobis outlyingness values and its value is very close to 1999 with the Mahalanobis function. Note that Tukey outlyingness is not recommended by Dang and Serfling [2010]. Therefore, the year 1981 can be considered as an effective outlier to be checked. The years 1987 and 1999 could be detected by spatial and Mahalanobis outlyingness and considering a larger true number of outliers than 5 (with values of 5%, 10% and 20% of the ratio of false outliers, the results remain the same). Note that the above suspected years of 1986 and 2002 can be considered as extremes and not outliers.

Figure 13.

Outlier detection using the outlyingness approach applied on the first two scores.

[67] Even though the curve of 1981 is the only effective outlier, in the following we examine also the years 1987 and 1999 since they are close to the thresholds. We observe from Figure 9 that the curves of 1981, 1987 and 1999 are clearly different from the location curves and from the general shape of curves. Indeed, on the basis of the corresponding hydrographs, the curve of 1981 is characterized by very high peak and volume whereas 1987 seems to correspond to a dry year since the flow was the lowest during the spring season. The flood corresponding to the year 1999 has also a high peak, although lower than the one corresponding to 1981.

[68] The detected outliers can be explained on the basis of meteorological data. The following interpretations are drawn on the basis of the data available in Environment Canada's Web site http://www.climat.meteo.gc.ca/climateData/canada_f.html. For 1981, which corresponds to the most important flood for this basin, an important amount of snow was accumulated in early winter (October–November to January) followed by thaw and rain during February–March. For the outlier corresponding to 1987, the comparison with the preceding and following years reveals that during the fall of 1987 there was much less rain and the temperatures were very cold, whereas the end of winter was hot. Hence, all the snow melted earlier compared to other years. The curve of 1999 is relatively higher than the location curves and corresponds to an important quantity of snow on the ground with high temperatures in March. In conclusion, the above detected years seem to be actually observed and do not correspond to incorrect measurements or circumstance changes over time. Hence, these observations should be kept and employed for further analysis. However, it is recommended to use robust statistical methods to avoid sensitivity of the obtained results (e.g., modeling and risk evaluation) to outliers.

4.3. Multivariate Results

[69] For comparison purposes, a multivariate study based on the work of Chebana and Ouarda [2011b] is carried out on the present data set. We focus here on the flood peak Q and the flood volume V as they are among the most important and studied flood characteristics [e.g., Yue et al., 1999; Shiau, 2003]. The bivariate series (Q, V), given in the first three columns of Table 3, are obtained from the daily flow series using an automated version of the algorithm of G. W. Pacher (Détermination objective des paramètres des hydrogrammes, personal communication, 2006). Note that the multivariate approaches presented by Chebana and Ouarda [2011b] are mainly based on depth functions. The Tukey depth function is considered in the present section. The corresponding depth values of each bivariate observation are reported in the fourth column of Table 3. The location and scale results are presented in Table 4. Results with other measures (such as the trimmed mean and dispersion) are obtained but not presented because of space limitations and in order to maintain the coherence with the FDA approach.

Table 3. Multivariate Results for Flood Peak and Volumea
YearPeakVolumeTDMOSOTO
  • a

    Bold entries indicate the values of the outlying measure corresponding to the detected outlier. TD, Tukey depth; MO, Mahalanobis outlyingness; SO, spatial outlyingness; TO, Tukey outlyingness.

1979886.672088.920.26920.05710.13610.4615
1980849.672357.020.38460.19710.15670.2308
19811456.673909.140.03850.88510.95630.9231
19821270.002443.150.03850.80320.62460.9231
1983974.673012.180.07690.67000.85000.8462
19841056.672751.690.11540.47130.68570.7692
1985787.001574.210.15380.46230.48150.6923
1986610.331536.340.11540.53060.60260.7692
1987344.331069.860.03850.82250.92040.9231
1988843.332374.490.30770.23900.24550.3846
1989678.671534.530.19230.45340.53950.6154
1990506.331752.060.07690.72230.56030.8462
1991740.002260.570.15380.44610.30030.6923
1992710.801128.710.03850.72230.89230.9231
1993666.801407.320.15380.54000.69640.6923
1994932.902722.550.15380.48020.61130.6923
1995868.772192.440.34620.00680.03240.3077
1996886.902476.360.30770.26440.35620.3846
1997697.302665.870.03850.78170.66070.9231
1998825.001843.600.30770.19630.27170.3846
19991306.672652.260.03850.80420.74500.9231
2000858.902492.650.23080.35260.40950.5385
2001732.501188.920.07690.70530.80760.8462
2002999.601485.360.03850.80450.67580.9231
20031004.931883.800.15380.62360.41020.6923
2004842.572802.320.07690.67830.72520.8462
Table 4. Multivariate Results for Flood Peak and Volume: Location and Scale Parameters
 PeakVolume
Mean (vector)859.152138.70
Tukey median (vector)847.722216.22
Dispersion (matrix)57316.61113915.10
 113915.10457040.80

[70] We observe that Q and V of the bivariate median correspond to those of the median curve obtained in section 4.2. Indeed, in both multivariate and functional frameworks, the median corresponds to the year 1980. However, the Q and V of the bivariate mean vector are quite different from those resulting from the mean curve. The mean vector is (Q = 859.15, V = 2138.70) whereas, when using Pacher's (personal communication, 2006) algorithm, the Q and V of mean curve are 673.09 and 2230.46, respectively. We observe also that the difference is larger for the peak than for the volume. This result could be explained by the effect of the detected outliers on the mean which is not the case for the median. Note that the outliers do not necessarily have the same impact in the multivariate and the functional frameworks.

Figure 14.

Bivariate results: (a) bagplot with the median and some particular years and (b) scalar scale function.

[71] Figure 14apresents the bivariate inline image bagplot where the median, the central and the outer regions are indicated as well as some particular observations (corresponding to years suspected as outliers from section 4.2). Note that the outer region is obtained by inflating the central region by a factor of 3 instead of 1.96 or 2.58 as in the functional bagplot (Figures 11a and 11b). We observe that the shape of the bivariate inline image bagplot is not similar to the functional bagplot and to the HDR box-plot based on the first two functional principal component scores inline image. The values in Tables 2 and 3 as well as the corresponding figures (Figures 11a, 11b, and 14a) indicate that the first two functional principal component scores zi capture the information from the hydrograph in a different way than do inline image. The former are based on an optimization procedure whereas the latter have physical significance. Nevertheless, both ways are useful to understand flood dynamics and should be used in a complementary manner. This finding should be studied more thoroughly in future research by considering a number of case studies.

[72] The bivariate inline image variability is evaluated both in a matrix form (Table 4) and by using scalar curve (Figure 14b). Note that the variability is particularly useful when comparing at least two data sets for the same kind of series (e.g., same variable or same vector). It is appropriate to compare the univariate peak scale with the functional one since the flood peak has the same unit and scale as the daily flow which is not the case for the volume. Hence, we observe that the peak variance has the same magnitude as in the functional context as it can be seen from Table 4 and Figure 7. One can also appropriately standardize the Q and V variables in order to compare the variances of the vector inline image and the functional context.

[73] The procedure employed by Chebana and Ouarda [2011b] for outlier detection is based on depth outlyingness measures and the corresponding thresholds. The reader is referred to Chebana and Ouarda [2011b] or Dang and Serfling [2010] for more details about the outlyingness expressions and threshold determination. In the present section, three outlyingness measures are evaluated on the inline image series, i.e., Tukey (TO), Mahalanobis (MO) and Spatial (SO). Their values are presented in the last three columns of Table 3. To obtain the threshold that the outlyingness of an outlier should exceed, we considered a ratio of false outliers of 15% among the allowed ones and we also allowed 5 true outliers (the same choices as Chebana and Ouarda [2011b]). Hence, the threshold corresponds to the empirical 0.97%-quantile of the outlyingness values. The obtained threshold values are 0.9231, 0.8676, and 0.9462 for TO, MO, and SO, respectively. Consequently, 1981 is detected by all measures, 1987 is detected only by TO and it has also the second highest outlying value by MO and SO but smaller than the corresponding thresholds. The measure TO detects several other outliers, including 1999 and 2002, which all have the same TO value (equal to the threshold). However, if a quantile of order higher than 97% is considered, by modifying the parameters related to the threshold, the TO does not detect any outliers. These results are consistent with those of the functional framework in the sense that the most unusual observations are detected in both frameworks. However, the proposed approach that consists in applying the Dang and Serfling [2010] technique on the first two score series zi seems to be justified and more reliable.

5. Summary and Concluding Remarks

[74] The first aim of the present paper is to introduce the functional framework to hydrological applications on the basis of the curve nature of the data to be employed and analyzed. The FDA framework can be seen as a natural extension of the multivariate FA where the latter is gaining popularity and usefulness in meteorological and hydrological studies. In the present study we introduced a number of FDA notions and techniques and adapted them to the hydrological context, and more specifically to floods. The techniques within the first functional FA step deal with visualization, location estimation, variability quantification, principal component analysis and outlier detection. A new nongraphical (numerical) outlier detection method is proposed which combines multivariate and functional techniques.

[75] An application is carried out to demonstrate the potential of employing FDA techniques in hydrology. The application deals with the natural streamflow series of the Magpie station in the province of Quebec, Canada. Results regarding location measures such as mean, median and modal curves, are obtained. The variability is studied as a simple bivariate function surface and also by using principal component analysis. Outlier curves are identified by the most efficient methods and interpretations are given on the basis of meteorological data. For comparison purposes, a brief bivariate study of flood peak and volume is carried out. Even though FDA is an extension of multivariate analysis, it is recommended to perform both approaches to obtain a complete understanding of floods and to make the appropriate decisions.

[76] From the elements discussed in section 1 and the results obtained in the case study, the following concluding remarks can be drawn and a number of limitations and perspectives are given.

5.1. Drawbacks of Previous Approaches

[77] The following drawbacks represent the motivation and the need to introduce the functional framework in hydrological applications.

[78] 1. The separate or joint use of hydrograph characteristics constitutes a major simplification of the real phenomenon.

[79] 2. Given the lack of data in hydrology, working with a limited number of extracted characteristics represents a loss of a part of the available information.

[80] 3. The way these characteristics are determined is neither unique nor objective.

[81] 4. The multivariate analysis is a simplification of the hydrological phenomena since it is based on flood characteristics which are simple transformations of the hydrograph.

[82] 5. In the multivariate setting, the complexity of the models, the fitting and estimation difficulty, the number of parameters and the associated uncertainty increase with the dimension.

[83] 6. The importance of the hydrograph shape is shown in studies such as the one by Yue et al. [2002] where the approaches approximating the flood hydrograph using probability densities are limited for instance to single-peak hydrographs.

[84] 7. The main flood characteristics, peak, volume and duration, cannot completely capture the shape of the hydrograph.

[85] 8. Even though in the flow duration curve studies [e.g., Castellarin et al., 2004] a number of functional elements, such as mean and median curves, are presented, they are limited and do not have a functional statistical foundation.

5.2. Conceptual Advantages of the Functional Framework

[86] The functional framework presents some general advantages which contribute to overcome the previous drawbacks at different levels.

[87] 1. The functional framework treats the whole hydrograph as a functional observation (function or curve) which is more representative of the real phenomena.

[88] 2. It employs the maximum of the available information in the data where the impact of the lack of data in hydrology can be reduced.

[89] 3. The functional framework is more general and more flexible and can represent a large variety of hydrographs.

[90] 4. The functional methodology constitutes a natural extension of univariate and multivariate hydrological FA approaches.

[91] 5. The location curves and functional scale measures can be used to characterize a given basin and to proceed to comparison or grouping of a set of basins.

[92] 6. FDA allows us to perform a single analysis on the whole data instead of several univariate or multivariate analysis.

[93] 7. The approaches dealing with hydrograph shape, e.g., the one proposed by Yue et al. [2002], can be generalized in the FDA context using smoothing techniques.

[94] 8. The functional setting avoids the definition and the evaluation of flood characteristics. Therefore, it does not require specific algorithms and avoids subjective evaluations; and the associated uncertainty can be reduced in the analysis.

5.3. Concluding Remarks From the Application

[95] The following points are drawn as specific results of the FDA application to the case study.

[96] 1. The location curves (mean, median and mode) give more information concerning the hydrological regime in the basin than the univariate and multivariate approaches by adding temporal aspects. These curves allow us to summarize the information contained in the data for a given basin, and hence make comparisons between basins and group basins with similar regime.

[97] 2. The bivariate (temporal) variance-covariance surface as well as the corresponding contour give an additional insight to the hydrological regime variability than the real-value or matrix in the univariate and multivariate contexts.

[98] 3. In addition to quantifying the variability, functional scale measures indicate when it occurs.

[99] 4. The case study results show that the mode is useful to characterize high flood values, the variability is very high during spring season and the principal components are shown to describe the variability in spring floods and autumn floods. The detected outliers are checked to be real observations and therefore it is suggested to use robust methods in any further analysis.

[100] 5. The first two functional principal components capture the information from the hydrograph in a different way than do inline image. Nevertheless, both ways are useful to understand flood dynamics and should be used in a complementary manner.

[101] 6. The FPCAs represent a new way to distinguish the different flood events in a given year. Indeed, the few first principal components can be used to identify where in the hydrograph the variation dominates and can be used to characterize flood events, e.g., the first two principal components are representative of the spring floods whereas the two others represent autumn floods.

[102] 7. In the functional context, outlier curves have different magnitudes and shapes compared to the rest of the observed curves. In the univariate and multivariate settings, the shape is not considered and cannot be captured even by using several variables.

[103] 8. The functional results obtained in this study are generally coherent with those of the multivariate analysis but give more insight to the hydrological phenomena such as in terms of location measures, variability and principal components.

5.4. Limitations and Perspectives of the Functional Framework

[104] The present study presented exploratory functional tools that are important on their own and it constitutes also a basis for the next steps for a reliable FDA-based hydrological FA, especially in terms of model selection and risk evaluation. Several perspectives are promising and can be carried out in future research.

[105] 1. Although the study focused on floods, the presented FDA methodology can be adapted and applied to treat other hydrometeorological variables such as droughts, precipitations, storms and heat waves.

[106] 2. FDA relies on the smoothing step. Therefore, a careful inspection of the resulting curves is recommended, for instance, to ensure the regularity of the smoothed functions, to reach a majority or special points such as peaks or to capture enough of the variance by the first few principal components. Even though a number of elements in this direction are given in the present study, it could be of interest to develop general criteria and objective choices depending on the objective of the analysis.

[107] 3. The classification of the curves of a given site as well as the clustering of sites in a region on the basis of the full hydrograph are also topics of interest.

[108] 4. Inferential aspects, such as modeling for prediction purposes, represent also important issues for future research efforts.

[109] 5. Future investigations should also deal with hypothesis testing as well as regression modeling.

Acknowledgments

[110] Financial support for this study was graciously provided by the Natural Sciences and Engineering Research Council (NSERC) of Canada and the Canada Research Chair Program. The authors would also like to thank the authors of the FDA R and Rainbow R packages. The authors wish to thank the Editor, Associate Editor, and three anonymous reviewers for their useful comments, which led to considerable improvements in the paper.

Ancillary