Simulation of the entire range of daily precipitation using a hybrid probability distribution



[1] Underestimation of extreme values is a widely acknowledged issue in daily precipitation simulation. Nonparametric precipitation generators have inherent limitations in representing extremes. Parametric generators can realistically model the full spectrum of precipitation amount through compound distributions. Nevertheless, fitting these distributions suffers from numerical instability, supervised learning, and computational demand. This study presents an easy-to-implement hybrid probability distribution to model the full spectrum of precipitation amount. The basic idea for the hybrid distribution lies in synthesizing low to moderate precipitation by an exponential distribution and extreme precipitation by a generalized Pareto distribution. By forcing the two distributions to be continuous at the junction point, the threshold of the generalized Pareto distribution can be implicitly learned in an unsupervised manner. Monte Carlo simulation shows that the hybrid distribution is capable of modeling heavy tailed data. Performance of the distribution is further evaluated using 49 daily precipitation records across Texas. Results show that the model is able to capture both the bulk and the tail of daily precipitation amount. The maximum goodness-of-fit and penalized maximum likelihood methods are found to be reliable complements to the maximum likelihood method, in that generally they can provide adequate goodness-of-fit. The proposed distribution can be incorporated into precipitation generators and downscaling models in order to realistically simulate the entire range of precipitation without losing extreme values.

1. Introduction

[2] Precipitation simulation is one of the key features of hydrological models, agricultural models, and climate impact studies [Kleiber et al., 2011]. Sufficiently long series of precipitation records are needed for catchment water management, drought characterization and prediction, and crop growth simulation. However, historical records of precipitation with desired spatial and temporal resolution are almost always insufficient. Moreover, it is difficult to quantify the uncertainty of model results from only a single sequence of realizations. On the other hand, there is considerable discussion these days that climate change is contributing to the increase in frequencies and magnitudes of precipitation extremes, leading to floods or droughts, and hence evaluating changes in precipitation extremes is receiving significant attention [Solomon et al., 2007; Lenderink and Meijgaard, 2008; Hardwick Jones et al., 2010]. Therefore, realistically modeling the full spectrum of precipitation is desired.

[3] Precipitation simulation dates back to the 1950s. Over the past decades, many simulation techniques have been developed [e.g., Gabriel and Neumann, 1962; Katz, 1974, 1977; Todorovic and Woolhiser, 1975; Richardson, 1981; Stern and Coe, 1984; Lall and Sharma, 1996; Wilks, 1998; Rajagopalan and Lall, 1999; Parlange and Katz, 2000; Yan et al., 2002; Harrold et al., 2003a, 2003b; Chandler, 2005; Mehrotra and Sharma, 2007a, 2007b; Furrer and Katz, 2007; Zheng and Katz, 2008a, 2008b; Brissette et al., 2007]. Typically, daily precipitation is represented as a mixture of two distributions in a parametric, nonparametric, or semiparametric framework. One is discrete binary distribution modeling the wet or dry state of a given day, and the other is continuous distribution modeling nonzero precipitation amounts on wet days. A most recent review on precipitation simulation can be found by Sharma and Mehrotra [2010]. Overall, there are two acknowledged challenges in daily precipitation simulation. One is referred to as overdispersion [Katz and Zheng, 1998]. The other one is the loss of extreme precipitation events. The first problem concerns both the occurrence and amount processes of precipitation, whereas the second one mainly concerns the amount process. This paper focuses on the second problem.

[4] Since daily precipitation amount always shows a skewed distribution with a bias toward low values, it is usually modeled by distribution families which have right-skewed property [Hundecha et al., 2009]. Different distributions, such as Kappa [Mielke, 1973], exponential [Todorovic and Woolhiser, 1975; Roldan and Woolhiser, 1982], gamma [Ison et al., 1971; Katz, 1977; Schoof et al., 2010], mixed exponential [Roldan and Woolhiser, 1982; Wilks, 1998, 1999], and truncated and power transformed normal distributions [Bárdossy and Plate, 1992; Hutchinson, 1995] have been used to model daily precipitation amount. The aforementioned families perform reasonably well in terms of reproducing averaging characteristics of precipitation. Nevertheless, none of them necessarily performs well in terms of simulating extremes [Wilks, 1999; Furrer and Katz, 2008]. Besides parametric approaches, nonparametric approaches also have been used for daily precipitation simulation. Synthetic precipitation is sequentially sampled from historical observations with replacement. Several limitations, especially with respect to extremes, inherent to the sampling scheme [Furrer and Katz, 2008], have been recognized, and corrected via nonparametric kernel density estimator (KDE) [Lall and Sharma, 1996; Rajagopalan and Lall, 1999]. Nevertheless, the likelihood for extremes to be generated is low [Markovich, 2007], leading to underestimated extreme rainfall. Reproducing the entire range of precipitation in synthetic series has been identified as a critical research need in both simulation and downscaling, and has inspired a recent flurry of research, like Vrac and Naveau [2007], Furrer and Katz [2008], and Hundecha et al. [2009], in which compound distributions are used for modeling precipitation amount. Problems involved in fitting these distributions include numerical instability, data sensitivity, supervised learning, and computational demand.

[5] The objective of this study therefore is to develop an efficient, reliable and relatively easy-to-implement method for simulating both the low to moderate and extreme rainfall. To that end, the specific objectives are to: (1) examine if the existing distributions are reliable to model precipitation amount in different climate divisions, (2) present a hybrid distribution to model the full spectrum of daily precipitation, (3) validate the hybrid distribution model, and (4) describe approaches for estimating parameters of the hybrid distribution. It is noted that the proposed distribution is not intended, however, to replace existing distributions, like those developed by Vrac and Naveau [2007] and Furrer and Katz [2008], but rather to complement them and to provide a more efficient, reliable, and less complicated approach to model the heavy tail distribution of precipitation amount without losing any goodness-of-fit.

[6] The paper is organized as follows. Formulating the objectives of the study in section 1, a short discussion of data to be used is given in section 2. Fundamental to the simulation of daily rainfall is the choice of a probability distribution which is described in section 3. A hybrid probability distribution is presented in section 4. Its evaluation is presented in section 5. Based on problems raised from real cases, three estimation approaches are presented in section 6. The paper is concluded in section 7.

2. Data Sets

[7] Daily precipitation records from 49 weather stations across Texas, given in Table 1, were used. These stations are spread across 10 climate divisions which are divided by National Weather Service. The United States Historical Climatology Network (USHC) provides high quality precipitation records [Mishra and Singh, 2010]. This study concerns only precipitation amount. Therefore, all nonzero records were valued pieces of information. Missing values have little influence on fitted distributions as long as sufficient data were available. To gather as much useful information as possible, all nonzero precipitation from the period of 1940 to 2009 was used without taking care of missing values. Omitting the influence of missing values is harmless considering that there were at least 1600 nonzero records at each station.

Table 1. ID, Label, Location, Annual Mean Precipitation (P) and Temperature (T ) of Weather Stations in Texas Used in This Studya
Station IDStation LabelLongitude (deg)Latitude (deg)P (mm)T (°F)Station IDStation LabelLongitude (deg)Latitude (deg)P (mm)T (°F)
  • a

    Annual mean precipitation and temperature are computed using data over 1960 to 2009.


3. Evaluation of Commonly Used Distributions for Daily Precipitation

[8] Fundamental to the simulation of daily precipitation is the use of an appropriate probability distribution for precipitation amount. To that end, the question arises: What are the typical distributional characteristics of daily precipitation? Then, one should search for a distribution that captures these characteristics. Since there are several forms of possible distributions, the next question to be addressed is one of evaluating these distributions and selecting the appropriate one. The selection of a distribution involves enumerating distribution properties and estimation of distribution parameters. These issues are discussed in what follows.

3.1. Rainfall Characteristics

[9] First, typical characteristics of a nonzero daily precipitation distribution were explored, using, as an example, the central station ID24, which is located in the western part of Texas. Among all the 49 stations, ID24 had the maximum entropy and was hence considered as “central” [Krstanovic and Singh, 1992]. Without considering missing values, there were 3049 wet days from 1 May 1940 to 31 December 2009. A histogram of nonzero values, together with summary statistics, is shown in Figure 1(a), which exhibits a representative shape of the distribution of daily precipitation amount. Two typical properties seen from the histogram include: (1) right skewed, indicated by median being lower than mean and most of the data being clustered around the left end of the distribution; and (2) heavy tailed, represented by sparse observations toward the tail end of the distribution and slower than exponential decay to zero, which can be efficiently illustrated from Figure 1(b). Distributions which can simulate these two properties should be used. Extensively employed distributions can generally be divided into single and compound types.

Figure 1.

Histogram of precipitation amount of station ID24 with the fitted exponential PDF represented by (a) solid line and the empirical survival function with that of the fitted exponential distribution denoted by (b) solid line plotted on log-log scale.

3.2. One-Component Distributions

[10] Commonly used one-component distributions include Kappa [Mielke, 1973], exponential [Todorovic and Woolhiser, 1975; Woolhiser and Roldan, 1982], and gamma distributions [Ison et al., 1971; Katz, 1977; Schoof et al., 2010]. Let X denote the nonzero daily precipitation amount and subscript capital letters, say math formula (for Kappa distribution), to distinguish different distributions. The probability density functions (PDF) for these distributions are now presented.

[11] Kappa distribution:

display math

where math formula, and math formula, math formula, and math formula are the shape and scale parameters, respectively.

[12] Exponential distribution:

display math

where math formula and math formula is the scale or intensity parameter.

[13] Gamma distribution:

display math

where math formula, and math formula and math formula are the shape and scale parameters, respectively.

[14] Other distributions where the focus is on modeling only the upper tail include generalized stretched exponential distribution and generalized Pareto (GP) distribution [Coles, 2001; Katz et al., 2002; Koutsoyiannis, 2004a, 2004b; Wilson and Toumi, 2005; Naveau et al., 2005]. Since the objective in this study was to simulate not only the “tail” but also the “bulk,” our interest is only in distributions which are widely used in stochastic weather generators or those that can model both the aforementioned precipitation distribution properties.

3.3. Two-Component Distributions

[15] Commonly used compound distributions include mixed exponential distribution [Roldan and Woolhiser, 1982; Wilks, 1998, 1999], dynamic mixture of gamma and generalized Pareto distribution [Vrac and Naveau, 2007; Hundecha et al., 2009], and hybrid gamma and generalized Pareto distribution [Furrer and Katz, 2008]. These compound distributions are discussed below.

[16] Mixed exponential distribution (ME):

display math

where math formula, p is the mixing factor, and math formula, math formula are, respectively, the scale parameters of the two components.

[17] Dynamic mixture of gamma and GP distribution (DM):

display math

where math formula, and math formula is the mixing function expressed as

display math

with location parameter math formula and scale parameter math formula. The mixing function monotonically increases from 0.5 to 1 as x increases from 0 to math formula such that the bulk of the distribution is dominated by gamma and the tail is dominated by the GP distribution.

[18] The other ingredients in this distribution are the gamma density math formula parameterized by math formula and math formula, the GP density math formula located at 0 ( math formula) with shape parameter math formula and scale parameter math formula, and the normalization constant math formula. After easy algebraic operations, the normalization constant is given as

display math

The advantages of this DM distribution are: (1) it can model the full range of precipitation; and (2) it circumvents the selection of threshold. Threshold selection is a challenging task in practice since given a data set it is difficult to pinpoint the level on which the extreme value theory (EVT) is based and usually a subjective trial and error exercise is needed [Frigessi et al., 2002; Carreau and Bengio, 2009].

[19] Hybrid gamma and GP distribution:

[20] Considering the discontinuity of DM distribution in the limiting case ( math formula is 0 or close to 0) and the difficulty of incorporating covariates, Furrer and Katz [2008] proposed a hybrid distribution where a GP distribution replaces the tail of a gamma distribution. For simplicity we use FK08 to denote this distribution. The PDF of FK08 distribution is

display math

where math formula is the indicator function and math formula is the normalization factor. This factor ensures that the integral of the density over its support is unity. To force the hybrid density to be continuous at the threshold math formula, it is necessary that math formula, which yields that the scale parameter math formula of the GP distribution is exactly the reciprocal of the gamma hazard function, i.e.,

display math

Therefore, this distribution can be fully represented by math formula.

[21] In both the above compound distributions, the GP distribution acts as a tail component

display math

where math formula, and math formula, math formula, and math formula are the shape, scale, and location parameters, respectively. The popularity of GP distribution can be explained by EVT [Coles, 2001; Castillo et al., 2005], which states that precipitation exceedances over a threshold math formula can be asymptotically approximated by a GP distribution, given that the threshold is sufficiently large. EVT has a deficiency of overlooking small values since the threshold should be sufficiently high and since only exceedances are involved in the analysis. Therefore, it is not suitable for modeling the full spectrum of precipitation. Then it would seem intuitive to model the bulk of precipitation by gamma or exponential distribution and take care of the tail by GP.

3.4. Estimation of Distribution Parameters

[22] The maximum likelihood (ML) method was used to estimate distribution parameters. To estimate parameters of the DM distribution, the normalization constant should be computed and the log likelihood function should be maximized. Since there is no closed antiderivative for the integral in the normalization constant, numerical integration should be employed. In this study, the MATLAB function quadgk, which is based on the adaptive Gauss-Kronrod quadrature algorithm, was used to complete the numerical integration. The log likelihood function was minimized with the use of fminsearch, which implements the Nelder-Mead simplex method. To obtain highly precise normalization constant, Frigessi et al. [2002] suggested to rewrite the integral from 0 to math formula as a sum of integrals from 0 to 1, 1 to 2, and so on, compute each integral and truncate the summation when the last integral did not lead to any significant change in the sum. This approach is computationally expensive since numerical experiments show that generally a large number of steps are needed and since the normalization constant should be computed at each step of the log likelihood maximization procedure. To speed up the procedure, an alternative is to compute the integration directly from 0 to math formula or to a large value and then use Frigessi's method. The fitting approach works satisfactorily when the integrand behaves well, as shown in the upper plot of Figure 2, but deteriorates, however, when the integrand is not smooth, as shown in the bottom plot. In this case, a normalization constant with low accuracy is obtained, which in turn introduces false local maxima of the log likelihood function and finally confounds the optimization algorithm.

Figure 2.

Two representative behaviors of the integrand function in the normalization constant of the DM distribution.

[23] It is difficult to determine the FK08 distribution directly by the ML method. This distribution was therefore estimated following the procedure suggested by Furrer and Katz [2008]: (1) estimating the gamma parameters math formula and math formula by the ML method with all the data; (2) determining a reasonable threshold math formula and then estimating the GP distribution math formula and math formula by the ML method from the data above math formula; and (3) adjusting the estimated scale parameter math formula obtained in step 2 from equation (6b) to achieve a continuous density. The estimation procedure underscores that a reasonable threshold should be predetermined.

3.5. Evaluation of Single Component Distributions Using QQ Plots

[24] The ML method was used to fit the aforementioned distributions to precipitation data from a sample station ID24. QQ plot was chosen as a goodness-of-fit criterion. QQ plots corresponding to Kappa, exponential, gamma distributions, given in Figure 3, show that despite an acceptable fit for low to moderate values, all of these distributions provided a rather poor fit for higher values. The upper tails of exponential and gamma distributions are not heavy enough, thus underestimating the likelihood of heavy precipitation, whereas Kappa distribution exhibits a much too heavy tail, leading to the overestimated extreme precipitation.

Figure 3.

QQ plots of observed versus Kappa, exponential, gamma, and ME modeled precipitation quantiles of station ID24.

3.6. Evaluation of Compound Distributions Using QQ Plots

[25] As seen from Figure 3, the ME distribution, which is the most extensively used compound distribution for precipitation simulation, offered somewhat of an improvement in capturing the tail behavior but not enough. This observation empirically confirms that the ME distribution performs well only when precipitation extremes are not very high [Wilks, 1999; Hundecha et al., 2009].

[26] Since the quantile function of the DM distribution cannot be analytically expressed and since we want to use the QQ plot to check the agreement between data and the fitted distribution, a parametric bootstrap method was used to construct the QQ plot [Efron and Tibhirani, 1993; Gomes and Oliveira, 2001; Castillo et al., 2005]. First, a large number of samples were drawn from the fitted distribution and the corresponding estimates for quantile math formula were computed. These estimates were then used to obtain an empirical cumulative distribution function (CDF) for the estimated quantiles math formula. The bootstrap based QQ plot can be constructed as a scatterplot of math formula versus math formula, and the corresponding confidence interval, say 95%, can be computed as math formula.

[27] In the quest to achieve this goal, one problem still to be addressed is how to simulate random samples from the DM distribution. Due to its functional complexity, direct random number simulation method is no longer feasible. A simulation approach, introduced by Frigessi et al. [2002], was therefore employed. The step by step procedure and the pseudo code are given in the Appendix.

[28] Figure 4 presents the QQ plots and their 95% confidence intervals for three sample stations, ID11, ID24, and ID44, respectively. The figure reveals two main points: (1) similar to commonly used single component distributions, the DM distribution provides an adequate fit for the “bulk” of precipitation amount, and (2) its performance on capturing high values depends on the data. It models the full range of precipitation well at station ID11, but for other two stations it leads to an over heavy tail, which may be caused by the distributional property of data (as in station ID24) or by the low accuracy of the normalization constant (as in station ID44). Although the DM distribution can be a viable choice to model the full spectrum of precipitation, its application is constrained by several problems, such as functional complexity, numerical instability, and computational expense.

Figure 4.

QQ plots of observed versus the DM distribution modeled precipitation quantiles of stations ID11 (left), ID24 (middle), and ID44 (right). The dash lines represent boundaries for the 95% confidence intervals.

[29] QQ plots of the fitted FK08 distribution corresponding to different thresholds are presented in Figure 5, signifying that its performance is determined by the threshold, which should be neither too small nor too large. A too small threshold (for example, math formula) means over emphasis on the GP distribution which will lead to an over heavy tail; whereas a too large threshold (say, math formula) indicates less emphasis on the GP distribution which will result in an underrepresented tail. A suitable threshold (like, math formula) does model both low to moderate and extreme values well. The threshold should be manually selected by trial and error, which is laborious and oftentimes subjective to the preference of a practitioner. Take math formula and math formula as examples, peaks over threshold (PoT) analysis indicates that both values are reasonable to model the exceedances by the GP distribution, as shown from the two plots in Figure 5. However, the fitted FK08 distributions significantly differ from each other in their performance with respect to full range of modeling.

Figure 5.

QQ plots of observed versus the GP distribution modeled quantiles of exceedances of precipitation over different thresholds (upper) and QQ plots of observed versus the FK08 distribution modeled full range of precipitation (lower) for station ID24.

4. Proposed Hybrid Distribution

[30] The above discussion shows that (1) a single distribution is inadequate to model the full range of daily precipitation and (2) fitting the available compound distributions suffer from functional complexity, numerical instability, supervised learning, and computational demand. For daily precipitation simulation, both the bulk and the tail should be taken into account. On the other hand, a computationally efficient model is attractive. To simulate the full range of precipitation, it is desirable to circumvent the threshold selection and reduce model complexity without loss of ability, if any. Taking into account these considerations, this study chose between the DM and the FK08 distributions to build a hybrid distribution by coupling an exponential distribution and a GP distribution. The presented hybrid distribution has its origin in the one introduced by Carreau and Bengio [2009], where Gaussian and GP distributions were stitched together.

[31] The PDF of the hybrid exponential and GP distribution (HEG) is given as

display math

and the CDF as

display math

The p-quantile function is

display math

The normalization constant math formula assures that the hybrid density is integrated to one over its support and thus is given by

display math

CDFs of exponential and GP distributions are involved in the HEG distribution:

display math
display math

To enforce the continuity of the hybrid density, i.e., math formula, the threshold math formula is defined as the junction point of the exponential and the GP distributions and therefore can be explicitly expressed as a function of scale parameters of the two distributions, i.e.,

display math

Apparently, the number of free parameter reduces to three. Thus, this distribution can be fully represented by math formula. Equation (8g) successfully bypasses the need for an explicit threshold selection. It is however cautioned that since only the PDF is forced to be continuous, it may converge to an unsmooth function, which nevertheless will not significantly influence the simulation and estimation. To remove such unsmoothness, one can force the derivative of the density to be continuous at the junction point. However, the flexibility of the distribution will decline.

[32] The HEG distribution adopts an exponential distribution for the low to moderate values, rather than a gamma distribution. This choice is made mainly for computational simplicity. It is recognized that directly learning the threshold math formula by maximizing the log likelihood is challenging [Frigessi et al., 2002; Carreau and Bengio, 2009]. A feasible approach is to learn it implicitly through expressing math formula as a function of other parameters [Carreau and Bengio, 2009]. Taking the exponential distribution as a hybrid component, threshold math formula can be formulated as a closed function of the HEG parameters. However it is difficult to do so with the gamma distribution. In turn, it is infeasible to bypass the need for threshold selection in a supervised manner as by Furrer and Katz [2008]. The HEG distribution is aimed to offer an efficient and relatively simple option to model the full spectrum of precipitation amount. The price is the reduced flexibility in modeling the bulk of the data. This is a limitation of the proposed distribution. Fortunately, the divergence of performances between the exponential and gamma distributions is not too much for low to moderate values.

[33] One may note that in the DM distribution, data below the location parameter math formula are modeled by the gamma distribution, and those above which are modeled by the GP distribution, when forcing math formula to 0. It means that in this case the DM distribution reduces to an analog of HEG, i.e., hybrid gamma and GP distribution. The problem of this limiting distribution is the discontinuity at the location math formula [Furrer and Katz, 2008]. A discontinuous density function is difficult to explain in practice, representing an unrealistic feature in precipitation [Vrac and Naveau, 2007]. Removing this discontinuity will again come across the difficulties as described above, assuming that one wants to avoid setting the threshold a priori.

[34] Another point worth noting is with respect to shape parameter math formula, which determines the tail property of the HEG distribution. Negative, zero, and positive values of math formula imply, respectively, bounded, light, and heavy tails. In the HEG distribution, math formula is forced to be positive, considering the fact that daily precipitation is widely acknowledged being heavy tail distributed [Koutsoyiannis, 2004a, 2004b; Vrac and Naveau, 2007, Furrer and Katz, 2008]. The shape parameter math formula can be restricted positive in one of two ways, with the use of a constrained optimization algorithm or by applying an exponential function to map the searching space of math formula from real line onto positive real line. This study adopts the latter way. Moreover, we want to caution that in the case of zero math formula, the GP distribution reduces to an exponential distribution and equation (8d) becomes incorrect.

5. Evaluation of Hybrid Distribution

[35] Since parameters of the hybrid distribution were estimated by the ML method and since the purpose is to model the full range of precipitation, one must answer the following two questions: (1) Is the ML estimator of the hybrid distribution asymptotically consistent and efficient? (2) Does the hybrid distribution perform better than or at least comparable to other distributions?

5.1. Asymptotic Property of the ML Estimator

[36] To answer the first question, random sample sets were generated with increasing size from the HEG distribution with parameters math formula, parameter estimates math formula were computed using the ML method for each sample set, and the asymptotic behavior of the ML estimator was empirically investigated. The sample size was increased with varying factor such that the asymptotic behavior was exhibited efficiently. For each sample size, random sampling and parameter estimation procedures were repeated 100 times to take into account the statistical variability.

5.1.1. Asymptotic Consistency of the ML Estimator

[37] The mean square error (MSE) between estimated and actual parameters was used to assess the asymptotically consistent property. If MSE decreased to zero as the sample size approached infinity, then the estimator was said to be asymptotically consistent. The MSE values of the ML estimates for each parameter are illustrated in the bar plot with increasing sample size, as shown in Figure 6, which indicates that MSE values for all parameters become smaller as the sample size increases. Observing the decreasing pattern of MSE, one can expect that as the sample size increases to a sufficiently large value MSE decays to zero or at least to a negligible value, which exactly meets the asymptotically consistent expectation of the ML estimators.

Figure 6.

Behaviors of MSE of the ML estimators of the HEG distribution (parameterized by math formula) as the sample size increases.

5.1.2. Asymptotic Efficiency of the ML Estimator

[38] The variance of estimates was used to indicate the asymptotically efficient behavior. If the variance of estimates decayed to zero as the sample size tended to infinity, then the estimator was considered asymptotically efficient. The distributions of parameters estimated by the ML method are shown by box plots in Figure 7. The span of the box plot shows a clear decreasing trend as the sample size increases. Note that when the sample size is large enough the estimated parameters are clustered within a very narrow interval centered or almost centered at real values. One can also expect that the interval will become much narrower and even negligible as the sample size goes to a sufficiently large value. This observation roughly indicates the asymptotic efficiency of the ML estimator since asymptotic efficiency involves that not only the variance of estimates decays to zero but also the rate at which it decays to zero.

Figure 7.

Distributions of the ML estimators of the HEG distribution (parameterized by math formula) as the sample size increases.

5.2. Preliminary Evaluation of Performance in Modeling Heavy Tail Data

[39] To answer the second question, the same scheme was used as in section 5.1. The parent distribution, however, was changed to an FK08 distribution parameterized as math formula, which was obtained by fitting the model to the data from station ID24. The simulation experiment was conducted as follows: generate different training sets with increasing size and test set with a fixed size of 1000 from the parent FK08 distribution; fit each training set to the HEG distribution and other widely used candidates; compute the estimated probability densities for elements in the test set; and then compare them with the actual values, which can be calculated directly by the PDF of the FK08 distribution. Four candidate models were chosen: the HEG distribution, the ME distribution, the DM distribution, and the nonparametric KDE with Gaussian kernel.

[40] As in the previous simulation experiment, the size of training sets was increased with varying factor and for each size the above procedure was repeated 50 times. A relatively small number of repeats (50) was used mainly because of the high computation cost of the DM distribution. The FK08 distribution was excluded to avoid selecting the threshold in a supervised manner.

5.2.1. Evaluation of Performance Using Relative Log Likelihood

[41] To quantitatively assess different candidate models, the relative log likelihood (RLL) was computed as

display math

where math formula and math formula, respectively, are the theoretical and estimated probability densities for elements in the test set. RLL is a measure of the divergence between the parent and the estimated distributions. The smaller the RLL is, the better the density estimator performs [Carreau and Bengio, 2009]. Summary statistics of RLL are listed in Tables 2, 3, 4, and 5, corresponding to the HEG distribution, the ME distribution, the DM distribution, and the KDE, respectively. The mean of RLL is a measure of the goodness-of-fit of the estimated density, whereas the coefficient of variation (CV) is a stability indicator.

Table 2. Summary Statistics of RLL for the HEG Distribution With Different Training Set Sizes n
n (×1000)FailuresMin.MeanMax.Cor. Coef.
Table 3. Summary Statistics of RLL for the ME Distribution With Different Training Set Sizes n
n (×1000)FailuresMin.MeanMax.Cor. Coef.
Table 4. Summary Statistics of RLL for the DM Distribution With Different Training Set Sizes n
n (×1000)FailuresMin.MeanMax.Cor. Coef.
Table 5. Summary Statistics of RLL for the Gaussian KDE With Different Training Set Size n
n (×1000)FailuresMin.MeanMax.Cor. Coef.

[42] The number of failures for each model is also summarized in these tables. The DM distribution failed due to numerical instability caused by less accurate and sometimes false normalization constant. The resulting density was meaningless, for example, a complex value due to complex normalization constant, especially when the sample was small. The failure of KDE resulted when the estimated densities for large quantiles were too small compared with the true values in that the RLL tended to infinity.

[43] Table 4 shows that the number of failures for the DM distribution was large, especially for small samples. When the sample size was less than 2000, the failure rate varied from 26% to 40%. The high failure rate indicates the existence of numerical problems due to its functional complexity. The failure of KDE especially appeared when the sample size was large since in this case more large values would emerge in the sample. This observation verified that the KDE was only reliable to model low to moderate values, but was, however, unable to appropriately simulate the upper tail behavior [Markovich, 2007; Carreau and Bengio, 2009].

[44] Based on the mean of RLL without considering the failure rate, the DM distribution ranked first, the HEG distribution second, the KDE third, and the ME distribution ranked fourth. Looking at the stability of these distributions as indicated by the CV values, the HEG distribution performed the best. The CV values for the HEG distribution were smaller than those of the DM distribution. Exceptions appeared when the sample size was less than 2000. This might be because of the high failure rate of the DM distribution. Only 30 and 37 of the 50 repeats were involved in the computation of CV, leading to suspicious and misleading results.

[45] From the view point of minimum RLL, the DM distribution seemed to be the best choice for heavy tailed data. It should be noted that in the DM model there are six free parameters, whereas only three free parameters are involved in the HEG distribution. Thus, it is argued that the small RLL is probably caused by over fitting.

5.2.2. Evaluation of Performance Using Information Criterion

[46] To further evaluate the distributions, the Akaike information criterion (AIC) [Akaike, 1974] and Bayesian information criterion (BIC) [Schwarz, 1978] were used. The information criteria take into account not only the goodness-of-fit but also the model complexity by penalizing the distribution with too many parameters. The smaller the math formula or math formula, the better the distribution.

[47] In this evaluation, the nonparametric KDE was excluded since it is broadly accepted that KDE is not suitable for heavy-tailed density estimation, which was also empirically verified by observations in section 5.2.1. The frequencies of selection of the three candidate models by math formula and math formula are summarized in Table 6. Results show that the DM distribution is penalized due to its functional complexity. And one can conservatively conclude that the small RLL values resulted from over fitting. For small samples, the ME distribution also performed well. This is not difficult to understand, because the majority of data in small samples were from the bulk of the distribution. As the sample size increased, more and more large quantiles were sampled and the inability of the ME distribution in modeling large values became apparent. Both math formula and math formula criteria indicated the HEG distribution to be the best choice.

Table 6. Frequencies of Selections of the Three Parametric Candidate Models by AIC and BIC Obtained From Different Simulations
n (×1000)HEGMEDM

5.3. Further Evaluation Using Precipitation Amount Records

[48] Does the preliminary simulation show that the HEG distribution is the best choice to model daily precipitation amount? This question cannot be answered simply by only yes or no, at least at the current stage. That is because of the limited scope of the simulation study and the diverse distributional patterns of real precipitation, and also because the parameter estimation method influences the performance of the distribution. In section 5.3, different distributions were fitted to daily precipitation records from the 49 stations in order to examine: (1) if one can expect the HEG distribution to be widely used to model daily precipitation amount; and (2) are there any exceptions where this distribution does not fit the data well?

[49] The tail of the ME distribution was too light to capture extreme precipitation, as shown in the left panel of Figure 8. Although QQ plots for only three sample stations are shown here, this conclusion is valid for all other stations. Therefore, our concern would only focus on the two compound distributions. To quantitatively measure the goodness-of-fit, average distance (AD) from the scattered points to the 1:1 reference line in the QQ plot was exploited, i.e.,

display math

Results showed that for only 11 of the 49 stations, AD values corresponding to the HEG distribution were greater than those of the DM distribution, even though the DM distribution has six free parameters rather than three for the HEG distribution. As indicated from QQ plots, the larger AD values associated with the DM distribution are mainly caused by the over heavy tail.

Figure 8.

Representative patterns of the QQ plots of observed versus the ME distribution (left), the DM distribution (middle), and the HEG distribution (right) modeled quantiles of precipitation.

[50] For further investigation, Figure 8 presents three representative patterns of QQ plots. The first pattern is shown in the top row panel, where the HEG distribution performs better than the DM distribution. In this case unsatisfactory goodness-of-fit of the DM distribution resulted mainly from the less accurate normalization constant. The second pattern is shown in the middle row panel, where both compound distributions fitted the data well. The DM distribution performed a little bit better over the HEG distribution with respect to minimum AD and maximum likelihood. This observation is expected, considering twice the number of parameters in the DM distribution. In this case, the AIC and BIC values were also computed. Neither of the two criteria provided information about over fitting, which agrees with the results of Varc and Naveau [2007]. It seems wise to choose the DM distribution if the problem of expensive computation is neglected. The presented HEG distribution is not intended, however, to replace the DM distribution but rather (1) to complement it in situations where it may have problems as at stations ID7 and ID44; and (2) to provide a relatively efficient, reliable, and simple way to model the full range of precipitation without losing any goodness-of-fit.

[51] The third pattern shown in the bottom panel of Figure 8 mainly signifies one disappointing fact that there are some situations in which both the DM distribution and the HEG distribution fail to capture extreme values. The noisy data, characterized by a few extremely large values appearing far away from the bulk of observations, is one of the reasons for this problem. Yet another probable reason is the misuse of a suboptimal parameter estimation method if one recalls that both upper tails of the two distributions are dominated by the GP distribution, which is outside of the exponential family. The ML estimator nicely performs only when the random variable belongs to the exponential family [Castillo and Hadi, 1995]. The following empirical study will provide justification for this possible explanation.

6. Other Parameter Estimation Approaches

[52] In the case that the ML estimator of the HEG distribution becomes invalid, how can one proceed? This is a common problem when working with the GP distribution, which is the tail component of the HEG distribution. Properly fitting the GP distribution has been approached by several researchers [Hosking et al., 1985; Castillo and Hadi, 1995; Singh and Guo, 1995, 1997; Castillo et al., 2005; Luceno, 2006; Brazauska and Kleefeld, 2009]. Two representative approaches are the elemental percentile (EP) method [Castillo and Hadi, 1995] and the maximum goodness-of-fit (MGF) method [Luceno, 2006]. The EP method is not an efficient option for the HEG distribution due to its somewhat cumbersome nature, which makes it difficult to express model parameters as functions of selected sample data and their empirical percentiles.

6.1. Two-Step Quantile Least Squares Method

[53] A two-step quantile least squares (QLS) method is described here. In the first step a number of samples are drawn with replacement from the original set and parameters are estimated based on the QLS method. For convenience, we denote these estimated parameter vectors as math formula, math formula, … , math formula, where r is the number of samples. The QLS estimates can be obtained as

display math

where math formula and math formula are the observed and estimated ith order statistics, respectively. Since there is an explicit formula for the HEG quantile function, math formula can be computed from equation (8c), without resorting to any numerical method. The simple quantile function is another desirable property of the HEG distribution compared with the DM distribution in (1) straightforward random number simulation, and (2) multiple options for parameter estimation.

[54] Since the elemental QLS estimator is sensitive to samples, the second step is to obtain a final robust estimator by robust functions, say median function (MED). Thus the final estimator of math formula can be expressed as

display math

Goodness-of-fit analysis showed that QLS is a good alternative to the ML method when the latter has problems. However, the two-step QLS is a bootstrap based method. That means it is expensive in computation, which will deteriorate the computational efficiency advantage of the HEG distribution and thus be against one of our major goals, i.e., to search for an efficient way to simulate the full spectrum of precipitation.

6.2. Maximum Goodness-of-Fit Method

[55] The maximum goodness-of-fit (MGF) method is another alternative. The basic idea for the MGF method is to obtain parameter estimates by maximizing the goodness-of-fit of the distribution. The right-tail Anderson-Darling (RAD) statistic:

display math

is an efficient and robust choice for the HEG distribution. It is not surprising to assign more emphasis to the right tail of the distribution, considering the following two reasons. First, the bulk of the HEG is the exponential distribution which is easily fitted. Second, the tail is the GP distribution and the ML method may have problems to fit this part. In addition, QQ plots in Figure 8 also signify that the loss-of-fit always happened in the tail part.

[56] The MGF method was used to fit each precipitation set. Maximizing equation (13) was done with the use of the MATLAB function fminsearch. AD values were computed and compared with those obtained from the ML method, as given in Table 7. It is seen that AD reduced significantly after the MGF method was applied. The relative percentage of decrements ranged from 2.86% to 67.53%. Only one exception appeared at station ID33, where AD obtained by the ML method was smaller. Improvement of the goodness-of-fit can be more easily seen from Figure 9. The significant improvements empirically confirmed the aforementioned inference that the failure of the HEG distribution in capturing extremes was caused by the suboptimal parameter estimation method. Using the goodness-of-fit statistics (AD) for comparison is unfair for the ML method, because maximizing goodness-of-fit is the objective of the MGF method. The purpose here is to solve the problem of “where to proceed” when the ML method has problems. The MGF method does not intend to replace the ML method but rather to complement it in troublesome situations.

Figure 9.

QQ plots of observed versus the HEG distribution modeled quantiles of precipitation using different model fitting methods.

Table 7. Average Distance (AD) Values for the ML and MGF Fitted Distribution for Each Station
Station IDAD (MLE)AD (MGF)Decrease Percentage (%)Station IDAD (MLE)AD (MGF)Decrease Percentage (%)

[57] Now one question arises: if the MGF method is sensitive to samples as it directly maximizes the agreement between the model and data. To answer this question, a simple simulation experiment was performed. A sample station, i.e., ID9, for which the ML estimator was optimal, was selected. The MFG method was used to fit nonzero observations. Then 500 random sets with fixed size, the same as the number of nonzero observations, were generated from the estimated HEG distribution. Finally, the MGF method was used to fit each set and its sensitivity was analyzed by assessing the variability of parameter estimates and quantile estimates. For comparison, the sample sets were also fitted using the ML method.

[58] Results are shown in Figure 10 by box plots. Compared with the ML estimator, the MGF estimator was more sensitive to the sample data, but not too much. Another point worth noting is that the mean of estimated parameters by the MGF is close to those of the true values. Therefore, it can be concluded that the MGF estimator is close to the ML estimator in the sense of root mean square error, even when the ML method is optimal. Real versus estimated quantiles were plotted in Figure 11. Similar patterns were found as those of the estimated parameters. The mean values of the estimated quantiles were almost the same as the theoretical values. The estimated quantiles by the MGF was also more sensitive to samples, but not as much as that in the parameter estimates. Put together, the above observations signify that the MGF estimator is a reliable alternative to the ML estimator, but with more variance.

Figure 10.

Box plots for HEG parameters fitted to 500 random samples by MLE, MGF, and PMLE methods, respectively. The random samples were drawn from the same parent distribution. True values are highlighted by deep pink lines.

Figure 11.

QQ plots of real and estimated quantiles by HEG distributions fitted to 500 random samples by MLE, MGF, and PMLE methods, respectively. The random samples were drawn from the same parent distribution. Boxes indicate the distribution of each estimated quantile.

6.3. Penalized Maximum Likelihood Method

[59] To take advantage of the less variance of the ML estimator and to reduce the likelihood of ML to converge to an over large shape parameter, the penalized maximum likelihood (PML) method is an appealing option. The PML method has been used for fitting extreme value distributions [Coles and Dixon, 1999]. The idea of the PML is to restrict the search space of the shape parameter by applying a penalty function. The adopted penalty function is of the form

display math

As math formula increases from 0 to 1, the penalty function decreases from 1 to 0. Compared with the one recommended by Coles and Dixon [1999], the additional parameter math formula provides more flexible control on the decreasing rate. After numerical experiments, it was found that the combination math formula, math formula, and math formula was suitable for the precipitation records reported in this study, as shown in Figure 12. The PML estimates are obtained by applying fminsearch to the penalized likelihood (or log likelihood), which is given by

display math
Figure 12.

Penalty function of the shape parameter math formula used in this study (solid line) and the one from Coles and Dixon [1999] (dashed line).

[60] QQ plots of the PML fitted HEG distributions of sample stations were included in Figure 9. One can see that PML can adequately fit the precipitation records. Only one exception among the 49 series appeared at station ID14, where the PML estimator performed worst among the three estimators. The same simulation experiment as for the sensitivity analysis of the MGF estimator was performed to investigate the influence of samples on the PML estimator. As expected, the PML estimator was less sensitive to samples than the MGF estimator and was comparable to the ML estimator, as presented in Figure 10. Similar observations were found in the quantile estimates, as shown in Figure 11. It is however cautioned that the reduced variance of the PML estimator is obtained at the expense of negatively biased shape parameter estimator, which was also remarked by Coles and Dixon [1999]. The negatively biased shape parameter will in turn lead to positively biased scale and location parameters, and negatively biased extreme quantiles. In terms of bias and variance, the PML estimator appears to be at least a competitive estimator to the other three and can be used as an alternative to the ML estimator.

[61] The described four estimating methods are all feasible options for fitting the HEG distribution, nevertheless, none of them being completely convincing in that it has been shown to be better than any other in every respect, for every data set. As to which method should be used, it involves a problem of tradeoff between bias and variance. In practice of precipitation simulation, we do not promote one estimator in favor of another, but recommend that both MGF and PML are reliable estimators in that generally they can provide adequate goodness-of-fit. Last but not least, we want to point out that the bootstrap based two-step estimating framework, as used in the two-step QLS estimator, might be used to reduce the variance of the MGF estimator, given that the computational efficiency is relatively less important. In this sense, the MGF method is preferred.

7. Conclusions

[62] This paper first examines the performance of existing distributions in modeling daily precipitation. Commonly used single component distributions cannot realistically model extreme rainfall events in many cases but compound distributions can adequately do. However, the fitting of these compound distributions is plagued by several drawbacks, exemplified by functional complexity, numerical instability, data sensitivity, and supervised learning. In view of these drawbacks, we present a hybrid exponential and generalized Pareto distribution. This distribution is tested on 49 records across Texas. Results show that it is relatively simple and reliable in modeling the full spectrum of precipitation distribution. By expressing the threshold parameter of the generalized Pareto component in the hybrid model as a function of other parameters, the threshold can be implicitly learned in an unsupervised manner. Therefore the difficulty of threshold selection as in the conventional peak over threshold analysis can be circumvented. Moreover, attributing to its functional simplicity, random numbers of the hybrid distribution can be easily simulated by entering uniform random variates into its quantile function, which can be explicitly expressed.

[63] Parameters of the hybrid exponential and generalized Pareto distribution can be estimated using different methods, for instance maximum likelihood, two-step quantile least square, maximum goodness-of-fit, and penalized maximum likelihood methods. In most cases, the maximum likelihood estimator is an optimal choice because of its nice properties like statistical consistency and efficiency, and so on. The other three alternatives can be decent remedies when the maximum likelihood method is troublesome due to a few extremely large values appearing far away from the bulk of observations. As to which method should be used in practice, we do not promote one in favor of another, but recommend that both the maximum goodness-of-fit and the penalized maximum likelihood methods are reliable in that generally they can provide adequate goodness-of-fit for both the “bulk” and the “tail” of the data. The proposed hybrid exponential and generalized Pareto distribution might be incorporated into stochastic weather generators to provide an efficient way to realistically simulate and downscale the full spectrum of daily precipitation. To ease the calculation effort and to reproduce results reported in this study, a suite of MATLAB functions is developed.

[64] There are several important issues appreciated in the rainfall modeling community that need more research in the future. One is to study the spatial distribution of the model parameters across a region so as to regionalize the model to be applicable at any location in the area. This issue is expected to be solved through a way inspired by Wilks [2008, 2009] and Kleiber et al. [2011]. On the other hand, all the reported analysis was limited within Texas. Evaluating the applicability and performance of the hybrid exponential and generalized Pareto distribution in diverse areas worldwide is another important task worthy of more efforts in the future.


[65] Following Frigessi et al. [2002], the random numbers from the DM distribution can be sampled as follows:

[66] 1. Draw a uniform variate math formula from math formula.

[67] 2. If math formula, then draw a random variate x from the Weibull distribution and evaluate the mixing function math formula at x, then retain x with probability math formula or reject it with probability math formula, and if so, resample again.

[68] 3. If math formula, then draw a random variate x from the Pareto distribution and evaluate the mixing function math formula at x, then retain x with probability math formula or reject it with probability math formula, and if so, resample again.

[69] 4. Repeat step 1 to step 3 many times until a desired number of random variates have been sampled.

[70] The pseudo code for the sampling procedure following the rule of MATLAB is shown in Table A1.

Table A1. Pseudo Code for Random Numbers Simulation of the DM Distribution
Set a, b, μ, τ, κ, σ
while true
 u ← sample from uniform U[0,1]
 if 0.5 − u > eps
  x ← sample from gamma distribution parameterized by a and b
  pmath formula
  r ← sample from uniform U[0,1]
  if 1pr > eps
  x ← sample from Pareto distribution parameterized by κ and σ
  pmath formula
  r ← sample from uniform U[0,1]
  if  preps


[71] This work was financially supported in part by the United States Geological Survey (USGS, Project ID: 2009TX334G) and TWRI through the project ‘Hydrological Drought Characterization for TX under Climate Change, with Implications for Water Resources Planning and Management.’ The constructive comments raised by the anonymous reviewers and the associate editor are gratefully acknowledged.