Two Tweedie distributions that are near-optimal for modelling monthly rainfall in Australia



Statistical models for total monthly rainfall used for forecasting, risk management and agricultural simulations are usually based on gamma distributions and variations. In this study, we examine a family of distributions (called the Tweedie family of distributions) to determine if the choice of the gamma distribution is optimal within the family. We restrict ourselves to the exponential family of distributions as they are the response distributions used for generalized linear models (GLMs), which has numerous advantages. Further, we restrict ourselves to distributions where the variance is proportional to some power of the mean, as these distributions also have desirable properties. Under these restrictions, an infinite number of distributions exist for modelling positive continuous data and include the gamma distribution as a special case. Results show that for positive monthly rainfall totals in the data history for a particular station, monthly rainfall is optimally or near-optimally modelled using the gamma distribution by varying the parameters of the gamma distribution; using different distributions for each month cannot improve on this approach. In addition, under the same model restrictions, monthly rainfall totals that include zeros are also well modelled by the same family of distributions. Hence monthly rainfall can be suitably modelled using one of two Tweedie distributions depending on whether exact zeros appear in the rainfall history. We propose a slight variation of the gamma distribution for use in practice. This model fits the data almost as well as the gamma distribution but admits the possibility that future months may have zero rainfall. Copyright © 2010 Royal Meteorological Society

1. Introduction

Rainfall models are important for forecasting and simulation purposes with extended applications in modelling runoff, soil water content and for forecasting drought and flood (Toth et al., 2000; Aubert et al.2003). Appropriate rainfall models assist in developing better climate-related risk management and decision-making capabilities. For modelling purposes, two different aspects of rainfall are common on any given timescale: the occurrence of rainfall and the amount of rainfall (Dunn, 2004). Rainfall models exist for different timescales such as hourly, daily, weekly, monthly, seasonal or annual (Boer et al., 1993; Sharda and Das, 2005; Aksoy, 2006; Tilahun, 2006).

To model the occurrence of daily rainfall, first-order (Gabriel and Neumann, 1962; El-seed, 1987), higher order (Katz, 1977; Deni et al., 2009), hybrid (Wilks, 1999) and hidden (Robertson et al., 2003) Markov chain models have been used. First-order Markov chain models assume that the occurrence of rainfall on a day depends on the occurrence of rainfall on the previous day. Higher order Markov chains consider the occurrence of rainfall on a day depends on the occurrence of rainfall on two or more days earlier. Higher order Markov models are more complex, but perform marginally better (Deni et al., 2009). Hybrid Markov chains consider different orders for wet and dry days while hidden Markov chains consider some hidden states. Chandler (2005) used logistic regression to model dry or wet days as a function of site altitude, North Atlantic Oscillation, seasonality and autocorrelation (indicators for rain on each of previous 5 days, plus persistence indicators for rain on both previous 2 days and all previous 7 days).

Sometimes, modelling the amount of rainfall is more important than modelling the occurrence of rainfall. Using rainfall data from New South Wales, Boer et al. (1993) used linear regression to model the amount of seasonal and annual rainfall as a function of longitude, latitude and altitude of the stations. Chowdhury and Sharma (2007) used a linear regression model to quantify the effect of El Nino southern oscillation on the amount of monthly rainfall. Considering nonlinear effects of some covariates on monthly rainfall amount, Zaw and Naing (2008) used polynomial regression to model the amount of monthly rainfall in Myanmar. One of the basic assumptions regarding the above-mentioned models is that the amount of rainfall is normally distributed with constant variance. For some stations, the amounts of rainfall on some timescales (e.g. annual) approximately follow a normal distribution when the use of normal distributions is appropriate. However, the amount of monthly, weekly or daily rainfall usually does not follow normal distribution and is right skewed, and so alternative distributions are needed to model the amount of rainfall on shorter timescales.

To model the right-skewed daily rainfall amounts, distributions that have been employed include the gamma (Aksoy, 2006), truncated gamma (Das, 1955), kappa (Meilke, 1973), generalized log-normal (Swift and Schreuder, 1981), mixed exponential (Chapman, 1997; Wilks, 1998, 1999) and mixed gamma (Jamaludin and Jemain, 2008). Jamaludin and Jemain (2008) used exponential, gamma, mixed exponential and mixed gamma distributions to describe the daily rainfall amount in Malaysia, and based on the Akaike Information Criteria (AIC), they showed that the mixture distributions are better than single distributions for describing the amount of daily rainfall.

Comparing log-normal, gamma, Weibull and log-logistic distributions on the non-zero weekly rainfall data from Dehradun, India, Sharda and Das (2005) showed that the Weibull distribution fits best (on the basis of the Anderson-Darling test). Taking 29 stations from Sen and Eljadid (1999) showed that, for monthly rainfall amounts, the gamma distribution fits well. Compared with other Pearsonian distributions (Pearsonian I and Pearsonian IX), the gamma distribution fits best for modelling the amount of monthly rainfall in the Asian summer monsoon (Mooley, 1973). Tilahun (2006) compared five different distributions (normal, log-normal, gamma, Weibull and Gumbel) for modelling the amount of rainfall in wet months in eight rainfall stations in Ethiopia and found none were optimal for every station.

Another alternative for modelling right-skewed rainfall amounts is to use distributions from a special family called the exponential dispersion model (EDM) family of distributions (Jorgensen, 1997). The EDM family of distributions are the response distributions for generalized linear models (GLMs) (McCullagh and Nelder, 1989) and include common distributions such as the binomial, Poisson, gamma and normal distributions. The models are widely used as the GLM framework is already in place for fitting models based on the EDM family of distributions and for diagnostic testing. In addition, covariates are easily incorporated into the modelling procedure (Jorgensen, 1987). GLMs have been used for fitting models to climatological data such as rainfall by numerous researchers (Coe and Stern, 1982; Wilks, 1999; Chandler, 2005).

The common models used in modelling monthly, weekly or daily rainfall amount have difficulty with the mixture of discrete (exact zero when no rainfall is recorded) and continuous (rainfall amount with non-zero rainfall recorded) data. To overcome the difficulty, some authors used logistic regression (Chandler and Wheater, 2002) or Markov chains (Richardson and Wright, 1984; Stern and Coe, 1984; Laux et al., 2009) to model the occurrence of wet or dry days, then gamma distributions to model the amount of rainfall on wet days. For example, Das et al. (2006) used Markov chains for rainfall occurrence and gamma distribution to model the amount of weekly rainfall in Bihar, India.

An alternative approach was adopted by Glasbey and Nevison (1997), who applied a monotonic transformation of rainfall data to define a latent Gaussian variable with zero rainfall corresponding to censored values below some threshold (1.05 mm). Husak et al. (2007) used a conditional distribution by accumulating probabilities conditional on the presence of rainfall. This is combined with a mixture coefficient used to account for the probability of no rain to create the probability distribution. Yoo et al. (2005) used mixed gamma distribution for modelling the amount of daily rainfall for both wet and dry periods. The distribution has two parts: one is the probability of having a dry day and the second is the probability of getting a wet day multiplying by a gamma distribution explaining the amount of rainfall on a wet day. Dunn (2004) used Poisson-gamma distributions to model the occurrence and amount of rainfall simultaneously. The distributions in the Poisson-gamma family belong to the EDM family of distributions (Jorgensen, 1997), upon which the GLMs are based.

Clearly, numerous probability models exist for modelling rainfall over various timescales. Numerous studies have fitted particular distributions to monthly rainfall, using the same distribution for each month but by varying the parameters, such as the mean and the variance, for each month (Mooley 1973; Husak et al., 2007; Piantadosi et al., 2009). The amount of rainfall on different months may follow different distributions rather than following the same distribution with varying parameters. We explore the possibility that different distributions are needed for each month by considering a broad family of distributions. To do so, we restrict ourselves to EDM family of distributions as these distributions are the response distributions for GLMs. Further, we consider EDMs where the variance is proportional to some power of the mean (often called the Tweedie family of distributions), as these distributions have properties useful for rainfall modelling (discussed in Section 3). The Tweedie family includes distributions suitable for modelling positive continuous data (such as the gamma) and also for modelling positive continuous data with exact zeros (such as Poisson-gamma).

We first discuss the data (Section 2), and then introduce the Tweedie distributions and their properties (Section 3). The results and discussion (Section 4) is followed by some concluding comments (Section 5).

2. Data

To study the different features of rainfall distribution, the monthly rainfall data from four Australian rainfall stations were taken as case studies (Figure 1) covering the period from 1910 to 2007 and were obtained from the Australian Bureau of Meteorology. Two were dry stations, Bidyadanga and Yoweragabbie, in Western Australia; and two were wet stations, Cowal in Queensland and Clarence in Victoria. Table I shows the summary statistics of the rainfall distribution for the four rainfall stations. The dry stations had high percentages of months with zero rainfall (34.4% and 17.5% of all months for Bidyadanga and Yoweragabbie, respectively). The wet stations, Cowal and Clarence, each had less than 1% of months with no rainfall. As examples, consider the monthly rainfall distribution for Bidyadanga (Figure 2) and Cowal (Figure 3): all monthly rainfall distributions are highly skewed to the right. For Bidyadanga, the distributions are quite different: summer months (December to March) clearly receive larger amounts of rainfall than the other months in general. For Cowal, the rainfall distributions are similar over the different months. Another 98 stations from different parts of Australia were also studied (Figure 1).

Figure 1.

Locations of the stations studied

Figure 2.

Monthly rainfall distribution for Bidyadanga station

Figure 3.

Monthly rainfall distribution for Cowal station

Table I. Summary statistics of the monthly rainfall for Bidyadanga, Yoweragabbie, Cowal and Clarence from 1910 to 2007
  • a

    IQR, interquartile range.

Mean (mm)42.319.992.389.7
Median (mm)2.311.468.366.4
Percentage of months with no rainfall34.417.50.90.5
IQRa (mm)46.224.991.285.0
Standard deviation (mm)86.426.290.780.1
Coefficient of variation (%)204.5131.798.389.3

3. Methodology: Tweedie densities

Figures 2 and 3 show that the monthly rainfall distributions are highly skewed to the right, and hence researchers have used a variety of distributions for modelling monthly rainfall Y, including the log-normal, Weibull, generalized log-normal, gamma and mixed gamma distributions (Mooley, 1973; Sen and Eljadid, 1999; Tilahun, 2006). In all the above-mentioned literatures, one type of distribution is used to model the rainfall amount of all the months. As pre-empted in Section 1, we explore the possibility that different distributions may be required for each month. We do this by embedding in a broad family of distributions, called the EDM family of distributions, whose probability functions have the form

equation image(1)

where µ is the mean of the distribution, ϕ> 0 and the functions θ and κ(θ) are known functions. [Correction added on 4 November 2010 after original online publication: Equation 1 has been corrected.] Since these distributions are the response distributions for GLMs, they are useful for modelling and simulation using the extensive GLM literature and software already in place (McCullagh and Nelder, 1989). For EDMs, the mean is µ = dκ (θ)/dθ and the variance is Var[Y] = ϕ d2κ(θ)/dθ2. As κ(θ) is a one-to-one function of θ, then Var[µ] = d2κ(θ)/dθ2, called the variance function, which characterizes the distribution in the class of EDMs.

Within the class of EDMs, we restrict ourselves to those distributions with the variance function V(µ) = µp for equation image, where the index p specifies the particular distribution. These distributions are often called the Tweedie family of distributions (Jorgensen, 1987, 1997; Dunn and Smyth, 2005). While these restrictions may appear restrictive, special cases include many popular distributions, including the normal (p = 0), Poisson (p = 1), gamma (p = 2) and inverse Gaussian (p = 3) distributions. Apart from these special cases, the probability functions for the Tweedie distributions have no closed form. For p ≥ 2, the distributions are suitable for modelling positive, right-skewed data. For 1 < p < 2, the Tweedie family is suitable for modelling positive continuous data with exact zeros and are sometimes called the Poisson-gamma distributions.

The case 1 < p < 2 deserves special mention. This case corresponds to a Poisson sum of gamma distributions and has an interesting interpretation in the context of rainfall modelling. Assume the amount of rainfall on day i = 1, 2, …, N is Ri where N is the number of days with non-zero rainfall in the respective month. Then N has an approximate Poisson distribution. Note that there will be months with no rainfall events (when N = 0). The total monthly rainfall Y is the Poisson sum of the gamma random variables, so that Y = R1 + R2 + …+ RN, defining Y = 0 when N = 0. The probability function of Y is complicated and cannot be written in a closed form (Dunn and Smyth, 2005).

There are some important properties of the Tweedie distributions that make them particularly appealing for use in rainfall modelling (Dunn, 2004):

  • There is some intuitive appeal for the models, considering total rainfall as a sum of rainfall on smaller timescales (outlined above).
  • These distributions belong to the exponential family of distributions, upon which GLMs are based. Consequently, there is a framework already in place for fitting models based on the Tweedie distributions and for diagnostic testing. In addition, covariates can be incorporated into the modelling procedure.
  • They provide a mechanism for understanding the fine-scale structure in coarse scale data (Dunn, 2004).
  • All exponential dispersion models that are closed with respect to scale transformations are Tweedie models (Jorgensen, 1997; Jiang, 2007); that is, if Y is from a particular Tweedie distribution with index p and c is some constant, then cY is also from the same Tweedie distribution.

While these are important considerations, the Tweedie distributions also fit the data well, in practice, as shown below.

4. Results and discussion

To determine the appropriate Tweedie distribution for a monthly rainfall distribution, the mean–variance relationship defined by the index parameter p must be determined. The mean–variance relationship can be studied informally. Compute the mean and variance of the rainfall amounts for each month, producing a mean and variance of the monthly rainfall for each month over all years. Plotting the log of the variance against the log of the mean (Figure 4) shows an approximate linear relationship between the group means and group variances for all four rainfall stations. The variance of the amount of rainfall is clearly not constant but depends on the mean; express the relationship between mean and variance of the amount of rainfall as log (group variance) = α+ p log(group mean). Rearranging,

equation image(2)

that is, the approximate linear relationship implies a variance function of the form V(µ) = µp, precisely the variance function for the Tweedie distributions. More formally, estimating the index parameter requires sophisticated numerical techniques (Dunn and Smyth, 2005, 2008), as implemented in the tweedie.profile function of the R (R Development Core Team, 2008) package tweedie (Dunn, 2008).

Figure 4.

Scatterplots showing the mean–variance relationship (measured on log scale) of monthly rainfall for the stations Bidyadanga, Yoweragabbie, Cowal and Clarence from 1910 to 2007, for all months. Each point represents the mean and variance of the amount of rainfall for a single month

The slopes of the lines in mean–variance plots approximately determine the p-indices, and hence the distributions in the Tweedie family. For Cowal and Clarence, the mean–variance relationship is not the same for all months (Figure 5), implying different Tweedie distributions are appropriate for different months. However, for Bidyadanga and Yoweragabbie, the mean–variance relationship is similar for most of the months.

Figure 5.

The mean–variance relationships for the monthly rainfall distributions at Bidyadanga, Yoweragabbie, Cowal and Clarence stations. Each line represents the mean–variance relationship for a different month computed from 1910 to 2007

More formally, estimate p (denoted by ) using a profile maximum likelihood estimate for each month, along with the 95% confidence intervals. This is possible using the tweedie package (Dunn, 2008) and the function tweedie.profile in R. The results for the four case studies are shown in Figure 6. By embedding in the Tweedie family of distributions, we determine the distribution that is optimal or nearly optimal to model the rainfall total for each month. For months with Y ≥ 0, necessarily 1 < p < 2. Although for months with Y > 0, expect p > 2, this is not required; in a very small number of cases, 1 < < 2 is observed when Y > 0. For Bidyadanga in January, ≈ 2, hence a gamma distribution is near-optimal for modelling the amount of January rainfall. Apart from January, the values of for all months are very similar, and all the confidence intervals contain p = 1.6. The confidence intervals for p for Yoweragabbie contain p = 1.6 for all months. For Cowal, ≈ 1.6 for the months from June to November. For the other months, the confidence intervals for p span p = 2.0 indicating that the use of gamma distribution is appropriate. For Clarence, ≈ 1.6 for February, March, July, August, September and November; for the other months, the gamma distribution is near-optimal.

Figure 6.

The 95% confidence intervals of p-indices for different months (1 = January, 2 = February and so on) for the stations Bidyadanga, Yoweragabbie, Cowal and Clarence

Quantile residuals were used to assess how well the distributions fit the original data (Dunn and Smyth, 1996). These have an exact standard normal distribution (apart from sampling error) provided that the correct distribution is used. Six sample QQ-plots of the quantile residuals (Figure 7) show, in all cases, that the Tweedie distributions fit the total monthly rainfall well. Other residuals, such as deviance and Pearson residuals, have difficulty with the exact zeros (Dunn and Smyth, 1996).

Figure 7.

Sample QQ-plots of the quantile residuals after fitting Tweedie distributions to monthly rainfall totals for different months for the stations Bidyadanga, Yoweragabbie, Cowal and Clarence. An ideal plot would show the points falling on the solid line, which corresponds to the standard normal distribution

On the basis of the results from 102 studied stations, the following observations are made. For months with Y ≥ 0, the Tweedie distributions are appropriate with = 1.6. For months with Y > 0 for all years, the gamma distribution is almost always the optimal or near-optimal choice of distribution within the Tweedie class of distributions—no other Tweedie distribution is a better choice for modelling such data. However, using the gamma distribution explicitly excludes the possibility of any months in the simulated future of receiving zero rainfall; this may be unrealistic.

In a small number of cases, months with Y > 0 are optimally modelled using ≈ 1.6. Very rarely, we found months with Y > 0 where →1. In these cases, the profile plots are unhelpful due to an artefact of data, noted by numerous authors (Jorgensen, 1987; Jorgensen and Paes de Souza, 1994; Gilchrist and Drinkwater, 1999) and has been illustrated and described technically (Dunn and Smyth, 2005). In these rare cases, we propose using = 1.6; a QQ-plot based on = 1.6 shows no problems with the model. Most stations are modelled using = 1.6 for most months; stations where = 2 are generally concentrated on the coastline (Figure 8).

Figure 8.

The QQ-plots of the quantile residuals after fitting Tweedie distributions to monthly rainfall totals for the months in the upper panel of Figure 7, with p = 1.99

5. Conclusions

We have considered the Tweedie family of distributions for modelling total monthly rainfall. These distributions have many desirable properties: they belong to the EDM family and so can be used in the GLM framework with all the advantages this brings; there is intuitive appeal as a Poisson sum of gamma distributions; they provide mechanisms for understanding the fine-scale structure of the data; and they are all closed with respect to scale transformations. Under these conditions, we have shown that the gamma distribution (the Tweedie distribution with p = 2) is almost always the optimal or near-optimal distribution for modelling positive monthly rainfall amounts in Australian stations; no other distribution in the Tweedie class is a better choice. In a small number of cases, = 1.6 seems appropriate. For months where some months record exactly zero rainfall, Poisson-gamma distributions are shown to fit well using = 1.6.

These results mean that the gamma distribution is almost always the optimal distribution among those studied for modelling monthly rainfall totals when Y > 0; in rare cases, = 1.6 is suitable. Consequently, for most cases, simulation studies can do no better (within the class of distributions studied) than to use the gamma distribution as a basis for simulation for any month of the year with Y > 0 in the available data history.

We propose that simulations instead should be based on a value of p slightly less than 2, say p = 1.99. This has the advantage of still modelling the data well (compare the QQ-plots of quantile residuals in the top panels of Figure 7 with the QQ-plots of Figure 9), yet admitting the possibility of zero rainfall in the simulated future. The values p = 2 and p = 1.99 are both within all of the 95% confidence intervals for p, so either choice is sensible based on the profile likelihood plots.

Figure 9.

Maps of Australia with black and grey dots representing the value of p for different months for 102 rainfall stations. Black dots represent p ≈ 1.6 and grey dots represent p ≈ 2


The authors acknowledge Dr Leigh Findlay for editorial assistance with the manuscript. The comments of a reviewer are gratefully acknowledged; they improved the flow, interpretation and understanding of the paper.