### ABSTRACT

- Top of page
- ABSTRACT
- 1. Introduction
- 2. Data
- 3. Method
- 4. Results
- 5. Discussion of alternative methodologies
- 6. Conclusions
- Acknowledgements
- References

This paper proposes a method for describing the distribution of observed temperatures on any day of the year such that the distribution and summary statistics of interest derived from the distribution vary smoothly through the year. The method removes the noise inherent in calculating summary statistics directly from the data thus easing comparisons of distributions and summary statistics between different periods. The method is demonstrated using daily effective temperatures (DET) derived from observations of temperature and wind speed at De Bilt, Holland. Distributions and summary statistics are obtained from 1985 to 2009 and compared to the period 1904–1984. A two-stage process first obtains parameters of a theoretical probability distribution, in this case the generalized extreme value (GEV) distribution, which describes the distribution of DET on any day of the year. Second, linear models describe seasonal variation in the parameters. Model predictions provide parameters of the GEV distribution, and therefore summary statistics, that vary smoothly through the year. There is evidence of an increasing mean temperature, a decrease in the variability in temperatures mainly in the winter and more positive skew, more warm days, in the summer. In the winter, the 2% point, the value below which 2% of observations are expected to fall, has risen by 1.2 °C, in the summer the 98% point has risen by 0.8 °C. Medians have risen by 1.1 and 0.9 °C in winter and summer, respectively. The method can be used to describe distributions of future climate projections and other climate variables. Further extensions to the methodology are suggested.

### 1. Introduction

- Top of page
- ABSTRACT
- 1. Introduction
- 2. Data
- 3. Method
- 4. Results
- 5. Discussion of alternative methodologies
- 6. Conclusions
- Acknowledgements
- References

In the last IPCC report (Solomon *et al.*, 2007) the evidence for changes in the variability and extremes of temperature, and changes in mean temperatures, was examined. Typically this evidence was based on the comparison of summary statistics derived from the empirical probability distribution function (EPDF) of observed or projected temperatures.

The EPDF of, for example, historical daily mean temperatures on a *specific* day over a particular period of years is obtained from the set of daily mean temperatures that are recorded on that day of the year for each of the years in that period. With daily long-term records of temperature, or with climate projections on a daily time step, it is possible to obtain an EPDF for each day of the year. It is also possible to obtain an EPDF for each month of the year (Barrow and Hulme, 1996), for a set of months (Beniston and Stephenson, 2004), or for the whole year (Ballester *et al.*, 2010) by selecting the set of days to be included in its construction.

Summary statistics can be derived for each day of the year from a daily EPDF and seasonal variation is observed in these statistics. Two types of summary statistics are typically derived from the EPDF. The first type characterizes the shape of the whole distribution, for example Ballester *et al.* (2010) summarize the mean, variance and skewness of the annual EPDF. The second type describes characteristics such as the percentiles of the distribution: for example, the temperatures below which 5%, say, of observations fell. For example, Yan *et al.* (2002b) calculated a number of percentiles for each day of the year for the period 1748–1998 in Uppsala and compared these with the same percentage points in Beijing for the period 1915–1997. These types of summary statistics are useful not only for researchers but can also be useful for users of climate data, such as farmers, hydrologists, or energy companies. For climate users in particular, these summary statistics should provide useful information about the climate and be relatively simple to calculate.

There are two difficulties for a climate user, especially when the interest is in daily statistics derived from the daily EPDF. First, summary statistics derived for two consecutive days may differ substantially, so that a sequence of daily summary statistics over the year, such as the 5% point, may demonstrate a large amount of day-to-day variability as well as seasonal variation. Although the observed day-to-day variability is real, it may mask the seasonal pattern. If comparing two locations or periods it may mask the underlying similarity or difference in the climate that these two distributions represent: in effect the signal may be masked by noise. This is especially true when the period of the EPDF is short; that is, not many observations contribute to each EPDF. A second difficulty is that for each new summary statistic that is required, the raw data must be re-examined.

A solution to the first problem is to fit a smooth function to the summary statistics through the year. In essence, the smoothed function ‘borrows’ information from adjacent days to return a set of summary statistics that varies smoothly over the year. For example, Yan *et al.* (2002b) fitted an 11 point binomial filter to the 5% values for each day derived from the EPDF. However, each separate summary statistic of interest must be smoothed separately, and even if the summary statistic is only required for 1 day, the statistic must be calculated on other days to obtain the smoothed result.

A solution to the second problem is to describe the EPDF by fitting a suitable theoretical probability distribution to the data. Given a choice of theoretical distribution that is a good fit to the EPDF, all relevant summary statistics, such as percentiles, can then be calculated from the theoretical distribution using the estimated parameters without needing to revisit the data. Climate users then only need to store the estimated parameters (typically two or three) of the distribution, rather than all the observed values, for each day of the year.

Ideally, the theoretical distribution and, therefore, summary statistics derived from the distribution, should capture the seasonal variability in the data and mask the day-to-day variation. This could be achieved if the estimated parameters of the theoretical distribution varied smoothly through the year. This is not achieved directly because the distribution is fitted separately to the EPDF for each day of the year, and the EPDF is noisy. An alternative to smoothing the summary statistics directly is to smooth the estimated parameters of the theoretical distribution. Summary statistics, derived from the theoretical distribution using smoothed parameters, will then also vary smoothly through the year capturing the seasonal variability in the data.

A common approach to describing the distribution of daily weather data so that the distribution varies smoothly through the year is to use Generalized Linear Models (GLMs) (McCullagh and Nelder, 1989). For example, Yan *et al.* (2002a) describe the distribution of maximum wind speed on any day of the year using a gamma distribution, and the amount of rainfall on a wet day is typically modelled using the gamma or exponential distribution (Stern and Coe, 1984). The distribution of daily (maximum, minimum and mean in general) temperatures, were modelled by Richardson (1981) and more recently by Furrer and Katz (2007), assuming that the temperature distribution on any day is a normal distribution.

In northern Europe, daily temperatures in winter can be below 0 °C and the EPDF is left skewed because of extreme cold events. In the summer, temperatures are typically above 0 °C and extreme warm events result in an EPDF that is right skewed. Although the GLM framework is quite general, it is constrained to modelling distributions from the exponential family. There are no distributions in the exponential family that are skew and take both positive and negative values and so this approach is not available for modelling these temperature data. Furthermore, in a GLM it is a function of the mean of the distribution that can vary through the year. Other characteristics of the distribution, such as the variance, have less flexibility.

Because of these limitations there appears to be little work on modelling temperature distributions in northern Europe. Jones *et al.* (1999) focused on relative rather than absolute distributions and changes in the distribution of the Central England temperature data by fitting gamma distributions to temperature anomalies relative to a smoothed mean on each day of the year for two different periods. Barrow and Hulme (1996) tested the suitability of different theoretical probability distributions to describe monthly EPDFs of daily maximum or minimum surface air temperatures, recorded in Fahrenheit, at nine locations in the UK. Different distributions were chosen depending on the month and location, and no overall preferences were found.

The present paper demonstrates a pragmatic method for describing how the daily distribution of temperatures: (1) varies through the year so that parameters of the distribution and therefore any summary statistics vary smoothly throughout the year, and, (2) has changed between two periods. The method fits the generalized extreme value (GEV) distribution, a distribution with three parameters, to the EPDF for each day of the year in two different periods. A separate linear model is then used to describe how each of the three estimated parameters varies smoothly through the year to obtain smoothed parameter estimates for each day of the year in each period. Using the smoothed parameter estimates to obtain new GEV distributions for each day of the year, relevant summary statistics can be derived and compared between the two periods.

The method is illustrated using daily effective temperature data, temperature adjusted for wind speed (see Section '2. Data' for definition) from De Bilt, Holland, for the two periods 1904–1984 and 1985–2009. One set of users of these data wanted to know the probability that the temperature exceeds certain thresholds to assist in planning the maintenance of utility services, especially during the winter. The users wanted to be able to calculate the probability for any threshold in the future, rather than being given results for pre-specified thresholds, and without returning to the raw data each time. A long time series of data was available, but because of concerns over changes in temperature, only the last 25 years were thought relevant. However, there is information in the previous 80 years of data, and an additional feature of this paper is that the smoothing of parameters in the more recent period uses, where possible, information from the earlier period, thus making maximal use of the available data. Finally, the paper describes how this methodology provides a baseline for the development of more sophisticated methods that directly model the correlation in the data and the parameters and provide measures of uncertainty in relevant summary statistics.

### 2. Data

- Top of page
- ABSTRACT
- 1. Introduction
- 2. Data
- 3. Method
- 4. Results
- 5. Discussion of alternative methodologies
- 6. Conclusions
- Acknowledgements
- References

The data are 38 351 records of daily effective temperature (DET) from De Bilt, Holland from 1 July 1904 to 30 June 2009, with missing values for all of April 1945 and the last 8 days in August 1915. Daily effective temperature on day *t*, *T*_{t} is calculated using the definition of Wever (2008) as:

- (1)

where *T* is the observed daily mean temperature on day *t* measured in °C, and *W*_{t} is the average wind speed on day *t* measured in m s^{−1}. The daily mean temperatures, *T*, and average wind speeds, *W*_{t}, were obtained from the Royal Netherlands Meteorological Institute website (http://www.knmi.nl). To account for variability in the altitude at which the wind speed was measured over the 105 years, the wind speeds were corrected using Wever's (2008) adjustment.

The data are split into two periods. The first from 1 July 1904 to 30 June 1984 is the reference period and the second, from 1 July 1984 to 30 June 2009, the current period. It is the distribution of daily effective temperatures in this current period of 25 years which is of most interest for users of summary statistics as well as the change between any two periods. As shown in Figure 1, the EPDF of DET in the reference period is therefore relatively well defined because it is based on 80 observations for most days of the year, whereas in the current period there are only 25 observations *per* day.

There is evidence of seasonal variation in the EPDF (Figure 1). It is negatively skewed in the winter, positively skewed in the summer and almost symmetric in spring and autumn. Overall, there is more evidence of negative skew than positive skew in the distribution of DET. This is because the distribution of observed temperatures in the winter is negatively skewed and high winds further decrease temperatures. In the summer, high winds decrease the effect of extreme high temperatures reducing the positive skew of the observed temperatures. Combining the DET values over all years and all days results in a bimodal distribution reflecting the positive and negative skew in the data.

Key summary statistics of the daily EPDF, mean, standard deviation and skew, are shown in Figure 2 for the reference period. These summaries are based on the negative temperatures, DET multiplied by − 1, because the focus is on winter temperatures. Further justification is given in Section '3.1. Stage 1: describing a separate distribution for each day in each period' There is strong evidence of seasonal variation and little evidence of day-to-day variability for the negative mean temperature: it is high in the winter and low in the summer (Figure 2(a)). The standard deviation and skew show more day-to-day variability and some seasonal variability. The temperatures vary least in the autumn, smallest standard deviation, and most in the winter (Figure 2(b)). The negative temperatures have a positive measure of skew, indicating a right-skew distribution in the winter (equivalent to cold extremes) and negative measures of skew, indicating left skew in the summer (hot extremes). The winter distributions are more skew than the summer distributions, as 12 days had a right skew of more than 90 in the winter, compared to 4 days with a left skew of less than − 90 in the summer. The two distributions at the bottom represent 2 days with very different characteristics. The EPDF of the negative temperatures on 15 January has a high mean, high variance and positive skew and on the 15 July has low mean, lower variance and negative skew.

### 5. Discussion of alternative methodologies

- Top of page
- ABSTRACT
- 1. Introduction
- 2. Data
- 3. Method
- 4. Results
- 5. Discussion of alternative methodologies
- 6. Conclusions
- Acknowledgements
- References

The two-stage methodology used in this paper fits a separate GEV distribution to the DET for each day of the year in each of the two periods to give estimates of the three parameters of the GEV distribution for each day in each period. Each of the three parameters is then smoothed to describe how it varies through the year and between the two periods. Smoothing is carried out by fitting a weighted general linear model with Fourier series functions as covariates. Using this method of smoothing: (1) enables there to be a smooth transition between the end of one year (30 June) and the beginning of the next year (1 July); (2) gives the ability to share information between the two time periods so that common features can be fitted together, this is especially important for the current period where estimates are based on many fewer observations; (3) enables those parameters that are more precisely estimated to be given more weight in defining the pattern through the year, and, (4) provides a straightforward way to predict the distribution for any day of the year in either period because all that needs to be stored are the three linear models and their estimated parameters. It is difficult to see how other smoothing methods, such as binomial filters, which are commonly used to model time series date (for example Yan *et al.*, 2002b) would meet all four of these requirements.

Further development of the methodology used in this paper could enable confidence intervals for the summary statistics to be obtained, which would be useful for comparing between periods or through the year. For this, the covariance of pairs of parameter estimates, for example , as well as the variance of each of the smoothed parameter estimates, are required (Coles, 2001). It may be possible to obtain these by fitting a multivariate normal distribution to all three fitted parameter together so that co-variances as well as variances are estimated. However this is a non-trivial estimation exercise, which moves away from the pragmatic approach described here, and it is unclear how weighted multivariate regressions, particular the weighted covariance structure, can be implemented. If it was implemented, it would only provide a first estimate of the variability of the summary statistics because it only captures the uncertainty about the smoothed parameters; the variability of the fitted parameters, for example , would not be captured by this method and so confidence intervals for summary statistics would appear more precise than they should. Further work is required to extend these methods and potential avenues with their major challenges are described below.

More elegant and statistically appealing approaches are used in the analysis of extremes. These strategies combine the two-stages so that the daily observations of DET are modelled directly to obtain estimates of the parameters of the GEV distribution that vary smoothly through the year. For example, Menéndez *et al.* (2007) fitted a one-stage model to high-sea levels in which the parameters of the GEV distribution vary seasonally through the year, and Maraun *et al.* (2009) apply a similar model to describe the annual cycle of precipitation using the maximum precipitation in each month. Coles and Tawn (2005) carried out a Bayesian analysis which accounted for seasonality and long-term trends in the parameters of the GEV distribution to describe extreme sea surges on the UK east coast. A Bayesian analysis is appealing because all sources of uncertainty are automatically included in the analysis.

However, the data structures to which these models are applied are simpler than the one described in this paper. First, they only model the distribution of either maximum value, or values over a threshold so there is no, or little, serial correlation to account for. To apply these methods to the data in this paper would mean that the serial correlation in the DET between consecutive days in the same year would need to be accounted for and it is challenging to extend the one-stage methods described above to do this. Using the two-stage method described in this paper the serial correlation does not need to be accounted for because separate distributions are fitted to each day.

Furthermore, the distributions of the extremes or peaks-over thresholds are always right-skew: this is much simpler to fit than trying to fit data with left and right skew. A possible alternative approach is to switch from modelling the distribution of negated DET to DET at some point in the year. Determining two suitable points to do this may be problematic.

A further limitation of the analysis in this paper is that it focuses on the difference between two periods, because separate GEV distributions need to be fitted for each time period. A more sophisticated analysis, such as Coles and Tawn (2005), could provide scope for modelling a long-term trend in DET, for example by using generalized additive models (Underwood, 2009) through the years once the issues of autocorrelation and changing the direction of skew were solved.

The analysis here has focused on DET and the issues surrounding the modelling of negative and positively skewed data, other climate variables could be modelled using a similar strategy. Depending on the shape of the EPDF, distributions other than the GEV might be appropriate. In addition to describing changes in the distribution of two historical periods, comparisons between two locations or between current day and projections could also be explored.

### 6. Conclusions

- Top of page
- ABSTRACT
- 1. Introduction
- 2. Data
- 3. Method
- 4. Results
- 5. Discussion of alternative methodologies
- 6. Conclusions
- Acknowledgements
- References

This paper describes a methodology for easily obtaining summary statistics that vary smoothly through the year, of the distribution of DET. Furthermore, it demonstrates how long time series of data over many years can be used to help inform summaries for a number of years, e.g. the most recent years. The method describes the EPDF of DET on any day as a GEV distribution, and uses simple equations to describe the seasonal variation in the parameters of this theoretical distribution for each time period. Given these equations, the practitioner can easily calculate the parameters of the GEV distribution, and from this calculate relevant summary statistics such as the 5% point for any day of the year in either time period without recourse to the original data.

The methods have been applied to DET in De Bilt Holland. They show a very definite warming, as measured by DET, from the reference to the current period, particularly in the winter temperatures. The median DET has increased by 1.1 °C in the winter and by 0.9 °C in the summer and a 1.2 °C warming in the lower 2% in the winter and a 0.8 °C increase in the upper 2% in the summer. These are consistent with the analyses of Brown *et al.* (2008) which looked at extreme values in the maximum and minimum temperatures from January 1950 to 2004 and showed a 1.1 °C warming in maximum daily maximum temperatures in Europe and a 1.6 °C warming in the minimum daily minimum temperature in Europe. However, Brown *et al.*'s (2008) analysis uses observations above a pre-specified threshold in each year. The analysis in this paper focuses on describing the distribution of all observations which provides additional insights into how temperatures are changing through time.

Methods used in the analysis and description of extreme events may be useful when calculating confidence intervals if issues of correlated data and distributions with both left and right skew can be solved. The methodology described in this paper is a simple pragmatic approach to obtaining summaries that can be applied to other daily observations where distributions are both positive and negatively skewed. Abbreviations: DET: daily effective temperature; EPDF: empirical probability distribution function; GEV: generalized extreme value