## Introduction

Ecological data are often discrete counts – the number of individuals or species in a trap, quadrat, habitat patch, on an island, in a nature reserve, on a host plant or animal, the number of offspring, the number of colonies or the number of segments on an insect antenna. Densities of individuals are often counts too: a count in an area of unit size (in the analysis of these data, the actual area of a count can be included as an offset; see below). Even though textbooks on statistical methods in ecology (e.g. Sokal & Rohlf 1995; Zar 1999; Crawley 2003; Maindonald & Braun 2007) recommend the use of the square-root transformation to normalize count data, such data are often log-transformed for subsequent analysis with parametric test procedures (e.g. Gebeyehu & Samways 2002; Magura, Tóthmérész, & Elek 2005; Cuesta *et al.* 2008). The reasons for this (log-transforming count data) are not clear but perhaps has to do with the common use of log-transformations on all kinds of data, and the fact that textbooks usually deal with the log-transformation first, before evaluating other transformation techniques.

The main purpose of a transformation is to get the sampled data in line with the assumptions of parametric statistics (such as anova, *t*-test and linear regression) or to deal with outliers (see Zuur, Ieno, & Smith 2010; Zuur, Ieno, & Elphick 2009a). These assumptions include that the residuals from a model fit are normally distributed with a homogeneous variance. In addition, regression assumes that the relationship between the covariate and the expected value of the observation is linear. Classical parametric methods deal with continuous response variables (weights, lengths, concentrations, volumes and rates) with few ‘zero’ observations. As such, a log-transformation may successfully ‘normalize’ such continuous data for use in parametric statistics.

Discrete response variables, such as count data, on the other hand, often contain many ‘zero’ observations (see Sileshi, Hailu, & Nyadzi 2009) and are unlikely to have a normally distributed error structure. The question arises: can, or should, count data that include zeroes be transformed to approximate normality to be subject to parametric statistics? Maindonald & Braun (2007) argued that generalized linear models (GLMs) have largely removed the need for transforming count data, yet the practice is still widespread in the ecological literature (see above).

Classically, response variables are transformed to improve two aspects of the fit: linearity of the response and homogeneity of the variance (homoscedasticity). This can be done in an exploratory manner (e.g. Box & Cox 1964), but transformations often have sensible interpretations, e.g. the log-transformation implies that the mechanisms are multiplicative on the scale of the raw data. Clearly, there is no reason to expect that a single transformation will behave optimally for both linearity and homoscedasticity; so, some compromise is often needed.

More recently, GLMs have been developed (McCullagh & Nelder 1989). These allow the analyst to specify the distribution that the data are assumed to have come from, which implicitly defines the relationship between the mean and variance. They can be chosen based on an understanding of the underlying process that is assumed to have generated the data, e.g. a constant rate of capture of individual members of a large population implies a Poisson distribution. If the capture rate varies randomly the data look clumped, with more zeroes but also more sites with large counts. In generalized linear modelling terminology this is ‘overdispersion’, which can be handled in several ways, the most popular of which is by specifying the response as coming from a quasi-Poisson or negative binomial distribution.

Here, we are interested in comparing how well the two approaches work when analysing count data. An additional wrinkle with the traditional approach of log-transforming is that log(0) = −∞; so, a value (usually 1) is added to the count before transformation. We are not aware of any justification for adding 1, rather than any other value, and this may bias the fit of the model. Zeroes do not present any problems in GLMs, as there it is the expected value that is log-transformed.

Zeroes can also be handled by using zero inflated models (e.g. Sileshi *et al.* 2009; Zuur *et al.* 2009, chapter 11, Zuur, Ieno, & Elphick 2010). When modelling counts, both zero inflated models and overdispersed models can account for a large number of zero counts, and there may be little advantage in fitting the zero inflated model.

To address this problem of data transformation, we simulated data from a negative binomial distribution (as count data in ecology are often clumped, producing an expected variance that is greater than the mean, see McCullagh & Nelder 1989; White & Bennetts 1996; Dalthorp 2004), which we then subjected to various transformations [square root and log(*y* + *n*)]. The transformed data were analysed using parametric methods and compared with an analysis of untransformed data in which the response variable was defined as following either a Poisson distribution with overdispersion (i.e. a quasi-Poisson distribution) or a negative binomial error distribution.