t test and related methods.
The t test compares means between two samples. Although it assumes that the samples are drawn from normal distributions with equal means, the results are surprisingly robust to departure from these assumptions (Heeren & d’Agostino 1987). Nevertheless, it is liable to break down when ‘skew is severe or when population variances and sample sizes both differ’(Boneau 1960; Stonehouse & Forrester 1998). At least one of these two circumstances is likely to pertain with count data. Skewness has already been mentioned, whereas a difference in means implies a difference in variances if the data are actually drawn from, for example, a Poisson or negative binomial distribution. Hence the t test cannot be recommended for untransformed parasite data. The same caveats apply to regression and analysis of variance, because these are all mathematically similar. Nevertheless, such analyses were carried out on untransformed data in at least 10% of the articles in the literature review (Table 3).
The performance of these methods may be improved by first transforming the data. In particular, if there are no zeros in the data, then a logarithmic transformation may suffice. Moreover, the t test and related techniques will then yield ratios of geometric means and so can easily be interpreted. Such methods were used in at least 15% of the articles in Table 3. As noted above, their robustness means that the normal distribution does not need to be a perfect fit for the results to be reliable. Figure 2 shows a histogram of malaria parasite densities from Dunyo et al. (2006) on a log scale. The shape is similar to that of the superimposed fitted normal curve, suggesting that the t test and related methods are likely to be applicable to the log-transformed data.
Figure 2. Histogram, on a log scale, of numbers of Plasmodium falciparum asexual blood stages on one high power field of a thick blood film in a clinical trial (Dunyo et al. 2006). The dashed line is the fitted normal distribution.
Download figure to PowerPoint
The logarithmic transformation is not applicable, however, in the presence of zeros. Various other options are available although, in most cases, they lack the same ease of interpretation. One option is to add 1, or another value, before taking logarithms. Table 2 shows that at least 13% of the articles in the literature review used a transformation other than the simple logarithmic, whereas in 15% it was not clear what, if any, transformation was used. For negative binomial distributions, the value k/2 can be added instead of 1 before taking logarithms, or an inverse hyperbolic sine transformation can be used (Beall 1942; Anscombe 1948; Laubscher 1961; Elliott 1977). Apart from the basic logarithmic transformation, these have at least two disadvantages. With few exceptions, the results cannot be interpreted easily, and the transformation ‘cannot remove the clump’ of zeros, if present, and so the data will not be normalised (Hallstrom 2010). In general, if zeros are present, then other techniques, in particular generalised linear models (GLMs) (see below), are likely to be preferable to normal-theory methods on transformed data (O’Hara & Kotze 2010).
These methods generally use the ranks of the data rather than the original values. This means that, for example, a single data value much greater than the others will not greatly affect the results as it would for a t test. One of the simplest non-parametric methods is the Mann–Whitney test, which compares data from two independent samples. In the literature review, non-parametric methods were used in 23% of those articles which used any inferential method (Table 3). Nevertheless, they do have some disadvantages:
If the sample size is low and the distribution of the data is close to normal, then they are likely to have considerably lower power than parametric methods such as the t test (Bland 1995). In these – rather limited – circumstances, a parametric method is likely to be preferable.
Contrary to common conception, they cannot, in general, be interpreted as comparisons of medians(Hart 2001). For example, the Mann–Whitney test (also known as the Wilcoxon test) assesses whether two distribution functions are unequal for at least one value (Conover 1980). The test can only be interpreted as a comparison of medians if the samples are drawn from populations whose distributions have the same shape, i.e., when represented graphically, they can be laid exactly on top of each other simply by shifting one along the horizontal axis by an amount Δ: a ‘shift alternative’ (Bauer 1972). This is unlikely to apply to count data but, if it were, it’s likely that a t test would be applicable in the first place (Bland 1995). When software provides a confidence interval, it is not necessarily for the difference in medians.
They are not generally capable of complex analyses, in particular those with multiple explanatory (predictor) variables.
Overall, non-parametric methods are suitable for simple analyses of skewed count data, although they are not as fool-proof as they may appear.
Generalised linear models.
We have seen techniques which assume that the distribution of the data is normal (Gaussian) and others which do not depend on it having any particular form. Another option is to find a distribution which does fit the data adequately: this is carried out in generalised linear modelling. This can be thought of a kind of regression in which the distribution of the data is not necessarily normal. For count data, the Poisson or negative binomial, for example, may be suitable.
We model a function of the mean, not necessarily the mean itself. This function is called the link function and, for present purposes, it is likely to be the logarithm. It is conventional here for statistical packages to use the natural (base e) logarithms.
Parasite counts are usually overdispersed relative to Poisson, requiring a distribution such as the negative binomial, which can accommodate this. If a Poisson distribution is used regardless, the results can be misleading (Barker & Cadwell 2008). However, it is possible to use the ‘sandwich’ estimator [e.g. with the ‘vce(robust)’ option in STATA] or ‘quasi-likelihood’ models, both of which effectively multiply the standard errors by a factor estimated from the data (White 1980; Sileshi 2006; Noe et al. 2010), but which do not fully characterise a specific distribution for the data.
Once the distribution has been chosen, a GLM is fitted by maximising the likelihood – i.e. the probability of the data, considered as a function of the model parameters (Everitt 1995) – over possible values of the regression coefficients. A couple of points are worth stressing.
First, it is the fitted mean, not the data, which is transformed in this approach. Hence, a logarithmic link function models the logarithm of the arithmetic mean, not the geometric mean as with an ordinary regression on the log-transformed data.
Secondly, the data must have values which are feasible for the chosen distribution. In the case of the Poisson or negative binomial distributions, for example, which apply to whole-number (integer) data, then, we must model the actual counts. If we have, for example, a set of parasite counts based on varying numbers of replicates, e.g. Kato-Katz slides, we should not input the average values to the GLM, because these are not necessarily whole numbers. Rather, the log number of replicates should be included as an offset in the model. This is an explanatory variable with its regression coefficient constrained to equal 1. In the example of multiple slides per person, if there is a single explanatory variable x with coefficient β, then we model the log mean count per person as loge(μ) = α + βx + loge(n), so loge(μ/n) = α + βx. We can then obtain exp(β) as the ratio of the average count per slide, even though we specified the total count as the response variable. Similarly, when analysing the incidence of events when the follow-up time varies between people, using log-time as an offset enables the analysis to yield rate ratios. Figure 3 shows mean hookworm egg count fitted as a function of age. As with other regression models, different forms of dependence can be included (e.g. polynomial), as can multiple explanatory variables.
Figure 3. Negative binomial regression of faecal egg count on age, the sloped solid line showing the fitted values. The × symbols are the means by age group, with vertical solid lines showing the 95% confidence intervals. Although age groups are shown, the fitted model uses the exact age for each person. The curved dashed lines are the pointwise 95% confidence intervals based on the whole dataset.
Download figure to PowerPoint
O’Hara and Kotze (2010) compared GLMs with transformation-based methods and found that the latter perform poorly – in terms of bias and sampling error – unless dispersion is small and the mean counts are large. Goodness of fit measures for GLMs, and negative binomial regression in particular, include checking that the residual deviance is similar to the residual degrees of freedom (Hilbe 2007), and plotting the residuals against fitted values to check that they have random scatter with constant variance (McCullagh & Nelder 1983). Chi-squared tests can also be carried out to compare the numbers in various categories of the outcome variable with those expected under the given distribution. However, it should be borne in mind that, with a large sample size, a departure from the assumed distribution may be large enough to be detected by the goodness-of-fit test, but too small to affect materially the results of the main analysis. I tend to rely on visual assessment of the overall distribution as shown in Figure 4.
Figure 4. Histogram of numbers of hookworm eggs counted on two Kato-Katz slides per person in the baseline survey of a prospective study in Brazil. The axis labels show the original values although the scale has been eighth-root transformed. The long-dashed line is the fitted negative binomial distribution and the short-dashed line the fitted Poisson.
Download figure to PowerPoint
Other distributions for skewed data include versions of the Poisson and negative binomial in which the proportion of zeros is defined by additional parameters. These are called zero-inflated distributions. The zero-inflated negative binomial may be a better fit than the negative binomial in some cases (Walker et al. 2009). If zeros were not recorded in the dataset, then a zero-truncated distribution may be suitable, while if counting stopped at a maximum value – for example no more than 500 Ascaris eggs were counted per slide by Cundill et al. (2011)– then censored distributions are available (Hilbe 2007). There are yet other distributions which have been little used (Grenfell et al. 1990; Shoukri et al. 2004; Hoshino 2005), if at all, (Dobbie 2001; Massé & Theodorescu 2005; Shmueli et al. 2005) for parasite data.
Other methods with potential for greater use.
There are several statistical methods which have not been described above, but which have scope for greater use with count data. The bootstrap was notable because of its absence in the articles reviewed. This approach is based on re-sampling the data and observing the sampling variation in any outcome of choice, such as a difference or ratio in medians or means (Efron & Tibshirani 1993). For simple comparisons, this would seem to have some advantages over non-parametric methods. However, it is not recommended for small sample sizes because the discreteness of the sampling distribution may make the inference unreliable.
The median or other percentile of the counts can be modelled as a function of explanatory variables using quantile regression (Yu et al. 2003). Transformation of the counts may be advisable but, because the usual functions maintain rank order (they are monotonic), this will not affect the interpretation of the results.
In many cases, the negative (zero) instances may warrant different treatment, whether for purely statistical reasons or due to clinical or scientific distinctions between them and the positive group. There is a well-developed literature on models which treat these two groups distinctly within a single overall analysis. These may be referred to as two-part, mixture or bivarate models (Lachenbruch 2002; Moulton et al. 2002; Hallstrom 2010). When the proportion of zeros is modelled as a function of covariates, then zero-inflated distributions are in this category. The results for the zero and non-zero parts of the model can be presented separately or, depending on the kind of model, it may be possible to synthesise them to report the overall mean (Burton et al. 2003).
Finally, when no single parameter, such as the median or mean, is found to be an adequate summary of the data (Montresor 2007), infection intensity classes (e.g., negative, mild, moderate, heavy) can be analysed as ordered categories (Agresti 1999).