SEARCH

SEARCH BY CITATION

Keywords:

  • statistical data analysis;
  • parasitology;
  • statistics;
  • non-parametric;
  • regression analysis
  • analyse statistique des données;
  • parasitologie;
  • statistiques;
  • non paramétrique;
  • analyse de régression
  • análisis estadístico de datos;
  • Parasitología;
  • Estadística;
  • No paramétrico;
  • Análisis de regresión

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods and results
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

Objective  To review methods for the statistical analysis of parasite and other skewed count data.

Methods  Statistical methods for skewed count data are described and compared, with reference to a 10-year period of Tropical Medicine and International Health (TMIH). Two parasitological datasets are used for illustration.

Results  The review of TMIH found 90 articles, of which 89 used descriptive methods and 60 used inferential analysis. A lack of clarity is noted in identifying the measures of location, in particular the Williams and geometric means. The different measures are compared, emphasising the legitimacy of the arithmetic mean for the skewed data. In the published articles, the t test and related methods were often used on untransformed data, which is likely to be invalid. Several approaches to inferential analysis are described, emphasising (1) non-parametric methods, while noting that they are not simply comparisons of medians, and (2) generalised linear modelling, in particular with the negative binomial distribution. Additional methods, such as the bootstrap, with potential for greater use are described.

Conclusions  Clarity is recommended when describing transformations and measures of location. It is suggested that non-parametric methods and generalised linear models are likely to be sufficient for most analyses.

Objectifs:  Examiner les méthodes d’analyse statistique des données du parasite et de celles d’autres mesures faussées.

Méthodes:  Les méthodes statistiques pour les données de mesures faussées sont décrites et comparées, pour celles utilisées sur une période de dix ans dans le journal ‘Tropical Medicine and International Health’. Deux ensembles de données parasitologiques sont utilisés à titre d’illustration.

Résultats:  90 articles ont été identifiés; 89 avec une analyse descriptive et 60 avec une analyse inférentielle. Un manque de clarté a été observé dans l’identification des mesures de localisation, en particulier dans la moyenne géométrique et dans celle de Williams. Les différentes mesures sont comparées, en insistant sur la légitimité de la moyenne arithmétique des données faussées. Dans les articles publiés, le test t et les méthodes associées ont été souvent utilisés sur des données non transformées, ce qui est susceptible de les rendre invalides. Plusieurs approches de l’analyse inférentielle sont décrites, mettant l’accent sur (1) des méthodes non paramétriques, tout en notant qu’elles ne sont pas simplement des comparaisons de médianes et (2) la modélisation linéaire généralisée, en particulier avec la loi de distribution binomiale négative. D’autres méthodes sont décrites, telles que celle d’amorçage ‘bootstrap’, qui a un potentiel d’utilisation plus grand.

Conclusions:  La clarté est recommandée lors de la description des transformations et des mesures de localisation. Il est suggéré que les méthodes non paramétriques et les modèles linéaires généralisés sont probablement suffisants pour la plupart des analyses.

Objetivo:  Revisar los métodos para el análisis estadístico del conteo parasitario y otros conteos con sesgo.

Métodos:  Se describen y comparan los métodos estadísticos utilizados para el análisis de datos de conteos con sesgo, haciendo referencia a aquellos utilizados durante un período de diez años en el Tropical Medicine and International Health. Se utilizaron dos grupos de datos parasitológicos como ejemplos ilustrativos.

Resultados:  Se identificaron 90 publicaciones, 89 con análisis descriptivo y 60 con análisis inferencial. Se observa una falta de claridad para identificar medidas de localización, en particular la media de geométrica y de Williams. Se compararon las diferentes medidas, haciendo énfasis en la legitimidad de la media aritmética para datos sesgados. En los artículos publicados se utilizaron a menudo la prueba t y métodos relacionados en datos sin transformar, lo cual probablemente sea inválido. Se describen varios enfoques del análisis inferencial, enfatizando (1) métodos no-paramétricos, teniendo en cuenta que no son solo comparaciones de las medias, (2) modelo lineal generalizado, en particular con distribución binomial negativa. Se describen métodos adicionales, tales como el bootstrap, con potencial para un mayor uso.

Conclusiones:  Se recomienda claridad para describir transformaciones y medidas de localización. Se sugiere que los métodos no paramétricos y los modelos lineales generalizados podrían ser suficientes para la mayoría de los análisis.

82, 83, 84…

87 weevils today, Mr. Victor.

That’s above average, but the trend is down.

In a wartime detention camp, a boy counts the vermin in his food rations (from the film Empire of the Sun, based on the autobiographical novel by JG Ballard).

Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods and results
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

Counting items such as parasites results in non-negative integer data: 0, 1, 2, etc. The aim of this article is to review methods for statistical analysis of such data in medical research. The focus will be on parasite counts although most of the same methods are applicable to other count data such as numbers of insects, plaques or disease episodes.

Descriptive analysis of such data usually includes summary numbers which aim to convey average values of the data. In statistics, these are called ‘measures of location’. For count data, even these simple measures prove problematic. This is because the data are often skewed: their ‘long tail’ (Mumpower & McClelland 2002) means that a small proportion of people can, for example, harbour a large proportion of the parasites. In basic statistics, it is often taught that such skewness precludes the use of the arithmetic mean because it is ‘overly’ influenced by the high values (Kirkwood & Sterne 2003). A possible alternative is to analyse the logarithms of the data instead of the original values. However, for count data, this is only feasible in the absence of zeros, because the logarithm of zero is not a finite number. This has led to controversy over what measures of location are appropriate, and to basic terms such as ‘geometric mean’ being used inconsistently.

Much of this controversy results from a reliance on statistical methods, which use the normal (Gaussian) distribution, whose symmetry usually precludes a good fit to count data. However, this dependence is now largely outdated, due to the availability of approaches based on skewed distributions such as the negative binomial (Anderson & May 1991).

This article aims to review the statistical methods currently being used for such data and to describe their advantages and disadvantages. It concentrates on the most common kinds of analyses, leaving aside more specialised ones such as spatial patterns, repeatability and reproducibility. Descriptive methods are addressed first, then inferential methods, the latter being those which compare counts between groups or correlate them with other variables.

Methods and results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods and results
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

Descriptive methods

Measures of location: median and means.

Location is “the notion of central or ‘typical value’ in a sample distribution” (Everitt 1995). The most common measures of location are the median and the various types of mean. The median is the middle value of the sorted data. If the number of values is even, then there are two middle values, and the convention is to define the median as the value half way between them. The median is zero if more than half of the values are zero.

By default ‘the mean’ usually refers to the arithmetic mean– the sum of the values divided by n, the number of values – although there are other types of mean which may be useful for count data. In particular, the geometric mean is obtained by multiplying together all the data values and then taking the nth root (Borowski & Borwein 1989; Everitt 1995). The geometric mean is always less than the arithmetic mean, unless all values in the dataset are identical, in which case the two are equal (Borowski & Borwein 1989). The geometric mean is zero whenever any of the individual values is zero. This makes it rather meaningless for datasets with zeros, e.g. uninfected people.

The geometric mean is related to logarithmic transformation of the data. If there are no zero values, then the geometric mean equals the exponential of the mean of the log-values. Any exponent can be used, but it must equal the logarithm base, e.g. 10 to the power of the mean of the log10-values. This suggests the following modification of the geometric mean to accommodate zero values: add 1 to all the data values, then take the geometric mean, then subtract 1 again. This is known as the Williams mean (Williams 1937). There are also other types of mean. For example, the square mean root is the square of the mean of the square root-transformed data. This measure has applications in meta-analysis (Bushman & Wang 1995) and in assessing the agreement of counts between observers or methods (Alexander et al. 2007). Other powers can also be used – in general, these are called power means or algebraic means (Bonferroni 1950).

Confusion between geometric and Williams means.

In the medical research literature, the Williams mean is sometimes called the geometric mean, as in the onchocerciasis community microfilarial load (CMFL) (Remme et al. 1986). Nevertheless, the two are not equal. One disadvantage of the Williams mean is its dependence on the choice of units. For example, if skin snips weigh 2 mg, the CMFL per snip is not double the CMFL per mg, although both units are used in the literature (Marshall et al. 1986; Remme et al. 1986). This scale dependence can be avoided by adding and subtracting a value with units e.g. 1/mg rather than a dimensionless 1 (Alexander et al. 2005).

The geometric mean is not a clever way to estimate the arithmetic mean.

Fulford explains that the Williams mean (he calls it the geometric mean) should not be used to estimate the arithmetic mean (Fulford 1994). Similarly, Dobson et al. (2009) define helminth vaccine efficacy as a fold reduction in arithmetic mean egg count, then show that this is not well estimated by a ratio of Williams means. The points made by these articles are valid although may seem rather tautological. The different types of mean differ mathematically, so it should not be a surprise that one of them cannot properly estimate another. Their values are no more commensurable than, for example, weight-for-height and weight-for-age in nutrition.

Choice of measure of location: arithmetic vs. geometric mean.

Compared to the arithmetic mean, the geometric mean is said to be ‘not overly influenced by the very large values in a skewed distribution’ (Kirkwood & Sterne 2003). How much is ‘overly’? We can try to answer this question objectively by going back to whichever health outcomes motivated the study.

A clear-cut example would be noise pollution. Human perception of sound is approximately logarithmic and in that, orders of magnitude change in sound pressure (measured in Pascals) are perceived as approximately uniform increments on a linear scale. This is known as Weber’s law (Wojtczak & Viemeister 2008) and is why sound volumes are expressed in decibels (log-ratio of pressure over baseline). Hence, averaging is more aligned with human perception if carried out on the logarithmic scale: geometric mean of Pascals or arithmetic mean of decibels (Olayinka & Abdullahi 2009). This reasoning is independent of the statistical distribution of the data: even if the decibel values were skewed, their arithmetic mean would still be more relevant than the geometric mean.

The choice of measure of location may not always be so straightforward. If we consider malaria parasite density, then we could imagine either the arithmetic or the geometric mean having more interest, depending on the purpose of the study. If the intensity of transmission is of interest, it may be reasonable to assume that this is proportional to parasite density, in which case the arithmetic mean would be relevant. This would correspond to the use of the annual transmission potential in filarial diseases, which is based on the arithmetic mean number of filariae per mosquito (Bockarie et al. 1996). If, by contrast, the objective of the study relates to the number of clinical cases, the geometric mean may be more appropriate because the relation of parasite density to the probability of being a case (as opposed to asymptomatic) is non-linear, the slope reducing at higher densities (Smith et al. 1994). In this case, we would need to exclude non-parasitaemic people from the calculation of the geometric mean, although this is not problematic for this purpose because, by definition, they cannot have malaria.

In other situations, another measure of location, such as the median, may be preferable. The general point is to ‘deconstruct’ the problem (Hand 1994), so that the choice of measure is subordinated to the study’s objectives, with purely statistical concerns – for example achieving a normal distribution – being secondary.

Measures of dispersion.

The most commonly used measures of dispersion are probably the standard deviation, range and interquartile range. The standard deviation is the square root of the variance, which is the average squared deviation from the mean. More specifically, the standard deviation equals inline image, where inline image means the sum over the data values xi and inline image is the sample mean. The range is the interval between the minimum and maximum values of the data. This tends to increase with larger sample sizes. Percentiles are values below which a certain percentage of the data lie, after they have been sorted into ascending order. The median, for example, is the 50th percentile. The interquartile range is the interval between the 25th and 75th percentiles: this interval contains the middle half of the data.

As the distributions of count data are usually asymmetric, using the standard deviation for error bars, or in ±notation about the mean, is not usually helpful. This is because they will often imply negative values whereas, of course, for count data, the lowest possible value is zero. For example, looking at the mean and standard deviation of the hookworm egg counts in Table 1, quoting ‘126 ± 322' would be rather meaningless. The interquartile range would be one preferable option. Similarly, error bars based on standard deviations are likely to go below zero, as in, for example, the figure showing schistosomiasis faecal eggs per gram by age in one of the articles in the review described below (Seto et al. 2007). Again, plotting the interquartile range, or superimposing box and whisker plots (Kirkwood & Sterne 2003), would be better options.

Table 1.   Descriptive statistics of example datasets
 Hookworm eggsPlasmodium falciparum asexual blood stages
  1. *A positive malaria slide was an entry criterion for the study.

  2. †The geometric mean is necessarily zero because the dataset contains zero values (i.e., negative people).

Sample size (male:female)1237 (603:634)477 (247:230)
Mean age in years (range)26 (0–95)4.9 (1–10)
Mass or volume of each sampleTwo Kato-Katz faecal slides, totalling approximately 1/12 g1 High power field of a thick blood film, approximately 0.001–0.0025 μl (Warrell & Gilles 2002)
Per cent positive76100*
Mean126139
Median2274
Geometric mean0†59
Williams mean1861
Range0–48031–4941
Inter-quartile range1–10323–181
Standard deviation322266
Reference(Cundill et al. 2011)(Dunyo et al. 2006)

For count data, the greatest utility of the standard deviation or variance may be to assess how homogeneous a distribution the data have. The Poisson distribution results from homogeneous processes, which are the simplest models for count data. The mean and variance of the Poisson distribution are equal whereas, if the variance of data is considerably greater than the mean, there it is said to be overdispersion (Elliott 1977; Mwangi et al. 2008). The variance of the negative binomial distribution, for example, equals μ + μ2/k, where μ is the mean and k is a positive-valued parameter. For small values of k, the variance is much greater than the mean, reflecting a high degree of overdispersion. Such distributions are sometimes also called ‘contagious’– even if no infectious disease is involved – because they reflect clustering in the data (Elliott 1977). These two situations – homogeneity and overdispersion – are illustrated in Figure 1. It is also possible for the variance to be less than the mean. This occurs when distances between events tend to be similar, with both smaller and larger distances being rare. This is rare in practice but would be the case, for example, for spatial events on an even grid.

image

Figure 1.  Simulated spatial data showing homogeneous and clustered processes in the upper and lower panels respectively. The dashed lines divide each area into a 5 × 5 grid. For the homogenous process, the variance and mean of the 25 counts are similar: 3.88 and 3.94 respectively, consistent with a Poisson distribution. For the clustered process, the variance is much larger than the mean: 28.4 vs. 4.0, indicating overdispersion. Although these data are spatial, the same principle applies to counts per person, for example. The data were generated using the ‘rpoispp’ and ‘rMatClust’ functions of the ‘spatstat’ package in the R software.

Download figure to PowerPoint

Example datasets

Hookworm eggs.

These are baseline data from a longitudinal study of hookworm in Minas Gerais state, Brazil (Cundill et al. 2011). Table 1 shows summary statistics for the total egg count from two Kato-Katz thick smears, each pair being prepared from a single stool sample. Those with missing data on age, gender or either egg count have been excluded.

Plasmodium falciparum asexual blood stages.

Children participating in a trial of anti-malarial drugs had a thick blood film read microscopically (Dunyo et al. 2006). The parasite counts from a single high power field are shown in Table 1.

Literature review

Statistical methods used for parasite count data were reviewed over 10 years of Tropical Medicine and International Health. The following search was carried out in the PubMed online database: (parasit* OR malaria* OR helmint* OR filar*) AND (count* OR intensit* OR densit*) AND trop med int health.

For the years 2001–2010 (volumes 6–15), the search returned 156 articles. They were retained in the review if they were found to include either (i) descriptive analysis of parasite count data or (ii) inferential analysis in which count data constituted an outcome variable. Articles using correlation coefficients – which do not distinguish between outcome and explanatory variables – were also included under the second category. Of the 156 articles, 90 were found to include such analyses: 89 descriptive and 60 inferential (see supporting information).

Methods for descriptive and inferential analysis are shown in Tables 2 and 3. The former includes measures of location, of which the geometric mean was the most commonly used (51% of articles). In some articles, the Williams mean was calculated but presented as the geometric mean. It is likely that there were other instances of this which could not be unambiguously identified from the text. In other words, the Williams mean category (13%) was difficult, in practice, to separate from the geometric mean. The arithmetic mean was also commonly used (35%) and the median less so (7%).

Table 2.   Descriptive measures of location for count data used in 89 articles in Tropical Medicine and International Health, 2001–2010
Descriptive measure of locationNumber of articles (%)*
  1. *The percentages add to more than 100 because some articles used more than one measure.

  2. †The term ‘geometric mean’ is sometimes used in these articles to refer to the Williams mean. Unambiguous examples of this, e.g., evidenced by an equation in the article, are included under ‘Williams mean or similar’. However, it is likely that more of the ‘geometric means’ are, in fact, Williams means, especially where the data included zeros.

  3. ‡Williams mean = (geometric mean of (x + δ))−δ, for δ = 1. Here, ‘similar’ measures include those which: used values of δ other than 1; added 1 without then subtracting it; or took the mean of log(x + 1) without then exponentiating.

  4. §The arithmetic mean restricted to parasite-positive individuals.

Arithmetic mean31 (35)
Geometric mean†, or arithmetic mean of logarithms
 Zeros absent25 (28)
 Zeros present7 (8)
 Unclear whether zeros present or not13 (15)
Williams mean or similar†‡12 (13)
Median6 (7)
Prevalence by category of infection intensity20 (22)
Other§1 (1)
Table 3.   Inferential statistical methods used in 60 articles in Tropical Medicine and International Health, 2001–2010
Inferential methodNumber of articles (%)*
  1. *The percentages add to more than 100 because some articles used more than one method.

  2. †Including t test, analysis of variance (anova) and ordinary least squares regression.

  3. ‡Comprising: ratios of parasite densities as response variable (2); χ2 test for trend on infection categories (1); normal distribution-based analysis of village-level indices (1); inference on overlap of confidence intervals (1); Wagstaff index of aggregation of parasites among people (1); logistic regression of high vs. low parasite intensity (1); review citing previous analysis (1).

Distribution-based
 Normal (Gaussian)†
  On untransformed data6 (10)
  After transformation of the data
  Logarithmic9 (15)
  Other8 (13)
  Unclear what, if any, transformation was used9 (15)
 Negative binomial6 (10)
 Poisson with allowance for overdispersion1 (2)
Non-parametric14 (23)
Other‡8 (13)
Unclear7 (12)

Inferential analysis

t test and related methods.

The t test compares means between two samples. Although it assumes that the samples are drawn from normal distributions with equal means, the results are surprisingly robust to departure from these assumptions (Heeren & d’Agostino 1987). Nevertheless, it is liable to break down when ‘skew is severe or when population variances and sample sizes both differ’(Boneau 1960; Stonehouse & Forrester 1998). At least one of these two circumstances is likely to pertain with count data. Skewness has already been mentioned, whereas a difference in means implies a difference in variances if the data are actually drawn from, for example, a Poisson or negative binomial distribution. Hence the t test cannot be recommended for untransformed parasite data. The same caveats apply to regression and analysis of variance, because these are all mathematically similar. Nevertheless, such analyses were carried out on untransformed data in at least 10% of the articles in the literature review (Table 3).

The performance of these methods may be improved by first transforming the data. In particular, if there are no zeros in the data, then a logarithmic transformation may suffice. Moreover, the t test and related techniques will then yield ratios of geometric means and so can easily be interpreted. Such methods were used in at least 15% of the articles in Table 3. As noted above, their robustness means that the normal distribution does not need to be a perfect fit for the results to be reliable. Figure 2 shows a histogram of malaria parasite densities from Dunyo et al. (2006) on a log scale. The shape is similar to that of the superimposed fitted normal curve, suggesting that the t test and related methods are likely to be applicable to the log-transformed data.

image

Figure 2.  Histogram, on a log scale, of numbers of Plasmodium falciparum asexual blood stages on one high power field of a thick blood film in a clinical trial (Dunyo et al. 2006). The dashed line is the fitted normal distribution.

Download figure to PowerPoint

The logarithmic transformation is not applicable, however, in the presence of zeros. Various other options are available although, in most cases, they lack the same ease of interpretation. One option is to add 1, or another value, before taking logarithms. Table 2 shows that at least 13% of the articles in the literature review used a transformation other than the simple logarithmic, whereas in 15% it was not clear what, if any, transformation was used. For negative binomial distributions, the value k/2 can be added instead of 1 before taking logarithms, or an inverse hyperbolic sine transformation can be used (Beall 1942; Anscombe 1948; Laubscher 1961; Elliott 1977). Apart from the basic logarithmic transformation, these have at least two disadvantages. With few exceptions, the results cannot be interpreted easily, and the transformation ‘cannot remove the clump’ of zeros, if present, and so the data will not be normalised (Hallstrom 2010). In general, if zeros are present, then other techniques, in particular generalised linear models (GLMs) (see below), are likely to be preferable to normal-theory methods on transformed data (O’Hara & Kotze 2010).

Non-parametric methods.

These methods generally use the ranks of the data rather than the original values. This means that, for example, a single data value much greater than the others will not greatly affect the results as it would for a t test. One of the simplest non-parametric methods is the Mann–Whitney test, which compares data from two independent samples. In the literature review, non-parametric methods were used in 23% of those articles which used any inferential method (Table 3). Nevertheless, they do have some disadvantages:

If the sample size is low and the distribution of the data is close to normal, then they are likely to have considerably lower power than parametric methods such as the t test (Bland 1995). In these – rather limited – circumstances, a parametric method is likely to be preferable.

Contrary to common conception, they cannot, in general, be interpreted as comparisons of medians(Hart 2001). For example, the Mann–Whitney test (also known as the Wilcoxon test) assesses whether two distribution functions are unequal for at least one value (Conover 1980). The test can only be interpreted as a comparison of medians if the samples are drawn from populations whose distributions have the same shape, i.e., when represented graphically, they can be laid exactly on top of each other simply by shifting one along the horizontal axis by an amount Δ: a ‘shift alternative’ (Bauer 1972). This is unlikely to apply to count data but, if it were, it’s likely that a t test would be applicable in the first place (Bland 1995). When software provides a confidence interval, it is not necessarily for the difference in medians.

They are not generally capable of complex analyses, in particular those with multiple explanatory (predictor) variables.

Overall, non-parametric methods are suitable for simple analyses of skewed count data, although they are not as fool-proof as they may appear.

Generalised linear models.

We have seen techniques which assume that the distribution of the data is normal (Gaussian) and others which do not depend on it having any particular form. Another option is to find a distribution which does fit the data adequately: this is carried out in generalised linear modelling. This can be thought of a kind of regression in which the distribution of the data is not necessarily normal. For count data, the Poisson or negative binomial, for example, may be suitable.

We model a function of the mean, not necessarily the mean itself. This function is called the link function and, for present purposes, it is likely to be the logarithm. It is conventional here for statistical packages to use the natural (base e) logarithms.

Parasite counts are usually overdispersed relative to Poisson, requiring a distribution such as the negative binomial, which can accommodate this. If a Poisson distribution is used regardless, the results can be misleading (Barker & Cadwell 2008). However, it is possible to use the ‘sandwich’ estimator [e.g. with the ‘vce(robust)’ option in STATA] or ‘quasi-likelihood’ models, both of which effectively multiply the standard errors by a factor estimated from the data (White 1980; Sileshi 2006; Noe et al. 2010), but which do not fully characterise a specific distribution for the data.

Once the distribution has been chosen, a GLM is fitted by maximising the likelihood – i.e. the probability of the data, considered as a function of the model parameters (Everitt 1995) – over possible values of the regression coefficients. A couple of points are worth stressing.

First, it is the fitted mean, not the data, which is transformed in this approach. Hence, a logarithmic link function models the logarithm of the arithmetic mean, not the geometric mean as with an ordinary regression on the log-transformed data.

Secondly, the data must have values which are feasible for the chosen distribution. In the case of the Poisson or negative binomial distributions, for example, which apply to whole-number (integer) data, then, we must model the actual counts. If we have, for example, a set of parasite counts based on varying numbers of replicates, e.g. Kato-Katz slides, we should not input the average values to the GLM, because these are not necessarily whole numbers. Rather, the log number of replicates should be included as an offset in the model. This is an explanatory variable with its regression coefficient constrained to equal 1. In the example of multiple slides per person, if there is a single explanatory variable x with coefficient β, then we model the log mean count per person as loge(μ) = α + βx + loge(n), so loge(μ/n) = α + βx. We can then obtain exp(β) as the ratio of the average count per slide, even though we specified the total count as the response variable. Similarly, when analysing the incidence of events when the follow-up time varies between people, using log-time as an offset enables the analysis to yield rate ratios. Figure 3 shows mean hookworm egg count fitted as a function of age. As with other regression models, different forms of dependence can be included (e.g. polynomial), as can multiple explanatory variables.

image

Figure 3.  Negative binomial regression of faecal egg count on age, the sloped solid line showing the fitted values. The × symbols are the means by age group, with vertical solid lines showing the 95% confidence intervals. Although age groups are shown, the fitted model uses the exact age for each person. The curved dashed lines are the pointwise 95% confidence intervals based on the whole dataset.

Download figure to PowerPoint

O’Hara and Kotze (2010) compared GLMs with transformation-based methods and found that the latter perform poorly – in terms of bias and sampling error – unless dispersion is small and the mean counts are large. Goodness of fit measures for GLMs, and negative binomial regression in particular, include checking that the residual deviance is similar to the residual degrees of freedom (Hilbe 2007), and plotting the residuals against fitted values to check that they have random scatter with constant variance (McCullagh & Nelder 1983). Chi-squared tests can also be carried out to compare the numbers in various categories of the outcome variable with those expected under the given distribution. However, it should be borne in mind that, with a large sample size, a departure from the assumed distribution may be large enough to be detected by the goodness-of-fit test, but too small to affect materially the results of the main analysis. I tend to rely on visual assessment of the overall distribution as shown in Figure 4.

image

Figure 4.  Histogram of numbers of hookworm eggs counted on two Kato-Katz slides per person in the baseline survey of a prospective study in Brazil. The axis labels show the original values although the scale has been eighth-root transformed. The long-dashed line is the fitted negative binomial distribution and the short-dashed line the fitted Poisson.

Download figure to PowerPoint

Other distributions for skewed data include versions of the Poisson and negative binomial in which the proportion of zeros is defined by additional parameters. These are called zero-inflated distributions. The zero-inflated negative binomial may be a better fit than the negative binomial in some cases (Walker et al. 2009). If zeros were not recorded in the dataset, then a zero-truncated distribution may be suitable, while if counting stopped at a maximum value – for example no more than 500 Ascaris eggs were counted per slide by Cundill et al. (2011)– then censored distributions are available (Hilbe 2007). There are yet other distributions which have been little used (Grenfell et al. 1990; Shoukri et al. 2004; Hoshino 2005), if at all, (Dobbie 2001; Massé & Theodorescu 2005; Shmueli et al. 2005) for parasite data.

Other methods with potential for greater use.

There are several statistical methods which have not been described above, but which have scope for greater use with count data. The bootstrap was notable because of its absence in the articles reviewed. This approach is based on re-sampling the data and observing the sampling variation in any outcome of choice, such as a difference or ratio in medians or means (Efron & Tibshirani 1993). For simple comparisons, this would seem to have some advantages over non-parametric methods. However, it is not recommended for small sample sizes because the discreteness of the sampling distribution may make the inference unreliable.

The median or other percentile of the counts can be modelled as a function of explanatory variables using quantile regression (Yu et al. 2003). Transformation of the counts may be advisable but, because the usual functions maintain rank order (they are monotonic), this will not affect the interpretation of the results.

In many cases, the negative (zero) instances may warrant different treatment, whether for purely statistical reasons or due to clinical or scientific distinctions between them and the positive group. There is a well-developed literature on models which treat these two groups distinctly within a single overall analysis. These may be referred to as two-part, mixture or bivarate models (Lachenbruch 2002; Moulton et al. 2002; Hallstrom 2010). When the proportion of zeros is modelled as a function of covariates, then zero-inflated distributions are in this category. The results for the zero and non-zero parts of the model can be presented separately or, depending on the kind of model, it may be possible to synthesise them to report the overall mean (Burton et al. 2003).

Finally, when no single parameter, such as the median or mean, is found to be an adequate summary of the data (Montresor 2007), infection intensity classes (e.g., negative, mild, moderate, heavy) can be analysed as ordered categories (Agresti 1999).

Discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods and results
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

The statistical challenges of count data result not so much from extreme skewness per se as from its combination with zero values. If there are no zero values, then analysis of log-transformed data by common methods such as the t test may be applicable, yielding results in terms of geometric means. However, when zeros are present, the log transformation is not readily applicable, and the geometric mean is zero, complicating the choice of measure of location for count data.

The Williams mean is an attempt to retrieve the geometric mean by applying it to the counts plus one, then subtracting one again. This approach is commonly used in analysis of parasite counts, although more for expediency than for clinical or biological relevance. Moreover, because it is sometimes misleadingly called the geometric mean, it is often difficult to decide from the text of a article what was actually done.

There is some resistance to using the arithmetic mean as a measure of location, given that basic statistics teaching usually brands it unsuitable for skewed data. Nevertheless, in some circumstances, the mean is the most relevant outcome measure. In health economics, for example, the mean cost per patient is proportional to the total cost, and for that reason, it is more relevant than other measures such as the median (Barber & Thompson 2000). This should not, and need not, be trumped by concerns over difficulty of analysis. Similarly, in parasitology, the arithmetic mean per person may be of interest because, for example, the total community burden in terms of parasite numbers is proportional to the arithmetic mean. Methods are available which model the arithmetic mean while accommodating skewness in the data. The median can also be used, although it will be zero if the prevalence is <50%. Non-parametric methods are commonly said to assess differences between medians, although this is not generally the case. It is possible to do sample size calculations and statistical analysis in terms of arithmetic means via suitable distributions such as the negative binomial (Alexander et al. 2011).

Choosing the most relevant measure of location will affect how any further analysis is carried out. Given the powerful statistical software now readily available, it should be possible to choose a method not solely on statistical considerations – finding one whose assumptions are justified – but based on how closely it responds to the original research question (Hand 1994).

The current article emphasises GLMs, in particular with the negative binomial distribution. Fuller explications of GLMs are available elsewhere (Gardner et al. 1995; Wilson & Grenfell 1997; Coxe et al. 2009; McElduff et al. 2010). The negative binomial often provides a good fit to parasite data (Anderson & May 1991), although other distributions may be better for particular datasets (Walker et al. 2009). One review of analysis methods for entomological data recommended negative binomial and quasi-likelihood approaches (Sileshi 2006), whereas another of aquatic organisms favoured the latter (Noe et al. 2010). One concern with the negative binomial is that, empirically, the dispersion parameter (k) is often found to increase with the mean (Alexander et al. 2011), yet basic models assume it to be constant. On the other hand, a simulation study has found likelihood-based analysis to be robust to this (Aban et al. 2008). There are also alternative parameterisations of the negative binomial with different variance–mean dependence (Hilbe 2007). GLM techniques for taking account of clustering can also be used in cohort studies with multiple disease episodes or repeated measures per person, although the issue of time dependence is likely to arise and this is beyond the scope of the current article.

When describing any numerical information, there is a trade-off to be made between the conciseness of any summary measure and the information it conveys. For parasite data, the case has been made for summarising intensity in the form of categories rather than any single value (Montresor 2007). This is indeed more informative, but less concise. For some purposes, such as clinical trials, a single measure per arm – such as a mean – is likely to be preferred. If categories are used in clinical trials, they will probably be further summarised into a single measure, such as the proportion with heavy infection.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods and results
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

This work was supported financially by United Kingdom Medical Research Council grant number G7508177 to the Tropical Epidemiology Group and by the Human Hookworm Vaccine Initiative (HHVI) of the Sabin Vaccine Institute, which receives funding from the Bill and Melinda Gates Foundation. The studies which generated the hookworm and malaria datasets presented here were supported respectively by (i) the HHVI and (ii) the Wellcome Trust (Project 061910), the Gates Malaria Partnership and the United Kingdom Medical Research Council. I am grateful to Christian Bottomley, Paul Milligan and an anonymous referee for useful comments on the manuscript.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods and results
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information
  • Aban IB, Cutter GR & Mavinga N (2008) Inferences and power analysis concerning two negative binomial distributions with an application to MRI lesion counts data. Computational Statistics & Data Analysis 53, 820833.
  • Agresti A (1999) Modelling ordered categorical data: recent advances and future challenges. Statistics in Medicine 18, 21912208.
  • Alexander ND, Solomon AW, Holland MJ et al. (2005) An index of community ocular Chlamydia trachomatis load for control of trachoma. Transactions of the Royal Society of Tropical Medicine and Hygiene 99, 175177.
  • Alexander N, Bethony J, Corrêa-Oliveira R, Rodrigues LC, Hotez P & Brooker S (2007) Repeatability of paired counts. Statistics in Medicine 26, 35663577.
  • Alexander N, Cundill B, Sabatelli L et al. (2011) Selection and quantification of infection endpoints for trials of vaccines against intestinal helminths. Vaccine 29, 36863694.
  • Anderson RM & May RM (1991) Infectious Diseases of Humans: Dynamics and Control, 1st edn. Oxford University Press, Oxford.
  • Anscombe FJ (1948) The transformation of Poisson, binomial and negative-binomial data. Biometrika 35, 246254.
  • Barber JA & Thompson SG (2000) Analysis of cost data in randomized trials: an application of the non-parametric bootstrap. Statistics in Medicine 19, 32193236.
  • Barker L & Cadwell BL (2008) An analysis of eight 95 per cent confidence intervals for a ratio of Poisson parameters when events are rare. Statistics in Medicine 27, 40304037.
  • Bauer DF (1972) Constructing confidence sets using rank statistics. Journal of the American Statistical Association 67, 687690.
  • Beall G (1942) The transformation of data from entomological field experiments so that the analysis of variance becomes applicable. Biometrika 32, 243262.
  • Bland M (1995) An Introduction to Medical Statistics, 2nd edn. Oxford University Press, Oxford.
  • Bockarie M, Kazura J, Alexander N et al. (1996) Transmission dynamics of Wuchereria bancrofti in East Sepik Province, Papua New Guinea. American Journal of Tropical Medicine and Hygiene 54, 577581.
  • Boneau CA (1960) The effects of violations of assumptions underlying the t test. Psychological Bulletin 57, 4964.
  • Bonferroni CE (1950) Sulle medie multiple di potenze [On multiple algebraic means]. Bollettino dell’Unione Matematica Italiana, serie 3 5, 267270.
  • Borowski EJ & Borwein JM (1989) Dictionary of Mathematics, 1st edn HarperCollins Publishers, London.
  • Burton MJ, Holland MJ, Faal N et al. (2003) Which members of a community need antibiotics to control trachoma? Conjunctival Chlamydia trachomatis infection load in Gambian villages. Investigative Ophthalmology and Visual Science 44, 42154222.
  • Bushman BJ & Wang MC (1995) A procedure for combining sample correlation coefficients and vote counts to obtain an estimate and a confidence interval for the population correlation coefficient. Psychological Bulletin 117, 530546.
  • Conover WJ (1980) Practical Nonparametric Statistics, 2nd edn. John Wiley & Sons, New York.
  • Coxe S, West SG & Aiken LS (2009) The analysis of count data: a gentle introduction to Poisson regression and its alternatives. Journal of Personality Assessment 91, 121136.
  • Cundill B, Alexander N, Bethony J, Diemert D, Pullan RL & Brooker S (2011) Rates and intensity of re-infection with human helminths after treatment and the influence of individual, household, and environmental factors in a Brazilian community. Parasitology 138, 14061416.
  • Dobbie MJ (2001) Modelling Correlated Zero-inflated Count Data. Australian National University, Canberrra.
  • Dobson RJ, Sangster NC, Besier RB & Woodgate RG (2009) Geometric means provide a biased efficacy result when conducting a faecal egg count reduction test (FECRT). Veterinary Parasitology 161, 162167.
  • Dunyo S, Ord R, Hallett R et al. (2006) Randomised trial of chloroquine/sulphadoxine-pyrimethamine in Gambian children with malaria: impact against multidrug-resistant P. falciparum. PLoS Clinical Trials 1, e14.
  • Efron B & Tibshirani R (1993) An Introduction to the Bootstrap, 1st edn. Chapman and Hall, New York.
  • Elliott JM (1977) Some Methods for the Statistical Analysis of Samples of Benthic Invertebrates, 2nd edn. Freshwater Biological Association, Ambleside.
  • Everitt B (1995) Cambridge Dictionary of Statistics in the Medical Sciences. Cambridge University Press, Cambridge.
  • Fulford AJC (1994) Dispersion and bias: can we trust geometric means? Parasitology Today 10, 446448.
  • Gardner W, Mulvey EP & Shaw EC (1995) Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological Bulletin 118, 392404.
  • Grenfell BT, Das PK, Rajagopalan PK & Bundy DAP (1990) Frequency distribution of lymphatic filariasis microfilariae in human populations: population processes and statistical estimation. Parasitology 101, 417427.
  • Hallstrom AP (2010) A modified Wilcoxon test for non-negative distributions with a clump of zeros. Statistics in Medicine 29, 391400.
  • Hand DJ (1994) Deconstructing statistical questions (with discussion). Journal of the Royal Statistical Society. Series A, (Statistics in Society) 157, 317356.
  • Hart A (2001) Mann–Whitney test is not just a test of medians: differences in spread can be important. BMJ 323, 391393.
  • Heeren T & d’Agostino R (1987) Robustness of the two-independent samples t-test when applies to ordinal scale data. Statistics in Medicine 6, 7990.
  • Hilbe JM (2007) Negative Binomial Regression, 1st edn Cambridge University Press, Cambridge.
  • Hoshino N (2005) Engen’s extended negative binomial model revisited. Annals of the Institute of Statistical Mathematics 57, 369387.
  • Kirkwood BR & Sterne JAC (2003) Essentials of Medical Statistics, 2nd edn. Blackwell Scientific Publications, Oxford.
  • Lachenbruch PA (2002) Analysis of data with excess zeros. Statistical Methods in Medical Research 11, 297302.
  • Laubscher NF (1961) On stabilizing the binomial and negative binomial variance. Journal of the American Statistical Association 56, 143150.
  • Marshall TF, Anderson J & Fuglsang H (1986) The incidence of eye lesions and visual impairment in onchocerciasis in relationship to the intensity of infection. Transactions of the Royal Society of Tropical Medicine and Hygiene 80, 426434.
  • Massé JC & Theodorescu R (2005) Neyman type A distribution revisited. Statistica Neerlandica 59, 206213.
  • McCullagh P & Nelder JA (1983) Generalized Linear Models, 1st edn. Chapman and Hall, London.
  • McElduff F, Cortina-Borja M, Chan S-K & Wade A (2010) When t-tests or Wilcoxon–Mann–Whitney tests won’t do. Advances in Physiology Education 34, 128133.
  • Montresor A (2007) Arithmetic or geometric means of eggs per gram are not appropriate indicators to estimate the impact of control measures in helminth infections. Transactions of the Royal Society of Tropical Medicine and Hygiene 101, 773776.
  • Moulton LH, Curriero FC & Barroso PF (2002) Mixture models for quantitative HIV RNA data. Statistical Methods in Medical Research 11, 317325.
  • Mumpower JL & McClelland G (2002) Measurement error, skewness, and risk analysis: coping with the long tail of the distribution. Risk Analysis 22, 277290.
  • Mwangi TW, Fegan G, Williams TN, Kinyanjui SM, Snow RW & Marsh K (2008) Evidence for over-dispersion in the distribution of clinical malaria episodes in children. PLoS ONE 3, e2196.
  • Noe DA, Bailer AJ & Noble RB (2010) Comparing methods for analyzing overdispersed count data in aquatic toxicology. Environmental Toxicology and Chemistry 29, 212219.
  • O’Hara RB & Kotze DJ (2010) Do not log-transform count data. Methods in Ecology & Evolution 1, 118122.
  • Olayinka OS & Abdullahi SA (2009) An overview of industrial employees’ exposure to noise in sundry processing and manufacturing industries in Ilorin metropolis, Nigeria. Industrial Health 47, 123133.
  • Remme J, Ba O, Dadzie KY & Karam M (1986) A force-of-infection model for onchocerciasis and its applications in the epidemiological evaluation of the Onchocerciasis Control Programme in the Volta River basin area. Bulletin of the World Health Organization 64, 667681.
  • Seto EY, Lee YJ, Liang S & Zhong B (2007) Individual and village-level study of water contact patterns and Schistosoma japonicum infection in mountainous rural China. Tropical Medicine & International Health 12, 11991209.
  • Shmueli G, Minka T, Kadane JB, Borle S & Boatwright PB (2005) A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54, 127142.
  • Shoukri MM, Asyali MH, VanDorp R & Kelton D (2004) The Poisson inverse Gaussian regression model in the analysis of clustered counts data. Journal of Data Science 2, 1732.
  • Sileshi G (2006) Selecting the right statistical model for analysis of insect count data by using information theoretic measures. Bulletin of Entomological Research 96, 479488.
  • Smith T, Armstrong Schellenberg J & Hayes R (1994) Attributable fraction estimates and case definitions for malaria in endemic areas. Statistics in Medicine 13, 23452358.
  • Stonehouse JM & Forrester GJ (1998) Robustness of the t and U tests under combined assumption violations. Journal of Applied Statistics 25, 6374.
  • Walker M, Hall A, Anderson RM & Basanez MG (2009) Density-dependent effects on the weight of female Ascaris lumbricoides infections of humans and its impact on patterns of egg production. Parasites & Vectors 2, 11.
  • Warrell DM & Gilles HM (eds) (2002) Essential Malariology. Arnold, London.
  • White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817838.
  • Williams CB (1937) The use of logarithms in the interpretation of certain entomological problems. The Annals of Applied Biology 24, 404414.
  • Wilson K & Grenfell BT (1997) Generalized linear modelling for parasitologists. Parasitology Today 13, 3338.
  • Wojtczak M & Viemeister NF (2008) Perception of suprathreshold amplitude modulation and intensity increments: Weber’s law revisited. Journal of the Acoustical Society of America 123, 22202236.
  • Yu K, Lu Z & Stander J (2003) Quantile regression: applications and current research areas. Statistician 52, 331350.

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods and results
  5. Discussion
  6. Acknowledgements
  7. References
  8. Supporting Information

Data S1 Supporting information.

FilenameFormatSizeDescription
TMI_2987_sm_DataS1_reviewv2.doc63KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.