On plotting species abundance distributions

Authors


John S. Gray, Marine Biodiversity Research Group, Department of Biology, University of Oslo, PB 1066 Blindern, 0316 Oslo, Norway

Summary

  • 1There has been a revival of interest in species abundance distribution (SAD) models, stimulated by the claim that the log-normal distribution gave an underestimate of the observed numbers of rare species in species-rich assemblages. This led to the development of the neutral Zero Sum Multinomial distribution (ZSM) to better fit the observed data.
  • 2Yet plots of SADs, purportedly of the same data, showed differences in frequencies of species and of statistical fits to the ZSM and log-normal models due to the use of different binning methods.
  • 3We plot six different binning methods for the Barro Colorado Island (BCI) tropical tree data. The appearances of the curves are very different for the different binning methods. Consequently, the fits to different models may vary depending on the binning system used.
  • 4There is no agreed binning method for SAD plots. Our analysis suggests that a simple doubling of the number of individuals per species in each bin is perhaps the most practical one for illustrative purposes. Alternatively rank-abundance plots should be used.
  • 5For fitting and testing models exact methods have been developed and application of these does not require binning of data. Errors are introduced unnecessarily if data are binned before testing goodness-of-fit to models.

Introduction

Hubbell (2001) has revived interest in studies of species abundance distributions (SADs) that were popular in the 1960s following the pioneering work of Fisher, Corbet & Williams (1943), Preston (1948, 1962), MacArthur (1957), MacArthur & MacArthur (1961) and Williams (1964). Hubbell's study was based on detailed analyses of SAD patterns of data from tropical trees, mainly from Barro Colorado Island (BCI), Panama (Condit et al. 2002). Hubbell's basic thesis was that the SAD patterns did not follow a log-normal distribution as there was an overweight of species with low abundances. He therefore, developed the ZSM model to account for the overabundance. Yet, analyses of purportedly the same data set by McGill (2003) and Volkov et al. (2003) showed that a log-normal distribution (Preston 1948), fitted the data as well as the ZSM model. Volkov et al. (2003) argued that a slightly lower value of the chi square test for goodness-of-fit argued in favour of the ZSM, but both models fitted within the accepted statistical bounds. However, the plots shown by McGill (2003) and Volkov et al. (2003) differed from those shown by Hubbell (2001).

Preston's (1948) method of binning the data to obtain a log-normal was derived before the advent of computers and was aimed at turning discrete data into a continuous distribution. He erected doubling classes of abundances (log2) which he called octaves. While many have claimed to have used Preston's original (1948) plotting method this is not in fact the case and most have used a modified version (described later in the Methods section), which was first suggested by Williams (1964). In addition to making plots of a great variety of data using this modified Preston method Williams (1964) often used another binning method that of × 3 classes to fit a log-normal distribution. Others have used a hybrid method (Gray 1987; Hubbell 2001; Plotkin & Muller-Landau 2002; O’Hara & Oksanen 2003; Chave 2004; Hubbell & Borda-de-Agua 2004). In discussing the log-normal distribution Magurran (2004, p. 32) stated that ‘It is not however, necessary to use log2; any base is valid and log3 and log10 are common alternatives’. Magurran & Henderson (2003) used the log10 base for their analysis of SAD patterns in a fish assemblage from a British estuary. Williamson & Gaston (2005) also plotted data on log10 scale using probability plots to compare models. While it is of course reasonable to plot using any logarithmic base, the plots that one obtains will vary not only with the base of the logarithm but also with the type of binning system used (Hubbell & Borda-de-Agua 2004). The consequences of using different binning methods are that different interpretations may be derived for the same data, especially where testing of fit to the model is done after binning. Here we plot the BCI data using a variety of different binning methods and show that the binning methods greatly influence the shapes of the plots produced.

Methods

We attempted to obtain the same plots used by Hubbell (2001), McGill (2003) and Volkov et al. (2003) by using the same BCI 50 ha data. We understand that the data set used was the 1995 set with 21 457 individuals and 225 species. (Another 1995 data set has 21 455 individuals and 227 species.) The data set is available in the supplementary information to McGill's (2003) paper.

Preston (1948) used as a binning system doubling of the number of individuals per species, e.g. 1, 2, 4, 8, 16, etc. However, he suggested that it was more convenient to use such numbers as the boundaries rather than the mid-points of the bins, giving a range of 0–1, 1–2, 2–4, 4–8, 8–16, etc. Species that occur with abundances exactly on the boundary were divided equally between the given bin and the next lower bin. Preston suggested that for natural assemblages the log-normal distribution was truncated at the one individual per species bin and that not all species were sampled. Preston called the truncation the ‘veil-line’ and showed that as sample size increased more of the full log-normal distribution was unveiled. In any sample of a natural assemblage studied there is a point of truncation and species ‘found’ behind the veil-line are represented by fractions of a species. With the binning system he proposed Preston never plotted the 0–1 individual bin as he argued that this bin covered the range 0–1 individuals per species the actual number of species was unknown and so could not be plotted (Preston 1948). His first bin, bin 1 was half the number of species with one individual plus half the number of species with two individuals per species. Bin 2 is then half the number of species with two individuals, all species with three individuals and half the number of species with four individuals, bin 3 is four to eight individuals per species; abundances that fall exactly on the boundary are divided equally between the given bin and the next lower bin, and so on.

Here we applied six different methods of binning to the BCI data. The binning methods used are:

  • • Method 1: modified Preston method devised by Williams (1964) and used by McGill (2003), Volkov et al. (2003). Bin 1 = half the number of species with 1 individual per species. From there the Preston (1948) method is followed with species that occur with abundances exactly on the boundary are divided equally between the given bin and the next lower bin.
  • • Method 2: Log2 classes, a simple doubling of number of species per bin. Bin 1 = 1 individual per species, bin 2 = 2, bin 3 = 3–4, etc.
  • • Method 3: modified log2 classes: bin 1 = number of species with 1 individual per species, bin 2 = number of species with 2–3 individuals per species, bin 3 = 4–7, bin 4 = 8–15, etc., i.e. the interval is on a log2 scale … (Gray 1987; Hubbell 2001).
  • • Method 4: Log3 scale, bin 1 = 1 individual per species, bin 2 = 2–3, bin 3 = 4–9, bin 4 = 10–27, etc.
  • • Method 5: Log10 classes, bin 1 = 1 individual per species, bin 2 = 2–10, bin 3 = 11–100 … (Magurran & Henderson 2003).

There are some modifications to these methods that have been used. Williamson & Gaston (2005) suggested that for method 3 class boundaries should be set at: 0·71, 1·41, 2·83, 5·66, 11·31, 22·63 … bin 1 = 0–1 individual per species, bin 2 = 2 individuals, bin 3 = 3–5, bin 3 = 6–11, bin 5 = 12–21 … This is equivalent to adding 0·5 to the bin boundaries and was used by Magurran (2004) in her worked examples for plotting log-series and log-normal distributions. Williams (1964) used a slight variant of method 4 (the × 3 classes method) where the bin boundaries were set at 1, 4, 13, 40, 121, etc., i.e. the intervals between bins (and not the boundaries as in method 4) are on a log3 scale.

Bulmer (1974) showed that the correct way to fit a log-normal to species-abundance data was to use a maximum likelihood method and to fit a Poisson log-normal, which is statistically more correct than Preston's (1948) method as it takes Poisson errors into account. Bulmer's method has been widely adopted (recent reviews by Etienne & Olff 2004; Williamson & Gaston 2005), and here we apply software written in ‘R’ to fit a Poisson log-normal to the data by maximum likelihood methods (O’Hara & Oksanen 2003). [http://cc.oulu.fi/~jarioksa/softhelp/vegan/html/fisherfit.html and is available as Fisherfit (vegan) v 1·6–10 from the Comprehensive R Archive Network http://lib.stat.cmu.edu/R/CRAN/.] The problem with truncation at the left-hand edge of the curve has been taken into account by O’Hara & Oksanen (2003) who record, ‘[Preston's] practice makes data look more log-normal by reducing the usually high lowest octaves, but is too unfair to be followed. Therefore the octaves used in this function include the upper limit.’ We did not test goodness-of-fit as this was not the purpose of this paper. Some researchers choose to bin the data and then fit a log-normal. To illustrate the problems with such an approach we plot the log-normal based on the full data set and after binning.

Results

Figure 1 shows the histograms for the BCI data. Figure 1(a) the plot is taken from Hubbell (2001, fig. 5·7, p. 135) for trees > 10 cm d.b.h and Fig. 1(b) is taken from McGill's (2003) paper. (Volkov et al.'s. 2003 plot is identical to McGill's and so is not shown.)

Figure 1.

Plots using different binning methods (see Methods for a detailed description) applied to the 1995 BCI data. (a) Hubbell's plot (2001). (b) Method 1: modified Preston method; bin 0 = half the number of species with one individual per species; bin 1 = half number of species with one individual plus half the number of species with two individuals per species; bin 3 = half number of species with two individuals all the species with three individuals and half the number of species with four individuals, etc. (c) Method 2: Log2 plot bin 1 = 1; bin 2 = 2; bin 3 = 3–4; bin 5 = 5–8 individuals per species, etc. (d) Method 3: bin 1 = 1; bin 2 = 2–3; bin 3 = 4–7; bin 4 = 8–15 individuals per species, etc. (e) Method 4: log3 scale bin 1 = 1; bin 2 = 2–3; bin 3 = 4–9; bin 4 = 10–27 individuals per species, etc. (f) Log10 bins: bin 1 = 1; bin 2 = 2–10; bin 3 = 11–100; bin 4 = 101–1000 individuals per species. Note that for consistency we use bin 1 to include numbers of species with 1 individual per species. In method 1 the number of species with one individual per species is halved so we call this bin 0. The curves show the maximum likelihood fits of the Poisson log-normal to the complete data (complete line) and to data after binning (broken line).

The discrepancy between the plots (Fig. 1a,b) is due to the use of different binning methods. Hubbel (2001) used method 3 and McGill (2003) and Volkov et al. (2003) method 1. [This discrepancy was also noted by Williamson & Gaston (2005) who wrongly thought that the difference was due to Hubbell having used the modified version of method 3 described above. (Hubbell & Borda-de-Agua 2004 describe the binning method used in Hubbell 2001).]. In method 1 the number of species in bin 0 is simply halved (Fig. 1b). Thus it is not surprising that McGill (2003) and Volkov et al. (2003) found that the log-normal distribution fitted the BCI SAD (Fig. 1b), whereas Hubbell suggested the log-normal was a poor fit to his data (Fig. 1a). Hubbell noted that the poor fit was due to an overabundance of rare species compared with that expected from the log-normal curve. We do not believe it is correct to simply halve the number of species in a bin and we recommend that all the data should be used to make plots. (Note that Preston's (1948) original method is simply as in method 1, but eliminating bin 0 in Fig. 1b and so is not plotted here.)

The problem with the log2 binning method (method 2, Fig. 1c) is that counts are made of the number of species with one individual and two individuals per species and only at bin 3 does a doubling of the number of individuals per species in a bin occur. We believe this is not in keeping with what biologists believe as an appropriate logarithmic scale, a point noted by O’Hara & Oksanen (2003).

Application of the other binning methods (methods 4 and 5) that use higher logarithmic bases log3 (Fig. 1e) or log10 (Fig. 1f) scales reduces the number of bins greatly and thus renders it difficult to apply goodness-of-fit tests due to the large reduction in degrees of freedom. Therefore, these methods are not recommended.

In the remaining plot (method 4, Fig. 1d) the bins are made by doubling the number of individuals per species binned (1, 2–3, 4–7, etc.) which to us seems the most logical way to transform the data to a geometric scale. However, Williams (1964) cautioned against using such an approach as the bin boundaries are at 0·5, 1·5, 3·5, 7·5, 15·5, etc. and are not equal on a logarithmic scale. The error in the uneven logarithmically spaced bins is only in the first bins and is very small and does not affect the integers of counts of numbers of species and the more logical doubling of bins than used in the other methods outweighs the problem of exactly plotting the bin boundaries.

The two curves in Fig. 1 show the maximum likelihood fit of the Poisson log-normal (Bulmer 1974) to the data for the complete data set (complete line) and for a similar fit after binning (broken line). Not surprisingly there are differences between the two curves; binning gives a less precise estimate of the log-normal fit and thus fits to the curve should always be done on the full data set. Plots of binned data should simply be used for illustrative purposes.

Discussion

We do not claim that these results are particularly novel, but we do believe there is much confusion in recent papers that have used different binning methods to interpret SAD patterns. Our results raise two issues. First, what data are being used to test the ZSM against other models? Much effort has been devoted to testing the ZSM using the BCI data (Hubbell 2001; Plotkin & Muller-Landau 2002; McGill 2003; Volkov et al. 2003; Etienne & Olff 2004; McKane, Alonso & Sole 2004). Yet few of the many sets available have been used for analyses, most studies are based on one set alone. Etienne & Olff (2004), however, used all the data sets in their analyses and this is clearly to be preferred to inferences made on subsets.

The differences between McGill's (2003) and Hubbell's (2001) plots are due to use of different binning methods. However, in Fig. 1 plots (a) and (d) used identical binning methods and should be exactly the same. The differences are small and probably are due to the fact that Hubbell (2001) used the data base as it was in 2001 and it has been continuously upgraded since with small changes in species identity and abundances recorded. Even though the BCI data are from a large sample and covers many species, Fig. 1 shows that a fitted log-normal model is still truncated, even though the data covered a very large sample of all tree species encountered within a 50 ha plot.

The second issue is that while it is of course not surprising that plots differ depending on which binning method is used (Fig. 1), we feel that there is a need for agreement on a standardized binning method. Method 3 seems to be the most appropriate. This method has been widely applied (Gray 1987; Hubbell 2001; Plotkin & Muller-Landau 2002; O’Hara & Oksanen 2003; Chave 2004; Hubbell & Borda-de-Agua 2004; Connolly et al. 2005). The advantage of such a plotting method is that it is very simple to generate, does not involve splitting species between bins and the resulting plots closely resemble the original Preston plot (Fig. 1b excluding bin 0 compared with Fig. 1d). This method also does not require elimination of species from the plots as in method 1 Williams (1964). Binning should, however, only be used to illustrate the shape of the curves and not for fitting and testing of alternative models.

An alternative procedure to binning that has been standard practice for many years is to use rank-abundance plots that utilize all the data and do not require binning (see McGill 2003 and Williamson & Gaston 2005 for recent examples of the utilization of such methods). Use of such plots have clear advantages over binning and are recommended. However, an additional point that needs to be considered is that, as O’Hara & Oksanen (2003) caution, as the log-normal is truncated at both ends it will differ from other models fitted using rank-abundance plots.

Methods for fitting and testing goodness-of-fit of the log-normal and the ZSM have received much attention recently. [Whether or not the log-normal is an appropriate SAD model (Williamson & Gaston 2005) is not discussed here.]Preston (1948) realized that binning prior to testing goodness-of-fit of the log-normal did not give an adequate test. He states ‘It is true that we might use analytical methods rather than graphical ones, as being more powerful tools, but statistical fluctuations are not thereby prevented from confusing the issue’ (p. 203). Today most researchers follow Bulmer (1974) who suggested that the log-normal model should be fitted using maximum likelihood methods taking account of Poisson errors, a suggestion that has been adopted generally. Recently in a new development Etienne & Olff (2004) have derived methods for fitting the zero-truncated multivariate Poisson log-normal distribution (MPLN) and a new exact analytical expression for the ZSM. They suggest that the most appropriate goodness-of-fit tests should utilize a Bayesian approach, which they illustrate by testing and fitting the MPLN and ZSM to the BCI data (and they also test a variant of the Broken Stick DBS model).

Alternative fitting methods have been used by Engen & Lande (1996a) who developed population dynamic models that generated the log-normal species abundance distribution. Their log-normal model (and a gamma distribution model, Engen & Lande 1996b) were fitted to a wide variety of data using parametric bootstrapping methods (Diserud & Engen 2000). Likewise in a recent study Connolly et al. (2005) compared fits of the log-series and log-normal model (and surprisingly not the ZSM), with corals and coral reef fish at different spatial scales. Goodness-of-fit was tested also using parametric bootstrapping and which model gave the most parsimonious fit was assessed using Akaike's Information Criterion, AIC (Akaike 1981). Clearly these recent approaches that use all the data and afford reliable tests of fit to alternative models by use of the AIC should be adopted generally if progress is to be made towards understanding the key ecological aspects of SADs.

Acknowledgements

This paper has been greatly improved following constructive and insightful comments made by Professor Anne Magurran and an anonymous reviewer to whom we offer our thanks. We thank the University of Oslo for financial support to the Marine Biodiversity Research Program. This paper is a contribution to the European Union's Network of Excellence in Marine Biodiversity MarBEF.

Ancillary