A simple method for estimating species abundance from occurrence maps


  • Deyi Yin,

    1. SYSU-Alberta Joint Lab for Biodiversity Conservation, Guangdong Key Laboratory for Biodiversity Dynamics and Conservation, State Key Laboratory of Biocontrol and School of Life Sciences, Sun Yat-sen University, Guangzhou, China
    Search for more papers by this author
  • Fangliang He

    Corresponding author
    1. SYSU-Alberta Joint Lab for Biodiversity Conservation, Guangdong Key Laboratory for Biodiversity Dynamics and Conservation, State Key Laboratory of Biocontrol and School of Life Sciences, Sun Yat-sen University, Guangzhou, China
    2. Department of Renewable Resources, University of Alberta, Edmonton, AB, Canada
    Search for more papers by this author


  1. The issue of how to estimate species abundance from presence/absence maps has attracted much attention. Several methods have been developed to address this problem. However, those methods either overlook the structure of spatial autocorrelation of species distribution, thus leading to underestimation, or they demand extra information besides presence/absence maps.

  2. This study first developed a new method that takes account of spatial autocorrelation and only requires occurrence maps, without any extra information. This method was further improved by incorporating a correction factor to it. We used an index defined by joint counts of occupied and unoccupied cells to measure spatial autocorrelation and to correct the underestimation of the random placement model.

  3. The performance of our method was compared against four other major methods (random placement model, negative binomial model, Conlisk et al.'s method and Solow & Smith's method) using both simulated and empirical data. The results showed that the performance of our method is comparable with other methods but requires less and readily obtained input data, a property important for real applications.

  4. We suggest this simple, data-parsimonious method be a useful alternative to the currently available methods for estimating abundance from occurrence.


Abundance is perhaps the most important ecological quantity necessary for understanding the dynamics of populations and for decision-makings in biological management and conservation, for example assessing extinction risk of endangered species (Mace & Lande 1991; Mace et al. 2008), monitoring invasive species (Veldtman, Chown & McGeoch 2010) and managing species populations, particularly threatened species (Tosh, Reyers & van Jaarsveld 2004; Figueiredo & Grelle 2009). In practice, however, data on species abundance are often not available or too expensive to collect. In such cases, distribution data have to be used as an approximate surrogate for abundance (Wilson et al. 2004; Cardillo et al. 2008; Mace et al. 2008). However, this surrogate approach can be improved. Widespread evidence has shown species' abundance and distribution in space are positively correlated (Gaston & Blackburn 2000). This correlation offers a possibility to estimate species abundance from distribution data, thus providing more accurate data for management, monitoring and conservation programmes. But how to estimate species abundance from distribution maps has been a challenging problem (Kunin 1998; He & Gaston 2000a; Conlisk et al. 2009; Hui et al. 2009; Solow & Smith 2010; Hwang & He 2011; Azaele, Cornell & Kunin 2012). Currently, the methods that have been developed to address this challenge can be divided into two groups. One is the occupancy–area relationship that models species occupancy as a function of spatial scale (or mapping resolution, i.e., ‘area’) (Kunin 1998; Hui et al. 2009). Equipped with this relationship, one can derive occupancy for any given mapping scale. The other is the occupancy–abundance relationship that directly models abundance in terms of species occupancy (He & Gaston 2000aa; Hui et al. 2009). Although both approaches aim to derive species abundance from distribution, the method in the first group does not genuinely estimate abundance but scales occupancy from one mapping scale to another. This study will not address this group of method but focuses on the occupancy–abundance relationship.

Methods based on the occupancy–abundance relationship are commonly involved in spatial statistical models. Two major types of abundance estimation methods fall into this group, depending on how spatial data are used. The first one is to assume that occurrence maps are the only data available. This type includes the methods developed from the random placement model and negative binomial distribution (He & Gaston 2000a; Hwang & He 2011). The second type requires extra information on species distribution in addition to occurrence maps. This includes methods developed by Conlisk et al. (2009) and Solow & Smith (2010).

As a first type of the occupancy–abundance methods, the random placement model is a null model that assumes random distribution of species. Although the model only requires presence/absence data of a single distribution map of a species, it substantially underestimates abundance when applied to empirical data because in reality, species are rarely randomly distributed. The second method was proposed by He & Gaston (2000a) based on the negative binomial distribution (NBD) to correct the underestimation. The use of this NBD model requires occurrence data on two distribution maps for two different spatial scales; the coarse scale map is generated by aggregating the finer scale map (Kunin 1998; He & Gaston 2000a). In addition, the NBD model requires that the aggregation parameter k be a constant across the two spatial scales. This approximate assumption was recognized by He & Gaston (2000a) and Conlisk et al. (2009) but was criticized by Conlisk, Conlisk & Harte (2007) to be problematic because the NBD k is proportional to cell size by definition and empirical species are often more aggregated than the NBD predicts.

As a second type of methods, Conlisk et al. (2009) took a regression approach to model species abundance in terms of the number of occupied cells, the NBD k (describing within-cell clustering) and Moran's I (describing between-cell clustering, denoted as C in Conlisk et al. (2009)). A single regression model is estimated for all species in a study area (i.e. each species contributes one data point to the regression model). Although this approach provides much flexibility to model the variation in spatial distribution of species at different cell sizes, this regression requires both within- and between-cell clustering. As another method of type two, Solow & Smith (2010) took a very different approach by proposing a method that does require not only the number of occupied cells but also the number of cells containing exactly one individual. The improvement of this method over other methods results from the additional information on the number of cells occupied by a single individual. These methods are data demanding.

In this study, we first developed a new method that is based on the benchmark random placement model but takes account of spatial autocorrelation in species distribution. This new method recognizes that spatial autocorrelation in binary maps is widespread and should be incorporated in abundance estimation to improve the underestimation of the methods (Conlisk et al. 2009; Hwang & He 2011; Azaele, Cornell & Kunin 2012). This method belongs to the first type of the occupancy–abundance models that only require occurrence data. However, results from empirical data showed this method still underestimated abundance to a certain degree. We thus further improved this new method by incorporating a correction to it and that makes the new method becomes a special case of the corrected method. Simulation and empirical tests both showed that the new method and its corrected form perform well and demand less input data than other methods. These methods offer a simple and quick tool for deriving abundance from distribution.

Materials and methods

The spatial autocorrelation in an occurrence map is not considered in the random placement and NBD models. The new method we developed is based on the random placement model but takes account of autocorrelation.

The random placement model assumes all the N individuals of a species are randomly distributed in a given area A. The method that estimates N in terms of occurrence data has the form (He & Gaston 2000a),

display math(eqn 1)

where Aa is the occupied area, and a is the area of a grid cell.

Model (1) is an unbiased estimator of abundance for randomly distributed species (He & Gaston 2000a). In reality, however, the majority of species are not randomly distributed but aggregated. To correct underestimation associated with (eqn 1), we must take account of spatial autocorrelation in the occurrence of species. This autocorrelation can be quantified in a number of ways, for example using Moran's I (Conlisk et al. 2009; Hwang & He 2011). However, Moran's I is a Pearson type of correlation coefficient, which is not appropriate for modelling correlation in binary data. Here, we took a different approach to measure spatial autocorrelation by considering the number of joints among two neighbouring cells (Fig. 1). (Strictly speaking, this index measures the inverse of spatial autocorrelation.) This is a simple but a germane measure of spatial autocorrelation for binary maps (Cliff & Ord 1973). There are two ways to define joints. One is the joint by two neighbouring cells sharing a common boundary (called the first-order neighbourhood). The other is the joint defined by two cells sharing a common boundary or vertex (called the second-order neighbourhood). As shown in Fig. 1, occupied and empty cells are represented by black (B) and white (W) colours, respectively. For the first-order neighbourhood, a joint is referred to as BW if one of the two adjacent cells is occupied (Fig. 1, left panel). Similarly, for the second-order neighbourhood, the neighbourhood joints, denoted as DBW joints, consist of horizontal and vertical cells plus the diagonal cells (D represents the diagonal joint) (Fig. 1, right panel). The degree of BW association can be measured by index IBW as,

display math(eqn 2)

where O(BW) is the observed number of BW joints, and E(BW) is the expected number of BW joints for randomly distributed species.

Figure 1.

The binary map of a species where the occupied and empty cells are denoted as black (B) and white (W). The horizontal and vertical arrows represent BW joint in the first-order neighbourhood (the left panel), while the horizontal and vertical cells plus the diagonal cells represent the second-order neighbourhood (the right panel).

For the second-order neighbourhood, the joint counts index IDBW is similarly defined as:

display math(eqn 3)

where O(DBW) and E(DBW) represent the observed and expected numbers of the DBW joints in the binary map, respectively.

These two indices I can be used to indicate the distribution pattern of a species. For a randomly distributed species, I is expected to be 1. < 1 indicates aggregation, while regular species have > 1. The farther the I departs from 1, the stronger species are aggregated (or regularly distributed). This feature of autocorrelation provides a simple approach for estimating abundance for species of nonrandom distribution. This can be achieved by dividing the unbiased estimate of (eqn 1) by the autocorrelation index I:

display math(eqn 4)

If first-order neighbourhood index IBW is used, (eqn 4) is called the first-order neighbourhood method (denoted as math formula). If IDBW is used, it is called the second-order neighbourhood method (math formula).

Analysis on a large number of real species (see below for the test species) revealed that the joint statistic IBW or IDBW is not sufficient to reflect the degree of spatial autocorrelation of empirical data. Moreover, spatial aggregation tends to increase with species abundance (He, Legendre & LaFrankie 1997). These properties suggest that (eqn 4) may still underestimate N. To improve the accuracy of the estimation, a smaller (or bigger) I than that of IBW or IDBW should be applied for aggregated (or regular) species. Such an I, denoted as I', can be empirically proposed as in the following (eqn 5) by penalizing I according to the abundance of a species. With this corrected I', math formula in (eqn 4) is estimated by replacing I by I':

display math(eqn 5)

where N is the abundance of a species, and c is a correction parameter. I is either IBW or IDBW, defined by (eqn 2) or (eqn 3). Accordingly, we denote I' as IBW or IDBW (and their corrected abundance estimators as NBW or NDBW). c in (eqn 5) is an adjustable constant. I is a special case of I' where = 0. In this sense, I' is a generalization of I. For an appropriately chosen c, it can provide highly accurate estimation of N for a given species. Determining c for a specific species is not our interest here. Instead, our interest is to provide a general c value that would improve the performance of model (4). For that purpose, we recommend an empirical value of 0·05 based on a performance test for more than 1400 species from four forests studied in this paper. However, other c values can also be used (see the simulation below). Because N in (eqn 5) is unknown (to be estimated), N0 estimated from (eqn 1) is used instead.

We tested the performance of these proposed new methods by simulated and empirical data. We first generated five data sets with the degree of clustering varying from random to moderately aggregated to highly aggregated using R package ‘spatstat’ (Baddeley & Turner 2005). The spatial distributions of species were simulated using the Thomas point process, ‘rThomas’, of the package. The function has three parameters, κ, μ and σ. The first two parameters define the number of clusters and the expected number of trees in each cluster, respectively, while σ is the standard deviation of the dispersal kernel around a cluster centre and controls the spatial aggregation of species. Figure 2 shows examples of the simulated random, moderately and highly aggregated patterns. The simulated point patterns were gridded into occurrence maps. Equation 4 and other four existing models (the random placement model, the NBD model, Conlisk et al.'s regression model and Solow & Smith's method) were then used to estimate abundance from the occurrence maps. We set = 0·02 for using (eqn 5) to estimate abundance for the simulated species. We reported results for simulations considering the distribution of 2000 trees in a 100-ha (1000 × 1000 m) area. The simulation was repeated for 500 times. We also tested the methods by simulating species distributions of different abundances but that did not change the relative performance of the methods and results are thus not reported.

Figure 2.

Illustrative distributions of four simulated species representing random, moderately and highly aggregated patterns in a 1000 × 1000 m area. Each species has abundance = 2000. Function ‘rThomas’ was used to simulate the spatial distributions. The function and its parameters were set as: rThomas(κ = 0·0002, σ, μ = 10, win = owin(c(0, 1000), c(0, 1000))). For different simulations, σ (in metre) had different values: σ = 500 (A), 100 (B), 30 (C) and 15 (D). More strongly aggregated patterns have smaller σ values, and σ = 500 generates a random distribution.

We further tested these methods using four empirical data sets of different types of forest plots from tropical, subtropical and temperate zones. The first data are the distributions of 302 tree/shrub species (243,541 stems) in the 50-ha stem-mapped plot (1995 census) on Barro Colorado Island (BCI), Panama (He & Condit 2007). The second one is the distributions of 816 tree/shrub species (320,904 stems) in the 50-ha plot in Pasoh Nature Reserve, Malaysia (He, Legendre & LaFrankie 1997). The third one is a newly established 50-ha plot of Heishiding (HSD) Nature Reserve in Southern China, locating in a subtropical area with 236 tree/shrub species and 218,838 stems. The last plot is a 25-ha temperate forest with 52 tree/shrub species (38,730 stems) in Changbaishan (CBS) Nature Reserve in north-eastern China. The trees/shrubs in all these plots were free-standing individuals with diameter at breast height ≥1 cm.

The performance of the tested methods was compared using the relative root mean squared error (rRMSE):

display math(eqn 6)

where xi is the predicted abundance for species i, oi is the observed abundance of the species, and n is the total number of species whose abundances are estimated.

In addition to rRMSE, we also calculated R2 values to indicate the correlation of the predicted and observed abundance of the different methods.

Because occurrence is scale-dependent (He & Gaston 2000b; Gaston & He 2010), it is important to evaluate the effect of the cell size at which species are mapped on the performance of the methods. We tested each method at four cell sizes: 10 × 10, 20 × 20, 25 × 25 and 50 × 50 m. No method is available to estimate abundance for saturated maps (i.e. a species fully occupies the maps leaving no empty cells). The number of saturated maps increases with the increase in the mapping cell size. Following Conlisk et al. (2009), we excluded species with >95% occupancy in each forest plot from further analysis.


The results for the simulated species shown in Table 1 indicate that all the methods work very well in estimating abundance of randomly distributed species (the first row). This is consistent with the results of the previous studies for randomly distributed species (He & Gaston 2000a; Conlisk et al. 2009; Solow & Smith 2010) and is expected because the random placement model is a special case of the NBD model and under random distribution, IBW and IDBW are expected to be 1. However, for aggregated distributions, the random placement model underestimates the abundance and the underestimation becomes worse with the increase in aggregation (Table 1). Although the NBD model improves the random placement model, it still underestimates the highly aggregated species (e.g. σ = 15, Table 1). This is also the case for the second-order neighbourhood method (math formula). In contrast, Conlisk et al.'s method overestimates the abundance. Solow & Smith's method and our new methods perform similarly well. They have overlapped confidence intervals that contain the true abundance.

Table 1. Means (and standard deviations) of the abundances estimated for the simulated random, moderately aggregated and highly aggregated distributions of 2000 trees (true abundance) in a 100 ha (1000 × 1000 m) plot. The algorithm and examples of simulated distributions are shown in Fig. 2. The abundance for each distribution was estimated from the grid cell size of 20 × 20 m that divided the 100-ha plot and was averaged from 500 replications of simulations for each pattern. The negative binomial model required two maps (fine and coarse maps) for estimating abundance. The coarse scale map was generated by merging 4 cells (2 × 2 cells) from the finer (20 × 20 m) base binary map. Conlisk et al.'s regression method used the regression coefficients they provided (their eqn 10). The first- and second-order neighbourhood methods (math formula and math formula) and the corrected first- and second-order neighbourhood methods (math formula and math formula) are given by (eqn 4). For estimating math formula and math formula, = 0·02 in (eqn 5) was used
SigmaRandom placement modelNegative binomial modelConlisk et al.'s methodSolow–Smith's method math formula math formula math formula math formula
500 (random)2000·24 (31·31)2039·56 (42·40)2054·56 (65·90)2001·03 (20·92)2023·49 (36·18)2011·70 (34·48)2027·35 (37·76)2013·60 (35·73)
1001973·89 (33·24)2016·98 (44·24)2088·87 (71·29)2002·09 (22·73)2015·37 (40·00)2008·51 (40·28)2022·28 (42·71)2014·28 (43·16)
301713·03 (37·06)2013·71 (62·83)2376·87 (92·45)2021·66 (33·65)2079·67 (55·10)2057·86 (49·65)2145·60 (64·18)2119·49 (57·32)
201457·12 (38·40)1944·48 (75·14)2467·42 (89·63)2092·74 (49·81)2065·53 (60·77)1990·57 (54·12)2181·88 (71·50)2090·51 (62·57)
151226·41 (37·15)1750·62 (78·70)2313·91 (82·91)2220·42 (83·16)1920·77 (60·53)1790·41 (59·22)2057·27 (71·14)1897·12 (59·22)

The results for the empirical data of BCI, Pasoh, HSD and CBS (Table 2) show that although no single method works consistently better across plots and scales, it is not surprising that Solow & Smith's method overall performs better than other methods. The general result is that rRMSEs are very close among the different methods in most cases. The similar performance of the methods is also evident from Fig. 3, showing the predicted versus the observed abundances for the BCI plot. In most situations, the first- and second-order neighbourhood methods (math formula and math formula) perform comparably well as Conlisk et al. and Solow & Smith's methods, but they all underestimate the common species just as shown in Fig 3. The corrected first- and second-order neighbourhood methods (math formula and math formula) largely eliminate this bias, but certain degree of bias remains for very common species. The fact that the new methods, Solow & Smith's method and Conlisk et al.'s method, have similarly high R2 values except at the largest 50 × 50 m scale suggests the comparable performance of the methods. At the scale of 50 × 50 m, none of the methods perform satisfactorily, although Solow & Smith's performs better (Table 2). Solow & Smith's method performs very well for BCI and Heishiding species but not as well for the Pasoh species (Table 2). In contrast, the first- and second-order neighbourhood methods (math formula and math formula) appear to perform best for Pasoh species but are less so for BCI, HSD and CBS species. The corrected first- and second-order neighbourhood methods (math formula and math formula) consistently do better than the math formula and math formula except for a few cases (Table 2). Conlisk et al.'s method seems not as sensitive as other methods to mapping resolution; the increase in its rRMSE with cell size is relatively slow (Table 2). All the methods become progressively worse with the increase in cell size.

Table 2. Comparison of the performance of different methods for estimating abundances for species in 50-ha plots of Barro Colorado Island (BCI), Pasoh, Heishiding (HSD) and a 25-ha plot of Changbaishan (CBS) reserves at four different cell sizes (10 × 10, 20 × 20, 25 × 25 and 50 × 50 m). The values in the table are the relative root mean squared error (rRMSE) as given by (eqn 6). R2 values for each method are shown in parenthesis. For estimating math formula and math formula, = 0·05 in (eqn 5) was used. The numbers in the parenthesis under cell size are the number of species whose abundances were estimated at that cell size
PlotCell size (# species)Random placement modelNegative binomial modelConlisk et al.'s methodSolow–Smith's method math formula math formula math formula math formula
BCI10 × 10 (300)0·1895 (0·86)0·1598 (0·94)0·1335 (0·97)0·1061 (0·98)0·1284 (0·95)0·1423 (0·94)0·1288 (0·97)0·1394 (0·97)
20 × 20 (294)0·2697 (0·74)0·2397 (0·81)0·1758 (0·95)0·1245 (0·96)0·1954 (0·89)0·2233 (0·88)0·1859 (0·93)0·2140 (0·92)
25 × 25 (290)0·2950 (0·70)0·2352 (0·87)0·2029 (0·94)0·1547 (0·96)0·2237 (0·86)0·2532 (0·86)0·2131 (0·91)0·2414 (0·91)
50 × 50 (238)0·3804 (0·29)0·3181 (0·48)0·2572 (0·68)0·2802 (0·87)0·2981 (0·49)0·3302 (0·48)0·2859 (0·56)0·3187 (0·54)
Pasoh10 × 10 (816)0·1461 (0·89)0·1160 (0·94)0·0974 (0·98)0·0598 (0·99)0·0807 (0·98)0·0913 (0·98)0·1080 (0·97)0·1070 (0·98)
20 × 20 (811)0·2357 (0·74)0·2017 (0·83)0·1484 (0·94)0·1351 (0·91)0·1411 (0·91)0·1571 (0·90)0·1384 (0·93)0·1505 (0·93)
25 × 25 (810)0·2657 (0·63)0·2005 (0·73)0·1695 (0·89)0·1769 (0·47)0·1649 (0·87)0·1815 (0·85)0·1568 (0·93)0·1070 (0·92)
50 × 50 (688)0·3700 (0·28)0·3015 (0·40)0·2627 (0·70)0·2719 (0·62)0·2839 (0·50)0·2986 (0·47)0·2689 (0·58)0·2838 (0·55)
HSD10 × 10 (233)0·2744 (0·67)0·2121 (0·84)0·1418 (0·93)0·1097 (0·98)0·1550 (0·89)0·1636 (0·87)0·1397 (0·94)0·1424 (0·93)
20 × 20 (233)0·4022 (0·40)0·3510 (0·55)0·2163 (0·82)0·1978 (0·93)0·2790 (0·65)0·3058 (0·63)0·2419 (0·76)0·2724 (0·72)
25 × 25 (231)0·4365 (0·27)0·3413 (0·49)0·2449 (0·72)0·2346 (0·92)0·3159 (0·52)0·3420 (0·50)0·2780 (0·64)0·3078 (0·60)
50 × 50 (188)0·5294 (0·04)0·4511 (0·18)0·3643 (0·41)0·3386 (0·62)0·4446 (0·25)0·4663 (0·22)0·4194 (0·33)0·4445 (0·30)
CBS10 × 10 (51)0·2866 (0·76)0·2360 (0·90)0·2032 (0·98)0·0856 (0·92)0·2085 (0·88)0·2160 (0·87)0·1914 (0·93)0·1986 (0·91)
20 × 20 (46)0·3690 (0·60)0·3508 (0·62)0·2496 (0·86)0·2210 (0·95)0·2819 (0·78)0·2982 (0·77)0·2661 (0·84)0·2844 (0·82)
25 × 25 (44)0·3937 (0·42)0·3333 (0·56)0·2521 (0·77)0·2021 (0·97)0·3048 (0·61)0·3257 (0·59)0·2829 (0·68)0·3081 (0·65)
50 × 50 (36)0·4170 (0·31)0·3365 (0·66)0·2731 (0·86)0·3893 (0·59)0·3245 (0·65)0·3609 (0·60)0·3087 (0·72)0·3491 (0·67)
Figure 3.

Predicted versus observed abundances estimated by different methods for 294 species in BCI with cell size = 20 × 20 m. The diagonal 1–1 line is for the predicted abundance equal to the observation.


Data on species distribution on the regional and global scales have been increasingly documented (Mitchell-Jones et al. 1999; Jetz, McPherson & Guralnick 2011). They have been widely used to determine species endangerment status (IUCN-SSC 2004; Cardillo et al. 2008; Mace et al. 2008; He 2012) and to assess the impact of climate and land-use change on species diversity (Thomas et al. 2004; Wilson et al. 2004; Jetz, Wilcove & Dobson 2007). These uses of species distribution data have substantially contributed to understanding the spatial and temporal distribution of species and strengthening our ability to predict the impact of environmental change and human activities on the well-being of species. However, studies and practice can be much improved if information on species abundance, in addition to distribution, is also known. The challenging question is how to obtain estimation on species abundance on the regional or global scale so that to reduce the uncertainty in modelling species distribution and in assessing their responses to environmental change. Any data, even an approximation, on species abundance can enhance the power of ecological inferences and predictions. Although much effort has been devoted to estimating abundance from distribution, the available methods vary in their success in estimation and in their practical applicability (Kunin 1998; He & Gaston 2000a; Conlisk et al. 2009; Hui et al. 2009; Solow & Smith 2010; Hwang & He 2011; Azaele, Cornell & Kunin 2012). It is interesting to observe that the development of several of these methods was motivated by the effort to improve the performance of the power-law model (Kunin 1998) or the NBD model (He & Gaston 2000a), the first two studies on the subject.

In this study, we proposed a simple method for estimating abundance from distribution maps. This method is based on the random placement model but corrects for spatial autocorrelation in species distribution. Although the usefulness of this method for estimating abundance for distributions on the regional scale remains to be tested (so do the other methods), it performs well at a relative small scale of 100 ha at which ecological studies are usually conducted. Similar to other existing methods, the methods we developed (math formula, math formula and math formula, math formula) substantially outperform the method of NBD (Table 2). Further to that, it has comparable performance with Conlisk et al.'s and Solow & Smith's methods in many cases, although overall Solow & Smith's performs best. However, in cases where (eqn 4) performs poorer, the difference is relatively small (Table 2 and Fig. 3). The advantage of (eqn 4) and (eqn 5) is that they are data parsimonious and only requires species distribution maps and a simple correction parameter in the case of (eqn 5). In contrast, other methods do not only need presence/absence data but also other extra information (e.g. the number of cells occupied by one individual for Solow & Smith's method and information on both within-cell and between-cell correlation for Conlisk et al.'s). The parsimonious nature of our new method is very useful in real applications. The requirement on the extra input data makes other methods less practical. Based on our results, we would recommend the first-order neighbourhood estimator math formula and its corrected form math formula for applications. Here, it is worthwhile mentioning that c in (eqn 5) is adjustable and offers an easy way to improve abundance estimation. = 0·05 is a universally acceptable correction factor that works well for all the real species across the four vastly different plots and scales (Table 2). The reason that = 0·02 was used in the simulation test is Thomas process simulates distributions that have clusters (parent trees) randomly distributed in space (Fig. 3). This uniformity renders a c smaller than real species distributions, which are often more heterogeneous than that simulated by the Thomas process. It is important to note that if interest is to estimate the abundance of a specific species, there always exists an optimal c value that can make (eqn 4) out-compete all other methods; neither a high nor a low c value would be optimal. According to our simulated and empirical results, c typically varies between 0·01 and 0·1. In real applications where c cannot be determined a priori, we suggest estimating the abundance interval by applying = 0·01 and 0·1. There should be a high confidence that the interval would include the true abundance. For example, Xylopia macrantha (Annonaceae) is an intermediate abundant species in the BCI plot and its true abundance is 1166. With N0 = 460·028 and = 0·537, the estimated abundance interval of the corrected first-order neighbourhood method is math formula = (891·0, 1449·8) with = 0·01 and 0·1.

We noticed that rRMSE is consistently larger for the HSD and CBS data than for the BCI and Pasoh data, regardless which method is used, and no single model is consistently superior across scales (Table 2). This observation reflects the complexity of spatial distribution of species that may arise from the underlying habitat heterogeneity of plots and biotic processes such as dispersal (Seidler & Plotkin 2006; Cáceres et al. 2012). These variations among plots suggest the importance to incorporate habitat heterogeneity into the existing methods for improving abundance estimation (Hui et al. 2009). For this purpose, it may be useful to consider integrating niche modelling (Peterson et al. 2011) with occupancy–abundance models.

He & Gaston (2007) suggested two major research directions in estimating abundance from occurrence. The first is how to incorporate spatial correlation in species distribution to improve abundance estimation. This study is a contribution to that direction. The second is how to develop methods for estimating abundances at the regional scale. All the methods so far derived perform reasonably well for relatively small areas, for example a 100-ha area as shown in this study but may fail to provide a reasonable estimation for much larger areas of, for example 100 km2. The development of methods for estimating regional abundance requires better understanding of species distribution on landscapes and scale-dependent functions describing such distribution. Another challenge is that abundance data on landscape scale are rarely, if ever, available to verify any methods that may be developed for regional scales. Until then, the methods for estimating abundance from regional distribution of specie are urgently needed for conservation and management planning.

It is useful to note that current methods of abundance estimation are commonly developed by assuming that any absence is the true absence from a distribution map. However, in reality, most real data are detection data where an absence grid may be a false absence due to certain sampling effects or incomplete survey (MacKenzie et al. 2002). The abundance for a detection/nondetection map would probably be underestimated using the methods developed for a true presence/absence map. There are two possible solutions to this challenge. One is to develop abundance estimation methods specifically for the detection/nondetection map (Hui et al. 2011). The other is to find a way to turn the detection/nondetection map to a true presence/absence map (MacKenzie et al. 2002) so that the existing methods based on true maps could be used. Detection probability is a key factor here. However, the detection probability is seldom known and usually varies from species to species and from grid to gird. In the meantime, repeated observations are usually not available. All these make this problem real difficult. Nevertheless, any solution to this challenge is expected to much expand the application of abundance estimation methods to the real world.


The work was supported by the Guangdong Key Laboratory for Biodiversity Dynamics and Conservation, Sun Yat-sen University. FH was supported by NSERC (Canada). No conflict of interest was involved in this study.