An effort-based index of beta diversity



  1. Beta diversity provides a link between species diversity in a region to species diversity within sites. Many metrics have been proposed for its estimation each reflecting a different aspect of the phenomenon. Many of them are more informative of the variation in alpha diversity among samples and the sampling effort (number of sampling units) rather than of the species assemblage differentiation. Here, we propose an index based on the sampling effort needed to fulfil a certain criterion and accounts for the relationship between the initial slope of the species accumulation curve and mean alpha diversity.

  2. For defining the index, we consider a set of n samples with S(n) species among R(n) total occurrences. The shared species occurrences are I(n) = R(n) − S(n). The proposed new index (N*) is defined as the point, in terms of n, where I(n) crosses S(n). Samples taken beyond that point (N*) contribute, cumulatively, more to the shared occurrences rather than to new species. For the estimation of N*, we provide the R function ‘Nstar’ based on the specaccum function in the vegan package.

  3. We tested the properties of N* on simulated datasets with known community assembly patterns and on a dataset of plant diversity of Greek Natura 2000 protected areas.

  4. N* is not mathematically confounded with alpha or gamma diversity. It depends on the relationship between gamma and alpha (Whittaker's index) but, furthermore, it reflects the variation in species occupancies. Thus, if a number of random samples, sufficient for the reliable estimation of mean alpha diversity and the species accumulation curve, is collected, N* converges to a value that does not change by the inclusion of more samples from the same region. N* depends primarily on the proportion of new species expected in each next sample.

  5. N* declines with nestedness in community structure, while it increases with species turnover. This holds for plant diversity of Greek Natura 2000, where N* exhibited lower values in island compared to continental regions reflecting the increased nestedness of island communities.


Beta diversity provides a link between species diversity in a region (gamma diversity) and species diversity within sites (alpha diversity) (Whittaker 1960, 1972). However, there is no consensus as to what precisely constitutes beta diversity, and therefore, many metrics for its estimation have been proposed (Koleff, Gaston & Lennon 2003; Jost 2006; Tuomisto 2010a,b; Anderson et al. 2011), each quantifying a different aspect of it (Anderson et al. 2011).

General definitions of beta diversity produce a beta with a hidden alpha dependency (Jost 2007). However, beta should be independent of mean alpha (Baselga 2010a; Ricotta 2010; Veech & Crist 2010; Chase et al. 2011), although there might be empirical cases where a statistical dependence is observed (Jost 2010). If a beta diversity measure is by definition dependent on alpha, then it fails to discern between spatial turnover and nestedness (Baselga 2010b; Baselga & Orme 2012). However, there is no consensus about the appropriate way of estimating beta diversity without the outcomes being influenced by the interrelationship between alpha, beta and gamma (Jost 2006, 2010; Baselga 2010a; Ricotta 2010).

Many beta diversity indices are sensitive to variation in sampling effort expressed in number of sampling units (Lennon et al. 2001; Kallimanis et al. 2008). The classic Jaccard and Sørensen indices of compositional dissimilarity are notoriously sensitive to sampling effort, especially for assemblages with numerous rare species (Chao et al. 2005; Cardoso, Borges & Veech 2009; Chase et al. 2011). Therefore, estimates based on different datasets or regions are not directly comparable.

In this paper, we propose a new measure of beta diversity, named N*. We will consider the first observation of any species as ‘useful’ information and any subsequent observation of the same species as ‘redundant’ (for estimating richness, but informative of beta diversity). Obviously, all the species recorded from the first drawn sample contribute to the species richness, while the species of subsequent samples can either add to the species richness or the redundant set of already found species. Up to a specific number of sampling units (hereafter referred to as samples), the accumulated number of species is higher than the number of species occurrences in the redundant set (Fig. 1a). Taking one more sample, the cumulative number of species in the redundant set becomes higher than the accumulated species richness, and therefore, there is a point where these are equal. We define N* as the average sampling effort at which the total number of species occurrences in the redundant set becomes equal to the species richness. We explore and test the basic properties of N* on simulated data and examine its applicability on a dataset of plant diversity from Greek Natura 2000 protected areas.

Figure 1.

(a) Definition of N*: S(n) is the average species accumulation curve, R(n) is the total number of species occurrences and I(n) = R(n) − S(n) is the accumulated shared species occurrences. N* is the sampling effort (n) in sampling units where I(n) crosses S(n). Up to N*, the samples contribute cumulatively more to the increase of S than to the increase of I. The opposite happens beyond N*. (b) N* estimation method: the species accumulation curve and its upper and lower bounds (average ± standard deviation) were estimated using the specaccum function of the vegan package in R with the dune data included in vegan. The line y = (a/2)n crosses the species accumulation curve as well as its upper and lower bounds, defining the upper and lower bounds of N*. (c) An example where N* is not estimable due to undersampling. In this case, the number of samples is less than the actual value N*.

Definitions and assumptions

Species richness is a nonadditive variable when aggregating across samples (He & Legendre 2002). In the simple case of two samples A1 and A2 having S(A1) and S(A2) species each, the total species richness is S(A1UA2) = S(A1) + S(A2) − S(Α1∩A2) (see Table 1 for notations). Setting R = S(A1) + S(A2) and I = S(Α1∩A2), we get S = R−I (eqn 1), where R is the total number of species occurrences and I is the shared species occurrences. Adding more samples, the analytic expression for I becomes complicated but, eqn 1 holds, so for n samples, I(n) R(n) − S(n). For large n, the total number of species occurrences is R(n) ān (ā=average number of species per sample) and the species richness S(n) is the average species accumulation curve (SAC), a concave function of n (power and logarithmic functions are the most frequently used, e.g. Kagiampaki et al. 2011; Triantis, Guilhaumon & Whittaker 2012). Consequently, as R(n) is a linear function of n and I(n) = R(n) − S(n), the dependence of I(n) on n is a convex function (Fig. 1a.) Ι(n) is a measure of the species distribution overlap among the n samples. It increases monotonically with n, and there is no upper bound for its value.

Table 1. Notations of the parameters and indices used in this article
NotationParameters of species presence–absence matrix
N Total number of samples
n Number of samples (sampling effort)
γ Total species richness, gamma diversity
α Species richness per sample, alpha diversity
ā Mean species richness per sample, mean alpha diversity
S(n) Species richness of n samples
R(n) Total number of species occurrences of n samples
I(n) Total number of species occurrences shared in n samples
f Proportion of the species occurrences matrix that is occupied, the inverse of Whittaker's beta
p i Proportion of new species expected in the ith sample
NotationIndices of beta diversity
βw Whittaker's original index of beta diversity, γ/ā
β−1 Harrison's index, (γ/ā − 1)/(N − 1)
N* The proposed index
N R * Theoretically expected N* for random distribution of species

The degree of species distributions overlap is an important aspect of beta diversity, and I can serve as an index of it after a proper rescaling into a conventional range, say 0–1. Doing so, one derives Harrison's beta β−1 = (γ/ā − 1)/(n − 1) (Harrison, Ross & Lawton 1992), originally proposed as a modification of Whittaker's beta βw = γ/ā to correct for the strong dependence of the later on n (Chiarucci et al. 2003). Some ways to derive β-1 from I are provided in Appendix .

The problem is that I is an unbounded convex function of n, and β−1 attempts to scale it on the unit linear segment (0–1). The symptom of this nonlinear scaling is that the scaled index β−1 is still dependent on n; as n increases, β−1 decreases. As gamma diversity is a finite number, if n increases then, in the limit (n→∞), β−1 tends to zero (see Vellend 2001 for a similar argument).

The N* index

Instead of rescaling I, we can derive an I-based index of species overlap by just taking the value of I at a characteristic, well-defined sampling effort (N*) at which a certain criterion is fulfilled. We define N* as the average sampling effort required so that I(n) = S(n), the intersection point of average S(n) and I(n), as shown in (Fig. 1a). Up to N*, samples contribute cumulatively more to the increase of S than to the increase of I. The opposite happens beyond that point.

From I(n) = R(n) − S(n) = ān − S(n), it follows that for n = N* we have I(N*) = S(N*)⇒ S(N*) = āN* − S(N*)⇒ S(N*) = āN*/2. That is, N* is also defined as the intersection point of S(n) with the line y = (ā/2)n (Fig. 1b). We use this later definition for the estimation of N* from data as there are well-documented methods for the estimation of the SAC (Colwell et al. 2012), and there exist many implementations of those methods [Estimates (Colwell 2012) and a number of R functions]. As electronic supplement (Appendices S3,S4) we provide, the R (R Development Core Team 2012) function ‘Nstar’ to estimate N* based on the specaccum function of the vegan package (Oksanen et al. 2012). An example of its use is given in Fig. 1b with the dune data included in vegan (Jongman, Ter Braak & Van Tongeren 1987). Specaccum estimates the average and the standard deviation of S(n). The standard deviation defines an upper and a lower bound of the SAC that intersect the line y = (ā/2)n providing a lower and an upper bound for N* (Fig. 1b). Note that the standard deviation of SAC does not directly translate to standard deviation of N*. However, the bounds of N* could be used as a measure of uncertainty. The bounds of S(n) depend on the variation in alpha that eventually affects the upper and lower bounds of N*, but not its average value.

There are cases where N* is not estimable from the available data as it is shown in Fig. 1c where we used only four of the 20 samples of the dune dataset to estimate N*. In this example, the line y = (ā/2)n has not yet exceeded S(n) and the intersection point has not been reached. However, there are subsets of four samples for which N* was estimated to be less than four, an underestimate of the true N* value. For a reliable estimation of N*, a ‘sufficient’ number of samples is required. Fig. 2 shows the way N* and ā depend on n. For any given number of samples n (3–19), we formed 500 random subsets and estimated average N* and ā from only those subsets where N* was estimable. Using three samples, N* was estimable in only 4·6% of the sets, and its average value was 2·74. The corresponding estimate of ā was very high (11·6) indicating that N* was estimable only when the species-rich samples were included. Using more than seven samples, all subsets provide an estimate and with ten or more, both ā and N* converge to a value. Furthermore, the use of subsets allows for the estimation of meaningful statistics of spread. We used this approach for the estimation of average, standard deviation and quartiles of N* in all our empirical and simulated datasets.

Figure 2.

Relationship of N* and mean alpha with the number of sampling units used for their estimation. For any given number of sampling units n (3–19), we formed 500 sets by selecting n of the 20 available samples and estimated N* and mean alpha taking into account only those cases (among the 500) that provide an estimate for N*.

N*, defined as the intersection point between = (ā/2)n and S(n), will always converge to a value as long as ā and S(n) stabilize. The convergence of ā is also required for the estimation of βw and other related indices. The number of samples required obviously depends on the variation in alpha.

The intersection point between S(n) and y = ān/2 exists if S(n) has an upper bound or even if S(n) is a linear function of n with a slope less than ā/2, as S(1) = ā>ā/2. However, there might be cases where the two curves do not intersect. An extreme such case is when all sites are completely distinct and there is an infinite number of species. Then, I(n) = 0, S(n) = R(n) = ān, βw = n, β−1 = 1, and N* is always higher than n. When dealing with an infinite species pool, both βw and N* tend to infinity. If, however, the species pool (γ) is finite, then, provided n is large enough to reach γ, the lines y = γ and y = (ā/2)n intersect at N* = /ā = 2βw. This is the highest possible value of N*. The lowest possible value of N* is 2 and is obtained when S(n) =  ā; all samples contain exactly the same species.

Factors expected to affect N*

N* increases with the slope of S(n) relative to ā. For S(n), we can write the recursive formula S(n) = S(n − 1) + pnαn, where αn is the number of species in the nth sample and pn is the proportion of new species expected among those αn. Although αn is not constant, for large enough n and under a randomization procedure for estimating S(n), αn = ā = S(1) and the equation S(n) = S(n − 1) + pnā has the solution inline image, given that p1 = 1.

inline image that is, N* depends only on pi's and consequently on βw estimated at n = N*. This means that N* is not mathematically confounded to alpha diversity but depends on the proportion of new species expected among the species of the next collected sample. In summary, N* is the sampling effort at which S(N*) = I(N*) = (ā/2)N* and its value equals two times the Whittaker's index calculated using N* samples.

N* accounts for the overlap of species distributions, so it is expected to depend on the fill (f) of the species presence matrix. The fill is the proportion of the matrix that is occupied and equals the inverse of βw index (f = āN/SN = ā/S). Assuming a random pattern of species distribution among samples, the value of N* (name it NR*) depends only on f according to inline image. This recursive formula is derived analytically (Appendix II), and a good initial value for starting the recursion is NR* = 0·5 + 1·6/f. The ratio N*/NR* standardizes for the effect of fill, so its values depend on factors affecting N* other than the fill. N*/NR* actually reflects the degree species assemblage deviates from randomness.

For a fixed f, the degree of species overlap depends on the way the species are distributed and eventually on the variation in species occupancy. If a species is recorded in k of n samples, then it contributes 1 to S, k to R and k − 1 to I. Widely distributed species, contributing mostly to I, tend to decrease N* while rare species, contributing relatively more to S, tend to increase it. For a fixed f and a given species occupancy distribution (SOD), N* is expected to depend on the way rare and common species are assembled in the community (nested pattern vs. turnover).

Materials and methods

To explore the effect of community structure on N*, we used ‘ideal’ patterns of community organization including (a) nested subsets, (b) ‘random’ patterns, (c) Clementsian gradients and (d) Gleasonian gradients, following Leibold & Mikkelson (2002). We produced the patterns keeping the total number of species (S), the total number of samples (N) and the fill (f), (and hence ā), constant.

In the nested pattern, sites with lower species richness constitute subsets of species-richer samples (Patterson & Atmar 1986). In our simulation, the first species occupies the whole range of samples (N). Any subsequent species with rank (r) from 2 to S occupies the samples ranging from 1 to nr where nr = 1 + (N − 1)edr with < 0, rounded to the nearest integer. That is, species ranges decline exponentially with species rank, but even the rarest species has at least one occurrence.

In the Gleasonian gradient, species composition changes smoothly across a gradient and there is a constant degree of species overlap. Each species occupies exactly n samples. The first species occupies the first n samples and each subsequent is shifted by m places with m < n, that is, each species and its subsequent in the matrix coexist in n-m samples.

The Clementsian gradient is characterized by compartmentalization in species composition resulting in discrete communities. We simulated the pattern setting the number of communities to k. Each community includes S/k species, each species occupies exactly N/k samples and all samples in the community have identical species composition.

In the random pattern, species were randomly allocated to samples, while controlling for the total number of species, samples and matrix fill.

From each pattern, we produced a series of variants differing in the distribution of row and column totals implementing row or column transformations. The transformations affect the exact position of species occurrences in the target row/column but maintain either the row or the column totals, respectively. Row transformations change the distribution of species among samples leaving the species occurrence distribution (SOD) unaffected, while column transformations change the SOD but preserve the species richness per sample (α). The variation of column and row totals was assessed by the Shannon's equitability index (J). The index assumes values between 0 (extreme differentiation) and 1 (no variation), among column/row totals.

As f is an important determinant of the average species overlap, we also produced a series of matrices differing in f. The variation in f was generated by varying ā, keeping S constant. We used multiple regression to determine the dependence of N* on f, ā, S, the variation in alpha (column totals) and the variation in species occupancy (row totals). For the multiple regression, we used the GLM module in R (Gaussian family with identity link), with a stepwise backward selection based on AIC.

The degree of nestedness was estimated by the NODF nestedness metric (Almeida-Neto et al. 2008), using ANINHADO (Guimarães & Guimarães 2006). Higher values of NODF imply more nested matrices.

Empirical data

We used data of vascular plants obtained from the project ‘Identification and Description of Habitat Types in Areas of Interest for the Conservation of Nature, 1999–2001’ for the design of the Greek Natura 2000 network of protected areas (209 sites) to test the performance of N*. The data refer to the species presences in each sample. In our analysis, 5651 samples (releves) were considered with surface area ranging between 50 and 400 m2 containing a total of 4063 species or subspecies of higher plants. We explored the effect of grain size on N*, that is, the surface area of the plot, using the whole dataset (Greece) and the effect of extent using only samples of surface area 100 m2 at Greece and its 13 phytogeographical regions according to Flora Hellenica (Strid & Tan 1997).


Simulated data

The community structure affected the values of N* (Fig. 3). Average N* increased from nested to species turnover structure. The nested dataset displayed the lowest average (13·3) but also the greatest range of N* values (Fig. 3a) due to high variation in alpha. The Gleasonian pattern (Fig. 3d) displayed the highest average N* (19·64) followed by the Clementsian (19·33) (Fig. 3c). Finally, the average value of N* (17·81) for the random pattern (Fig. 3b) lies between the values of totally nested communities and totally ordered ones. The random pattern has the narrowest range of values.

Figure 3.

Estimation of N* for four simulated community structures: (a) nested, (b) random, (c) Gleasonian, that is, smooth gradient species turnover and d) Clementsian, that is, compartmentalized community structure. The continuous vertical line represents the average N*, while the dotted lines represent the upper and lower bound of N*. Visual representation of the presence–absence matrix of community structures (= 30, = 200) is provided as an inset.

Fig. 4 shows the way N* relates to 1/f (the Whittaker's index) and to the variation in species occupancy as expressed by the Shannon's evenness (JRow). The data were obtained from a series of ‘ideal’ patterns differing in f (generated by varying ā) and the associated matrices produced by column and row transformations (fixed ā, S and N). While f is the most significant determinant of N*, the variation in species occupancies is also significant (P < 0·001); however, the variation in alpha between samples is not significant. Higher N* values are obtained when the fill is small (high 1/f) and for a given fill when the variation in species occupancy is small (high JRow).

Figure 4.

Relationship of N* with the inverse of matrix fill (1/f) and the variation of species occupancies (row totals) expressed by the Shannon's evenness (JRow). Matrix fill is the proportion of the species presence–absence matrix that is occupied which equals the inverse of Whittaker's original index of beta diversity. The coefficients of the multiple linear regression model fitted (estimates, standard errors, t-values and significance) are given in the inset table.

Empirical Data

In Table 2a, we show the effects of grain size (surface area of the sample) on N*. As grain size (A) increases from 50 to 400 m2, so does ā, f and NODF, while N* decreases according to N* = 1125A−0·64 with R2 = 0·99. N* correlates strongly with the degree of nestedness (R2 = 0·93) but not with S. N*/NR* is not correlated with grain size but decreases linearly with the extent of sampling (R2 = 0·989) and the sampling density (R2 = 0·994). It also increases with JRow (R2 = 0·978) as the later decreases linearly with extent (R2 = 0·98).

Table 2. Data on vascular plants of Greece from the survey of plant diversity of the Greek Natura 2000 of protected areas. Sampling characteristics [Total sampled area of the region, Sampled extent used in the analysis, Sampling density and Number of samples (N)] and species diversity indices [mean alpha (ā), total number of species (S), matrix fill (f), Shannon equitability index for the variation in species occurrences (JRow) and in alpha diversity (JCol), nestedness metric NODF, average N*(1st and 3rd quartiles of N*) and ratio (N*/NR*) of observed N* to theoretically expected for random distribution of species (1st and 3rd quartiles of the ratio)]. The dataset of the entire country was divided into 13 subsets according to the phytogeographical region the samples were collected. The first part of the table examines the grain effect on N* estimation (constant extent, Greece). The second part examines the effect of different phytogeographical regions (constant grain, 100 m2)
(a) Effects of grain: dataset all Greece, total sampled area of the regions = 1172324·25 m2
Sample sizeSampled extent (m2)Sampling density N ā S J Row J Col f NODF N* (1st–3rd quartile) N*/NR* (1st–3rd quartile)
5058 6500·05117316·9522700·890·970·0072·0891·8 (87·2–96·9)0·43 (0·41–0·46)
100233 6000·20233623·7630000·850·980·0083·6555·5 (44·1–65·5)0·28 (0·22–0·32)
200249 0000·21124528·1725460·850·990·0113·540·3 (38·5–42·5)0·28 (0·27–0·30)
300180 0000·1560034·118120·860·990·0194·4229·2 (28·1–30·4)0·34 (0·32–0·36)
400118 8000·1029736·5213410·880·990·0274·6423·2 (22·2–24·1)0·39 (0·38–0·40)
(b) Diversity estimates for Greece and 13 phytogeographical regions of Greece
RegionTotal sampledarea (m2)Sampled extent (m2)Samp. density N ā S J Row J Col f NODF


(1st–3rd quartile)


(1st–3rd quartile)

Greece1 172 324233 6000·20233623·7630000·850·980·0083·6555·9 (52·1–59·1)0·28 (0·22–0·32)
North East Greece185 88513 4000·0713426·439180·930·960·0296·0838·9 (37·2–41. 9)0·70 (0·69–0·71)
West Aegean islands31 12761000·26124·345000·940·980·0497·2327·5 (26·7–28·5)0·83 (0·82–0·84)
North Central Greece128 47645000·044524·854610·940·980·0547·2729·4 (27·7–30·3)0·98 (0·96–1·01)
Peloponnisos166 77058 2000·3558219·9211900·890·990·0173·6942·8 (41·3–44·4)0·45 (0·43–0·46)
Southern Pindos84 76626 6000·3126620·618580·900·980·0244·7934·9 (34·4–35·3)0·52 (0·51–0·53)
Sterea Ellas157 21628 4000·1828425·3212240·900·980·0214·0740·7 (40·2–41·3)0·53 (0·51–0·55)
Northern Pindos72 64411 1000·1511119·895160·910·990·0396·1226·1 (25·6–26.7)0·62 (0·61–0·63)
Ionian islands24 67014 4000·5814415·493980·890·980·0397·0222·9 (22·4–23·5)0·55 (0·54–0·56)
East Aegean islands70 17820 2000·2920226·346750·880·980·0398·7718·5 (18·2–18·8)0·45 (0·44–0·46)
Kriti Karpathos88 71131 5000·3631532·588790·880·970·0378·6617·8 (17·6–18·1)0·34 (0·31–0·36)
Cyclades48 43611 7000·2411726·96060·900·970·0448·7619·4 (18·9–19·9)0·39 (0·38–0·40)
East Central Greece107 52956000·055630·005560·930·990·0547·2723·8 (23·1–24·5)0·54 (0·53–0·56)
North Aegean islands591619000·321911·47610·890·980·18821·756·3 (6·0–6·7)0·69 (0·68–0·71)

In Table 2b, we present diversity estimates for Greece and its 13 phytogeographical regions for the fixed grain size of 100 m2. In all cases, the number of available samples (N) was enough for N* to converge, and the estimated N* was lower than N/2 with the exception of North Central Greece. The highest value of N* and the lowest value of NODF were estimated for Greece. Among the phytogeographical regions, islands and East Central Greece demonstrated relatively low values of N* (6·3 < N* < 22·9) and high degrees of nestedness (7·02 < NODF < 21·75) with the noticeable exception of West Aegean Islands having N* = 27·5 and NODF = 7·23. The West Aegean Islands include Evvia, the second largest island of the Aegean separated only by a few hundred metres from mainland, thus having more mainland than island characteristics. The continental regions exhibited high values of N* (26·1 < N*<42·8) with the highest value observed in Peloponnisos.

N* significantly (P < 0·01) relates to NODF (R2 = 0·90), f (R2 = 0·84) and S (R2 = 0·83) for the given grain (100 m2). In all cases, the relationship is a power law, negative for NODF and f and positive for S. However, there is a strong linear dependence of NODF on f (R2 = 0·94) and a log–log relationship between S and f (R2 = 0·85). Using AIC-based variable selection, the best model accounting for the variation of N* is log(N*)=−6·63 − 0·81 log(f)+7·96 JRow (R2 = 0·95). Similar to the ‘ideal’ patterns, variation in f explained most of the variation of N* (deviance = 3·12), followed by the variation in species occurrences (deviance = 0·45).

The ratio N*/NR* standardizes N* for the effect of f and depends mostly on JRow (N*/NR*=−4·89 + 6·1 JRow, R2 = 0·70). It also depends on regional extent (N*/NR*=−0·2216 Ext4·524, R2 = 0·6). As extent increases, JRow decreases (JRow = 1·06 Ext−0·017, R2 = 0·47). In our case, JRow increases with the proportion of singletons and doubletons (JRow 0·8 + 0·002 (%singletons+%doubletons), R2 = 0·73). So, the dependence of N*/NR* on the regional extent is due to the decrease of the proportion of singletons and doubletons and eventually of JRow, with increasing extent. The proportion of singletons ranges from about 23% (whole Greece and Peloponnisos) to more than 45% (North and West Aegean islands, North and East Central Greece). The proportion of doubletons follows a similar pattern.

For the different phytogeographical regions, the ratio N*/NR* varied between 0·34 and 0·98. North East Greece, North Central Greece and West Aegean Islands showed the highest values of N*/NR* indicating a community structure approaching a random pattern. The ratio tends to increase from South to North, and this holds for both islands and the continental regions.


In this paper, we advocate an index of beta diversity, N*, which is conceptually different from most beta diversity indices in use and exhibits conceptual affiliations to Whittaker's (1972) original concept of half changes and Soininen, McDonald & Hillebrand's (2007) halving distance. It is defined as the average sampling effort at which the total number of species occurrences shared (I) becomes equal to species richness (S). It is estimated as the intersection point between the line (ā/2)n and S(n), the average species accumulation curve (SAC). The point (N*,S(N*)) has the unique properties that S(N*)=I(N*)=(ā/2)N* and inline image, where pi is the proportion of new species expected in ith sample. The proposed index quantifies the differentiation among sites and is not mathematically confounded with alpha diversity.

N* depends on the initial slope of S(n) compared to ā, a relationship recognized as an indicator of beta diversity by Olszewski (2004). Thus, N* can be estimated even when species richness does not approach an asymptote, as is the case of tropical insects. This is an important property, as in most cases the total species richness of any taxon present in any region remains unknown. This does not imply that similar N* values are expected, when comparing a species-rich region to a species-poor one having equal ā and JRow. In the species-poor region, a higher average species overlap, for the given ā, and thus a lower N* value is expected.

For a reliable estimation of N*, both ā and average S(n) should be accurately estimated, and therefore, a sufficient number of samples are required. Given this sufficient number of samples, N* converges to a value that is not affected by the inclusion of more samples from the same region (extent). This property is very important ensuring that our index is more informative of the species assemblage differentiation rather than of n (Vellend 2001; Chao et al. 2005).

If the number of available samples is not sufficient, N* is either not estimable at all (the Nstar function reports that ‘n < N*’ in that case) or it is underestimated. An indication of the latter case is that the proportion of valid N* estimations reported by Nstar function is less than 100%. From our experience, a number of samples greater than 2N* are enough for N* to converge. If N* n/2, an underestimation is possible. In any case, to ensure that N* is correctly estimated, the use of random subsets, as in the example illustrated in fig. 2, is recommended. Even in cases that N* is not estimable due to undersampling, extrapolation of the sample-based SAC can help estimating it (Colwell et al. 2012). But, extrapolation of SACs might be unreliable (Olszewski 2004) and even if not, ā must be reliably estimated.

When using N* to compare different regions, the sampling units should be of equal area, equal time duration or otherwise standardized (e.g. phytosociological units) as the grain size is the measurement unit of N* and it is expected to affect structure assemblage (Mac Nally, Fleishman & Bulluck 2004). Given that gamma diversity of a region is constant, as grain size increases ā, f and the average species distribution overlap increases and N* decreases. This was confirmed by the analysis of the empirical data.

The strong relation between N* and 1/f = βw, observed in both simulated and empirical data, implies that our index shares some properties with the sample-based estimate of Whittaker's βw. However, for a given f, the value of βw is fixed, not affected by the spatial covariation of species distribution (Lira-Noriega et al. 2007; Arita et al. 2008). On the contrary, N* is significantly correlated not only with f but also with JRow (variation in species occupancy frequencies). N* increases as species distribution overlap decreases; hence, it increases with turnover and decreases with nestedness, with randomly ordered data lying between these two extremes. This was confirmed by simulation analyses using species distribution patterns with known properties. In the nested pattern, JRow is the highest and the average N* value the lowest. In ideal Clementsian and Gleasonian patterns, all species have the same occupancy frequency (JRow = 1) and therefore very similar N* values, but slightly higher for the Gleasonian one, as the average species overlap of the later is generally lower.

As, for a given f, N* is strongly related with the SOD, factors affecting the SOD indirectly affect N*. The form of the SOD is defined on a certain grain and extent of sampling (McGeoch & Gaston 2002), as happens with many other observed diversity patterns (Lennon et al. 2001; Kallimanis et al. 2008).

N* is more affected by f than by the SOD. If the underlying pattern of species distribution is of interest, then N*/NR* can be used. NR* is the N* value of an ideal random pattern having the same fill as the data analysed. NR* depends only on f. So, the ratio N*/NR* depends mostly on JRow and is more informative of the community structure; values close to one indicate randomness.

The effects of extent on N* and N*/NR* are evident when comparing Greece with its phytogeographical regions. For a fixed grain, as the extent increases so does gamma diversity and the average distance between randomly drawn samples leading to an increase of compositional dissimilarity and thus to a higher N*. In islands, N* values were lower than that of adjacent continental regions reflecting the higher nestedness of the former. With increasing extent, JRow and N*/NR* decreased due to the decrease of the proportion of singletons and doubletons. N*/NR* was well below one in almost all cases but in North East Greece, West Aegean Islands and North Central Greece, indicating deviation from randomness due to uneven species occupancies.

As the extent of the region is an important determinant of N* (and of other diversity indices), the sampling within the region should be random and as even as possible, to represent the regional diversity. If most of the samples are drawn from a small sub-region, then they will not be a good reflection of the regional species diversity and they will underestimate N*. Furthermore, random sampling, evenly distributed in the region, is necessary to account for the effect of spatial autocorrelation of species distributions on the average overlap and eventually on N*.

In conclusion, we propose the use of the N* as an index of beta diversity, as it reflects the heterogeneity in species composition among different species assemblages and the ratio of N*/NR* as a complementary index of community structure. N* depends on f and JRow while N*/NR* depends mostly on JRow. If a sufficient number of samples are available for its estimation, N* is not affected by the inclusion of further samples from the same region. Thus, it allows direct comparisons among regions with different sampling effort, as it is usually the case with readily available datasets of biodiversity. It shares some properties with the Whittaker's index βw, but unlike βw it is affected by the variation in species occupancies among samples and eventually by the SOD, a property that needs to be considered carefully in beta diversity studies.


This work is a part of ML PhD Thesis. We thank P. Dimopoulos and I. Tsiripidis for constructive discussions and criticism. We also thank J.M. Halley and K. Touloumis for valuable comments and linguistic improvements in a previous version of our manuscript. We also thank Mick Ashcroft and the anonymous reviewers for their very helpful comments.

This research has been co-financed by the European Union (European Social Fund – ESF) and Greek national funds through the Operational Program ‘Education and Lifelong Learning’ of the National Strategic Reference Framework (NSRF)–Research Funding Program: Heracleitus II. Investing in knowledge society through the European Social Fund.