## Introduction

Beta diversity provides a link between species diversity in a region (gamma diversity) and species diversity within sites (alpha diversity) (Whittaker 1960, 1972). However, there is no consensus as to what precisely constitutes beta diversity, and therefore, many metrics for its estimation have been proposed (Koleff, Gaston & Lennon 2003; Jost 2006; Tuomisto 2010a,b; Anderson *et al*. 2011), each quantifying a different aspect of it (Anderson *et al*. 2011).

General definitions of beta diversity produce a beta with a hidden alpha dependency (Jost 2007). However, beta should be independent of mean alpha (Baselga 2010a; Ricotta 2010; Veech & Crist 2010; Chase *et al*. 2011), although there might be empirical cases where a statistical dependence is observed (Jost 2010). If a beta diversity measure is by definition dependent on alpha, then it fails to discern between spatial turnover and nestedness (Baselga 2010b; Baselga & Orme 2012). However, there is no consensus about the appropriate way of estimating beta diversity without the outcomes being influenced by the interrelationship between alpha, beta and gamma (Jost 2006, 2010; Baselga 2010a; Ricotta 2010).

Many beta diversity indices are sensitive to variation in sampling effort expressed in number of sampling units (Lennon *et al*. 2001; Kallimanis *et al*. 2008). The classic Jaccard and Sørensen indices of compositional dissimilarity are notoriously sensitive to sampling effort, especially for assemblages with numerous rare species (Chao *et al*. 2005; Cardoso, Borges & Veech 2009; Chase *et al*. 2011). Therefore, estimates based on different datasets or regions are not directly comparable.

In this paper, we propose a new measure of beta diversity, named *N**. We will consider the first observation of any species as ‘useful’ information and any subsequent observation of the same species as ‘redundant’ (for estimating richness, but informative of beta diversity). Obviously, all the species recorded from the first drawn sample contribute to the species richness, while the species of subsequent samples can either add to the species richness or the redundant set of already found species. Up to a specific number of sampling units (hereafter referred to as samples), the accumulated number of species is higher than the number of species occurrences in the redundant set (Fig. 1a). Taking one more sample, the cumulative number of species in the redundant set becomes higher than the accumulated species richness, and therefore, there is a point where these are equal. We define *N** as the average sampling effort at which the total number of species occurrences in the redundant set becomes equal to the species richness. We explore and test the basic properties of *N** on simulated data and examine its applicability on a dataset of plant diversity from Greek Natura 2000 protected areas.

### Definitions and assumptions

Species richness is a nonadditive variable when aggregating across samples (He & Legendre 2002). In the simple case of two samples *A*_{1} and *A*_{2} having *S(A*_{1}*)* and *S(A*_{2}*)* species each, the total species richness is *S *= *S(A*_{1}*UA*_{2}*) = S(A*_{1}*) + S(A*_{2}*) − S(Α*_{1}*∩A*_{2}*)* (see Table 1 for notations). Setting *R = S(A*_{1}*) + S(A*_{2}*)* and *I = S(Α*_{1}*∩A*_{2}*),* we get *S = R−I* (eqn 1), where *R* is the total number of species occurrences and *I* is the shared species occurrences. Adding more samples, the analytic expression for *I* becomes complicated but, eqn 1 holds, so for *n* samples, *I(n) *= *R(n) *− *S(n)*. For large *n*, the total number of species occurrences is *R(n) *= *ān* (*ā*=average number of species per sample) and the species richness *S(n)* is the average species accumulation curve (SAC), a concave function of *n* (power and logarithmic functions are the most frequently used, e.g. Kagiampaki *et al*. 2011; Triantis, Guilhaumon & Whittaker 2012). Consequently, as *R(n)* is a linear function of *n* and *I(n) = R(n) − S(n),* the dependence of *I(n)* on *n* is a convex function (Fig. 1a.) *Ι(n)* is a measure of the species distribution overlap among the *n* samples. It increases monotonically with *n*, and there is no upper bound for its value.

Notation | Parameters of species presence–absence matrix | |
---|---|---|

N | Total number of samples | |

n | Number of samples (sampling effort) | |

γ | Total species richness, gamma diversity | |

α | Species richness per sample, alpha diversity | |

ā | Mean species richness per sample, mean alpha diversity | |

S(n) | Species richness of n samples | |

R(n) | Total number of species occurrences of n samples | |

I(n) | Total number of species occurrences shared in n samples | |

f | Proportion of the species occurrences matrix that is occupied, the inverse of Whittaker's beta | |

p _{ i } | Proportion of new species expected in the ith sample | |

Notation | Indices of beta diversity | |

β _{w} | Whittaker's original index of beta diversity, γ/ā | |

β _{−1} | Harrison's index, (γ/ā − 1)/(N − 1) | |

N* | The proposed index | |

N _{R} * | Theoretically expected N* for random distribution of species |

The degree of species distributions overlap is an important aspect of beta diversity, and *I* can serve as an index of it after a proper rescaling into a conventional range, say 0–1. Doing so, one derives Harrison's beta β_{−1}* = *(*γ/ā* − 1)/(*n* − 1) (Harrison, Ross & Lawton 1992), originally proposed as a modification of Whittaker's beta β_{w} *= γ/ā* to correct for the strong dependence of the later on *n* (Chiarucci *et al*. 2003). Some ways to derive β_{-1} from *I* are provided in Appendix .

The problem is that *I* is an unbounded convex function of *n*, and β_{−1} attempts to scale it on the unit linear segment (0–1). The symptom of this nonlinear scaling is that the scaled index β_{−1} is still dependent on *n*; as *n* increases, β_{−1} decreases. As gamma diversity is a finite number, if *n* increases then, in the limit (*n→∞*), β_{−1} tends to zero (see Vellend 2001 for a similar argument).

### The *N** index

Instead of rescaling *I*, we can derive an *I*-based index of species overlap by just taking the value of *I* at a characteristic, well-defined sampling effort (*N**) at which a certain criterion is fulfilled. We define *N** as the average sampling effort required so that *I*(*n*) = *S*(*n*), the intersection point of average *S(n)* and *I(n),* as shown in (Fig. 1a). Up to *N**, samples contribute cumulatively more to the increase of *S* than to the increase of *I*. The opposite happens beyond that point.

From *I(n) = R(n) − S(n) = ān − S(n)*, it follows that for *n = N** we have *I(N*) = S(N*)⇒ S(N*) =* *āN* − S(N*)⇒ S(N*) = āN*/2*. That is, *N** is also defined as the intersection point of *S(n)* with the line *y = (ā/2)n* (Fig. 1b). We use this later definition for the estimation of *N** from data as there are well-documented methods for the estimation of the SAC (Colwell *et al*. 2012), and there exist many implementations of those methods [Estimates (Colwell 2012) and a number of R functions]. As electronic supplement (Appendices S3,S4) we provide, the R (R Development Core Team 2012) function ‘Nstar’ to estimate *N** based on the *specaccum* function of the vegan package (Oksanen *et al*. 2012). An example of its use is given in Fig. 1b with the dune data included in vegan (Jongman, Ter Braak & Van Tongeren 1987). *Specaccum* estimates the average and the standard deviation of *S(n)*. The standard deviation defines an upper and a lower bound of the SAC that intersect the line *y* = *(ā/2)n* providing a lower and an upper bound for *N** (Fig. 1b). Note that the standard deviation of SAC does not directly translate to standard deviation of *N**. However, the bounds of *N** could be used as a measure of uncertainty. The bounds of *S(n)* depend on the variation in alpha that eventually affects the upper and lower bounds of *N**, but not its average value.

There are cases where *N** is not estimable from the available data as it is shown in Fig. 1c where we used only four of the 20 samples of the dune dataset to estimate *N**. In this example, the line *y* = *(ā/2)n* has not yet exceeded *S(n)* and the intersection point has not been reached. However, there are subsets of four samples for which *N** was estimated to be less than four, an underestimate of the true *N** value. For a reliable estimation of *N**, a ‘sufficient’ number of samples is required. Fig. 2 shows the way *N** and *ā* depend on *n*. For any given number of samples *n* (3–19), we formed 500 random subsets and estimated average *N** and *ā* from only those subsets where *N** was estimable. Using three samples, *N** was estimable in only 4·6% of the sets, and its average value was 2·74. The corresponding estimate of *ā* was very high (11·6) indicating that *N** was estimable only when the species-rich samples were included. Using more than seven samples, all subsets provide an estimate and with ten or more, both *ā* and *N** converge to a value. Furthermore, the use of subsets allows for the estimation of meaningful statistics of spread. We used this approach for the estimation of average, standard deviation and quartiles of *N** in all our empirical and simulated datasets.

*N**, defined as the intersection point between *y *= (*ā/2*)*n* and *S(n)*, will always converge to a value as long as *ā* and *S(n)* stabilize. The convergence of *ā* is also required for the estimation of *β _{w}* and other related indices. The number of samples required obviously depends on the variation in alpha.

The intersection point between *S(n*) and *y = ān/2* exists if *S(n)* has an upper bound or even if *S(n)* is a linear function of *n* with a slope less than *ā/2*, as *S(1) = ā>ā/2*. However, there might be cases where the two curves do not intersect. An extreme such case is when all sites are completely distinct and there is an infinite number of species. Then, *I(n) = 0*,* S(n) = R(n) = ān*,* β _{w} = n, β_{−1} = *1, and

*N**is always higher than

*n*. When dealing with an infinite species pool, both

*β*and

_{w}*N**tend to infinity. If, however, the species pool (

*γ*) is finite, then, provided

*n*is large enough to reach

*γ*, the lines

*y = γ*and

*y = (ā/2)n*intersect at

*N**=

*2γ*/

*ā*= 2

*β*. This is the highest possible value of

_{w}*N**. The lowest possible value of

*N**is 2 and is obtained when

*S(n)*=

*ā;*all samples contain exactly the same species.

#### Factors expected to affect *N**

*N** increases with the slope of *S(n)* relative to *ā*. For *S(n)*, we can write the recursive formula *S(n) = S(n − 1) + p*_{n}*α*_{n}, where *α*_{n} is the number of species in the *nth* sample and *p*_{n} is the proportion of new species expected among those *α*_{n}. Although *α*_{n} is not constant, for large enough *n* and under a randomization procedure for estimating *S(n)*,* α*_{n} = *ā = S(1)* and the equation *S(n) = S(n − 1) + p*_{n}*ā* has the solution , given that *p*_{1}* *=* 1*.

that is, *N** depends only on *p*_{i}'s and consequently on *β _{w}* estimated at

*n = N**. This means that

*N**is not mathematically confounded to alpha diversity but depends on the proportion of new species expected among the species of the next collected sample. In summary,

*N**is the sampling effort at which

*S(N*) = I(N*) = (ā/2)N**and its value equals two times the Whittaker's index calculated using

*N**samples.

*N** accounts for the overlap of species distributions, so it is expected to depend on the fill (*f*) of the species presence matrix. The fill is the proportion of the matrix that is occupied and equals the inverse of *β _{w}* index (

*f = āN/SN = ā/S*). Assuming a random pattern of species distribution among samples, the value of

*N**(name it

*N*

_{R}

***) depends only on

*f*according to . This recursive formula is derived analytically (Appendix II), and a good initial value for starting the recursion is

*N*

_{R}

** = 0·5 + 1·6/f*. The ratio

*N**/

*N*

_{R}* standardizes for the effect of fill, so its values depend on factors affecting

*N** other than the fill.

*N*/N*

_{R}

***actually reflects the degree species assemblage deviates from randomness.

For a fixed *f*, the degree of species overlap depends on the way the species are distributed and eventually on the variation in species occupancy. If a species is recorded in *k* of *n* samples, then it contributes 1 to *S*,* k* to *R* and *k − 1* to *I*. Widely distributed species, contributing mostly to *I*, tend to decrease *N** while rare species, contributing relatively more to *S*, tend to increase it. For a fixed *f* and a given species occupancy distribution (SOD), *N** is expected to depend on the way rare and common species are assembled in the community (nested pattern vs. turnover).