### Introduction

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information

Beta diversity provides a link between species diversity in a region (gamma diversity) and species diversity within sites (alpha diversity) (Whittaker 1960, 1972). However, there is no consensus as to what precisely constitutes beta diversity, and therefore, many metrics for its estimation have been proposed (Koleff, Gaston & Lennon 2003; Jost 2006; Tuomisto 2010a,b; Anderson *et al*. 2011), each quantifying a different aspect of it (Anderson *et al*. 2011).

General definitions of beta diversity produce a beta with a hidden alpha dependency (Jost 2007). However, beta should be independent of mean alpha (Baselga 2010a; Ricotta 2010; Veech & Crist 2010; Chase *et al*. 2011), although there might be empirical cases where a statistical dependence is observed (Jost 2010). If a beta diversity measure is by definition dependent on alpha, then it fails to discern between spatial turnover and nestedness (Baselga 2010b; Baselga & Orme 2012). However, there is no consensus about the appropriate way of estimating beta diversity without the outcomes being influenced by the interrelationship between alpha, beta and gamma (Jost 2006, 2010; Baselga 2010a; Ricotta 2010).

Many beta diversity indices are sensitive to variation in sampling effort expressed in number of sampling units (Lennon *et al*. 2001; Kallimanis *et al*. 2008). The classic Jaccard and Sørensen indices of compositional dissimilarity are notoriously sensitive to sampling effort, especially for assemblages with numerous rare species (Chao *et al*. 2005; Cardoso, Borges & Veech 2009; Chase *et al*. 2011). Therefore, estimates based on different datasets or regions are not directly comparable.

In this paper, we propose a new measure of beta diversity, named *N**. We will consider the first observation of any species as ‘useful’ information and any subsequent observation of the same species as ‘redundant’ (for estimating richness, but informative of beta diversity). Obviously, all the species recorded from the first drawn sample contribute to the species richness, while the species of subsequent samples can either add to the species richness or the redundant set of already found species. Up to a specific number of sampling units (hereafter referred to as samples), the accumulated number of species is higher than the number of species occurrences in the redundant set (Fig. 1a). Taking one more sample, the cumulative number of species in the redundant set becomes higher than the accumulated species richness, and therefore, there is a point where these are equal. We define *N** as the average sampling effort at which the total number of species occurrences in the redundant set becomes equal to the species richness. We explore and test the basic properties of *N** on simulated data and examine its applicability on a dataset of plant diversity from Greek Natura 2000 protected areas.

#### Definitions and assumptions

Species richness is a nonadditive variable when aggregating across samples (He & Legendre 2002). In the simple case of two samples *A*_{1} and *A*_{2} having *S(A*_{1}*)* and *S(A*_{2}*)* species each, the total species richness is *S *= *S(A*_{1}*UA*_{2}*) = S(A*_{1}*) + S(A*_{2}*) − S(Α*_{1}*∩A*_{2}*)* (see Table 1 for notations). Setting *R = S(A*_{1}*) + S(A*_{2}*)* and *I = S(Α*_{1}*∩A*_{2}*),* we get *S = R−I* (eqn 1), where *R* is the total number of species occurrences and *I* is the shared species occurrences. Adding more samples, the analytic expression for *I* becomes complicated but, eqn 1 holds, so for *n* samples, *I(n) *= *R(n) *− *S(n)*. For large *n*, the total number of species occurrences is *R(n) *= *ān* (*ā*=average number of species per sample) and the species richness *S(n)* is the average species accumulation curve (SAC), a concave function of *n* (power and logarithmic functions are the most frequently used, e.g. Kagiampaki *et al*. 2011; Triantis, Guilhaumon & Whittaker 2012). Consequently, as *R(n)* is a linear function of *n* and *I(n) = R(n) − S(n),* the dependence of *I(n)* on *n* is a convex function (Fig. 1a.) *Ι(n)* is a measure of the species distribution overlap among the *n* samples. It increases monotonically with *n*, and there is no upper bound for its value.

Table 1. Notations of the parameters and indices used in this articleNotation | Parameters of species presence–absence matrix |
---|

*N* | Total number of samples |

*n* | Number of samples (sampling effort) |

*γ* | Total species richness, gamma diversity |

*α* | Species richness per sample, alpha diversity |

*ā* | Mean species richness per sample, mean alpha diversity |

*S(n)* | Species richness of *n* samples |

*R(n)* | Total number of species occurrences of *n* samples |

*I(n)* | Total number of species occurrences shared in *n* samples |

*f* | Proportion of the species occurrences matrix that is occupied, the inverse of Whittaker's beta |

*p* _{ i } | Proportion of new species expected in the *i*th sample |

Notation | Indices of beta diversity |

*β*_{w} | Whittaker's original index of beta diversity, γ/*ā* |

*β*_{−1} | Harrison's index, *(γ/ā − 1)/(N − 1)* |

*N** | The proposed index |

*N* _{R} *** | Theoretically expected *N** for random distribution of species |

The degree of species distributions overlap is an important aspect of beta diversity, and *I* can serve as an index of it after a proper rescaling into a conventional range, say 0–1. Doing so, one derives Harrison's beta β_{−1}* = *(*γ/ā* − 1)/(*n* − 1) (Harrison, Ross & Lawton 1992), originally proposed as a modification of Whittaker's beta β_{w} *= γ/ā* to correct for the strong dependence of the later on *n* (Chiarucci *et al*. 2003). Some ways to derive β_{-1} from *I* are provided in Appendix .

The problem is that *I* is an unbounded convex function of *n*, and β_{−1} attempts to scale it on the unit linear segment (0–1). The symptom of this nonlinear scaling is that the scaled index β_{−1} is still dependent on *n*; as *n* increases, β_{−1} decreases. As gamma diversity is a finite number, if *n* increases then, in the limit (*n∞*), β_{−1} tends to zero (see Vellend 2001 for a similar argument).

#### The *N** index

Instead of rescaling *I*, we can derive an *I*-based index of species overlap by just taking the value of *I* at a characteristic, well-defined sampling effort (*N**) at which a certain criterion is fulfilled. We define *N** as the average sampling effort required so that *I*(*n*) = *S*(*n*), the intersection point of average *S(n)* and *I(n),* as shown in (Fig. 1a). Up to *N**, samples contribute cumulatively more to the increase of *S* than to the increase of *I*. The opposite happens beyond that point.

From *I(n) = R(n) − S(n) = ān − S(n)*, it follows that for *n = N** we have *I(N*) = S(N*) S(N*) =* *āN* − S(N*) S(N*) = āN*/2*. That is, *N** is also defined as the intersection point of *S(n)* with the line *y = (ā/2)n* (Fig. 1b). We use this later definition for the estimation of *N** from data as there are well-documented methods for the estimation of the SAC (Colwell *et al*. 2012), and there exist many implementations of those methods [Estimates (Colwell 2012) and a number of R functions]. As electronic supplement (Appendices S3,S4) we provide, the R (R Development Core Team 2012) function ‘Nstar’ to estimate *N** based on the *specaccum* function of the vegan package (Oksanen *et al*. 2012). An example of its use is given in Fig. 1b with the dune data included in vegan (Jongman, Ter Braak & Van Tongeren 1987). *Specaccum* estimates the average and the standard deviation of *S(n)*. The standard deviation defines an upper and a lower bound of the SAC that intersect the line *y* = *(ā/2)n* providing a lower and an upper bound for *N** (Fig. 1b). Note that the standard deviation of SAC does not directly translate to standard deviation of *N**. However, the bounds of *N** could be used as a measure of uncertainty. The bounds of *S(n)* depend on the variation in alpha that eventually affects the upper and lower bounds of *N**, but not its average value.

There are cases where *N** is not estimable from the available data as it is shown in Fig. 1c where we used only four of the 20 samples of the dune dataset to estimate *N**. In this example, the line *y* = *(ā/2)n* has not yet exceeded *S(n)* and the intersection point has not been reached. However, there are subsets of four samples for which *N** was estimated to be less than four, an underestimate of the true *N** value. For a reliable estimation of *N**, a ‘sufficient’ number of samples is required. Fig. 2 shows the way *N** and *ā* depend on *n*. For any given number of samples *n* (3–19), we formed 500 random subsets and estimated average *N** and *ā* from only those subsets where *N** was estimable. Using three samples, *N** was estimable in only 4·6% of the sets, and its average value was 2·74. The corresponding estimate of *ā* was very high (11·6) indicating that *N** was estimable only when the species-rich samples were included. Using more than seven samples, all subsets provide an estimate and with ten or more, both *ā* and *N** converge to a value. Furthermore, the use of subsets allows for the estimation of meaningful statistics of spread. We used this approach for the estimation of average, standard deviation and quartiles of *N** in all our empirical and simulated datasets.

*N**, defined as the intersection point between *y *= (*ā/2*)*n* and *S(n)*, will always converge to a value as long as *ā* and *S(n)* stabilize. The convergence of *ā* is also required for the estimation of *β*_{w} and other related indices. The number of samples required obviously depends on the variation in alpha.

The intersection point between *S(n*) and *y = ān/2* exists if *S(n)* has an upper bound or even if *S(n)* is a linear function of *n* with a slope less than *ā/2*, as *S(1) = ā>ā/2*. However, there might be cases where the two curves do not intersect. An extreme such case is when all sites are completely distinct and there is an infinite number of species. Then, *I(n) = 0*,* S(n) = R(n) = ān*,* β*_{w} = n, β_{−1} = 1, and *N** is always higher than *n*. When dealing with an infinite species pool, both *β*_{w} and *N** tend to infinity. If, however, the species pool (*γ*) is finite, then, provided *n* is large enough to reach *γ*, the lines *y = γ* and *y = (ā/2)n* intersect at *N** = *2γ*/*ā* = 2*β*_{w}. This is the highest possible value of *N**. The lowest possible value of *N** is 2 and is obtained when *S(n)* = *ā;* all samples contain exactly the same species.

##### Factors expected to affect *N**

*N** increases with the slope of *S(n)* relative to *ā*. For *S(n)*, we can write the recursive formula *S(n) = S(n − 1) + p*_{n}*α*_{n}, where *α*_{n} is the number of species in the *nth* sample and *p*_{n} is the proportion of new species expected among those *α*_{n}. Although *α*_{n} is not constant, for large enough *n* and under a randomization procedure for estimating *S(n)*,* α*_{n} = *ā = S(1)* and the equation *S(n) = S(n − 1) + p*_{n}*ā* has the solution , given that *p*_{1}* *=* 1*.

that is, *N** depends only on *p*_{i}'s and consequently on *β*_{w} estimated at *n = N**. This means that *N** is not mathematically confounded to alpha diversity but depends on the proportion of new species expected among the species of the next collected sample. In summary, *N** is the sampling effort at which *S(N*) = I(N*) = (ā/2)N** and its value equals two times the Whittaker's index calculated using *N** samples.

*N** accounts for the overlap of species distributions, so it is expected to depend on the fill (*f*) of the species presence matrix. The fill is the proportion of the matrix that is occupied and equals the inverse of *β*_{w} index (*f = āN/SN = ā/S*). Assuming a random pattern of species distribution among samples, the value of *N** (name it *N*_{R}***) depends only on *f* according to . This recursive formula is derived analytically (Appendix II), and a good initial value for starting the recursion is *N*_{R}** = 0·5 + 1·6/f*. The ratio *N**/*N*_{R}* standardizes for the effect of fill, so its values depend on factors affecting *N** other than the fill. *N*/N*_{R}*** actually reflects the degree species assemblage deviates from randomness.

For a fixed *f*, the degree of species overlap depends on the way the species are distributed and eventually on the variation in species occupancy. If a species is recorded in *k* of *n* samples, then it contributes 1 to *S*,* k* to *R* and *k − 1* to *I*. Widely distributed species, contributing mostly to *I*, tend to decrease *N** while rare species, contributing relatively more to *S*, tend to increase it. For a fixed *f* and a given species occupancy distribution (SOD), *N** is expected to depend on the way rare and common species are assembled in the community (nested pattern vs. turnover).

### Materials and methods

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information

To explore the effect of community structure on *N*,* we used ‘ideal’ patterns of community organization including (a) nested subsets, (b) ‘random’ patterns, (c) Clementsian gradients and (d) Gleasonian gradients, following Leibold & Mikkelson (2002). We produced the patterns keeping the total number of species (*S*), the total number of samples (*N*) and the fill (*f*), (and hence *ā*), constant.

In the nested pattern, sites with lower species richness constitute subsets of species-richer samples (Patterson & Atmar 1986). In our simulation, the first species occupies the whole range of samples (*N*). Any subsequent species with rank (*r*) from 2 to *S* occupies the samples ranging from 1 to *n*_{r} where *n*_{r}* = 1 + (N − 1)*e^{dr} with *d *<* *0, rounded to the nearest integer. That is, species ranges decline exponentially with species rank, but even the rarest species has at least one occurrence.

In the Gleasonian gradient, species composition changes smoothly across a gradient and there is a constant degree of species overlap. Each species occupies exactly *n* samples. The first species occupies the first *n* samples and each subsequent is shifted by *m* places with *m < n*, that is, each species and its subsequent in the matrix coexist in *n-m* samples.

The Clementsian gradient is characterized by compartmentalization in species composition resulting in discrete communities. We simulated the pattern setting the number of communities to *k*. Each community includes *S/k* species, each species occupies exactly *N/k* samples and all samples in the community have identical species composition.

In the random pattern, species were randomly allocated to samples, while controlling for the total number of species, samples and matrix fill.

From each pattern, we produced a series of variants differing in the distribution of row and column totals implementing row or column transformations. The transformations affect the exact position of species occurrences in the target row/column but maintain either the row or the column totals, respectively. Row transformations change the distribution of species among samples leaving the species occurrence distribution (SOD) unaffected, while column transformations change the SOD but preserve the species richness per sample (*α*). The variation of column and row totals was assessed by the Shannon's equitability index (*J*). The index assumes values between 0 (extreme differentiation) and 1 (no variation), among column/row totals.

As *f* is an important determinant of the average species overlap, we also produced a series of matrices differing in *f*. The variation in *f* was generated by varying *ā*, keeping *S* constant. We used multiple regression to determine the dependence of *N** on *f*,* ā*,* S*, the variation in alpha (column totals) and the variation in species occupancy (row totals). For the multiple regression, we used the GLM module in R (Gaussian family with identity link), with a stepwise backward selection based on AIC.

The degree of nestedness was estimated by the *NODF* nestedness metric (Almeida-Neto *et al*. 2008), using ANINHADO (Guimarães & Guimarães 2006). Higher values of *NODF* imply more nested matrices.

#### Empirical data

We used data of vascular plants obtained from the project ‘Identification and Description of Habitat Types in Areas of Interest for the Conservation of Nature, 1999–2001’ for the design of the Greek Natura 2000 network of protected areas (209 sites) to test the performance of *N**. The data refer to the species presences in each sample. In our analysis, 5651 samples (releves) were considered with surface area ranging between 50 and 400 m^{2} containing a total of 4063 species or subspecies of higher plants. We explored the effect of grain size on *N**, that is, the surface area of the plot, using the whole dataset (Greece) and the effect of extent using only samples of surface area 100 m^{2} at Greece and its 13 phytogeographical regions according to Flora Hellenica (Strid & Tan 1997).

### Discussion

- Top of page
- Summary
- Introduction
- Materials and methods
- Results
- Discussion
- Acknowledgements
- References
- Supporting Information

In this paper, we advocate an index of beta diversity, *N**, which is conceptually different from most beta diversity indices in use and exhibits conceptual affiliations to Whittaker's (1972) original concept of half changes and Soininen, McDonald & Hillebrand's (2007) halving distance. It is defined as the average sampling effort at which the total number of species occurrences shared (*I*) becomes equal to species richness (*S*). It is estimated as the intersection point between the line *y *= *(ā/2)n* and *S(n),* the average species accumulation curve (SAC). The point (*N*,S(N*)*) has the unique properties that *S*(*N**)=*I*(*N**)=(*ā*/2)*N** and , where *p*_{i} is the proportion of new species expected in *i*th sample. The proposed index quantifies the differentiation among sites and is not mathematically confounded with alpha diversity.

*N** depends on the initial slope of *S(n)* compared to *ā*, a relationship recognized as an indicator of beta diversity by Olszewski (2004). Thus, *N** can be estimated even when species richness does not approach an asymptote, as is the case of tropical insects. This is an important property, as in most cases the total species richness of any taxon present in any region remains unknown. This does not imply that similar *N** values are expected, when comparing a species-rich region to a species-poor one having equal *ā* and *J*_{Row}. In the species-poor region, a higher average species overlap, for the given *ā*, and thus a lower *N** value is expected.

For a reliable estimation of *N**, both *ā* and average *S(n)* should be accurately estimated, and therefore, a sufficient number of samples are required. Given this sufficient number of samples, *N** converges to a value that is not affected by the inclusion of more samples from the same region (extent). This property is very important ensuring that our index is more informative of the species assemblage differentiation rather than of *n* (Vellend 2001; Chao *et al*. 2005).

If the number of available samples is not sufficient, *N** is either not estimable at all (the *Nstar* function reports that ‘*n < N**’ in that case) or it is underestimated. An indication of the latter case is that the proportion of valid *N** estimations reported by Nstar function is less than 100%. From our experience, a number of samples greater than *2N** are enough for *N** to converge. If *N* *> *n/2*, an underestimation is possible. In any case, to ensure that *N** is correctly estimated, the use of random subsets, as in the example illustrated in fig. 2, is recommended. Even in cases that *N** is not estimable due to undersampling, extrapolation of the sample-based SAC can help estimating it (Colwell *et al*. 2012). But, extrapolation of SACs might be unreliable (Olszewski 2004) and even if not, *ā* must be reliably estimated.

When using *N** to compare different regions, the sampling units should be of equal area, equal time duration or otherwise standardized (e.g. phytosociological units) as the grain size is the measurement unit of *N** and it is expected to affect structure assemblage (Mac Nally, Fleishman & Bulluck 2004). Given that gamma diversity of a region is constant, as grain size increases *ā, f* and the average species distribution overlap increases and *N** decreases. This was confirmed by the analysis of the empirical data.

The strong relation between *N** and *1/f* = *β*_{w,} observed in both simulated and empirical data, implies that our index shares some properties with the sample-based estimate of Whittaker's *β*_{w}. However, for a given *f*, the value of *β*_{w} is fixed, not affected by the spatial covariation of species distribution (Lira-Noriega *et al*. 2007; Arita *et al*. 2008). On the contrary, *N** is significantly correlated not only with *f* but also with *J*_{Row} (variation in species occupancy frequencies). *N** increases as species distribution overlap decreases; hence, it increases with turnover and decreases with nestedness, with randomly ordered data lying between these two extremes. This was confirmed by simulation analyses using species distribution patterns with known properties. In the nested pattern, *J*_{Row} is the highest and the average *N** value the lowest. In ideal Clementsian and Gleasonian patterns, all species have the same occupancy frequency (*J*_{Row} = 1) and therefore very similar *N** values, but slightly higher for the Gleasonian one, as the average species overlap of the later is generally lower.

As, for a given *f*,* N** is strongly related with the SOD, factors affecting the SOD indirectly affect *N**. The form of the SOD is defined on a certain grain and extent of sampling (McGeoch & Gaston 2002), as happens with many other observed diversity patterns (Lennon *et al*. 2001; Kallimanis *et al*. 2008).

*N** is more affected by *f* than by the SOD. If the underlying pattern of species distribution is of interest, then *N**/*N*_{R}*** can be used. *N*_{R}*** is the *N** value of an ideal random pattern having the same fill as the data analysed. *N*_{R}*** depends only on *f*. So, the ratio *N*/N*_{R}*** depends mostly on *J*_{Row} and is more informative of the community structure; values close to one indicate randomness.

The effects of extent on *N** and *N*/N*_{R}*** are evident when comparing Greece with its phytogeographical regions. For a fixed grain, as the extent increases so does gamma diversity and the average distance between randomly drawn samples leading to an increase of compositional dissimilarity and thus to a higher *N**. In islands, *N** values were lower than that of adjacent continental regions reflecting the higher nestedness of the former. With increasing extent, *J*_{Row} and *N*/N*_{R}*** decreased due to the decrease of the proportion of singletons and doubletons. *N**/*N*_{R}*** was well below one in almost all cases but in North East Greece, West Aegean Islands and North Central Greece, indicating deviation from randomness due to uneven species occupancies.

As the extent of the region is an important determinant of *N** (and of other diversity indices), the sampling within the region should be random and as even as possible, to represent the regional diversity. If most of the samples are drawn from a small sub-region, then they will not be a good reflection of the regional species diversity and they will underestimate *N**. Furthermore, random sampling, evenly distributed in the region, is necessary to account for the effect of spatial autocorrelation of species distributions on the average overlap and eventually on *N**.

In conclusion, we propose the use of the *N** as an index of beta diversity, as it reflects the heterogeneity in species composition among different species assemblages and the ratio of *N**/*N*_{R}*** as a complementary index of community structure. *N** depends on *f* and *J*_{Row} while *N*/N*_{R}*** depends mostly on *J*_{Row}. If a sufficient number of samples are available for its estimation, *N** is not affected by the inclusion of further samples from the same region. Thus, it allows direct comparisons among regions with different sampling effort, as it is usually the case with readily available datasets of biodiversity. It shares some properties with the Whittaker's index *β*_{w}, but unlike *β*_{w} it is affected by the variation in species occupancies among samples and eventually by the SOD, a property that needs to be considered carefully in beta diversity studies.