DAR (diversity–area relationship): Extending classic SAR (species–area relationship) for biodiversity and biogeography analyses

Abstract I extend the classic SAR, which has achieved status of ecological law and plays a critical role in global biodiversity and biogeography analyses, to general DAR (diversity–area relationship). The extension was aimed to remedy a serious application limitation of the traditional SAR that only addressed one aspect of biodiversity scaling—species richness scaling over space, but ignoring species abundance information. The extension was further inspired by a recent consensus that Hill numbers offer the most appropriate measures for alpha‐diversity and multiplicative beta‐diversity. In particular, Hill numbers are essentially a series of Renyi's entropy values weighted differently along the rare‐common‐dominant spectrum of species abundance distribution and are in the units of effective number of species (or species equivalents such as OTUs). I therefore postulate that Hill numbers should follow the same or similar law of the traditional SAR. I test the postulation with the American gut microbiome project (AGP) dataset of 1,473 healthy North American individuals. I further propose three new concepts and develop their statistical estimation formulae based on the new DAR extension, including: (i) DAR profile—z–q relationship (DAR scaling parameter z at different diversity order q), (ii) PDO (pair‐wise diversity overlap) profile—g–q relationship (PDO parameter g at order q, and (iii) MAD (maximal accrual diversity: D max) profile—D max‐q. While the classic SAR is a special case of our new DAR profile, the PDO and MAD profiles offer novel tools for analyzing biodiversity (including alpha‐diversity and beta‐diversity) and biogeography over space.

The Hill numbers, originally introduced as an evenness index from economics by Hill (1973) who was apparently inspired by Renyi's (1961) general entropy of order, has not received the attention it deserves until recent years. Chao et al. (2012) further clarified Hill's numbers for measuring alpha diversity as (with slightly different symbols and notations): where S is the number of species, q is the order number.
The Hill number is undefined for q=1, but its limit as q approaches to 1 exists in the following form: The parameter q determines the sensitivity of the Hill number to the relative frequencies of species abundances. When q=0, the species abundances do not count at all and 0 D=S, i.e., species richness. When q=1, 1 D equal the exponential of Shannon entropy, and is interpreted as the number of typical or common species in the community. When q=2, 2 D equal the reciprocal of Simpson index, i.e., which is interpreted as the number of dominant or very abundant species in the community (Chao et al. 2012).
The general interpretation of diversity of order q is that the community contains q D=x equally abundant species. Then, the diversity of a community can be measured with a series of Hill numbers, possibly plotted on a single graph as a 'continuous' function of the parameter q. Chao et al (2012) termed the series of plots "community diversity profile" that characterizes the species-abundance distribution of a community and offers complete information on its diversity. Since all Hill numbers are in units of species, and in fact, they are referred to as the effective number of species or as species equivalents, therefore, intuitively, Hill numbers should follow the same or similar pattern of SAR.
into the product of alpha and beta, in which both alpha and gamma diversities are measured with the Hill numbers. For example, Jost (2007) demonstrated that the partition of Hill numbers into independent alpha (within community) and beta (between communities) is necessarily multiplicative such as: 4) This beta diversity derived from the above partition takes the value of 1 if all communities are identical, the value of N (the number of communities) when all the communities are completely different from each other (there are no shared species). With Whittaker (1972) words, this beta diversity measures "the extent of differentiation of communities;" or with Jost (2007) words, "the effective number of completely distinct communities." Indeed, recent advances (Jost 2007, Ellison 2010, Chao 2012, Chao & Jost 2015 have made a convincing case that Hill numbers and multiplicative beta-partition offer to date the most generally consistent and appropriate, yet simple solution to investigate diversity. For this we stick to the Hill numbers and multiplicative beta-diversity in this study. To compute the Hill numbers of beta diversity with Eq. (4), we first must compute the Hill numbers of gamma diversity of multiple local communities (regional or metacommunity). Let us consider a fixed set of N local communities. Assume that there are S species in the pooled assemblages (meta-community), € y ij is the abundance of the i-th species in the j-th local community. Let be the total abundance in the meta-community, and let ∑ be the community size of the j-th local community. Then we may denote the value € y ij as € y ij = y ++ (y + j / y ++ )(y ij / y + j ) = y ++ w j p ij , where € p ij = y ij y + j is the relative abundance of the i-th species in the j-th community, and € w j = y + j y ++ is the relative community size or the weigh of the j- For the gamma community, we pool the species abundances across local communities, and let be the total value of the i-th species in the metacommunity.
Then the alpha diversity and gamma diversity of the meta-community can be computed as (5) & (6) respectively, When q=1, the corresponding q D γ , and their detailed formulae can be found in Chiu et al. (2014) and we omit them here.
In this article, we compute diversities until q=3, i.e., to the third order, which includes traditional species richness (q=0), the exponential of Shannon index (q=1), the reciprocal of Simpson index (q=2) and one additional set of indexes for q=3.

(ii) The DAR models and DAR profiles
See the main text, no supplements needed.

(iii) Sampling schemes to fit DAR models
Multiple sampling schemes or types of SAR have been devised to test the SAR theory including strictly nested (Type I), contiguous quadrats grid (Type-II), non-contiguous quadrats grid (Type-III), and areas of varying size (islands, Type-IV) (Scheiner et al. 2003(Scheiner et al. , 2011. The first three types are termed aggregate SARs and the fourth type is termed independent SAR. According to Whittaker & Triantis (2012), the most critical distinction is whether cumulative totals are computed across a set of areas or whether the actual number of species detected in each area is directly utilized for constructing SAR. Accordingly, some researchers refer to the former as the species accumulation curves (SACs) and the latter as island species area-relationships (ISARs). We use the more general and popular term species-area relationship (SAR) in the study, although our approach belongs to the former category (i.e., SAC). It is noted that island (Type-IV) appears to be a reasonable sampling scheme in our study, but the islands in the human microbiome are obviously of approximately the same size. According to the above-mentioned Whittaker & Triantis (2012) distinction, the island model is hardly applicable to the study of human microbiome.
In this study, we adopt the Scheiner (2003Scheiner ( , 2011 Type-III sampling scheme, non-contiguous quadrats grid. Strictly speaking, quadrats (the same size) are arranged in a regular but noncontiguous grid. Since the quadrat (individual subject), i.e., our body (habitats for our microbiome) is highly mobile, figuring out a definite spatial coordinate relationship, not to mention a grid of quadrats, is hardly possible. Nevertheless, Type-III is obviously the most natural choice we can make. To deal with the lack of spatial order (grid), we adopt the following three-step scheme.
Step (a): generate all possible sequences of the subjects, i.e., the total permutations of all subjects in the HMP dataset.
Step (b): randomly choose 100 (or 1000) sequences (as 100 or 1000 repetitions) from the total permutations, and each (sequence) of the 100 (1000) sequences of the subjects is treated as a 'grid' of subjects. Each 'grid' of subjects is utilized to accumulate diversity and fit a DAR curve.
Step (c): the average parameters from the 100 (1000) repetitions (DAR curves) are adopted as the model parameters of the final DAR model selected for the dataset.
Here, we further explain our sampling scheme for constructing DAR models. As explained previously, unlike most studies in macro-ecology, where there is often a natural spatial sequence (or arrangement) among the communities sampled, the community samples in our HMP data are pretty much 'random.' In other words, there is not a naturally occurring spatial sequence (arrangement) among the communities of the HMP samples we use for modeling DAR. To avoid the potential bias from an arbitrary order of the community samples, we totally permutated the orders of all the community samples under investigation, and then randomly choose 100 (or 1000) orders of the communities generated from the permutation operation. That is, rather than taking an arbitrary order for accruing community samples in one-time fitting to the DAR model, we repeatedly perform the DAR model-fitting 100 (1000) times, each time the community samples used to build the DAR model was randomly chosen from the total permutation of all the community samples under investigation. It is noted that the 100 (1000) times of sampling from the total permutation were conducted randomly without repetition (replacement), i.e., a specific permutation is used at most once for DAR model fitting. Finally, the averages of the model parameters from the 100 (1000) times of DAR fittings are adopted as the model parameters of the DAR for the set of community samples under investigation.
Occasionally, there may be the cases when the DAR model parameters do not make sense ecologically. For example, in the case of PLEC model [eqn. (6)], the sign of parameter z and d must be opposite to have A max >0 [eqn. (11)], otherwise, A max <0 the model does not have an extreme value. In this kind of exceptional case, we simply discard the specific permutation that generated the exception, and add another round of model fitting by randomly taking another order of samples from the total permutation; the additional fittings may be repeated as needed until a positive exponent is obtained. Note that we only remove exceptions that do not make sense ecologically such as A max <0 (negative number of accrual is not possible). For another example, again in the case of PLEC, negative z and positive d are possible for DAR beyond species richness (q=0), which are considered as valid permutations.
(iv) The accrual of diversities to fit DAR models After selecting the sampling schemes, we need to specify the scheme to accrue diversity based on the diversity formulae listed above [Eqns.
(1-6)]. Although the accumulation of species in traditional SAR is well defined and there is no ambiguity on how the species counts are computed once the scheme for area accrual is decided, the computation of the diversity accrual is still largely an uncharted area, and there may be more than one scheme to accumulate diversity across space, especially for the accrual of beta-diversity.
To devise what we believe to be the most appropriate and also natural methods to accrue diversity, we follow the following three principles. The first is to use the Hill numbers, or what Jost (2007) termed the true diversity; the second is to follow the essence of SAR, as captured by the word "accumulation" or "aggregate," i.e., species (diversity) are accumulated for the accrued areas; the third is that the diversity scaling model should be useful for predicting diversity at different levels of areas accumulated. We consider these three principles as axioms in traditional SAR and we believe that any extension from SAR to DAR should not violate them.
The last principle is a major reason why SAR has been a central theme of both community ecology and conservation biology, and is obviously important for understanding the biogeography of human microbiome. One important advantage for us to stick to the three principles, which are embodied in the traditional SAR theory, is that our new DAR should inherit many of the insights and applications that traditional SAR has reveled and offered. Based on these three principles, we construct the following procedures to accumulate diversity over areas.
To accrue alpha diversity across areas, at each accrual step, we first simply add up all the OTU lists (rows) in the OTU table of the communities up to the accrual step. Assuming there are N communities, at accrual step i=1, 2,…, N-1, we simply add up the rows in the standard OTU table corresponding to the first i communities, until the last row is added when the accrual is completed. For each of the aggregate (accumulated) community, we compute its alpha-diversity with Eq. (1) or Eq.
(2) (for q=1) with the added-up OTU lists for the accrual step. The resulted pairs of accrued alpha diversities and areas are regressed to fit the DAR model.
An alternative accrual scheme is to use Eq. (5), but the output with Eq. (5) would be a measure of average single community (local community) diversity, which is not a cumulative quantity and is hence not a measure that we are interested in to investigate DAR relationship. This is because Eqn. (5) is constrained by its mission to partition gamma diversity [Eq. (6)] multiplicatively into alpha diversity [Eq. (5)] and betadiversity [Eq. (4)]. Because it is a measure of the average alpha diversity of the local communities that constitute the regional meta-community, rather than the alpha-diversity of aggregate (pooled or accumulated) communities, therefore, this alternative scheme contradicts the second and third principles we identified above.
Yet another alternative accrual scheme is to use Eq. (6), which calculates the gamma diversity of pooled (aggregate or accumulated) regional or meta-community. Although conceptually it computes the gamma diversity of the aggregate community, it is actually equal to the alpha diversity we calculated for the aggregate community with Eqns. (1) or (2) (when q=1) when the community weight (w j ) is equal for all N communities (i.e., w j =1/N). Since the equal weight community assumption is largely true even if the weights are not exactly equal, we can expect that the difference between the results from Eqns. (1) and (6) should not to have significant influence on DAR modeling results. Therefore, we do not see an advantage with this alternative accrual scheme, especially the potential confusion in terminology. In this case, whether it is called alpha or gamma diversity matters little, and from the perspective of SAR (DAR), we prefer to use the term alpha-diversity.
To the best of our knowledge, scaling of beta-diversity has not been approached in the existing literature. Therefore, we have a benefit to prioritize the potential utilization that the beta-diversity scaling relationship may possess. We believe that the three axioms (principles) we identified above should also guide the extension of SAR to beta-DAR. Furthermore, the capability to predict beta-diversity of aggregate community, as designated in principle 3, is also the priority that the beta-DAR should pursue, in our belief. Through trialand-error exploration, we found that the following scheme for accruing beta-diversity over areas best satisfy the three principles we set for developing beta-DAR.
Formally, to accrue beta-diversity across areas, we start the computational procedure with two local communities (samples) by using the formulae specified by Eq. (4)-(6), from which the first beta-diversity value is computed for the two initial communities. For each newly added community (sample) at the accrual step i (i=3, 4, … N), we simply run the same computation procedure (same equations) with 3, 4, … N samples, until all N communities are accrued for their beta-diversity. With each newly added local community, we obtain a new Hill numbers of beta-diversity, until all N communities are accrued for their beta-diversity. The series of beta-diversities are regressed with their respective areas accrued at each step. Obviously, this accrual scheme calculates the accumulated beta-diversity of N communities (of individuals in the case of human microbiome). It is also the maximal difference among N communities in terms of betadiversity.

(v) Predicting MAD (Maximal Accrual Diversity) with PLEC-DAR models
No supplement to this sub-section is need.
(vi) Self-similarity property of PL-DAR and the prediction of pair-wise diversity overlap No supplement to this sub-section is need.
Supporting information to Section of "Results" (Tables S1-S3):  Table S3. Fitting the beta-Diversity Area Relationship (beta-DAR) with 100 times of sampling of the 1473-Subjects AGP datasets (without removing any failed fitting cases to calculate the statistics of model parameters)