Accounting for uncertainty in species delineation during the analysis of environmental DNA sequence data


Correspondence author. E-mail:


1. Defining species boundaries represents a significant challenge in biodiversity studies, especially as these studies increasingly rely on high-throughput DNA sequencing technologies. A promising approach for delineating species in environmental sequence data combines phylogenetics and coalescence theory to estimate species boundaries from distributions of lineage birth rates within multispecies coalescent trees.

2. Existing methods for interpreting these models utilize hypothetico-deductive reasoning to identify thresholds associated with a mixed speciation-coalescent model that fits the data better than a null model. Here, I describe an alternative approach that ranks and assigns weights to models based on their fit to the data using information criteria and then uses model averaging to estimate parameters and species probabilities.

3. This approach is applied to data from two independent studies that address (i) patterns of cospeciation in an aphid–bacterial symbiosis and (ii) diversity of bacterial communities associated with the human gut. In both of these cases, accounting for uncertainty during model selection allowed greater flexibility to detect variable (with respect to time) speciation-coalescent thresholds among lineages.

4. The precision of the predicted species boundaries varied among the studies, and the variance-to-mean ratio for richness estimates ranged from 0.023 to 0.079. Sample-based estimates of gut bacteria richness revealed that accounting for uncertainty during species delineation increased the variance in the estimates of population means (by individual from which the samples were taken or by sex of the individuals) by up to 7.5%.

5. In ecological and evolutionary studies, conclusions are highly dependent on the classification system that is adopted; given the uncertainty in species boundaries observed here, ignoring this source of error (as is common practice) likely results in inflated type I error rates. The approach described here represents an objective, theory-based method for predicting species boundaries and explicitly incorporates uncertainty in the classification system into biodiversity estimation, thus allowing researchers to better address the causes and consequences of biodiversity.


The question of what constitutes a species is a unifying concept in biology. Recognizing the boundaries between species has long been the goal of systematics and evolutionary biology and been of practical significance to environmental biologists. However, the importance of this issue has intensified owing to the need for a rapid accounting of species before they are lost to extinction (Myers et al. 2000). This, as well as the challenging task of delineating species within cryptic taxa (Hebert et al. 2004a) and the inability to study the majority of microbial organisms in their natural environment or even in the laboratory (Liesack & Stackebrandt 1992), has led many to argue that DNA sequences should be used in a primary role during species discovery and identification, as opposed to their use as a complement to traditional taxonomic approaches (Blaxter 2004; Savolainen et al. 2005; Vogler & Monaghan 2007). Therefore, a major imperative in bioinformatics is the development of theoretical and practical approaches for generating biodiversity estimates from DNA sequence data.

Most existing approaches for biodiversity estimation from DNA sequences rely on clustering algorithms that estimate the relative similarity between pairs of DNA sequences; these algorithms then form groups (meant to approximate species and usually referred to as operational taxonomic units) that contain only those individuals whose DNA differs less than a strict cut-off (e.g. Stackebrandt & Goebel 1994; Acinas et al. 2004; Hebert et al. 2004b). The validity of using strict cut-offs has been challenged owing to the high likelihood that not all species within a taxon share a common cut-off (Nilsson et al. 2008; Monaghan et al. 2009), particularly when the taxon under study is phylogenetically broad and is likely to contain lineages whose rates of molecular evolution vary. Where uncertainty exists, researchers will often present several biodiversity estimates, each using a different strict cut-off (e.g. Sogin et al. 2006; Amend et al. 2010). However, investigators eventually have to decide on a cut-off in order to interpret these estimates, especially when making comparisons between samples. How boundaries between species clusters are defined is likely to influence the interpretation of these data, and differences are usually attributed to variation in the strength of environmental filtering and trait conservatism at increasing taxonomic levels (Koeppel et al. 2008; Phillipot et al. 2010; Burke et al. 2011). These differences may also arise if the cut-off that best approximates the species boundary varies among lineages, but it is not possible to estimate and explicitly account for this source of error in downstream analyses using strict cut-offs.

Recently, the general mixed Yule-coalescent (GMYC) method, a model-based likelihood approach that combines phylogenetics and coalescence theory, was proposed to estimate species boundaries from DNA sequence data (Pons et al. 2006; Fontaneto et al. 2007). In addition, previous studies have shown that this approach can identify genetic clusters that generally correspond with existing classification schemes (Monaghan et al. 2009) and highlight patterns of cospeciation between an endosymbiont and its host (Jousselin, Desdevises & Coeur d'acier 2009) while also performing better than operational clustering approaches at accounting for the environmental associations of taxa (Powell et al. 2011). The procedure estimates lineage birth rates associated with both Yule diversification (speciation events; Yule 1924) and neutral coalescent (merging of lineages within a population; Kingman 1982) processes from a multispecies coalescent tree, calculating the likelihood assuming a threshold age between these processes at each node in the phylogeny. The point of maximum-likelihood (ML) of this mixed model estimates the transition between these processes and can thus be used to infer the species boundary. The model may assume that the depth of the transition from speciation processes to coalescent processes is fixed across all lineages (the single-threshold GMYC model) or that this transition varies across some or all lineages (the multiple-threshold GMYC model; Monaghan et al. 2009).

Currently, the approach uses a two-step hypothetico-deductive approach to make decisions regarding the location and addition of transition points within the phylogeny. In the first step, a likelihood ratio test is employed to determine whether the estimation of parameters associated with the ML single-threshold model results in a significant increase in likelihood (the estimate for the ML model) over a model that only estimates parameters associated with the coalescent process. In the second step, the ML multiple-threshold model is compared to the ML single-threshold model (the new null model, in this case) in a likelihood ratio test to determine whether the addition of parameters (thresholds) is justified. As a result, a single model is chosen to represent the predicted species boundaries within a taxon, and speciation is assumed to have occurred during the waiting interval(s) leading up to the node(s) specified by the ML model and along the other concurrent waiting intervals associated with other lineages (Fig. 1). Any uncertainty in the selection of this model is usually represented by a confidence interval (within two log-likelihood units of the ML model) in the predicted number of independent lineages. The ML single-threshold model usually provides a significantly better fit to the data than the null model (e.g. Pons et al. 2006; Fontaneto et al. 2007; Ahrens, Monaghan & Vogler 2007; Barraclough et al. 2009; Jousselin et al. 2009; Leliaert et al. 2009; Monaghan et al. 2009; Bode et al. 2010; Powell et al. 2011), but the ML multiple-threshold model usually does not provide a better fit to the data than the ML single-threshold model (personal observation).

Figure 1.

 Illustration of the GMYC procedure using the bacterial phylogeny from the study of Jousselin et al. (2009). Lineage-through-time (a) and single-threshold GMYC likelihood profile (b) plots for the phylogeny in (c). The model with the highest likelihood in panel B is that in which the 24th divergence event represents the earliest population coalescent event; this node is indicated by the black circle, and the timing of this event is represented by the vertical dotted lines in panels a and c. In this model, speciation events are assumed to have occurred at some point along the waiting intervals intersecting the vertical dotted line, represented by the dotted branches. During the 24th waiting interval, there are 24 lineages (‘species’): 12 clusters (red) and 12 singletons (black). Immediately after the end of this interval, there are two lineages in the eighth cluster (‘population’) and one in all other clusters, while the number of lineages within a cluster following the 54th (and final) divergence event is equal to the number of tips associated with that cluster. The lineage birth rate and scaling parameter associated with the speciation process (λs, ps) are estimated from the timing of the 23 divergence events to the left of the threshold in panel a. Divergences within clusters are used to estimate the lineage birth rate and scaling parameters associated with the coalescent process (λc, pc; estimated from the rate of accumulation of branches to the right of the threshold in panel a). One example for the multiple-threshold model (the third-ranked model for bacteria in Table 1) is also given (d, e). In this case, the earliest population coalescent event is represented by the red dot, positioned at the end of the 22nd waiting interval. For most clusters (those in red), speciation events are assumed to have occurred along the branches intersecting the red dotted vertical line. However, for the eighth cluster, speciation is assumed to have occurred at some point along the 24th waiting interval, isolating this cluster from its sister ‘species’ represented by the neighbouring singleton. The predicted species boundaries are similar to those predicted by the single-threshold model, except that one singleton lineage has been merged into the seventh cluster; note that allowing this merger while retaining all other boundaries would not be possible under the single-threshold scenario. Here, λs and ps are estimated from the timing of all divergences to the left of the blue dotted vertical line, excluding the coalescent event at the end of the 22nd waiting interval. λc and pc are again estimated from the timing of divergences within all clusters, regardless of the threshold.

In fact, several models in the set of models fit during the estimation of likelihoods may fit the data well enough to fulfil the one requirement for selection imposed on the ML model (a significant increase in likelihood over a single null model), highlighting a significant limitation of such a hypothetico-deductive approach. In many cases, the analysis is being employed in an exploratory framework where the goal is not to test hypotheses with regard to species boundaries but to generate predictions regarding the probable placement of these boundaries. Multimodel inference and model averaging based on information-theoretic approaches are better suited to decision-making in this context (Burnham & Anderson 2002). Models are ranked based on their ability to efficiently account for variation in the data (i.e. models with more parameters suffer greater penalties), and the contribution of each model to the estimation of parameters is weighted by these ranks. The boundaries associated with clusters predicted by the model can be estimated based on these rankings, as can the uncertainty associated with these boundaries. Biodiversity estimates, therefore, are associated with an estimated level of uncertainty related to how isolated DNA sequences may be classified into different clusters, and this uncertainty can then be explicitly accounted for in downstream applications.

Here, I describe an approach for summarizing GMYC model results using formal inference from multiple models based on the Akaike Information Criterion (AIC; Akaike 1974). This approach was chosen because it is well suited to the current GMYC modelling framework, which does not require a priori hypotheses regarding the taxonomic affiliation of isolates (the scenario in which this method is most often to be used) and thus all potential models are equally likely to represent the true model prior to analysis. I then apply this approach to two real data sets that represent two probable ways in which ecologists and evolutionary biologists are most likely to use this approach, as well as highlight the empirical benefits arising from the approach: (i) in the delineation of species within taxa sampled from a broad geographical range (while estimating probabilities associated with individuals belonging to these species) and (ii) in the estimation of biodiversity and community structure from DNA samples extracted directly from the environment (while explicitly representing uncertainty in species boundaries in the variance associated with these estimates).

Materials and methods

The GMYC Model

Detailed descriptions of the GMYC model and its derivation can be found in the studies of Pons et al. (2006), Fontaneto et al. (2007) and Monaghan et al. (2009). An empirical example of the GMYC model is illustrated in Fig. 1. Briefly, the procedure has two components (the neutral coalescent model and the Yule model) for modelling the distribution of waiting times between diversification events. Each component of the model has two parameters that are estimated from the data (a branching parameter, λ, and a scaling parameter, p); the fifth parameter indicates the threshold transition point between the two model components. Likelihoods associated with waiting interval i in phylogeny x are calculated as:

image(eqn 1)


image(eqn 2)

The phylogeny contains m species, and each species consists of a population with size nj. The first component of b* represents the Yule process s associated with ni,s (interspecific) lineages at the end of waiting interval i; the branching rate parameter, λs, is a constant and represents the speciation rate but can change as lineages accumulate according to the scaling parameter ps. The second component represents the coalescent process c associated with each population m with ni,c (intraspecific) lineages at the end of waiting interval i. Here, the branching rate parameter λc is a constant (but is scaled to pc as lineages accumulate), represents the birth rate within a population and is estimated from the distribution of waiting times across all clusters.

In the single-threshold version, a complete search is performed in which model likelihoods are estimated assuming that the threshold transition point (T) occurs at each node in the phylogeny; therefore, the optimal T is associated with the ML model among this set and is used to predict the series of waiting intervals in which speciation is to have occurred. In the multiple-threshold version of the model, the algorithm uses a heuristic search in which clusters are merged or split by moving thresholds within groups of lineages either closer to or further away from the root node, respectively; therefore, predicted speciation events are not restricted to concurrent waiting intervals. Because the branching rate and scaling parameters are estimated across all clusters, the number of parameters in the model increases by one for each threshold that is added to the model.

Multimodel Inference and Model Averaging

As stated in the introduction, the hypothetico-deductive approach is suboptimal in this context where the focus is on predicting probable species boundaries and the process is largely exploratory. At each step of the GMYC procedure, a statistical model is generated that predicts waiting times based on four continuous variables (the two λs and ps) and one or more categorical variables indicating the type of event (speciation/coalescent branching) occurring at the end of each waiting time, depending on what side of T it falls on. Thus, ML parameter estimates are obtained for each model given the particular threshold(s) at that step. It is then possible to rank and assign weights to models based on their AIC scores. The approach is similar in concept to that described by Carstens & Dewey (2010). Their approach is based on the explicit a priori specification of clade membership followed by species tree estimation; AIC is used to compare models with different numbers of parameters (distinct clades).

Following the estimation of likelihoods for GMYC models using single and multiple thresholds, a modified AIC score, corrected for small sample size (AICc; ,McQuarrie & Tsai 1998), is calculated for each model in the set:

image(eqn 3)

where k is the number of parameters in the model (including branching rates, scaling factors and unique thresholds), l is the number of observations in the model (nodes in the phylogeny) and L is the model likelihood. The model with the lowest AICc score corresponds to the best model, which most efficiently accounts for variation in the data. Akaike weights (wi; Akaike 1978; Burnham & Anderson 2002) are assigned based on the difference of the score (Δi) between each model i and the score of the best model, standardized by the sum of all differences (Δr):

image(eqn 4)

Akaike weights are then employed to calculate model-averaged estimates of the number of GMYC species (

image(eqn 5)

) based on the number of GMYC species predicted by each model (

image(eqn 6)


image(eqn 7)

and variances associated with these parameters, by calculations of sums of squares:

image(eqn 8)

Here, the conditional variance

image(eqn 9)

for the parameters within each model equals zero because the position of the nodes is fixed (i.e. the model is fit to a single tree).

To generate model-averaged estimates of diversity within individual samples (

image(eqn 10)

), the individual/isolate–sample matrix is converted into a species–sample matrix using species boundaries predicted by each model. Diversity indices are then calculated for each model from the species abundance distribution within each sample or, in cases where richness is being estimated, the species–sample matrix is used to determine presence/absence within each sample. Finally, model-averaged estimates for each sample are calculated using the Akaike weights associated with each model. Estimates from N samples can then be used to estimate the population average,

image(eqn 11)

and variance,

image(eqn 12)


image(eqn 13)

is the conditional variance within each sample associated with uncertainty in defining the species boundaries.

Where the focus is on creating a virtual taxonomy from this clustering approach, Akaike weights are used to estimate probabilities that two individuals or isolates belong to the same cluster. For each model in the set, each pair of isolates is scored (pi) as being predicted to belong (1) or not belong (0) to a common cluster. The probability

image(eqn 14)

that each pair of isolates belongs to a common cluster is calculated as the Akaike weighted sum of these scores:

image(eqn 15)

with R being the number of models in the set.

Data Analyses

The approach was evaluated using phylogenetic trees generated from two published studies that represent two of the most common ways in which DNA sequence data are generally used in biodiversity studies: (i) in the delineation of species within a phylogenetic framework and (ii) in the estimation of species diversity and other aspects of community structure in environmental surveys. The first study was a survey of co-evolutionary relationships between aphids in the genus Brachycaudus and obligate symbiotic bacteria, Buchnera aphidicola (Jousselin et al. 2009). In this study, aphid and bacterial DNA were amplified from 56 specimens of Brachycaudus sampled in Europe and Australia. I used the ultrametric phylogenetic trees in Fig. 2 of the original publication, which were provided by Emmanuelle Jousselin (Center for Biology and Management of Populations, Campus International de Baillarguet, France), to estimate probabilities of isolates forming genetic clusters. The phylogeny of Brachycaudus was derived from a maximum-likelihood analysis of a partitioned data set including the CytB and COI genes and the ITS2 region (Coeur d'acier et al. 2008). The phylogeny of Buchnera was estimated from a ML analysis of a partitioned data-set including the TrpB gene and the intergenic regions between (i) the hupA and rpoc genes and (ii) the ssb and dnaB genes (Jousselin et al. 2009). The authors obtained ultrametric trees from these phylogenies using the relaxed Bayesian method for multilocus data in MULTIDIVTIME (

Figure 2.

 Model-averaged GMYC predictions of pairwise probabilities that aphid individuals (top panel) or bacterial symbionts (middle panel) co-occur within a species cluster. The bottom panel indicates the absolute difference in cluster probabilities when comparing the top and middle panels, demonstrating the congruence in species delineation between the two symbionts. Labels indicate the individual aphid from which host and bacterial DNA was isolated and are ordered to coincide with the Brachycaudus phylogeny in Fig. S1.

The second study described bacterial communities associated with human body habitats using pyrosequencing (Costello et al. 2009). In this study, the authors sampled various habitats (stool, body cavities, and hair and skin surfaces) on multiple individuals and at multiple times. The authors used PCR to amplify variable region 2 of bacterial 16S rRNA genes present within each sample and characterized bacterial communities following pyrosequencing. I analysed a subset of these data, derived from stool samples collected from six individuals on each of 4 days, to generate model-averaged estimates of ‘species’ richness and diversity within each sample. These data were made available by Patrick Schloss (Department of Microbiology & Immunology, University of Michigan, USA), who obtained them from the Short Read Archive data repository ( and used them in a tutorial (, accessed 28 September 2010) for a bioinformatics software project, mothur (Schloss et al. 2009). Using mothur v.1.13.0, sequences were aligned against the SILVA reference alignment, and low-quality (quality score ≤35, homopolymer >8), short (≥150 bp before position 6333 following alignment) and chimeric sequences (tested against the SILVA Gold sequence database) were trimmed from the data. A pre-clustering step that seeks to reduce sequencing noise was employed to filter sequences that differed by <1% from dominant sequence types (Huse et al. 2010). Following these steps, the majority of unique sequences were classified as belonging to the phylum Firmicutes (2074), and the remainder were classified as belonging to a variety of other phyla (1057), when compared against the SILVA rdp6 taxonomy outline. Pairwise distances were calculated between all unique sequences in mothur; strings of gaps (including terminal gaps) were treated as single insertions. The distance matrix was then imported into R (R Development Core Team 2009), and for each of these two groups of unique sequences (Firmicutes and other bacteria), I generated ultrametric phylogenetic trees using the ‘upgma’ function in the ‘phangorn’ package (Schliep 2011). The Firmicutes phylogeny was then split into two subtrees containing 1112 and 962 tips, respectively. I divided the sequences into these three groups to facilitate more rapid GMYC model calculations; each of these trees was large enough to provide sufficient statistical power (Powell et al. 2011).

For each of the five phylogenetic trees between the two studies, I estimated model parameters and likelihoods using tools from the ‘splits’ package (Ezard, Fujisawa & Barraclough 2009) for R; these analyses were carried out on Bioportal (, a web-based bioinformatics utility sponsored by the University of Oslo. I used a modified version of the ‘gmyc’ function that also estimates model parameters assuming that each isolate belongs to its own unique species; this model represents a second null model (Yule), in addition to the null model in the original version for which all isolates are assumed to belong to a single coalescent population (coalescent). Akaike weights and model-averaged estimates were then calculated as described above using the code provided in the supplementary online materials. To facilitate rapid calculations, estimates were based on a subset of all models with the sum of weights, cumulated from largest to smallest, just ≥ 0·99; other approaches are also valid, such as using δAIC thresholds that provide a desired level of empirical support (Burnham & Anderson 2002).


Clustering Probabilities for Brachycaudus and Symbiotic Buchnera

The species boundaries predicted by this approach were in general qualitative agreement with those predicted by the ML single-threshold model and presented by Jousselin et al. (2009). The ML single-threshold model in each phylogeny predicted a slightly higher number of independent lineages in Buchnera (24) than in Brachycaudus (22). However, accounting for uncertainty in model selection resulted in the convergence of these estimates (mean ± sd; Brachycaudus: 22·8 ± 1·2, Buchnera: 23·3 ± 0·9). The congruence of the classification of the symbionts to GMYC species is depicted in the bottom panel of Fig. 2, which explicitly shows the differences in pairwise clustering probabilities when comparing model predictions for the aphids to those for the bacteria. Here, it is clear that the model predicts patterns consistent with cospeciation by aphids and their bacterial symbionts in most cases, but highlights a few samples (individual aphids) where this is not clear.

For the analysis of the aphid isolates, three of the single-threshold models contributed a majority of weight (0·523) to the model (Table 1), suggesting that no single model best represents species boundaries for these data. An additional four models containing variable speciation-coalescent transitions (two thresholds in each) contributed weights >0·05, with the remainder of the models contributing <0·05 (Tables 1 and S2). The genetic clusters predicted by this method were robust, with most clades having probabilities of >0·80 or <0·05 (Figs 2 and S1). Two clades had intermediate probabilities of representing genetic clusters: s242 and its related ingroup (0·272), and b1730 and its related ingroup (0·650).

Table 1.   Summary of GMYC models contributing Akaike weights >0.01 and null models for the aphid and bacterial phylogenies. Asterisks indicate maximum-likelihood single- and multiple-threshold models
SymbiontModel rankLikelihoodParametersAICcδAICcAkaike weightsModel type
Aphid 1253·995−496·760·000·239Single*
36242·152−480·0616·700·000Null (Yule)
41241·472−478·7218·050·000Null (coalescent)
Bacteria 1232·705−454·180·000·213Single*
58219·402−434·5819·610·000Null (Yule)
65218·632−433·0321·150·000Null (coalescent)

For the analysis of the bacterial isolates, two single-threshold models provided the best fit to the data out of all models in the set (i.e. had the lowest AICc score; Table 1). These two models shared a common speciation-coalescent threshold age because of each employing a node on either end of a zero-length branch as the transition (b1483 and ingroup/b1760 and ingroup) and together contributed a weight of 0·426 to the model-averaged estimates. Additional models in the set contributed uncertainty in these estimates, with two models each based on a single-threshold and another model containing variable speciation-coalescent transitions (two thresholds; not the ML multiple-threshold model) each contributing weights of 0·089 and greater (Table 1). All other models contributed weights of <0·05 (Tables 1 and S1). Probabilities associated with genetic clusters were generally >0·90 or <0·05, with two exceptions (Figs 2 and S2): s317 and its related ingroup (0·376), and b1747 and its related ingroup (0·464).

Uncertainty in the Richness of Bacterial Communities

For each subtree from the human stool bacteria data set, allowing the speciation-coalescent transition to vary among lineages improved the fit of the model to the data; the ML single-threshold model provided a relatively poor fit to the data in each case (δAICc = 23·96, 54·38, and 64·49; Tables S3–S5). However, there was variable support for simply choosing the ML multiple-threshold model as the best representation of speciation patterns. This model provided the best relative fit to the data in one subtree, fit the data reasonably well in another subtree (δAICc = 0·45), and fit the data relatively poorly in the third subtree (δAICc = 6·88).

In general, no one model provided a much better fit to the data than the other models, with estimates being weighted by several models within the set (Fig. 3). As such, there is no support for choosing a single model to inform species boundaries in these data. For two of the three subtrees, several models contributed Akaike weights >0·01 (44 and 30 models in each subtree), and maximum weights were <0·04 (0·027 and 0·038, respectively). For the third subtree, the majority of weight (0·533) was derived from three models in the set, with an additional five models contributing weights >0·01 to the estimates.

Figure 3.

 Frequency distribution of Akaike weights calculated for GMYC models fit to bacterial phylogenies. Phylogenies were reconstructed from bacterial DNA isolated from human stool samples. a–c represent distributions from each of the three subtrees that were analysed.

Estimating bacterial richness by averaging across all models in this subset revealed the effects of ignoring uncertainty in species boundaries. The GMYC method predicted 932·84 (sd = 7·99) species of bacteria in all stool samples collected from the six individuals included here. The approach resulted in estimates of bacterial richness in each sample that varied in precision (median variance-to-mean ratio = 0·048, range = 0·023–0·079). Bacterial richness varied among individuals when averaged across the four sample dates, but also among samples for some individuals (Table 2), while on average, males harboured bacterial communities of similar richness as females (Table 2). Accounting for uncertainty in the clustering algorithm increased the variance associated with these estimated means by an average of 2·6% (range: 0·4–7·5%).

Table 2.   Richness of bacterial communities in human stool sampled from six individuals on 4 days. Estimates for each day represent model-averaged estimates predicted by GMYC models; numbers in parentheses represent variances associated with these estimates. Richness was aggregated by subject and sex, and two estimates of standard error (SE) are provided; asterisks indicate that the estimate accounts for uncertainty during model selection
 Day 1Day 2Day 3Day 4SubjectSex
Female 1141·2 (8·0)140·8 (6·6)192·1 (9·9)202·8 (12·9)169·216·616·5123·127·427·3
Female 290·0 (4·7)92·9 (3·7)103·2 (6·3)110·7 (5·6)99·24·94·8   
Female 3106·5 (3·6)103·0 (2·8)85·3 (3·0)109·0 (2·6)101·05·45·4   
Male 183·8 (2·7)76·6 (1·7)146·9 (7·7)152·7 (9·6)115·020·220·2105·920·620·5
Male 285·3 (4·7)65·9 (2·6)115·8 (8·0)106·8 (5·0)93·511·311·2   
Male 3117·3 (7·4)97·9 (3·3)99·8 (7·9)121·6 (5·7)109·26·26·0   


Inherent Advantages of the Multimodel GMYC Approach

In studies that require the prediction of species boundaries from DNA sequence data, model-based approaches represent the state of the art. The premise that these approaches are based on are sometimes designed to be simplistic and practical, such as the ‘97%’ rule for bacteria (Stackebrandt & Goebel 1994; Hanage, Fraser & Spratt 2006) or the ‘10 ×’ rule for animals (Hebert et al. 2004b) in which thresholds for an individual belonging to a species is set based on the percentage similarity of that individual's DNA to that of individuals within that species. The GMYC approach is similar in that it is based on thresholds but has two advantages in that a threshold can be detected from the data and does not need to be specified beforehand (Pons et al. 2006) and that it does not have to fulfil the assumption that the evolutionary process progresses at the same rate in all taxa (Monaghan et al. 2009).

However, the hypothetico-deductive approach that is usually used to interpret GMYC modelling outcomes can be biased against models with multiple thresholds with absolute likelihoods less than the maximum, reducing the benefits of the latter advantage. As shown here, these models often fit the data as well or better than ML single- and multiple-threshold models when the number of parameters estimated by the model is taken into account. In addition, the information-theoretic approach described here contextualizes the GMYC approach within a more appropriate theoretical framework; averaging parameters across multiple models explicitly accounts for uncertainty during model selection, providing a probability that any two individuals (sequences) belong to a species (as shown in the first study) and/or an estimate of variance associated with species diversity within and between samples (as shown in the second study), given the assumptions of the model. Thus, this approach represents a useful extension to enhance the utility of the GMYC model for many of the questions asked by evolutionary biologists and ecologists.

Appropriate Uses for the Multimodel GMYC Approach in Taxonomy

From the perspective of taxonomic applications, the multimodel GMYC approach is particularly suitable to situations where the a priori specification of hypothesized species boundaries, as employed by other model-based approaches for species delineation (Rach et al. 2008; Yang & Rannala 2010; Leaché & Fujita 2010; Carstens & Dewey 2010), is inconvenient. This is particularly important for taxa that lack a practical species-level taxonomic framework within which biologists can work. Fitting these models to the data results in probabilistic taxonomic hypotheses, which can then form a basis for narrowing in on a pool of a priori hypotheses to test with multiple approaches. In cases where established taxonomies exist but may be in doubt (e.g. Fontaneto et al. 2007; Jousselin et al. 2009), it can be used to make quantitative predictions regarding clades that may require further investigation.

From a broader perspective, particularly (but not exclusively) relevant for microbial taxa for which a classical species concept is difficult to apply, the question remains whether these clusters represent species or taxa at a different hierarchical level. The model makes predictions regarding the timing of speciation events from theories based on evolutionary and population genetics and compares these predictions to patterns in the data. These predictions have been observed to correspond with existing species limits in some higher taxa (Monaghan et al. 2009). This is a correlative approach, however, and other processes besides speciation may be behind the patterns observed here. For example, these clusters may represent populations that are particularly suited to the present environmental conditions and therefore exhibit rapid population growth while accumulating genetic mutations (ecotypes, Koeppel et al. 2008). This issue, with regard to bacteria, was partly addressed by Barraclough et al. (2009); they referred to the genetic clusters identified by the GMYC model as ‘evolutionarily significant units’ owing to the fact that they represent independently evolving lineages (at least at the level of the individual gene locus) as well as biological phenomena that need to be explained even if they are not representative of species, per se. The fact that these clusters exist and that they correspond with biological/ecological characteristics of the taxa in which they are found (Jousselin et al. 2009; Powell et al. 2011) suggests that they are representative of some fundamental evolutionary process.

Estimating Uncertainty in Sample Diversity Using the Multimodel GMYC Approach

Another advantage arising from this extension of the GMYC framework relates to how we perceive data derived from community metagenomic surveys. The GMYC approach, when utilized in an information-theoretic framework, explicitly characterizes the uncertainty associated with diversity estimates in these surveys. This represents a significant departure from current approaches that are acknowledged to be biased and subject to stochasticity but are largely evaluated based on their ability to rapidly arrive upon a single accurate solution (e.g. Sun et al. 2009; Hao, Jiang & Chen 2011). In the analysis of DNA sampled from bacteria in human stool samples, the uncertainty associated with these estimates did not prevent the detection of differences between individuals from which the samples were taken, as well as among sample times within each individual. However, even in this case where the uncertainty associated with species boundaries was low (on average, this variance scaled to approximately 5% of the mean across samples), it increased the variance in the estimates of the population means (by individual or sex) by up to 7·5%. In situations where this uncertainty is large (indicating that the interpretation of community structure is highly dependent on the classification system that is adopted), ignoring this uncertainty will result in the inflation of type I error rates.

Computational efficiency is especially important for the analysis of metagenomics data (like the gut bacteria data analysed here) generated with next-generation sequencing approaches, which tend to contain tens to hundreds of thousands of sequences. The effort required to run the GMYC algorithm is greater than existing clustering approaches employing strict cut offs, partly attributable to the estimation of the topology and divergence times in the phylogeny prior to fitting the model but also the requirement for optimization at each step of the algorithm. The multiple-threshold approach is particularly computationally intensive, and the time required to estimate a series of likelihoods increases in a more or less exponential fashion with the size of the phylogeny. Here, this issue was avoided by splitting the data set into three subtrees, each containing between 962 and 1112 tips. This allowed each subtree to be analysed simultaneously with a total run time of approximately 3 days. However, as the average read lengths associated with these sequencing techniques increases and data sets become larger, it will become more important to identify ways to rapidly and efficiently estimate GMYC model parameters and likelihoods.

Here, I focused on estimating species diversity within DNA sequence data while accounting for unclear species boundaries. Extensions of this approach will address distance-based estimates of community composition variation among samples because this is a major focus of research on microbial biodiversity (Kuczynski et al. 2010). In principle, the approach is the same except that Akaike weights will be applied to parameters estimated in multivariate space (for example, average community dissimilarity) from the species–sample matrices predicted by the model.


In summary, the advantages of using the GMYC model for predicting species boundaries in environmental DNA sequence data, and in generating hypotheses for species boundaries in taxonomic studies, are enhanced by the use of multimodel inference during model selection. In particular, accounting for uncertainty during model selection allows greater flexibility to detect variable (with respect to time) speciation-coalescent thresholds among lineages and explicitly incorporates the uncertainty associated with unclear species boundaries into evolutionary studies and the analysis of ecological communities.


Thanks to Michael Monaghan, Tancredi Caruso, Matthias Rillig, Maarja Öpik, Tim Barraclough, Emmanuel Paradis and three anonymous reviewers for helpful discussions and comments. Also, thanks to Emmanuelle Jousselin for providing the Brachycaudus and Buchnera phylogenies and to Rob Knight and Patrick Schloss for making the gut bacteria data available. Funding was provided by a Marie Curie International Incoming Fellowship.