The use of DNA barcodes to estimate phylogenetic diversity in forest communities of southern China

Abstract To elucidate potential ecological and evolutionary processes associated with the assembly of plant communities, there is now widespread use of estimates of phylogenetic diversity that are based on a variety of DNA barcode regions and phylogenetic construction methods. However, relatively few studies consider how estimates of phylogenetic diversity may be influenced by single DNA barcodes incorporated into a sequence matrix (conservative regions vs. hypervariable regions) and the use of a backbone family‐level phylogeny. Here, we use general linear mixed‐effects models to examine the influence of different combinations of core DNA barcodes (rbcL, matK, ITS, and ITS2) and phylogeny construction methods on a series of estimates of community phylogenetic diversity for two subtropical forest plots in Guangdong, southern China. We ask: (a) What are the relative influences of single DNA barcodes on estimates phylogenetic diversity metrics? and (b) What is the effect of using a backbone family‐level phylogeny to estimate topology‐based phylogenetic diversity metrics? The combination of more than one barcode (i.e., rbcL + matK + ITS) and the use of a backbone family‐level phylogeny provided the most parsimonious explanation of variation in estimates of phylogenetic diversity. The use of a backbone family‐level phylogeny showed a stronger effect on phylogenetic diversity metrics that are based on tree topology compared to those that are based on branch lengths. In addition, the variation in the estimates of phylogenetic diversity that was explained by the top‐rank models ranged from 0.1% to 31% and was dependent on the type of phylogenetic community structure metric. Our study underscores the importance of incorporating a multilocus DNA barcode and the use of a backbone family‐level phylogeny to infer phylogenetic diversity, where the type of DNA barcode employed and the phylogenetic construction method used can serve as a significant source of variation in estimates of phylogenetic community structure.


| INTRODUC TI ON
Plant DNA barcodes, based on either single or multilocus regions of the chloroplast and/or nuclear genomes, have been applied to questions in community ecology (Kress et al., 2009;Valentini, Pompanon, & Taberlet, 2009). Estimates of phylogenetic genetic diversity can be used to quantify the evolutionary and ecological processes associated with community assembly, composition, and structure at different spatiotemporal scales (Cavender-Bares, Kozak, Fine, & Kembel, 2009;Helmus, Bland, Williams, & Ives, 2007;Mouquet et al., 2012;Webb, 2000). The branch lengths and topology of community phylogenies can influence estimates of phylogenetic diversity in different ways (Boyle & Adamowicz, 2015;Mazel et al., 2016;Swenson, 2009). For estimates based on DNA barcodes, the metric used to assess phylogenetic diversity may be influenced by the evolutionary rate of the barcode(s) employed. For example, DNA barcode regions that are phylogenetically conservative or hypervariable may under-or overestimate phylogenetic diversity, respectively. The effect of barcode region (and their combinations) on estimates of phylogenetic diversity metrics has not been empirically tested and may be a potential source of variation that requires consideration when assessing community phylogenetic structure.
Estimates of phylogenetic diversity may be also influenced by the type of phylogenetic construction method employed. Typically, tree topologies at deep phylogenetic nodes (e.g., family level) that have been inferred with a limited set of barcodes are largely incongruent with broadly accepted patterns of taxonomic relationships (e.g., APG IV; Byng et al., 2016). To constrain deep phylogenetic nodes and follow broadly accepted phylogenetic patterns, supertree methods (Bininda-Emonds & Sanderson, 2001;Webb & Donoghue, 2005) can be combined with DNA barcode sequence data (Erickson et al., 2014;Kress et al., 2010) to provide more accurate depictions of topology. Furthermore, the incorporation of a backbone phylogeny can provide more accurate estimates of the branch lengths (Boyle & Adamowicz, 2015) and potentially affect the metrics of phylogenetic community diversity (Swenson, 2009). Since ecological and evolutionary processes might operate at different phylogenetic depths (Mazel et al., 2016), it seems reasonable that phylogenetic diversity metrics that are sensitive to processes operating at deep phylogenetic depths may be strongly influenced by combining supertree methods with DNA barcode sequence data, whereas those estimates that are largely capturing diversity at the tips of the phylogeny may be less influenced, although this remains to be tested.
In contrast to the branch length-based metrics, several phylogenetic diversity metrics (e.g., PAE, the relationship between species evolutionary distinctiveness and abundance; IAC, the imbalance of abundances at higher clades) have been developed to capture information on both the topology and branch lengths of phylogenies connecting the species of a community (Cadotte et al., 2010;Krajewski, 1994;Vanewright, Humphries, & Williams, 1991). These topology-based metrics have also been shown to be valuable for predicting patterns of abundance, community composition, and ecosystem functioning (Cadotte et al., 2010;Liu et al., 2018;Liu, Zhang, et al., 2015), but are seldomly evaluated in terms of how the branch lengths and topologies of community phylogenies may affect estimates of community phylogenetic diversity. Here, we predict that the use of a backbone phylogeny will have a strong influence on topology-based metrics.
To assess the potential variance in estimates of phylogenetic diversity associated with DNA barcodes and phylogeny construction methods, we first constructed a series of phylogenies, using Bayesian tree inference, for two distinct tropical forest communities that vary in elevation in the Dinghushan National Nature Reserve, Guangzhou, China. Specifically, we sampled two plastid gene regions (rbcL + matK) and the nuclear ribosomal internal transcribed spacers (ITS and ITS2 as part of the ITS region but with considerable power in species identification and resolution, see Chen et al., 2010) for all trees in each plot and constructed a series of phylogenies using different barcode combinations. To investigate the effects of supertree methods on estimates of phylogenetic diversity, we constructed another series of phylogenies with backbone family-level phylogenies based on APG IV (Byng et al., 2016). Taking a multi-model comparative approach, we assessed the relative contribution of single and multilocus barcodes, family-level backbone, and their combinations to predict the variance in estimates of phylogenetic diversity metrics. We address the following questions: (a) What are the relative influences of single DNA barcodes on estimates on phylogenetic diversity metrics? (b) What is the effect of using a backbone familylevel phylogeny to estimate phylogenetic diversity metrics?

| Study sites
Both study plots were located at the Dinghushan National Nature (range: −0.2°C to 38.1°C) and mean annual rainfall is 1,927 mm (Liu, Yan, et al., 2015). One plot is located in a subtropical mountain evergreen forest (600 m a.s.l.), while the other plot is located in a subtropical valley rain forest (100 m a.s.l.). Both plots have the same sampling area (1 ha) and similar arboreal species richness. There were a total of 114 trees with the abundance of each species being calculated by counting the number of individuals at breast height >10 cm in both plots. The mountain evergreen forest plot had 75 species with 41 unique species, and the valley rain forest plot had 73 species with 39 unique species. We list the detailed species information in Supporting Information Table S1.

| Community phylogenies
An exhaustive description of the methods for DNA extraction, PCR amplification, and sequencing can be found in Liu, Yan, et al. (2015).
Here, we briefly describe the methods for phylogenetic construction.
For the 114 species across both plots, we aligned rbcL and matK using MAFFT (Katoh & Standley, 2013) and then eliminated divergent regions using Gblocks (Castresana, 2000). We aligned ITS and ITS2 using SATé (Liu et al., 2012). We then concatenated subsets of the rbcL, matK, ITS, and ITS2 sequences to generate a total of seven super matrices: (a) rbcL + matK, (b) rbcL + ITS, (c) rbcL + ITS2, (d) matK + ITS, (e) matK + ITS2, To assess the influence of a constrained family-level backbone on community phylogenetic diversity metrics, we constructed a total of fourteen species-level phylogenies based on the seven super matrices: one set based on Bayesian phylogenies and a second set based on Bayesian phylogenies with a constrained backbone topology at the family level based on the APG IV system (Byng et al., 2016). We then selected the best model of nucleotide substitution based on the lowest Akaike information's criterion (AIC) for each barcode region using the function "modelTest" in the phangorn library (Schliep, 2011) in R (R Core Team, 2016). For all barcode combinations, modelTest found that the best model was the generalized time reversible (GTR) model with a gamma distribution parameter describing among-site rate variation and a proportion of invariant sites parameter. We constructed all Bayesian phylogenies in MrBayes 3.2.5 (Ronquist et al., 2012) using four chains with 1,000,000 generations, a sampling and diagnostic frequency of 100, and a 20% burn in. We chose one representative of an early diverging gymnosperm lineage, Cunninghamia lanceolata, as the root for the Bayesian phylogenies. We then used a semi-parametric rate-smoothing method to transform the phylogeny to an ultrametric tree using the "chronopl" function with λ value 1,000 in the R ape library (Paradis, Claude, & Strimmer, 2004). For Bayesian phylogenies without family-level backbone constraint, we ranked all the post topologies by the symmetric distance with the backbone topology at the family level based on APG IV system using the function "treedist" in the R phangorn library (Schliep, 2011). Then we selected the top ranking 500 topologies for further analysis. We also randomly selected 500 topologies for Bayesian phylogenies with backbone for comparison. We used these selected topologies to estimate the posterior probabilities of the nodes for the Bayesian phylogenies and the Bayesian phylogenies with backbone, respectively (Figure S1-S7).

| Phylogenetic diversity metrics
For each of the fourteen Bayesian phylogenies (7 supermatricies with and 7 supermatricies without the backbone), we calculated several measures of phylogenetic diversity for all plants in the data set as well as at the plot level: Faith's PD, which sums all phylogenetic branch lengths (Faith, 1992); mean pairwise distance (MPD), which is the average distance separating all pairs of species of a community on the phylogenetic tree (Webb, Ackerly, McPeek, & Donoghue, 2002); and mean nearest taxon distance (MNTD), which is the average of the shortest phylogenetic distance for each species to its closest relative in the assemblage (Webb et al., 2002). We calculated MPD and MNTD using a species presence/absence matrix as well as a species abundance matrix. We denoted MPD ed , MNND ed for the abundance-weighted versions of the metrics, respectively.
In addition, we calculated (a) a metric of phylogenetic-abundance evenness (PAE), which evaluates the relationship between the abundance and the distribution of terminal branch lengths (Cadotte et al., 2010) and (b) the imbalance of abundances at higher clades (IAC), which encapsulates the distribution of individuals across the nodes in the phylogeny (Cadotte et al., 2010). These diversity measures were chosen because of their wide use in ecology and conservation and because they represent measures of diversity that are based upon either branch lengths or tree topology.

| Linear mixed-effects models
To determine the effects of single and multilocus DNA barcodes, family-level backbone, and their combinations on each measure of phylogenetic diversity, we constructed a series of linear mixed-effects models using the "lme" function in the nlme library in R (Jose Pinheiro, Bates, DebRoy, Sarkar, & Team, 2016). The general form of the GLMM is as follows: where rbcL, matK, ITS, ITS2, and the use of a family-level backbone phylogeny were set as fixed effects (not including the global intercept, α), and the plots (100 and 600 m) are random effects.
We modeled the plots as random intercepts (δ plot ) to account for plot-level differences in measures of phylogenetic diversity that were unrelated to the particular barcodes and the use of a family-level backbone phylogeny. To meet the assumptions of normality, we log-transformed all measures of phylogenetic diversity.
We evaluated model support using Akaike's Information Criterion corrected for small sample sizes (AIC c ; Burnham & Anderson, 2002. To describe the proportion of variance explained by just the fixed factors and by the fixed and random factors together, we used the function "r.squaredGLMM" in the library MuMIn in R to calculate marginal R 2 and conditional R 2 , respectively (Barton, 2018;Nakagawa & Schielzeth, 2013). To check the robustness of multi-model inferences according to random sampling, we randomly resampled the suites of estimates of phylogenetic diversity 100 times and reran the multi-model inference for random datasets each time. The model ranks were consistent among random samples. Here, we only present the multi-model inference results based on original measures of phylogenetic diversity. To provide a relative rank of the importance of main predictors, we calculated standardized coefficients (β n /SE n ) for each n term in the models featured in each subset, averaged these across all models based on AIC c weights (wAIC c ) (re-calculating ΣwAIC c = 1 over the models in which each term appeared), and then calculated the mean and confidence intervals (95%) of standardized coefficients for each term.

| RE SULTS
Of the 24 general linear mixed-effect models that were constructed (including the intercept-only model), estimates of phylogenetic diversity based on models that included multi-locus barcodes had log (phylogenetic diversity) = + plot + rbcL 1 + matK 2 +(ITS∕ITS2) 3 + backbone 4 .
higher rankings than those based on single barcodes (Table 1,   Supporting Information Table S2-S8). The top-rank models for all metrics except IAC included rbcL, matK, ITS, and family-level backbone (wAIC c = 0.999 for PD, MPD, MPD ed , MNTD, MNTD ed , and PAE) (Table 1 and Supporting Information Tables S2-S7), which accounted for >9% of the variances explained for each estimate.
By contrast, ITS2 instead of ITS was included in the most parsimonious models for IAC (wAIC c = 0.999; Table 1 and Supporting   Information Table S8). However, IAC estimates were much less dependent on the combination of DNA sequence data and phylogeny F I G U R E 1 Map of the study sites on Dinghu Mountain, Guangzhou, Guangdong Province, China    were based on tree topology (Figure 2f,g), but the direction of its effect depended on the topology-based metric. ITS was the most influential factor for estimates of PAE (Figure 2f), whereas familylevel backbone was the most influential factor for IAC (Figure 2g).

| D ISCUSS I ON
Our results reveal that multilocus barcodes outperform singlelocus barcodes in explaining maximum variation in estimates of phylogenetic diversity regardless of the phylogenetic reconstruction methods used, both in terms of model ranking and model-averaged, standardized effects. This result is in line with previous meta-analytical and experimental evidence that suggests a combination of more than one DNA barcode locus, including a phylogenetically conservative coding locus and one or more rapidly evolving barcode regions, are essential for inferring robust phylogenetic relationships among plants (Burgess et al., 2011;Fazekas et al., 2008;Kress & Erickson, 2007;Kress et al., 2009;Li et al., 2011;Liu, Yan, et al., 2015). Here, the reason for the complementary influence of DNA barcodes with different rates of evolution might be due to different ecological and evolutionary processes operating at different evolutionary time scales, which contribute differently to plant community phylogenetic structure (Mazel et al., 2016). For example, conserved DNA barcodes might provide important insight into the processes acting at long evolutionary time scales, whereas rapidly evolving barcodes might signal more recent speciation events (Webster, Payne, & Pagel, 2003). Of the universal barcodes that were used in our models, the effects of both chloroplast DNA regions (rbcL & matK) were evident across all phylogenetic diversity metrics. Notably, matK tended to be a more important factor for inferring phylogenetic diversity metrics that are based on branch length methods, while rbcL had a greater influence on topology-based metrics. This result suggests that rbcL and matK might be useful for estimates phylogenetic diversity for subtropical plant communities in China by establishing deep phylogenetic branches and terminal branches, respectively. Indeed, there is increasing evidence that matK is the most variable coding region of the angiosperm plastome and as such, in most genera, matK has higher species discriminatory power compared to rbcL (Hilu et al., 2003;Hollingsworth, Clark, et al., 2009;Liu, Yan, et al., 2015).
However, our results also suggest that the influences of DNA barcodes on measures of phylogenetic diversity might be inconsistent with their discriminatory success and more dependent on the methods and metrics used, both sources of variation that will likely have broad implications for future studies.
In this study, ITS had a stronger effect than ITS2 on estimates of community phylogenetic diversity except for IAC. This result is not surprising given that ITS2 is only one of three partitions in the ITS gene region (ITS1, 5.8S, ITS2;Coleman, 2003). Collectively, the effects of ITS on estimates of phylogenetic diversity that were based on branch length methods were comparable to those of regions of the plastid genome (i.e., matK) using model-averaged, standardized coefficients. Although this finding implies that ITS does not estimate branch lengths better than matK, ITS does show a much stronger influence on estimates of PAE than barcode regions of the plastid genome. Because PAE stresses the phylogenetic-abundance distributions among terminal branches (Cadotte et al., 2010), our results indicate that ITS may be a better estimator, over the other three DNA barcodes, of phylogenetic relationships at the tips of the community phylogeny. Of the four DNA barcode markers used in this study, ITS has been shown to have the highest species discriminatory power due to its ability to differentiate closely related, congeneric species (Li et al., 2011). Meanwhile, ITS2 outperformed ITS in predicting the variation in IAC, which stresses the topology at deep nodes of the phylogeny. This result suggests that ITS2 may be better tool for estimating phylogenetic relationships at deep clades .
We found that models that included a backbone family-level phylogeny had the highest support but only showed a strong effect for measures of phylogenetic diversity that are based on the topology of the community phylogeny. In our study system, the use of a backbone family-level phylogeny was required to explain maximum variation in topology-based metrics, which is consistent with our expectation that a limited number of DNA barcodes might generate inconsistent relationships deep within the phylogeny compared to broadly accepted patterns (Erickson et al., 2014). Such inconsistency might have a significant influence on measures of phylogenetic diversity, which are more sensitive to the basal topology of phylogenies.
Indeed, we found evidence that the use of a backbone family-level phylogeny was more effective on estimates of IAC than that of PAE, given PAE measures the phylogenetic-abundance distribution among terminal branches and IAC quantifies the imbalance of abundances at deeper clades (Cadotte et al., 2010). However, the effects of enforcing a backbone for deeper relationships in the phylogeny were negligible for estimates of branch length-based metrics.
Although the optimal combination of DNA barcodes (e.g., rbcL + matK + ITS) and the use of a backbone family-level phylogeny served as a consistent and accurate predictor for the metrics of phylogenetic diversity considered here, the explanatory power of the top-rank models varied depending on how phylogenetic diversity was measured. For example, mean pairwise distance (MPD) attained the highest proportion (>31%) of the explained variation, whereas the imbalance of abundances at higher clades (IAC) attained the lowest proportion (0.1%). It is generally agreed that different ecological processes (i.e., environmental filtering and limiting similarity) and evolutionary processes (i.e., local adaptation, speciation, extinction) operating at different spatiotemporal scales can contribute to community structure (Cavender-Bares et al., 2009;Swenson, 2011). However, determining the relative contribution of ecological versus evolutionary processes contributing to community patterns can be difficult (Cavender-Bares et al., 2009). Among the metrics of phylogenetic diversity considered, MPD was "best" explained by the combination of DNA barcodes, which is consistent with previous studies. For example, Mazel et al. (2016) showed that MPD is more sensitive to long-term evolutionary processes compared to PD and MNTD. However, our results for IAC, which is independent of the DNA barcodes and phylogeny construction methods, suggest that ecological processes are mainly generating community phylogenetic patterns at our sites. Given that combinations of DNA barcodes, the use of a backbone family-level phylogeny, and the plots together accounted for the substantial proportions (R 2 c > 32%) of variation in a series of estimates of phylogenetic diversity, this result is in line with the view that both ecological and evolutionary processes are influencing biodiversity at our sites. Future studies should consider the inclusion of functional, environmental, and demographic data to further elucidate the underlining ecological and evolutionary mechanisms contributing to community structure.
This study examined the influence of different combinations of four core DNA barcodes and community phylogeny reconstruction methods on a series of estimates of community phylogenetic diversity metrics for two subtropical forest plots in Guangdong, southern China. There are, however, a number of additional DNA fragments including coding regions (i.e., rpoB & rpoC1) and noncoding spacer regions (i.e., atpF-atpH, trnH-psbA, and psbK-psbI) that have been proposed as candidates for universal plant DNA barcodes . Furthermore, our power to detect influences of plant DNA barcodes and tree construction methods on estimates of phylogenetic diversity was limited by a small set of phylogenetic diversity metrics. Although these factors should be considered in future studies, our study provides insight into the magnitude of the influence of single barcodes, or their combination, and phylogeny reconstruction methods on community phylogenetic patterns. Notably, our study underscores the complexity of explaining community phylogenetic patterns, where future studies should evaluate the sensitivity of phylogenetic diversity metrics to the