Clade composition of a plant community indicates its phylogenetic diversity

Abstract Phylogenetic diversity quantification is based on indices computed from phylogenetic distances among species, which are derived from phylogenetic trees. This approach requires phylogenetic expertise and available molecular data, or a fully sampled synthesis‐based phylogeny. Here, we propose and evaluate a simpler alternative approach based on taxonomic coding. We developed metrics, the clade indices, based on information about clade proportions in communities and species richness of a community or a clade, which do not require phylogenies. Using vegetation records from herbaceous plots from Central Europe and simulated vegetation plots based on a megaphylogeny of vascular plants, we examined fit accuracy of our proposed indices for all dimensions of phylogenetic diversity (richness, divergence, and regularity). For real vegetation data, the clade indices fitted phylogeny‐based metrics very accurately (explanatory power was usually higher than 80% for phylogenetic richness, almost always higher than 90% for phylogenetic divergence, and often higher than 70% for phylogenetic regularity). For phylogenetic regularity, fit accuracy was habitat and species richness dependent. For phylogenetic richness and divergence, the clade indices performed consistently. In simulated datasets, fit accuracy of all clade indices increased with increasing species richness, suggesting better precision in species‐rich habitats and at larger spatial scales. Fit accuracy for phylogenetic divergence and regularity was unreliable at large phylogenetic scales, suggesting inadvisability of our method in habitats including many distantly related lineages. The clade indices are promising alternative measures for all projects with a phylogenetic framework, which can trade‐off a little precision for a significant speed‐up and simplification, such as macroecological analyses or where phylogenetic data is incomplete.

amount of time since the most common ancestor of a pair of species), which are derived from dated phylogenies. Researchers have developed more than 70 metrics for quantifying alpha (within-site) and beta (among sites) phylogenetic diversity, which are summarized under several frameworks (Scheiner, Kosman, Presley, & Willig, 2017;Tucker et al., 2017). It is worth noting that there is no agreement on the best or the most suitable metric. Phylogenetic diversity reflects diversification of lineages, geographic movement of lineages, and deep-past and present assembly processes (Gerhold, Carlucci, Proches, & Prinzing, 2018;Webb et al., 2002;Yguel et al., 2016) that can be lineage specific (Elliott, Waterway, & Davies, 2016;Ndiribe et al., 2013). Considering such complexity, it is not possible to address phylogenetic patterns in communities using only one number. Therefore, this plethora of metrics is inevitable because each metric was designed to capture a specific aspect of phylogenetic diversity. Fortunately, various phylogenetic diversity metrics tend to correlate (Swenson, 2014;Vellend, Cornwell, Magnuson-Ford, & Mooers, 2011) suggesting redundancy of some of them, thus, there has been an attempt to select a leading measure for each dimension of phylogenetic diversity (richness, divergence, and regularity; sensu Tucker et al., 2017; Table 1).
To construct dated phylogenies requires considerable effort, and the whole process is affected by methodological biases and subjective decisions (Jantzen et al., 2019;Li et al., 2019). Further, calculated phylogenetic diversity metrics depend on the attributes of phylogenies, such as the degree of balance, diversification rate, resolution, taxon sampling, or tree reconstruction methods (Jantzen et al., 2019;Park, Worthington, & Xi, 2018;Swenson, 2009;Vellend et al., 2011). Here, we propose and evaluate an approach based on the idea of considering species phylogeny as a categorical variable (i.e., affiliation to a phylogenetic clade) rather than continuous (i.e., phylogenetic distances among species). A similar approach based on taxonomic relatedness (derived from a hierarchical Linnaean classification with applied taxonomic weights proportional to the level of the taxonomic rank two species hold in common, i.e., genus, family, or order) has proven to be useful to estimate biodiversity patterns in fish communities (Campbell, Neat, Burns, & Kunzlik, 2010;Hall & Greenstreet, 1998;Warwick & Clarke, 1995). There is also a clear parallel in functional ecology, clades can be considered as analogous to plant functional types (PFT) and their proportions can be utilized to indicate phylogenetic diversity of a community. Such a categorical approach to phylogeny might be a tool for ecologists who are not specialists in phylogenetics and might be useful in communities where some taxa do not have available DNA sequences or in studies where a little precision can be traded-off for significant speed-up and simplification.
This framework certainly causes a loss of information as we basically introduce a polytomy at a node of a defined clade, i.e. the categorical approach still separates species according to their clade affiliation, but it ignores phylogenetic information within clades. On the other hand, there is some indirect support that this loss of phylogenetic information within clades would have a marginal effect. Li et al. (2019) compared purpose-built phylogenies (estimated from sequence data) with published synthesis-based supertrees (which usually have more polytomies than the former) and showed that phylogenetic diversity metrics computed from both types of phylogenies were highly correlated. Cadotte (2015) also demonstrated that changing branch lengths did not strongly affect relationships between phylogenetic diversity and ecosystem function, suggesting that phylogenetic diversity measures are not so sensitive to the branch lengths of the phylogeny as long as the topology is right. One important criterion for choosing among metrics is their conceptual and mathematical simplicity (Vellend et al., 2011). Therefore, if the categorical approach provides sufficiently correlated values with other phylogeny-based measures, than its use can be justified in order to simplify and speed-up phylogenetic diversity estimation.
The phylogenetic categorical approach cannot rely on phylogenetic distances, but we can include information about how clades are represented in a community (presence and relative abundance) to estimate its phylogenetic diversity. Consider a simple example phylogeny of 10 species (Figure 1a), which covers all major clades of the whole species pool of our first case study ( Figure S1). We simulated 1,000 communities where these 10 species occurred, but we let their proportions in a community randomly vary. For each community, we estimated phylogenetic richness, divergence, and Phylogenetic divergence was relatively high when all defined clades (i.e., monocots, Ranunculales, superrosids, and superasterids) had equal proportions ( Figure 1c). Finally, phylogenetic regularity was relatively high (i.e., the variance of phylogenetic distances was low) when the defined clades had proportions proportional to their relative species richness in the species pool ( Figure 1d).
Based on the conclusions from the conceptual example described above, we propose here three alternative measures, the clade indices that do not require dated phylogenies for their computation, but instead they utilize information about clade proportions in a community and species richness of a community or defined clades (Table 2). We assessed their fit accuracy for leading phylogeny-based measures of the three dimensions of phylogenetic diversity: richness, divergence, and regularity (sensu Tucker et al., 2017).
To do so, we examined the performance of the proposed clade indices in two case studies, firstly with a dataset with a purpose-built phylogeny (sensu Li et al., 2019) consisting of relatively small number of taxa in the species pool and second dataset with a synthesis-based phylogeny (sensu Li et al., 2019) consisting of relatively large number of taxa in the species pool. In this first case study, we also examined what clade resolution (at the super-order, order, and family level) for the clade index definition is the most suitable in terms of fit accuracy for phylogeny-based measures. Secondly, we used simulated community matrices based on a megaphylogeny of 31,389 vascular plants (Qian & Jin, 2016) to demonstrate how the clade indices perform at various phylogenetic scales (Graham, Storch, & Machac, 2018), at different species pool sizes and along a species richness gradient. TA B L E 1 Summary of three dimensions of phylogenetic diversity (defined by Tucker et al., 2017)  Species-rich communities Clade-rich communities Communities with low asymmetric competition F I G U R E 1 A conceptual example demonstrating how clade proportions (relative cover) affect values of leading metrics of all dimensions of phylogenetic diversity (Faith's PD = richness, MPD = divergence, and VPD = regularity). (a) We randomly selected 10 species: two monocots (Agrostis capillaris L. and Bromus erectus Huds.), one Ranunculales (Ranunculus repens L.), three superrosids (Fragaria viridis Weston, Trifolium pratense L., and Vicia cracca L.), and four superasterids (Aegopodium podagraria L., Centaurea jacea L., Campanula patula L., and Plantago major L.) in order to cover all major clades of the whole species pool ( Figure S1). The number of species in each clade approximately reflects relative species richness of clades of the species pool of the case study in species-rich grasslands. Then, we simulated 1,000 communities using all the 10 species and let their proportions randomly vary. Phylogenetic richness, divergence, and regularity were estimated for each simulated community. (b) Faith's PD particularly increased with increasing proportion of R. repens (i.e., the relatively most phylogenetically distant species compared to the rest). Distant branches contribute more to phylogenetic richness as they are longer, suggesting that increase in their weight (reflecting species proportion in a community) also increases phylogenetic richness of a community.

| Data collecting
The focus of the case studies was on herbaceous terrestrial systems. First, we used data from species-rich grasslands located in  Chytrý & Rafajová, 2003). This dataset included 16,542 plots and 1,608 species and covered 26 Central European herbaceous habitats (see Table S2 for a habitat classification). We limited our analysis to herbaceous angiosperms that dominate all systems used in this study. In the grassland dataset, tree taxa were omitted in the initial phase of the vegetation recording, but this most likely did not affect estimation of phylogenetic diversity as we found only a few tree seedlings in a few plots. We deleted Pteridophyta from both datasets, whereas gymnosperms did not occur in any dataset.

| Phylogenetic inference and molecular dating
Prior to the phylogenetic analysis, we checked species lists and edited some species names in order to follow the NCBI nomenclature. For the species-rich grasslands, we constructed a molecularbased phylogeny for our 171 species using 20 orthologous loci downloaded from GenBank (Benson et al., 2017) via an online tool OneTwoTree (Drori et al., 2018). We used Piper nigrum L. from the Magnoliids group (a sister clade to clades occurring in our dataset; APG IV, 2016) as an out-group. Due to missing sequence data, we replaced Potentilla heptaphylla L. with a relatively close congener Potentilla crantzii (Crantz) Beck ex Fritsch (Dobeš, Rossa, Paule, & Hülber, 2013) that had available DNA data. Sequences were aligned using a fast option (FFT-NS-2) in MAFFT (Katoh & Standley, 2013) under default settings available at the OneTwoTree website (6mer pairwise alignment method). The alignment was then cured using the Gblocks online tool (under less stringent selection settings; Castresana, 2000).
We constructed the dated tree using BEAST version 1.10.4  in the CIPRES portal (Miller, Pfeiffer, & Schwartz, 2010). To do so, we manually set constraints according to the APG IV angiosperm phylogeny (APG IV, 2016) and set the uncorrelated relaxed clock as a clock model, Yule process as a speciation model and GTR+G+I (with four gamma categories) as a nucleotide substitution model. To translate genetic distances into absolute times, we exploited the TimeTree database (Kumar, Stecher, Suleski, & Hedges, 2017) and set several time priors with normally distributed errors (median and standard deviation computed from all studies available in the TimeTree database reporting a given divergence time estimate). We performed three independent runs (with different starting seeds) for 100 million generations each. Finally, we checked convergence in Tracer v1.7.1 (Rambaut, Drummond, Xie, Baele, & Suchard, 2018) and combined all runs (10% generations as a burn-in). The dated maximum clade credibility tree ( Figure S1) was sampled from 30,000 trees (10% trees as a burn-in).
For the species in the dataset from the CNPD, we extracted species phylogeny from the dated supertree of the European flora (Durka & Michalski, 2012) and followed their nomenclature.

| Phylogenetic diversity dimensions and metrics
We applied the framework of Tucker et al. (2017) and selected three leading metrics describing three phylogenetic diversity dimensions:

TA B L E 2 Summary of the proposed clade indices
Species-rich clades are penalized as they get lower weight proportional to their clade richness. Higher proportions of species-poor clades increase the clade richness index values Species from species-poor clades have higher probability to be relatively phylogenetically distant to the rest of a community and their increasing proportion increases phylogenetic richness of a community ( Figure 1b) Larger deviations from optimal proportions (i.e., 1/number of defined clades in the whole species pool) decrease the value of the clade divergence index. Scales from 0 to 1 Phylogenetic divergence tends to be close to its peak when a community consists of all clades of a species pool and their proportions are equal (Figure 1c) Larger deviations from the optimal proportions (i.e., clade species richness/total species pool richness) decrease the value of the clade regularity index. Scales from 0 to 1 Phylogenetic regularity tends to be close to its peak (the lowest VPD) when a community consists of all clades of a species pool and their proportions are proportional to their relative clade richness given a species pool (Figure 1d) Note: S = species richness of a plot; p i = proportion of the ith clade in a plot; CR i = species richness of the ith clade in the whole species pool (all species in the dataset); CR SP = the number of all defined clades in the whole species pool; S SP = species richness of the whole species pool. richness, divergence, and regularity (Table 1). Faith's PD (Faith, 1992) describes the amount of evolutionary history across species (sum of branch lengths) and is a leading measure of phylogenetic richness.
Mean phylogenetic distance between each pair of species (MPD; Webb et al., 2002) is a leading measure of phylogenetic divergence.
Variation of pairwise phylogenetic distances between each pair of species (VPD; Clarke & Warwick, 2001) is a leading measure of phylogenetic regularity (lower variation indicates higher regularity). We also identified species richness in each plot.
According to Vellend et al. (2011), one can distinguish two qualitatively different types of phylogenetic diversity indices. Faith's PD, MPD, and VPD are type II metrics which are calculated using a subset phylogeny of a focal subset of species (e.g., a vegetation plot). Type I indices are based on the whole species pool phylogeny; each species has its distinctness score calculated. These scores are then used to calculate a phylogenetic diversity measure of a plot (for example, summed evolutionary distinctiveness; Redding & Mooers, 2006). However, type I indices are highly correlated with Faith's PD (Vellend et al., 2011), suggesting they are closely related to the phylogenetic richness dimension, and so we did not consider them. We calculated indices using functions (pd and mpd) from the picante package . To compute VPD, we modified the mpd function to calculate the variation of pairwise phylogenetic distances (not the mean as in the original function). All metrics were abundance weighted by percentage cover. To calculate abundance-weighted Faith's PD (Barker, 2002), we used the R function of Swenson (2014).

| Definition of the clade indices
Species affiliation to a clade was based on the recent APG IV classification (APG IV, 2016). The proposed clade indices are summarized in Table 2. They all need information about clade proportions in a community (e.g., relative cover, biomass or abundances). The key idea behind the clade richness index is to penalize proportions of species-rich clades (by reverse clade species richness) because species from species-rich clades are unlikely to be relatively distantly related to the rest of co-occurring species in a community.
By chance, more species from a species-rich clade can occur in a community, which would decrease phylogenetic richness as these species are relatively closely related. Species richness can be a very good indicator of phylogenetic richness by its own (Swenson, 2014;Vellend et al., 2011); hence, it is useful to include it in the equation (Table 2a). For phylogenetic divergence, when clades are equally abundant in a community, phylogenetic divergence is close to its peak ( Figure 1c). Thus, any deviations from these equal proportions should decrease phylogenetic divergence (Table 2b). For instance, if all clades are present and have equal (i.e., optimal) proportions, the clade divergence index equals one. Finally, the clade regularity index has a similar computation to the clade divergence index, but the optimal proportions are proportional to the relative clade species richness (Table 2c). An R script for computation of the clade indices is stored in the supplemental dataset (https:// data.mende ley.com/datas ets/gbv47 2pxsb /1).

| Performance of the clade indices: case studies
We did all statistical analyses and data simulations in we detected during the model diagnostics.

| Performance of the clade indices: simulated datasets
Simulation workflow was specifically designed to cover several aspects that can affect phylogenetic diversity estimation, that is, taxon sampling (Park et al., 2018), the number of taxa included in the regional phylogeny (Jantzen et al., 2019) or species richness of a community (Sandel, 2018;Swenson, 2014). Thus, these factors could also affect fit accuracy of the clade indices for all dimensions of phylogenetic diversity. The simulation workflow is summarized in Figure S2.
Simulation was based on a megaphylogeny of vascular plants (Zanne et al., 2014, updated by Qian & Jin, 2016. We set three phylogenetic scales: vascular plants, angiosperms, and superasterids. For each phylogenetic scale, we set three species pool sizes: 2,000, 500, and 250 species. These species pools were created by randomly assigning species from a given phylogeny (vascular plants, angiosperms, or superasterids). For each combination of phylogenetic scale and species pool size, we generated community matrices under several species richness ranges: 10-160, 10-80, 10-40, 10-20, 5-10, and 2-5 species per community. For each species richness range, we generated 50 community matrices with 240 sites (same data size as in the grassland case study). Species proportions in communities were random but their sums were always one. In total, we generated 2,700 unique species pools with 2,700 unique corresponding community matrices (900 for each phylogenetic scale).
For each community matrix, we computed both phylogeny-based metrics (Faith's PD, MPD and VPD)

| RE SULTS
For all phylogenetic diversity dimensions, fit accuracy of the clade indices increased with fineness of phylogenetic resolution in species-rich grasslands (Table S3) Table S4.
Heteroscedasticity was mainly apparent in the CNPD dataset showed changeable fit accuracy across habitats (Table S5), and the heteroscedasticity issues at the left end were mainly caused by several habitats ( Figure S5), such as C1 (surface standing waters) or C2 (surface running waters). Partly, the broader taxon sampling in the CNPD phylogeny was the reason for a large range of VPD values (approximately, three times higher than in species-rich grasslands).
The variance of VPD values was largest at the left end, where the clade regularity index explained VPD less accurately (Figure 2f).
Nevertheless, R 2 rapidly increased (72.3%) when we only included plots with the clade regularity index higher than 0.2 (93.8% of all plots). For phylogenetic richness and divergence, fit accuracy across habitats was usually similar (more than 70% for phylogenetic richness and more than 90% for phylogenetic divergence) with several exceptions with lower R 2 values, such as H2 (screes) or E4 (alpine and subalpine grasslands). Fit accuracy in all habitats is given in Table S5.
Simulated datasets revealed that species richness range was the most important determinant of fit accuracy of the clade richness index, while phylogenetic scale mainly affected fit accuracy of the clade divergence and regularity indices, followed by species richness (Table 3). Species pool size did not influence fit accuracy for any phylogenetic diversity dimension (Table 3). For phylogenetic richness and regularity, R 2 values increased with increasing species richness range (Figure 3a, Figure S6d,e). For phylogenetic divergence and regularity, fit accuracy increased with decreasing phylogenetic scale, R 2 was highest in community matrices sampled from the phylogeny of superasterids (Figure 3b,c), while the clade indices for these two dimensions were less reliable at the largest phylogenetic scale, that is, vascular plants (Figure 3b,c). At smaller phylogenetic scales (angiosperms and superasterids), fit accuracy for phylogenetic regularity also increased with increasing species richness range ( Figure S6d,e), but this was not the case when we sampled community matrices using the whole phylogeny of vascular plants, that is, the largest phylogenetic scale considered ( Figure S6f)

| D ISCUSS I ON
We have shown that simple taxonomic coding at the family level can be used to accurately indicate phylogenetic diversity in plant communities. We propose three simple surrogates of phylogenetic diversity, the clade indices, which only require information about species affiliation to a clade and clade proportions in samples, while phylogenetic distances among species are not necessary (Table 2) The larger CNPD phylogeny with a broader taxonomic sampling created an almost three times larger range of VPD values in the CNPD compared to the grassland dataset. Due to this issue, we particularly encountered problems with heteroscedasticity. In species-and cladepoor habitats, the fit was generally poor (Table S5, Figure S5). For example, water habitats (C1 and C2) or carr and fen scrubs (F9.2) usually host specialized species from very few clades (e.g., Alismataceae or Salicaceae, respectively). Phylogenetic regularity of communities in these habitats will be highly dependent on the presence of other arms from the angiosperm radiation, as more distantly related lineages decrease phylogeny balance more than closely related ones, that is, the degree to which branch points define subgroups of equal size (f) (Heard, 1992). Vellend et al. (2011) provide relevant discussion of the effect of tree imbalance on phylogenetic diversity assessment. Thus, we suggest using the clade regularity index in relatively species-rich communities where its values are higher than 0.2, and recommend the estimation of phylogenetic regularity using phylogeny-based measures in communities where the clade regularity index ranges from 0 to 0.2.
For phylogenetic richness and divergence, fit accuracy of the clade indices was consistent across all the studied habitats (Table S5, Figure S6d,e). This suggests lower reliability of our method at very small spatial scales where plots consist of few species (<10). In contrast to species richness, increasing phylogenetic scale increases the possible range of phylogenetic distances because more distantly related species can occur in a community.
As expected, fit accuracy for phylogenetic divergence and regularity was better at smaller phylogenetic scales (superasterids and angiosperms). For phylogenetic divergence, we observed a disparity in fit accuracy between case studies (substantial R 2 values) and simulated community matrices (moderate R 2 values). This could be probably attributed to the simulation protocol. Simulated community matrices were completely random in terms of species selection and species proportions, which does not reflect nonrandom assembly processes in nature. Sometimes, fit accuracy was greatly improved by log-transforming MPD values, but this mainly depended on the generated community matrix and we did not find consistent im-  Qian and Jin (2016). Species pool size indicates the number of species in a regional phylogeny (2,000, 500, or 250). Species richness range indicates a range restricting the number of species in artificial communities (2-5, 5-10, 10-20, 10-40, 10-80, and 10-160). In total, 2,700 unique species pools and corresponding community matrices were generated.

F I G U R E 3
Major determinants of fit accuracy of the clade indices in simulated communities (species richness range for phylogenetic richness and phylogenetic scale for divergence and regularity; Table 3 Figure S7). To account for possible bias due to species richness variation, null models or rarefaction is recommended (Miller et al., 2017;Sandel, 2018;Swenson, 2014). Both tools can be used to treat species richness-dependence of clade indices. Finally, phylogenetic resolution influences the performance of the clade-based approach.
As expected, our results indicate that increasing fineness of phylogenetic resolution increases the tightness of the relationship between phylogeny-based measures and clade indices (Table S3). This agrees with case studies and simulated phylogenies that showed a lower impact of the lack of resolution or poorly estimated branch lengths at more recent nodes on phylogenetic diversity Swenson, 2009). Naturally, our method can be prone to taxonomic errors as it assumes proper species assignments to defined taxonomic groups.
Our goal was to show the link between clade composition and phylogenetic diversity. Our results suggest that the clade indices proposed here, which are based on taxonomic resolution at the family level, are a good indicator of all phylogenetic diversity dimensions in angiosperm-dominated habitats with 10 and more species per sampling unit (e.g., 1 m 2 or larger plots in grasslands). Even though this study focused on vascular plants, our results should generalize to any taxonomic group with a well-developed taxonomic classification supported by molecular data. In general, if a taxonomic classification of a group reflects current molecular phylogenies we should expect close correlations between taxonomy-based metrics (e.g., this study, Warwick & Clarke, 1995) and molecular-based phylogenetic metrics. Our approach has a potential in studies working with a lot of taxa when phylogenetic reconstruction might be very timeand money-consuming.

ACK N OWLED G M ENTS
We are grateful to Martin Kočí, Erika Lošáková, and David Opálka for help with field measurements in species-rich grasslands and Milan

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R CO NTR I B UTI O N S
MB and PM conceived the ideas, designed the study, and analyzed the data. MB conducted phylogenetic analysis and wrote the manuscript with help from RJP and MD. All authors discussed the results, contributed critically to the drafts and gave final approval for publication.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data supporting the results (accession numbers, alignment matrices, BEAST.xml file, phylogenetic trees, plot data, species lists, and simulation results) are archived in the Mendeley Data depository