Phylogenetic diversity metrics from molecular phylogenies: modelling expected degree of error under realistic rate variation

Phylogenetic diversity or phylo‐diversity measures use information about evolutionary history and relationships to inform conservation priorities. These metrics are usually derived from the branches of molecular phylogenies. But inferring phylogenetic timescale from molecular data relies on many assumptions about the evolutionary process, most of which are based on statistical convenience rather than biological information. Here we ask whether known patterns of variation in rate of molecular evolution can lead to errors in phylo‐diversity measures.


| Int roduc tion 1
Conservation planning may involve quantifying the contribution to conservation goals made by different areas, populations, species or other biological groups, and comparing the effectiveness of different conservation strategies in achieving those goals (Brooks et al., 2006). Conservation goals may be aimed at maximizing representativeness and persistence of biodiversity across a planning region (Margules & Pressey, 2000). But "biodiversity" is a rich and complex concept that is challenging to express in a way that can be easily measured and compared across ecosystems, areas or time periods (Lean, 2016;Maclaurin & Sterelny, 2008).
Phylogenetic measures of diversity, or phylo-diversity, have been developed as a way of deriving an objective, quantitative metric that reflects different intrinsic values of species or areas for the purposes of conservation prioritization (Crozier, 1997;Tucker et al., 2017). These measures have in common that they are derived from some function of the branch lengths of a phylogeny, as distinct from taxonomic measures that depend on the number of nodes or species in the tree (Tucker et al., 2017). Metrics may use just a single edge from a phylogenetic tree-such as the tip length connecting a species to its nearest relative-or a path between tips in the phylogeny, or the sum of edges for a clade or sample of tips (Kondratyeva et al., 2019).
Faith's phylogenetic diversity (FPD;Daniel P Faith, 1992) is one of the most widely used phylogenetic measures of biodiversity.
It is calculated as the sum of the branch lengths of the minimum spanning tree connecting the members of a community or assemblage. This measure was proposed as a means of representing the diversity of features among a group of species. A feature is any aspect of a species which may in turn reflect its ecological role, distinct characteristics or traits of value to human society (Forest et al., 2015;Maclaurin & Sterelny, 2008). Feature diversity is difficult to measure directly, hard to compare among taxonomic groups and requires decisions about which features should be selected as salient for conservation worth (Kelly et al., 2014). Moreover, there is a desire not only to conserve known feature diversity but to maximize the probability of retaining features that could be recognized as important in the future, referred to as "option value" (Faith, 2007), though this usage differs somewhat from the economic origins of the term (Maier, 2012). Note that the unqualified term "phylogenetic diversity" may refer to either FPD in particular or the more general concept. To avoid confusion, we use the term "phylo-diversity" for the latter throughout.
Branch lengths have been considered an indicator of feature diversity on the assumption that a longer branch represents more opportunity for evolutionary change and the development of unique characteristics (Faith, 2018). This view has attracted some controversy. Empirical evidence for the relationship between phenotypic or ecological diversity and FPD remains conflicted Devictor et al., 2010;Flynn et al., 2011;Fritz & Purvis, 2010;Tucker et al., 2019), and some empirical studies have shown that species sets chosen to maximize FPD may not reliably conserve more feature diversity than random species sets in practice (Kelly et al., 2014;Mazel et al., 2018). Studies based on theoretical models of trait evolution are also divided on this issue (Letten et al., 2014;Mazel et al., 2017;Tucker et al., 2018). Partly in response to these controversies, authors have proposed alternative interpretations of FPD reflecting other qualities relevant to conservation. For example, some scholars have argued that "evolutionary heritage" is worth protecting independently of any link with feature diversity (Mooers, Heard, & Chrostowski, 2005;Rosauer & Mooers, 2013;Winter et al., 2013). Alternatively, FPD is sometimes interpreted as an indicator of future evolutionary potential (Muenchow et al., 2018).
FPD is typically used to compare the relative value of biological assemblages or communities, by summing over all branch lengths connecting the species. An extension of the notion of evolutionary history as a target for conservation is that phylogenetic branch lengths can be used to quantify the unique phylo-diversity contribution of particular species. Species with few close relatives, and therefore long branch lengths, are frequently identified as conservation priorities on the grounds that they represent a unique suite of traits (Isaac et al., 2007;Safi et al., 2013). For example, during the recent Australian bushfire crisis, extraordinary efforts were expended to save the Wollemi pine, a critically endangered species separated from its nearest living relative by over 150 million years of evolution (Laity et al., 2015). Phylogenetic measures of evolutionary distinctiveness (ED) have also been combined with indicators of global endangerment (GE) to produce the EDGE score, which is used to identify priority species for conservation in mammals (Isaac et al., 2007), amphibians (Isaac et al., 2012), corals (Curnick et al., 2015;Huang et al., 2016), birds (Jetz et al., 2014), sharks, rays and chimaeras (Stein et al., 2018), and reptiles (Gumbs et al., 2018). These efforts have increased public awareness of many lesser-known threatened species through top 100 priority lists maintained by the Zoological Society of London's (ZSL) "EDGE of Existence" programme (ZSL, 2014).
FPD and associated measures such as ED have become increasingly popular in conservation biology and are being promoted as potential tools for conservation decision-making (Brooks et al., 2015;Hendry et al., 2010;Isaac & Pearse, 2018). Despite the aforementioned debates regarding the relationship of FPD with various salient qualities for conservation, the degree to which we can rely upon molecular branch lengths to represent any of these qualities has received less attention. While few studies have addressed this question, some have found that FPD estimation can be affected by factors such as whether branch lengths are resolved in units of time or genetic difference (Elliott et al., 2018) and whether the tree is subsampled from a larger phylogeny (Park et al., 2018). These findings suggest that the nature of phylogenetic inference can impact metrics used in conservation prioritization. Time-resolved branch lengths from molecular data are historical inferences based on a complex set of assumptions, and their interpretation depends on the interplay of time, rates and genetic differences (Bromham, 2019). Even where the rate of molecular evolution is assumed to be the same in all lineages, uncertainty in branch lengths arises from the need to infer past changes that cannot be directly observed and to infer the age of lineages using fossils whose age and position in the phylogeny cannot be known with absolute precision (Yang & Rannala, 2005). This uncertainty leads to a background level of error in estimating evolutionary quantities even where rates are constant (Wertheim & Sanderson, 2011). But the situation is much more complicated when rates of molecular evolution vary among lineages. Given the growing evidence that evolutionary rates are influenced by a wide range of species traits, as well as by environmental factors and macroevolutionary processes, we should expect that rates will vary across the phylogeny (Bromham, 2009;Bromham et al., 2015;Lanfear et al., 2010).
There are a large number of phylogenetic methods that allow rates of molecular evolution to vary over the phylogeny, such as "relaxed clock" models that allow limited random variation in rates among lineages (dos Reis et al., 2016;Ho, 2014;Kumar, 2005;Lepage et al., 2007). These methods make many assumptions that allow for statistical tractability. For example, most methods rely on modelling substitution rates based on simple parametric distributions, either independently for each lineage (Drummond et al., 2006) or determining the rate for each new lineage from the rate of its parent (Ho et al., 2005;Thorne et al., 1998). Changing these assumptions can affect the branch lengths produced by the method (Battistuzzi et al., 2010), and the effect is larger when there is more overall variation (Sarver et al., 2019). Error in inferred branch lengths can lead to erroneous conclusions about macroevolutionary processes, such as the speciation rate, where rates are associated with an evolving continuous trait (Duchêne et al., 2017) or discrete character (Shafir et al., 2020). Therefore, we might reasonably expect that choices made in phylogenetic analyses could influence measurements of phylo-diversity.
Does the inherent uncertainty of phylogenetic branch length estimation produce uncertainty in phylo-diversity measurements? We examine the performance of relaxed clock methods in estimating Faith's PD (FPD) and Evolutionary Distinctiveness (ED) from nucleotide data simulated under biologically realistic models of rate variation. We ensure that our results are relevant to current practice in the field using four strategies. Firstly, we conduct a systematic study of the literature to inform both our simulation design and our analysis, selecting the most common phylogenetic methods and phylo-diversity metrics. Secondly, since correlates of variation in substitution rates have been well-studied in birds (Berv & Field, 2018;Eo & DeWoody, 2010;Lanfear et al., 2010;Nabholz et al., 2016), we use published studies of birds to set realistic parameters for rate variation. Thirdly, we simulate biologically realistic patterns of rate variation driven based on observed patterns of association between substitution rates and species traits and diversification rates. Fourthly, we want our phylogenies to represent the kind of trees typically analysed in phylo-diversity studies, so we generate sample phylogenies by drawing assemblages of taxa from known locations, then deriving the tree connecting those taxa from a global phylogeny. In this way, our simulations are conditioned on realistically sized data sets, realistic tree shapes and realistic patterns of rate variation.
We investigate two biologically informed models of rate variation. The first is a species-based trait model. There is growing evidence that particular species traits are associated with rate variation, for example with generation time, longevity, body size and fecundity (Bromham, 2011;Hua et al., 2015;Qiu et al., 2014;Thomas et al., 2010;Welch et al., 2008;Wong, 2014). Therefore, rates of molecular evolution may evolve along phylogenies as species traits. The second model is based on the observed association between net diversification rate and rate of molecular evolution, a relationship that has been found for a wide range of taxa, including birds, reptiles and plants (Barraclough & Savolainen, 2001;Bromham et al., 2015;Eo & DeWoody, 2010;Lanfear et al., 2010;Webster et al., 2003).
While the causation of this relationship remains unknown, proposed mechanisms include elevated substitution rates speeding the development of reproductive isolation in populations  or founder effects generating elevated rates following speciation (Pagel et al., 2006).
Once we have evolved DNA sequences under both trait-based and speciation-associated rate variation, we then apply the most commonly used phylogenetic methods to reconstruct the phylogenetic history from the sequences and apply a number of phylo-diversity metrics to the reconstructed phylogeny. This allows us to describe the likely scale of errors when estimating FPD and ED for real data and considers the implications for potential applications of these measures to conservation planning and prioritization.

| Survey of common practice
We used a literature survey to gauge common practice in recent studies that use measures of phylo-diversity to provide information relevant to conservation (Table 1; Figure 1). This was not an exhaustive review of all published papers, but a representative sample of 131 recent studies (2015)(2016)(2017)(2018), in order to gather information on metric use, size of phylogeny, amount of data, phylogenetic methods used and type of phylo-diversity measures (see Table S4 available as Supplementary Information for the full list of studies).
Many of the studies use previously published supertrees to derive a phylogeny for the taxa in the study; for example, a mammal supertree (Bininda-Emonds et al., 2007), a global bird phylogeny (Jetz et al., 2012) and various angiosperm phylogenies included in the package Phylocom (Webb et al., 2008). Some studies use the TimeTree of Life database, which is based on summaries of published dates (Hedges and Kumar, 2009). Others use molecular phylogenies freshly inferred from sequence data. The two most common approaches to phylogeny reconstruction across both pre-published trees and newly inferred phylogenies were the Bayesian inference package BEAST (Bouckaert et al., 2014) with branch lengths estimated using an uncorrelated lognormal clock (UCLN), and the maximum likelihood package RAxML (Stamatakis, 2014) with branch lengths estimated using non-parametric rate smoothing (NPRS), for example using r8s (Sanderson, 2003) or treePL (Smith & O'Meara, 2012). Relatedness Index (NRI). Because MPD and NRI are similar to PD in using aggregates of branch lengths, we chose to focus on FPD and ED in our simulation study as two distinct ways of using phylogenetic branch lengths in conservation prioritization.
We also examined the intended use of the phylo-diversity metric (Table 1). The modal use was to quantify biodiversity changes due to environmental impacts, such as agriculture and habitat loss.
However, the next three most frequent uses (identifying hotspots, optimizing reserve designs and assessing adequacy of existing reserves) all related to ranking or comparing phylo-diversity values among either discrete sites or among grid cells on a map, with the ultimate aim of informing prioritization of conservation resources.
Meanwhile, studies using ED were largely focused on identifying priority species for conservation. Our evaluation of phylo-diversity metrics reflects these primary proposed uses of the relevant metrics.

| Design of simulation studies
We investigated the effects of realistic rate variation on the estimation of phylo-diversity metrics by simulating DNA sequences under empirically informed conditions. We conditioned our simulations on studies in which a phylogeny is constructed directly, rather than drawn from a pre-existing tree. Of the studies that used BEAST, the median number of species was 147, so we set our simulation tree size to 150 tips. Of the studies that provided information on alignment lengths, the median value was 4733 nucleotide sites. As a practical limitation, we chose to simulate sequence lengths of 2000 variable sites. This number can be thought of as equivalent to a proteincoding alignment of about 6000 bases where the majority of signal comes from the third codon position.
Randomly generated trees for 150 taxa might not have realistic distributions of branch lengths, particularly for FPD measures that are applied to geographic assemblages which are not all closely related members of a single clade. Therefore, to construct our simulations we first generated 100 "community assemblages" by randomly selecting locations from the world map and sampling 150 species from that area. We used bird species to generate our sample datasets as a data-rich case study. We then extracted a phylogeny for each assemblage by taking the subtree that incorporated all sampled taxa from a published time-scaled supertree and simulated average substitution rates for each branch of each of the resulting 100 trees under each of two biologically informed patterns of rate variation ( Figure 2). Finally, we simulated DNA sequences along these phylogenies under these rate variation patterns. A full description of our assemblage sampling and sequence simulation procedures is available as Supplementary Information (Extended Methods).

| Biologically informed patterns of rate variation
We have two ways of modelling realistic patterns of rate variation.
One models the influence of species traits on rates of molecular evolution. Another models the association between rate of molecular Phylo-diversity metric use

Number of Studies
Quantifying environmental impacts (e.g. agriculture, forest clearance) 41 Identifying geographic patterns and hotspots 39 Optimizing reserve designs/Prioritizing potential conservation sites 25 Adequacy of existing reserves 20 Testing against other metrics 17 Identifying priority species 9 Capacity to capture salient feature (e.g. ecosystem function, niche space) 5 Identifying drivers of biodiversity 5 Quantifying loss due to extinctions 4 Association with threat status 2 Predicting environmental impacts (e.g. climate change) 2 FPD-area relationship as a measure of ecosystem response 2 Evaluating collection strategies 2 Comparing different data sources 2 Other 6 TA B L E 1 Uses of phylo-diversity metrics from a survey of 150 conservation biology studies published between 2015 and 2018: see Table S4 available as  Supplementary Information for details evolution and net speciation rates Webster et al., 2003). All simulations were undertaken in the R programming environment (R Core Team, 2019). Details of the simulation methods are available as electronic Supplementary Material.
In the first scenario (trait-based variation), we start with our phylogeny of 150 taxa. We evolve a continuous trait value as a Brownian motion along the tree and then use this to assign substitution rates to branches. Substitution rates are generated from trait values via a linear relationship between rates and body mass, which is derived from a published regression analysis of bird body sizes and third codon substitution rates (Nabholz et al., 2016).
In the second scenario (speciation-based variation), we model evolutionary rate variation that is correlated with variation in the rate of lineage diversification. We first use a combination of methods to infer speciation rates for every lineage in the bird phylogeny from which our sampled assemblages are drawn (see Supplementary Methods). Each branch in the sampled assemblage phylogenies is given the speciation rate of the branch in the full bird phylogeny that shares the same ancestral node. As in the trait-based scenario, we then simulate an average substitution rate for each branch under the assumption that the substitution rate is linearly related to the speciation rate.

| Reconstructing phylogenies from simulated alignments
The two simulation methods (trait-based rates and speciation-based rates) each produce 100 alignments of 150 sequences along a known phylogeny. How accurately can we infer the true history by reconstructing the phylogeny using only the sequences from the tips? We

| Phylo-diversity metrics
For each of 200 alignments (100 under each of two simulation methods), these procedures give us four reconstructed phylogenies, using two different methods (Bayesian/BEAST and maximum likelihood/NPRS) for each of two calibration schemes (Figure 2). We F I G U R E 1 Summary of findings from a survey of 150 studies published between 2015 and 2018 that apply phylo-diversity measures to conservation biology. (a) Methods used to reconstruct branch lengths for use in phylo-diversity estimation. Bar heights show the number of times a phylogeny using a particular method is used among all of the surveyed studies. We do not distinguish whether the study itself used this method or whether it made use of pre-published branch lengths estimated by this method. UCLN: Uncorrelated lognormal clock. ACLN: Autocorrelated lognormal clock. NPRS: Non-parametric rate smoothing. The "interpolated" category refers to methods that distribute undated nodes uniformly between dated nodes. The "unconstrained" category refers to methods that estimate branch lengths in units of genetic divergence rather than separating rates and dates. (b) Phylo-diversity measures used and number of studies using each measure. The full list of studies is available as Supplementary Information (Table S4). Abbreviations are those used in Table 1. Note that many studies used multiple measures. Both regular and standardized versions of FPD are listed under FPD, while "sesMNTD" is grouped together with NTI and "sesMPD" is grouped with NRI now have 100 phylogenies that represent the history of our samples (along which sequences were simulated), which we will refer to as the "true tree," and 100 phylogenies reconstructed from the simulated sequences for each combination of simulation scenario, reconstruction method and calibration condition, which we refer to as "reconstructed trees." Now we apply two different phylo-diversity measures to each of these reconstructed phylogenies and compare the values to those calculated on the true tree used to simulate each data set.
We calculate Faith's PD score (FPD; Daniel P Faith, 1992) for each reconstructed tree and each corresponding true tree. This score is the sum of the branch lengths of the entire tree (the tree length). Evolutionary distinctiveness (ED) values were calculated for each terminal branch in each true and reconstructed phylogeny. The ED of a tree tip is a weighted sum of all branch lengths connecting that tip to the root, where the weight of each branch is given by the reciprocal of the number of tips descending from that branch. FPD and ED calculations used the "picante" package for R (Kembel et al., 2010). We used the "pd" function for FPD and the "evol.distinct" function for ED using the "fair proportions" metric, which distributes the length of each branch evenly among its descendants. We used these values to compare phylo-diversity measures made on reconstructed trees to the values on the corresponding true trees and determine the degree of associated error.

F I G U R E 2
Schematic for the simulation study design. 100 assemblages are sampled from randomly chosen grid cells on a global map with occurrence data. The phylogeny for each assemblage is then extracted from a supertree, generating 100 trees. Patterns of evolutionary rate variation are simulated for each tree under two scenarios: one in which molecular rates are correlated with speciation rates, and one in which molecular rates are correlated with a life-history trait. Each of these two patterns is applied to each tree, and sequences are then simulated along the phylogeny under each rate scenario, producing 200 sequence data sets. Each data set is then used to reconstruct a phylogeny, including tree topology and branch lengths, using either Bayesian inference with an uncorrelated lognormal relaxed clock (UCLN) or maximum likelihood reconstruction using non-parametric rate smoothing (NPRS). Faith's phylogenetic diversity (FPD) and taxon-specific evolutionary distinctiveness (ED) values are calculated for each of the resulting 800 reconstructed trees. Finally, these "reconstructed" FPD and ED values are compared with FPD and ED values calculated from the original 100 "true" trees Simulate sequences

Simulate rates
x 100 x 100 x 200 x 200 x 200 x 200 x 200 x 200 x 200 x 100 x 100 True ED

| Estimation accuracy
FPD estimates were correlated with the true values with Pearson coefficients of 0.5 or higher across all simulation scenarios, calibration availabilities and reconstruction methods, but the mean error in estimated FPD was in all cases between 6% and 14% (

| Rank order stability
Ranking errors are also observable in a comparison of rankings generated from true and reconstructed FPD values among our 100 true trees (Figure 4; Figure S7 available as Supplementary Information).
Because FPD is often used to rank areas or assemblages by their relative diversity, here we are ranking all 100 sampled assemblages by their FPD values, hypothetically in order to identify the top-ten assemblages that have the highest phylo-diversity values. Across all of our simulation scenarios and reconstruction methods, the ranking of the assemblages based on reconstructed FPD differed from that using the true FPD of the assemblage, leading to some assemblages

| Analysis of the largest errors
We asked whether the cases with the largest FPD and ED errors were due to failures of phylogenetic inference, in which errors in topology or divergence times might be large enough to be noticed by inspection of the tree. To test this, we asked whether the trees with the greatest error in FPD and ED were those for which the reconstruction was very different from the true tree, as indicated by larger tree distances between the true and reconstructed trees. We therefore plotted topological and branch score errors for each reconstructed phylogeny (Figure 7; see Supplementary Methods). While these largest FPD errors did cluster at the upper end of the distribution of branch length errors, for topology errors they are observed throughout the plot. The greatest ED ranking errors were not clustered within the upper tail of either the distribution of topology errors or the distribution of branch length errors.

| Discussion
Phylogenetic diversity (FPD) or Evolutionary Distinctness (ED) measures aim to produce universal metrics that reflect important properties for conservation prioritization. These measures do not directly target species characteristics or assemblage properties that are considered desirable for conservation, but instead aim to produce objective measures that will scale with important characteristics of biodiversity. These metrics are commonly taken to represent evolutionary time and/or opportunity for acquisition of unique traits. There has been a vigorous debate on whether phylo-diversity measures are appropriate tools for conservation prioritization, and the accuracy of various metrics at making representations of phylogenetic properties has been tested. But most such tests assume that the phylogeny is known without error. Given that phylo-diversity measures are usually based on molecular phylogenetic branch lengths, it is important to consider the accuracy of phylogenetic reconstruction. Here we focus on only one potential source of uncertainty, arising from lineage-specific patterns of molecular rate variation, but other issues such as calibration error or gene tree incongruence could also lead to inaccuracy.
Ideally, researchers would consider the variance arising from all aspects of molecular phylogenetic inference (e.g. Bromham et al., 2018).
In order to examine how well inferred FPD represents true evolutionary history, we have simulated data sets that mimic realistic studies, based on a survey of the literature and known patterns of rate variation. We then ask how close FPD measures on phylogenies inferred for these data are to the true underlying history. We have found that the two phylogenetic reconstruction methods most commonly used in studies of phylo-diversity are expected to lead to average levels of error in FPD estimates of 6 to 14% under realistic models of molecular rate variation, and up to 24-38% error, depending on the reconstruction method and number of calibrations (Table 2). These levels of error can impact ranking of assemblages F I G U R E 5 Histograms showing the number of taxa incorrectly excluded from the top 100 ED rankings when ED is calculated from reconstructed branch lengths. The number of excluded taxa is calculated from 100 replicate simulations for each of 4 combinations of simulation scenarios and calibration regimes. All branch lengths were estimated by Bayesian reconstruction in BEAST 2 (for NPRS results see Figure S3 available as Supplementary  How much impact would phylogenetic error of this magnitude have on conservation prioritization? If prioritization of limited conservation resources, or selection of relatively few sites for protection, was based on the selection of top-ranked areas based on FPD alone, then we would in some cases miss making the optimal decision due to error in the reconstruction of branch lengths. While some recommendations for priority conservation areas have been made using simple scoring and ranking schemes, typically FPD is considered in combination with other facets of biodiversity such as taxonomic and functional diversity (e.g. Mazel et al., 2014;Soutullo et al., 2005). Recommendations for incorporating phylo-diversity metrics into conservation plans commonly make use of complementarity-based reserve selection algorithms that aim to achieve a predetermined biodiversity target with maximum efficiency (e.g. Brum et al., 2017;Morales-Barbero & Ferrer-Castán, 2019;Pollock et al., 2015;Rosauer et al., 2017).Whether measures of phylo-diversity are used as the sole input to conservation ranking or combined with other information to set priorities, it is appropriate to ask whether the expected degree of error in these measures could impact on prioritization processes. Our literature survey suggests that the aim of many is to provide relative ranking of conservation priorities either of individual species or species assemblages by their inferred phylogenies (Table 1). However, it is unclear the degree to which such fine distinctions would impact on real-world conservation planning.
For example, FPD and ED may form part of frameworks that incorporate estimates of extinction risk and prioritize conservation action based on the expected loss of phylo-diversity. For FPD, this is formalized in the "expected" or "probabilistic" diversity framework, which prioritizes combinations of taxa by the average amount of phylo-diversity that could be lost if they are not protected. For ED, the "EDGE of Existence" project combines phylogenetically derived ED scores with the conservation status of the taxon, as represented by its classification under the IUCN scale of extinction risk. In essence, a taxon with few close relatives (high ED) that is also globally endangered (GE) will get the highest ranking. Because EDGE ranking combines phylogenetic information with extinction risk evaluation, error in phylogenetic branch length is unlikely to change the ranking of most cases. But we have shown that uncertainty in ED can change the relative weighting of species based on phylogenetic information, which leaves open the possibility for error in cases where threat status is less of a factor.
One way of seeing this issue is that FPD or ED can serve as a way to order the relative conservation value of species within a given cat-   (Jetz et al., 2014). These rankings could change if EDGE scores were based on phylogenies inferred under different models or assumptions.
Levels of phylogenetic error reported in this study are unlikely to impact high profile cases concerning the most evolutionarily distinct species in a given EDGE list, such as the Wollemi pine, and however, they may impact which critically endangered species with lower ED scores are included in top 100 lists, with implications for recognition and prioritization. We have shown that the ED scores in the middle of the distribution are more vulnerable to ranking error (Figure 4).
In fact, many EDGE species with higher global endangerment values have ED values in the middle of the range for their taxon group ( Figure 8). This could mean that rankings among these more highly endangered species are especially vulnerable to ED estimation error caused by branch length reconstruction. The assemblage phylogenies used in this study are much smaller and less taxonomically complete than the large trees used in many EDGE studies (e.g. Bininda-Emonds et al., 2007). But here we demonstrate that realistic levels of error in ED scores can lead to changes in rankings and are therefore worth investigating when using ED scores to set conservation priorities.
A theme of conservation planning is the necessity of properly characterizing the risk arising from potential worst-case outcomes, given the irreplaceability of biological resources (Brooks et al., 2006).
Accounting for this risk requires accounting not only for expected average level of error, but also for the worst-case scenarios (Daniel P. Faith, 2008;Daniel P Faith, 2015). Since the goal of this study is to examine the potential for error in FPD and ED outcomes arising from common molecular phylogenetic inference on realistic data sets, it is interesting to study the characteristics of the largest errors we encountered. When estimating FPD, the least accurate cases based on Bayesian phylogenies involve errors of up to 32.8% of the true FPD value (Table 2). In the data sets with the greatest ED ranking errors, one fifth of the membership of a top 100 list could be incorrectly included, if most or all of the species were at the highest level of endangerment ( Figure 5). Whether errors of this magnitude will have significant real-world impact on prioritization will depend on how they are estimated and applied in any given case study. Where the outcomes of such studies are considered to have important implications for conservation, the potential impact of phylogenetic error should be considered. Notably, FPD and ED scores are commonly applied to very large trees (Arregoitia et al., 2013;Jetz & Pyron, 2018) or to smaller phylogenies extracted from supertrees (approximately 60% of the sampled studies used supertrees: Table S4 available as supplementary information). Errors in branch lengths may thus be propagated through many studies that all sample the same supertree.
It is important to note that errors in phylo-diversity measures may be significant even where there are no obvious errors in the inferred phylogeny. Our results show that the largest errors in FPD and ED are not necessarily associated with major topology errors or largescale errors in inferring branch lengths. This suggests that it will not always be possible to detect errors from obvious inconsistencies in F I G U R E 7 Distribution of the 95th percentile of most severe FPD and ED estimation errors across different levels of phylogenetic error. Phylogenetic error is calculated either by the Robinson-Foulds distance, emphasizing topological differences or the branch score distance, emphasizing branch lengths. The 95th percentile of most severe FPD and ED errors are marked by diamonds, severity is given as the absolute percentage error for FPD and as the number of taxa incorrectly excluded from the top 100 for ED. Data are shown for the speciation scenario under BEAST 2 with 3 calibrations This discussion also touches on a broader issue in the testing and validation of evolutionary models. An average rate of error that is considered acceptable for some uses, such as phylogeny estimation or molecular dating, might be considered unacceptably high for practical applications, such as conservation prioritization. Here, the number of measurements is generally small; for example, community metrics may be calculated for a small number of potential reserves in a particular locality, or species may be ranked based on their individual EDGE scores from a single estimate of a phylogeny. Therefore, a level of error such as that found in these simulations could have a significant impact on conclusions of a prioritization study. We need to know not only which qualities of a phylogeny most align with the things we wish to conserve, but also how reliably our measurements are able to determine those qualities. Where the outcomes of phylogenetic studies are expected to have real-world impacts on conservation, it is vital that clear statements can be made about the accuracy and precision of such measures. Importantly, we must model the error arising from all stages of the estimation process, including the uncertainty in phylogenetic reconstruction, because even "best practice" phylogenetic methods are not perfect.

| Conclusions and recommendations for future research
We have characterized the likely levels of error in estimating phylo-diversity metrics when rates of molecular evolution vary between species, even when a rate-variable method is used to infer the phylogeny. This level of error may be acceptable for many practical applications of biodiversity metrics, but demonstrates that phylogeny should be considered as a potential source of error when designing prioritization methods. The levels of phylogenetic error reported under biologically realistic levels of rate variation, using common phylogenetic methods, could be sufficient to have an impact on prioritization rankings based on phylogenetic measures. Future studies could consider the potential impact of other types of clock model misspecification on phylodiversity estimates, as well as studying the impacts of this and other error sources in the practical application of phylogenetic diversity to conservation.

ACK N OWLED G EM ENTS
The authors wish to thank Professor Kim Sterelny and Dr. Chris Lean for input in the early stages of the project.

F I G U R E 8
Illustration of the log ED values for EDGE top species lists. These represent the top 100 EDGE rankings for mammals, reptiles and birds. The grey curve indicates the distribution of log ED values for all taxa in the respective phylogeny. Vertical lines mark the position in the distribution of top 100 species. Line style indicates endangerment status: solid, critically endangered (CR); long dashed, endangered (EN); and dotted, vulnerable (VU). Each EDGE list includes a large component of endangered and critically endangered species with ED scores that fall within the middle of the distribution. This is an example of a situation in which critically endangered taxa could be excluded from the rankings if scores used in the EDGE lists were to suffer from ranking errors like those observed in our study. Data were sourced from https://www.edgeo fexis tence.org/edge-lists (accessed 13 Dec 2019)

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13179.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data and code associated with this analysis are available via Mendeley Data at https://data.mende ley.com/datas ets/52ht8 hr93p/ draft ?a=a1fac e2f-9d75-4075-b8a6-e9f50 3cf145d.