Measuring inferential importance of taxa using taxon influence indices

Abstract Assessing the importance of different taxa for inferring evolutionary history is a critical, but underutilized, aspect of systematics. Quantifying the importance of all taxa within a dataset provides an empirical measurement that can establish a ranking of extant taxa for ecological study and/or quantify the relative importance of newly announced or redescribed specimens to enable the disentangling of novelty and inferential influence. Here, we illustrate the use of taxon influence indices through analysis of both molecular and morphological datasets, introducing a modified Bayesian approach to the taxon influence index that accounts for model and topological uncertainty. Quantification of taxon influence using the Bayesian approach produced clear rankings for both dataset types. Bayesian taxon rankings differed from maximum likelihood (ML)‐derived rankings from a mitogenomic dataset, and the highest ranking taxa exhibited the largest interquartile range in influence estimate, suggesting variance in the estimate must be taken into account when the ranking of taxa is the feature of interest. Application of the Bayesian taxon influence index to a recent morphological analysis of the Tully Monster (Tullimonstrum) reveals that it exhibits consistently low inferential importance across two recent treatments of the taxon with alternative character codings. These results lend support to the idea that taxon influence indices may be robust to character coding and therefore effective for morphological analyses. These results underscore a need for the development of approaches to, and application of, taxon influence analyses both for the purpose of establishing robust rankings for future inquiry and for explicitly quantifying the importance of individual taxa. Quantifying the importance of individual taxa refocuses debates in morphological studies from questions of character choice/significance and taxon sampling to explicitly analytical techniques, and guides discussion of the context of new discoveries.


| INTRODUC TI ON
A fundamental question in systematics centers on understanding the importance of different taxa for understanding phylogenetic relationships. However, quantifying taxon importance has hinged on varying definitions of the term across many biological disciplines. In conservation biology and ecology, clades have traditionally been assigned values for "phylogenetic diversity [PD]" (Faith, 1992a,b) and taxa have been assigned estimates of "originality/evolutionary distinctiveness [ED]" (Pavoine, Ollier, & Dufour, 2005;Redding et al., 2008; and sources therein), both defined using combinations of character change reconstruction or branch lengths, and node counting across clades or between taxa of interest. Computational biology has built upon these definitions of importance and has cast importance in combinatorial terms employing PD and ED as measures in a constrained optimization problem (the "Noah's Ark Problem (NAP)," a subset of the knapsack problem) to solve for the amount of unique evolutionary history that can be preserved in a subset of taxa given assumptions on an amount of funding, and the relationship of funding allocated to probability of survival (Billionnet, 2013;Hartmann & Steel, 2006;Nee & May, 1997;Weitzman, 1998).
An alternative approach, suggested by Mariadassou, Bar-Hen, and Kishino (2012), is instead a total-taxa approach that assigns a value called taxon influence to all taxa within a dataset based on a leave-one-out taxon jackknifing and reinference procedure. This approach provides a relative measure to generate ranked lists of a full set of taxa, rather than acting as a cutoff method, like rogue taxon analysis, or on subtrees, like leaf or taxon stability indices. Because a taxon influence value is derived from independent reanalysis of the nearly complete original data compared to the full original data, it is a phylogenetic inference-based reframing of a distinctiveness measure that is derived from a full analysis rather than partitioning of a single analysis. Additionally, the generality of taxon influence methods makes them applicable to many underassessed species, for which character data, either DNA or morphology, may be the only thing known (Mace, Gittleman, & Purvis, 2003). Furthermore, unlike ED/PD measures, taxon influence analyses do not require time-calibrated phylogenies, which frequently necessitate a degree of knowledge of the fossil and/or biogeographic record unavailable for many groups of interest. Given this broad applicability and minimal assumptions, taxon influence approaches stand to potentially bridge the gap between definitions of importance in conservation and systematics by generating minimal-assumption taxon rankings based on whole tree inference, which may subsequently guide the acquisition of data for clades of interest that lack the kind of information necessary for NAP approaches. Furthermore, such rank lists may be useful to track changes in character data as more analyses at phylogenomic (Bragg, Potter, Bi, & Moritz, 2016;Faircloth et al., 2012) and phenomic (e.g., Copes, Lucas, Thostenson, Hoekstra, & Boyer, 2016;Goswami, 2015;O'Leary & Kaufman, 2011) scales increase in size.
Similarly, because taxon influence values are estimated for all taxa in a dataset, the relative position of a taxon of interest in the ranking of taxa may be useful for explicitly quantifying hypotheses of taxon importance implicit in many announcements of newly discovered or redescribed taxa. For example, in publications of new taxa based on phenomic data generated by tomographic methods, it remains a standard procedure to place these specimens using a parsimony analysis and to present character optimizations and contextualization of the new taxon based on its inferred position relative to other known groups on either an optimal or consensus topology (e.g., Giles, Friedman, & Brazeau, 2015;McCoy et al., 2016;Van Roy, Daley, & Briggs, 2015;Zhu et al., 2013). Such announcements are effectively verbal hypotheses of taxon importance. Despite this fact, existing inferential methods are insufficient for testing these hypotheses, because taxon importance is a relative measure that must account for both the importance of the other taxa and the effects of the characters used to infer the phylogeny.
However, two problems exist with current taxon influence implementations. First, existing implementations are based on maximum likelihood, which infers a single optimized tree topology. Influence values for a taxon derived from trees estimated using ML are therefore based on a comparison of only two topologies that are assumed to be fixed estimates. These estimates thus critically neglect uncertainty-a value as important as the tree itself (Huelsenbeck & Rannala, 2004)-an omission which stands to significantly affect the inferred influence values and rankings generated by the taxon influence procedure.
Second, existing taxon influence procedures discussed in Mariadassou et al. (2012) utilize either the Robinson-Foulds metric (RF; Robinson and Foulds, 1981) or branch score difference (BSD; Kuhner and Felsenstein, 1994) to quantify differences between trees. Both values are derived from the computational literature and are agnostic to the issue of influential taxa. For example, the RF metric can produce maximal values for trivial rearrangements of a single taxon pair (Böcker, Canzar, & Klau, 2013;Lin, Rajan, & Moret, 2012), making it likely susceptible to the effects of rogue taxon behavior. The BSD, although accounting for both branch length and topological differences, is based on the RF metric and likely inherits this problem. Additionally, the interaction of differences in topology and branch lengths in the BSD may counteract one another in cases where short branch lengths and topological differences occur simultaneously (Kuhner & Felsenstein, 1994). A tree distance specific to questions of taxon influence remains an outstanding problem.
To address these issues and to demonstrate the utility of taxon influence analysis for both robust ranking and taxon rank placement, we apply a modified version of the original taxon influence index (TII) approach of Mariadassou et al. (2012) to three published datasets: a complete mitogenomic dataset of reptiles (Jonniaux & Kumazawa, 2008), here referred to as JK2008, and two recently published datasets debating the placement of the unusual fossil taxon Tullimonstrum in a phylogenetic context (McCoy et al., 2016;Sallan et al., 2017). We account for tree uncertainty using a Bayesian approach to TII calculation discussed, but not implemented, by Mariadassou et al. (2012), and also present a novel tree distance to circumvent problems with the RF metric and BSD invoked in the original publication.

| Phylogenetic analyses
Bayesian phylogenetic analyses were conducted in MrBayes v.3.2.6 (Ronquist et al., 2012). The JK2008 dataset was analyzed using the same model parameterization (GTR + I + Γ) as in Mariadassou et al. (2012). Analysis was run using a single chain of 10 million generations, with a 20% burn-in. The Tullimonstrum datasets were analyzed using the Mkv + Γ model (Lewis, 2001) with six discrete classes, using a single chain of 20 million generations, with a 50% burn-in. In both cases, the number of generations required to reach a sufficient topological ESS was determined by calculation of approximate ESS values in the R package rwty (Warren, Geneva, & Lanfear, 2017). Because the TII approach is a single taxon-pruning procedure, all jackknifed analyses were assigned the parent number of generations.

| Taxon influence measurement
The taxon influence index (TII), the expected distance between pairs of trees in the posterior distribution, was calculated according to Mariadassou et al. (2012): where T* is the posterior distribution of trees from analysis using all taxa, T′ is a posterior distribution of trees in which a focal taxon is dropped before analysis, T′ i is a phylogenetic tree from a posterior T′ for which taxon i was dropped before analysis, T* i is a tree from the posterior T* in which taxon i was dropped a posteriori for comparison with T′ i , w i is the posterior probability of a tree Second, given the potential issues with both the RF metric and BSD regarding influential taxa, informative distances between trees were defined as the ratio of the distance between the trees to the size of the shared tree. This new criterion was satisfied by a value referred to here as the SPR excess, an SPR distance-the minimum number of subtree-pruning and regrafting rearrangements required to turn one tree into another (e.g., Goloboff, 2008)-scaled by the number of taxa in the maximum agreement subtree (MAST, (Gordon, 1979;Finden & Gordon, 1985;Valiente, 2009), and see Ge, Wang, and Kim, 2005 for an example of the implications of deviation in tree shapes between a difference and similarity measure in the context of molecular data).
Finally, for comparison to TII estimates, a rogue taxon analysis (Aberer, Pattengale, & Stamatakis, 2010;Aberer et al., 2013) using the Mkv + Γ model was conducted in raxml v8.2.9 (Stamatakis, 2014). To standardize the comparison to a fixed set of trees, the postburn-in distribution of trees from the Bayesian analysis, rather than a collection of bootstrap trees, was used. All TII calculations were conducted using scripts written by the authors (Supplementary Information) in the R environment (R Core Team 2016) using the ape (Paradis, Claude, & Strimmer, 2004), phangorn (Schliep, 2011), stringr (Wickham, 2015), and gespeR (Schmich et al., 2015) packages. Differences in taxon influence-based rankings between the two Tullimonstrum datasets, and differences in rank by proportion of missing data, were calculated for this dataset using rank-biased overlap (Webber, Moffat, & Zobel, 2010), for which significance was assessed using a permutation procedure against the null hypothesis of dissimilar rankings.

| Taxon influence values
TII analysis of the JK2008 dataset produced mostly well-separated median values, with a small number of downwardly directed outliers, an apparent negative relationship between taxon influence and the interquartile range of the TII estimate, and no apparent relationship between TII estimate and the skewness of the distribution (Figure 3). TII analysis of the Tullimonstrum datasets produced wellseparated values (Figure 4a,b, lower), with a small number of extreme and directionally biased outliers that comprised no more than 10% of each taxon's TII estimates (Figure 4a,

| D ISCUSS I ON
Inference of well-separated TII values for two contrasting data types-molecular data and morphological data-and for differing degrees of phylogenetic signal suggests the Bayesian-based approach presented here is robust and applicable for ranking taxa with different data properties. Additionally, the stability in rank location of a focal taxon (Tullimonstrum) using our approach suggests the method may be beneficial for contextualizing hypotheses of the importance of individual taxa using analytical rank results.

| Molecular dataset
The difference in taxon ranks between the present analysis and the original ML analysis underscores the important distinction between the two methods. Although the two approaches exhibited some overlap in highly ranked taxa (

| Morphological dataset
The ranking of Tullimonstrum using methods like taxon influence is significant because it reframes the debate in the recent literature on the taxon (Clements et al., 2016;McCoy et al., 2016;Sallan et al., 2017) from a conceptual one of character choice/significance and taxon sampling to an explicitly analytical one of the inferential importance of the taxon relative to other taxa. Specifically, based on the present results (Figures 4 and 5

| Methodological implications and future directions
The bounds around the resampling results (Figures 3 and 4) suggest that the finite sum approximation utilized in this study generates reproducible rankings of taxon influence and may thus be an effective approximation for calculating taxon influence based on the SPR excess distance measure, from posterior distributions of trees for which the probabilities were calculated using the standard approach.
The causes for the existence of directional outliers in the studied datasets (Figures 3 and 4) is currently unclear, but may be an artifact of either the number of finite sums, or of an interaction between the probabilities of trees and the SPR distances between them.
Although we have focused on several standard parametric models for nucleotide substitution and morphological character transformation, other posterior distributions of trees are possible. It may be useful to, for example, explore the distribution of parsimony-score-ranked trees under the Bayesian approach using the TS97/no common mechanism model (Tuffley & Steel, 1997)

Taxon influence index
Proportion missing data posteriors using information-theoretic measures (Larget, 2013;Lewis et al., 2016) and nonuniform tree priors, which may reveal a more universal metric for taxon influence assessment. Finally, although our method assesses the influence of individual taxa using a leave-one-out jackknifing approach as an intuitive method for generating ranked lists of taxa based on what is essentially the "main effect" of each taxon, the contributions of higher-order "interaction" effects, such as pairwise-or clade-based influence, have yet to be addressed by the taxon influence approach. Approaches for estimating clade stability have been discussed by several authors, including Pol and Escapa (2009), for reduced positional congruence, and Gatesy (2000), for linked branch support. In these cases, analyses were conducted on complete-taxon datasets and sets of most parsimonious trees, rather than via a taxon jackknifing approach. The theory for, and effect of, pairwise or higher-order interactions on taxon influence values is currently unclear. Future work expanding the taxon influence method through a leave-k-out approach may be beneficial, although direct interpretation of the results of complex multi-taxon interaction may be difficult.

ACK N OWLED G M ENTS
We thank Katherine St. John, Klaus Schliep, John G. Maisey, Dean C. Adams, Liam Revell, Sven Templer, and Lauren Sallan for discussion of the concept of taxon influence and for code suggestions that improved the analytical components of the manuscript.

CO N FLI C T O F I NTE R E S T S
We declare we have no competing interests.

AUTH O R CO NTR I B UTI O N S
J.S.S.D. conceived the study, wrote the original taxon influence code, and drafted the manuscript. E.W.G. wrote the tree parser, refined the code, ran cluster-based analyses, and edited the manuscript. Both authors approved submission.

DATA ACCE SS I B I LIT Y
All scripts and functions for calculating taxon influence and associated analyses are provided as Supplementary Data accompanying this paper.

E TH I C A L S TATEM ENT
The research complies with all national and international ethical requirements.