How to get the most out of phylogenetic imputation without abusing it

Phylogenies are viewed as potentially powerful resources to predict missing values in trait datasets, but they are often misused. Critically, many of the imputed values that completely or partially rely on phylogenetic information are trusted without convincingly demonstrating that the data meet the requirements for the predictions to be at least minimally valuable. I discuss that phylogenetic signal, which is the mainstay of phylogenetic imputation, is often interpreted as ‘strong’ because the outcome of randomization tests has prevailed over the actual strength of the signal in determining whether it is strong or not. This circumstance has led many researchers to infer conclusions based on ‘strong’ signals that are actually way more labile than a phylogenetic random walk (i.e. Brownian motion). Although trait evolutionary trajectories that nearly fit Brownian motion are typically considered as strongly conserved, the Brownian process is subject to high levels of stochasticity that may render spurious predictions under some circumstances. To my knowledge, very few studies (if any) that rely on phylogenetically imputed information have rigorously evaluated the expected accuracy of individual predictions, despite among‐lineage variability in prediction accuracy can be dramatic even for strongly conserved traits. Here, I advocate for a Monte‐Carlo approach that is based on trait simulations to assess the prediction accuracy that is expected for each missing value in the traits of interest, which can be continuous or discrete. The framework is presented in a detailed step‐by‐step R tutorial that was conceived for non‐specialized researchers to identify highly likely spurious predictions without the need for advanced technical and statistical skills. Although phylogenetic imputation has important limitations, I suggest that leveraging advances in our understanding of such hindrances and using the technique with caution and restraint will allow trait‐based research to progress further while sampling efforts continue replacing imputed data.


| INTRODUC TI ON
Trait-based research is essential for improving our understanding of the ecological and evolutionary drivers that shape biodiversity (Green et al., 2022;Keddy & Laughlin, 2022).However, the field is currently constrained by the scarcity of available information for many relevant traits (Sandel et al., 2015;Taugourdeau et al., 2014).
Although most scholars would agree that efforts should be directed to compile species' trait values on the field (Etard et al., 2020), collecting data is time-consuming and financially expensive, and it might be even impossible for very large datasets that include elusive or endangered species.This circumstance has favoured the widespread use of phylogenetic imputation alone, or in combination with other imputation techniques, to tentatively fill in the gaps until fieldbased measurements become available (Debastiani et al., 2021;Johnson et al., 2021;Penone et al., 2014).Roughly, the idea is taking advantage of the fundamental principle that species are overall more similar to their close relatives than to distant ones (a pattern that is commonly referred to as 'phylogenetic signal') to predict missing trait values from extant data (see Swenson, 2014 for a seminal review).Notably, phylogenetic imputation has made it possible to explore many questions that would otherwise have remained obscure (e.g.James et al., 2021), and the prospect is to continue filling in the gaps with imputations as ecological research is extended to datadeficient taxonomic groups (González-del-Pliego et al., 2019;Jetz & Freckleton, 2015).
Despite these promising advances, a closer look at the literature reveals that phylogenetic imputation is often misused.For example, Cantwell-Jones et al. (2022) applied the technique to propose a list of 1044 edible species as promising key sources of B vitamins, but the accuracy of most of their predictions was indistinguishable from that expected under mere randomness (Molina-Venegas et al., 2023).This is just one of the many examples that could be given to illustrate how phylogenetically imputed values are often trusted without convincingly demonstrating that the data meet the requirements for the predictions to be at least minimally valuable.As such, very few studies (if any) that rely on phylogenetically imputed information have rigorously evaluated the expected accuracy of individual predictions, despite among-lineage variability in prediction accuracy can be dramatic even for strongly conserved traits (Molina-Venegas et al., 2018), and others even failed to report the strength of phylogenetic signal in the traits under consideration (e.g.Carmona et al., 2021;Hernández-Hernández & Wiens, 2020;Méndez et al., 2022;Toussaint et al., 2021).
Previous findings showed that, overall, incorporating phylogenetic correlation structure into imputation exercises can improve the accuracy of the predictions (Debastiani et al., 2021;Penone et al., 2014).However, extremely inaccurate predictions may conceal among more precise ones even under scenarios of 'strong' phylogenetic signal (Molina-Venegas et al., 2018, 2023), and this caveat has been pervasively overlooked since the technique became popular.
Here, I aim to reverse this trend by providing in depth insight on the factors that determine the accuracy of phylogenetic imputations.
Also, the piece is accompanied by a detailed user's guide that will allow non-specialized researchers to critically assess the reliability of phylogenetic predictions on continuous and discrete traits in accordance with the goals of the study.

| A critical interpretation of phylogenetic signal
The predictive capability of phylogenetic imputation is primarily determined by the amount of phylogenetic signal in the observed data, that is, the extent to which closely related species share similar values in the traits of interest (Blomberg et al., 2003).Thus, predictions based on weak phylogenetic signals will always be valueless.This, however, leads to a fundamental question; what is a sufficiently strong phylogenetic signal for imputing missing values?The literature is highly convoluted and inconsistent regarding whether a trait shows 'strong' phylogenetic signal.For example, some authors assume 'strong phylogenetic signal' even if the traits show evolutionary trajectories that are considerably more labile than expected under Brownian motion (e.g.Liu et al., 2015), and others consider phylogenetic signal is strong only if trait evolution is more conserved than Brownian expectation (CaraDonna & Bain, 2016).
Laying between these opposing views, the prevailing idea is that phylogenetic signal is 'strong' when evolutionary trajectories nearly fit Brownian motion and significantly deviate from a 'white noise' model (i.e.pure random evolution; Münkemüller et al., 2012).
Phylogenetic signal is typically evaluated using the Pagels's λ model (Pagel, 1999) and/or the Blomberg's K metric (Blomberg et al., 2003).The λ and K are equal to 1 if Brownian motion fits well to the data, and less conserved evolutionary trajectories yield lower values to a minimum of 0 (complete lack of phylogenetic signal).
The λ metric has a natural scale between 0 and 1, meaning that it shall not detect evolutionary trajectories that are more conserved than Brownian motion.In contrast, the K metric may successfully capture the latter pattern (K is >1 if evolutionary trajectories are more conserved than Brownian motion), and therefore both metrics provide complementary information (note that λ and K have different scales and are not directly comparable unless λ = K = 0 or λ = K = 1).The statistical significance of λ and K is typically evaluated using randomization tests, which often results in significant signals (i.e.p < 0.05 in the randomization tests) but observed λ and K much smaller than 1 (Swenson, 2019).This is because the randomization approach is more akin to asking whether there is more signal than expected from a 'white noise' model (Münkemüller et al., 2012), a condition that can be met even with extremely labile traits (Molina-Venegas et al., 2023).Unfortunately, the outcome of randomization tests has prevailed over the strength of phylogenetic signal in determining whether it is strong or not, a circumstance that has led many researchers to infer conclusions based on 'strong' signals that are way more labile than Brownian expectation.
It is important to note that, despite Brownian evolution often being considered an indication of strong phylogenetic signal, this pattern is just the result of a random walk up the branches of a phylogenetic tree; a process that is subject to high levels of stochasticity (Box 1).I stress that such Brownian stochasticity should serve as a warning for researchers to carefully assess whether the observed phylogenetic signal is sufficiently strong to trust phylogenetic predictions in accordance with the aims of the study.
For example, if the goal is using imputed values along with observed data to compute a summarizing metric of functional diversity at the macro-ecological scale (e.g.Swenson et al., 2017), the researcher may take a less conservative stance and assume that a phylogenetic signal that is close to Brownian motion (e.g.λ > 0.9) may suffice to fill in the gaps because prediction errors will be diluted by mean aggregation.However, if the researcher intends to use each predicted value separately to make an inference of any kind (e.g.Bellot et al., 2022;Cantwell-Jones et al., 2022), a more conservative stance is desirable.In this case, predictions may only be tentatively trusted if phylogenetic signal is stronger than Brownian motion.

| Looking beyond phylogenetic signal, predictive distances and imputation accuracy
Phylogenetic signal is a major, but not the only factor, in determining the accuracy of phylogenetic imputations.As such, even if the traits show strongly conserved evolutionary trajectories (e.g.K ≫ 1), accurate predictions are expected only when the phylogenetic distance between (i) the terminal node representing the closest relative with known trait value of a target species and (ii) the internal node representing their most recent common ancestor, is relatively BOX 1 Traitgrams showing the evolutionary trajectories of three simulated traits with contrasting phylogenetic signals.To simulate the traits, a Brownian motion model of evolution (state at the root = 0, instantaneous variance = 0.1) was propagated along the branches of the phylogeny using (a) a delta transformation (delta = 0.1, generating more phylogenetic signal than expected under Brownian motion), (b) the actual phylogeny (Brownian motion), and (c) a lambda transformation (lambda = 0, white noise).Then, I fitted Pagel's delta and lambda models to the simulated traits and repeated the procedure iteratively until delta = 0.1 ± 0.025 for trait (a), lambda = 1 ± 0.025 for trait (b), and lambda <0.025 for trait (c).Phylogenetic signal was measured using the Blomberg's K metric and tested with a randomization test (p = 0.001, 0.003, and 0.919 for (a), (b) and (c), respectively).Note that contrary to expectation, Brownian motion can lead to higher resemblance between distant relatives in some cases (e.g.sp4-sp13, sp1-sp16).Moreover, even for the strongly conserved trait (a), the premise that species are more similar to their close relatives than to distant ones is not strictly met in all cases (e.g.sp2-sp7 show higher resemblance than sp7-sp8).Simulations were conducted with the 'geiger' R package (Pennell et al., 2014).
short (hereafter 'predictive distance', Figure 1).Thus, the longer the predictive distance, the lesser the difference between phylogenetic imputations and random predictions (Molina-Venegas et al., 2018, 2023).A high incidence of missing values in the dataset will result in longer predictive distances due to a higher probability of phylogenetic clumping of the missing data (Figure 1), hence reducing prediction accuracy.It follows that, in addition to ensuring that phylogenetic signal is sufficiently strong for the purpose of the study, phylogenetic imputation exercises should always be accompanied by a rigorous evaluation of the prediction accuracy that is expected for each target species.

| OPENING THE PHYLOG ENE TI C IMPUTATI ON ' B L ACK BOX ' WITH THE MONTE-C ARLO APPROACH
A common procedure to validate imputation exercises is conducting leave-one-species-out cross-validation trials on the observed values of the traits.Roughly, the observed values are dropped one at a time, and the remaining values are used to make a prediction.
Then, the overall error across all the predictions is accessed using a prediction coefficient (Cantwell-Jones et al., 2022;Guénard et al., 2013;Vaitla et al., 2018).Another closely related procedure consists in dividing the observed values into a training set and a test set, then using the former to parametrize a predictive model and the latter to assess model performance independent of parametrization (e.g.Bellot et al., 2022).However, such overall evaluations of imputation performance may not capture the extant variability in prediction accuracy among phylogenetic tips (Molina-Venegas et al., 2018), that is, prediction accuracy is not guaranteed for all target species even if the overall error that is accessed in the cross-validation trials is low (Figure 2).Here, I advocate for a Monte-Carlo approach that is based on trait simulations to directly assess the prediction accuracy that is expected for each target species (see Appendix 1 for a detailed step-by-step R tutorial).
The foundational idea of phylogenetic imputation is parametrizing the evolutionary trajectory of a trait of interest based only on its observed values, and then propagating the model on a phylogenetic tree containing all the species with known and unknown values (hereafter 'complete phylogeny') to make a prediction on the latter.Likewise, the user may simulate a bunch of traits (e.g.N = 100) on the complete phylogeny using the parameters of the model that best fits the observed data (typically Pagel's lambda or Pagel's delta, see Appendix 1 for details), and then use the simulated values of each trait that match the positions of the observed data to make a prediction on the target species (Figure 3).Because the values that are missing in the real trait are known in the simulated ones, we can estimate the accuracy that is expected for each target species across the simulated traits with a prediction coefficient: where y obs and y pred are the observed (simulated) and predicted values, respectively, for target species i and simulated trait j, and s 2 y is the variance of simulated trait j across all species.P 2 varies between 1 (prediction perfectly matching the observation) and minus infinity (note that there is no theoretical limit to how badly a predictive model may perform), and the averaged P 2 i computed across all the simulated traits (median P 2 i ) summarizes the prediction accuracy that is expected for target species i at the estimated model parameters (Figure 3).Each distribution of P 2 i scores can be compared to a null distribution of P 2 null values that is computed across all the species in the phylogeny by making individual predictions based simply on trait means (this equates a 'white noise' model in which phylogeny is completely irrelevant for trait evolution).If the median P 2 i is significantly Predictive distances (double arrows) between scenarios of lower (a) and higher (b) incidence of missing data that are also less and more phylogenetically clumped, respectively.Predictive distances are defined by the distance between (1) the terminal nodes (species) representing the closest relative with known trait value of the target species (question marks) and ( 2) the internal node representing their most recent common ancestor (MRCA).Overall, the longer the predictive distances, the lower the prediction accuracy.
greater than the median P 2 null , then the data meet the requirements for the prediction on target species i to be at least minimally valuable (see Appendix 2 for an assessment of the performance of the Monte-Carlo approach).Moreover, the researcher can use alternative null models to assume more conservative stances and seek predictions that are more valuable than the minimum.For example, if the goal is making an inference on each imputed value separately (e.g.Cantwell-Jones et al., 2022) and/or using imputations in situations where the loss of fine scale information can have major consequences, Brownian motion could be used as a null model instead to draw the distribution of P 2 null scores (proved that phylogenetic signal in the observed values of the trait of interest is above Brownian motion, otherwise a Brownian null will not make sense).If the loss of fine scale information may not be particularly severe (as in large-scale studies where the errors are often diluted by mean aggregation), then the user could opt for any intermediate null model in-between white noise and Brownian motion (see Appendix 1 for details).
Phylogenetic imputation is often combined with trait-correlation based techniques, typically using phylogenetic eigenvectors (Debastiani et al., 2021;Penone et al., 2014).In this case, I recommend assessing the worthiness of phylogenetic information alone using the Monte-Carlo method on each trait separately.In case a phylogenetic prediction cannot be determined as (at least) minimally valuable for a given species, the researcher should remove the phylogenetic eigenvectors from the trait matrix before predicting the corresponding missing value.Note that there is no theoretical limit to how badly a phylogenetic correlation structure can predict missing data (Guénard et al., 2015), and therefore adding phylogenetic eigenvectors blindly could become counterproductive.Finally, it is important to note that evolutionary models underlying phylogenetic imputation readily lend themselves to quantifying uncertainty around the predicted values-e.g.prediction variances and standard errors in Rphylopars (Goolsby et al., 2022) and Picante (Kembel et al., 2010) R packages, respectively-, which may serve to preliminarily flag potentially unreliable predictions.As such, the greater the uncertainty around the predictions, the lesser the probability they are minimally valuable.Moreover, prediction uncertainty could potentially be accommodated in downstream analyses to increase the robustness of statistical inferences (Johnson et al., 2021).However, the vast majority of the most widely used methods among eco-evo researchers require filling in the gaps with one single trait value per species, hence the convenience of the Monte-Carlo based perspective for trusting or disregarding predictions based on individual contrasts for each target species.

| CON CLUDING REMARK S
Despite the promising prospect of using phylogenetic imputation to complete trait databases, the technique has important caveats that have been pervasively overlooked in the literature.The use of phylogenetic information in ecological research remains mired F I G U R E 2 Hypothetical examples where leave-one-species-out cross-validation trials will not reflect the performance of phylogenetic imputation.The size of the leaves on the phylogenetic tips represents the value of the trait for each species (same trait in both scenarios but different distribution of missing data).In example (a), the overall error accessed across all the predictions in the cross-validation trial will be lower than in (b), and yet phylogenetic imputation will perform poorly because predictive distances (red arrows to the right of the missing values) are long.In contrast, phylogenetic imputation will perform better in (b) due to shorter predictive distances and despite a higher overall error in the cross-validation trial is expected because of the longer cross-validation predictive distances (blue arrows).Note that if two observed values share the same cross-validation predictive distance, only one arrow is depicted to avoid plot crowdedness.in controversy for several reasons (Cadotte et al., 2017;Gerhold et al., 2015), and particularly due to a misguided use of the methodology (Davies, 2021).I stress that avoiding the 'blind' use of phylogenetic imputation is a step forward in mitigating the phylogenetic credibility crisis.Although the technique has important limitations, leveraging advances in our understanding of such hindrances and using the available tools with caution and restraint will allow trait-based research to progress further while sampling efforts continue replacing imputed data.In this regard, the Monte-Carlo based perspective presented here can help researchers to identify highly likely spurious phylogenetic predictions without the need for advanced technical and statistical skills, thus allowing non-specialized users to get the most out of phylogenetic imputation without taking an excessive risk.
F I G U R E 3 Workflow of the Monte-Carlo approach for measuring and testing the prediction accuracy that is expected for each target taxon.Firstly, an evolutionary model (Pagel's lambda or Pagel's delta, see Appendix 1) is fitted to the data and model parameters are retained (1).Secondly, the parameters of the model are used to simulate a bunch of traits (e.g.N = 100) along the branches of the phylogeny iteratively, and the simulated values of each trait that match the positions of the observed data are used to make a prediction on the target taxa (2a).The prediction accuracy per target taxon and trait (prediction coefficient P 2 for continuous traits) is accessed and stored to finally obtain a distribution of scores for each target taxon.Besides, a null distribution of prediction scores is drawn by dropping all simulated trait values, one at a time, and making a prediction per taxon and trait based on a null approach-for example 'white noise' model (2b)-.Finally, statistical tests are conducted to assess whether prediction accuracy of imputations conducted with the evolutionary model that best fits the observed data is greater than that obtained with the null (3).