Exploring the effects of weighting against homoplasy in genealogies of palaeontological phylogenetic matrices

Although simulations have shown that implied weighting (IW) outperforms equal weighting (EW) in phylogenetic parsimony analyses, weighting against homoplasy lacks extensive usage in palaeontology. Iterative modiﬁcations of several phylogenetic matrices in the last decades resulted in extensive genealogies of datasets that allow the evaluation of differences in the stability of results for alternative character weighting methods directly on empirical data. Each generation was compared against the most recent generation in each genealogy because it is assumed that it is the most comprehensive (higher sampling), revised (fewer misscorings) and complete (lower amount of missing data) matrix of the genealogy. The analyses were conducted on six different genealogies under EW and IW and extended implied weighting (EIW) with a range of concavity constant values ( k ) between 3 and 30. Pairwise comparisons between trees were conducted using Robinson – Foulds distances normalized by the total number of groups, distortion coefﬁcient, subtree pruning and regrafting moves, and the proportional sum of group dissimilarities. The results consistently show that IW and EIW produce results more similar to those of the last dataset than EW in the vast majority of genealogies and for all comparative measures. This is signiﬁcant because almost all of these matrices were originally analysed only under EW. Implied weighting and EIW do not outperform each other unambiguously. Euclidean distances based on a principal components analysis of the comparative measures show that different ranges of k-values retrieve the most similar results to the last generation in different genealogies. There is a signiﬁcant positive linear correlation between the optimal k-values and the number of terminals of the last generations. This could be employed to inform about the range of k-values to be used in phylogenetic analyses based on matrix size but with the caveat that this emergent relationship still relies on a low sample size of genealogies. © 2024 The Authors. Cladistics published by John Wiley & Sons Ltd on behalf of Willi Hennig Society.


Introduction
The analysis of the phylogenetic relationships among species has represented one of the main research efforts of palaeontology in the last 30 years, especially in the study of vertebrate fossils.Phylogenetic analyses using maximum parsimony as optimality criterion have been and currently still are the prevailing method chosen by palaeontologists, although probabilistic approaches (mainly Bayesian inference analyses) have become more common in the last decade (Lee and Worthy, 2012;Wright and Hillis, 2014).Within this context, the weighting of characters has been one of the most debated topics of phylogenetic parsimony analyses.For example, an old debate was whether parsimony precluded the weighting of characters or not, but all characters require a weight for their optimization.The long-standing discussion is whether all characters should be equally weighted or not (e.g.Farris, 1969Farris, , 1983;;Goloboff, 1991Goloboff, , 1993Goloboff, , 1995;;Turner and Zandee, 1995;Kluge, 1997;Miller and Hormiga, 2004).Differential weighting of characters is a way to account for the reliability of characters, in which a higher amount of homoplasy in a character is negatively related to its reliability to reconstruct phylogenetic relationships (Goloboff, 1993(Goloboff, , 2022)).Farris (1969) developed a successive weighting procedure (known as Successive Approximations Weighting), in which the amount of homoplasy of the characters was calculated on an initial sample of most parsimonious trees (MPTs) and subsequently a new tree search was conducted using reassigned character weights based on those of the previous analysis.This procedure was repeated until a stable result was reached.Later, Goloboff (1993) and Goloboff et al. (2008) proposed another procedure, implied weighting (IW), to estimate character weights during the tree searches with the aim of penalizing steps of homoplasy (h, the number of additional steps over the minimum number of possible steps).Parsimony under IW uses as optimality criterion the maximization of the total character fit and the character fits (f ) are defined by the following equation: in which k is the concavity constant value.
The degree of concavity of this function, and hence character fit, is adjusted using the concavity constant k, which determines how strongly a character is downweighted based on its level of homoplasy.If the k-value is low, the downweighting of homoplastic characters is stronger, whereas higher k-values result in weaker weighting against homoplasy because they produce less concave functions, resembling more equal weighting (EW) (readers are referred to Goloboff, 1993;Goloboff et al., 2008;Goloboff, 2022 for detailed description of the method).More recently, Goloboff (2014) proposed extended implied weighting (EIW) to take into account the missing data in the matrices, in which different concavities are determined based on the amount of missing data in each character (see Goloboff, 2014Goloboff, , 2022 for a detailed description).
In the case of palaeontology, where the vast majority of phylogenetic analyses are conducted under a cladistic framework, weighting against homoplasy is by far less commonly used than equal character weighting.For example, a Google Scholar search (February 2023) found 14 600 papers that included 'palaeontology ', 'phylogenetic analysis' and 'parsimony' in their text between 2009 and present.However, only 5.86% of them also included 'implied weighting', 'implied weights' or 'implied weighted' in their text.In contrast, 16.78% of the 14 600 papers also included 'equally weighted', 'equal weights' or 'equal weighting' in their text in the same period.Thus, this shows a conspicuous asymmetry favouring analyses using all equally weighted characters in palaeontological papers (it is relatively common for papers that use EW in their phylogenetic analyses not to provide a statement about character weighting and the opposite is true when they use IW; therefore, this asymmetry has probably been underestimated in this search).The higher frequency of parsimony analyses under EW is probably rooted in the common belief that differential character weighting implies additional assumptions over analyses using EW (Kluge, 1997;Grant andKluge, 2003, 2005;Congreve and Lamsdell, 2016).However, analyses under EW assume that each character has the same weight as the others, while IW allows the weights to be all equal or different, depending on the data; EW seems thus to make a stronger assumption than IW (Goloboff, 2022).In addition, some authors have interpreted that the analyses with character weighting on the basis of homoplasy are circumscribed only to a way to resolve polytomies found under EW and thus, IW and EIW are warranted only as an auxiliary method to improve the resolution of consensus trees (Turner and Zandee, 1995).Nevertheless, it is now clear that analyses weighting characters on the basis of homoplasy can recover phylogenetic relationships not recovered under EW (Goloboff et al., 2018).
Recent studies have found that parsimony under IW outperforms EW (i.e.tree comparison measures under IW were more similar to the target tree(s) than under EW; this criterion is followed here when referring to a method that outperforms other) in different simulation-based analyses (e.g.Goloboff et al., 2008Goloboff et al., , 2018;;Puttick et al., 2019).Although other analyses also based on simulations have concluded the opposite (Congreve and Lamsdell, 2016;O'Reilly et al., 2016;Puttick et al., 2017), those criticisms to IW have been refuted on the basis of methodological problems in the analyses and/or misinterpretations of the IW method (Goloboff et al., 2018).Furthermore, studies conducted on empirical data matrices found similar stratigraphic congruence for their fossil taxa when they were analysed under EW or IW (Sansom et al., 2018).
Those comparisons between equally and differentially weighted analyses can be complemented by studies of how EW, IW and EIW behave throughout empirical modifications of datasets (Goloboff et al., 2008), but no such comparison has been carried out before.It is worth noting that it is difficult to determine whether simulated datasets actually capture all salient complexities of empirical morphological data and thus it is desirable to complement them with analyses on empirical datasets.Iterative modifications of several phylogenetic matrices in the last decades resulted in extensive genealogies of datasets that allow evaluation of the effect of different methods for character weighting directly in empirical data.The use of genealogies of data matrices to explore how characteristics of a phylogenetic matrix or its resultant trees change through time is not common, but it has been used recently to analyse the evolution of the distribution of the amount of character homoplasy in three empirical genealogies (Murphy et al., 2021).
Genealogies of palaeontological datasets are particularly interesting for comparison between weighting methods because of the considerably higher amount of missing data present in comparison with neontological matrices (missing entries can create problems for IW that EIW is intended to solve; Goloboff, 2014).
The aim of this exploratory study is therefore to explore the influence of EW, IW and EIW in the stability of results, for a series of genealogies of empirical datasets.In general, it could be expected that a method of analysis that consistently recovers trees that are closer to the actual historical tree produces more stable results as the dataset is modified by successive studies.Note that stability is a necessary but not a sufficient condition for an analytical method to produce results closer to the correct tree; the results of the analysis of a series of datasets could be stable yet potentially completely wrong, but unstable results can be correct only occasionally.The last dataset in the series is used here as the yardstick for comparison, because it should represent the state of the art in terms of knowledge about morphology, characters and taxa for the group in question (incorporating knowledge from the previous datasets in the case of direct genealogical lines of matrices).The final goal of this research is therefore determining if there is a general pattern of more similar topological results to the last generation, and thus higher topological stability, throughout the generations in the trees recovered under EW, IW and EIW within a broad range of k-values.It is expected that the results presented here would be relevant for deciding on the use of character downweighting on the basis of homoplasy, with a special focus on palaeontological datasets.

Genealogies
The effects of EW, IW and EIW (Goloboff, 1993(Goloboff, , 2014;;Goloboff et al., 2008) were tested on iterative modifications of direct genealogical lines of different palaeontological data matrices (i.e.divergences that would produce horizontal ramifications of the genealogy were excluded; Fig. 1).The search of genealogies was limited to those with at least eight direct-line generations, trying to cover a disparate range of sampled terminals (23-229 species; all matrices provided as Data S1).This search resulted in six genealogies focused on different fossil vertebrate clades.In all cases, the non-applicable scorings (À) of the data matrices were replaced by an asterisk (*), which TNT distinguishes from missing entries.This was done in order to not artificially inflate (Congreve and Lamsdell, 2016;Goloboff, 2022) the number of missing data in the matrices; EIW is based on assuming missing entries would have as much homoplasy as observed entries, but entries for inapplicable characters cannot have homoplasy and thus should not be counted as "missing".The following six different genealogies of matrices were used: 1.The Complete Archosauromorph Tree Project (CoArTreeP) matrix genealogy: this dataset is part of a project that aims to build a phylogenetic dataset that includes all valid species of Permian-Early Jurassic archosauromorphs (Ezcurra, 2022).After multiple iterative modifications, this matrix has been expanded from 82 to 229 active terminals and from 600 to 907 active characters.Here, a direct genealogical line of 10 generations of this data matrix was used (1, Ezcurra, 2016;2, Nesbitt et al., 2017;3, Ezcurra et al., 2017;4, Ezcurra and Butler, 2018;5, Ezcurra et al., 2019;6, Butler et al., 2019;7, Ezcurra et al., 2020;8, Ezcurra and Sues, 2021;9, Ezcurra et al., 2023a;10, this paper;Fig. 1). 2. The Theropod Working Group (TWiG) matrix genealogy: this genealogy of matrices represents one of the oldest attempts of building an extensive palaeontological phylogenetic data matrix through iterative expansion through time (Norell et al., 2001).Since the first version of the TWiG matrix, the dataset has been expanded from 44 to 170 terminals and from 205 to 855 characters.

Tree searches
Each generation of each matrix genealogy was analysed in TNT 1.6 (Goloboff and Morales, 2023) under maximum parsimony using EW, IW, and EIW.In the case of the analyses weighting against homoplasy, a range of concavity constant (k) values between 3 and 30 was used here.Searches with k-values below 3 were avoided because the character downweighting of these functions is too strong (Goloboff  et al., 2008).In order to reduce the computational time of the tree searches and comparisons between generations, k-values higher than 30 were not conducted because they gradually converge with EW.If the data matrix had more than 80 terminals, the tree search strategy initially used a combination of the tree-search algorithms sectorial searches, drifting, ratchet and tree fusing, until 100 hits of the same minimum tree length were achieved.On the other hand, if the matrix had 80 or fewer terminals, a search of 1000 replications of Wagner trees (with random addition sequence) followed by TBR branch swapping (holding 10 trees per replicate) was performed.Regardless the tree-search algorithms used, the shortest trees obtained were then subjected to a final round of TBR branch swapping in all cases.Zero length branches in any of the recovered MPTs were collapsed.All multistate characters that were considered as additive in their original analyses were also treated as such here.TNT was set to retain up to 10 000 trees in memory during each of the searches in order to reduce computational times.Homoplasy indices were calculated with a script that does not take into account a priori deactivated terminals (STATSb.run;see the supplementary material of Spiekman et al., 2021).Tree statistics (i.e.length, fit, retention index, consistency index) were saved as output files for each analysis.The tree searches were conducted in one of the clusters of the Centro de Computación de Alto Desempeño of the Universidad Nacional de C órdoba (Argentina) using custom scripts written here for TNT 1.6 (Data S1).

Tree comparisons
All of the tree comparisons were automated in scripts written for TNT 1.6 (Data S1) and conducted in a cluster of the Centro de Computaci ón de Alto Desempe ño of the Universidad Nacional de C órdoba (Argentina).Pairwise comparisons were conducted between the trees recovered in each generation against those of the youngest generation of each genealogy of matrices because it is here assumed that the latter is the most comprehensive (more complete taxon and character sampling), revised (lower proportion of misscorings and clearer and less ambiguous character formulations) and complete (lower amount of missing data in terminals present in previous iterations) matrix of the genealogy.Thus, the latest generation of a genealogy of phylogenetic data matrices would be the most reliable, whereas older generations act as increasingly stronger empirical perturbations of this last generation.As a consequence, the trees of the last generation of the genealogies are the target of all of the tree comparisons, quantifying the topological stability of results along phylogenetic genealogies.These results could be stable around a correct tree, but obviously also around an incorrect one.That is, stability is a necessary but not a sufficient condition to recover correct results along a phylogenetic genealogy.Although stability is expected to increase in later generations if the data matrices tend to become more reliable through iterations, this is not an assumption that should be necessarily satisfied in the genealogies analysed in this study because it is likely that they are still far from a taxonomic sampling that is comprehensive enough to reach such stability (the number of species sampled in the fossil record is probably a minor percentage of the whole taxonomic richness of the clade).
If these tree comparisons are conducted against a tree with strong independent support, such as a molecular tree, this would result in a different study because the comparisons would focus directly on the (assumed) correctness of the trees inferred rather than on their stability.Furthermore, comparisons against a molecular tree are impossible to carry out for most of the datasets included in this study because they only include species based on fossil specimens.Similarly, the use of stratigraphic congruence as a comparative measure (e.g.Sansom et al., 2018) would not determine tree stability (as in the case of a molecular tree) and they can introduce complex biases intrinsic to the sampling quality and preservation potential of each different taxonomic group.
Four comparative measures were used to quantify topological similarity/dissimilarity: Robinson-Foulds distances (RF) normalized by total number of groups (RF, TNT command rfdistp; Robinson and Foulds, 1981); the complement of the number of subtree pruning and regrafting moves (SPR) divided by the number of taxa -2 (SPR, TNT command sprsim; Goloboff, 2007); distortion coefficient (DC, TNT command symcoeff;Farris, 1973); and the complement of the sum of the similarities of each group divided by the number of nodes [GrComp, TNT code provided by Goloboff et al., 2018].
In addition to the four tree comparison measures used here, other measures also quantify tree similarity (e.g.generalized RF distances, number of quartets, information-theoretic RF distances; Smith, 2019Smith, , 2020;;Asher and Smith, 2022).However, those other measures were not implemented in TNT when I conducted the analyses of this study (2023), and (given the organization of the scripts used) including those measures in the workflow would have been difficult.Furthermore, although alternative tree comparison measures exist, there is no reason to think that the measures used here-if perhaps imperfect-could bias the results in one or other direction.
The MPTs recovered in the generation under comparison and those of the youngest generation were read, their order was randomized, respectively, with the TNT command randtrees/, and the first 1000 trees of each of the two generations under comparison were kept in memory.The randomization of the tree orders aimed to not limit the pairwise comparisons to certain, restricted areas of the treespace.The reduction of the sample to 1000 trees in each generation implies a maximum of 1 000 000 possible pairwise comparisons in order to reduce computational times.The values of each comparison between genealogies and for the four different comparison metrics were saved as different sets of output files.
The tree comparisons were conducted between the MPTs to avoid different problems that can be created by using consensus trees.For example, the topological congruence between consensus trees can be overestimated because of polytomies (e.g.O'Reilly and Donoghue, 2018;Smith, 2019;Goloboff, 2022) when using the RF distance: the difference from tree A to a tree B with conflictive groups will be larger between each other than the difference from tree A to a consensus tree C with a polytomy that involves that conflictive group, although this lack of resolution could be the result of more dissimilar relationships in that conflictive region of the tree between tree A and the MPTs that generated the consensus tree C. Admittedly, the use of individual MPTs has its own potential problems: an analysis producing trees with two types of topology, one topology A occurring in a single tree and another topology B occurring in many trees by virtue of a single rogue taxon jumping among different positions in the B topology, would lead to the second type of topology being overinfluential.While this is in principle true, it is not clear that using consensus trees overcomes this problem-a strict consensus tree would be too unresolved to be useful, and a majority rule consensus tree is itself a midpoint (Barthélemy and McMorris, 1986) that is affected by similar problems.Using pruned consensus trees would help in this regard, but automating the calculation of optimal pruned consensus trees for each of hundreds of phylogenetic analyses is far from trivial.In summary, while the comparison between individual MPTs used here may not be ideal and may introduce noise in the comparisons, the same is true for consensus trees, and as there is no reason to think that comparing individual MPTs should consistently bias the comparison in favour of either EW, IW, or EIW, it is used as the best compromise solution available.

Character treatment comparisons
The means of the pairwise comparisons within each generation were plotted against generation number (time variable) for EW, IW and EIW in each dataset and for the four tree comparison metrics.The trees of the last generation were pairwise compared among themselves and their mean was also included in the plot in order to account for the topological variation present in that tree sample (topological precision), which is expected to be larger under EW than under IW and EIW (Goloboff et al., 2008(Goloboff et al., , 2018)).In addition, boxplots were built using all of the values calculated for each generation (or a random subsample of 10 000 values if the number of comparisons exceeded that number), to the exclusion of the youngest generation, for the different weighting strategies.A principal components analysis (PCA) was used to ordinate the information of a matrix composed of the mean of the means of the values of each generation for the four comparative measures for each weighting strategy.Pairwise Euclidean distances, using the first two principal components, were calculated between each weighting strategy and the ideal values of each tree metric (i.e.RFn = 0, SPR = 1, DC = 1, GrComp = 0).These distances helped to identify the strategies that showed the most stable and similar topologies across generations.An additional PCA was conducted using the best tree comparison values calculated for EW within each generation and for each metric (EW*).The Euclidean distance to the ideal tree metrics of the mean of EW* for each generation was included in the previous plots to compare the performance of IW and EIW against the best scores calculated for EW.
Character treatments whose results are topologically less disparate are more likely to show better tree comparison metrics than those with higher variability (note that tree precision is not necessarily correlated with tree accuracy; see e.g.O'Reilly et al., 2016).A higher variability will produce means more distant to the ideal comparison metrics and it is unusual for a generation to have trees more similar to those of the last generation than the last generation with itself.Thus, the topological variability of the sample of trees of the last generation will act as a minimum threshold for the tree comparison metrics throughout the genealogy, which will be higher in less variable tree samples.Analyses under EW are more likely to recover a larger number of different trees than IW and EIW (Goloboff et al., 2008).Thus, if the ordination of the dataset includes the generation values without taking into account the variability of the last generation, EW would have a potential disadvantage against IW and EIW.As a result, the PCAs were also conducted after subtracting the mean of the last generation from the mean of the previous generations [note that this was conducted for the tree comparison metrics whose best result is 0 (RFn, and GrComp), whereas the '1-mean' of the last generation was added to the mean of the previous generation in those metrics whose best result is 1 (SPR and DC)].Nevertheless, a higher variation of tree topologies also implies a higher ratio of wrong groups (as there is only one true phylogeny).As a consequence, the results obtained after subtracting the mean of the last generation should be interpreted in the light of how large the variation in that last generation was, mainly if there are cases in which EW performs slightly better.
As a complementary analysis to discard that the results were limited to a central tendency present among the MPTs, the worst metric value calculated for IW or EIW for each k-value was subtracted from the best metric value of EW in each generation.This aimed to account for the complete range of topological variation between methods throughout genealogies.The resultant value was subtracted from 1 in those cases where the optimal metric value is 0 (RF and GrComp), but not in those where the optimal metric is 1 (SPR and DC).Thus, the final value was <0 if EW showed a better performance than IW or EIW, >0 if IW or EIW showed a better performance, and equal to 0 if methods were tied.This procedure was repeated for the combination of the other three possible comparisons between extreme values, i.e. best IW/EIW valuebest EW value, worst IW/EIW valueworst EW value, and best IW/EIW valueworst EW value, for each metric and each generation.The mean for all of the genealogy was calculated for each method and metric.Thus, these four means for each method establish the upper and lower limits of the range where EW performs better with respect to IW/EIW and the other way around.This dataset was ordinated with a PCA and the Euclidean distances to the ideal tree metrics for EW (i.e.À1, À1, À1, À1) and IW/EIW (i.e. 1, 1, 1, 1), respectively, were calculated for each method.These distances were used to compare the range of best performance across methods.
Histograms were built based on the comparison between the means of GrComp for different weighting strategies after subtracting the mean of the last generation.The highest frequencies on the negative or positive sides of the plot determine which weighting strategy outperforms the other and values more distant from 0 denote a larger ratio of group similarities in favour of that treatment.Other histograms were built based on the consistency index (Kluge and Farris, 1969) and the steps of homoplasy of each character (i.e.optimized length of the characterminimum possible length of the character for active terminals; TNT code: 'length[0 n]minstepsact[n]', where 0 is the first MPT in memory and n is the character number).The consistency index and steps of homoplasy were calculated on a sample of, previously randomized, 1000 MPTs recovered in the last generation of each genealogy under EW.Finally, histograms were built for the distribution of the fit of each character with the k that retrieved the overall best performance through the generations and also k = 3, 9 and 30.The fit was also calculated on a sample of, previously randomized, 1000 MPTs, but found under IW with the respective k-value for which the fit is being calculated.TNT decomposes additive characters into binary under IW and the fit should be calculated for each bin.Thus, character fit was calculated only for non-additive characters because the calculation of the fit for each bin is time consuming.The fit of each non-additive and active character was calculated with the following TNT code, in which the argument %1 is the number of the selected k-value: The histograms were built using the mean of each non-additive character in all cases.All of the graphics (with exception of Fig. 1), PCA (R function prcomp) and linear regressions were conducted in the software environment R 4.2.1 (R Development Core Team, 2022).p-Values <0.05 have been considered statistically significant.The R and TNT scripts to conduct these graphics and analyses are provided as Data S1.

Results
The Complete Archosauromorph Tree Project genealogy Normalized Robinson-Foulds distances.The mean RF values under EW show a rather constant behaviour through the generations, with the exception of a conspicuous peak in generation 4 (Fig. 2a).The mean RF values calculated for the trees found under IW and EIW in generations 1-8 are all distinctly lower (i.e. more similar trees to the youngest generation) than those for EW.In generation 9 there are more disparate results, with mean RF values under IW k = 3-8, 22 and 23 and EIW k = 3-10 and 18 showing more different topologies from their last generation than in the trees found under EW.In the last generation, the mean RF values of the IW and EIW strategies are 0.07, whereas the mean RF value for EW is 0.11.This denotes a larger variation among the EW trees of the last generation with respect to those found under IW-EIW.In the boxplots, the median RF and interquartile range of the analyses under IW k = 12-21 and [24][25][26][27][28][29][30]and EIW k = 19,20,[22][23][24][25][26][27]29 and 30, show lower values than in the analyses under EW (Fig. 3a).In particular, the lowest medians and interquartile ranges are those for IW k = of 20, 23 and 27 and EIW k = 19, 23, 24 and 26.Subtree pruning and regrafting moves.The mean SPR values show a similar behaviour throughout the genealogy to that described above for RF.EW has rather stable mean SPR values through generations 1-7, but there is a peak in generations 8 and 9 (Fig. 2b).IW and EIW show in the vast majority of cases higher mean SPR values (i.e. more similar to the trees of the youngest generation) than EW throughout the genealogy, but this is reversed in generation 9 for IW k = 3-7, 9, 22, 23 and EIW k = 3-10 and 18.The pattern of the last generation is similar to that of the means of RF, with the presence of low variation and good performance for all of the strategies that downweight character homoplasy, whereas EW shows a relatively worse mean.In particular, the mean for EW in the last generation is lower than those of generations 8 and 9, indicating an increase of the variation of the tree topologies towards the end of the genealogy that does not occur under IW and EIW.In the boxplots, all of the interquartile ranges for IW k = 12-21 and 24-30, and EIW k = 11, 15 and 19-30, are higher than the ranges of EW (Fig. 3b).
Distortion coefficient.The mean DC values show a behaviour throughout the generations very similar to those of the means of SPR for EW, IW and EIW, including higher values (i.e.trees more similar to those of the last generation) for IW and EIW than for EW throughout generations 1-8, and IW k = 8-21 and 24-30, and EIW k = 11-17 and 19-30, for generation 9 (Fig. 2c).The DC values of the character downweighting strategies all converge to 0.993-0.994 in the last generation, whereas the mean of DC for EW is lower (0.990), but very similar to or overlapping those calculated for EW in previous generations (0.989-0.990).The boxplots show the highest medians and interquartile ranges for IW k = 16, 18-21 and 24-30, and EIW k = 19-30, which are all considerably higher than for EW (Fig. 3c).

Group
comparisons.The means of group comparisons (GrComp) recovered for IW and EIW for all k-values are lower (i.e. higher sum of group similarities per number of nodes) than those for EW throughout generations 1-8 (Fig. 2d).In generation 9, the mean GrComp values calculated for IW and EIW are more diverse, being slightly higher to slightly lower than those found for EW (ca.0.301).Mean GrComp values converge to 0.05 for all IW and EIW k-values in the last generation, whereas the mean GrComp for EW is slightly higher (0.07).The boxplots show that the lowest median of GrComp is for EW, but at the same time it has multiple outliers with very high GrComp values that are larger than any of those calculated for the strategies that downweight homoplasy.The interquartile lower limits of IW k = 17-30 and EIW k = 26-30 are slightly lower than those of EW (Fig. 3d).The histograms that compare the means of GrComp show a clear outperformance of IW and EIW over EW either when the mean GrComp of the last generation is subtracted or not from that of previous generations (Fig. S1).When equivalent histograms are plotted between the means of GrComp IW k = 20 and other IW and EIW k-values, the former outperforms the others (Fig. S2).
Ordinated data.The PCA shows a cluster of IW and EIW strategies that distinctly departs from those of EW and is closer to the ideal tree metrics (Fig. 4a).The lowest Euclidean distances to the ideal tree metrics are recovered for a k-value of 20 in both IW and EIW, whereas the distance between the ideal tree metrics and EW is considerably longer than that for any k-value for both IW and EIW (Fig. 4c,d).In particular, the Euclidean distance calculated with the best metrics found for EW in each generation (EW*) is still larger than those of the IW and EIW strategies with the best performance.There is a pattern in which the treatments that more strongly penalize homoplasy (k = 3-18) are those that generally perform worst for both IW and EIW.The PCA without subtracting the means of the last generation shows extremely similar results, but, as expected, the Euclidean distance from EW and IW/EIW to the ideal tree metrics increases (Fig. S3).When the extreme values for each generation are compared, IW and EIW show lower ranges of Euclidean distances, and hence the best performance, compared with EW for all k-values with the exclusion of IW k = 4 and EIW k = 3 and 9 (Fig. 4e,f).The performances of IW and EIW are extremely similar to each other for this genealogy of data matrices.
The Theropod Working Group matrix genealogy Normalized Robinson-Foulds distances.The mean RF values calculated for the topologies recovered under IW and EIW are all distinctly lower than those for EW, with the exception of a higher mean RF for EIW k = 3 and 4 in generation 12 (Fig. 5a).Also, the mean RF distances of EW show considerably more disparate values throughout the generations than under IW and EIW, which are more stable.In particular, EW has two distinct peaks of mean RF values in generations 4 and 5 and 9-11, respectively.In the last generation, the mean RF values of IW and EIW converge to a value close to 0.15, showing some degree of variation among the MPTs.However, a higher mean (0.18), depicting more variation among trees, is calculated for the last generation of EW.In the boxplots, all of the medians of the IW and EIW methods (with the exception of EIW k = 3) are lower than that of the analysis under EW (Fig. 6a), and the lowest medians are those of IW k = 10, 26, 27 and 29 and EIW k = 20 and 23.Similarly, the upper limit of the interquartile range is lower than the lower limit of EW in most of the analyses that downweight homoplasy to the exclusion of IW k = 3-5, 7, 17 and 25 and EIW k = 3-10 and 12-19.
Subtree pruning and regrafting moves and distortion coefficient.The mean SPR and DC values show very similar behaviours throughout the genealogy and are reported together.Both metrics also have similar patterns to that of the means of RF, in which the trees recovered under IW and EIW are in the vast majority of cases more similar to those of their respective last generation and the treatments have more stable results through time than in EW (Fig. 5b).Exceptions of slightly lower values than those of EW occur only in a few generations under IW and EIW k ≤ 9 (Fig. 5b).For both metrics in generation 11 EW has a conspicuous valley.The mean SPR and DC of all of the strategies that downweight character homoplasy converge at distinctly higher values than those of the last generation of EW.The highest medians and upper limits of the interquartile ranges are those found under IW k = 8-11, 14-17 and 26-30 and EIW k = 20, 21 and 23 for SPR and IW k = 29 and EIW k = 23 for DC (Fig. 6b).In all of these cases the complete interquartile ranges are higher than that of EW.
Group comparisons.The mean GrComp values are in the vast majority of cases lower under IW and EIW than under EW, but there are more exceptions than in the previous tree comparison metrics, most of them for IW k ≤ 20 in generations 1-3, 6-8 and 12 (Fig. 5d).Nevertheless, the mean GrComp values under EW are always higher than those under IW k = 21-30 and EIW k = 9-30.Differences among the means of GrComp are particularly conspicuous in generations 4, 5 and 9, in which the values calculated under EW are at least twofold those under IW and EIW, which are very close to 0.1 for all k-values.The mean GrComp values of the last generation are rather similar between strategies: 0.088 under IW, 0.090 under EIW and 0.105 under EW (Fig. 5d).The boxplots show that the lowest medians of GrComp are those for IW k = 14-16, 21, 25-27, 29 and 30 and EIW k = 20, 21 and 23.The interquartile ranges overlap among all of the strategies, but the upper whisker of EW is considerably higher than those of all of the strategies with character homoplasy penalization (Fig. 6d).The histograms of the differences between the mean GrComp values show that the treatments that downweight homoplasy clearly outperform EW in all cases, even after subtracting the mean of the last generation (Fig. S4).Among the GrComp values calculated under character downweighting, EIW k = 5 outperforms treatments with other k-values in the vast majority of cases (Fig. S5), with the exception of EIW k = 8, in which in 7 of the 12 generations the latter recovered lower (although with a very minor difference) mean GrComp values (Fig. S5f).The lowest mean GrComp values were calculated for IW k = 15 and EIW k = 5, respectively (Fig. 7b, Fig. S6b).EIW shows generally lower mean GrComp values than IW, but this difference is rather marginal (≤0.016).Ordinated data.The first two components of the PCA show three distinct groups, composed of (i) EW, (ii) IW k =3-5 and EIW k = 3 and 4 and (iii) all of the other IW and EIW treatments and the ideal tree metrics (Fig. 7a).When the means of the last generations are not considered, all of the IW and EIW treatments are clustered close to each other, whereas EW and the ideal tree metrics are strongly separated from each other and also from the cluster formed by character homoplasy penalization strategies (Fig. S6a).The Euclidean distance of EW to the ideal tree metrics is about twice that for any k-value under both IW and EIW.Among the character downweighting treatments, the largest Euclidean distances are those for IW k = 3-5 and EIW k = 3 and 4 regardless of whether the means of the last generation are considered or not (Fig. 7c,d S6d).The lowest Euclidean distance for EIW is that for k = 5 when the mean of the last generation is considered (Fig. 7d) and for k = 8 when it is not (Fig. S6d).The Euclidean distances calculated from the extreme values of each generation show at least a partial overlap between all character weighting strategies to the exclusion of the comparisons between EIW k = 6 and 7 vs.EW (Fig. 7e,f).However, this overlap with the best EW values is minimal for IW k = 3-5 and for all EIW k-values.The lower ranges of the Euclidean distances calculated for the best values  of all of the strategies that downweight homoplasy are distinctly lower than those for the best EW values.
"Tawa" matrix genealogy Normalized Robinson-Foulds distances.The RF of the analyses under EW has an overall trend towards lower values through the generations, but this is interrupted by a peak along generations 4 and 5 (Fig. 8a).The mean RF values for IW k = 3 and 4 and EIW k = 3-6 are higher than or similar to those calculated under EW in most generations (Fig. 8a).In contrast, the mean RF values for IW k = 5-30 and EIW k = 7-30 show better performance and share a similar pattern, in which the values are lower than those under EW in generations 1-4, similar in generations 5 and 7, and higher in generation 6.The mean RF for the different k-values converge at a value of 0.040 under IW and 0.16 under EIW in the last generation, whereas the mean RF value is intermediate for EW (0.025).Thus, there is a higher variability among the IW trees of the last generation than in those found under EW and, especially, EIW.The boxplots show that the medians of RF under IW k = 5-30 are all considerably lower than that of EW, but with overlapping interquartile ranges (Fig. 9a).The strategies with the worst performances are IW k = 3 and 4 and the former is the only strategy with an interquartile range that does not overlap others.A very similar pattern occurs under EIW, but the worst performances are extended between k = 3-6, in which the interquartile ranges of the two former k-values do not overlap those of the other strategies.When the boxplots of IW and EIW are compared between each Subtree pruning and regrafting moves.The mean SPR under EW has an overall trend towards higher values as the genealogy progresses in time, but this is interrupted by a deep valley of values along generations 4-6 (Fig. 8b).In most cases, IW k ≥ 5 and EIW ≥ 7 have higher mean SPR values than those under EW in generations 1-4 and 7.In particular, the strategies with k = 5-9 also outperform EW in generation 5.In contrast, the strategies that strongly penalize homoplasy (IW k < 5 and EIW < 7) have lower mean SPR values than EW throughout the genealogy.In the last generation, all of the IW strategies converge to a mean SPR value of 0.968 and EIW strategies to a value of 0.989.As occurred in the RF, EW shows an intermediate value to those of IW and EIW in the last generation (0.981) (Fig. 8b).The boxplots show that the medians of SPR under IW are higher than that under EW, with the exception of IW k = 3, 4 and 9 (Fig. 9b).The analyses under IW k = 10, 21, 22 and 25 have the highest medians and the interquartile ranges of all of the strategies that penalize homoplasy overlap that of EW to the exclusion of IW k = 3.The boxplots of EIW show a similar pattern to those of IW, but analyses with k = 3-6 show a worse or similar performance to EW (Fig. 9b).All of the analyses under EIW with higher k-values show very high medians, very similar to those of the IW strategies with best performance, but in most cases with higher upper interquartile ranges.Distortion coefficient.The median of DC under EW only outperforms those for IW k = 3 and EIW k = 3 and 4, whereas all of the other strategies show distinctly higher medians (Fig. 9c).In the biplot, the behaviour of DC under EW is very similar to that of SPR for the same strategy (Fig. 8c).Across most of the genealogy IW k = 3 and 4 and EIW k = 3-6 show a lower or similar mean DC than EW.An overall different pattern emerges under IW k = 5-30 and EIW k = 7-30, in which generations 1-5 and 7 have higher values and generation 6 a distinctly lower value than EW.As in the previous metrics, the mean DC for EW is an intermediate value between those for IW and EIW in the last generation (Fig. 8c).
Group comparisons.The means of GrComp are very high (i.e.ca.0.20-0.55)under EW, IW k = 3-4, and EIW k = 3-6 in generations 1-5 (Fig. 8d).In particular, the means of GrComp of these strategies that downweight homoplasy are considerably higher than those under EW in generations 2-4.In contrast, the mean GrComp values are distinctly lower (<0.10) for IW k = 5-30 and EIW k = 7-30 in generations 1-4.In generation 5, all of the treatments converge around a value of 0.20-0.30and subsequently drop towards generation 7.All of the treatments that downweight homoplasy show higher mean GrComp values than under EW in generation 6, whereas all of the strategies converge to a value around 0.1 in generation 7. The median of GrComp for EW is lower than that for IW k = 3 and EIW k = 3 and 4, only slightly higher than that of IW k = 4 and EIW k = 5 and 6, and considerably higher than all of the other character downweighting strategies (Fig. 9d).The lowest medians are those recovered for IW k = 10, 21, 22 and 25 and EIW k = 11 and 16-18, but the lowest lower limit of the interquartile ranges is recovered for IW k = 6 and EIW k = 8-10.
The histograms show that trees found under IW and EIW have a distinctly higher number of generations in which their mean GrComp values outperform those under EW, with the exception of IW k = 3 (Fig. S7).The overall mean of GrComp reaches the lowest value for IW k = 6 and EIW k = 8, respectively, in which the IW strategy has marginally better metrics than EIW (Figs 9 and 10b).This overall mean ratio is considerably higher for EW (ca.0.24 vs. <0.07),but even worse values are those for IW k = 3 and 4 and EIW k = 3-6 (>0.32).The frequencies of mean GrComp values found under IW k = 6 clearly outperform those of IW k = 3 and 4 (Fig. S8a,b) and have only marginally better performances when compared with those of IW k = 5 and 7 and EIW k = 7-11 (Fig. S8c-i).
Ordinated data.The PCA generates three groups in the first two components (in decreasing order to the Euclidean distance to the ideal tree metrics): (i) IW k = 3 and 4 and EIW k = 3-6; (ii) EW; and (iii) all of the other IW and EIW strategies (Fig. 10a,c,d).This same pattern also occurs when the means of the last generation are not subtracted from the data (Fig. S9a,c,d).Low Euclidean distances are calculated for valleys under IW k = 5-6 and EIW k = 7-11, in which the lowest distances are those of IW k = 6 and EIW k = 11, respectively.The lowest IW Euclidean distances are only marginally lower than those under EIW and this difference decreases even more if the means of the last generation are not considered (Fig. 10, Fig. S9).The Euclidean distance calculated from the best EW metrics (EW*) is higher than IW k = 5-30 and EIW k = 7-30.When the extreme values for each generation are compared, IW and EIW show considerably lower ranges of Euclidean distances, and hence better performance, than EW for all k-values to the exclusion of IW k = 3 and 4 and EIW k = 3-6 (Fig. 10e,f).

Odontoceti matrix genealogy
Normalized Robinson-Foulds distances.The mean RF values calculated under EW have their maximum in generation 1, reach their minimum in generation 2, and the values are intermediate in subsequent generations with a mild valley in generation 5 (Fig. 11a).The mean RF values for IW k = 3-4, 7-11, 13, 14 and 16 and EIW k = 4-11, 13 and 15 have considerably lower values through most of the generations than EW.In contrast, higher or similar mean RF values occur in the strategies with other k-values.The mean RF values of all of the strategies that downweight homoplasy converge at a value of 0 in the last generation, indicating the presence of all identical trees, whereas the mean RF is considerably higher in EW (0.16), indicating a considerable diversity of trees at the end of this genealogical line.In the boxplots, IW k = 3, 4, 7-11, 13, 14, 16 and 20-23 and EIW k = 4-11, 13, 15 and 17-24 have distinctly lower medians than that of EW (Fig. 12a), whereas the only strategies with interquartile ranges that are completely lower than that of EW are those for IW k = 4, 8 and 9 and EIW k = 4-11.All of the other strategies have higher medians and interquartile ranges that overlap or exceed those of EW.
Subtree pruning and regrafting moves and distortion coefficient.The behaviours of the mean SPR and DC values are very similar to each other in this genealogy and, as a result, they are reported together (Fig. 11b,c).The lowest and highest mean values for EW are those of generations 1 and 2, respectively.Subsequently, these values are  The boxplots show that IW k = 4, 7-11, 13, 14 and 16 and EIW k = 4-11, 13 and 15 have SPR and DC medians close to 1 and distinctly higher than that for EW (Fig. 12b,c).In particular, the complete interquartile ranges of IW k = 4, 8 and 9 and EIW k = 4-11 are higher than that of EW.

Group comparisons.
The mean GrComp values of EW show a low slope towards lower values throughout the genealogy (Fig. 11d).In the vast majority of generations IW k = 3, 4, 7-11, 13, 14 and 16 and EIW k = 4-11, 13 and 15 have mean values that are distinctly lower than those under EW.The mean of EW in the last generation is 0.009, whereas this value converges to 0 in all of the strategies that downweight homoplasy.The medians and the complete interquartile ranges of IW k = 4, 8 and 9 and EIW k = 4-11 are distinctly lower than those of EW (Fig. 12d).In the histograms, it is clear that all of the mean GrComp values of the treatments that penalize character homoplasy outperform that of EW (Fig. S10).The mean GrComp values of best performance are recovered for IW k = 4 and EIW k = 7, respectively, in which the latter is the lowest, but marginally, among all of the strategies (Fig. 13b).Indeed, the histograms between IW and EIW treatments show that the difference between the mean GrComp values per generation is extremely small in most cases (Fig. S11).The same pattern remains when the mean of the last generation is not subtracted from the previous generations (Fig. S12).
Ordinated data.The PCA shows two distinct groups among the treatments that downweight homoplasy (Fig. 13a,c,d, Fig. S12a,c,d).When the mean of the last generation is subtracted EW is positioned slightly closer to the group with values more similar to the ideal tree metrics (Fig. 13a,c,d), but EW falls within the group with worst metrics when the last generation is not considered (Fig. S12a,c,d).The Euclidean distances have a distinct valley in IW and EIW k = 4-11, but there are distinctly high outlier distances for IW k = 5 and 6.The Euclidean distances became gradually larger towards higher k-values in both IW and EIW, but there is a considerable variation of results in the range of k = 11-17.The Euclidean distance calculated from the best EW metrics (EW*) is distinctly larger than those of the k-values with valleys of best performance (Fig. 13c,d).Throughout the range of studied k-values IW and EIW show very similar Euclidean distances.The Euclidean distance ranges calculated from the most extreme metric values show that IW k = 3, 4, 7-11, 13, 14 and 16 and EIW k = 4-11, 13 and 15 distinctly outperform EW, whereas the opposite occurs for the remaining strategies (Fig. 13e,f).the genealogy.These values are in almost all cases and generations considerably higher under IW and EIW than under EW, in which the only exceptions are marginally lower values under EIW k = 3, 4 and 6 in generation 7 (Fig. 14b).There is a clear tendency under IW and EIW of an increase of the mean SPR values from generation 1 to 6, but there is a clear drop in generation 7. The only exception for this overall pattern is under EIW k = 5, in which all of the mean SPR values remain almost constant between generations 1-7 (0.827-0.829).As occurred for RF, there is a remarkable difference between the values of EW (0.732) and IW/EIW (0.983) in the last generation.All of the medians and most interquartile ranges are higher under IW and EIW than those calculated under EW, in which the only exceptions are slightly overlapping ranges in IW k = 3, 13, 17, 22, 23, 26 and 30 and EIW k = 3, 4, 6, 10, 23 and 26-29 (Fig. 15b).
Distortion coefficient.The overall pattern of the mean DC values throughout the generations resembles that of SPR, but the IW and EIW values are considerably closer to those calculated under EW (Fig. 14c).Through generations 1-3 IW and EIW show similar mean DC values, there is a valley of values in generation 4, they increase considerably in generation 6 to reach a peak, and subsequently decrease conspicuously to reach an overall valley in generation 7.There are multiple IW and EIW strategies that have lower mean DC values than EW in generations 1, 4 and 7, but all of the analyses with the different k-values have higher mean DC values than EW in generations 2, 3, and 6.There is a very large difference between the IW/EIW mean DC values and the lower EW value in the last generation (Fig. 14c).All of the medians calculated under IW and EIW are higher than that of EW to the exclusion of EIW k = 8 (Fig. 15c).Only IW k = 4 and EIW k = 5 have complete interquartile ranges higher than that of EW.
Group comparisons.The mean GrComp values calculated for the trees found under EW are in the vast majority of cases considerably lower (i.e. a higher proportion of group similarities per node) than those calculated under IW and EIW (Fig. 14d).In general, there is a trend towards lower mean GrComp values between generations 1 and 6 under IW and EIW, but this is interrupted by a peak of values under IW k = 6-30 and EIW k = 6-30 in generation 5.In generation 6, the mean of GrComp for all of the IW and EIW strategies converge to very low values (0.03-0.28).Subsequently, all of the IW and EIW strategies increase their mean GrComp values in generation 7, reaching levels similar to those of generations 1 and 2. The mean GrComp value for EW is distinctly higher than those of all of the IW and EIW treatments in the last generation.The medians and interquartile ranges of all of the IW and EIW strategies, with the exception of IW k = 3 and 4 and EIW k = 3-5, are distinctly higher than those calculated for the trees recovered under EW (Fig. 15d).In particular, the interquartile range of EIW k = 5 is the only one that is entirely lower than that of EW among the strategies that penalize homoplasy.In agreement with these results, the mean GrComp values of EW outperform those calculated for each generation under IW and EIW (Figs S13 and S14).The strategies that more strongly downweight homoplasy are those that recovered lower GrComp values (e.g.IW k = 3 and 4 and EIW k = 3-5) for the pairwise comparisons between generations (Fig. 15d) or for the overall mean of the genealogy (Fig. 16b, Fig. S15b).If the mean of the last generation is subtracted from previous generations, the mean of GrComp for EW clearly outperforms all of the IW and EIW strategies and this does not occur only with EIW k = 5 if the mean of the last generation is not considered (Fig. S15b).
Ordinated data.The PCA shows three distinct clusters in the first two components, one represented by EW, another by EIW k = 5, and the last one composed of all of the other IW and EIW strategies (Fig. 16a).This pattern is recovered after subtracting or not the means of the last generation (Fig. S15a).The Euclidean distances to the ideal tree metrics are considerably longer in all of the IW and EIW treatments than under EW and EW* when the mean of the last generation is considered (Fig. 16c,d).If the mean of the last generation is not considered, IW k = 3 and 4 and EIW k = 3-5 show lower Euclidean distances than EW (Fig. S15c,d).Implied weighting has a valley of Euclidean distances in IW k = 3 and 4 and EIW k = 3-5, in which EIW k = 5 is the character downweighting treatment that reaches results that are closer to the ideal tree metrics (Fig. 16, Fig. S15).The Euclidean distance ranges generated from the extreme metric values show a broad overlap between the best EW, IW and EIW results (Fig. 16e,f).In particular, IW k = 3 and 4 and EIW k = 3-5 reach lower distances than EW, but EW has lower distances than all of the other strategies that more mildly penalize homoplasy.

Crocodylomorpha matrix genealogy
Normalized Robinson-Foulds distances.The mean RF values of EW remain stable during the first five generations; subsequently they increase gradually up to a peak in generation 10, and finally decrease towards the end of the genealogy (Fig. 17a).The mean RF values of the strategies that downweight character homoplasy are in most cases distinctly lower than those under EW, but generally follow the same pattern, with higher mean RF values around generations 9-14.Nevertheless, there are some strategies that present this peak displaced towards generations 3-7 (e.g.IW k = 8, 10, 12, 16).In the last generation, all of the character homoplasy treatments converge to similar RF values: EW = 0.094, IW = 0.081, and EIW = 0.086.All of the medians calculated for the results recovered under IW are lower than that of EW and the interquartile ranges are entirely lower than that of EW only for IW k = 5, 6 and 17-20 (Fig. 18a).In the case of the analyses under EIW, the interquartile ranges of all of the results with different k-values overlap that of EW, but the medians are in most cases lower in the strategies that downweight homoplasy.Subtree pruning and regrafting moves and distortion coefficient.The SPR and DC are reported together because their behaviours are very similar throughout the genealogy (Fig. 17b,c).The mean values of both metrics for EW are approximately stable through the first five generations and they decrease to reach a valley in generation 11; subsequently, they increase again towards the end of the genealogy.As was the case for the RF values, the vast majority of the mean SPR and DC values under IW and EIW outperform those under EW, with a valley of values around generations 10-13.In a few cases, this valley is displaced previously in the genealogy around generations [3][4][5][6][7]8,10,12,14,16).In The medians calculated for the IW analyses are consistently higher than that for EW in both SPR and DC, but in the case of EIW, the exceptions are those with k = 4, 7, 8 and 10-20 for SPR and k = 4, 10-13, 17-20, 25 and 27 for DC (Fig. 18b,c).In particular, the interquartile ranges of SPR and DC are entirely higher than those of EW in IW k = 5, 6 and 17-20 and there is at least a partial overlap between EW and all of the EIW k-values.
Group comparisons.The behaviour of the mean GrComp values under EW resembles those of the other tree comparison metrics, with low, stable values in generations 1-5 and more suboptimal values around generations 6-14, in which the peak is reached at generation 11 (Fig. 17d).Also, GrComp resembles the other metrics in the presence of IW and EIW mean values that indicate that most trees of most generations are more similar to those of their respective last generation than under EW.Some of the IW and all of the EIW treatments have a peak of more suboptimal values around generations 9-14 (IW k = 3,5,7,9,11,13,15,17,19,(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), whereas this peak occurs around generations 2-7 in the remaining IW k-values.GrComp converges to relatively low, similar mean values under the different character homoplasy treatments (EW = 0.052, IW = 0.054, and EIW = 0.056).The median of GrComp calculated under EW is considerably higher than that of all of the IW k-values and most EIW strategies to the exclusion of k = 4, 7, 10 and 18-20 (Fig. 18d).The interquartile ranges of EW and the strategies that downweight homoplasy overlap in all cases, although marginally in IW k = 17 and 18.
The histograms comparing the mean GrComp values of the different strategies clearly show that IW and EIW outperform EW (Fig. S16).The mean GrComp values between the different IW and EIW treatments show variable results, in which there are several treatments that outperform others in seven or more generations, but with very low GrComp differences (<0.05; e.g.Fig. S17c,f-i).The means of the GrComp values through generations 1-14 show that IW k = 5 and 20 have the best performances with respect to generation 15 (Fig. 19b).In the case of EIW, the lowest mean GrComp value is recovered under k = 30, but this value is only marginally higher than the most optimal strategies under IW.This same pattern is recovered when the mean of the last generation is considered or not (Fig. 19b, Fig. S18b).
Ordinated data.The PCA shows that the different treatments under IW and EIW are positioned considerably closer to the ideal tree metrics than EW, either after subtracting the means of the last generation or not (Fig. 19a, Fig. S18a).The Euclidean distances reach overall valleys under IW k = 5 and 6 and EIW k = 5-9, respectively (Fig. 19c,d,Fig. S18c,d).Between k = 17 and 20, IW has a second, suboptimal valley, which is absent under EIW.Compared with the lowest ones calculated under EIW, IW has marginally lower Euclidean distances.The  Euclidean distance calculated for the best tree metrics of each generation under EW (i.e.EW*) is considerably larger than those of all character downweighting methods.The ranges of Euclidean distances built based on the extreme values of each generation show small overlaps between the best EW, IW and EIW values, to the exclusion of the non-overlapping ranges of IW k = 4 and 5 (Fig. 19e,  f).Nevertheless, all of the ranges calculated for IW and EIW reach considerably smaller Euclidean distances to the ideal tree metrics than those for EW.

Equal weighting vs. differential downweighting
The tree comparison metrics clearly show that the results recovered under IW and EIW throughout generations are, with the majority of k-values and with the exception of one genealogy (i.e. the Procolophonidae genealogy; see below), topologically more similar to and more stable than the trees of their last generation in the case of EW.These results are particularly significant because, as expected from the frequent bias among palaeontologists of preferring EW to IW or EIW, the vast majority of these empirical matrices were originally analysed only under EW.
The analyses conducted here are based on empirical data and, as a result, the true phylogenetic relationships among the taxa are unknown.This contrasts with analyses based on simulations in which the results can be compared with an a priori known tree ('model tree'; e.g.Goloboff et al., 2008Goloboff et al., , 2018;;Wright and Hillis, 2014;Congreve and Lamsdell, 2016;O'Reilly et al., 2016;Puttick et al., 2017).However, simulations cannot capture the highly complex variability intrinsic to empirical data and thus it is desirable to complement them with analyses on empirical datasets.Although the true phylogeny of non-simulated datasets is unknown, it is logical to assume that the matrices of the last generation of each genealogy are the most reliable descriptors of the similarities and differences among taxa, because they have undergone several iterative modifications that have improved their taxon and character sampling, as well as scoring accuracy and completeness (see Material and methods).In order to explore this assumption through the different genealogies analysed, GrComp was calculated between a random sample of 1000 trees found under EW in each generation and 1000 trees of those found in the last generation of the k-value with the best performance for that genealogy under IW or EIW (hereafter 'EW GrComp*').As a comparative reference, the GrComp between the trees of the k-value with best performance under IW or EIW in each generation and those of its last generation was used as the standard.The results show that there is no case in which the mean of EW GrComp* increases throughout the genealogies and, conversely, these values decrease in the last generations of each genealogy (Fig. 20).There is no increase in the ratio of group dissimilarities between EW GrComp* and the character downweighting methods throughout the generations and the respective increase in the number of terminals of each matrix in each generation does not decrease the proportional number of group similarities.In contrast, divergent results between EW GrComp* and IW/EIW would indicate increasingly more conflictive phylogenetic signals and a decrease of the reliability of the matrices through time for at least one of the methods.Thus, the results recovered are not in conflict with the hypothesis that iterative modifications of the matrices produce more reliable phylogenetic datasets and, hence, more accurate results.
These graphics (Fig. 20) also show relevant information about the ability of EW to recover group similarities with respect to the trees of its own last generation and those found in the last generation under IW or EIW (with the k-value of best performance).In the case of the Tawa genealogy, the EW GrComp* shows a trend towards lower mean values as the generation number increases and these values are very similar to those under IW k = 6 in the last four generations (Fig. 20c).This shows a clear convergence of the tree topologies between EW and IW k = 6 as the matrices of the genealogy have been iteratively modified through the years.The biplot of GrComp of the Tawa genealogy (Fig. 8d) indicates that this convergence is probably driven by stronger phylogenetic signals in the last generations that favour certain groupings that were more prone to topological differences between different character homoplasy treatments in earlier generations (e.g.EW, IW k = 3, EIW k = 3 and 6).Regarding the CoArTreeP, TWiG and Crocodylomorpha genealogies, the patterns observed between the EW GrComp* and each character downweighting treatment (IW k = 20 for CoArTreeP, EIW k =5 for TWiG and IW k = 5 for Crocodylomorpha) are similar.These patterns are characterized by closer mean values between the different character weighting strategies early and late in the genealogies, when there are smaller taxon samplings and when the matrices are based on supposedly more reliable data, respectively (Figs 2d,5d,17d,20a,b,f).
The Odontoceti genealogy maintains a similar range of mean EW GrComp*, around 0.35-0.45,with respect to the trees found in the last generation under IW k = 4 (Fig. 20d).This range of mean GrComp values for EW is lower (around 0.17-0.32)when calculated against the trees found in the last generation under EW (Fig. 11d).The mean GrComp values for IW k = 4 are extremely low (<0.01) when calculated against the trees of the last generation of the same character homoplasy treatment (Fig. 20d).This is expected if we consider that the Euclidean distances with k-values of best performance and the mean of the means of GrComp for this k-value are very close to 0 (Fig. 13, Fig. S12).This is the genealogy with the strongest topological differences between EW and the character homoplasy treatments with best performance (generally those with a k-value range = 4-11; Fig. 13c,d) among those analysed here.However, it should be noted that the character homoplasy treatments that perform worse through the generations (generally those with a k ≥ 17; Fig. 13c,d) have higher mean GrComp values than EW (Fig. 13a,b).Indeed, these mean GrComp values are within the range of those calculated when EW is compared against the trees of the last generation of IW k = 4 (Figs 11d and  20d).These two distinct clusters of mean GrComp values, ≤0.105 and ≥0.23 respectively (Fig. 13b), seem to be the result of the structure intrinsic to the matrices of this genealogy.They differ from the other genealogies analysed here, as well as a large sample of other empirical datasets (Goloboff et al., 2018: fig. 1a), in the presence of an unusually long and relatively high tail of the distribution of homoplastic characters and this is probably the main driver of its distinct behaviour through the generations (Figs 21 and 22; see below).
As mentioned above, the Procolophonidae genealogy has the peculiarity that EW outperforms all of the character downweighting treatments tested here (Fig. 16, Fig. S15).However, the mean GrComp values calculated throughout the genealogy for EW against the latest generation of EIW k = 5 and EW, respectively, have an overall pattern and distance between them rather similar to those of the CoAr-TreeP, TWiG and Crocodylomorpha genealogies.The relatively good performance of EW in this genealogy seems to be a result of intrinsic factors that distinguish its matrices from those of the previous genealogies, including the presence of a worse adjustment of the homoplasy distribution to an exponential model-not completely skewed towards zero (Fig. 21c).An exponential distribution of homoplasy has been found as the model more common among a large sample of empirical datasets and it makes sense in phylogeny because the branches are independent of each other after cladogenesis, following a Markov process (Goloboff et al., 2018).In contrast, the Procolophonidae genealogy has a non-exponential distribution of character homoplasy and this could contribute to a poor performance in the analyses under IW and EIW (see below, but this topic deserves a detailed analysis that goes beyond the scope of this study).This genealogy is peculiar in containing the smallest matrices analysed here, in terms of both taxonomic and character sampling, which is when IW is least advisable (Goloboff and Arias, 2019;Goloboff, 2022).It is particularly striking that the character homoplasy treatments that best perform in this genealogy are EW and IW and EIW with very low k-values, which are those that more strongly penalize homoplasy.The histogram of the frequency of character homoplasy of the Procolophonidae genealogy shows that its mean is very close to a single step of homoplasy (Fig. 21e) and 73% of the characters are non-homoplastic in the MPTs of the last generation.The presence of a high proportion of non-homoplastic characters in the Procolophonidae genealogy differs from the condition present in the other genealogies, mainly with respect to the larger datasets with well skewed exponential distributions of the homoplasy.These latter distributions, which are sound from an evolutionary point of view (Markov process), are probably favoured by the more common practice of the last decades of building character lists (mostly) irrespective of their expected homoplasy (Goloboff et al., 2018;contra Puttick et al., 2017).
The k-values of best performance strongly vary among the different genealogies studied here (Figs 4,7,10,13,16,19,Figs S3,S6,S9,S12,S15,S18).This is in agreement with the notion that there is no generalized optimal k-value or k ranges that could be chosen a priori (or at least without considering matrix size; see below) at the time of analysing phylogenetic matrices with different characteristics (Goloboff et al., 2008(Goloboff et al., , 2018)).Also, the behaviour of the Euclidean distances among the range of k-values analysed here is not homogenous through the datasets.In some genealogies (e.g.CoArTreep, TWiG, Tawa), the k-values that more strongly penalize homoplasy (k = 3-5 or higher values in some genealogies) are those that recover the results that more distinctly differ from those of the last generations, but these low k-values generally outperform those that gradually more mildly penalize homoplasy in other genealogies (e.g.Procolophonidae, Odontoceti, Crocodylomorpha).
Beyond this variability of the ranges of k-values that perform best in the different genealogies, it is clear that the character treatments that penalize homoplasy show a better performance than EW in terms of topological similarity to the trees of the last generations in the analyses conducted here (with the exception of the Procolophonidae genealogy).These results indicate that five of the six genealogies composed of empirical palaeontological datasets recovered tree topologies under IW or EIW-with all or most k-values-more similar to those found in the last, and most reliable, generation of the matrices earlier in the genealogy than in the analyses under EW.This cannot be attributed to the fact that analyses under IW or EIW more frequently recover better resolved consensus trees than analyses under EW (Goloboff et al., 2008) because the pairwise comparisons were conducted directly between MPTs.Thus, if simulations that found that character downweighting methods recover a higher proportion of group similarities over dissimilarities than EW are taken into consideration ("a group found by implied but not equal weights is more likely to be correct than wrong, and a group found by equal but not implied weights is (slightly) more likely to be wrong than correct"; Goloboff et al., 2018: 425), the higher stability of the tree topologies throughout the generations can be interpreted as an additional positive outcome of IW and EIW over EW.

Implied weighting vs. extended implied weighting
The Euclidean distances and GrComp values recovered from the different genealogies are almost identical (CoAr-TreeP, Odontoceti and Tawa genealogies) or only marginally different (TWiG and Crocodylomorpha genealogies) between IW and EIW for their k-values with the best performance, with only the exception of the Procolophonidae genealogy.For example, the TWiG genealogy shows at first glance a distinctly lower mean GrComp under EIW k = 5 than for IW k = 15, but when the values are compared the difference between them is only 0.01 (Fig. 7b).A similar situation occurs in the Crocodylomorpha genealogy, in which the metrics calculated in the analyses under IW outperform those under EIW, but the differences in the values are <0.01 for mean GrComp and ca.0.02 for the Euclidean distances.The differences between the values for the CoArTreeP, Odontoceti and Tawa genealogies are even smaller.Goloboff et al. (2018) found in simulations including missing entries that the trees recovered under EIW are generally closer to the model tree than those found under IW, including a trend to recover more correct and fewer incorrect groups.The analyses conducted here (based on empirical datasets) do not show an unambiguous result favouring a higher stability among tree topologies for either IW or EIW, at least in the limited sample of genealogies of this study.However, they show that EIW performs at least as well as IW in these empirical datasets.These results bolster the responses of Goloboff et al. (2018) to the criticisms raised against EIW by Congreve and Lamsdell (2016), which were based on mistaken ideas on the assignment of expected homoplasy to missing data and the treatment of inapplicable scorings as missing scorings (which can be treated as different type of scorings since the method was first implemented in TNT, Goloboff, 2014; missing and inapplicable data were treated differentially here).Indeed, the genealogies studied here have a high proportion of missing data, which is common among vertebrate palaeontology datasets (TWiG = 66%, CoArTreeP = 57%, Crocodylomorpha = 56%, Tawa = 47%; Procolophonidae = 42%, Odontoceti = 34%; proportions calculated for the last generation of each genealogy), and in addition, one of them includes a substantial proportion of inapplicable character states (CoArTreeP = 4.3% in the last generation).These are scenarios in which Congreve and Lamsdell (2016) considered that EIW would face methodological problems and would be expected to perform worse than IW.This is clearly not the case for these genealogies, in which both IW and EIW show extremely similar tree comparison metrics through the generations and in two of them (TWiG and Procolophonidae genealogies) EIW even performs marginally better in terms of topological similarity across time.
Implied weighting and matrix size Goloboff et al. (2008) recovered poorer results under IW in very large molecular datasets than in smaller matrices.They interpreted that these poor performances occurred because of weighting too strongly against homoplasy in these large datasets rather than a more general problem inherent to character downweighting itself (Goloboff et al., 2018).Goloboff (1993) suggested that it was logical that matrices with different numbers of taxa required different k-values.The results of Goloboff et al. (2008) bolstered the possible relationship between reasonable ranges of k-values, steps of homoplasy and numbers of taxa, but this issue remains mostly unexplored.In the analyses conducted here, the IW and EIW k-values with the best performance are not concentrated around the same value or ranges of values among the different genealogies, but show considerable variation (Table 1).The k-values of the ranges that best perform in each of these genealogies, as well as the k-values that best perform as a whole in each genealogy, seem to increase with the number of terminals present in their latest generation.This is particularly clear in the case of the analyses under IW.In order to test this relationship, linear regressions were conducted between the mean of the optimal range of k-values around the value that recovered the lowest mean GrComp value vs. the number of terminals of the last generation and the mean of the overall optimal range of k-values (which may include or not the k-value that retrieved the lowest mean GrComp value) vs. the number of terminals of the last generation.Each of these two regressions was conducted for IW and EIW, respectively.
The linear regression between the mean of the optimal IW k range (which included the lowest mean GrComp value) vs. the number of terminals resulted in a statistically significant positive correlation (y = 0.08501x -0.34501, R 2 = 0.89 and p-value = 0.005; but the regression for the mean of the overall optimal range of k-values was marginally nonsignificant: y = 0.10853x -1.12437, R 2 = 0.66 and p-value = 0.051).Regarding EIW, the regression between the mean of the overall optimal EIW k range vs. the number of terminals was statistically significant (y = 0.05597x + 2.80303, R 2 = 0.68 and p-value = 0.042; but the regression for the mean of the optimal range of k-values around the value that recovered the lowest mean GrComp value was not significant: y = 0.05833x + 6.19117, R 2 = 0.21 and p-value = 0.35) (Fig. 23).There is also a significant linear correlation between the IW k-values (for the range of k-values that include the lowest mean GrComp value and the range that not necessarily includes it) and the number of characters of the last generation (R 2 = 0.85/0.80 and p-value = 0.01/0.02 in each regression, respectively), but not in the case of EIW (R 2 = 0.12/0.60 and p-value = 0.50/0.07 in each regression, respectively).
Linear regressions between the mean of the homoplasy distribution or its 95% quantile vs. the optimal k-values were not significant (R 2 ≤ 0.62 and p-value > 0.25).In addition, the maximum possible ratios for implied weights (i.e. the weight of the characters that are most penalized) in the last generation of each genealogy (which depend on the maximum possible number of steps among characters, which in turn depends on the number of taxa) were calculated for the k-value ranges of best performance (Table 1).There is a tendency for the characters to be penalized more strongly in the genealogies with larger taxon samplings, which is an expected result based on the presence of longer tails in the distribution of the steps of homoplasy of larger matrices (see below).However, the linear regression between these variables is not significant and has poor adjustment for IW (R 2 = 0.39 and p = 0.181) and is marginally non-significant for EIW (R 2 = 0.64 and p = 0.056).Nevertheless, it should be kept in mind that the poor adjustments found here could be related to the low number of genealogies under analysis.Future analyses with an increased sample may find a significant relationship between maximum implied weight ratios and matrix size.A significant relationship between maximum implied weight ratios and a matrix parameter could be useful to set the k-value indirectly with the TNT command piwe<r, in which the cost of adding a step to a homoplasy-free character is no more than r times (where r > 1) the cost of adding a step to a character with the maximum possible number of extra steps (Goloboff, 2022: 111).
The positive correlation between k-values and taxon sampling size (also for character sampling size, but only for IW) can be explained by a lower influence of homoplasy in increasingly large phylogenetic datasets.Indeed, the histograms showing the frequency of homoplasy in the characters show that those matrices with a larger number of terminals have longer and shallower tails, reaching more extreme, but considerably infrequent, steps of homoplasy (e.g.CoArTreeP, TWiG and Crocodylomorpha genealogies; Fig. 21).This is expected because a larger number of terminals increases the probability of sampling independent acquisitions of character states (Goloboff, 1991).If the weighting against homoplasy is too strong in these large datasets, the distribution of the fit of the characters is considerably less skewed and its mean moves towards lower fit values (Fig. 22, Figs S19 and S20).Thus, a very high proportion of the characters of the matrix are strongly downweighted; for example, under k = 3, 55% of the characters of the CoArTreeP genealogy, 28% of those of the TWiG genealogy, 27% of those of the Crocodylomorpha genealogy, 22% of those of the Tawa genealogy and 59% of those of the Odontoceti genealogy have a fit of <0.5 (i.e.their weight is reduced to more than half than under EW; Figs S19 and S20).These strongly penalized characters are probably relevant to inform about the relationships of, mainly, the shallower branches of the trees and their strong downweighting is expected to contribute to the poor performance of these matrices under these very low k-values.On the other hand, low k-values result in a strong penalization of homoplasy in a proportionally lower frequency of characters in small matrices with shorter tails in the distribution of their steps of homoplasy; for example, under k = 3, only 3% of the characters of the Procolophonidae genealogy have a fit of <0.5.The different k-values of best performance result in histograms of the distribution of the fit of the characters that resemble each other among the different genealogies, i.e. skewed to the right and with a mean around 0.7-0.9,regardless of their number of terminals in the last generation (with exception of the Odontoceti genealogy).
In addition, the proportion of characters with a fit of <0.5 is below 15% in all of these datasets excluding the Odontoceti genealogy (14% for Crocodylomorpha with k = 5; 3% for Tawa with k = 6; 1% for CoAr-TreeP with k = 20; 0% for TWiG with k = 15; and 1% for Procolophonidae with k = 4).In the particular case of the Odontoceti genealogy, 52% of the characters have a fit of <0.5 with k = 4.This seems to be a consequence of its homoplasy distribution, with a long tail with a very shallow slope of homoplastic characters (Fig. 21d), in which less than half (46%) of the characters have steps of homoplasy lower than 5.In contrast, 56-100% of the characters have steps of homoplasy lower than 5 in all of the other genealogies.Puttick et al. (2019) concluded on the basis of their simulations that IW performs well in datasets with large numbers of consistent characters, but not when homoplasy is high [consistency index (CI) = 0.0-0.4]because it gives higher weight to homoplastic characters rather than consistent characters.However, Goloboff (2022) showed that CI values < 0.4 imply a higher homoplasy than that expected in a random dataset and that IW performs better in simulations with amounts of homoplasy lower than in random datasets (Goloboff, 2022: fig. 4.6).The results of the analyses conducted here based on empirical data do not support the conclusions of Puttick et al. (2019) either.The last generations of the genealogies analysed here and in which IW/EIW outperform EW have high homoplasy indices for individual characters (mean CI = 0.31 for CoArTreeP; mean CI = 0.38 for Odontoceti; mean CI = 0.47 for TWiG; mean CI = 0.48 for Crocodylomorpha; mean CI = 0.49 Tawa).In contrast, the single genealogy in which EW outperformed strategies that downweight character homoplasy has a considerably higher ratio of consistent characters (mean CI = 0.64 for Procolophonidae).Thus, it seems that the poor performance of IW in the results of Puttick et al. (2019) is due more to their rescaling of the RF distance and the use of very low k = 2 (they also used k = 10 and 20, but pooling together all of the results recovered under IW) for degrees of homoplasy worse than those expected at random, than to a general problem of IW.
The significant positive correlation recovered here between the mean of the optimal k-values and matrix size could be useful to inform on the choice of the range of k-values to be used in analyses under IW and EIW.Previous studies have shown a positive relationship between dataset size and homoplasy that is primarily caused by an increase in the number of taxa and not by an increase in the number of characters (Goloboff, 1991).Thus, the linear models that considered the number of terminals of the last generations as a variable were used to build a series of ranges of k-values based on the 95% confidence interval of the regressions (Table 2; the results after subtracting the mean of the last generation were not considered here because IW and EIW are not being compared with the more variable results of EW).This correlation should be considered as a first approach, having in mind the caveat of the low number of independent genealogies that have been analysed here.The aim of the present study was to evaluate the behaviour of EW and IW/EIW in comprehensive samples of MPTs through long direct genealogical lines of empirical palaeontological data matrices.Thus, although expanding the sample of genealogies goes beyond the goal of this study, it is something that should be done in the future in order to further test the correlation between k-values and the number of terminals in parsimony analyses conducted under IW and EIW.Nevertheless, beyond these limitations, it seems appropriate to provide these predicted ranges of k-values based on the number of terminals of the phylogenetic matrix as a preliminary guide (Table 2) because there has been a huge disparity in the k-value choices and the length of the ranges of parsimony analyses that used IW (and less frequently EIW) in the last two decades.In several cases, the decision of the k-values to be used seems to not be related to characteristics intrinsic of each dataset (e.g.character homoplasy distribution, matrix size), but more to subjective decisions of the researchers.
For example, some researchers used an IW or EIW k-value = 3 because this is the default setting of TNT, regardless of the characteristics of the analysed data matrix (e.g.Freudenstein and Rasmussen, 1999;Harbach and Kitching, 2005;Johnson and Musetti, 2012;Gonc ¸alves, 2016;Boessenecker et al., 2017;Averianov and Lopatin, 2021;Longrich et al., 2024).Other authors decided to use long ranges of k-values with the aim of being conservative and to consider a broad spectrum of results under character homoplasy penalization.Ezcurra et al. (2019) and Trotteyn and Ezcurra (2020) used a broad range of IW k = 3-18 when analysing different iterations of the CoArTreeP genealogy composed of 116 and 115 terminals, Table 2 Range of concavity constant values predicted on the basis of the 95% confidence intervals of the statistically significant linear regressions (see Fig. 23).

Number of terminals
IW k ranges* EIW k ranges † respectively.The confidence interval calculated here for the regression that uses the mean of the optimal IW k range that includes the lowest GrComp value (Fig. 23a) predicts a reasonable range of k-values between 6 and 13 for these matrices (Table 2).This narrows the range of k-values used by these authors and, most importantly, excludes very low k-values that strongly penalize homoplasy and perform worse in medium-to large-sized datasets.Indeed, it is clear in Ezcurra et al. (2019: fig. 23) that the most disparate resampling frequencies were calculated under IW k = 3-4.More recently, Ezcurra and Sues (2021) explored results alternative to EW using three k-values (k = 3, 7 and 10) under IW in an iteration of the CoArTreeP genealogy that included 190 terminals.Table 2 suggests a range of IW k = 12-20 for this number of terminals, which is outside of the k-values used by the latter authors.Finally, another example is the analysis of the dataset of Pérez et al. (2022), which was composed of 50 extinct and living gastropod species and was analysed with a range of IW k = 1-100.
The relationships between species were shown in the MPT found under IW k = 54 (Pérez et al., 2022: figs 4 and 6) and the results with other IW k-values were used to discuss the presence/absence of groups in the former tree (Pérez et al., 2022: fig. 4).This dataset has a strongly skewed exponential distribution of the character homoplasy and Table 2 suggests using a considerably narrower range of IW k-values = 3-8 for a matrix of this size.

Conclusions
Contrasting with the mainstream in (especially vertebrate) palaeontology, this exploratory study suggests that IW and EIW should be used when analysing phylogenetic morphological datasets.In five of the six genealogies analysed here, IW and EIW show more stable and similar results to those of the last, and assumed most reliable, generation than EW.This indicates that it is more likely that groups present in the most recent iterations of a matrix will be recovered earlier in the genealogies analysed under IW and EIW than under EW.Nevertheless, one of the empirical genealogies analysed here raises a warning flag and an exception could be the cases in which the homoplasy of the characters departs from a well skewed exponential distribution.In these cases, EW may outperform methods that downweight characters on the basis of their homoplasy.Thus, it seems desirable to evaluate the distribution of homoplasy of the characters in an optimal tree found under EW to help in the choice of analysing the phylogenetic datasets under EW and/or IW/EIW, mainly in small-sized matrices (<50 terminals).If a method that penalizes character homoplasy is chosen, the significant positive linear correlations presented here between the optimal k-values and the number of terminals of the last generations would be employed to inform about which range of k-values to use in analyses under IW and EIW based on matrix size.However, it should be kept in mind that this emergent relationship still relies on a low sample size of genealogies.Similar studies with an increased number of empirical genealogies (and also including different taxonomic groups other than vertebrates) should be conducted in the future.In addition, it would be interesting to conduct analyses similar to those performed here using other tree comparison measures, as well as using consensus trees instead of the individual optimal trees.

Fig. 2 .
Fig. 2. Tree comparison metrics vs. generation number for the CoArTreeP genealogy.(a) Robinson-Foulds normalized distance.(b) Number of subtree pruning and regrafting moves (SPR) similarity.(c) Distortion coefficient.(d) GrComp (proportion of the complement of the sum of group similarities divided by the number of nodes).

Fig. 3 .
Fig. 3. Boxplots of the tree comparison metrics for the CoArTreeP genealogy.(a) Robinson-Foulds normalized distance.(b) SPR similarity.(c) Distortion coefficient.(d) GrComp (proportion of the complement of the sum of group similarities divided by the number of nodes).

Fig. 4 .
Fig. 4. Ordination of the tree comparison metrics of the CoArTreeP genealogy.(a) Principal components analysis (PCA) showing the distribution of different homoplasy treatment methods in the first two components.(b) Mean of GrComp with respect to trees of the last generation (equal weighting (EW) is not shown because it represents a long outlier value).Euclidean distances to the ideal tree metrics calculated from the mean of the tree comparison metrics for (c) implied weighting (IW) and (d) extended implied weighting (EIW), and calculated from the difference between the extreme tree comparison metric values of IW/EIW and EW for (e) the best IW and EW values and (f) the best EIW and EW values.

Fig. 7 .
Fig. 7. Ordination of the tree comparison metrics of the TWiG genealogy.(a) Principal components analysis (PCA) showing the distribution of different homoplasy treatment methods in the first two components.(b) Mean of GrComp with respect to trees of the last generation (EW is not shown because it represents a long outlier value).Euclidean distances to the ideal tree metrics calculated from the mean of the tree comparison metrics for (c) IW and (d) EIW, and calculated from the difference between the extreme tree comparison metric values of IW/EIW and EW for (e) the best IW and EW values and (f) the best EIW and EW values.

Fig. 10 .
Fig. 10.Ordination of the tree comparison metrics of the Tawa genealogy.(a) Principal components analysis (PCA) showing the distribution of different homoplasy treatment methods in the first two components.(b) Mean of GrComp with respect to trees of the last generation.Euclidean distances to the ideal tree metrics calculated from the mean of the tree comparison metrics for (c) IW and (d) EIW, and calculated from the difference between the extreme tree comparison metric values of IW/EIW and EW for (e) the best IW and EW values and (f) the best EIW and EW values.
intermediate between those of the first two generations in generations 3-7.Throughout most of the genealogy IW k = 3-4, 7-11, 13, 14 and 16 and EIW k = 4-11, 13 and 15 have higher mean SPR and DC values than EW.Other strategies with different k-values have generally similar or lower mean values than those for EW.In the last generation, all of the mean SPR and DC values of the character downweighting strategies converge to 1 and those for EW are 0.895 and 0.998, respectively.

Fig. 13 .
Fig. 13.Ordination of the tree comparison metrics of the Odontoceti genealogy.(a) Principal components analysis (PCA) showing the distribution of different homoplasy treatment methods in the first two components.(b) Mean of GrComp with respect to trees of the last generation.Euclidean distances to the ideal tree metrics calculated from the mean of the tree comparison metrics for (c) IW and (d) EIW, and calculated from the difference between the extreme tree comparison metric values of IW/EIW and EW for (e) the best IW and EW values and (f) the best EIW and EW values.

Fig. 16 .
Fig. 16.Ordination of the tree comparison metrics of the Procolophonidae genealogy.(a) Principal components analysis (PCA) showing the distribution of different homoplasy treatment methods in the first two components.(b) Mean of GrComp with respect to trees of the last generation.Euclidean distances to the ideal tree metrics calculated from the mean of the tree comparison metrics for (c) IW and (d) EIW, and calculated from the difference between the extreme tree comparison metric values of IW/EIW and EW for (e) the best IW and EW values and (f) the best EIW and EW values.

Fig. 19 .
Fig. 19.Ordination of the tree comparison metrics of the Crocodylomorpha genealogy.(a) Principal components analysis (PCA) showing the distribution of different homoplasy treatment methods in the first two components.(b) Mean of GrComp with respect to trees of the last generation (EW is not shown because it represents a long outlier value).Euclidean distances to the ideal tree metrics calculated from the mean of the tree comparison metrics for (c) IW and (d) EIW, and calculated from the difference between the extreme tree comparison metric values of IW/EIW and EW for (e) the best IW and EW values and (f) the best EIW and EW values.

Fig. 20 .
Fig. 20.Biplots showing the mean of GrComp calculated with respect to trees of the last generation for the k-value with best performance for that k-value and method (colour) and EW (black), respectively, vs. generation number.(a) CoArTreeP for IW k = 20.(b) TWiG for EIW k = 5.(c) Tawa for IW k = 6.(d) Odontoceti for IW k = 7. (e) Procolophonidae for EIW k = 5. (f) Crocodylomorpha for IW k = 5.Vertical dotted lines represent the standard deviation.

Fig. 21 .
Fig. 21.Histograms showing the distribution of the homoplasy of the characters in the last matrix of the genealogies analysed here.(a) CoAr-TreeP genealogy.(b) TWiG genealogy.(c) Tawa genealogy.(d) Odontoceti genealogy.(e) Procolophonidae genealogy.(f) Crocodylomorpha genealogy.The vertical dotted red line represents the mean of the distribution.

Fig. 22 .
Fig. 22. Histograms showing the distribution of the fit of the non-additive characters in the last matrix of the genealogies analysed here.(a) CoArTreeP genealogy.(b) TWiG genealogy.(c) Tawa genealogy.(d) Odontoceti genealogy.(e) Procolophonidae genealogy.(f) Crocodylomorpha genealogy.The vertical dotted red line represents the mean of the distribution.

Fig. 23 .
Fig. 23.Linear regressions between the mean of optimal IW/EIW k-values and the number of terminals of the last generations of each genealogy.(a) Regression between the mean of the optimal IW k range that includes the lowest GrComp value vs. the number of terminals.(b) Regression between the mean of the optimal EIW k range that includes the lowest GrComp value vs. the number of terminals.(c) Regression between the mean overall optimal IW k range vs. the number of terminals.(d) Regression between the mean overall optimal EIW k range vs. the number of terminals.The blue line represents the linear regression and the grey shadow the 95% confidence interval.

Fig. S12 .
Fig. S12.Ordination of the tree comparison metrics of the Odontoceti genealogy without subtracting the means of the last generation.Fig. S13.Histograms showing the frequency of the difference between GrComp calculated for EW and a selected sample of k-values under IW and EIW in the Procolophonidae genealogy.Fig. S14.Histograms showing the frequency of the difference between GrComp calculated for EW or a selected sample of IW/EIW k-values and EIW k = 5 in the Procolophonidae genealogy.Fig. S15.Ordination of the tree comparison metrics of the Procolophonidae genealogy without subtracting the means of the last generation.Fig. S16.Histograms showing the frequency of the difference between GrComp calculated for EW and a

Fig. S14 .
Fig. S12.Ordination of the tree comparison metrics of the Odontoceti genealogy without subtracting the means of the last generation.Fig. S13.Histograms showing the frequency of the difference between GrComp calculated for EW and a selected sample of k-values under IW and EIW in the Procolophonidae genealogy.Fig. S14.Histograms showing the frequency of the difference between GrComp calculated for EW or a selected sample of IW/EIW k-values and EIW k = 5 in the Procolophonidae genealogy.Fig. S15.Ordination of the tree comparison metrics of the Procolophonidae genealogy without subtracting the means of the last generation.Fig. S16.Histograms showing the frequency of the difference between GrComp calculated for EW and a

Fig. S19 .
Histograms showing the distribution of the fit of the characters in the last matrix of the (a-c) CoArTreeP and (d-f) TWiG genealogies under different k-values.Fig. S20.Histograms showing the distribution of the fit of the characters in the last matrix of the (a-c) Odontoceti and (d-f) Procolophonidae genealogies under different k-values.

Table 1
Mean values calculated for the optimal k ranges recovered for the six empirical genealogies and their number of terminals in the last generation.
* Significant lineal correlation with number of terminals.