Bias in phylogenetic measurements of extinction and a case study of end‐Permian tetrapods

Extinction risk in the modern world and extinction in the geological past are often linked to aspects of life history or other facets of biology that are phylogenetically conserved within clades. These links can result in phylogenetic clustering of extinction, a measurement comparable across different clades and time periods that can be made in the absence of detailed trait data. This phylogenetic approach is particularly suitable for vertebrate taxa, which often have fragmentary fossil records, but robust, cladistically‐inferred trees. Here we use simulations to investigate the adequacy of measures of phylogenetic clustering of extinction when applied to phylogenies of fossil taxa while assuming a Brownian motion model of trait evolution. We characterize expected biases under a variety of evolutionary and analytical scenarios. Recovery of accurate estimates of extinction clustering depends heavily on the sampling rate, and results can be highly variable across topologies. Clustering is often underestimated at low sampling rates, whereas at high sampling rates it is always overestimated. Sampling rate dictates which cladogram timescaling method will produce the most accurate results, as well as how much of a bias ancestor–descendant pairs introduce. We illustrate this approach by applying two phylogenetic metrics of extinction clustering (Fritz and Purvis's D and Moran's I) to three tetrapod clades across an interval including the Permo‐Triassic mass extinction event. These groups consistently show phylogenetic clustering of extinction, unrelated to change in other quantitative metrics such as taxonomic diversity or extinction intensity.

be highly variable across topologies. Clustering is often underestimated at low sampling rates, whereas at high sampling rates it is always overestimated. Sampling rate dictates which cladogram timescaling method will produce the most accurate results, as well as how much of a bias ancestor-descendant pairs introduce. We illustrate this approach by applying two phylogenetic metrics of extinction clustering (Fritz and Purvis's D and Moran's I) to three tetrapod clades across an interval including the Permo-Triassic mass extinction event. These groups consistently show phylogenetic clustering of extinction, unrelated to change in other quantitative metrics such as taxonomic diversity or extinction intensity.
C O M P A R I S O N S of palaeontological data on extinction from different time periods are complicated by profound contrasts in timescale, the volume and quality of available data, approaches to analysis, and the intensity with which different geographical areas and taxonomic groups have been studied (Jablonski 2008;Fritz et al. 2013;de Vos et al. 2014;Payne et al. 2016). These problems are especially acute for vertebrates, which are of considerable interest to biologists but have an incomplete palaeontological record in comparison to shelly marine invertebrates (Foote & Raup 1996;Foote & Sepkoski 1999). Despite these limitations, the fossil record can offer a natural laboratory for testing hypotheses about how extinction dynamics might change or be maintained in times of extreme ecological stress (Jablonski 1994(Jablonski , 2005Finnegan et al. 2015). This deep-time perspective is becoming increasingly important to contemporary biological research as extinction rates increase and biodiversity declines (McKinney 1997;Erwin 2009;Barnosky et al. 2011).
Two approaches dominate studies of extinction: measuring selectivity with respect to different biological, life history or extrinsic traits (Bielby et al. 2006;Cardillo et al. 2008;Turvey & Fritz 2011;Harnik et al. 2012) and measuring extinction intensity and turnover rates. The latter has been the usual focus of quantitative analyses of extinction in the geological past (Raup 1994;Alroy 1996;Stanley 1998;Alroy et al. 2001Alroy et al. , 2008Jablonski 2008). Ideally, the fossil record might be used to identify traits which may make taxa vulnerable to extinction (Jackson & Erwin 2006;Purvis 2008;Fritz et al. 2013). Some highresolution fossil records have indeed been used to investigate selection against a particular trait, or vulnerability to a particular pressure. Previous studies have shown extinction selectivity related to body size (Harnik 2011;Tomiya 2013), feeding strategy (Jeffery 2001), geographical range (Kiessling & Aberhan 2007;Payne & Finnegan 2007;Jablonski 2008), morphology (Liow 2007;Friedman 2009) and clade richness (Smith & Roy 2006), among others. Unfortunately even this basic level of trait data is not immediately accessible for much of the fossil record.
Phylogenetic approaches can lessen some of the biases introduced by imperfect sampling, while simultaneously providing results from different data and scales that can be directly compared across clades and through time (Purvis 2008;Fritz et al. 2013;Harnik et al. 2014). Many previous studies, focusing on a variety of different questions and methods, have demonstrated that application of phylogenetic data to study of the fossil record can be important in obtaining valid, statistically unbiased results (Felsenstein 1985;Grafen 1989;Norell 1992;Rabosky 2010;Pennell & Harmon 2013;Sakamoto et al. 2016). Studies of extinction can also be augmented by the incorporation of phylogeny, which provides additional information that cannot be accessed through taxonomic or stratigraphic approaches, or from measuring turnover rates alone (Hardy et al. 2012). For example, phylogenetic measurements of extinction can be used to find the presence or absence of taxon-independent selection against traits (Tomiya 2013), measure loss of evolutionary history (Huang et al. 2015), or understand the origin of phylogenetic community structure (Fraser et al. 2015).
Intuitively, we might expect that extinction is selective with respect to the relationships between taxa (i.e. phylogeny), given that some traits may make taxa vulnerable or resistant to extinction, and that these traits might be phylogenetically conserved (Hunt et al. 2005;Green et al. 2011;Smits 2015). In other words, due to their shared ancestry, closely related taxa are more likely to share similar characteristics, and the probability of a taxon becoming extinct might in turn be related to those characteristics (Fig. 1B). When this is the case the phylogenetic clustering of extinction (i.e. whether closely related taxa become extinct at the same time) might act as a proxy for selection for or against particular traits in the fossil record. This proxy could be studied in situations where a phylogeny is available, but detailed morphological or life history information is lacking. This approach broadly assumes that a Brownian motion-like model of trait evolution adequately reflects changes in the features that are relevant to extinction risk (Freckleton et al. 2002;Harmon et al. 2010). In such a case clustered extinction is indicative of selection with respect to phylogenetically conserved traits, whereas phylogenetically random extinction is indicative either of selection with respect to phylogenetically labile traits, or of extinction that is not selective with respect to particular traits (Fritz & Purvis 2010).
Although phylogenetic methods offer advantages over approaches based on taxonomy or extinction intensity, incorporating fossil taxa into phylogenies potentially introduces its own set of biases. For example, range extensions in a phylogeny are asymmetrical; they can predate fossil occurrences, thereby extending a taxon's range into the past, but the length of unsampled history after the last fossil occurrence of a taxon cannot easily be estimated. There have been studies on the effect on downstream analyses of several of the features that are more acute in phylogenies of fossil taxa than those of extant groups (e.g. uncertain divergence dates (Bapst 2014;Halliday & Goswami 2016), missing character data causing tree misspecification (Stone 2011) and a higher proportion of soft polytomies (Garland & Diaz-Uriarte 1999;Housworth & Martins 2001;Davis et al. 2012)). However, the effect of the overall 'degraded' nature of a palaeontological phylogeny has not yet been fully investigated, particularly with respect to the phylogenetic structure of extinction.
Here we use simulations to examine the efficacy of Fritz and Purvis' D (Fritz & Purvis 2010; a metric of the clustering of binary traits across a phylogeny) when applied to phylogenies of fossil taxa to measure the phylogenetic clustering of extinction given evolution of relevant traits under a Brownian motion model of change. We investigate the ways in which results from this analysis of simulated fossil (i.e. degraded) data are biased with respect to true evolutionary patterns, and identify the likely causes of such bias. This provides a general guide for the use of these analyses on fossil data. We illustrate this approach to studying the clustering of extinction with an empirical example based on tetrapods during the Permian-Triassic mass extinction (PTME).

Phylogenetic clustering of extinction
Both the simulation study and analysis of real data require measurement of the phylogenetic clustering of extinctions of lineages. Here we treat extinction and survival as a binary trait within a time bin (Fig. 1). There are several methods by which the phylogenetic or taxonomic clustering of a binary trait may be measured, but here we focus on Fritz and Purvis' D (Fritz & Purvis 2010). This metric is scaled to random and Brownian motion expectations of trait distribution. A random expectation is where extinctions and survivals are randomly scattered across the tips of the phylogeny within the time bin (Fig. 1A). The Brownian expectation is the pattern of extinctions and survivals across the tips that is obtained if a continuous trait evolves under a Brownian motion (random walk) model of evolution and is then converted into a binary trait using a threshold value. As outlined above, a longer shared ancestry means that under this model closely related taxa are more likely to have similar traits, leading to a pattern of clustering of the same trait values on the phylogeny (Fig. 1B).
The scaling of the test statistic D means that, unlike alternative metrics, it is robust to tree shape, tree size, and trait prevalence for trees containing more than 50 tips (Fritz & Purvis 2010). D can therefore be used to reliably compare values through time, and between clades, providing an advantage over other methods (Hardy et al. 2012). We also repeated all analyses on the real data using Moran's I (a test for spatial autocorrelation (Moran 1950) generalized for use to measure phylogenetic signal by Gittleman & Kot (1990)) to establish whether the same variation in extinction clustering through time was found with both measures.
D is calculated by scaling the observed sum of sisterclade differences (SSD) to sister-clade differences from 1000 iterations of Brownian and random models, using equation 1: where ∑d obs is the observed SSD and ∑d b and ∑d r are the Brownian and random SSD for each iteration. Once the value has been scaled, D = 1 corresponds to a random trait distribution, and D = 0 corresponds to a Brownian, or clustered, trait distribution. A p-value for D is calculated by comparing the estimated value to the distributions of values generated for ∑d b and ∑d r (see also Fritz & Purvis 2010, table 1). Moran's I is a metric for spatial autocorrelation. It can be adapted for purpose here to measure the degree to which a binary trait (extinction) clusters in phylogenetic space (phylogenetic distance between taxa) (Gittleman & Kot 1990;Lockwood et al. 2002). It is calculated with equation 2: where n is the number of observations, w ij is a weighting that is calculated as 1 divided by the cophenetic distance between two species i and j, and z i is the normalized value of the trait for the species I (Lockwood et al. 2002).
In some previous studies, Moran's I correlograms have been used, which is possible when both extinction and taxonomic distance are binary traits. The generalized method for Moran's I used here has the advantages of providing one value for the entire tree, and including the additional information provided by phylogenetic branch duration (Hardy et al. 2012).

Timescaling
Phylogenetic comparative methods require a cladogram with branch durations scaled to time. The timescaling method may have an important influence on the outcome of measurements of extinction clustering because it controls which taxa are included in each timeslice, as well as the phylogenetic distance between taxa. There are several post hoc methods for timescaling cladograms of fossil taxa, and here we applied four. First we used the Hedman algorithm (Hedman 2010; Lloyd et al. 2016a), which provides a distribution of estimates for the position of each internal node in the tree, based on the ages of the earliest representatives of consecutive sister groups. We performed this in R using code written by Graeme Lloyd and available in Lloyd et al. (2016b). We also tested the older and widely used mbl (minimum branch length; Laurin (2004)) and equal (Brusatte et al. 2008;Lloyd et al. 2012) methods. For the simulation study we additionally used the cal3 timescaling method (Bapst 2013) which calibrates internal node positions according to three rates (origination, extinction and sampling) that can be estimated from occurrence data (Foote 2001). We could not use cal3 on the real data because a majority of the taxa in our datasets are point occurrences, so we could not obtain reliable rate estimates (Bapst 2014).

Simulations
We used wrappers of functions in the paleotree package in R (Bapst 2012) to generate phylogenies that included episodic mass extinction events (scripts provided in Soul & Friedman 2017). These phylogenies were sampled to simulate fossil occurrence ranges, which were subsequently used to reconstruct and scale cladograms of the sampled fossil taxa according to time. We measured D for an identical timeslice, which included a mass extinction, through the 'true' phylogenetic histories and the sampled fossil cladograms, and compared the results. In order to assess the way in which particular factors might bias measurements of clustering, we varied: (1) the method used to timescale the cladograms; (2) the degree to which extinction was phylogenetically clustered; and (3) the way in which sampled ancestral taxa were included within the timescaled cladograms.
Generating evolutionary histories. Phylogenies were generated using origination and sampling rates based on one simulation time unit representing 1 myr. Mass extinctions were generated by selecting 75% of taxa to go extinct. For clustered extinction we first simulated traits under Brownian motion. A low proportion of lineages with a trait value below a threshold were terminated, and a high proportion of those taxa with a trait value above the threshold were terminated. As discussed above (see Phylogenetic clustering of extinction) this leads to clusters of closely related tips on the phylogeny becoming extinct at the same time. For phylogenetically random extinction, the same overall proportion of lineages was terminated but terminations were selected randomly across the tree. The tree simulation continued from surviving lineages after each mass extinction event. We used three sets of five 'true' phylogenies, one set with clustered extinction, one with random extinction, and the final with bifurcating rather than budding origination (see Foote 1996, fig. 1). We sampled each of these 15 true phylogenies 50 times at three different per-capita rates: 0.01, 0.1 and 0.5 per-lineage time units. This sampling represents the combined processes of incomplete preservation and collection of fossil occurrence data.
Each of the sets of sampled ranges of taxa was used as the basis for timescaled cladograms (see Timescaling above). We tested three timescaling methods and implemented three different strategies for including sampled ancestral taxa. The options used in each set of simulations are detailed in Table 1. Overall this process yielded 15 simulated true phylogenies, 2250 sets of simulated taxon ranges and 5250 timescaled cladograms of sampled fossil taxa. Following generation of timescaled cladograms we measured Fritz and Purvis' D (Fritz & Purvis 2010) for the same, single, timeslice in each true phylogeny and each reconstructed fossil cladogram. This allowed assessment of which parameters were the most important controls on whether this measurement could recover the true signal for palaeobiological data.
Treatment of ancestors. Sampling taxa from ancestral lineages has been shown to be probable when dealing with data measured on long timescales (Foote 1996). In the majority of work estimating phylogenetic relationships, it has not been possible to identify which taxa might be ancestral to other sampled taxa (but see recent approaches e.g. Gavryushkina et al. 2014;Heath et al. 2014;Bapst et al. 2016). In commonly used methods of phylogenetic inference, sampled ancestral taxa are reconstructed as sister to their descendants. This may have an influence on the outcome of phylogenetic measures of extinction; the treatment of ancestors as they are incorporated into the phylogeny is therefore an important consideration. To simplify the test of how much of an influence sampled ancestral taxa might have on the outcome of the analysis, we used only a bifurcating model of origination (rather than budding or anagenetic origination, which can be simulated using paleotree). The first treatment of sampled ancestral taxa was to place them as sister taxa to their descendants and leave them in the cladogram (emulating the most likely result of a cladistic analysis where ancestors are sampled in real data (Wagner & Erwin 1995;Alroy et al. 2001)). This has two principal effects. First is the introduction of 'pseudoextinctions' where a taxon disappears from the fossil record and therefore appears to have become extinct, but actually the lineage has undergone morphological change. Second is the introduction of 'pseudosurvivals', which occur when an ancestor is sampled in an earlier time bin than its descendant. When they are reconstructed as sister taxa, the origin of the descendant must match the origin time of the ancestor and so a ghost range is inserted, crossing the boundary between time bins.
The second treatment of ancestors did not include sampled ancestral taxa, which where pruned from the cladograms before they were timescaled. This removes both pseudoextinctions and pseudosurvivals. The final treatment of ancestors was to remove sampled ancestral taxa only after the tree had been timescaled. As outlined above, this introduces ghost ranges into the phylogeny, so psuedosurvivals appear where these ghost ranges extend across the boundary into the previous timeslice. However, because the ancestors themselves are then pruned from the tree, pseudoextinctions are no longer present. The only treatment of ancestors available in reality is the first, because in the majority of cases we are unable to identify and remove ancestors from a phylogeny. Consequently, the second two treatments are performed only in order to understand the cause of any bias observed in the results, and do not represent real or reconstructed evolutionary trees. These scenarios, and their effects, are explored more fully in the discussion.
Caveats. The method used here can be viewed as optimistic, as only two factors (missing taxa and sampled ancestors) are investigated. We assume that cladograms recover true evolutionary relationships, which is unlikely to be the case. We also assume that there is no uncertainty in the ages of the fossil specimens, when in reality these are often only known as precisely as a geological stage, particularly for groups like terrestrial vertebrates where studies of phylogenetic clustering would most easily be conducted. Finally, we simulate the traits linked to extinction under a Brownian motion model of evolution, which leads to phylogenetically conserved trait patterns and phylogenetically clustered extinction. In reality, traits that are under selection may be best modelled by a different evolutionary regime (e.g. adaptive peak or early burst). We are therefore specifically investigating whether this approach can be used to detect selection with respect to traits that are adequately modelled by Brownian motion. The results of this simulation study do not fully represent our ability, or lack thereof, to correctly estimate this metric from fossil data. However, they do provide evidence of the way in which each cause of bias is likely to affect results and an indication of where problems are likely to arise. The code for all simulations and analyses can be found in Soul & Friedman (2017).
An empirical example: tetrapods at the PTME As an illustration of this approach we quantified the phylogenetic clustering of extinctions in the fossil record of three major tetrapod clades (sauropsids, temnospondyls and synapsids) using two different metrics outlined in the phylogenetic clustering of extinction section above: Fritz and Purvis' D (Fritz & Purvis 2010) and Moran's I (Moran 1950;Gittleman & Kot 1990). The length of time over which we measured these metrics extended from the Pennsylvanian to the Late Triassic, divided into ten timeslices of similar length, each comprising one or two geological stages. We performed sensitivity analyses by varying the length of timeslices and the method used to scale cladogram branches to time.
Data. Phylogenies were composites constructed using published supertrees and cladistically inferred topologies for subgroups (cf. Soul & Friedman 2015). The topology for temnospondyls was a supertree taken directly from Ruta et al. (2007). The topologies for sauropsids and synapsids were composite trees constructed by combining  S1). Occurrence data for each taxon were taken primarily from the Paleobiology Database (https://www.paleobiodb.org) except for parareptiles where these data were poorly covered in the database but available from the author of the published topology (Ruta et al. 2011).
To translate extinction to a binary trait, each timescaled cladogram was divided into successive timeslices of approximately the same length. If a taxon's last appearance fell within any one timeslice this was classified as an extinction; if the taxon's range included the end of the timeslice this was a survival because the taxon was present within the slice but survived into at least the next one. For the main analysis we used timeslices that began and ended at the start and end of geological stages, but combined some consecutive stages into single bins in order to generate intervals of more consistent length. It has been demonstrated previously that the intensity of the signal can be sensitive to temporal resolution of the timeslices (Hardy et al. 2012). Therefore, to test the effect of the length and timing of the timeslices we also conducted analyses using timeslices of exactly equal durations of 10 and of 15 myr.
The dates of occurrences of many fossil taxa, particularly vertebrates during the Palaeozoic and Mesozoic, are often only known to stage-level precision. To account for uncertainly in the actual times of first and last appearances of taxa in the record, a set of 50 stochastically generated fossil ranges was made for each taxon. First and last appearances were selected from a uniform distribution between the beginning and end of the most precise time period from which each taxon is known. The cladogram for each of the three groups was then timescaled using these sets of ranges. This can affect lineage divergence time estimates, and consequently the outcome of downstream analyses (Bapst 2014;Soul & Friedman 2015).
Sampling rate proxies. Variation between time bins in the rate of fossil preservation and discovery could have an important effect on the resulting signal (we test for this bias in the simulation section). In order to verify that preservation and sampling heterogeneity between bins was not the main driver of variation in extinction clustering results for our empirical data, we compared values of D to values for several proxies for fossil record quality. Due to the large proportion of point occurrences in the datasets (51%), and generally low number of occurrences per taxon, a sampling rate could not be directly estimated for the empirical data via any of several sophisticated and commonly used maximum likelihood or Bayesian estimators (e.g. Foote & Raup 1996; Alroy 2008;Liow & Finarelli 2014). Instead, we provide three proxies for the relative quality or heterogeneity of the fossil record through time: (1) the number of tetrapod bearing formations per bin; (2) the per-bin average number of formations in which each taxon occurring in that bin is represented; (3) a comparison of standard diversity (SD; a basic taxon count) with average duration of ghost lineage per taxon in each bin (average ghost lineage duration (AGLD); Cavin & Forey 2007). These proxies are only basic assessments of variation in fossil record quality through time, but are unfortunately the best methods currently available, given the nature of the data. They are adequate for their application here, which is to check whether sampled fossil record heterogeneity can be discounted as the main driver of the measured phylogenetic pattern in extinction.
For proxies 1 and 2 we performed a Pearson productmoment correlation test of first differences of D against the value for the proxy, a significant correlation would indicate that variation in D is an artefact of variation in fossil preservation and discovery potential through time. The method we used here for proxy 3 was developed by Cavin & Forey (2007) to distinguish between genuine and artefactual diversity peaks, by identifying time periods when the record comprises low numbers of highly productive horizons (Lagerst€ atten). A peak in SD that is not accompanied by a change in AGLD indicates that the record for that time bin is dominated by Lagerst€ atten. We use this method to identify time bins with particularly heterogeneous records, and compare this to times that extinction is particularly clustered or overdispersed.

Simulations
With the exception of Fig. 2, the figures in this section depict the median difference between D calculated on a simulated true phylogeny, and D calculated on the corresponding sampled cladograms. A positive value indicates that estimates of extinction were more strongly clustered on the sampled cladograms than on the true phylogeny.
Sampling rate. The baseline simulation demonstrates that accurate recovery of the strength of phylogenetic clustering of extinction is not guaranteed, whether or not extinction is clustered in (simulated) reality (Fig. 2). Correct recovery of the strength of phylogenetic clustering of extinction depends heavily on sampling rate (Figs 2, 3).
At low sampling rates of 0.01 per lineage time unit (ltu) the value of D is on average higher (less clustered) than, or close to, the originally simulated value. A medium sampling rate of 0.1 ltu À1 usually resulted in overestimates of clustering (i.e. lower values of D), and a high sampling rate of 0.5 ltu À1 always leads to overestimates of the strength of clustering of extinction. In the simulations where extinction was not significantly clustered in the true phylogenies ( Fig. 2B; Table 1: true phylogeny set 2), the analysis falsely rejected the possibility of phylogenetically random extinction at high sampling rates.
Timescaling method. The method used to timescale the trees of fossil taxa also had an important influence on recovery of accurate estimates of D (Fig. 3). At 0.01 ltu À1 , when the trees were timescaled using mbl and cal3, clustering was underestimated, but when the trees were timescaled using Hedman, estimates at 0.01 ltu À1 were closer to estimates of D from the real tree. However, these showed a large variance across measurements from different topologies. At higher sampling rates Hedman timescaled trees gave D values which implied a far greater strength of clustering than the original simulated phylogeny. When the trees were timescaled using cal3, estimates were more accurate overall, although low and high sampling rates did lead to a slight underestimate and overestimate of clustering respectively. Trees scaled using mbl did not give the most accurate estimates at any sampling rate, but were slightly better than Hedman at the two higher sampling rates.
Strength of clustering. Whether or not extinction in the simulation was phylogenetically clustered made a small difference in the mean accuracy of estimates of D (Fig. 4). When extinctions were phylogenetically clustered there was a larger variance in estimates from fossil trees than when extinction in the simulation was phylogenetically random. Medians of estimates for clustered and non-clustered extinctions showed approximately the same difference from the true value of D.
Ancestors. In the baseline simulation (Fig. 2), sampled ancestral taxa were placed in a polytomy with their descendants. When these were removed after timescaling (which removed pseudoextinctions but not pseudosurvivals) the measured signal shifted to lower values of D (more clustered); at high and medium sampling rates this lead to an overestimation of clustering, at low sampling rates clustering was still underestimated and showed large variation across topologies. When ancestors were removed before timescaling (removing both pseudoextinctions and pseudosurvivals) the measured signal at high sampling rates shifted from an overestimate of the strength of clustering to a more accurate estimate (Fig. 5).  Tetrapods at the PTME Strength of clustering through time. Extinction was phylogenetically clustered in all three clades during the majority of the time bins investigated (Fig. 6), and fell within the distribution of the Brownian expectation. There is a greater spread in D values in time bins where the phylogenetic patterning is weak or random, showing that in these cases variation in both the topology and branch lengths of the tree has more of an effect on the result. All three clades show relatively random extinction in their early history; it is not clear whether this is a genuine signal or bias caused by proximity to the root of the tree or a small sample size. Extinctions are then consistently clustered in the last three timeslices of the Permian in all clades. There does not seem to be an overall trend in changes in extinction clustering. It is not more likely for a decrease in signal strength between timeslices to follow an increase, or vice versa. Extinction intensity does not correlate significantly with strength of phylogenetic clustering for any of the clades (Pearson product-moment correlation: sauropsids r = À0.6936, p = 0.0800; synapsids r = À0.5596, p = 0.1915; temnospondyls r = 0.2281, p = 0.6228). Changing the algorithm used to timescale the cladograms lead to very similar estimates of D and did not affect the overall conclusions (Soul & Friedman 2017, fig. S2).
Measurements of Moran's I for sauropsids and synapsids showed similar patterns to D, with one exception in the Middle Triassic, during which a large proportion of taxa go extinct (72%). Moran's I for temnospondyls showed a slightly different pattern to D (Soul & Friedman 2017, fig. S3). Again this can most likely be attributed to the relative proportions of extinction; extinction intensity in temnospondyls correlates with the test statistic for I (r = 0.8295, p = 0.02).
Sampling rate proxies. Neither of the two formation-based proxies shows a significant correlation with D in any clade (Table 2). Average ghost lineage duration (AGLD) Estimated values depend on timescaling method. Results of cladogram sets 1, 2 and 3. A, median and interquartile ranges of the difference in estimated value of D from the true value of D for three different sampling rates from left to right, using three different methods to timescale the cladogram; plotted to highlight the influence of sampling rate. B, the same data but arranged to highlight the influence of timescaling method. The methods increase in complexity and amount of input data required from left to right. Values close to the dashed line at 0 on the plots indicate that good estimates were made on the timescaled cladograms, with reference to the simulated true phylogeny. The narrower a box is, the more consistent results were across the iterations of cladograms.
shows a different pattern for each clade (Fig. 7). Sauropsids show an increase in heterogeneity of the record in the Middle Triassic, which does not correspond to an unusually high or low value of D. Synapsids have the same small increase in record heterogeneity in the Middle Triassic, preceded by a more dramatic increase in the Guadalupian that then declines in the end-Permian. These changes are not tracked by changes in D, which remains consistent and low throughout the Permian and Early to Middle Triassic. Temnospondyls show a very strong Lagerst€ atten effect in the Early Triassic but this time period is not distinguishable from others in the phylogenetic clustering analysis.

Simulations
The results of the simulation analyses indicate that there are several important factors that need to be considered when interpreting phylogenetic clustering of extinction measured with fossil data. The effectiveness of different methods depends on the type of data being used for the analysis (Figs 2-6). The way in which taxa in the clade under investigation evolved and became extinct also has an effect on the accuracy and precision of results (Fig. 4), so caution must be taken when drawing conclusions from any one test. Although many factors have an influence on the bias in simulation outcomes, the sampling rate has the largest effect (Fig. 2). If the sampling rate can be estimated, at least approximately, the biases introduced by other factors can be anticipated.
Causes of bias. The two problems introduced in the simulation analyses were: (1) sampling rate variation (i.e. proportion of missing taxa); and (2) reconstruction of ancestors as sister taxa to their descendants. The second is linked to the first, as increased sampling rate increases the probability of sampling ancestors. Results suggest that the main bias at high sampling rates (towards overestimation of the strength of phylogenetic clustering) is a result of the second problem where pseudosurvivals result in an increased number of survivals at the end of each timeslice. This is demonstrated by the overestimation of clustering when only pseudosurvivals are included in the timescaled cladogram (Fig. 5). Situations where pseudosurvivals are likely to occur lead to clumps of closely related taxa surviving the end of timeslices (Fig. 8), which in turn lead to a lower phylogenetic distance between survivals on average. Extinctions and survivals are  symmetrical in the calculation of D, so an increase in survivals, where those survivals are in closely related taxa, has the same effect as an increase of extinctions in closely related taxa. When pseudoextinctions are also included they create an opposite bias, leading to an estimate closer to the originally simulated value of D (Figs 5, 8). At low sampling rates the median estimate is rarely significantly clustered, even when the phylogeny that was originally simulated displayed highly clustered extinction. With fewer sampled taxa across the phylogeny overall, there is a lower probability of sampling closely related taxa, and a higher probability of sampling a taxon but not any of its descendants. For a poorly sampled tree, the most closely related taxa that have actually been sampled will not necessarily have been closely related in absolute terms, so the signal of very closely related taxa surviving or becoming extinct at the same time is lost. In addition, with smaller sample sizes the statistical power of the test to detect clustering is reduced.
Different timescaling methods changed the magnitude of bias in each case. The mbl method can be considered conservative because it does not assume large amounts of unsampled lineage history for which there is no direct evidence, but is unlikely to represent the true timings of lineage divergences accurately. The cal3 method assigns branch durations in a less ad hoc manner and so tends to extend internal branches proportionally more than mbl, and the Hedman method extends internal branches even more so. This has the effect of drawing a greater number of divergences back into earlier timeslices, leading to more survivals and causing a more clustered signal to occur when compared to the signal measured on differently timescaled trees (Fig. 8).  Methodological recommendations. The correct way to implement and interpret measurements of the phylogenetic clustering of extinction is evidently a complex question. The nature of the data used for the analysis is important, as well as the way these data are subsequently treated. What does seem possible is that sampling rate can often be estimated (to the correct order of magnitude) and that an appropriate timescaling method can therefore be selected. However, all other biasing factors are either not possible to control, or difficult to estimate. With this in mind the most reasonable procedure is to begin by estimating sampling rate (in so far as that is possible), then to choose an appropriate timescaling method. At very low sampling rates, cladograms should be timescaled using the Hedman method to reduce bias, whereas at higher rates the cal3 method should be used. Conveniently cal3 is a method more suited to clades with higher sampling rates, as it requires additional information (origination, extinction and sampling rates) that can be more accurately measured for groups with a high sampling rate. Conversely the Hedman method can be used when sampling is low and the additional information on rates is not available (Lloyd et al. 2016a). Following this, results should be interpreted in the context of the other biases that are probable given the dataset. For example if data are found to have a low sampling rate (c. 0.01 ltu À1 ) and significantly clustered extinction, then this result can be expected to have a large error. If the data shows random extinction at a low sampling rate it can be considered more reliable. At high sampling rates (c. 0.5 ltu À1 ) the analysis is consistently prone to overestimation of the strength of clustering, which means that a significantly clustered signal could be found that is in fact an artefact of the analysis and should be interpreted cautiously.
Ideally, to obtain an unbiased estimate of D, the phylogeny would be reconstructed using a method by which ancestors can be reliably inferred. New methods (e.g. Gavryushkina et al. 2014) which allow for sampled taxa to be directly ancestral to others in the estimated phylogeny hold possibilities for ancestral inference. These could be implemented for phylogenetic comparative methods in the future, but the data required for this kind of inference are unavailable for the clades studied here, and for many other clades of fossil taxa. Although the response of downstream analyses has not been quantified for phylogenies inferred in this way, there is great potential for improvement of phylogenetic comparative methods that are particularly vulnerable to bias caused by sampled ancestors, such as the method used here.

Tetrapods at the PTME
We provided an analysis of the phylogenetic clustering of extinction in three tetrapod clades during the PTME as an illustrative example of the application of this method to the fossil record. Tetrapod extinctions were phylogenetically clustered during the Pennsylvanian to Upper Triassic interval. This corroborates previous research that indicates that some degree of phylogenetic signal is a common feature of extinctions regardless of timescale, and can be considered a general rule (McKinney 1997;Janevski & Baumiller 2009;Roy et al. 2009). There is a large body of work to demonstrate that the nature and degree of extinctions during the PTME was different in each clade in measures such as diversity (Fr€ obisch 2008;Ruta & Benton 2008;Lucas 2009;Ruta et al. 2011). In combination with our results this indicates that variation in phylogenetic selectivity and variation in extinction intensity are not directly related, but may share a common driver of extreme values.
It has been suggested that the PTME represented a period of complete ecosystem restructuring for terrestrial tetrapods (Benton et al. 2004;Fr€ obisch 2013). Highly phylogenetically clustered extinction has a disproportionately large effect on biodiversity compared to random extinction (Davies & Yessoufou 2013), perhaps allowing for or requiring major ecosystem change (Krug & Patzkowsky 2015). The three focal clades show clustered extinction during the final timeslice of the Permian (Lopingian), but this is not unique; other intervals show clustering comparable to that of the PTME, indicating that phylogenetic selectivity may at times be decoupled from both extinction intensity and ecological impact (Droser et al. 2000;Hardy et al. 2012). (2010) suggest that phylogenetically random extinction can be attributed to geographical variation in the intensity of threat to survival in different regions that affects all the taxa living there (e.g. one region becomes very dry or hot). On small spatial scales, extinction (or extinction risk) is often phylogenetically clustered and this clustering can indeed be attributed to selection against particular phylogenetically conserved phenotypes (Roy et al. 2009;Hardy et al. 2012). However, across the large spatial and temporal scales of this study, geographical distribution may also be phylogenetically conserved, because many taxa are restricted to particular habitats or climatic zones, to which close relatives with whom they share a recent evolutionary history are also more likely to be restricted (Lieberman 2003;Krug & Patzkowsky 2015). This pattern may not occur in all taxa, particularly not in generalists with good dispersal ability, but overall the two factors which have control over a species' vulnerability to extinction on the timescales in this study, its phenotype and the extinction threat it experiences, are both expected to be phylogenetically conserved to some degree, particularly in the early history of a taxon. Future work could compare the phylogenetic clustering of extinction through time with the correlation between geographical and phylogenetic distance of sampled taxa to begin to tease apart these two factors.

Geographically linked extinction. Fritz & Purvis
Sensitivity tests. In agreement with the simulation study the sensitivity tests indicated that in some cases the method employed to perform the various steps required to obtain a result had an influence on the observed signal, but these effects were small. Changing the algorithm used to timescale the trees had an effect because different timescaling methods add duration to internal branch lengths to varying degrees, which led to taxa being present in earlier time bins in Hedman timescaled trees. (see Fig. 8 and simulation results for the possible effect of this).
Implementing an alternative method (Moran's I) to measure clustering also gave a slightly different result, particularly for temnospondyls. However, the strong link between extreme values of trait prevalence and extreme values of I indicates that the method is not particularly robust to variation in trait prevalence, unlike Fritz and Purvis' D, which can give results that are comparable across timeslices even when they have a very high or low proportion of extinctions.
Changing the length of the timeslices changes which taxa survive in each timeslice, and therefore the strength of clustering. When the timeslices correspond to combinations of stages (as they do in the main analysis), a taxon will always go extinct in the same timeslice, even if its divergence date is in different stages for different iterations of the timescaling algorithm. This is not the case when 10 and 15 myr timeslices are used (except for the Lopingian because this boundary is used in all the alternative sets of timeslices) leading to the larger variance across different tree topologies in these results (Soul & Friedman 2017, fig. S4).
Sampling proxies. The simulation study demonstrated that the sampling rate has an important effect on whether estimates of D are biased towards phylogenetic clustering or overdispersion. With this in mind it was important to assess  whether variation in sampling probability between bins was driving increases and decreases in estimates of D. A record comprised mostly of singletons prevented direct estimation of sampling rate. However, two proxies for relative sampling through time based on formation counts showed no obvious correlation with extinction clustering metrics, demonstrating that preservation and discovery potential was not the main driver of differences in clustering between time bins. Likewise the average ghost lineage duration analysis showed that there are sections of the record of all three clades that are heterogeneous (taxa are sampled from one or a few horizons of exceptional preservation), but these do not correspond to unusually high or low values of D.
Implications of the simulation study for the tetrapod case study The literature indicates that sampling rates that translate to the region of 0.1 per lineage million years (lmy) can be expected for marine invertebrate taxa. For Neogene mammals and well preserved marine invertebrate records a rate of 0.5 lmy À1 might be possible (Foote & Sepkoski 1999;Alba et al. 2001). For the majority of Palaeozoic and Mesozoic terrestrial vertebrate clades, particularly those which include many point occurrences, the sampling rate is likely to be on the order of 0.01 lmy À1 (Foote et al. 1999;Friedman & Brazeau 2011). Thus the simulation results at 0.01 ltu À1 are the most representative of our tetrapod dataset, and are therefore the best indicator of the bias that can be expected in our empirical results. Results for 0.01 ltu À1 are on average biased towards underestimating the strength of clustering (Figs 2, 3). However, there is a large variation from results that were measured on different randomly selected samples of the record, and they include estimates of strong clustering.
Given that the large majority of timeslices in the empirical analysis show clustered extinction, it is unlikely, but still possible, that each of these estimates is a biased result based on the sample of the record represented by the cladogram, and that extinction was not in fact phylogenetically clustered.
Strongly clustered median values of D were only produced in the simulation when fossil trees derived from records with high sampling rate were tested. At low sampling rates significant clustering was rarely observed within the simulations. This calls into question how, at the low sampling rates seen for terrestrial vertebrate clades, significant clustering was so commonly found in our real data. One possibility might be that there is a taphonomic bias caused by regional-scale ecological stress in combination with local scale preservation heterogeneity and taxon distributions. A further possibility is that the bifurcating constant rate birth-death model and univariate Brownian motion trait evolution used to simulate the true phylogenies was not an adequate model for the evolutionary process (Hagen et al. 2015). For example, there has been some previous support for the hypothesis that simultaneous phylogenetic selectivity with respect to multiple aspects of phenotype or ecology (i.e. a phylogenetically conserved ecological niche) should lead to strong clustering of extinction (Green et al. 2011).

CONCLUSIONS
The phylogenetic clustering of extinction is a useful measurement that can be made for clades where a robust phylogeny is available, but detailed trait information is lacking. In the absence of adequate data to identify the extinction risk associated with specific phenotypic or life history traits in the geological record, Shows inferred survivals and extinctions, along with whether they could be expected to lead to an increase or decrease in estimates of D. The focal timeslice is below the dashed line. In each section the 'real' phylogeny is on the left and the reconstructed tree is on the right. Samples of the real phylogeny are shown by grey filled circles (these samples represent fossil specimens when applied to real data). Extinctions are shown by filled circles and survivals by empty circles. Any survival or extinction present in the right hand tree but not the left can be considered a pseudosurvival or pseudoextinction respectively. These reconstructed cladograms have only short backwards range extensions, consistent with what would be seen when using the mbl method, if a different timescaling algorithm was used, divergences would move further back in time. These do not represent all possible scenarios but are illustrative of situations that could lead to localized increases or decreases in estimates of clustering.
phylogeny can act as a proxy for the effect of selection (or lack thereof) against the combination of these traits in a species, if those traits are phylogenetically conserved. In combination with previous studies, these results demonstrate that phylogenetic clustering of extinctions is common on all scales and that patterns in the short-term scale up over time to result in similar patterns in the long-term. There are several characteristics prevalent in phylogenies of fossil taxa that can introduce bias into the results of these measurements and they must be carefully considered before the analyses are performed and before conclusions are drawn.
The following key points should be held in mind when measuring clustering of extinction in fossil groups: 1. The sampling rate for the clade of interest should be estimated as accurately as possible to provide a context for interpretation of results. 2. The cladogram should be timescaled with an algorithm appropriate for the sampling rate and incorporate estimation of ancestral relationships if possible. 3. Low preservation and discovery rate leads to a loss of information that causes a bias towards low estimates of the strength of clustering of extinction. 4. High preservation and discovery rate without consideration of ancestor-descendent relationships leads to topologies that cause a bias towards high estimates of the strength of clustering of extinction. Despite the importance of these considerations, phylogenetic clustering of extinction can offer additional insight into macroevolutionary patterns associated with extinction events and how those patterns vary across time and clades.