paleotree: an R package for paleontological and phylogenetic analyses of evolution


Correspondence author. E-mail:


1. paleotree is a library of functions for the R statistical computing environment dedicated to analyses that combine paleontological and phylogenetic data sets, particularly the time-scaling of phylogenetic trees, which include extinct fossil lineages.

2. The functions included in this library focus on simulating paleontological data sets, measuring sampling rates, time-scaling cladograms of fossil taxa and plotting historical diversity curves.

3. I describe the capabilities and analytical basis of the functions in paleotree by presenting two examples. The first example showcases the simulation capabilities and plotting the output as diversity curves. The second example demonstrates time-scaling a cladogram of fossil taxa and estimating sampling rates and completeness from temporal ranges.


Increasingly, paleobiology and evolutionary biology are finding common ground in the shared realm of phylogenetics. Paleobiologists are increasingly applying phylogeny-based approaches, developed by evolutionary biologists, to study evolution in the fossil record (Fusco et al. 2012; Lloyd, Wang & Brusatte 2012). The last few years have seen a rise in the use of the free and platform-independent R coding environment (R Development Core Team, 2011) among members of the applied phylogenetics community. This movement has produced a wealth of R libraries dedicated to the analysis of phylogenies and evolutionary patterns (e.g. Paradis, Claude & Strimmer 2004; Harmon et al. 2008; Schliep 2011; Revell 2012). However, only a few libraries deal explicitly with issues encountered in analyses of paleontological phylogenies, such as paleoPhylo which focuses on depicting relationships among fossil taxa (Ezard & Purvis 2009). In this article, I present the R library paleotree, which implements methods for the analysis of paleontological data and the evaluation of models, with a particular focus on phylogenetic approaches.

paleotree is designed to concentrate on four types of analyses: (a) simulating diversification and sampling in the fossil record, (b) calculating diversity changes across time from different data types, such as taxonomic ranges and phylogenies, (c) estimating sampling parameters from observed temporal ranges and (d) time-scaling cladograms of extinct taxa. In addition, paleotree offers a wide range of additional utilities related to these areas, such as converting simulated fossil records into standard phylogenetic data structures. paleotree can be installed from any of the mirrors for the Comprehensive R Archive Network, the standard repository for publicly-released R libraries.

Many of the functions in paleotree depend on the functions and standards set by the free library ape (Paradis, Claude & Strimmer 2004). The S3 object class phylo established by ape gives programmers a structure, which can be used as a common currency among other phylogenetics packages in R, including the option to add modules to the phylo data structure (e.g. Revell 2012). The various time-scaling functions for paleontological phylogenies in paleotree require objects of class phylo as input and also output phylogenies in this format, as do the simulation-conversion functions taxa2phylo and taxa2cladogram. Functions in paleotree that produce time-scaled phylo objects will alter the standard format of these objects at output by adding a new element, $root.time. This value informs other functions in paleotree at what time before present the root divergence occurred, as most functions in ape and other packages assume that the tip furthest from the root on a tree with edge lengths is at the modern time (zero time-units before present). This is useful, as many paleontological trees will have their tips located far from the present, such as a phylogeny of non-avian dinosaurs or trilobites. In addition to depending on ape, paleotree also imports functions from the publicly available libraries geiger (Harmon et al. 2008) and phangorn (Schliep 2011).

I describe here a multi-part example that demonstrates several capabilities of the paleotree library. First, I obtain a simulated data set of taxonomic ranges from a birth–death model, model sampling in these lineages and convert the simulation into phylogenetic data sets. I also plot the diversity curves calculated from these different versions of the same simulation. In the second part of the example, using the simulated data set from the first part, I demonstrate the functions in paleotree for time-scaling cladograms of fossil taxa using their ranges. Finally, I discuss using the sampling rate conditioned (‘SRC’) time-scaling methods, including an example of estimating the prerequisite instantaneous sampling rate from taxon ranges.

Example Part 1: simulating fossil data and plotting diversity curves

The main simulation function in paleotree is simFossilTaxa, which simulates diversification under a birth–death model (Kendall 1948; Nee 2006). There are a number of birth–death simulators available via R, such as the function rlineage in ape. simFossilTaxa differs by enmeshing the birth–death process with models of morphological differentiation across lineages. This is important for simulating paleontological data sets, as the interpretation of evolutionary patterns in the fossil record is dependent on the recognition and identification of morphologically-defined taxa.

By manipulating the function’s arguments, several different patterns of morphological differentiation can be simulated (Wagner 1995; Wagner & Erwin 1995; Foote 1996). Differentiation can occur in both daughter lineages of branching events (‘bifurcating cladogenesis’), in only a single daughter lineage (‘budding cladogenesis’) or in neither descendant lineages (‘cryptic cladogenesis’). Shifts in morphology can also be modeled along branches, independent of branching events (‘anagenesis’). simFossilTaxa can simulate any mixture of these processes by treating these as discrete events leading to the recognition of new taxa, instead of directly simulating morphological change. By default, clades are simulated with no anagenesis and pure budding cladogenesis at each branching event. The selected model of morphological differentiation can greatly impact some analyses while having little effect on others, such as estimates of diversity.

The output of simFossilTaxa is a matrix describing the ancestor–descendant relationships and temporal ranges of morphologically distinct taxonomic units. In addition to the options available for simulating morphological divergence, the acceptance of simulation runs in simFossilTaxa can be conditioned on a number of various minimum and maximum criteria. Simulation runs are accepted for output if and when they meet all of these criteria.

The code below simulates a clade using typical birth–death parameters (speciation rate = extinction rate = 0·1 per lineage time-units), conditioned so that the resulting output has between 120 and 150 morphologically distinguishable taxa over its evolutionary history and no living taxa at the end of the simulation. The diversity curve of this data is plotted, depicting the true history of diversity changes for our simulated clade.

library(paleotree)   # load the library

set.seed(1)         # set random seed

taxa <- simFossilTaxa(p=0·1,q=0·1,nruns=1,

 mintaxa=120, maxtaxa=150,


# plot true diversity curve as Fig. 1A

Figure 1.

 Three plots of the changes in taxonomic diversity for a simulated clade. The first is the ‘true’ record of the evolutionary history of these taxa, and the second is the result of a simulation of sampling on the true ranges, in continuous time. The third plot is a conversion of the sampled continuous-time ranges onto a time-scale of discrete intervals, each five time-units long.


taxadiv <- taxicDivCont(taxa)

# saves div curve data as ‘taxadiv’

title(“True Diversity Curve”)

The function sampleRanges can be used to simulate incomplete sampling across a set of lineages output by simFossilTaxa. This function stochastically places sampling events along the original ranges and outputs the temporal ranges that would be observed given those samples. By default, sampleRanges assumes a model where sampling events are independently distributed across branches and throughout time as a Poisson process (Foote & Raup 1996; Foote 1997), with some instantaneous rate. By altering the function’s parameters, sampling can be simulated under more complex, time-varying models (Liow, Quental & Marshall, 2010).

In the example below, the instantaneous sampling rate is set to 0·2 per lineage time-units. The function binTimeData takes sampled continuous-time ranges as input and returns these same ranges on a discrete interval time-scale, here with intervals five time-units long. So that all three diversity curves are plotted against the same horizontal time-scale, we will directly input the interval times saved from calculating the true diversity curve above.

# sample ranges

rangesCont <- sampleRanges(taxa,r=0·2)

# plot figure 1B



title(c(“Sampled Diversity Curve”,

 “(Continuous Time)”))

# bin ranges into discrete intervals

rangesDisc <- binTimeData(rangesCont,


# plot figure 1C



title(c(“Sampled Diversity Curve”,

 “(Discrete Time)”))

The resulting figure reveals that although the diversity histories estimated from the sampled ranges have a similar pattern of increases and decreases as the original data set, the under-sampled simulation has less diversity observed per time-unit (Fig. 1).

The simulated fossil record can also be converted into a phylogeny using two different functions, depending on what type of data is desired. taxa2phylo constructs the time-scaled tree that perfectly describes the set of relationships for particular points in time within the simulated taxon ranges. Taxonomic identity of branches is lost; only the historical patterns of branching matter. Ancestral taxa with multiple descendants are chopped into multiple segments to become multiple branches within the output phylogeny. Tip taxa represent the position of the per-taxon instantaneous observations; by default, these are the last occurrences of each taxon, but this can be changed with the argument obs_time. In general, users should use taxa2phylo in simulation studies, such as when simulating trait data with functions that require phylo objects.

tree <- taxa2phylo(taxa)

# get diversity curve

phyloDiv(tree,int.times=taxadiv [,1:2])

Alternatively, taxa2cladogram produces an unscaled cladogram that contains the set of nesting relationships that are resolvable with morphological data among the input taxa, given the pattern of morphological shifts and ancestor-descendant relationships in the input. The result emulates an ideal cladistic analysis with a very large number of informative characters, only capturing the true resolvable relationships among a simulated set of taxa.

Output from taxa2cladogram is generally poorly resolved, as the sampling of ancestral taxa and static taxa with multiple descendants will produce non-nesting relationships that are not resolvable using typical morphological phylogenetics and thus cannot be portrayed on a cladogram (Smith 1994; Wagner & Erwin 1995).

  • cladogram <- taxa2cladogram(taxa,plot=T)

The diversification histories of fossil taxa in paleotree can be further investigated by examining the phylogenetic structure among lineages extant at specific dates in the history of a clade. This is possible with the function timeSliceTree that removes the portions of a phylogeny that passes some specified point in time. For this example, the tree is sliced at 800 time-units before present, during a diversity peak in the simulated clade. Terminal branches that have gone extinct by that date can be dropped using the argument drop.extinct, allowing us to simulate the type of data generally available to biologists. timeSliceTree outputs an ultrametric phylogeny, which can be visualised as a lineage-through-time plot with phyloDiv.

tree800 <- timeSliceTree(tree,


# LTT plot (Figure 2)

Figure 2.

 The phylogeny at top depicts the phylogenetic relationships along taxa present at 800 time-units before present in the simulated clade, with extinct lineages removed, obtained using the function timeSliceTree. The figure below it is a plot of the increase in lineage diversity reconstructed from the phylogeny above, with richness plotted on a log-scale (numbers on the axis are actual richness, not log-richness).


Example Part 2: estimating sampling rates and time-scaling paleontological phylogenies

Often, workers who wish to use phylogenetic approaches for studying evolution in the fossil record begin with an unscaled cladogram and a set of taxon ranges. To further complicate matters, fossil taxa are generally known only as first and last appearances in discrete temporal intervals, adding temporal uncertainty to when they occurred, and the cladograms may be partially unresolved. In paleotree, the function timePaleoPhy time-scales trees, using any of several typically applied time-scaling methods (e.g. Hunt & Carrano 2010; Hopkins 2011; Lloyd, Wang & Brusatte 2012). The default time-scaling method, referred to as the ‘basic’ method in the documentation, sets node ages equal to the first appearance time of their earliest descendant taxon (Smith 1994).

timePaleoPhy is specialised for data sets where taxon occurrences are known in continuous time, but paleotree also offers bin_timePaleoPhy, a wrapper for timePaleoPhy that accepts taxon occurrences in discrete time intervals. When bin_timePaleoPhy is used, per-taxon first and last occurrence dates are pulled from a uniform distribution, bounded by the upper and lower bounds of the intervals from which those occurrences are listed. This continuous-time data set, stochastically built from the discrete-time input, is then used as input for timePaleoPhy. If a pair of dates would imply a negative duration for a taxon, they are discarded and drawn again. This is similar to methods employed by Lloyd, Wang & Brusatte (2012).

To demonstrate the time-scaling capabilities of paleotree, I will give an example of time-scaling the ideal cladogram simulated above with the discrete-time ranges, as the majority of paleontologists using this package will only have occurrences in discrete intervals.

timetree <- bin_timePaleoPhy(cladogram,


# plot the tree


In the example above, bin_timePaleoPhy output a time-scaled phylogeny where the tips are placed at the first appearance dates of the included taxa. In other words, the observed ‘terminal’ portions of those taxonomic ranges were not added onto the output tree. This default setting can be changed using the argument add.term. Typically, terminal branches are not added in analyses of characters considered static over taxon ranges, but should be included in analyses of diversification.

timetree <- bin_timePaleoPhy(cladogram,


#plot the tree


By default, bin_timePaleoPhy outputs only a single time-scaled tree. As the occurrence dates used are stochastically drawn from uniform distributions, this function automatically returns a warning that strong interpretations should not be made on a single output tree because of this stochastic element. Analyses based on a single tree may produce misleading results. Users should instead consider a sample of trees with different dates stochastically chosen, by increasing the ntrees argument to a suitably large sample size. When ntrees is increased, the output is of the ape class ‘multiPhylo’ (Paradis, Claude & Strimmer 2004).

Cladograms of fossil taxa are often not fully resolved, presumably owing to uncertainty in the tree topology. To deal with this, a user could individually time-scale a sample of most parsimonious trees, which are fully resolved (Bell & Braddy, in press). Less ideally, a user could input a partially unresolved consensus tree and randomly resolve the soft polytomies, generating a sample of time-scaled trees. This can be implemented within bin_timePaleoPhy via the argument randres and increasing ntrees. Examining multiple trees will also help us account for both the uncertainty produced by drawing taxon appearance dates from uniform distributions and the uncertainty in the phylogenetic relationships. Below is a short example of generating and plotting nine time-scaled phylogenies.

timetrees <- bin_timePaleoPhy(




for(i in 1:9){



These time-scaled trees are very similar to each other but do differ slightly in their branch lengths and in the relationships among taxa. This variation could make a considerable difference in the results of some comparative analyses. For analyses of real data, we would want to generate samples of time-scaled trees much larger than nine to fully test our hypotheses with.

The function multiDiv calculates the median diversity and 95% quantile diversity across an input sample of phylogenies. Although not shown here, multiDiv can also calculate diversity curves for lists, which include multiple data types, such as taxonomic range data.

  • multiDiv(timetrees)

The default basic time-scaling function commonly produces zero-length branches, which are often problematic for comparative analyses (Hunt & Carrano 2010). The five time-scaling methods applicable with timePaleoPhy are explained in detail within the function’s documentation. In addition to those methods, the paleotree library also offers the SRC time-scaling method for paleontological phylogenies (Bapst, in preparation), as utilised by the functions srcTimePaleoPhy and bin_srcTimePaleoPhy.

To use the SRC methods, an estimate of the instantaneous sampling rate is needed. For discrete temporal data, sampling probabilities can be estimated based on the frequency distribution of taxon durations (Foote & Raup 1996). The function getSampProbDisc finds the per-interval sampling probability using models fit with maximum likelihood, using the approach developed by Foote (1997). A similar function getSampRateCont exists for fitting sampling models to continuous-time occurrences. getSampProbDisc also outputs an estimate of the expected completeness of the fossil record, that is, the proportion of taxa in a group that we would expect to have sampled at least once. It is very important for the likelihood optimiser of getSampProbDisc to converge, as the likelihood surface for estimating these parameters can occasionally be flat and uninformative, leading to inaccurate parameter estimates. If convergence does not occur, getSampProbDisc will print a warning message and users should refer to the paleotree documentation.

#obtain sampling probability

MLfit <- getSampProbDisc(rangesDisc)

The sampling probability is not equivalent to the sampling rate, but can be converted using the function sProb2sRate. To convert the estimate, the mean interval length must be supplied.

#obtain sampling rate


This estimate of the instantaneous rate can be specified as an argument in bin_srcTimePaleoPhy. There are several important differences between the SRC algorithm and the other time-scaling methods. For example, in the SRC algorithm, polytomies in the input cladogram will always be resolved during time-scaling, terminal ranges are always added onto the phylogeny, and some taxa may be stochastically identified as ancestors (details in the paleotree documentation and Bapst in preparation).


The library paleotree offers a number of functions useful for analysing or simulating phylogenetic patterns of evolution in the fossil record. A number of future improvements are planned to further extend these capabilities, such as altering getSampProbDisc to accommodate for ranges of taxa, which are still extant in the modern day.


I would like to particularly thank Michael Foote for comments and helping me troubleshoot several functions. I would also like to thank Matthew Pennell, Emily King, Annat Haber, Jonathan Mitchell, Graeme Lloyd and an anonymous reviewer for discussions and suggestions that have greatly improved paleotree and this manuscript. The function timePaleoPhy is inspired by code originally provided to me by Graeme Lloyd and Gene Hunt.