Causality from palaeontological time series

As custodians of deep time, palaeontologists have an obligation to seek the causes and consequences of long‐term evolutionary trajectories and the processes of ecosystem assembly and collapse. Building explicit process models on the relevant scales can be fraught with difficulties, and causal inference is typically limited to patterns of association. In this review, we discuss some of the ways in which causal connections can be extracted from palaeontological time series and provide an overview of three recently developed analytical frameworks that have been applied to palaeontological questions, namely linear stochastic differential equations, convergent cross mapping and transfer entropy. We outline how these methods differ conceptually, and in practice, and point to available software and worked examples. We end by discussing why a paradigm of dynamical causality is needed to decipher the messages encrypted in palaeontological patterns.

W E INHABIT a world fundamentally shaped by the deep history of geobiological interactions. Long-term ecological and evolutionary dynamics coupled with geological processes have made our planet a vast reservoir of living and fossil biomass that is literally fuelling our modern civilization, and the rapid depletion of this biomass holds dire consequences for humankind (Schramski et al. 2015). As modern ecosystems head towards uncharted territories, geohistorical data are our only record of ecosystems undisturbed by human activities and of biotic responses to global change in the past (Dietl et al. 2015). Palaeontologists thus have an obligation not only to document the rich complexity of life's history, but also to extract from this complexity the causal structures that govern its dynamics (Fig. 1).
The recent growth of large-scale global and regional palaeontological data compilations, as well as high-resolution examination of local stratigraphic sections, has provided momentum for palaeontologists to rise to this challenge. Increasingly comprehensive records of taxonomic richness, abundance and phenotypic variability, combined with improved chronological and phylogenetic constraints, yield time series that can capture the dynamics of underlying biological processes, and be meaningfully compared with other deep-time records of environmental change. We will argue that these time series beg for causal explanations that are not restricted to unique events, but accommodate dynamical, temporally extensive modes of causality.
Although a strong interest in using quantitative approaches to understand temporal trends in palaeontology can be traced back to Simpson (1944) and earlier (Brinkmann 1929), we believe that the first person to introduce our field to formal time series analyses was David Raup. While his collaboration with Gould, Schopf and Simberloff on stochastic evolution was ongoing (Raup et al. 1973), he proposed a more formalized use of stochastic time-series models (random walks) to study evolutionary time series (Raup 1977a). These models became the canonical null hypotheses against which presumed evolutionary patterns were judged (Raup & Crick 1981;Bookstein 1987;Roopnarine et al. 1999;Hannisdal 2006). Today, the use of stochastic time series models of phenotypic evolution has shifted from null hypothesis testing to parameter estimation and model selection (Hunt 2006;Hannisdal 2007;Hunt & Rabosky 2014). If a time series of phenotypic variability in fossil organisms is compared to a group of different models, then a better fit to a specific model is usually interpreted as statistical support for the (implicitly causal) mechanism implied by the preferred model. Raup & Sepkoski (1984) also pioneered the use of Fourier analysis to study the frequency power spectrum of extinction intensities, concluding that there is a c. 26 million-year periodicity in extinction at the family level. They speculated on astronomical and astrophysical causes of the periodicity (and therefore of the extinctions), because 'purely biological or earthbound physical cycles seem incredible'. Their paper was promptly followed by Kitchell & Pena (1984) who used autoregressive integrated moving average (ARIMA) models to infer a dominant periodicity of about 31 million years with substantial uncertainty. Kitchell & Pena stressed that 'periodicity in a time series is insufficient evidence that a periodic external force is causally responsible', not least because in natural systems, (quasi-)periodicity can arise through internal dynamics without periodicity in the external forcing, like in the case of predator-prey cycles. The existence of a significant periodicity in the Phanerozoic extinction record and the extent to which matching frequencies in other processes (astronomical or Earth bound) imply a causal connection remain topics of continued debate in the current palaeontological literature (Melott & Bambach 2017;Erlykin et al. 2018). Similar controversies surrounding the causal implications of matching periodicities are simmering in related fields, such as the existence of Milankovitch cycles in Palaeozoic sedimentary sequences Smith et al. 2016) or in seafloor bathymetry (Huybers et al. 2016;Olive et al. 2016). One way to break the impasse, is to confront our time series with methods that seek to distinguish causative from correlative relationships.

From prediction to causality
Causality is a slippery concept that is widely discussed across all scientific disciplines (e.g. Hill 1965;Winship & Morgan 1999;Pearl 2009;Illari et al. 2011). For instance, the search term 'causality and science' returned more than 300 articles in the period 2008-2017 from the journal Philosophy of Science alone (averaging six articles per issue). We will not attempt to summarize the history of the causality concept here, but we note that the criteria for identifying causal relationships have been continuously subject to revision in the sciences. For instance, Karl Pearson simply equated causation with correlation in the form of contingency tables (Pearson 1911). Hill (1965) famously discussed nine minimal criteria for empirically identifying causal agents of disease that were the benchmark for epidemiologists for decades. Today, only one of these criteria, temporal precedence (i.e. the disease-causing agent needs to occur before the disease), is generally accepted in its original form (Fedak et al. 2015).
Controlled experiments, in which the effect of a single factor is isolated and tested by direct manipulation, are considered to be the gold standard of scientific causal assessment. For a system of the kind outlined in Figure 1, however, experiments are confined to numerical representation of processes in silico. For example, Earth system models used in near-term climate projections are arguably the most sophisticated tools available for numerical experiments (Taylor et al. 2012). Yet, despite the vast complexity of these models, the inclusion of key biological processes is still in its infancy (e.g. Wieder et al. 2015) and experiments are currently limited to millennial time scales or shorter. Models of intermediate complexity are better suited for exploring mechanistic hypotheses in deep time (e.g. Meyer et al. 2016). Explicit process models, including highly idealized mathematical representations, will undoubtedly play an increasingly important role in causal assessment in palaeontology.
Ultimately, however, process models need to specify a model structure assumed to plausibly represent causal mechanisms, which creates a vulnerability to model misspecification that can be difficult to assess (Wood & Thomas 1999;Babtie et al. 2014). On the spectrum between

Living Earth
Solid Earth

Fossil Record
Rock Record F I G . 1 . The multiple causes of the fossil record. Schematic representation of the multi-directional relationships between living, fluid and solid earth components that combine to create the fossil record. Fluid earth includes atmosphere and hydrosphere. Note that while the rock record is viewed as a physical object with mass and chemical composition that interacts with the other active components (denoted by circles), the fossil record is here viewed as a subset of the rock record. Adapted from Peters & Heim (2011). explicit process models and statistical analysis of patterns of correlation, there is room for concepts that use either stochastic parameterizations or an entirely equation-free technique to detect causal connections directly from observed time series. In the following, we present three such methods that are amenable to palaeontological time series analyses, namely: (1) linear stochastic differential equations; (2) convergent cross mapping; and (3) transfer entropy. Before we present the nuts and bolts of these techniques, it is necessary to provide a brief introduction to two underlying concepts from the field of time series analysis: Granger causality and dynamical system reconstruction.
Granger causality rests on the idea that a cause both precedes and helps predict its effect. Inspired by earlier theories of prediction, Granger (1963Granger ( , 1969 implemented this idea in the context of econometric time series analysis. His original formulation was in terms of linear vector autoregressive models, in which the value of a variable at a given time is modelled as a linear weighted sum of a set of values from its own past, and from the past of a set of other variables. Each variable is a time series representing a stochastic process, and the task is to find the weights that minimize the prediction error. In the classical sense, a variable X is said to 'Granger cause' Y if our ability to predict Y declines when X is removed from the set of predictive variables. Granger causality is a special case of transfer entropy (see Transfer Entropy, below) and the two are equivalent for Gaussian variables (Barnett et al. 2009).
It is not uncommon to explore the 'cause precedes effect' criterion by performing time lagged cross-correlation analyses on a pair of time series, where the lagged series is interpreted as the effect if a significant cross correlation is found between the two offset time-series. Such an approach may have levels of false positives and false negatives that are unacceptable (Liow et al. 2015) due to mismatches between the underlying processes and the implicit assumptions inherent in the lagged analyses performed.
In the context of time series analysis, (lagged) correlations and associated regression analysis need to correct for serial (or auto-)correlation, which is one of the problems Granger helped elucidate with his work on cointegration, unit root tests and causality. We do not intend to review the use of classical methods like regression analysis here (see Brown et al. 2011 for a review of their use in the climate change ecology literature). We note, however, that the development of many standard statistical methods, including regression, was motivated by a classical experimental setting in which the cause-effect mechanism is known and regression coefficients are intended to measure (typically linear) effect size. The methods reviewed here are meant to expand our toolkit to handle more general systems where mechanisms may be unknown and experiments are not feasible.
The criterion of improved prediction also has its limitations. For example, if watching the weather report helps us to predict the weather, then it Granger-causes the weather, even if it is obviously not a true cause. In addition, Granger's definition makes the unrealistic assumption that all relevant variables are included in the analysis. A variety of extensions and modifications of the original definition have since been proposed (e.g. Chen et al. 2004;Eichler 2005;Barrett et al. 2010;Barnett & Seth 2015) including the much more flexible linear stochastic differential equation framework described below, which permits the identification of unobserved (latent) confounding factors and feedback.
Linear systems have the distinct advantage that they can be broken down into parts that can be analysed separately and then recombined. Matters become worse if there is strong nonlinearity, which is ubiquitous in nature and in our everyday experience. In a system where components are interdependent, such as ecosystems or Earth systems ( Fig. 1), variables can interact in such a way that any linear correlations between them can change as the state of the system changes. Small perturbations at tipping points can be associated with sudden regime shifts in the response of ecosystems to environmental change (Burkett et al. 2005) and in the dynamics of speciation (Nosil et al. 2017). May (1976) alerted biologists to the fact that even the simplest mathematical relations between two species can yield highly complex, nonlinear and unpredictable dynamics. Most nonlinear dynamical systems cannot be solved analytically, but a geometric approach can be used to characterize their properties. From a dynamical systems perspective, predictive causality is derived from the geometry of the dynamics, which can be reconstructed without accounting for all the component variables constituting the whole system. To provide some context for the two equation-free methods described herein, we need to highlight a few key concepts.
In dynamical systems theory, the state of the system is described by coordinates in a state space (or phase space for continuous-time systems). The number of interacting components in the system determines the dimensionality of its state space, and the time evolution of the system forms trajectories in this space. In an open, dissipative system that exchanges energy and matter with its environment, any dynamics will cease unless there is some driving force. Over a sufficient period of time, the driving force and the dissipation will tend to balance, and the system will settle on a typical behaviour. This typical behaviour, towards which the system 'gravitates', is confined to a subset of the state space known as an attractor. Examples of attractors can be stable fixed points (e.g. a pendulum with friction will converge on a position of rest), limit cycles (e.g. predator-prey oscillations) or more complex subsets such as the strange attractors of chaotic systems (e.g. the famous 'butterfly'-shaped attractor of the Lorenz system). The attractor is itself an invariant subset of the dynamics: if we pick any point on the attractor and follow its evolution under the dynamics of the system, the resulting trajectory lies entirely on the attractor. This implies that to study the long-term behaviour of the system we can seek to reconstruct the attractor, rather than the entire set of unobserved (latent) variables constituting the whole system. State-space (attractor) reconstruction is a starting point for the two equation-free time series analysis approaches outlined below.
Despite their conceptual and practical differences, all three methods discussed in this review are ultimately rooted in the notion of prediction improvement as a statistical way of making inroads to causality. They all begin with measurements (assumed to represent variables that capture unobserved processes of interest) which are arranged as time series data that are then subject to statistical analyses ( The causal graph formalism is widely used for modelling causality, especially in the social sciences, and puts special emphasis on interventions. For example, watching the weather report may Granger-cause the weather, but through intervention we could actively manipulate the weather forecast to discover that it does not cause the weather. Concerns have been raised, however, that the causal graph formalism is less well-suited for explaining spatially and structurally complex biological phenomena (Kaiser 2016), and that the interventionist approach is not applicable to dynamical systems in general (Weber 2016). Further discussion of causal graph methods is outside the scope of this review.
Some of the key features of the methods described below are summarized in Table 1. These features are subject to change because all three methods are topics of active research. We invite the reader to explore the methods on dedicated websites that provide software, detailed user manuals and worked examples cited below.

TIME-SERIES APPROACHES FOR CAUSAL INFERENCES
Linear stochastic differential equations Linear stochastic differential equations (SDEs) are applied to (palaeo)biological questions most commonly in the form of an Ornstein-Uhlenbeck (OU) process for F I G . 2 . Extracting causal structures from palaeontological time series. Ecological and evolutionary processes interact with environmental dynamics to produce a fossil record (represented in the diagram by iconic fossils), which we then sample. Selected measurements (shown as output from a database), together with chosen environmental data are then arranged in temporal sequence, as time series. To avoid any subliminal ranking, we denote variables by generic symbols. If these time series are sufficiently informative, we can use this inherent dynamical information to characterize the strength and directionality of causal connections between the variables, represented as a causal diagram. Straight arrows between the symbols indicate causal directionality and the thickness of the arrows indicates the strength of the relationships. describing time series of phenotypes (e.g. Lande 1976; Hunt 2006). In the last few years, SDEs have also been used in palaeontology to model systems within an extended Granger causal framework.
To explain how SDEs can be used as a causal inference tool, we first briefly describe the structure of a basic linear SDE. An SDE is simply a differential equation in which at least one term is a stochastic process. We can write a basic linear SDE in this form: where X(t) is the process of interest (dependent variable) that changes with the independent variable time t. The first part of the right-hand side of the equation, a(X(t) À l)dt, describes the deterministic part of how X changes with time. The second part of that, rdB(t) is the stochastic component. a and l are parameters that characterize the deterministic part of the process while r describes the stochastic part. a, l and r are often assumed to be constant in biological applications, but they need not be. If the deterministic component is absent, such that dX(t) = rdB(t), and r is constant, the expression is reduced to a random walk (commonly referred to as Brownian motion, or a Wiener process). Where a, l and r are all constant, Equation 1 represents an OU model which, roughly speaking, describes a random walk with a tendency to move towards a long-term average (l). For instance, if X represents body size, and the left-hand side of Equation 1 reflects how fast body size is changing, then a indicates how strongly body size is 'pulled' towards the average body size l, while r is the standard deviation of the stochastic changes in body size.
To illustrate how we can use SDEs to examine whether there is a causal (Granger) relationship between two time-series, let's consider a hypothesis. Say that we would like to test if sea surface temperature (SST) has influenced body size in a taxon of interest. We need to have time series of both a SST proxy (SST) and size measurements (Size) for our taxon that overlap substantially in time (but they don't necessarily have to be measured at the same time points).
We could describe SST as and Size as dSizeðtÞ ¼ Àa Size ðSizeðtÞ À l Size Þdt þ r Size dB Size ðtÞ ð3Þ by simply substituting X in Equation 1 by SST or Size, and putting SST and Size subscripts on these equations to signify that these parameters are specific to SST and Size. As such, there is no modelled relationship between SST and Size. However, we can imagine that SST and Size are In Equation 4, there is an extra term q which describes the temporal correlation between SST and Size in the stochastic part of the equation. Note also that we could have written the same for Equation 2 and the described relationship between SST and Size would be the same.
To explicitly model a Granger causal relationship between SST and Size where SST drives Size we retain Equation 2 and write for Size, The deterministic part of the SST equation has entered the deterministic part of the Size equation such that Size now depends on SST. The stochastic part of the equation is unchanged. If we wanted to include the reversed directionality of causality, we would need to write an equivalent equation for SST (which, unlike in the correlative equation, is not symmetrical). Note that no causal link from another time series can be modelled if the putative 'response' time series is purely stochastic. We can then compare the different models for SST and Size using a chosen model comparison approach.
Some advantages of SDE for palaeontological data are the following. First, because continuous processes are modelled, there is no need for observed time series to be measured at equidistant or the same time points. Second, there is no need to detrend the data series given that stationarity is not assumed in SDE (e.g. random walks are not strictly stationary). Third, SDEs allow for the modelling of latent variables that are not observed. Fourth, feedback loops, that is causal arrows in both directions, are permissible. These last two features explicitly relax two assumptions of the original formulation of Granger causality. Fifth, uncertainty in measurements can be included in parameter estimation. In addition, because SDE is parametric, it allows us to characterize the properties of stochastic processes and the timescale on which one time-series responds to another.
However, linear SDEs are unsuited to systems that are highly non-linear. They are also in general not amenable to studying counts (e.g. species richness) unless changes in species richness can be justified as proxies of rates of diversification, extinction or other such continuousvalued quantities that reflect underlying processes. Observed time series need to be have enough data points to be informative, but we refrain from specifying the minimum data requirements for any of the methods, because these requirements depend on the ratio of signal to noise, and on the type of dynamics of the underlying processes. Even though it is possible to go beyond pairwise studies of time series in SDE, analysing systems with more than three time-series can be quite expensive computationally. In addition, like any other approach that utilizes model comparison, but especially so because causal hypotheses are involved, the relative best fit of models needs to be interpreted with caution, as even the best model in a limited pool of models may have poor absolute fit to the data (Pennell et al. 2015;Voje et al. 2018). SDEs, as currently implemented, cannot accommodate substantial changes in the causal relationship between variables through time. If there is evidence to suggest that relationships do not remain the same over time, then this hypothesis could be tested with SDEs by dividing the time series into two or more segments.
The linear stochastic differential equation approach to causal inference has already seen a number of palaeontological applications. SDE has been used to show that coccolithophore size evolution in the Cenozoic was not driven by global temperature changes but was characterized by local spatial dynamics (Reitan et al. 2012). The SDE method has also been applied to the classic 'ships passing in the night' question of bivalve versus brachiopod diversification dynamics in the Phanerozoic, where it was shown that bivalve extinction rates drove brachiopod origination rates (Liow et al. 2015;Reitan & Liow 2017). Recently, it has also been used in parallel with the equation-free methods described in the following sections to show that the observed global occupancy of planktonic foraminifera has been dynamically coupled to past oceanographic changes captured in deep-ocean temperature reconstructions for the Cenozoic (Hannisdal et al. 2017). Code written in R and C to run linear SDEs in causal inference is currently available at http://folk.uio. no/trondr/layered/ but a fully documented R package is forthcoming.

Convergent cross mapping
As noted above, to study the long-term behaviour of a deterministic dynamical system we may characterize the state space (attractor), rather than the entire set of unobserved variables constituting the whole system. Attractor reconstruction is often the first step in non-linear time series analysis, estimation of invariants (see Transfer Entropy, below) and forecasting/prediction. It was shown in 1980 that for a dynamical system of multiple coupled components, a time series of any single component can be used to obtain a faithful representation of the dynamics of the whole system, by combining the time series with lagged instances of itself (Packard et al. 1980;Takens 1981).
If a time series X = {x(t 1 ), . . ., x(t n )} is a component of a dynamical system, we can reconstruct the state space of the system by constructing m-dimensional time-delay embedding vectors where x(t i ) is the scalar value of the time series X at time t i , and s is the embedding lag. The embedding lag is only relevant for computational purposes and is unrelated to any true causal delay in the dynamics. For a dynamical system of unknown dimensionality, the optimal embedding dimension m needs to be estimated (Sauer et al. 1991). Loosely speaking, the Takens embedding theorem says that there is a one-to-one mapping between the reconstructed states in E X and the states of the whole system (Takens 1981;Sauer et al. 1991). The reconstructed attractor thus preserves the dynamics of the system. If two variables X and Y are coupled components of the same dynamical system (i.e. they are causally connected) then the one-to-one mapping between a reconstruction E Y and the system attractor is also true for a time series of Y. Consequently, E X and E Y will be in one-to-one correspondence with each other, which can serve as a criterion for causal detection. Attractor reconstruction using delay embeddings has found many applications in nonlinear time series analysis (Kantz & Schreiber 2003). Sugihara et al. (2012) introduced convergent cross mapping (CCM) as a nonparametric statistical approach to detecting causality between two observed time series. The CCM method estimates dynamical coupling strength by quantifying the extent to which the dynamics of the driver time series are encoded in a time-delay embedding of the putative response variable.
The CCM algorithm iteratively draws a subset of a time series Y (the prediction set) and finds, for each point P i in the prediction set, the corresponding subset L i in the reconstructed state space E X (the library set). Next, it locates the nearest neighbours of L i in E X and uses a geometric projection technique (Sugihara & May 1990) to compute a predicted value P i *. The closeness of prediction, or cross-mapping skill, is measured by the Pearson correlation between the predicted value P i * and the observed value P i . The convergence part of CCM comes from the expectation that, if the variables belong to the same dynamical system, then the cross-mapping skill should increase and converge with increasing size of the library set used in the reconstruction.
In contrast to Granger causality approaches, which use past values of the forcing variable to predict future values of the response variables, CCM asks whether the reconstructed states of the response variable can be used to predict the states of the forcing variable. Hence, the notation 'X xmap Y' refers to estimating y (t i ) from the corresponding delay-embedding reconstruction of x(t i ), which can be interpreted as 'Y is causally influencing X'. In CCM, causal directionality is detected by assessing whether prediction skill from X to Y is greater than vice versa.
State-space reconstruction requires a sufficient embedding dimension m (the number of lagged instances included in the embedding vector), and the CCM algorithm selects the embedding parameters by optimizing selfprediction using a geometric projection technique (Sugihara & May 1990). See Haaga et al. (2018) for more details.
In systems with strong unidirectional forcing, the response variable can be fully encoded in the forcing variable ('synchronization'), and cross mapping will be observed in both directions, even if true causality is unidirectional. To address this problem, Ye et al. (2015a) introduced a lagged CCM to distinguish synchronization from true two-way causality. In practice, beyond the sign of the peak lag, current implementations of CCM cannot reliably detect the absolute magnitude of a causal time delay, and the peak lag may not be unique if the forcing variable is strongly periodic. Instead, Haaga et al. (2018) proposed to integrate CCM skill over a window of positive and negative lags, thus sacrificing absolute time delay information for the sake of obtaining a more robust detection of causal directionality. Indeed, several aspects of the CCM approach to causality are topics of lively discussion, including different ways of subsampling the data to construct libraries ( CCM has been applied to records in which samples are not evenly spaced in time (e.g. van Nes et al. 2015) but the state-space cross-mapping concept demands that the variables being compared are sampled at contemporaneous points, which might necessitate bin-averaging, interpolation, or imputation of data. CCM does not assume stationarity, hence there is no need to detrend the observed time series. Measurement errors, sampling noise, or chronological uncertainty have to be dealt with by Monte Carlo iterative analyses, which can be computationally costly. CCM is only defined for the bivariate case, but, in principle, transitive multivariate causal chains can also be resolved through pairwise analyses (Ye et al. 2015a). For dynamical systems in general, the relationship between the coupled components will vary depending on the state of the system (think of the two 'wings' of the Lorenz 'butterfly' attractor). In the CCM approach, two variables are causally coupled if they can be shown to belong to the same dynamical system, but that does not imply that the relationship between them is fixed through time. With sufficient data, a moving window analysis can be used to test for temporal changes in the strength and directionality of coupling.
The convergent cross-mapping approach to causal inference has already found a few palaeontological applications, including time series of Cenozoic planktonic foraminifera ( Software for running CCM is available as an R package (rEDM) at https://github.com/ha0ye, and wrapper functions that simplify lagged CCM analyses with surrogate testing are available at https://www.earthsystemevolution.c om/page/tools/.

Transfer entropy
Granger causality was originally inspired by ideas from the theory of prediction, and Granger himself noted early on its linkages to information theory Granger (1963). A predictive notion of causality is closely related to the notion of a directional flow of information, and information-theoretic techniques are now widely used to measure causal dependence in dynamical systems (Hlav a ckov a-Schindler et al. 2007; Amblard & Michel 2013).
One of the most popular information-theoretic measures of directional information flow is Schreiber's (2000) transfer entropy, which starts conventionally with the Markov property of conditional independence: If two processes X and Y are independent, then knowing the past l states of Y has no influence on the state transition probability of X, from x(t i ) to x(t i+1 ), beyond knowing the past k states of X alone. To simplify notation, we denote x(t i+1 ) as x i+1 , and a vector x(t i , . . ., t iÀk ) of k past states is written as x To address these shortcomings, a new approach to estimating transfer entropy is being developed by D. Diego, K. Haaga and BH (see https://www.earthsystemevolution. com/page/tools/). This approach is also based on the concept of an attractor as a subset of a system's state space, where each state consists of a set of current and past values from the different time series being analysed. Not all points on an attractor will be visited with the same frequency, and the frequency with which a typical trajectory will visit some portion of the state space is related to the density of the attractor, which is invariant under the dynamics. If we assume that two time-series X and Y represent variables that belong to the same dynamical system, and that the system has settled on its attractor, then it is possible to interpret the invariant density of a given region of the attractor as the probability for the system to be in any of the states in that region. The usual method for estimating the probability distributions required for computing transfer entropy is the method of nearest neighbours (e.g. Kraskov et al. 2004), which is very efficient but may lose accuracy unless the number of points in an embedding reconstruction of the attractor is very large. Instead, Diego et al. take a different approach inspired by the work of Froyland (1997), by discretizing the state space into simple volume partitions (simplices) and tracking how these volumes change between time steps. The movement and deformation of volumes through time approximate the action of the underlying governing processes. The dynamics are approximated by the so-called transfer operator, in the form of a Markov matrix of transition probabilities, which is used to estimate the invariant probability distribution needed to compute TE. The aim of this approach is to improve accuracy for short time series, and because the transfer operator is stationary by construction, there is no need to pre-process non-stationary data.
Both CCM and TE are referred to as 'model-free', or 'equation-free' methods, because they take a geometric approach to detecting dynamical coupling without using model equations to specify causal mechanisms. TE allows time series to be unevenly sampled in time, but different variables need to be compared at contemporaneous points for the calculation of TE to be meaningful. Unlike the original TE (Schreiber 2000) and IT (Verdes 2005), the transfer operator TE does not assume stationarity and obviates the need to detrend the observed time series. Like CCM, the TE currently requires a Monte Carlo approach to account for noise or chronological uncertainty. Unlike the CCM, however, the TE can be directly generalized to three or more variables to test for confounding, or common-cause, interactions. That said, with more variables added in a conditional TE, longer time series are needed to ensure sufficient information content. TE allows for the relationship between coupled variables to vary depending on the state of the system and, with enough data, temporal changes in the strength and directionality of coupling can be assessed with a moving window analysis.
Software for computing the transfer operator TE using the high-performance computing language Julia is available at https://www.earthsystemevolution.com/page/tools/. An R package of wrapper functions is forthcoming. This website also features worked examples and interactive applications with examples of dynamical systems.

FROM EVENT-BASED REASONING TO DYNAMICAL CAUSALITY
It is a quirk of human cognition that as we receive a continuous stream of sensory input, we tend to segment this We do argue, however, that causal attribution in the context of palaeontological time series needs a dynamical, process-like notion of causality to complement our research on events. Raup himself lamented the tendency in our field of 'finding specific causes for specific evolutionary events' rather than seeking generalizations across groups of events (Raup 1977b). The same concern has been raised for the interpretation of carbon isotope excursions across the Phanerozoic, where underlying dynamics may explain commonalities among a large number of isotope excursions that are typically treated as unique events (Bachan et al. 2017). Palaeontological and palaeoenvironmental records provide access to only a very small subset of the variables that may have interacted to generate the patterns observed in those records. Moreover, most deep-time records are themselves phenomenological proxies, which are typically a compounding of different underlying processes (e.g. the oxygen isotope record from marine carbonates captures past changes in temperature, ice volume, salinity and pH). But even if we cannot know all the interacting components (or 'detail complexity') of the relevant system, observed time series can capture important aspects of its dynamical complexity. If so, then the dynamical information inherent in the observed time series can be used to detect causal connections. As noted above in the method descriptions, this data-driven approach places certain demands on the information content of the data that are not always met in relatively short, noisy and irregular palaeontological time series. In addition, we note that there are other important empirical considerations that we have not discussed in this review, including the appropriateness of measurements (Houle et al. 2011) and the perennial problems associated with disentangling temporal process from stratigraphic architecture (Holland 1995;Hannisdal 2006Hannisdal , 2007Patzkowsky & Holland 2012). On the other hand, the quality and resolution of deep-time records are steadily improving, including innovative approaches such as continuoustime age models (e.g. Nelsen et al. 2016;Husson & Peters 2017).
Thinking about the innumerable interactions that conspire to generate our observed palaeontological patterns can be quite dizzying, ranging from the gene regulatory networks involved in development (Peter & Davidson 2015), and the vast complexity of ecosystems (e.g. Lima-Mendez et al. 2015), to biotic feedbacks on global climate on million-year time scales (e.g. Beerling & Berner 2005). To illustrate how a dynamical notion of causality might help us see the proverbial forest for the trees, we make an analogy with another system of immense complexity that allows us to think about such things in the first place (and get a headache): the human brain. While the basic causal mechanisms behind brain activity at the microlevel of a single neuron have essentially been understood since the 1950s, macro-level activities such as movement, cognition and perception involve interactions of massive numbers of neurons across large systems of the brain. Recent technologies such as functional magnetic resonance imaging and electroencephalography provide time series of this system-level behaviour, which can be fruitfully analysed using concepts from dynamical systems theory and surrogate data testing (Breakspear 2017) as well as Granger causality (Bressler & Seth 2011). Causal hypotheses can thus be tested on observed time series without recourse to modelling the underlying complexity. A time-series analysis approach to causality in neuroscience has been criticized on the grounds that it can fail to resolve the full structure of the underlying mechanism (e.g. Stokes & Purdon 2017). However, this particular critique fails to make the distinction between quantification of directional information flow and explicit modelling of physical causal mechanisms, both of which are legitimate tasks, but the latter is beyond the purview of the timeseries analysis approach. Instead, a data-driven causal analysis can uncover dynamical relationships between components of complex and poorly constrained systems (e.g. Fig. 3), and thus help guide efforts to build models of the underlying mechanisms.
Another tentative insight from the study of large-scale brain activity that could resonate with palaeontologists is that macro-scale dynamics are not simply the sum of micro-level components, and that causal emergence allows for top-down causality (e.g. Hoel et al. 2013). Although fossils are not neurons, and the question still remains as to what extent macroevolutionary processes are emergent or generated bottom-up from microevolutionary processes (Erwin 2000;Simons 2002; Jablonski 2017a), we note that new theoretical results have raised the tantalizing prospect of a fundamental link between the processes of evolution and the principles of learning (Watson & Szathm ary 2016).
The types of linear thinking that we typically use in standard data analysis (e.g. variance partitioning, regression coefficients) assume that components can be analysed separately and then added back together. However, this approximation often fails in both living and Earth systems, where components are intertwined and may interact in such a way that any linear correlations are transient and may even switch sign (referred to as 'mirage correlations' by Sugihara et al. 2012). Although there is no substitute for a well-posed question that can be answered with limited data and simple methods, we must acknowledge that the systems from which we obtain time series (and the causal questions we ask of them) may be far from simple, and may require analyses that are conceptually and computationally involved. We want to know what has 'driven', 'tracked', 'interacted with' or 'controlled' the intriguing patterns of temporal change observed in deep-time records, but the way we pose causal questions belies the epistemic challenges involved in answering them. The methods discussed in this review invite us to take a step back and think carefully about what it means to formulate and test causal hypotheses on time series.
A defining topic of palaeontology is the interplay of ecological, evolutionary and environmental processes responsible for the long-term maintenance and turnover of Earth's biota. Palaeontological data provide pre-anthropogenic baselines and deep-time lessons critical for understanding current systems as well as their recent, near-term and longterm future responses to change (National Research Council (US) 2011;Hannisdal et al. 2012Hannisdal et al. , 2017Harnik et al. 2012;Dietl et al. 2015;Finnegan et al. 2015;Kidwell 2015;Schmidt 2017). With the looming threat of large-scale disruption of global ecosystems, there is a need for palaeontologists to be bifocal: we need to zoom in to uncover the historical particulars of individual events, but also zoom out to search for underlying causal structures that may hold general lessons across multiple events. A dynamical, process-like concept of causality may be a Rosetta stone that can assist in deciphering the messages encrypted in palaeontological time series.