Editorial: Special issue on time series analysis in the biological sciences


David S. Stoffer, Depart;ment of Statistics University of Pittsburgh Pittsburgh, PA 15260, USA

The types of problems encountered in analyzing time series or spatial processes (or both) from the biological sciences are about as broad as the field itself. This includes applications in molecular biology, environmental biology, epidemiology, neurology and bioinformatics, marine biology, oceanography, biotechnology, physiology, botany, ecology, medicine, and evolution. Many of the problems involve departures from linearity, normality, and stationarity or homogeneity, and may involve missing data, irregular sampling, or multiple series collected at different time scales. Moreover, many current biological time series are collected under designed experiments and thus require modelling the between-subject and between-trial variations.

Most scientists who have studied time series analysis have encountered the sunspot data set, or the lynx trappings data set that are used as classic examples to demonstrate nonlinearity and non-normality. The lynx series is typical of predator-prey processes (the prey being the snowshoe hare) that are often modeled by the so-called Lotka-Volterra equations, which are a pair of simple nonlinear differential equations used to describe the interaction between the size of predator and prey populations. These series play a prominent role, for example, in the text by Tong (1990) . These series are often used to demonstrate a process that is not time reversible. Recall that if a process {Xt} is linear and Gaussian, then the process is time reversible because inline image has the same distribution as the values in reverse order, inline image.

A process that is apparently not time reversible is shown in Figure 1. These data are taken from Shumway and Stoffer (2011) and are monthly growth rate of pneumonia and influenza deaths in the United States for 11 years, 1968–1978. The data tend to increase slowly to a peak and then decline quickly to a trough (↗↓). Moreover, although pneumonia and influenza are worse in the winter, the month with the peak number of occurrences varies annually. In addition, nonlinearity is seen in the lag plots of Figure 1. In particular, notice that in the lag-two plot, the dynamics of the present value changes according to whether the growth two months prior is above or below about 12–15%. For example, in Figure 1, the correlation between Xt and Xt−2 appears to be positive if Xt−2 < 0.15 and negative if Xt−2 > 0.15.

Figure 1.

 Top: Monthly growth rates of pneumonia and influenza deaths in the United States for 1968–1978. Bottom: Lag-plots of the current value, Xt, with one month prior, Xt−1, and two months prior, Xt−2

The data shown in the top of Figure 2 are a single channel EEG signal taken from the epileptogenic zone of a subject with epilepsy, but during a seizure free interval of 23.6 seconds, and is seriesFigure 3 (d) shown in Andrzejak et al. (2001). The bottom of Figure 2 shows the innovations (residuals) after the signal has been removed by on fitting an AR(p) based on AIC. Due to the large spikes in the EEG trace, it is apparent that the data are not normal. In fact, the innovations in Figure 2 are from a heavy-tailed distribution and possibly are an infinite variance process. Infinite variance processes are described in detail in Brockwell and Davis (1991, Chapter 13). In this case, it is possible to pose a linear process, but with stable innovations.

Figure 2.

 Top: A single channel EEG signal taken from the epileptogenic zone of a subject with epilepsy during a seizure free interval of 23.6 seconds; see Andrzejak et al (2001). Bottom: The innovations after removal of the signal using an autoregression based on AIC

Figure 3.

 The sample ACF of the EEG innovations (left) and the squared innovations (right); the EEG innovations series is shown in Figure 2

Most time series analysts have seen data such as the returns of the S&P 500. The fact that these types of processes tend to be uncorrelated, but dependent, has led to the development of models such as GARCH-type models, stochastic volatility models, or bilinear models. One typical exhibition is to plot a stock return series, noting obvious departures from independence, for example, clusters of volatility, but that the sample autocorrelation function (ACF) is essentially that of white noise. Then, we exhibit the sample ACF of the squares of the data and voilá, the dependence in the process is revealed. In fact, such an occurrence is not limited to financial time series, but can also be seen in processes encountered in the sciences related to the study of life (i.e. Biology). For example, the left side of Figure 3 shows the sample ACF of the EEG innovations diplayed in Figure 2. The fact that the values of the ACF are small indicate that the innovations is a white noise process. However, the right side of the figure shows the sample ACF of the squared EEG innovations, where we clearly see significant autocorrelation. Thus, while the innovations appear to be white, they are clearly not independent.

Another situation in which linearity and normality are unreasonable assumptions is when the data are discrete-valued and small. One such process is the number of poliomyelitis cases reported to the U.S. Centers for Disease Control for the years 1970–1983, displayed in Figure 4. The marginal distribution appears to be overdispersed Possion, or generalized Poisson, or perhaps negative binomial, which is a mixture of Poissons; for example, see Joe and Zhu (2005). Moreover, we see that the ACF of process seems to imply a simple autocorrelation structure, which might be modelled as a simple non-Gaussian AR(1) type of model. The polio data set is taken from Zeger (1988), who fits a generalized linear ARMA-type model. Generalized linear ARMA models are an extension of generalized linear models to dependent data situations where ARMA-type autocorrelations are evident. In another approach, models have been developed to have ARMA-type autocorrelation structures, but are constrained so that the process stays in the state-space of integers, for example. Fortunately, for those who are interested in these problems, Jung and Tremayne (2011) provides a recent and extensive summary of the state-of-the-art.

Figure 4.

 Poliomyelitis cases reported to the U.S. Centers for Disease Control for the years 1970–1983

Many biological time series exhibit symptoms of non-stationarity, including non-constant means and variance (structural breaks, change-points) and power spectra that evolve with time. As a specific example, brain activity is often altered following a shock to the system such as seizure onset or a presentation of some external stimulus. These changes are often reflected in the spectrum and coherence. One challenge here is to quantify the impact of a shock to the biological system. A number of non-parametric approaches have been proposed, the locally stationary Fourier-based model of Priestley (1965) and Dahlhaus (1997), the locally stationary wavelets model in Nason et al. (2000), the SLEX (smooth localized Fourier) model in Ombao et al. (2005), and a local spline approach taken in Rosen et al. (2009). Recently, Gorrostieta et al. (2012) and Kang et al. (2013) studied stimulus-induced changes in the brain dynamics using time series models with subject-specific random effects to account for between-subject variation in brain responses.

The previous problems are just a few simple examples of the types of data seen in the biological sciences, and many of these problems may seem familiar. These problems are discussed in more detail in the forthcoming text on nonlinear and non-Gaussian series, Douc et al. (2013). The reality is that many problems, such as brain connectivity, are highly complex and are currently only understood at a basic level.

The collection of 12 invited articles included in this volume are meant to demonstrate the variety of problems and approaches that are taken to solve complex problems in the biological sciences. The techniques run the gamut of time series techniques, and include analyses in the time, spatial, and frequency domains. The hope is that we may motivate more experts in time series and spatial analysis to consider working on problems in the biological sciences. As will be seen from the collection, the problems are many and are rich. In fact, comprehensive solutions to these problems may require statistical techniques from a variety of areas including functional data analysis, mixed effects models, high dimensional data analysis, statistical learning and computing. The articles are arranged in alphabetical order according to the first author.