# Predictability: Recent insights from information theory

## Abstract

[1] This paper summarizes a framework for investigating predictability based on information theory. This framework connects and unifies a wide variety of statistical methods traditionally used in predictability analysis, including linear regression, canonical correlation analysis, singular value decomposition, discriminant analysis, and data assimilation. Central to this framework is a procedure called predictable component analysis (PrCA). PrCA optimally decomposes variables by predictability, just as principal component analysis optimally decomposes variables by variance. For normal distributions the same predictable components are obtained whether one optimizes predictive information, the dispersion part of relative entropy, mutual information, Mahalanobis error, average signal to noise ratio, normalized mean square error, or anomaly correlation. For joint normal distributions, PrCA is equivalent to canonical correlation analysis between forecast and observations. The regression operator that maps observations to forecasts plays an important role in this framework, with the left singular vectors of this operator being the predictable components and the singular values being the canonical correlations. This correspondence between predictable components and singular vectors occurs only if the singular vectors are computed using Mahalanobis norms, a result that sheds light on the role of norms in predictability. In linear stochastic models the forcing that minimizes predictability is the one that renders the “whitened” dynamical operator normal. This condition for minimum predictability is invariant to linear transformation and is equivalent to detailed balance. The framework also inspires some new approaches to accounting for deficiencies of forecast models and estimating distributions from finite samples.

## 1. INTRODUCTION

[2] Lorenz [1963] shattered the paradigm of a clockwork universe when he proved that if the solution to a dynamical system was not periodic, then small uncertainties in the state will grow so large as to render the forecast no better than a randomly drawn state from the system. No longer was there reason to expect weather to be as predictable as the motion of the planets or the tides of the ocean. Lorenz's result implies that the time during which the state is predictable, i.e., when prediction errors lie below those based on random selection of realistic states, is finite in nonperiodic systems. Therefore, when the state is predictable, the prediction error variance is less than the error variance of random selections. The difference between these two variances can be called the predictable variance. Current numerical weather prediction models indicate that the predictable variance is relatively small after about 3 weeks [Simmons and Hollingsworth, 2002]. Despite being small, predictable variance beyond 3 weeks may still be of interest. Specifically, certain spatial or temporal structures may be highly predictable beyond 3 weeks but difficult to detect because the unpredictable structures superposed on it dominate. For instance, climate in a limited region might be highly predictable beyond 3 weeks, but this predictability might be difficult to detect in an analysis that pools all regions together. A component that is highly predictable only in a certain season may be difficult to detect in an analysis that pools all seasons together. Also, components that are predictable beyond 3 weeks may be persistent and hence explain much of the variability of monthly means, even if they explain little of the daily variability [Shukla, 1981a]. In addition, large-scale structures tend to be more persistent, and hence more predictable, than small-scale structures [Lorenz, 1969; Shukla, 1981b]. These and numerous other examples illustrate that predictability beyond 3 weeks can be identified by appropriate filtering in space or time. The question arises as to whether there exists an optimal method for finding such predictability. Consider the following techniques that have been used to identify predictable structures in weather and climate data sets: (1) Barnett and Preisendorfer [1987] used canonical correlation analysis to identify relations between sea surface temperatures and land surface temperature. (2) Lorenz [1965] used singular value decomposition to identify the initial conditions that maximized error growth. (3) Deque [1988] and Renwick and Wallace [1995] used a version of principal component analysis to identify the most predictable patterns in operational forecast models. (4) Hasselmann [1979, 1997] developed “fingerprint” methods for detecting climate change. (5) Venzke et al. [1999] used an extension of signal-to-noise ratio to multivariate analysis to identify predictable variables in climate change scenarios. (6) Schneider and Griffies [1999] used discriminant analysis to find components that maximize predictive power.

[3] Each of these methods has some legitimate claim for identifying maximally predictable components. A natural question is how the methods compare when applied to the same problem. Despite appearances to the contrary the above methods are consistent: They all give the same result, on average, when applied to variables that are joint normally distributed. Demonstrating this consistency on a case-by-case basis would be unsatisfying because it would not give insight into the underlying reasons for this consistency. The purpose of this paper is to summarize and clarify a theoretical framework based on information theory that reveals the underlying unity of various multivariate statistical methods. Indeed, all of the specific examples listed above are equivalent to maximizing certain terms of a measure of predictability called relative entropy. This paper also shows that the framework provides sensible answers to a variety of questions that otherwise have no clear answer. This topic constitutes only a fraction of the vast literature on predictability. For a review of other topics in the predictability of weather and climate we recommend the book edited by Palmer and Hagedorn [2006]. The remainder of this section outlines, with a minimum of mathematics, the main results reviewed in this paper.

[4] In section 2 we introduce the basic concepts in predictability theory. Specifically, we define the forecast and climatological distributions and two guiding principles of predictability. The first principle is that a variable is unpredictable if its forecast distribution is identical to its climatological distribution. Hence a necessary condition for predictability is that the forecast and climatological distributions must differ. Intuitively, a measure of predictability should measure the “difference” between two distributions. More precise statements about measures of predictability are difficult to formulate without knowing the motives of the user. In practice, predictability might be defined more restrictively, e.g., the forecast should have more information (less uncertainty) than the climatology, to account for small ensemble size or imperfect model.

[5] The second principle of predictability, which has been emphasized by Schneider and Griffies [1999] and Majda et al. [2002], is that a measure of predictability should be at least invariant to linear, invertible transformations of the variables. Measures that satisfy this invariance do not depend on the arbitrary basis set used to represent the variables. Three measures of predictability have been proposed that satisfy these principles: mutual information [Leung and North, 1990], predictive information [Schneider and Griffies, 1999], and relative entropy [Kleeman, 2002]. If only average predictability is considered, then all three measures are equal, and no distinction exists between the measures. A fourth measure, called the Mahalanobis error, is introduced here that satisfies the two principles and can be interpreted as a multivariate generalization of the familiar normalized mean square error.

[6] In section 3 and later, we confine our attention to normal distributions, for which numerous analytical results exist. Section 4 gives explicit expressions for the predictability of normally distributed variables in terms of their mean and covariances.

[7] Section 5 reviews an important concept called the whitening transformation. A whitening transformation produces a set of uncorrelated variables with equal variances. The importance of this transformation lies in the fact that when it is applied to forecast variables, many familiar techniques, such as analysis of variance, principal component analysis, and singular value decomposition, give immediate results about predictability. Indeed, predictability of a forecast can be deduced solely from the whitened forecast variables, as discussed in section 6.

[8] In section 7, we discuss a technique called predictable component analysis which finds components with maximum predictability, analogous to the way principle component analysis finds components with maximum variance. The state of a system can be represented by a sum of predictable components ordered such that the first has maximum predictability, the second has maximum predictability subject to being uncorrelated with the first, and so on. Remarkably, the same predictable components are obtained whether one optimizes predictive information, relative entropy (ignoring the signal term), mutual information, the Mahalanobis error, as well as classical measures such as normalized mean square error, average signal to noise ratio, or the anomaly correlation. Section 7 also shows that optimizing the signal term in relative entropy yields the same results as fingerprint methods in climate change analysis.

[9] In section 8 we discuss how singular vector analysis and canonical correlation analysis emerge naturally in this predictability framework. A vexing question in the use of singular vectors to measure predictability is which norm should be chosen. Information theory provides a sensible answer to this question. Specifically, singular vectors correspond to predictable components when the initial vector norm is based on the observation covariance matrix, and the final vector norm is based on the climatological covariance matrix. The initial vector norm constrains the initial vectors to have equal probability density, which ensures that they are equally likely to arise in the observations. The final vector norm reduces to relative variance in one dimension, as appropriate for predictability measures, and is invariant with respect to linear transformations, which ensures that the singular vectors do not depend on the coordinate system in which the state is represented. These results shed light on the role of norms in predictability. Section 9 shows that the above framework includes data assimilation, clarifying the fact that the predictability framework accounts for both dynamics and initial condition uncertainty.

[10] The role of norms in predictability theory is further clarified in section 10. It is shown in certain idealized examples that if the above norms are chosen, components that maximize signal are identical to components that minimize error, whereas this consistency is lost if other norms are chosen. A surprising result is that the choice of initial norm can determine whether one maximizes or minimizes predictability. Section 10 also shows how to generalize singular vectors to models with stochastic forcing. This generalization is important because if the model is stochastic, the growth of initial condition error captures only part of the total forecast spread since the stochastic component also contributes to spread.

[11] Section 11 shows that the above framework includes linear stochastic models: One need only to make the proper identification between stochastic model parameters and the forecast and climatological distributions. This connection allows a dynamical interpretation of predictability.

[12] In section 12 we discuss the fascinating and not fully understood role of nonnormality in predictability. Generalized stability analysis tells us that singular vectors of nonnormal systems grow more strongly than those of normal systems with the same dynamical eigenvalues. One might surmise from this that nonnormality diminishes predictability since it enhances error growth. However, one could equally well surmise that nonnormality enhances predictability because it enhances signal growth. Confounding these arguments is the fact that nonnormality increases the climatological variances, so the difference between forecast and climatological variances becomes difficult to guess. The solution to this dilemma, which becomes clear only after rigorous analysis, is that nonnormality enhances predictability. The minimum predictability, for all measures of predictability (ignoring the signal term in relative entropy), occurs when the whitened dynamical operator is normal. This condition occurs when the dynamical operator and noise covariance matrix can be diagonalized simultaneously, which, in turn, is precisely the condition for detailed balance. Remarkably, the minimum value of predictability depends only on the real part of the eigenvalues of the dynamical operator. This result further implies that the predictability of normal systems is independent of the location of spectral peaks in the power spectrum. A conjecture for the upper bound of predictability is also discussed.

[13] A fundamental limitation with the above framework is that the forecast distribution of the climate system is unknown and must be estimated from finite samples. This review concludes with a discussion of some promising techniques for dealing with finite samples.

## 2. WHAT IS PREDICTABILITY?

[14] The foundations of predictability have been discussed by Lorenz [1963, 1965, 1969] and Epstein [1969]. In this framework the state of the system is specified by a finite set of numbers called the state vector, xt, which evolves according to a set of known equations. Though the model is known, the state of the system may be uncertain because of observation error or because the model contains terms that vary randomly. Relative to an observer the most complete description of the state and its uncertainty is a probability distribution. This distribution can be interpreted as a density of possible states in phase space. As each possible state evolves in accordance with the governing equations, so too does the distribution describing the density of states. If the governing equations are deterministic and conservative, then the distribution satisfies Liouville's equation. If the governing equations contain random processes of a certain class, then the distribution satisfies the Fokker-Plank equation.

[15] An implicit assumption in the above description is that the system satisfies the Markov property. By Markov property we mean that given the state xt at time t, no additional data concerning states at previous times can alter the (conditional) probability of the state at a future time t + τ. The distribution of the future state xt+τ given xt is denoted p(xt+τxt) and is often called the transition probability. The transition probability is computed from a deterministic or stochastic model and completely describes the evolution of a Markov process.

[16] The probability distribution for the state of the system changes discontinuously after the system is observed. For instance, a 10% chance of temperature exceeding a certain threshold at a certain location at a certain time instantaneously changes to 100% certainty when temperature is observed (with certainty) to exceed that threshold at that location at that time. The new distribution of xt after a set of observations become available is called the analysis distribution and is denoted by p(xtΘt), where Θt represents the set of all observations up to time t. (We use the notation that the density function p() differs according to its argument.) Construction of the analysis distribution is a central goal of data assimilation [Jazwinski, 1970; Kalnay, 2003].

[17] We call t the initial condition time, t + τ the verification time, and τ the lead time. To reduce the notational burden, variables at the initial and verification times will be represented by different symbols as follows:

In this notation the analysis distribution is denoted p(iΘ), and the transition probability is p(vi).

[18] The assumption that the transition probability is known exactly is called the perfect model assumption. The term perfect model often implies that the model is deterministic, in which case the transition probability p(vi) is a delta function. In this paper, this term will be used more generally to mean that the transition probability p(vi) is known, be it deterministic or stochastic. Essentially, a perfect model in our sense means that a deterministic model exists for the distribution. The term perfect model scenario is taken to mean that both the transition probability p(vi) and the analysis distribution p(iΘ) are known exactly.

[19] The distribution at times other than those for which observations are available is given by

where the integral is interpreted as a multivariate integral over the support of i. Equation (2) follows from the theorem of compound probabilities, provided p(vi, Θ) = p(vi), which holds for dynamical systems (provided the observations do not perturb the system significantly). The distribution p(vΘ) will be called the forecast distribution. The mean of the forecast distribution is called the signal, while the dispersion is called the noise or spread. A forecast ensemble is a sample drawn from the forecast distribution (2). In practice, the forecast ensemble is constructed by repeatedly drawing a sample from the analysis distribution p(iΘ) and then computing the transition probability p(vi) conditioned on the drawn initial state i. The resulting forecast ensemble can be interpreted as a Monte Carlo solution of the integral (2).

[20] As discussed above, the forecast distribution can be obtained from a perfect model. However, in the end the forecast distribution is the fundamental entity: If the forecast distribution can be computed directly without invoking a perfect model or analysis distribution, then so much the better. For example, a forecast distribution for a stationary process can be approximated by the observed relative frequency of events in a long historical record. Also, it should be recognized that the forecast distribution is not unique: Different observations Θ and different verification variables v give rise to different forecast distributions. For instance, the variable v could be a subset of state variables, in which case the relevant forecast is the marginal distribution of (2) with the missing variables “integrated out.” Unfortunately, it is difficult to distinguish a given forecast distribution from a perfect forecast distribution only on the basis of a sample from the perfect forecast distribution (see Jolliffe and Stephenson [2003] and Gneiting et al. [2007] for discussion). We will call the above forecast distributions reliable when we need to distinguish them from an imperfect forecast.

[21] The forecast distribution in the absence of a specific measurement of the system Θ is called the climatological distribution and is given by the theorem of compound probability

which is the marginal distribution for v. Equation (3) shows that the climatological distribution (3) can be interpreted as the average forecast distribution. The climatological distribution may depend on time, such as the phase of the seasonal cycle or diurnal cycle. An illustration of the forecast and climatological distributions is shown in Figure 1.

[22] Some predictability studies define the climatological distribution to be the asymptotic (i.e., “saturated”) forecast distribution. The distinction between the asymptotic forecast and the average forecast (3) is often immaterial because the two distributions are identical. However, in cases in which the two distributions differ, (3) is often more appropriate. A simple example is a damped linear model with imperfect initial conditions. In this case the asymptotic forecast distribution is a delta function, since the model is deterministic, but this is a poor choice for the climatological distribution since it has smaller spread than the forecast at all lead times. Indeed, the difference between a delta function and a distribution with finite spread is infinite by some measures, implying “infinite predictability,” an absurdity in light of the initial errors. In contrast, defining climatology as the average forecast distribution leads to sensible conclusions about predictability in this case, including the general decrease of predictability with lead time.

[23] The following principle is common to the various definitions of predictability proposed in the literature: If the forecast distribution p(vΘ) is identical to the climatological distribution p(v), then the variable v is said to be unpredictable. This principle is understood to be with respect to a perfect model scenario and for a specific set of observations Θ. It follows from this definition that a necessary condition for predictability is for the forecast and climatological distributions to differ. Thus a plausible measure of predictability is some measure of the difference between the forecast and climatological distributions.

[24] Early studies measured predictability based on the mean square error of a perfect model forecast. In this case the forecast and verification are considered to be independent realizations from the forecast distribution p(vΘ), and the mean square difference between these two states measures the dispersion of the distribution. Typically, mean square error increases with lead time (provided the initial error is not large) and asymptotically approaches a finite value, called the saturation value. The saturation value often is comparable to the mean square difference between two randomly chosen states from the system, indicating that the forecast dispersion equals the climatological dispersion. These considerations suggest that a reasonable measure of predictability in one dimension is the ratio of mean square error to its climatological value.

[25] A second principle of predictability, which has been emphasized by Schneider and Griffies [1999] and Majda et al. [2002], is that a measure of predictability should be at least invariant to linear, invertible transformation of the variables. In the univariate case the ratio of the mean square error to its climatological value is invariant to linear transformations of the variable, allowing comparisons of predictability between physically different variables. Measures that are not invariant to variable transformations depend on the basis set used to represent the state and require weights to account for variables with different units, leading to such questions as how many degrees Celsius of temperature are equivalent to 1 m/s of wind velocity? In contrast, using measures that are invariant to variable transformation circumvents such questions since the values would be independent of the weights. Such measures identify “fundamental” properties of predictability that are independent of the coordinate system.

[26] Information theory provides measures of predictability that satisfy these two principles of predictability and that have several additional attractive properties. The starting point for these measures is the metric entropy. The entropy of a continuous distribution p(x) is defined as

where the integral is understood to be a multiple integral over the range of x. There exist many excellent reviews of entropy in the literature demonstrating that it arises as a natural and fundamental measure of uncertainty in a number of fields, including communication theory, data compression, gambling, computational complexity, statistics, and statistical mechanics [Shannon, 1948; Goldman, 1953; Reza, 1961; Cover and Thomas, 1991; Jaynes, 2003]. Although we will not reproduce those arguments here, as we could hardly do them justice, it is difficult to conceive of a measure better suited for a general evaluation of uncertainty than entropy.

[27] Given entropy as a measure of dispersion, a natural measure of predictability is the difference between the entropy of the forecast and climatological distributions:

where

is the entropy of the forecast distribution. The quantity PΘ is called the predictive information, and it varies in time through its dependence on Θ. (Schneider and Griffies [1999] introduced a measure of predictability equal to the difference between the entropy of the climatological distribution and the entropy of the forecast error. If the forecast error is identified with the difference between a randomly chosen state and the mean of the forecast distribution and the forecast system is perfect, then PΘ defined in this paper is equivalent to the predictive information defined by Schneider and Griffies. Hence we do not distinguish between these two definitions in this paper. We note that the definition of Schneider and Griffies allows predictability to be defined for imperfect forecasts in contrast to PΘ defined here, which is defined only with respect to a perfect model scenario.) The predictive information PΘ can be interpreted as the average amount of information provided by knowledge of Θ and by the implied perfect forecast derived from this knowledge. If we have perfect knowledge of the system at time t, then Θ = i. In such cases, Cover and Thomas [1991] show that the predictive information averaged over all initial conditions i decays monotonically with lead time in Markov systems, consistent with the intuitive notion that predictability degrades with lead time. If the system is deterministic, then predictive information is infinite, reflecting the fact that a deterministic forecast with perfect initial conditions has no uncertainty at any subsequent time.

[28] An alternative measure of the difference between two distributions is relative entropy, also known as the Kullback-Leibler distance [Cover and Thomas, 1991; Kleeman, 2002]. In the context of predictability, relative entropy is

Just as for predictive information, relative entropy varies from one forecast to another through its dependence on Θ. This metric arises in statistical problems in which two distributions must be discriminated from each other. Relative entropy decreases monotonically in Markov systems in a wide variety of circumstances [Cover and Thomas, 1991].

[29] Remarkably, predictive information and relative entropy are invariant with respect to (nonsingular) linear transformations of the state [Schneider and Griffies, 1999; Kleeman, 2002]. This property allows one to combine variables with different units into a single measure of predictability without introducing weighting factors since such factors cannot affect final values. Relative entropy has the additional property of being invariant to nonlinear, invertible transformations of the state [Majda et al., 2002].

[30] Another remarkable fact is that relative entropy and predictive information have precisely the same value when averaged over all observations Θ [DelSole, 2004a]. Hence relative entropy and predictive information measure different aspects of the average predictability of the variable v. The average of these quantities can be shown to be

This quantity is known as mutual information and arises frequently in information theory as a measure of the dependence between two random variables [Cover and Thomas, 1991]. Thus the average predictability of variable v (with respect to the observations Θ) equals the mutual information between v and Θ. This quantity vanishes if and only if the two variables are independent. Furthermore, this metric is invariant with respect to nonlinear, invertible transformations of the state, a property that follows from the fact that it is essentially the relative entropy between the joint distribution and the product of the marginals. Leung and North [1990] proposed mutual information (which they called “transinformation”) between the state at two different times as a measure of predictability. This quantity is also called the time-delayed mutual information in the physics literature [Vastano and Swinney, 1988].

[31] Substituting the predictive information (5) into the first equality in (8) gives

where

is the conditional entropy. This relation, which is well known in information theory, states that the statistical dependence between v and Θ, as measured by the mutual information M, equals the average reduction in uncertainty due to knowledge of the observations Θ. This relation rigorously expresses the notion that predictability can be measured in two equivalent ways: by the difference between the forecast and climatological distributions and by the degree of statistical dependence between the observations Θ and verification v. If the variable v is unpredictable, then, according to the above definition, the forecast and climatological distributions are identical, i.e., p(vΘ) = p(v). The latter condition is precisely the statement that the variables v and Θ are independent. Thus unpredictability implies that v and Θ are independent, and predictability implies that v and Θ are dependent random variables. Fundamentally, predictability is a measure of the gain from knowing something, say, an initial condition or climate forcing, that is statistically related to the variables of interest v.

[32] A key difference between relative entropy and predictive information is that relative entropy vanishes if and only if the distributions are identical. For instance, entropy is invariant with respect to a translation in space. Thus forecast and climatological distributions that differ only in their means will have vanishing predictive information but positive relative entropy.

[33] Other measures of predictability have been proposed in the literature, including potential predictive utility as measured by Kuiper's statistical test [Anderson and Stern, 1996] and the difference in distributions as measured by the Kolmogorov-Smirnov measure [Sardeshmukh et al., 2000]. One reason for preferring measures based on information theory is that they have an immense number of applications and interpretations. As an example, suppose a bookmaker offers fair odds based on the historical frequency (climatology) for the future occurrence of some climate event, such as El Niño. The maximum average rate at which a gambler with a reliable forecast can double his money is the relative entropy between the climatological and forecast probabilities [Kelly, 1956]. Numerous other examples could be given to demonstrate the practical usefulness of measures from information theory [see Cover and Thomas, 1991].

[34] A number of predictability studies demonstrate that specifying “boundary conditions,” such as SST, soil moisture, vegetation, and snow cover, in an atmospheric model can produce detectable influences in remote locations under favorable conditions (see Shukla and Kinter [2006] for a review). Technically, these models are not forecast models since the boundary conditions are specified rather than predicted. Nevertheless, these studies are important for establishing a physical basis for seasonal predictability by showing that components that vary more slowly than the atmosphere can influence the general circulation beyond the range of predictability associated with weather. Moreover, the framework discussed in this paper can be applied to these studies by interpreting the boundary conditions to be the observation Θ. However, the predictability of models with imposed boundary conditions may differ considerably from the predictability of models with dynamically interactive boundary conditions [Wang et al., 2005; Wu et al., 2006].

## 3. NORMALLY DISTRIBUTED VARIABLES

[35] The remainder of this paper considers variables that are joint normally distributed or at least well described by the first two moments. Measures derived for normal distributions are invariant only to linear, nonsingular transformations. The predictability of nonlinear or non-Gaussian systems can differ dramatically from that of linear or Gaussian systems [Smith, 2003]. Methods for non-Gaussian distributions are discussed in section 13. We denote a K-dimensional normal distribution with mean μ and covariance matrix Σ as Nx(μ, Σ), where

where ∣Σ∣ denotes the determinant of Σ and superscript H denotes the conjugate transpose (we consider only real variables but invoke conjugate transpose for formal reasons needed in section 12). The mean and covariance matrices of the relevant variables will be denoted as follows:

where angle brackets denote an average over the joint distribution of the random variables and the dependence of these quantities on initial time t and lead time τ is understood. We use the superscript infinity symbol to denote quantities that become stationary in the limit of large time and therefore become independent of initial time t in this limit:

[36] In the notation defined in equations (11)(12), the forecast distribution p(vΘ) will be represented as

the climatological distribution p(v) will be represented as

and the distribution of the observations p(Θ) will be represented as

[37] The product rule p(vΘ) p(Θ) = p(v, Θ) implies the identity

Identifying the mean of the forecast μv∣Θ with the signal and the dispersion of the forecast with the noise, equation (17) can be interpreted as a decomposition of the climatological covariance into a sum of the noise variance and signal variance of the forecast. We call (17) the signal-noise decomposition. Note that (17) holds independently of the distribution.

[38] If the variables v and Θ are joint normally distributed, then a standard result in multivariate statistics [Johnson and Wichern, 1982, p. 170] states that the conditional distribution p(vΘ) also is joint normally distributed with mean and covariance

where ΣvΘ = Σ is the cross-covariance matrix between v and Θ, which is assumed to be constant. The above result is closely related to classical linear regression. Specifically, consider a system in which the verification v is related linearly to observations Θ in the form

where L is a constant matrix, b is a constant vector, and η represents random error. This relation can be interpreted as a linear regression model in which the variable to be predicted, v, called the predictand, is related to a variable that is known, Θ, called the predictor. As such the regression coefficients L and b can determined by minimizing the mean square prediction error 〈ηHη〉,. This is a standard optimization problem with solution

Substituting this solution into the model (19) and averaging while holding Θ constant gives

The forecast covariance matrix for (19) is

Equations (21) and (22) are precisely the mean and covariance matrix of the conditional distribution p(vΘ) given in (18). This correspondence, which is well known in multivariate statistics, reflects the intimate connection between Gaussian distributions and linear relations.

## 4. MEASURES OF PREDICTABILITY FOR NORMAL DISTRIBUTIONS

[39] The classic measure of predictability is based on the mean square difference between forecast and truth. In a perfect model scenario, “truth” is identified with a randomly chosen member of the forecast ensemble. Let ɛ be the difference between two independently chosen members of the forecast ensemble f1 and f2. Then, the mean square “error” (MSE) of the forecast is

where ∥2 = xHx = Tr[xxH] and the overbar denotes an average over the forecast ensemble or, more precisely, all random variables except the observations Θ. As discussed in section 2, measures of predictability quantify the degree to which the forecast and climatological distributions differ. The usual measure of this difference based on the mean square error is the ratio

[40] The above metric has two basic limitations: (1) The measure depends on the coordinate system or basis set chosen to represent the state, and (2) the measure fails to capture differences between the forecast and climatological distributions that are not reflected in the total variance of the variables. Measures based on information theory do not suffer from these limitations and have further attractive properties, as will become clear. The derivation of predictive information, relative entropy, and mutual information for normal distributions is a standard result in information theory (see DelSole [2004a] for a summary of the derivations). The results are

where ∣A∣ and Tr[A] denote the determinant and trace of A. Following Schneider and Griffies [1999], we normalize predictability measures by state dimension K; otherwise, the measures scale with K. This normalization is tantamount to changing the base of the logarithm. Expression (25) shows that predictive information PΘ and mutual information M measure differences in covariances, while relative entropy RΘ measures differences in both mean and covariances. Following Kleeman [2002], we call the last term in the RΘ equation the signal and remaining terms dispersion.

## 5. WHITENING TRANSFORMATION

[41] The key difference between error analysis and predictability analysis is that error analysis characterizes absolute error, while predictability analysis characterizes relative error. Schneider and Griffies [1999] make the profound point that analysis of predictability in multidimensions is equivalent to analysis of variance in a suitably transformed coordinate system. This coordinate system will be called whitened space, and variables in this space are called whitened variables. Whitened variables appear frequently in information theory [e.g., Majda et al., 2002; Tippett and Chang, 2003] and in pattern recognition theory [Fukunaga, 1990]. The distinguishing feature of whitened variables is that their climatological covariance matrix equals the identity matrix. We further define a whitened variable to have zero mean. Any set of variables can be whitened by a linear transformation, provided the climatological covariance matrix is positive definite. Such transformations are unique up to a unitary transformation. In practice, variables can be transformed to whitened space by projecting them onto their principal components and then normalizing the principal components to unit variance. A geometric interpretation of the whitening transformation is illustrated in Figure 2. We denote whitened variables by tildes and the associated transformation matrix by the square root of the covariance matrix, which satisfies

It should be recognized that the whitening transformation generally depends on the variable. Thus

each of which can be readily verified to have homogeneous covariances, e.g.,

[42] A whitened operator is an operator that acts on whitened variables in such a way as to preserve the original relations. Thus the whitened regression operator L in (20) is

[43] The norm of a whitened variable will be called the Mahalanobis norm. For example, the Mahalanobis norm of v is

In statistics the quantity (30) is the Mahalanobis distance between v and μv. Stephenson [1997] has discussed several interesting properties of this measure, including the fact that it is invariant with respect to nonsingular linear transformations [see also Mardia et al., 1979, p. 31].

[44] The predictability measure (24) based on the whitened variables is

since 2 〉 = Tr[I] = K. Measure (31) is proportional to the mean square error in whitened space and invariant to linear transformation. Thus the whitening transformation essentially converts error analysis into predictability analysis. We call the predictability measure (31) the Mahalanobis error. In one dimension the Mahalanobis error reduces to

which is the ratio of the forecast variance σv∣Θ2 to the climatological variance σv2, the classical measure of predictability in one dimension. The Mahalanobis error can be regarded as a generalization of normalized forecast error to multidimensions, a point that does not seem to be generally recognized in the predictability community. However, the Mahalanobis error is useful primarily for distributions that are well characterized by their first two moments.

[45] Another property of whitened variables is that their probability density is isotropic (i.e., rotationally invariant). For instance, variables with constant probability density (11) satisfy

Expression (33) defines the surface of a sphere in whitened space, which, of course, is isotropic. States with the same Mahalanobis norm therefore are equally likely. Gaussian random variables that satisfy equation (33) are said to have equal likelihood.

[46] Interestingly, in the case of normal distributions the relative entropy, predictive information, and Mahalanobis error can be combined into a single expression:

The fact that all three measures of predictability appear in a single equality suggests a deeper connection among these measures than has been recognized so far.

## 6. WHITENED FORECAST COVARIANCE MATRIX

[47] Using standard properties of trace and determinant, measures (25) and (31) can be written as

where

The matrix ΦΘ will be called the whitened forecast covariance matrix. This matrix plays virtually the same role in predictability as the “predictive information matrix” discussed by Schneider and Griffies [1999] and Tippett and Chang [2003], to which it is related by a similarity transformation. In one dimension the whitened forecast covariance matrix reduces to (32), namely, the classical measure of predictability in one dimension. The above expressions show that the whitened forecast covariance ΦΘ and whitened mean vΘ contain everything there is to know about the predictability of Gaussian random variables.

[48] The whitened forecast covariance matrix is not unique because the whitening transformation is not unique. However, the above measures of predictability are unique because the trace and determinant are invariant to unitary transformations. Furthermore, eigenvectors of the whitened forecast covariance are unique when transformed back into physical space.

## 7. PREDICTABLE COMPONENTS

[49] The above measures quantify predictability by a single number. It is often enlightening to decompose total predictability into components that optimize predictability, similar to the way principal component analysis decomposes total variance into components that optimize variance. It turns out that decomposing variables by predictability is nothing more difficult than applying principal component analysis to whitened forecast variables. Specifically, principal component analysis of the whitened forecast variables yields an uncorrelated set of components that optimize predictability in the sense that the first component optimizes predictability, the second optimizes predictability subject to being uncorrelated with the first, and so on. This decomposition reveals how predictability is distributed among spatial structures and is especially useful if predictability is dominated by a small number of components. This technique, called predictable component analysis, was introduced independently by Deque [1988] and Schneider and Griffies [1999] and provides the basis of a procedure by Majda et al. [2002] for estimating relative entropy for non-Gaussian distributions. Following Renwick and Wallace [1995], the predictable component analysis technique will be denoted PrCA, not to be confused with PCA, which denotes principal component analysis.

[50] To demonstrate the above claims, consider a projection vector q. We seek the projection vector q such that the inner product of the projection vector and the state optimizes a measure of predictability. If the state v is normally distributed, then any linear combination of state variables v = qHv is also normally distributed with scalar mean and variance

The predictive information, relative entropy, mutual information, and the Mahalanobis error for the projected variable can be derived from (25) and (37) as

Note that each of these expressions is a function of the ratio of variances

where u is related to q by

When the forecast variance is less than the climatological variance, then ϕ < 1 and the measures in (38) are monotonic functions of ϕ. It follows that, aside from relative entropy, which involves an extra term involving the difference in means, optimization of ϕ is equivalent to optimizing each of these measures of predictability.

[51] The above optimization problem can be solved by noting that ϕ in (39) is a Rayleigh quotient. A standard theorem in linear algebra [Noble and Daniel, 1988] states that the vectors that optimize the Rayleigh quotient (39) are given by the eigenvectors of the eigenvalue problem

The symmetry of the whitened covariance matrix implies that the eigenvectors u are orthogonal. This orthogonality implies that the components are uncorrelated. Furthermore, the eigenvalue ϕ gives the variance ratio associated with the projection vector q as given by (40). Thus the eigenvector with the smallest eigenvalue maximizes predictability, the eigenvector with the next smallest eigenvalue maximizes predictability subject to being uncorrelated with the first, and so on.

[52] Using the fact that the trace and determinant of a matrix equals the sum and products of the eigenvalues, respectively, the measures of predictability (35) may be written in terms of the eigenvalues ϕ1, ϕ2,…,ϕK of the whitened forecast covariance matrix as

Thus the whitened forecast covariance matrix plays a central role in the predictability of normally distributed variables: Its eigenvectors determine the predictable components, and its eigenvalues determine the magnitude of predictability.

[53] The eigenvalue problem (41) in whitened space is equivalent to the generalized eigenvalue problem in untransformed space:

The eigenvectors q are not generally orthogonal in space, so a corresponding set of vectors needs to be used to describe the spatial structure of the predictable component. Indeed, eigenvectors q are orthogonal with respect to the metric Σv in the sense that if qj and qk are two eigenvectors with differing eigenvalues, then qjHΣvqk = 0. If the projection vectors are normalized such that

then the biorthogonal vector to q is clearly

This relation further implies that

which preserves the biorthogonality relations pHq = uHu = 1. Thus the time series associated with the predictable component is qHv, and the spatial structure associated with this time series is p. The original time series v can be recovered by summing over all predictable components p, multiplied by their time series qHv. This procedure is discussed in further detail by Schneider and Griffies [1999] and DelSole and Chang [2003]. As noted by Schneider and Griffies [1999], predictable component analysis is also formally equivalent to discriminant analysis in the sense that the projection vectors that optimize ϕ in (39) can be interpreted as the weighting coefficients for a linear combination of variables that discriminate as much as possible between the forecast and climatological distributions or, equivalently, between the signal and noise. Coincidentally, Deque [1988] introduced a procedure by the same name, independently of Schneider and Griffies [1999], which turned out to be equivalent to the technique of Schneider and Griffies because of the equivalency between (41) and (43).

[54] If the forecast covariance is constant (i.e., 〈ΣvΘ〉 = ΣvΘ), the signal-noise decomposition (17) may be substituted in the numerator of (39) to give

The last term on the right-hand side is the ratio of the signal variance to the total variance. The relation between ϕ and the more familiar signal-to-noise ratio can be found by substituting the signal-noise decomposition (17) into the denominator of (39), which gives

where

is the signal-to-noise ratio. Maximization of the signal-to-noise ratio directly has been discussed by Hasselmann [1979] and Venzke et al. [1999] and yields “signal-to-noise empirical orthogonal functions (EOFs).” The monotonic relation between ϕ and signal-to-noise ratio s2 indicated in (48) implies that minimizing ϕ, and thus maximizing predictability, is equivalent to maximizing the signal-to-noise ratio. Thus the signal-to-noise EOFs are identical to the predictable components. In addition, Kleeman and Moore [1999] and Sardeshmukh et al. [2000] show that ϕ = 1 − ρ2, where ρ, often called the anomaly correlation, is the correlation between the mean forecast and one realization of the forecast. Finally, equation (48) implies that ϕ is positive and does not exceed unity; that is, 0 < ϕ ≤ 1 when the forecast covariance is constant.

[55] Another class of predictable components are obtained by optimizing the signal term RΘ in (38),

This function also is a Rayleigh quotient and hence can be solved by similar methods. In fact, since the quadratic form in the numerator is rank 1, the solution can be found analytically as

This function is known as Fisher's linear discriminant function and occurs frequently in discriminant analysis and in “fingerprint” methods in climate studies [Hasselmann, 1993].

[56] One may also determine the component that optimizes the average signal term in relative entropy. This optimization problem is equivalent to maximizing

which is the signal-to-noise ratio. The equivalence between components that maximize average signal and dispersion is a simple consequence of the signal-noise decomposition (17).

[57] In summary, predictive information, the dispersion term in relative entropy, the average signal term in relative entropy, mutual information, Mahalanobis error, signal-to-noise ratio, and anomaly correlation are optimized by the same components. This unifying result is a consequence of the fact that any measure of predictability that is invariant to affine transformation and monotonically related to uncertainty is maximized by the same components, independent of the detailed form of the measure [see DelSole and Tippett, 2007].

## 8. PREDICTABILITY OF LINEAR DYNAMICS

[58] In section 7 we assumed merely that the climatological and forecast distributions were Gaussian. In this section we invoke the stronger assumption that the variables v and Θ are joint normally distributed. For joint normal distributions the forecast distribution p(vΘ) has mean and covariance (18). This assumption immediately implies that the forecast covariance is constant. We assume that the forecast covariance ΣvΘ is constant in the remainder of this paper.

[59] Substituting this covariance into the whitened forecast covariance matrix ΦΘ(36) gives

where is the whitened regression operator defined in (29). The emergence of here shows the intimate relation between the whitened forecast covariance and the whitened regression operator. Furthermore, the whitened signal is found by invoking the conditional mean (18):

Expression (54) shows that variations in the signal are induced by variations in observations Θ. These two results allow us to write the predictability measures as

Note that predictive information PΘ equals mutual information M because the covariances are constant under the joint normal assumption.

[60] It follows immediately from (53) that the eigenvectors of ΦΘ are the left singular vectors of . To see this, consider the singular value decomposition (SVD) of :

where the columns of U and W are the left and right singular vectors, respectively, which satisfy UHU = I and WHW = I, and S is a diagonal matrix with nonnegative diagonal elements, called singular values. Substitution of the SVD (56) into the whitened covariance matrix (53) gives

Equation (57) is of the form of an eigenvector decomposition. It follows that the left singular vectors of are also the eigenvectors of ΦΘ, which, in turn, are the predictable components.

[61] For any projection vector the forecast covariance (53) reduces to the scalar

where

is the squared correlation coefficient between the state v and observations Θ. Relation (59) implies two things. First, it implies that the singular values given by the diagonal elements of S are the correlation coefficients ρ. Second, it implies that optimization of ϕ is equivalent to optimization of ρ2: Predictable components optimize not only the whitened forecast variance, and hence optimize measures of predictability suggested from information theory, but also optimize the squared correlation between the state and observations. As is well known, canonical correlation analysis is a procedure that determines the linear combination of one set of variables and a second linear combination in a second set of variables, such that the correlation between the resulting combinations is maximized [Barnett and Preisendorfer, 1987; von Storch and Zwiers, 1999; DelSole and Chang, 2003]. The above analysis demonstrates that canonical correlation analysis is equivalent to predictable component analysis when the variables are joint normally distributed. This equivalence can be seen more directly by substituting the covariance matrix (18) into the generalized eigenvalue problem (43), which gives

This eigenvalue problem is essentially equivalent to the eigenvalue problem that arises in canonical correlation analysis [see, e.g., von Storch and Zwiers, 1999, equation (14.10)].

[62] The above analysis shows that if the variables are joint normally distributed, canonical correlation analysis is equivalent to predictable component analysis, which, in turn, is equivalent to SVD analysis of the whitened regression operator. It follows that canonical correlation analysis between two variables is equivalent to an SVD analysis of the corresponding whitened regression operator. This connection is not surprising since SVD determines linear combinations of two sets of variables that maximize the covariance between the variables, and in whitened space all variables are normalized to unit variance, and hence their covariances are identical to correlations. Canonical correlation analysis (CCA) then decomposes a linear relation into a set of uncorrelated components ordered such that the first maximizes predictability, the second maximizes the predictability subject to being uncorrelated with the first, and so on. Further interesting connections between CCA and linear regression are discussed by DelSole and Chang [2003].

[63] The equivalence between predictable component analysis and canonical correlation has a more fundamental explanation. Predictable components optimally discriminate between forecast spread and climatological spread. A fundamental relation, embodied in (9), is that changes in spread due to observations becoming available are related to the degree of statistical dependence between observations and state. Hence maximizing predictability, as measured by the difference in spread, is equivalent to maximizing the statistical dependence between variables. It is noteworthy that CCA emerges as a natural procedure for optimizing predictability in contrast to classical predictability theory in which CCA often is invoked as an ad hoc procedure.

[64] Since the trace and determinant of a matrix equal the sum and product of the eigenvalues, respectively, the above results imply that predictability measures can be written as

where ρ1, ρ2,…, ρK are the canonical correlations. Thus the total predictability can be written solely in terms of the canonical correlations plus a signal term that depends on observations Θ.

[65] Nothing yet has been said about the right singular vectors. By analogy with the usual interpretation of singular vectors in initial value problems one might think that the right singular vectors W give the observations Θ that produce the predictable components U. However, the spread in this model does not vary from forecast to forecast, so the initial condition is irrelevant as far as the spread is concerned. It turns out that the right singular vectors are relevant to the forecast signal, as can be seen by the following argument. According to (18) the whitened signal is related to the observations through

A standard fact about singular vectors is that if is the leading right singular vector of , then it maximizes the norm of the “response” from (62)

subject to the constraint that the observations have unit Mahalanobis norm

The latter constraint can be interpreted as choosing observations that are “equally likely” in the sense described in section 4. Moreover, the left singular vector gives the “response” due to the right singular vectors. Note that the term being maximized in (63) is the signal term in relative entropy. Thus the singular vectors of the whitened regression operator optimize the signal term in relative entropy, subject to an equal likelihood constraint on the observations Θ. Perhaps the surprising part of these considerations is that the same singular vectors optimize predictability in an average sense, being the eigenvectors of the forecast spread, and in an instantaneous sense, being vectors that optimize the signal term in relative entropy.

## 9. STATE SPACE MODELS

[66] In many situations we do not have direct access to the regression operator L that maps observations Θ into future states v. Instead, we possess a dynamical model that propagates initial states i into future states v and an analysis system that converts observations Θ into initial states i. The question arises as to whether the predictability of this system can be described by the above formalism. To establish this connection, consider a state equation that maps i into v,

where G is a constant square matrix called the propagator and ξ is Gaussian random variable with zero mean and covariance matrix Σξ. In practice, the initial state i is derived from a data assimilation system, which is a system for computing the analysis distribution p(iΘ). For Gaussian distributions the initial state, conditioned on the observations, has the standard form

where μiΘ is the conditional mean of the state, called the analysis, and e is a Gaussian random variable called the analysis error, with zero mean and covariance Σe. The analysis is related to the observations by

where

Note that the analysis operator B is not necessarily a square matrix. The covariance of the analysis signal (67) is found by “squaring” both sides of (67) and substituting (68):

Using (66) and (67) to eliminate the initial condition i from (65) gives

Equation (70) is precisely of the form of a linear regression model (19) with the identifications

In other words, the linear state space model with analysis can be represented as a single linear regression model whose predictability has been discussed in section 8. Interestingly, the effect of model dynamics and analysis can be factored into distinct terms from the regression operator L, a result that holds generally for linear models with joint normal distributions.

## 10. ERROR DYNAMICS AND PREDICTABILITY

[67] The results discussed in section 9 allow us to address an apparent paradox regarding the use of singular vectors in the study of predictability. On the one hand, several authors have argued that the leading singular vectors of a propagator maximize error growth and hence identify components that are poorly predicted [Lorenz, 1965; Farrell, 1990; Palmer, 1995]. On the other hand, other authors have argued that the leading singular vectors of a propagator maximize signal growth and hence identify components that are well predicted [Penland and Sardeshmukh, 1995]. How can singular vectors identify well-predicted components in one case and poorly predicted components in the other, when both are derived from the same propagator?

[68] To unravel this seeming paradox, consider the covariance matrix obtained by “squaring” both sides of the first equality in (70):

The first term on the right-hand side of (72) measures forecast signal, the second term measures the forecast spread due to initial condition uncertainty, and the last term measures the forecast spread due to model randomness. The whitened version of (72) (after moving the signal term to the left-hand side) is

where

By association with the terms from which they are derived, is identified with climatological spread due to the forecast signal, is identified with spread due to initial condition error, and ξ measures the spread due to model noise. The appearance of two transformed propagators, when only one propagator is relevant in state space, may seem obfuscating, but it will be seen that these transformed propagators hold the key to resolving the paradox.

[69] The equivalence of the whitened forecast covariance matrix (73) to that given in (53) follows from the identity

where (68), (69), and (71) have been used. Equation (75) further implies that left singular vectors of the regression operator are identical to those of , demonstrating that the predictable components can be derived from either operator. However, the right singular vectors generally are not the same, as appropriate since the regression operator L acts on observations Θ, while the propagator G acts on initial states i. Nevertheless, just as the right singular vectors of give the observations that maximize the signal term in relative entropy subject to an equal likelihood constraint, so too the right singular vectors of give the initial analyses that maximize the signal term, subject to an equal likelihood constraint on the initial analyses.

[70] To make further progress, it is necessary to examine the original arguments of Lorenz [1965]. Lorenz considered separate solutions of a dynamical system with slightly different initial conditions. For sufficiently small differences the solution is governed by a linear equation, called the tangent linear model, whose propagator generally is a function of time. Moreover, the tangent linear model has no inherent randomness. In order to make contact with this scenario we assume that the propagator for the tangent linear model is constant and the stochastic component of the model vanishes, i.e., Σξ = 0. In this scenario the whitened forecast covariance matrix (73) reduces to

Equation (76) implies that the left singular vectors of also are the left singular vectors of but with reversed ordering. Thus equation (76) establishes the remarkable fact that under the scenario outlined above, the singular vectors that maximize relative error are in precise one-to-one correspondence with the singular vectors that maximize predictability but with reversed ordering. The two matrices and differ only by the normalization applied to the right side of the propagator G. This seemingly innocuous change in normalization fundamentally changes the interpretation of the propagators. In particular, measures the growth of forecast spread due to initial error, while measures the growth of the forecast signal. We are lead to the conclusion that the normalization applied to the right of the propagator determines whether the singular vectors maximize noise or maximize signal.

[71] The difference in normalization between and can be interpreted as applying different constraints on the initial vector norm. According to the reasoning outlined in section 8, computing the SVD of is equivalent to optimizing the norm of the response

subject to the constraint that

In contrast to optimizing absolute error we optimize the relative error (77) and hence optimize predictability (a distinction clearly understood by Lorenz [1969]). The constraint (78) on the initial error may be interpreted as an equal likelihood constraint, as discussed at the end of section 4. The equal likelihood constraint can be interpreted as constraining the whitened errors to lie on the surface of a sphere. Thus this constraint appears to be consistent with Lorenz's [1969, p. 327] approach to considering an ensemble of initial errors that are random in the sense that “no direction in…[state] space is preferred over any other direction.” Also, the equal likelihood constraint is consistent with the constraint used in studies that compute ensemble-based estimates of the forecast error covariance with singular vectors, using an initial norm based on the analysis error covariance [Houtekamer, 1995; Molteni et al., 1996; Ehrendorfer and Tribbia, 1997].

[72] The results of this section actually resolve more than just the paradox mentioned earlier. Recall that the scenario considered by Lorenz [1965, 1969] assumed that the only source of forecast spread was initial condition error; that is, the model itself contained no stochastic component. If the model contains a stochastic component, then forecast uncertainty grows not only from initial condition error but also from stochastic forcing. Consistent with this, the forecast covariance matrix (73) implies that the singular vectors of coincide with the singular vectors of (but with reversed order) only if the stochastic component in the model vanishes (i.e., Σξ = 0). Thus maximizing growth due to initial condition error captures only part of the total error growth. The question arises as to whether the singular vector methodology can be extended to models containing stochastic forcing. For constant dynamics the answer is immediate from (73): The total predictability is described by , and therefore the singular vectors of this propagator optimize total predictability even if the model contains stochastic forcing.

[73] As discussed earlier (see discussion around (48)), the eigenvalues of ΦΘ are less than 1, which implies that the singular values of are less than 1. Thus there is no “growth of perturbations” due to in the sense that the response norm exceeds the initial norm. Although vectors do not grow in the norms defined in (77) and (78), they may grow in other norms. Importantly, the singular values of the normalized propagator have a simple, direct interpretation in terms of predictability. In particular, if the forecast covariance is constant, minimum and maximum predictability correspond to singular values zero and one, respectively. The predictable components in the case of nonconstant forecast covariance are discussed by DelSole and Tippett [2007].

## 11. LINEAR STOCHASTIC MODELS

[74] Stochastic models provide a class of models in which predictability can be understood as the evolving response of a dynamical system to random forcing. As shown by Wang and Uhlenbeck [1945], any stationary, Gaussian, Markov process can be represented by a suitable linear stochastic model and vice versa. Therefore predictability of linear stochastic models can be solved immediately by recognizing that the stochastic model solution must be precisely of the form (65), whose predictability has already been discussed extensively. It merely remains to identify all predictability quantities in terms of stochastic model parameters. This section derives these identifications; for further details about stochastic models we recommend the work of Gardiner [1990] and DelSole [2004b].

[75] Consider a finite dynamical system of dimension K. The state of the system is specified by a K-dimensional vector xt. If the dynamical system is linear and driven by noise, then the state is governed by a stochastic differential equation of the form

where A is a K × K matrix, called the dynamical operator, and w is a K-dimensional vector of stochastic processes. The dynamical operator A is assumed to be independent of time, in which case the differential system (79) is autonomous and solutions can be found explicitly. To ensure bounded solutions, the matrix A is assumed to be stable; that is, it possesses K distinct eigenvalues with negative real parts. The forcing w is assumed to be Gaussian white noise with zero mean and time-lagged covariance

where Q is a covariance matrix and δ is the Dirac delta function.

[76] For a particular realization of the forcing w, (79) constitutes a set of ordinary differential equations that can be solved by standard methods [Noble and Daniel, 1988]. The solution is

where

where s is a variable of integration. The first term in (81) depends on the initial condition and hence represents a predictable part of the solution; this term can be identified with the signal, since ξ has zero mean. The matrix

is called the propagator. In (83) the columns of Z give the eigenmodes of the dynamical operator A, and Λ is a diagonal matrix whose diagonal elements give the corresponding eigenvalues. The second term in (81), namely, ξ, depends on random forcing and hence is unpredictable; this term can be identified with the noise.

[77] Comparison of (81) and (65) shows that the propagator is

Thus the whitened propagator is

where Σi is the covariance matrix for the initial conditions, which depends on the assimilation. The climatological covariance can be computed by “squaring” both sides of (81), which gives

where 〈H〉 = 0 owing to causality. This expression is merely the signal-noise decomposition (17) for the stochastic model. It remains to compute the noise covariance matrix Σξ. If the forcing w satisfies (80), then a standard calculation in stochastic calculus shows that the noise term ξ has zero mean and covariance

This expression can be evaluated explicitly by substituting the propagator (83). The result is

where

In expression (89), λk is the kth diagonal element of Λ, the asterisk denotes the complex conjugate. In (88) the small circle denotes a Hadamard product (given two matrices, X and Y, the Hadamard product X°Y is a matrix such that the (i, j) element is XijYij). The above expression for the noise covariance matrix involves only known model parameters, namely, the eigenvectors and eigenvalues of the dynamical operator and the covariance matrix Q. This completes the identification of predictability quantities with their corresponding stochastic model parameters.

## 12. BOUNDS ON PREDICTABILITY

[78] Generalized stability theory [Farrell and Ioannou, 1996a, 19996b] tells us that perturbations grow more strongly in nonnormal systems than in normal systems for all dynamical operators with the same eigenvalues. One might surmise from this that nonnormality diminishes predictability, since it enhances error growth. However, one could equally well surmise that nonnormality enhances predictability because it enhances signal growth. Confounding these arguments is the fact that Ioannou [1995] showed that the response of a linear system to white noise has larger variance in nonnormal systems than in normal systems. Thus nonnormality changes both the forecast and climatological variances, rendering the net predictability ambiguous.

[79] To gain insight into the role of nonnormality in predictability, we consider the problem of finding the forcing covariance matrix Q that optimizes predictability, subject to fixed dynamical operator A. In general, predictability can be made arbitrarily small by increasing the analysis error variance. A useful reference is the perfect initial condition scenario in which the analysis errors vanish. The lower bound for this scenario defines the greatest lower bound on predictability, since imperfect initial conditions can only reduce predictability further. Under this assumption the whitened regression operator becomes the whitened propagator for a stationary climatology

where we have the fact that Σv = Σi in a perfect initial condition scenario for a stationary system. Since the whitened propagator is related to the original propagator by a similarity transformation, the two propagators have identical eigenvalues. The whitened forecast covariance matrix is IWWH.

[80] We have established that all eigenvalues of the whitened forecast covariance matrix IWWH lie between 0 and 1. By following essentially the same logic as in section 8 it follows that all singular values of the whitened propagator W are necessarily less than 1. This fact implies that the whitened propagator does not give rise to “nonnormal growth”; that is, predictability decays monotonically from all initial conditions in Gaussian, Markov systems. This is true even if the propagator supports nonnormal growth in the energy norm. Although the prewhitening transformation removes nonnormal growth from the predictability dynamics, it generally does not make the dynamics normal. The nonnormality of W, as indicated by differences between the singular values and the modulus of eigenvalues, has an important influence on predictability, as will be seen in the remainder of this section.

[81] We first consider bounds on the Mahalanobis error. According to relation (35) the Mahalanobis error is related to the singular values of the whitened propagator as

where ρk denotes the kth singular value of W, ordered from largest to smallest. A standard result from linear algebra [Marcus and Minc, 1992, p. 117] states that the sum of squared singular values is bounded below by the sum of squared modulus eigenvalues:

where λk refers to the kth eigenvalue of A, ordered from largest to smallest. Bound (92) implies that the Mahalanobis error is bounded as

Remarkably, this bound depends only on the real part of the eigenvalues of the dynamical operator. Since large Mahalanobis error corresponds to low predictability, inequality (93) gives a lower bound on predictability. Before interpreting this result, consider other measures. Predictive information depends on the determinant of the whitened forecast covariance matrix. Using the fact that the determinant of a matrix equals the product of its eigenvalues and using the convexity properties of −log(1 − x2), Tippett and Chang [2003] show that

Similarly, Tippett and Chang [2003] show that relative entropy is bounded as

provided the effect of initial conditions is neglected. In all cases examined above, the lower bound of predictability depends only on the eigenvalues of the dynamical operator.

[82] The above results establish the existence of a lower bound, but they do not show that it is obtainable. It turns out that the lower bound is achieved when the whitened dynamical propagator is normal in the sense that

In particular, when W is normal, its eigenvectors are orthogonal and the eigenvalues of WWH are exp(2Reλk). When W is nonnormal, the eigenvalues of WWH lie outside the range of exp(2Reλk). This latter fact is the reason that perturbations can grow in finite time in linear systems with decaying modes [Farrell and Ioannou, 1996a, 1996b]. This result, which forms the core of generalized stability analysis, implies that the response to white noise always has larger variance in nonnormal systems than in normal systems [Ioannou, 1995]. The corresponding conclusion in predictability theory is that nonnormality of the whitened dynamics increases predictability. The above analysis illustrates how familiar methods and concepts from generalized stability analysis can be brought to bear on predictability questions.

[83] We now establish conditions for a whitened dynamical operator to be normal; that is, for W to satisfy (96). Farrell and Ioannou [1993b] prove that the most general transformation that renders a dynamical operator normal is of the form P = UDZ−1, where U is unitary, D is a diagonal matrix with positive diagonal elements, and Z is the eigenvector matrix discussed in (83). In order for this transformation to be a whitening transformation, it must be true that

where we have substituted (88). Equality (97) can be manipulated into the form

Since the right-hand side is a diagonal matrix, the matrix in parentheses on the left-hand side must also be diagonal, which implies that the forcing covariance matrix is of the form

where Y is a positive definite, diagonal matrix. The conclusion that the forcing covariance matrix is of the form (99) is equivalent to the statement that the forcing covariance matrix and dynamical operator are simultaneously diagonalizable. If we now substitute the forcing covariance matrix (99) into the covariance solution (88) and use this result to compute the predictive information from (25), we find that

where diag(E) denotes a diagonal matrix with the same diagonal elements as E. Note that the diagonal matrix Y cancels in expression (100), indicating that predictability in this case is independent of the amplitudes of the random forcing (so long as the amplitudes are nonzero). Furthermore, the predictability given in (100) for uncorrelated random forcing is identical to the general lower bound (94). A similar analysis shows that each lower bound of predictability given in (93), (94), and (95) is achieved when the forcing covariance matrix and dynamical operator are simultaneously diagonalizable (i.e., Q is of the form (99)).

[84] The above results can be summarized very simply: The minimum predictability of a stochastic model, under a perfect initial condition scenario, occurs when the whitened dynamical operator is normal, which is equivalent to the condition that the forcing covariance matrix and dynamical operator are simultaneously diagonalizable. Remarkably, this minimum predictability depends only on the real part of the eigenvalues of the dynamical operator; in particular, the minimum value is independent of the structure of the eigenmodes, independent of the amplitude of the random forcing applied to each eigenmode, and independent of the imaginary part of the eigenvalues.

[85] If the dynamical operator and noise covariance matrix are simultaneously diagonalizable, then the original K-dimensional system can be transformed into K uncoupled one-dimensional systems. Thus minimum predictability corresponds to systems that are fundamentally a collection of independent, one-dimensional systems. In essence, coupling enhances predictability.

[86] The condition that the dynamical operator A and noise covariance matrix Q are simultaneously diagonalizable is equivalent to the condition that the two operators commute:

Condition (101) is identified with the condition of detailed balance [Gardiner, 1990; Weiss, 2003]. Detailed balance occurs when the joint probability for transition from Y to Z equals the joint probability for transition from Z to Y. Detailed balance implies that a system is reversible in a probabilistic sense, although individual realizations do not satisfy time reversal symmetry.

[87] The above results establish that minimum predictability occurs when detailed balance is satisfied or, equivalently, when the whitened dynamical operator is normal. Both conditions are invariant with respect to nonsingular linear transformations. However, the condition of detailed balance can be established even for nonlinear systems with multiplicative noise in contrast to the concept of a whitened dynamical operator. Whether detailed balance corresponds to minimum predictability in these more complex models remains to be established.

[88] If the whitened dynamical operator is normal, then peaks in the power spectrum occur only at the imaginary part of the eigenvalues of the dynamical operator. Since the predictability of such systems depends only on the real part of the eigenvalues, it follows that predictability of normal systems is independent of the location of spectral peaks [see also Chang et al., 2004].

[89] In contrast to the lower bound discussed above, the upper bound of predictability is not well understood. In unpublished work the authors have shown that the solutions that attain the lower bound given in (94) are the only solutions that render the predictability stationary. Thus, if there exists upper bounds, these bounds must lie on the “boundary” of allowed solutions; for instance, the upper bounds must be in the set of semidefinite forcing covariance matrices. Tippett and Chang [2003] prove that the upper bound of predictability in two-dimensions is given by a rank 1 forcing, consistent with being a “boundary.” Remarkably, the predictability of stochastic models driven by rank 1 forcing is independent of the forcing, provided all eigenmodes are excited. Tippett and Chang conjecture that rank 1 forcing gives the upper bound to predictability for all stochastic models with the same dynamical eigenvalues. If this conjecture is true, then

To our knowledge, no empirical results contradict these bounds.

[90] Rank 1 forcings also arise when maximizing the ratio Tr[ΣvΘ]/Tr[Q], the ratio of the response variance to the noise variance. Specifically, the noise covariance that maximizes this ratio is the rank 1 covariance Q = ffH, where the vector f is a stochastic optimal [Farrell and Ioannou, 1993a; Kleeman and Moore, 1999; DelSole, 2004b]. Chang et al. [2004] show that, in the limit of short lead time, these same stochastic optimals minimize the ratio Tr[ΣvΘ]/Tr[Σv] and hence maximize this measure of predictability (see Chang et al. [2004] for an explanation of this counterintuitive result). As far as measures from information theory are concerned, stochastic optimals have no special status among rank 1 forcings because measures from information theory are independent of the rank 1 forcing [Tippett and Chang, 2003]. The relevance of stochastic optimals to predictability in stochastic models with full rank forcing has not been fully explored.

## 13. PRACTICAL ISSUES

[91] We now come to a fundamental limitation that haunts all predictability studies: The true forecast distribution is not known. In this section we discuss practical issues with estimating this distribution and suggest new approaches inspired by the framework discussed in this paper.

[92] A first step to estimating a probability distribution is to parameterize it. This means that the distribution is assumed to have a specific functional form that depends on a finite set of parameters. For example, a normal distribution requires specifying all the parameters contained in the mean vector and the covariance matrix. Having parameterized the distribution, the next step is to estimate the parameters based on a finite sample, usually by optimizing some measure of the “goodness of fit” between the assumed distribution and the observed sample. The outstanding problem with estimating parameters in this way is overfitting, i.e., fitting specific random features in the observed sample that do not recur in independent samples. In the context of regression analysis, overfitting leads to artificially small residuals, which implies artificially enhanced predictability. Thus estimation requires striking a balance between using enough parameters to capture important features in the distribution but not too many to produce artificial predictability. In practice, the sample size is fixed, so the total number of parameters must be constrained in some way in order to obtain accurate estimates from the sample.

[93] One approach to limiting the number of parameters is to project the variables onto a reduced subspace. The implications of this projection can be understood by partitioning the state into two parts, v = [vA, vB]. Using standard “chain rules” in information theory [Cover and Thomas, 1991, section 2.5], measures such as predictive information can be written as

Analogous chain rules exist for relative entropy and mutual information but are not shown. The result shows that the total predictability of a two-component system [vA, vB] can be decomposed into the predictability of vB alone plus the predictability of vA conditioned on vB. The last term in braces is related to information transfer [Schreiber, 2000]. Importantly, the two terms within braces are nonnegative. Therefore predictability measured in the truncated system constitutes a lower bound on the total predictability. Different partitions produce different lower bound estimates, reflecting the fact that different partitions correspond to different pieces of information available to the observer. Since predictability in a truncated system constitutes a lower bound on the total predictability, the choice of variables is guided by the goal of identifying variables that lead to the greatest predictability, without overfitting.

[94] To estimate a distribution, we must have a sample drawn from it. Unfortunately, the climate system provides only a single realization, and the available forecast models cannot be used for this purpose because they are not perfect. Nevertheless, imperfect forecast models still can be useful. Recall that the forecast f is the result of assimilating observations into an initial condition and subsequently solving a set of nonlinear differential equations. Thus f can be interpreted as a nonlinear function of Θ, and p(vf(Θ)) is a reliable forecast distribution. The possible advantage of estimating p(vf) instead of p(vΘ) is that the relation between v and f may be easier to parameterize than the relation between v and Θ, because the forecast model may capture a significant fraction of the nonlinear and nonlocal relations, especially those due to advection. In the univariate, normal case the mean of the distribution p(vf) is called a model output statistics correction [Wilks, 1995]. DelSole [2005] proved that the predictability measured from p(vf) can never exceed the predictability of the perfect model forecast p(vΘ). Thus the perfect model forecast has the distinction of being the reliable forecast with maximum predictability.

[95] The most common basis set for reducing the dimension of predictability analysis is the leading principal components of the system. However, principal components maximize variance and hence may not be the best choice for predictability analysis. An alternative basis set worth investigating is the potential predictable components, which are the predictable components of a forecast model with respect to its climatology. DelSole [2005] showed that these components can be used as predictors of the statistically corrected forecast without loss of generality. DelSole and Shukla [2006] showed that these predictors can often produce superior statistical corrections, although overfitting problems due to small sample sizes still need further attention.

[96] In the statistics literature, methods for dealing with overfitting are associated with the terms model selection, regularization, cross-validation, and minimum description length (see Hastie et al. [2001] for an overview). Different optimization methods often have different regularization techniques associated with them. For instance, predictable components arise not only from a regression analysis between forecast and observations (section 8) but also between forecast member and forecast mean [Tippett, 2006]. The latter regression problem is more convenient from a practical point of view, since it involves only forecast variables, and is important in Monte Carlo forecasting [Leith, 1974]. To show this connection explicitly, consider the linear regression problem

which has the least squares solution

It can be shown that the eigenvectors of R are the predictable components p defined in section 7, and the eigenvalues of R are the squared anomaly correlations. A common regularization method in regression theory is to express the predictor and predictands in a truncated EOF expansion [Barnett and Preisendorfer, 1987]. The question of what truncation gives the “best” regression model then is the problem of model selection, for which there are many methods, including in-sample criteria like the Akaike information criterion, the Bayesian information criterion, and generalized cross-validation and out-of-sample methods like cross-validation [Zucchini, 2000]. In this way, techniques developed in the context of linear regression theory can be applied to predictable components analysis. Further examples of this approach in a different context are given by Schneider and Held [2001] and DelSole [2006].

[97] The above problems also arise in data assimilation. Indeed, since Kalman filter methods can be formulated as regression between observation and model state, methods developed in the context of data assimilation may be relevant to predictability estimation. For instance, the method of covariance filtering renders the sample covariance full rank by forcing the correlation between spatially distant points to be zero [Houtekamer and Mitchell, 2001; Hamill et al., 2001]. Hybrid covariance methods use the sample covariance on the space spanned by the ensemble data and an analytic covariance function on the remainder [Hamill and Snyder, 2000].

[98] Recently, Majda et al. [2002], Abramov and Majda [2004], and Abramov et al. [2005] proposed an elegant approach to dealing not only with finite samples but also with non-Gaussian distributions. The basic idea is to determine the forecast distribution that minimizes predictability, as measured by relative entropy, subject to the constraint that the leading moments of the forecast distribution equal the sample moments derived from the ensemble. The resulting optimization problem can be solved using standard maximum entropy methods. This approach is formally identical to the minimum discriminant information approach of Kullback [1959], who proposed this method as a general approach to statistical hypothesis testing, a connection that probably could be exploited further. Abramov and Majda [2004] show that the moment constraints can be introduced in a hierarchy such that the predictability increases as more moment constraints are imposed. Furthermore, the hierarchy can be decomposed into contributions due to signal, noise, and “cross-terms,” which facilitates interpretation. In this framework, relative entropy serves not only as a measure of predictability but also as a measure of how much discriminatory information is gained by including particular moments.

[99] For highly non-Gaussian distributions that are not well described by their leading moments other methods may be needed. The statistics literature contains many techniques for estimating probability distributions, including histograms, maximum likelihood, Bayesian inference, kernel estimation, and K-nearest neighbors. In the dynamical systems literature, mutual information is used to select the time delay required in constructing delay coordinates for plotting attractors. In this context, Fraser and Swinney [1986] proposed a partitioning procedure that adapts to the available sample [see also Darbellay and Vajda, 1999]. Mutual information is also used in physiology to study synchronization, and in this context Kraskov et al. [2004] proposed an approach to estimating mutual information based on K-nearest neighbor distances. Other applications of information theory appear frequently in physics, statistical science, and physiological journals.

[100] Although the techniques discussed in this paper were derived for Gaussian distributions, they can be useful even for non-Gaussian distributions. In particular, PrCA maximizes a ratio of variances, linear regression minimizes a sum of squares, and CCA maximizes a correlation, independent of the distribution. The Gaussian assumption also gives a lower bound to predictive information and mutual information, provided the climatological distribution is normal. This fact follows from the fact that these quantities can be written as a difference in entropies and that the normal distribution maximizes entropy for all distributions with the same first two moments.

[101] We hope that this paper clarifies the connections between information theory, predictability analysis, and multivariate statistics and that it stimulates further interaction between this fields.

## Acknowledgments

[102] Many of the ideas presented in this review have been distinctively influenced by our teachers and colleagues, including Ping Chang, Ben Kirtman, Tapio Schneider, J. Shukla, and David Straus. The anonymous reviewers also provided detailed and insightful comments that led to many clarifications. The first author's research was supported by the National Science Foundation (ATM0332910), National Aeronautics and Space Administration (NNG04GG46G), and the National Oceanographic and Atmospheric Administration (NA04OAR4310034). The second author's research was supported by the International Research Institute for Climate and Society and National Oceanographic and Atmospheric Administration Office of Global Programs grant NA07GP0213. The views expressed herein are those of the authors and do not necessarily reflect the views of NOAA or any of its subagencies.

[103] The Editor responsible for this paper was Gerald North. He thanks two anonymous technical reviewers and one anonymous cross-disciplinary reviewer.