Locally Stationary Wavelet Packet Processes: Basis Selection and Model Fitting

For non‐stationary time series, the fixed Fourier basis is no longer canonical. Rather than limit our basis choice to wavelet or Fourier functions, we propose the use of a library of non‐decimated wavelet packets from which we select a suitable basis (frame). Non‐decimated packets are preferred to decimated basis libraries so as to prevent information ‘loss’ at scales coarser than the finest. This article introduces a new class of locally stationary wavelet packet processes and a method to fit these to time series. We also provide new material on the boundedness of the inverse of the inner product operator of autocorrelation wavelet packet functions. We demonstrate the effectiveness of our modelling and basis selection on simulated series and Standard and Poor's 500 index series.


M.B. PRIESTLEY: GIANT OF TIME SERIES
Today, it is clear that Maurice Priestley's fascinating, lucid and encyclopaedic body of work was way ahead of its time. Certainly, his path-breaking work on non-stationary time series provides the basis for a great deal of the academic work carried out today: that in the scientific literature, that contained in modern software packages and through into applications. Many were privileged to benefit from scientific interlocution with Professor Priestley. In our case, this communication was not only about the then new field of wavelets but also about the 'oscillatory process' idea, which was, and is, a key inspiration to all working in non-stationary time series.
This article is focused on a, maybe, little-explored part of Priestley's panoply that can be summarized by the following quote from Priestley (1983, p. 822), which refers to representations for non-stationary processes: " Parzen (1959) has pointed out that if there exists a representation X.t/ D R t .!/dZ.!/, then there is a multitude of different representations of the process, each representation based on a different family of functions." and "The situation is in some ways similar to the selection of a basis for a vector space." and "However, if the process is non-stationary this choice [complex exponential family] of functions is no longer valid."

INTRODUCTION
If a time series is stationary, then classical (Fourier) theory provides optimal and well-tested means for its analysis. Indeed, if the series, X t ; t 2 Z, is stationary, then it is required by theory to possess the following wellknown decomposition: where d .!/ is a zero-mean orthonormal increments process and A.!/ is the amplitude function (for a process with absolutely continuous spectral distribution function, see Priestley (1983) Section 4.11, for example). There are several beautiful proofs that establish that the Fourier representation is the canonical one in the stationary situation. See, for example, the nice expositions in Hannan (1960) and Priestley (1983) Section 4.11. We are interested in the case where X t might be locally stationary, that is, over short periods of time the series appears to be stationary but it can change its statistical properties slowly over (longer periods of) time. The concept of non-stationary time series has been appreciated for many years. The theory of non-stationary processes was significantly advanced by a series of papers by M.B. Priestley and co-authors from the mid-1960s notably the RSS Read Paper: Priestley (1965). A rigorous asymptotic framework for local stationarity modelling was introduced in Dahlhaus (1996aDahlhaus ( , 1997 within a framework that we call locally stationary Fourier processes. Remark 1 (Rescaled time asymptotics and locally stationary Fourier processes). The locally stationary Fourier model from Dahlhaus (1997) is a (triangular array of) stochastic process(es) represented by 153 where d .!/ is a zero-mean orthonormal increment process. The transfer function A 0 t;T satisfied sup t;! jA 0 t;T .!/ A.t =T; !/j Ä K T 1 , for all T , for some constant K and 2 -periodic function A W OE0; 1 R ! C satisfying A.u; !/ D A.u; !/ and A.u; !/ is a continuous function in u 2 OE0; 1. The quantity u D t=T was called rescaled time and ! 2 . ; / the frequency. This definition permits the uniform convergence A 0 t;T .!/ ! A.´; !/ to be well defined and therefore allows meaningful asymptotics for the locally stationary spectra. When the function A t;T .!/ is constant with respect to t , then the locally stationary Fourier process becomes stationary. (Dahlhaus's definition is more detailed with more technical conditions that we omit here.) Most locally stationary representations, up to and including Dahlhaus (1997), rely on the Fourier basis to furnish 'oscillation'. One of the key messages that we wish to emphasize is that for non-stationary processes the Fourier basis is no longer canonical. Silverman (1957) remarked on this predominance of 'harmonizable processes'. However, Priestley (1988) (and others) already explicitly considered the possibility of using oscillatory functions other than Fourier for the purpose of basis representation, and this observation constitutes one of the main inspirations for the current article. For example, Nason et al. (2000) address this by introducing locally stationary time series models based on wavelets that they call locally stationary wavelet processes. The remainder of this section focuses on process definitions; more explicit definitions of the underlying bases of oscillatory functions is provided in the next section. Remark 2 (Wavelets and locally stationary wavelet processes). The locally stationary wavelet model from Nason et al. (2000) represents the process (array) by where ¹ j;k º is a collection of zero mean uncorrelated random variables, the vectors ¹ j;k .t /º is a set of nondecimated discrete Daubechies wavelets (defined later in (6)) and ¹w j;kIT º is a set of amplitudes. The amplitudes have further technical conditions imposed on them, but they are analogues of the quantities in the Fourier representation in (2): ¹ j;k .t /º is the analogue of ¹exp.i !t /º, ¹ j;k º the analogue of d .!/ and ¹w j;kIT º the analogue of ¹A t;T .!/º. As with the locally stationary Fourier model, the amplitudes w j;kIT are closely related to an amplitude function W j .t =T / and an underlying rescaled time asymptotic model. The evolutionary wavelet spectrum S j .´/ D W j .´/ 2 . When w j;kIT is a constant function of k or, equivalently, S j .´/; W j .´/ are constant functions of´, then the associated locally stationary wavelet processes are second-order stationary.
The locally stationary wavelet framework has also been successfully used to model multi-variate time series as in Sanderson et al. (2001) and Park et al. (2014) and references therein. However, rather than limit the choice to wavelet or Fourier bases, a further alternative would be to select a basis from an overcomplete set of alternatives that is commonly referred to as basis library. The benefits of basis libraries in statistical time series modelling were first realized by Ombao et al. (2001;2002;2005) who used the SLEX functions from Wickerhauser (1994) as follows.
where ‰ C .t /; ‰ .t / are two specially constructed smooth real-valued window functions. The SLEX time (timeblock)-varying transfer function can be computed as the inner product of X tIT with the respective SLEX basis function. Ombao et al. (2001) use this system for adaptive segmentation of a time series into piecewise stationary processes and for spectral smoothing. Ombao et al. (2002) introduce the process, estimation theory and show asymptotic equivalence to the Dahlhaus locally stationary Fourier model. Ombao et al. (2005) extend the idea to multi-variate time series. Another example is Donoho et al. (2003) who are concerned with locally stationary covariance estimation using penalized basis methods. A general review of locally stationary time series models can be found in Nason and von Sachs (1999) and Dahlhaus (2012); see also Cardinali and Nason (2008) for an additional recent set of references. There are many possible models ('the multitude'), and not much is known about how the respective process classes of locally stationary Fourier, locally stationary wavelet, SLEX and our model overlap. From a more theoretical standpoint, these different models correspond to different tilings of the timefrequency plane and, hence, have different characteristics in analysis mode extracting often very different aspects of information from a time series. This article proposes the use of the overcomplete dictionary of NDWPs from which we select a suitable basis. Non-decimated packets are preferred to decimated basis libraries so as to prevent information 'loss' at scales coarser than the finest. Therefore, this article introduces the new class of locally stationary wavelet packet (LSWP) processes and a method to successfully fit these to time series data. We propose a complete framework for process representation and inference for the associated time-frequency spectra, and we provide theoretical results concerning the existence of an asymptotically unbiased spectral estimator in this setting.
A key conceptual difference between the SLEX model earlier and our wavelet packet models later is that we use NDWP basis functions. For process representation and spectral estimation of many processes, we surmise that probably both work similarly but SLEX, in not being non-decimated will possibly be more computationally efficient for some processes. However, for other processes, especially for finite T , the non-decimation can pick up structure that SLEX might miss. Although widely referred to in the signal processing literature, wavelet packets have not, until now, been extensively used within statistical time series. Exceptions using the non-decimated version are Walden and Contreras Cristan (1998), Section 6 of Percival and Walden (2000), Nason et al. (2001), Nason and Sapatinas (2002), Gabbanini et al. (2004), Cardinali (2009), Milne et al. (2009), Yang et al. (2009 and Garcia et al. (2013).
There appears to be a misconception about locally stationary processes that use non-decimated transforms, for example, Ombao et al. (2002, p. 173), who claim that it is not straightforward to simulate realizations. On the contrary, Cardinali and Nason (2008) mention LSWsim, a fast O.T log T / function, that simulates any locally stationary wavelet process. Similar fast functions have been constructed for our current work involving packets with the same order of computation as the fast Fourier transform that SLEX makes use of.
Section 3 provides a quick review of wavelets, wavelet packets and basis libraries and introduces the relevant notation. Section 4 presents our modelling framework and the relevant estimators for a fixed basis eventually selected from an overcomplete basis library. Section 5 illustrates our methodology to select an appropriate basis from a wavelet packet dictionary. Section 6 presents simulations of several LSWP processes for which we assess the finite sample performances of our basis selection method. Section 7 presents an application of our inferential methods on S&P's 500 returns, and Section 8 concludes, outlining directions for future work. Proofs of the main theoretical results are deferred to the appendix appearing on the Issue Information. wavelet transform; see Daubechies (1992), Percival and Walden (2000) or Nason (2008) for alternative accounts. Wavelets can be used as building blocks for a wide variety of non-smooth signals, in situations where the Fourier functions would not be suited. There are many wavelets one might use. Daubechies (1992) provides an introduction to the mathematical foundation of wavelets, including the Least Asymmetric (LA) bases, which were the first compactly supported wavelets designed to be quasi-symmetric. As often emphasized, wavelets have a gender, that is, the father wavelet is built from a low-pass linear filter designed to provide a local signal approximation, whereas the mother wavelet is built from a high-pass filter identifying the local signal variation. The mother and father wavelets can be dilated and translated to form a location-scale family that is used to produce a multiresolution approximation for functions. From the mother wavelet .t /, we can form daughters j;k .t / D 2 j=2 ¹2 j .t 2 j k/º for translates k 2 Z and scale parameter j 2 Z. For suitable choices of mother wavelet, the system ¹ j;k .t /º j 2Z;k2Z can become an orthonormal basis for functions f 2 L 2 .R/ for example. For non-decimated wavelets, the 2 j k is replaced by k, and then, we obtain a system of functions able to provide useful representations, but no longer orthogonal. Possibly, the simplest example of a mother wavelet is the Haar wavelet defined by .t / D 2 1=2 for t 2 .0; 1=2/, 2 1=2 for t 2 .1=2; 1/ and zero elsewhere. However, to build discrete time series, we use discretized versions of wavelets as described next. Nason et al. (2000) introduced discrete non-decimated wavelets designed to represent discrete time series X t ; t 2 Z as in (3). These are derived using the same ¹h k º k and ¹g k º k low-pass and high-pass quadrature mirror (finite impulse response) filters that Daubechies (1992) used to build her compactly supported continuous time wavelets. For example, for Haar wavelets, h 0 D h 1 D g 0 D 2 1=2 and g 1 D 2 1=2 . At each scale j 1, the associated discrete non-decimated wavelets j D . j;0 ; : : : ; j;N j 1 / are vectors with up to N j coefficients defined by 1;k D X n g k 2n ı 0;n D g k ; k D 0; : : : ; N 1 1; j C1;k D X n h k 2n j;n ; k D 0; : : : ; N j 1 1; where ı 0;n is the Kronecker delta, N j D .2 j 1/.N 1/ C 1 and N is the length of the filters ¹h k º. At heart, the discrete wavelet vectors, j , are oscillatory replacements of the Fourier vectors exp.i !t /, both of which satisfy various internal orthogonality conditions. In the locally stationary wavelet process representation (3), the notation j;k .t / actually refers to the basis (vector) element j;k t .

Wavelet Packets
Wavelet packets are an extension of wavelets whose basis functions, j;i;k .t /, depend on an additional parameter i that measures the number of oscillations of the function. The oscillation parameter i can take values ranging from 0 to 2 j 1 for each scale j D 1; 2; : : : ; J . See Wickerhauser (1994) or Percival and Walden (2000) for more details. For discrete time series representation, we make use of discrete NDWPs, defined next.
Definition 1 (Discrete NDWPs). Discrete NDWP are constructed as in (6) except that the ¹g k º and ¹h k º can both be replaced by of either ¹g k º or ¹h k º at each scale j . At each scale j where N j and N are as in (6). A wavelet packet J;i is also written in short form as .J; i /. The value of i can be obtained by constructing a binary number with 0/1 appearing at position j D 1; : : : ; J depending on whether filtering h k 2n or g k 2n is applied at stage j using either the third or the fourth equation in (7). See Example 1 for an illustration of the construction. As with wavelet vectors (earlier) the notation j;i .t k/ actually refers to the element j;i;t k . Figure 1 shows some examples of wavelet packet basis functions derived from two different mother wavelets. The second column in each row corresponds to the wavelet; the other columns correspond to other packets that offer greater oscillatory flexibility compared with just using wavelets alone.
Remark 4 (Frequency coverage). Hess-Nielsen and Wickerhauser (1996, p. 525) consider an abstract two-dimensional signal representation in which time and frequency are indicated along the horizontal and vertical axes respectively. A waveform is represented by a rectangle in this plane with its sides parallel to the time and frequency axes. . . . Let is call such a rectangle an information cell. The time and frequency of a cell can be read, for example, from the coordinates of its lower left corner. The uncertainty in time and the uncertainty in frequency are given by the respective dimensions of the cell, it does not matter whether the nominal frequency and time position is taken from the center or from a corner of a rectangle.
Both wavelets and wavelet packets can be seen to 'cover' certain portions of the time-frequency plane. At each scale j D 1; 2; : : : ; J , wavelets can be associated with the frequency interval (and the vertical axis of the timefrequency plane) of I j D .2 .j C1/ ; 2 j . Wavelet packets are associated with the interval I j;i D .2 .j C1/ i < ! Ä 2 .j C1/ .i C 1/ for j D 1; 2; : : : ; J and i D 0; 1; : : : ; 2 j 1. Note that I j D I j;1 : that is a wavelet packet with index i D 1 is equivalent to the wavelet at that scale. Of course, a packet has a time extent as well and so a wavelet packet basis, b 2 B, is a disjoint cover of the entire time-frequency plane (see Theorem 3 of Hess-Nielsen and Wickerhauser (1996)).

Basis Libraries
A basis library is a redundant set of bases from which one can be chosen to represent a data generating process. One example for locally stationary time series representation is provided by the SLEX processes from Section 1. In that example, the SLEX library redundancy concerned multiple possible segmentations of the time dimension in the time-frequency plane of those processes. The local cosine bases used in Mallat et al. (1998)  Remark 5 (Wavelet packet libraries and underlying wavelets). Any given wavelet packet basis library depends on an underlying Daubechies' mother wavelet. Hence, there are different libraries corresponding to different mother wavelets, and each of those will have different pros and cons. Different mothers could be incorporated into our scheme, but, for simplicity of presentation, we restrict ourselves to a wavelet packet basis computed with respect to a single given compactly supported Daubechies' mother wavelet. Our framework will, however, be valid for all wavelet packet libraries built from any of Daubechies' wavelets. Our computational examples consider smooth wavelet packets built from least asymmetric filters of length N D 8, or LA.8/, which are particularly well suited for time series analysis. See Figure 1 for an illustration. To establish notation, let B denote a particular basis library and jBj denote the number of bases it contains. Let jbj define the number of packets in each basis b. In the following, B will be the (non-decimated) wavelet packet library built from a given Daubechies' wavelet (Coifman and Wickerhauser, 1992;Wickerhauser, 1994;Hess-Nielsen and Wickerhauser, 1996), which includes the wavelet basis as a particular entry in the library. Our methodological aim is to devise an approach to identify such a basis that best fits the data with respect to some statistical criterion. This corresponds to a process representation where the time-frequency plane is segmented by a sequence of intervals ¹I jp ;ip º p2b , where I j;i was defined in Remark 4.
Example 1 (Wavelet packet libraries notation). For J D 2, consider the following library of wavelet packet ; 0/; .2; 1/; .2; 2/; .2; 3/º: For example, basis b c contains three packets so jb c j D 3 and we can alternatively use the compact notation b c D ¹.j p ; i p /º pD1;2;3 so that for p D 1 we have .j 1 ; i 1 / D .1; 1/, for p D 2 we have .j 2 ; i 2 / D .2; 0/ and for p D 3 we have .j 3 ; i 3 / D .2; 1/. Hence, we can equivalently refer to a wavelet packet by either the doublets .j p ; i p / or their briefer basis location index p D 1; : : : ; jbj. The basis b c is also the discrete wavelet basis (up to J D 2) since this is given in general by packets ¹.j; 1/; .J; 0/º j D1;:::;J . Figure 1 shows the wavelet packets forming the basis b d with jb d j D 4. Remark 6. In general, the size of any basis b is finite for finite T . This is because the set of all possible packets is of size T log 2 T . We typically refer to bases b of finite size. However, whenever appropriate, we specify which results refer to infinite dimensional bases b, which correspond to the limit case T ! 1.

LOCAL STATIONARITY AND WAVELET PACKET PROCESSES
This section introduces locally stationary processes constructed using a wavelet packet basis using a data-driven basis selection strategy from a library of packet bases. First, for a given fixed basis, b, we introduce the LSWP processes. Elements of the basis b are called packets and denoted by p.
Definition 2 (LSWP process). Given wavelet packet basis b 2 B, the LSWP processes are a sequence of doubly indexed stochastic processes ¹X tIT º tD0;:::;T 1 , T D 2 J 1 having the following representation in the meansquare sense where jp ;ip ;k and w jp ;ip ;kIT are respectively a collection of orthonormal random variables and amplitude coefficients with location index k D 0; : : : ; T 1 and packets .j p ; i p / for p 2 b. The set ¹ jp ;ip ;k .t /º jp ;ip contains discrete NDWPs that have support length N jp and are based on a mother wavelet .t / of compact support with length N , as earlier. Moreover, for´2 .0; 1/, there exist functions W jp ;ip .´/ that satisfies the following conditions: i There exists a sequence of constants C p such that for each p 2 b and T sup kˇw jp ;ip ;kIT where the sup is over all partitions ¹a i º of .0; 1/.
For a non-trivial theory, we require some further tools and notation. First, we define two operators that generalize the autocorrelation wavelet and associated inner product from Section 2.3 and Section 2.4 of Nason et al. (2000).
Definition 3 (Cross-correlation wavelet packets). For p; p 0 2 b, define the cross-correlation wavelet packet by the convolution: where p;k are NDWPs from Definition 1. When the convolution is taken over the same wavelet packet, that is, when p 0 D p, then ‰ p;p . / D ‰ p . / is also called autocorrelation wavelet packet. We also define A D .A p;p 0 / p;p 0 D1;:::;jbj as the inner product operator having entries The two derivations are equivalent, but the latter can be implemented in a more computationally efficient way. For finite samples (T < 1), the operator becomes a square matrix of finite dimensions jbj jbj. For both finite and infinite dimensional cases, we also define the inverse operator A 1 D .A 1 p;p 0 / p;p 0 . Conditions for the existence of this operator in both finite and infinite dimensional cases will be discussed in the following of this section.
Remark 7 (Spectra and autocovariances for LSWP processes). Analogously to the locally stationary wavelet model, we define the evolutionary wavelet packet spectra (EWPS) as S p .´/ D jW jp ;ip .´/j 2 and the marginal EWPS as N S p D R S p .´/d´. In what follows, we refer to their whole sets of values respectively as S.b/ D ¹S p .´/º p2b and N S.b/ D ¹ N S p º p2b . The time localized covariance is given by where ‰ p . / is the wavelet packet autocorrelation function from Definition 3. In the following, we also refer to the whole set of T observations from model (8) as X T . As suggested in Fryzlewicz et al. (2003), for t; s D 0; : : : ; T 1, we can approximate the entries of † D E.X T X 0 T /, as †.t; s/ D C.t=T; t s/ C O.T 1 /. Because the only unknown quantities in † are the spectral entries, we will also refer to it as † S.b/ .
Example 2 (Haar moving average (MA) packet processes). To familiarize the reader with locally stationary wavelet processes, Nason et al. (2000) introduced the Haar MA processes. Recall that the first-order Haar MA process was X 1 t D 2 1=2 . t t 1 / and the second order was X 2 t D 2 1 . t C t 1 t 2 t 3 / for t 2 Z, where ¹ t º is an i.i.d. zero mean unit variance process. These can be written in the locally stationary wavelet form in (3) by setting (for X 1 t ) S 1 .´/ D 1, 1;k D k and j;k .t / being non-decimated Haar wavelets, similarly for X 2 t and more generally X r t .

160
A. CARDINALI AND G. NASON For any given packet, p 2 b, a similar kind of MA process can be defined. For example, at scale j D 1, the process X 1 t earlier is one process and, in wavelet packet notation, its scale j p D 1 and index number i p D 2, that is, X .1;2/ t . The other packet process at scale j p D 1 is X .1;1/ t D 2 1=2 . t C t 1 /. At the second scale, there are four packets denoted X .2;i/ t for i D 0; : : : ; 3 of the same form as X 2 t earlier but with the signs of each of the coefficients (in the same order) are .C; C; C; C/; .C; C; ; /; .C; ; ; C/ and .C; ; C; /. The second in this list corresponds to the X 2 t process earlier.) From this, we can define the MA wavelet packet process by selecting a particular packet (and underlying wavelet) but using wavelet packets instead of wavelets. For an illustration of these wavelet packets derived from the Haar and least-asymmetric LA(8) wavelets at scale J D 2, see Figure 1.

Inference for a Fixed Basis
Given a fixed wavelet packet basis, b 2 B, we can use results analogous to those in Nason et al. (2000) to derive an estimator for the EWPS.
Definition 4 (Unbiased wavelet packet periodogram). For a given packet p 2 b 2 B with packet vector p , define the wavelet packet process as the empirical wavelet packet coefficients of X tIT : The quantity d p;k is a process rather than just a set of coefficients because local stationarity of X tIT is conferred onto the process d p;k through a time-invariant linear filtering. Also define the (raw) wavelet packet periodogram by I p;k D jd p;k j 2 . As in Nason et al. (2000), the raw wavelet packet periodogram is a biased estimator of the spectra since it can be proved that EI p;k D P p 0 2b A p;p 0 S p 0 .´/CO.T 1 / for p 2 b, where A p;p 0 was introduced in Definition 3. However, the estimator can be 'corrected' to make it asymptotically unbiased. Therefore, the (asymptotically) unbiased wavelet packet periodogram is defined as for p 2 b, where A 1 p;p 0 was introduced in Definition 3. From these definitions, it follows that EL p;k D S p .´/ C O.T 1 /.
Obtaining a consistent estimator of S p .´/ can be achieved using similar methods to those described in Nason et al. (2000). The R package LSWPPlib will be available on CRAN to compute the quantities defined in this section.

A Note on the Theory of Locally Stationary Wavelet Packet Processes
A number of theoretical properties of LSWP processes are based on the existence of the (bounded) inverse operator A 1 introduced in Definition 3. For example, the existence of an unbiased spectral estimator for LSWP processes directly depends on the existence of such operator as can be appreciated by looking at equation (13). Moreover, an invertible representation between the evolutionary spectra S p .´/ defined in Remark 7 and local autocovariances defined in equation (11) is only possible if this (bounded) operator exists since for all p 2 b the inverse formula of (11) is The existence of this positive definite operator and its bounded inverse when b is the wavelet basis and when . / is either the Haar or Shannon wavelet was proved in Theorem 2 of Nason et al. (2000), who also conjectured the existence of general results for all Daubechies' compactly supported wavelets. We now show that this result extends not only to all Daubechies' compactly supported wavelets but also to operators A constructed from wavelet packet bases. We use several results from Goodman et al. (1995), and we will refer to specific parts of that paper as GMRS-page number. The proof of the following theorem can be found in the appendix appearing on the Issue Information.
Theorem 1 (Boundedness of A inverse). Let b be a basis of packets. Let A D .A p;p 0 / p;p 0 2b for jbj ! 1, where A p;p 0 is defined in equation (10). Furthermore, let b such that not all packets belong to the same scale, that is, j p ¤ j p 0 , for some p; p 0 2 b. Then, the inverse of the semi-infinite A operator for wavelet packets is bounded.

LOCALLY STATIONARY WAVELET PACKET PROCESSES BASIS SELECTION
In the previous section, the 'true' wavelet packet basis is assumed known. However, in practice, the basis is not known and the goal is to find, at least, a good basis. Previous work with adaptive representations in signal processing, for example, Coifman and Wickerhauser (1992) and time series, for example, Ombao et al. (2001) or Mallat et al. (1998), for function and process representation has concentrated on using basis libraries, that is, libraries of orthonormal`2 bases. These studies concentrated on selecting the best basis, O b 2 B where 'best' can have several different meanings; see Percival and Walden (2000, Section 6.3 p. 221) for a nice example.
Strictly speaking, the basis concept is identified with decimated wavelet packets: for non-decimated wavelets, the equivalent collection of packets is termed a frame -which, mathematically, has the same representative power but within an overdetermined system and so not technically a basis. To simplify our exposition, we will keep using the notion of basis even if we will be referring to NDWP frames derived from the associated decimated wavelet packet basis. More details on frames can be found in Mallat (2009).
Given an appropriate objective function to be optimized, our goal is to reconstruct the, possibly sparse, true representation from a dictionary of`2 frames, that is, a collection of linearly independent vectors that are almost (but not exactly) orthogonal; see Daubechies (1992) for more details.
This task turns out to be significantly harder than selecting from a dictionary of orthogonal bases. In fact, representations based on`2 frames account for a significant number of redundant and correlated coefficients; therefore, it is crucial to understand how to make good use of these. In our setup, the main challenge is therefore the derivation of an appropriate objective function that can ensure good model fitting and the derivation of appropriate cost functionals that can be associated with each packet to ensure successful optimization/basis selection.

Suggestion: Cost Functionals Based on Profile Likelihood
Inference for locally stationary time series for a fixed (Fourier) basis has been the object of a number of papers such as Dahlhaus (1996bDahlhaus ( , 1997. However, from the point of view of theoretical inference, the problem of finding an adaptive frequency tiling can be seen as the problem of estimating a number of unknown packet indices p 2 b, which are the parameters of interest, given the presence of nuisance parameters (the level of the spectra for p 2 b). Profile likelihood provides a common approach to inference in the presence of nuisance parameters. The use of profile likelihoods for semi-parametric models was discussed in Kauermann (2002), where it was established that, as in classical parametric models, profiling leads to systematic bias. For locally stationary processes built on Gaussian innovations, the negative log-likelihood based on representation (8) can be written as follows.
Proposition 1. Let X tIT be defined as in (8) and having Gaussian innovations. Then, the negative log-likelihood for a basis b 2 B is proportional to where L p;t is the (asymptotically) unbiased wavelet packet periodogram as defined in (13).
For LSWP processes, the parameters of interest for selecting a basis are the packet indices p 2 b, where b is a basis (or, more precisely, an NDWP frame). Here, the nuisance parameters are the vectors S.b/ of the spectra associated with each frame. A profile log-likelihood for b can therefore be derived based on (15) so that we have the following result.
Proposition 2. Let X tIT be defined as in (8) having Gaussian negative log-likelihood proportional to (15). Define log L p D T 1 P t log L p;t , where L p;t is defined by (13). Furthermore, define the negative profile loglikelihood for b 2 B as e L T .b/. Then, we have 1.
These results show that the profile (negative) log-likelihood is a negatively biased estimator for the negative loglikelihood. Moreover, Proposition 2 shows that the profile likelihood is characterized by a nonlinear relationship with respect to the nuisance parameters estimates, which are also strongly correlated at fine scales. Simulation experiments confirm that the basis selection based on the optimization of the profile likelihood leads to large systematic errors in reconstructing known bases. Since the biased profile likelihood and its nonlinear dependence with respect to nuisance parameter estimates leads to poor basis selection, we consider an alternative approach aimed to improve the basis selection by removing the aforementioned nonlinearity. This is achieved by considering an alternative objective function and alternative cost functionals.

Cost Functionals for Penalized Least Squares
The alternative approach that we adopt to overcome the difficulties of working in this highly irregular and nonlinear setting is based on the use of an objective function that is still biased with respect to the log-likelihood but is now linearly related to the nuisance parameter estimates. Among several possible alternatives, we will consider the objective function where, for p D .
for some a 2 .0; 1/ and b 2 B. The functional form of (17) is based on two main arguments. First, we note that P J jp˛p a jp D 1, so one component of the weights is self-normalizing. This component is then multiplied by a jp , which compensates for the increasing dependence among N L p for different packets at finer scales. It is worth noting that for˛p D 1 we have that E e L 2;T .b/ D 0:5.2 J C N 2 T / where N 2 T D R C.´; 0/d´is the marginal variance for the LSWP process. Since 2 J is constant for all b 2 B, the maximization of (16) coincides with the maximization of the fitted marginal variance, and therefore, this approach to basis selection can be interpreted as using penalized marginal least squares. This is because by maximizing the variance of the fitted model we minimize the residual variance not captured by the selected basis and the corresponding spectral estimates. Conditionally on the weights p , e L 2;T is a consistent estimator for L 2;T ¹b; S.b/º D 0:5.2 J C P p˛p N S p /. This is a consequence of the consistency for the marginal periodogram ¹L p º p as estimator of the marginal spectra ¹S p º p . Under the assumptions of Definition 2, this latter proof is a direct consequence of Theorem 3 from Cardinali and Nason (2010) that proved the same result for locally stationary wavelet processes.

Basis Selection
The optimization of (16) necessary to implement basis selection of LSWP needs to be carried out over nonoverlapping tiling of the frequency interval .0; 1=2. This can still be achieved by using the best basis algorithm where, for each packet, we will consider the cost functionals ˛p N L p . Here, negative signs are used because the best basis algorithm will instead minimize e L 2;T .b/ over b 2 B. Our algorithm for basis selection is very fast and is based on the following steps: 1. Calculate N L .jp ;ip / for j p D 1 and i p D 1; 2; : : : ; 2 jp . This is carried out by calculating a bias matrix A for the scale j p D 1 and then calculating the unbiased periodogram; 2. Repeat step 1 for scales j p D 2; 3; : : : ; J , where J is the maximum level of the NDWP transform that we consider in the analysis; 3. Calculate the whole set of weights˛. jp ;ip / from (17); 4. Apply the best basis algorithm to the set of cost functionals as ˛. jp ;ip / N L .jp ;ip / ; min ® e L 2;T .b/¯, where e L 2;T .b/ is defined in (16).

Practical Advice for Parameter Settings
Our basis selection method based on penalized least squares requires three parameters to be set. The first choice to be made is which (mother) wavelet filter to use to build the wavelet packets library. We mainly refer to Daubechies filters here and have used both LA.8/ and Haar wavelets in our simulations. Generally, LA.8/ filters performed better than Haar filters in our experiments; therefore, we recommend their use. Further work would be required to investigate the performance of other wavelets and provide recommendations for their use in specific situations. The second parameter that needs to be set is the depth of the wavelet packet library, represented by the maximum scale J . This parameter should be selected by taking into account the wavelet filter used to build the library. Filters of greater length allow smaller values of J to be set. If N represents the filter length, then a recommended choice for the library depth is to set J D OElog 2 T OElog 2 N g, where OEx is the integer part of x and g D 3 is an integer that reduces the computational boundary of J further. This is to avoid overfitting due to large positive correlation of wavelet packet coefficients at large scales. In our examples, we used this setting but even for g D 2 we obtained good results.
The third parameter to be set is the (penalty) rate of a j from equation (17). Penalties are parameterized by a geometric progression that allows us to interpret the penalty in terms of compensation for increasing positive correlation of the wavelet packet periodogram coefficients at large scales j . We mainly used a penalty of rate a D 0:98, but our experiments suggest that values larger than 0.95 perform similarly. It should be also noted that the penalty rate should be set in regard to the wavelet filter used and, for Haar filters, setting a D 0:95 provides better results than larger values. Our experiments show that this rate should be proportional to the filter length N : the larger is the latter, the larger should be the rate. This implies that cost functionals at large scales should be penalized more for shorter filters. The need of larger penalization for Haar filters seems also because of the frequency leakage that characterizes filters with shorter length.

SIMULATION EXAMPLES
This section simulates several LSWP processes and empirically evaluates our basis selection methodology. We simulate processes using representation (8) with w jp ;ip ;kIT D S 1=2 p;k for k D 0; : : : ; T 1 and independent draws jp ;ip ;k from the standard Gaussian distribution. We consider both stationary and locally stationary processes with fixed bases and energy distributions and compare the estimated bases with the truth.
To evaluate our fits, we derive a measure of the chance of correctly selecting increasing proportions of true packets in the estimated basis. We aim to construct a measure that accounts for the different portions of the spectra that are represented by each packet within a given basis, so we define jI p j as the length of the frequency intervals associated with a generic packet p and defined by I j;i from Remark 4. Therefore, the 'portion' of the true spectrum b that is correctly fitted by O b can be expressed as where I.A/ is the usual indicator function. Hence, I.¹ O p 2 bº/ is one if O p is contained within the true basis b and zero otherwise. Hence, the quantity jI O b;b j is larger when more of the estimated packets are contained within the true basis. Indeed, if the estimated basis is equal to the true basis, then the complete frequency interval .0; 1=2 is covered and the portion is one. (Since all the separate jI O p j add up to 1=2 and then multiplying by two gives a portion of 1.) When we run M separate simulations, indexed by m D 1; 2; : : : ; M , the portion for simulation m will be written as jI O bm;b j 2 OE0; 1. Define to be the empirical proportion of bases that correctly fit at least 100q% of the true spectra for q 2 .0; 1/, b 2 B and M some positive integer. For clear axes labels in figures, in the succeeding sections, w we suppress the dependence of b; M on the R M;b .q/, but it should be remembered that R.q/ depends on M and b. We next exhibit our simulation results on six different process types labelled LSWP 1 to LSWP 6 . Here LSWP 1 ; LSWP 2 ; LSWP 3 and LSWP 5 are stationary and LSWP 4 and LSWP 6 are locally stationary. In all cases, M D 1000.

Simulating Stationary LSWP 1 Processes
Example LSWP 1 uses the basis b a D ¹. 1; 0/; .4; 8/; .4; 9/; .4; 10/; .4; 11/; .4; 12/; .4; 13/; .4; 14/; .4; 15/º: (20) The frequency design implied by this basis gives high resolution to the highest frequencies and minimum resolution to the first half of the spectra. The EWPS for our simulation is This spectrum does not depend on´, the rescaled time, and hence, the process specified by (8) is stationary and the amplitudes w jp ;ip ;k S 1=2 p . Figure 2 shows a single realization and the marginal spectra for this process.
The frequency tiling implied by b b is the opposite to that of the process LSWP 1 , as it gives the highest resolution to the lower frequencies and minimum resolution to the second half of the spectra. A single realization and the marginal spectra for this process are illustrated in Figure 4. Figure 5 displays the survival probabilities for the LSWP 2 model and the conclusions are broadly the same as for the LSWP 1 process discussed earlier. transform spectral resolution, which increases dyadically moving towards lower frequencies. This property is commonly known as adaptivity of the wavelet transform and implies that noisier signal components are averaged over a wider frequency band, and those bands decrease for less noisy components. Figures 7 and 8 show the goodness of fit for different sample sizes. The figures provide empirical evidence that our procedure is consistent and is also invariant to the specification of the time-varying energy distribution since the shape of the curve R.q/ is very similar for the two processes.

Simulating LSWP 5 and LSWP 6 Processes
Examples LSWP 5 and LSWP 6 both use the basis b d D ¹.1; 0/; .2; 2/; .3; 6/; .4; 14/; .4; 15/º: The frequency tiling implied by this basis is still an example of smooth change in frequency resolution, but corresponds to the opposite design in comparison with that of the discrete wavelets earlier. In fact, now the frequency resolution is low at low frequencies and then increases for higher frequencies. Analogously to the previous section, we will now consider both examples of stationary and time-varying energy distributions with spectra given by LSWP 6 ! S p D 2 2jp 7 cos 2 .4 ´/ I.p 2 b d /: Single realizations from these processes and their (identical) marginal spectra are shown in Figure 9. The performance of our estimator is again summarized by the plots in Figures 10 and 11. The results indicate that our procedure is consistent also for these examples and is also invariant to the specification of the time-varying energy distribution since the shape of the curve R.q/ is basically identical for the two processes. Overall, we can estimate about 70% of the true basis with high probability and we have just below 50% chance to achieve about 80% fit.
7. TIME-FREQUENCY ANALYSIS OF STANDARD AND POOR'S 500 LOG-RETURNS Figure 12 shows a series of 1024 S&P 500 log-returns from the period November 1999 to July 2002. These data have been widely analysed in many different ways and GARCH models have been proposed by several studies. Our simulations earlier indicated that LSWP models can reproduce heteroscedasticity even from a stationary specification. In particular, LSWP 5 realizations show that this can occur when the estimated marginal spectra account for high-frequency resolution at highest frequencies. In these situations, the intensity of heteroscedasticity is positively correlated with the energy level of the marginal spectra. In a similar setup, simulations of the LSWP 6 process have also shown that time-varying energy can only reinforce or mitigate the intensity of heteroscedasticity.
It is exciting that this basis is not a wavelet nor close to a Fourier basis. So using wavelet packets has really made a difference here. In particular, the basis selection indicates that it is probably wasteful to use a traditional Fourier spectral analysis that has too fine frequency resolution at the lower frequencies.
Our fitted basis accounts for higher resolution and (relatively) low energy at higher frequencies, and this seems the characteristic time-frequency scenario for financial returns in an efficient market. The estimated marginal spectra is illustrated in Figure 13. We have evidence that the lower frequencies account for (relatively) higher energy than higher frequencies. We are therefore in an intermediate situation with respect to the simulated scenarios.
Remark 8 (Robustness of basis selection). An important question is 'how robust is our basis selection to variations in the parameter settings discussed in Section 5.4?' We conducted our analysis using wavelet packets built from Daubechies LA.8/ filters since these performed better than Haar filters in our simulations.
However, the results presented for the S&P data have also been obtained using the other recommended settings listed in Section 5.4. We have repeated our analysis using different values for both the library depth J and the penalty rate a. When considering J D 5, only one packet from O b SP was omitted and was replaced by two child packets from scale j D 5. The missing packet corresponds to the high-frequency end of the spectra, therefore to the more noisy data component. For J D 6, we have the same basis than that selected for J D 5. For all those values of J , we were able to select the packet corresponding to j p D 1, which in our basis corresponds to (low) frequencies in the interval .0; 1=4. As for the penalty rate a from equation (17), we have repeated our analysis for rates 0:95 < a < 0:98 and have observed the same results in each of the cases J D 4; 5; 6.
Remark 9 (Interpreting evidence from time-frequency analysis). A wavelet packets basis represents the oscillatory components accounting for the most energy in a time series. The associated frequency intervals provide twofold information both on a range of frequencies/periods characterizing each component and on their length (which is inversely proportional to the precision of measuring the evolution of the (time-varying) amplitude and variance). Therefore, wide frequency intervals correspond to more precise measurement of time-varying features and should be of particular interest in applications focusing on precise measurements in the time domain such as change-point analysis.  Figure 14 shows the asymptotically unbiased wavelet packet periodogram from (13). Each line in Figure 14 (top to bottom) corresponds to the elements (left to right) in the selected wavelet packet basis given in (27). The estimated basis O b SP is characterized by a wide packet at low frequencies I 1;0 D .0; 1=4/ corresponding to oscillatory dynamics with period larger than 4 days (since we analysed daily data). The relatively high energy associated with this wavelet packet suggests that this asset could be of interest to investors looking for weekly (or longer) investment horizons. The time-frequency periodogram does not show evidence of strong time-varying effects apart from two packets that correspond to lower and medium frequencies. Those packets do not belong to the classical discrete wavelet transform tiling; therefore, the identification of the time-varying effects could be compromised if a non-adaptive tiling (such as that from the discrete wavelet transform) was imposed. The overall limited presence of time-varying effects (apart the aforementioned episodes around February 2001) seems to reflect the prevalent non-transitory nature of the heteroscedasticity that characterizes this particular dataset. The two changes identified at the beginning of 2001 seems to mark the US recession (and the US stock market drop) that started in that trimester.

CONCLUSIONS AND FURTHER WORK
This article introduces LSWP and shows how they can be used for modelling and analysis of locally stationary time series. Unlike other locally stationary models based on orthonormal`2 bases, the LSWP model includes finite sample stationary processes as particular cases. Furthermore, the LSWP family provides a very flexible framework for analysing locally stationary time series by allowing the most important periodic components to be selected by a data-driven criterion. However, basis selection for LSWP is a difficult problem: the`2 frame design of NDWP potentially introduces over-parametrization and correlation in a standard best-basis selection. We are able to derive a modified best-basis selection that uses functionals of the unbiased periodogram.
We also use a penalty to ensure that a sparse basis is selected and to compensate for the leakage that affects NDWP designs, especially at fine scales. We show with a number of prototype simulation examples that LSWP can represent a large variety of empirical features and provide a novel framework for analysing heteroscedastic time series. Our simulations show that, by using different designs for energy distribution and frequency resolution, we can represent many heteroscedastic features by using little or no time-varying spectral coefficients. Our empirical analysis based on the S&P 500 returns confirms this finding and, in particular, that heteroscedastic time series are better represented by a frequency tiling that is different from the one implied by the classical discrete wavelet transform. These findings can lead to significant improvements in the accuracy of in-sample and out-of-sample analysis of locally stationary time series.