Estimating nonlinear additive models with nonstationarities and correlated errors

In this paper, we study a nonparametric additive regression model suitable for a wide range of time series applications. Our model includes a periodic component, a deterministic time trend, various component functions of stochastic explanatory variables, and an AR(p) error process that accounts for serial correlation in the regression error. We propose an estimation procedure for the nonparametric component functions and the parameters of the error process based on smooth backfitting and quasimaximum likelihood methods. Our theory establishes convergence rates and the asymptotic normality of our estimators. Moreover, we are able to derive an oracle‐type result for the estimators of the AR parameters: Under fairly mild conditions, the limiting distribution of our parameter estimators is the same as when the nonparametric component functions are known. Finally, we illustrate our estimation procedure by applying it to a sample of climate and ozone data collected on the Antarctic Peninsula.


INTRODUCTION
In many time series applications, the data at hand exhibit seasonal fluctuations and a trending behavior. A common way to incorporate these features is to assume that the data generating The copyright line for this article was changed on 16 August 2018 after original online publication.

)
+ m(X t ) + t for t = 1, … , T with E[ t | X t ] = 0. Here, m is a periodic function with a known integer period and m 0 is a deterministic time trend. The stochastic component consists of the residual t and of the term m(X t ) that captures the influence of the d-dimensional stationary covariate vector X t = (X 1 t , … , X d t ). We do not impose any parametric restrictions on the component functions m , m 0 , and m. Moreover, we allow for correlation in the error terms t , which are modeled as a stationary AR( p) process. Note that, as usual in nonparametric regression, the time argument of the trend function m 0 is rescaled to the unit interval.
Two special cases of model (1) have been considered in the literature. The fixed design setting + t was analyzed, for example, in the works of Truong (1991), Altman (1993), Hall and van Keilegom (2003), and Shao and Yang (2011), who provided a variety of methods to estimate the nonparametric trend function m 0 and the AR parameters of the error term. Interestingly, they established an oracle-type result for the estimation of the AR parameters. In particular, they showed that the limiting distribution of the parameter estimators is unaffected by the need to estimate the nonparametric function m 0 . The second special case of model (1) is the setting Y t = m(X t ) + t . The problem of estimating the AR parameters in this setup has been studied under the restriction that {X t } is independent of the error process { t }. Truong and Stone (1994), Schick (1994), and Lin, Pourahmadi, and Schick (1999) showed that, under this restriction, an oracle-type result for the parameter estimators holds analogous to that in the fixed design setting.
In this paper, we study the estimation of the parametric and nonparametric components in the general model (1). We allow {X t } and { t } to be dependent, thus dispensing with the very restrictive assumption that the covariate process is independent of the errors. In order to circumvent the well-known curse of dimensionality, we assume the function m to be additive with component functions m j for j = 1, … , d, thus yielding A full description of model (2) together with a discussion of its components is given in Section 2. Our estimation procedure is introduced in Section 3. The nonparametric components m and m 0 , … , m d are estimated by extending the smooth backfitting approach of Mammen, Linton, and Nielsen (1999), who derived its asymptotic properties in a strictly stationary setup. Due to the inclusion of the periodic and the deterministic trend components, our model dynamics are no longer stationary. In Sections 3.1 and 3.2, we describe how to incorporate this type of nonstationarity into the smooth backfitting procedure. Given our estimatorsm andm 0 , … ,m d of the functions m and m 0 , … , m d , we can construct approximate expressions̃t of t . Using these, the parameters of the AR(p) error process are estimated via a quasimaximum likelihood-based method, the details of which are given in Section 3.3.
Section 4 contains our results on the asymptotic properties of our estimators. In Sections 4.2 and 4.3, we provide the convergence rates of the nonparametric estimatorsm andm 0 , … ,m d , as well as their Gaussian limit distribution. The asymptotic behavior of the parameter estimators of the AR( p) error process is studied in Section 4.4. There, we show that the parameter estimators are asymptotically normal. Deriving the limit distribution of the parameter estimators is by far the most difficult part of the theory developed in this paper. To do so, we need to establish a higher-order stochastic expansion of the first derivative of the likelihood function. This requires substantially different and much more intricate techniques than those in the analysis of the special cases previously discussed in the literature.
As we will see, the asymptotic distribution of our parameter estimators in general differs from that of the oracle estimators constructed under the assumption that the additive component functions are known. Thus, the additional uncertainty that stems from estimating the component functions does have an impact on the asymptotic distribution of our parameter estimators in general. However, an oracle-type result can be established under additional conditions on the dependence structure between the covariates X t and the errors t . In particular, the limit distribution of our parameter estimators coincides with that of the oracle estimators if we additionally assume that E[ t | X t+k ] = 0 for all k = −p, … , p. This assumption is evidently much weaker than imposing independence between {X t } and { t } as in the simpler settings discussed above. Our theory thus generalizes the previously found oracle-type results.
We illustrate our estimation procedure by applying it to monthly minimum temperature and ozone data from the Faraday/Vernadsky research station on the Antarctic Peninsula in Section 5. The nice thing about this application is that Hughes, Subba Rao, and Subba Rao (2007) used a parametric regression model setup with AR errors to analyze the same data. Hence, our analysis can be regarded as a semiparametric extension to their study and we can get an impression of what can be gained by using our more flexible specification in this setting.

MODEL
Before we introduce our estimation procedure, we take a closer look at model (2) and comment on some of its features. We observe a sample of data {(Y t,T , X t ) ∶ t = 1, … , T}, where Y t,T are real-valued random variables and X t = (X 1 t , … , X d t ) are R d -valued random vectors that form a strictly stationary process. As already noted in the introduction, the data are assumed to satisfy the model equation with E[ t | X t ] = 0, where m is a periodic component with some integer-valued period , m 0 is a deterministic trend, and m j are nonparametric functions of the regressors X t for j = 1, … , d.
For simplicity, we assume that the period is known. Methods to estimate can, for example, be found in the work of Vogt and Linton (2014). By including the periodic component m and the deterministic trend m 0 , the dynamics of Y t,T depend on time and are thus nonstationary. The errors { t } follow a stationary AR( p) process of the form where * = ( * 1 , … , * p ) is the vector of parameters, the AR order p is assumed to be known, and the residuals t form a martingale difference with respect to  t = {X t , X t−1 , … , t−1 , t−2 , … }.
The additive functions in model (3) are only identified up to an additive constant. To identify them, we assume that the constant is absorbed into the periodic component and the remaining components have zero mean, that is, where p j is the marginal density of X t . The covariates X t are assumed to take values in a bounded interval that, without loss of generality, is taken to be [0, 1] for each j = 1, … , d. Throughout this paper, the symbol x 0 is used to denote a point in rescaled time. Moreover, we write To be able to do reasonable asymptotics, we let the trend function m 0 in model (3) depend on rescaled time t T rather than on real time t. If we defined m 0 in terms of real time, we would not get additional information on the structure of m 0 locally around a fixed time point t as the sample size increases. Within the framework of rescaled time, in contrast, the function m 0 is observed on a finer and finer grid of rescaled time points on the unit interval as T grows. Thus, we obtain more and more information on the local structure of m 0 around each point in rescaled time. This is the reason why we can make reasonable asymptotic considerations within this framework.
In contrast to m 0 , we let the periodic component m in model (3) be a function of real time t. This allows us to exploit its periodic character when doing asymptotics: Assume that we want to estimate m at a time point t ∈ {1, … , }. As m is periodic, it has the same value at t , t + , t + 2 , t + 3 , and so on. Hence, if m depends on real time t, the number of time points in our sample at which m has the value m (t ) increases as the sample size grows. This gives us more and more information about the value m (t ) and thus allows us to do asymptotics.

ESTIMATION PROCEDURE
We now describe how the various components of model (3) are estimated. Our procedure consists of three steps. In the first step, the periodic model component m is estimated. The estimation of the nonparametric functions m 0 , … , m d is addressed in the second step. Finally, we use the estimators of the additive component functions to construct estimators of the AR parameters.

Estimation of m
For any time point t = 1, … , T, let t = t − ⌊ t−1 ⌋ with ⌊x⌋ denoting the largest integer that is smaller than or equal to x. Our estimator of the periodic component m is defined as where K t ,T = 1+⌊ T−t ⌋ is the number of observations that satisfy t = t + k for some k ∈ N. The estimator has a very simple structure. It is the empirical mean of observations that are separated by a multiple of time points. Later on, we will show thatm is asymptotically normal. Note that this result is robust to the presence of the deterministic trend function m 0 . In particular, we will see that the effect of the unknown time trend m 0 on the estimatorm can be asymptotically neglected.
In order to estimate the functions m 0 , … , m d in (6), we extend the smooth backfitting approach of Mammen et al. (1999), who derived the asymptotic properties of this approach in a strictly stationary setup. Model (6) involves a deterministic time-trend component, which makes the model dynamics nonstationary. In what follows, we describe how to modify the smooth backfitting procedure to allow for this type of nonstationarity.
To do so, we first introduce the auxiliary estimatorŝ , and the kernel function K(·) integrates to one. These weights have the property that ∫ 1 0 K h (v, w)dv = 1 for all w, which is needed to derive the asymptotic results for the backfitting estimators.
Given the smoothersq andm, we define the smooth backfitting estimatorsm 0 , … ,m d as the minimizers of the criterion where the minimization runs over all additive functions g(x) = g 0 (x 0 ) + · · · + g d (x d ) whose components satisfy Here,p is a kernel estimator of p j for j = 0, … , d, where we define p 0 (x 0 ) = I(x 0 ∈ [0, 1]). Explicit expressions for these estimators are given below in (9) and (12).
According to the definition in (7), the backfitting estimatorm =m 0 + · · · +m d is an L 2 -projection of the (d + 1)-dimensional Nadaraya-Watson smootherm onto the space of additive functions with respect to the densityq. Rescaled time is treated as an additional component in this projection. In particular, note thatq estimates the product of a uniform density over [0, 1] and the density p of the regressors X t . This shows that rescaled time is treated in a similar way to an additional stochastic regressor that is uniformly distributed over [0, 1] and independent of the variables X t . The heuristic idea behind this is the following. Firstly, as the variables X t are strictly stationary, their distribution is time-invariant. In this sense, their stochastic behavior is independent of rescaled time t T . Thus, rescaled time behaves similarly to an additional stochastic variable that is independent of X t . Secondly, as the points t T are evenly spaced over the unit interval, a variable with a uniform distribution closely replicates the pattern of rescaled time.
By differentiation, we can show that the solution to the projection problem (7) is characterized by the system of integral equations As we do not observe the variables Z t,T = Y t,T − m (t), we define the kernel estimators in (8) in terms of the approximationsZ t,T = Y t,T −m (t). In particular, we letp for j, k = 1, … , d with j ≠ k, wherep is the one-dimensional kernel density estimator of the marginal density p j of X t ,p ,k is the two-dimensional kernel density estimator of the joint density p j,k of (X t , X k t ), andm is a one-dimensional Nadaraya-Watson smoother. Moreover, for k = 1, … , d andm c = 1 T ∑ T t=1Z t,T . Note that it would be more natural to definep 0 (x 0 ) = I(x 0 ∈ [0, 1]), as we already know the "true density" of rescaled time. However, for technical reasons, . This creates a behavior of the estimatorp 0 in the boundary region of the support [0, 1] analogous to that ofp at the boundary. Alternatively, we could definê is the support of the kernel function K.) Moreover, we could setp 0,k (x 0 , x k ) =p 0 (x 0 )p k (x k ), thereby exploiting the "independence" of rescaled time and the other regressors.
In our theoretical analysis, we work with the smooth backfitting estimators characterized as the solution to the system of integral equations (8). In general, however, the system of equations (8) cannot be solved analytically. Nevertheless, the solution can be approximated by a backfitting algorithm that converges for arbitrary starting values. The algorithm can be VOGT AND WALSH summarized as follows. Letm [0] be starting values for j = 0, … , d. In the rth iteration step, one cycles through all the components j = 0, … , d and computes for each j. In the work of Mammen et al. (1999), the asymptotic properties of this algorithm are established under very general conditions that are implied by our assumptions in Section 4.1. Under these general conditions, it can be shown that, with probability tending to 1, with some constants 0 < < 1 and c > 0. Hence,m [r] converges tom at the order O( 2r ) in an L 2 -sense. In practice, the backfitting algorithm is iterated until some convergence criterion is satisfied. In the empirical part of this paper, we work with the following stopping rule. The algorithm terminates either after a maximum of 50 iterations or if, for all directions where is a small number that is set to 10 −5 in our code. Our smooth backfitting estimators are based on Nadaraya-Watson pilot smoothers. Alternatively, local linear smoothers could be used. Throughout this paper, we restrict attention to Nadaraya-Watson-based smooth backfitting because the derivations and the notation would become even more involved in the local linear case. From a theoretical point of view, local linear smoothers have the advantage that they are design adaptive and, thus, do not suffer from boundary problems. As shown in the work of Mammen et al. (1999) in the strictly stationary case, this advantage carries over to the local linear based smooth backfitting estimators; see theorem 4' in the work of Mammen et al. (1999) and the discussion thereafter. In applications where forecasts are performed, a particular interest lies in the value of the trend function m 0 at the rescaled time point u = 1. Hence, a good boundary behavior is particularly important for the trend component m 0 . To reduce the boundary bias in time direction, one could use a local linear pilot smoother in time direction while working with Nadaraya-Watson or local linear smoothers in the other directions.

Estimation of the AR parameters
To motivate the third step in our estimation procedure, we shall initially consider an infeasible estimator of the AR parameters. Suppose that the functions m , m 0 , … , m d were known. In this situation, the AR(p) error process The parameters * = ( * 1 , … , * p ) of the error process could thus be estimated by standard maximum likelihood methods. In particular, we could use a conditional maximum likelihood estimator of the form̂= where Φ is a compact parameter space and l T is the conditional log-likelihood given by Note that̂has a closed-form solution that is identical to the usual least squares estimator. We will, however, not work with this closed-form solution in what follows. Instead, we will formulate our proofs in terms of the likelihood function. This makes it easier to apply our arguments to other error structures such as ARCH processes. We give some comments on how to extend our approach in this direction in Section 6.
As the functions m , m 0 , … , m d are not observed, we cannot use the standard approach from above directly. However, given the estimatorsm ,m 0 , … ,m d from the previous estimation steps, we can replace the error terms t by the approximations in the maximum likelihood estimation. The log-likelihood then becomes

ASYMPTOTICS
In this section, we analyze the asymptotic properties of our estimators. The first subsection lists the assumptions required for our analysis. The following subsections describe the main asymptotic results, with each subsection dealing with a separate step of our estimation procedure.

Assumptions
To derive the asymptotic properties of the estimatorsm ,m 0 , … ,m d , the following assumptions are needed: (H1) The process {(X t , t ) ∶ t = 1, … , T} is strictly stationary and strongly mixing with mixing coefficients satisfying (k) ≤ a k for some 0 < a < 1. (H2) The variables X t have compact support, which w.l.o.g. equals [0, 1] d . The density p of X t and the densities p (0,l) of (X t , X t + l ), l = 1, 2, … , are uniformly bounded. Furthermore, p is bounded away from zero on [0, 1] d .
VOGT AND WALSH (H6) The bandwidth h satisfies either of the following: .
The above assumptions are very similar to the standard conditions for smooth backfitting estimators to be found, for example, in the works of Mammen et al. (1999), Mammen and Park (2006), or Yu, Mammen, and Park (2011). Note that we do not necessarily require exponentially decaying mixing rates, as assumed in (H1). These could alternatively be replaced by sufficiently high polynomial rates. We nevertheless make the stronger assumption (H1) to keep the notation and structure of the proofs as clear as possible. In (H6), we impose two alternative conditions on the bandwidth h. In (H6a), h is assumed to be of the order T −1/5 , which is optimal for estimating the additive component functions m 0 , … , m d . In (H6b), we assume h to be of the order T −λ with some λ ∈ ) and, thus, undersmooth the backfitting estimators of m 0 , … , m d . Undersmoothing is needed to estimate the AR coefficients in the error term. It makes sure that the bias parts of the backfitting estimators can be asymptotically neglected and do not influence the limit distribution of the AR parameter estimators. Assumption (H6b) parallels standard undersmoothing conditions imposed in the context of semiparametric estimation problems.
In order to show that the estimators of the AR parameters are consistent and asymptotically normal, we additionally require the following assumptions: There exist a real constant C and a natural number l * such that E[| t | | X t+k ] ≤ C and E[| t t+l | | X t+k , X t+l ] ≤ C for all l with |l| ≥ l * and k = −p, … , p.
The compactness assumption in (H7) is required for the proof of consistency. (H8) and (H9) are technical assumptions needed to show asymptotic normality.

Asymptotics form
We start by considering the asymptotic behavior of the estimatorm . The next theorem shows is asymptotically normal for each fixed time point t.

Theorem 1. Let (H1) and (H3) be fulfilled and assume that
Asm and m are periodic and the period is a fixed constant that does not depend on the sample size T, Theorem 1 immediately implies that The proof of Theorem 1 is straightforward: We havẽ The term (A) approximates the integral ∫ 1 0 m 0 (u)du. It is easily seen that the convergence rate is O and can thus be asymptotically neglected. Noting that {W t } is mixing by (H1) and has mean zero by our normalization, we can now apply a central limit theorem for mixing variables to the term (B) to get the normality result of Theorem 1. The statement of Theorem 1 is derived under the assumption that the period is a fixed constant that does not depend on the sample size T. Heuristically speaking, we thus assume that is small compared with T. To better take into account the case where is not so small in comparison with T, we could allow to grow with T, that is, = T . Becausem (t) is nothing else than an empirical average of the observations Y t ,T , Y t + ,T , Y t +2 ,T , … , the effective sample size for estimating m (t) is T∕ . This implies that, for each fixed t,m (t) converges to m (t) at the rate O p ( √ ∕T). The faster = T grows with T, the slower this rate becomes. As regards to our theory, we can allow = T to grow with the sample size T as long as it does not diverge too quickly. In particular, our main theorems (Theorems 2, 3, and 4) and Corollary 1 remain to hold true as long as T does not grow too fast. In the sequel, we stick to the assumption that is a fixed constant in order to keep the technical arguments as clear as possible.

Asymptotics form , … ,m d
The main result of this subsection characterizes the limiting behavior of the smooth backfitting estimatorsm 0 , … ,m d . It shows that the estimators converge uniformly to the true component functions at the one-dimensional nonparametric rate no matter how large the dimension d of the regressors. Moreover, it characterizes the asymptotic distribution of the estimators.

Theorem 2. Suppose that conditions (H1)-(H5) hold. (a) Assume that the bandwidth h satisfies (H6a) or (H6b). Then, for I
for all j = 0, … , d. (b) Assume that the bandwidth h satisfies (H6a). Then, for any x 0 , … , x d ∈ (0, 1), VOGT AND WALSH and the constants c b = lim T→∞ T 1∕5 h and c K = ∫ K 2 (u)du. Furthermore, the functions j are the components of the L 2 (q)-projection of the function defined in Lemma 3 in Appendix A onto the space of additive functions satisfying ∫ (x )p (x )dx = 0. Finally, the constants j can be characterized by the equation

j also given in Lemma 3 in Appendix A.
As described in Section 3.2, rescaled time t T behaves similarly to an additional uniformly distributed regressor that is independent of the other regressors. This consideration allows us to derive the above result by extending the proving strategy of Mammen et al. (1999). The details are given in Appendix B.

Asymptotics for the AR parameter estimators
We finally establish the asymptotic properties of our estimator̃of the AR parameters * . The technical details can be found in Appendix C. The first theorem shows that̃is consistent.

Theorem 3. Suppose that the bandwidth h satisfies (H6a) or (H6b). In addition, let assumptions (H1)-(H5) and (H7) be fulfilled. Then,̃is a consistent estimator of
The central result of our theory specifies the limiting distribution of̃.

Theorem 4. Suppose that the bandwidth h satisfies (H6b) and let assumptions (H1)-(H5) together with (H7)-(H9) be fulfilled. Then, it holds that
Consider for a moment the case in which the functions m and m 0 , … , m d are known. In this case, we can use the "oracle" estimator̂defined in (15) to estimate the AR parameters * . Standard theory tells us that̂is asymptotically normal with asymptotic variance Γ −1 p WΓ −1 p . Theorem 4 thus shows that, in general, the limiting distribution of our estimator̃differs from that of the oracle estimator. Note that this difference does not merely result from the fact that the functions m 0 , … , m d are nonparametric. Even if they are parametric,̃will in general have a different distribution than the oracle estimator̂.
Even though different in general, the asymptotic distributions of̃and̂are the same in a wide range of cases. This oracle-type result is stated in the following corollary.

Corollary 1. Suppose that all the assumptions of Theorem 4 are fulfilled and that
Corollary 1 follows directly from the proof of Theorem 4: Inspecting the functions defined in Lemma 4 in Appendix C and realizing that they are constantly zero under the assumptions of the corollary, the matrix Ω is immediately seen to be equal to zero as well. The corollary shows that the oracle result holds under fairly mild conditions on the dependence structure between X t and t , in particular, under much weaker conditions than independence of the processes {X t } and { t }. To give an example where the conditions of the corollary are satisfied but where the processes {X t } and { t } are not independent, consider the following. Let the AR residuals be given where is a continuous volatility function and { t } is a process of zero-mean i.i.d. variables that is independent of {X t }. A simple argument shows that E[ t | {X t }] = 0 in this case, that is, the assumptions of the corollary are satisfied. Moreover, it is easily seen that the processes {X t } and { t } are not independent given that the function is nonconstant.
Note that our theory re-establishes the oracle result derived in the simpler setup without stochastic covariates, that is, in the model with E[ t ] = 0. In this case, the periodic component can be estimated as described in Section 3.1. Moreover, we can use a Nadaraya-Watson smoother of the form (14) to approximate the trend component m 0 . A vastly simplified version of the proof for Theorem 4 shows that the limiting distribution of the AR parameter estimators is identical to that of the oracle estimators in this setting. In particular, the stochastic higher-order expansion derived in Lemma 4 is not required any more. The arguments of the much simpler Lemma 5 in Appendix C are sufficient to derive the result. To understand the main technical reasons why the argument simplifies so substantially, we refer the reader to the remarks given after the proof of Lemma 5. The normality results of Theorem 4 and Corollary 1 enable us to calculate confidence bands for the AR parameter estimators and to conduct inference based on these. To do so, we need a consistent estimator of the asymptotic variance of̃. Whereas such an estimator is easily obtained under the conditions of Corollary 1, it is not at all trivial to derive a consistent estimator of V * in Theorem 4. This is due to the very complicated structure of the matrix Ω, which involves functions obtained from a higher-order expansion of the stochastic part of the backfitting estimators (see Theorem 5 in Appendix B). To circumvent these difficulties, one may try to set up a bootstrap approach to estimate confidence bands and to do testing. The normality result of Theorem 4 could be used as a starting point to derive consistency results for such a bootstrap procedure. However, this is beyond the scope of this paper and a substantial project in itself.

APPLICATION
In this section, we apply our estimation procedure to a set of monthly temperature and ozone data from the Faraday/Vernadsky research station on the Antarctic Peninsula. The data can be found online as supporting information. Alternatively, it is available on request from the British Antarctic Survey, Cambridge. A strong warming trend has been identified on the peninsula over the past 50 years. In particular, the monthly mean temperatures at Faraday station have considerably increased during this time, as pointed out, for example, by Turner, King, Lachlan-Cope, and Jones (2002) and Turner et al. (2005).

Modeling approach
We closely follow the analysis of Hughes et al. (2007) as our model can be seen as a semiparametric extension to their approach. According to Hughes et al. (2007), the rise of the mean monthly VOGT AND WALSH temperature is mostly due to an increase in the minimum monthly temperature. They argued that, to understand and quantify the warming on the peninsula, an appropriate statistical model of the minimum temperature is called for. Following their lead, we focus on modeling the minimum temperature and consider stratospheric ozone as a potential explanatory variable.
The data used in our analysis are plotted in Figure 1. The upper panel contains the monthly minimum near-surface temperatures at Faraday station from September 1957 to December 2004, whereas the lower panel shows the monthly level of stratospheric ozone concentration measured in Dobson units over the same period. For more information on the data, consult the work of Hughes et al. (2007), where a detailed description of them can be found. Hughes et al. (2007) proposed a parametric model with a linear trend and a parametrically specified periodic component with a period of 12 months to fit the temperature and ozone data. Their baseline model is given by the equation where Y t denotes the minimum monthly temperature and a = (a 0 , … , a 3 ) is a vector of parameters. In addition, they considered the extended model where the covariate X t − 1 , denoting the lagged detrended and deseasonalized ozone concentration, enters linearly. In their analysis, they found a strong linear upward trend in the minimum monthly temperature. Moreover, they observed considerable autocorrelation in the residuals t and proposed an AR process to model them. Using an order selection criterion, they found an AR(1) model to be most suitable, which fits nicely with the preference for AR(1) errors when using discrete time series to model climate data, as mentioned in Mudelsee (2010).
We now introduce a framework that can be regarded as a semiparametric extension to the parametric models (23) and (24). Our baseline model is given by where Y t,T are minimum monthly temperatures, m is a seasonal component, and m 0 is a nonparametric time trend. We additionally consider an extended version of (25) having the form where, as before, the variables X t − 1 denote lagged monthly stratospheric ozone concentration levels that have been detrended and deseasonalized. The additive functions m , m 0 , and m 1 in the above two models are normalized as described in (4). Following the work of Hughes et al. (2007), we assume the variables t to have an AR(1) structure and allow for the minimum monthly temperature to have a 12-month cycle by setting = 12. Before giving our estimates, we provide the preferred fits of the models (23) and (24) with t = 0.562 t − 1 + t and t a conv GEV(−0.0969,−5.67,3.59) random variable.

Implementation
We estimate the additive component functions together with the AR parameter in our models (25) and (26) by the three-step procedure outlined in Section 3. Note that, in the simpler model (25), the trend function m 0 is estimated by a Nadaraya-Watson smoother in the second step of the procedure. To maintain comparability with the work of Hughes et al. (2007), we also use the observed data up until December 2003 for the estimation. To compute the estimators of the additive functions m 0 and m 1 and of the AR parameter, we employ an Epanechnikov kernel and choose the bandwidths by the following simple plug-in rule: VOGT AND WALSH wherě( for j ∈ {0, 1}. Here,̌2,p ,p ′ ,m ′ , andm ′′ are initial estimators that are computed with a pilot bandwidth h. Specifically,p ( is a kernel estimator of the marginal density p j andp ′ (x ) is its derivative. Moreover,m ′ (x ) andm ′′ (x ) are estimators of the first and second derivatives m ′ (x ) and m ′′ (x ), respectively. They are obtained from a local quadratic fit ofm , as described on p. 1271 of the work of Mammen and Park (2005), wherem is an initial backfitting estimator of the component function m j . Furthermore,̌2 0 is an estimator of the long-run error variance 2 0 , which has the form 2 0 = E[ 2 t ]∕(1 − * ) 2 given the AR(1) error structure t = * t − 1 + t . To obtaiň2 0 , we fit an AR(1) model to the initial residualšt = Y t −m (t) −m 0 (t∕T) −m 1 (X t−1 ) and compute estimates of * and E[ 2 t ]. The term̌2 1 is an estimator of the conditional error variance E[ 2 t | X t−1 = · ]. For simplicity, we suppose that the errors t are homoskedastic, implying that E[ 2 t | X t−1 = · ] is equal to the unconditional short-run error variance E[ 2 t ]. This allows us to work with the simple estimatoř2 1 = T −1 ∑ T t=1̌2 t . All initial estimators are computed using an Epanechnikov kernel and the starting bandwidth h = 0.1. The constants c V and c B used in (29) are given by c V = ∫ K 2 (u)du and c B = ∫ u 2 K(u)du. For the Epanechnikov kernel, they amount to c V = 3∕5 and c B = 1∕5.
The plug-in bandwidth h * can be regarded as an estimator of the optimal bandwidth which minimizes the (asymptotic) integrated mean squared error (MSE) of the backfitting estimator m . To compute it, we essentially have to estimate the asymptotic variance and bias expressions appearing in the normality result of Theorem 2. To do so, we make use of the following: (a) natural estimators of the terms j appearing in the asymptotic bias are given by 0 and (b) the expressioň (x ) is an estimator of the bias component j (x j ). In general, the terms j (x j ) have a very complicated form. Only if d = 1 (as in our application) or if the d > 1 random regressors X 1 t , … , X d t are independent from each other, do we obtain ( Hence, under the assumption of independent regressors, our plug-in rule can be easily extended to the case d > 1. In practice, this extended plug-in rule can be expected to work as long as the regressors are only moderately dependent, that is, as long as the independence assumption is not too far from the truth. Otherwise, more sophisticated methods are needed to approximate the terms j (x j ).
The proposed plug-in procedure is of course only a heuristic rule. We do not provide any theory for automated bandwidth selection as this would be an entire project in itself. Indeed, to the best of our knowledge, bandwidth selection for smooth backfitting estimators in the dependent data case is still an open problem. Some theory for the i.i.d. case is provided in the work of Mammen and Park (2005). However, their results have not been extended to the dependent data case so far. Note that we have not used a cross-validation procedure to select the bandwidths for the following reason. As shown, for example, in the works of Altman (1990), Hart (1991), Herrmann, Gasser, and Kneip (1992), and Hart (1994), standard cross-validation tends to perform poorly when used to estimate the optimal bandwidth in the simple fixed design setting Y t,T = m( t T ) + t with correlated errors. In our setting, analogous difficulties are to be expected when selecting the bandwidth in time direction.

Estimation results
The estimate of the periodic component m is given by the circles in Figure 2. The vertical dashed lines illustrate the estimated 95% confidence intervals. Using the dotted line, we have superimposed the estimated periodic function from the parametric model (28). Two differences between our periodic component estimate and the parametric estimate given in (28) become apparent immediately. Firstly, our estimate achieves its minimum in the southern hemisphere winter month of August, whereas the minimum is in July/August when the parametric model is used. Secondly, in contrast to its parametric counterpart, our estimate is not symmetric: The fall in the minimum temperature from January to August is more gradual than the increase from August until January. Interestingly, the median monthly minimum temperature also follows this pattern, as can be seen in the boxplot of the monthly minimum temperatures provided in figure 1(b) of the work of Hughes et al. (2007). Figure 3 shows the smooth backfitting estimates of the additive functions m 0 and m 1 in model (26), along with their corresponding estimated 95% pointwise confidence bands. To compute them, we have used the bandwidths (h * 0 , h * 1 ) = (0.15, 0.14), which were chosen as described in Section 5.2. The dashed lines in Figure 3 are fits from the parametric model (28). As the Nadaraya-Watson estimate of m 0 in the simpler model (25) is very similar to the estimate in (26), we do not plot it separately. From the shape ofm 0 together with the rather tight 95% confidence band in the left-hand panel of Figure 3, there seems to be a strongly nonlinear upward moving trend in the minimum monthly temperature. Not only is the linear parametric trend in (28) not capable of capturing the nonlinear pattern, we can also see that it underestimates the overall trend increase in the monthly minimum temperature over the entire estimation period. However, toward the end of the sample, the fits are very similar. The estimatem 1 in the right-hand panel of Figure 3 suggests that the lagged ozone concentration level has a negative effect on the minimum monthly temperature. Although the effect appears to be nonlinear again, the deviation from linearity does not seem to be as severe as form 0 .
From the third step of our estimation procedure, we obtain estimated AR parameters of 0.56 and 0.57 for the models (25) and (26)

FIGURE 2
The circles represent the demeaned estimate of the seasonal component m of models (25) and (26) along with (dashed vertical lines) the estimated 95% pointwise confidence intervals. The dotted line represents the function 6.61 sin   (28) obtained by Hughes et al. (2007) estimates obtained by Hughes et al. (2007) in the parametric models (27) and (28). As discussed in Section 4.4, it is straightforward to calculate confidence intervals for the parameter estimate in the simple model (25), whereas this is extremely involved in the extended model (26) if we are not willing to make the assumptions of Corollary 1. Here, we shall be content with giving the 95% confidence band in the simple model (25) (27), we see that the parameter uncertainty is fairly similar, although the estimated 95% confidence band for the parametric model (27) is slightly narrower and asymmetric due to the assumed conv GEV innovations. To summarize, it seems like the residual process displays significant positive persistence, which, as pointed out by Mudelsee (2010), is a common phenomenon for climate data.
Finally, we compare our semiparametric models with those of Hughes et al. (2007) in terms of forecasting ability. To do so, we repeat the forecasting exercise from the work of Hughes et al. (2007), that is, we compute the one-step-ahead forecasts of the minimum monthly temperatures for the twelve months from January to December 2004. The one-step-ahead forecast for time point t 0 + 1 is obtained by estimating the model using observations at t = 1, … , t 0 and constantly extrapolating the estimated trend functionm 0 into the future. As it is unclear what the theoretically optimal smoothing parameter in the forecasting context would be, we have computed the forecasts for bandwidths h on a grid spanning a wide range of values from 0.05 to 1 in each direction.
We first compare the forecasting performance of our simple model (25), the corresponding parametric model (23), and the intermediate model where the seasonality is modeled as in our approach but the trend is linear as in the models of Hughes et al. (2007). Note that in (30), the model constant is absorbed into the seasonal component m . For all three models, we take the errors to follow an AR(1) process. For comparison purposes, all three models are estimated using minimum temperature data from September 1957 onwards. The parameters a = (a 0 , … , a 3 ) in model (23) are estimated by least squares. For the intermediate model (30), the seasonal component is estimated as in our procedure and the parameter b is obtained from a least squares fit. Finally, our simple model (25) is estimated using a local constant smoother with a bandwidth varying over the grid mentioned above. In all the above models, an estimator of the AR(1) parameter in the error term is obtained from a least squares fit to the estimated residuals. Table 1 reports the estimated MSE of the forecasts for the models considered. A graphical illustration is provided in Figure 4. As one can see, our model (25) performs better in terms of MSE than its two competitors (23) and (30) for all bandwidths h considered. The largest reduction in MSE occurs when moving from model (23) to (30), that is, when using our estimate of the periodic component instead of the one considered in (23). The MSE can be seen to further reduce when moving from the intermediate model (30) to our model (25), that is, when replacing the linear trend by a nonparametric trend. The additional gain, however, is much smaller. These results are quite intuitive: The estimated periodic component exhibits a strong variation within a one-year cycle, whereas the estimated trend varies comparatively little over short time periods. Hence, for short-term forecasts, the variation in the trend is much less important than the variation in the seasonal component. Moreover, the parametric and nonparametric fits of the trend function are very close toward the end of the sample. This suggests that, for the data sample at hand, much more can be gained in terms of forecasting by improving the estimate of the seasonal component than the trend estimate.
In addition to the analysis above, we have performed the same forecasting exercise for our model (26), which includes ozone as an additional predictor. We denote the bandwidth for estimating m j by h j and consider bandwidths h j on a grid ranging from 0.05 to 1 for j = 0, 1. For all combinations of bandwidths (h 0 , h 1 ) considered, the MSE lies between 11.55 and 9.32.

Model (23) Model (30)
Model ( (25). "Forecast for best model (25)" refers to the forecast of our model (25) using the "best" possible bandwidth, that is, the bandwidth that produces the smallest MSE. "Forecast for worst model (25)" analogously refers to the forecast using the "worst" bandwidth VOGT AND WALSH The MSE value reported in the work of Hughes et al. (2007) for the corresponding parametric model (28) with an ozone component amounts to 10.14. The MSE produced by our semiparametric model (26) lies below the reference value 10.14 for all bandwidth combinations with h 1 > 0.1. Hence, we can improve on the forecasting performance of the parametric model (28) as long as we do not strongly undersmooth in the direction of ozone. The best MSE for model (26) is obtained for the bandwidth h 1 = 1, that is, when ozone is essentially smoothed out. Setting h 1 = 1 and varying the bandwidth h 0 in time direction over the grid from 0.05 to 1, the MSE varies from 9.78 to 9.32. Comparing these MSE values to those for the simpler model (25) with a trend component only, there appears to be no further gain from including ozone as a predictor. Thus, it seems that lagged ozone does not help in terms of forecasting, in contrast to the results in the work of Hughes et al. (2007). Finally, one may try to improve the forecast performance of our model (26) by using a local linear pilot smoother in time direction, as discussed at the end of Section 3.2. However, for the data sample at hand, there is no gain from doing so, the best prediction MSE over the grid of bandwidths considered being 9.63.

EXTENSIONS
Our theory and methods can be extended in various directions. In what follows, we discuss some of them.

Alternative error structures
As already mentioned in Section 3.3, our theoretical arguments can be generalized to work for other error structures. An important example is the case in which we suspect the residuals to be heteroskedastic and model them via an ARCH(p) process. Going along the lines of the proofs for Theorems 3 and 4, the ARCH parameter estimators can be shown to be consistent and asymptotically normal. The only difference to the AR case is that the conditional likelihood has a more complicated form, making it more tedious to derive the expansion of the first derivative of the likelihood function in the normality proof. Our proving strategy may also be applied to ARMA( p, q) and GARCH( p, q) residuals. This is most easily seen for a causal and invertible ARMA(1, 1) process { t }, which satisfies the equation for some white-noise residuals t . In this case, the conditional likelihood can be written as which has a very similar structure to the likelihood function of the AR(p) case. The only notable difference is that the sum over k in the definition of t ( , ) now has t − 1 elements rather than only a fixed number p. As the elements of the sum are weighted by the coefficients ( − ) k − 1 ( + ) that decay exponentially fast to zero, this does however not cause any major problems in the proofs. In particular, we can truncate the sum at min{t − 1, C log T} for a sufficiently large C, the remainder being asymptotically negligible. After this truncation, the arguments of the AR( p) case apply more or less unchanged. In the general ARMA( p, q) setup, the structure of the likelihood function becomes much more complicated. It is thus convenient to base the estimation of the parameters on a criterion function that is a bit simpler to handle. In particular, consider a causal and invertible ARMA( p, q) process and write * = ( * 1 , … , * p ) and * = ( * 1 , … , * q ). As 1 + for all |z| ≤ 1. Using this, we obtain that Truncating the infinite sum on the left-hand side, we now define the expressions and estimate the ARMA coefficients * and * by minimizing the least squares criterion This criterion function again has a very similar structure to that of the AR( p) setup. In particular, setting 0 ( ) = 1 and k ( ) = 0 for k > 0 yields the conditional likelihood of the AR( p) case. As the coefficients k ( ) (as well as their derivatives with respect to ) decay exponentially fast to zero, a truncation argument as in the ARMA(1,1) case allows us to adapt the proving strategy of Theorems 3 and 4 to the setup at hand.

A locally stationary version of our model
Our model decomposes the time series observations Y t,T into the seasonal component m (t), the time trend component m 0 ( t T ), and the stationary stochastic component Whether the stationarity of the stochastic component is a reasonable assumption of course depends on the application context. An interesting extension of our framework is to replace the stationary by a locally stationary stochastic component. This results in the model equation where m and m 0 are defined as before and m j are time-varying nonparametric component functions. In this very general model framework, the covariate process {X t,T ∶ t = 1, … , T} may be allowed to be locally stationary (rather than strictly stationary) and { t,T ∶ t = 1, … , T} may be a time-varying AR process of the form where is a time-varying volatility function and t are i.i.d. innovations. The locally stationary model (31) (without a periodic component m ) was studied by Vogt (2012). There, theory for smooth backfitting estimators of the time-varying functions m j was developed. The results from the work of Vogt (2012) could be used as a starting point to generalize our theoretical results to models (31) and (32). This is however far from trivial. To do so, we would have to derive a uniform stochastic expansion of the smooth backfitting estimators in the time-varying regression model (31), which parallels the expansion from Theorem 5 in Appendix B. (Such an expansion has not been provided in the work of Vogt (2012).) With such an expansion at hand, one could attempt to extend the theoretical results of this paper and to derive the asymptotic distribution of the estimated AR parameters̃i(u) at a given rescaled time point u.

More efficient estimation by prewhitening techniques
We finally discuss a prewhitening strategy to improve the efficiency of our estimators. For ease of notation, consider the simplified model where X t is real valued and { t } is an AR(1) process of the form t = * t − 1 + t with white-noise residuals t . Taking first-order differences in (33) and using the AR(1) equation of the error terms yields Hence, if we knew the AR parameter * , we could prewhiten the model by forming the first-order differences Y t,T − * Y t − 1,T . As a result, we would obtain a model with uncorrelated error terms t , thus getting rid of the AR structure in the errors. Such a prewhitening strategy has been studied, for example, in the work of Xiao, Linton, Carroll, and Mammen (2003) in the context of a nonparametric regression model with AR errors.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article.

AUXILIARY RESULTS
Before deriving our main results, we state and sketch the proofs of some auxiliary lemmas. The first lemma concerns the uniform convergence of the kernel density estimatorsp andp ,k .

Lemma 1. Suppose that (H1)-(H5) hold and that the bandwidth h satisfies (H6a) or (H6b).
Then, We next consider the convergence behavior of the one-dimensional Nadaraya-Watson smootherŝ m defined in (11) and (14). To do so, we decomposem =m A +m B into a stochastic partm A and a bias partm B , which are defined aŝ for j = 0, … , d, where we set X 0 t = t T to shorten the notation. For the stochastic partm A , we have the following. (H6b),
Lemmas 1 and 2 can be proven by small modifications of standard uniform convergence results for kernel estimators, as given in the works of Bosq (1998), Masry (1996), or Hansen (2008. For the bias partm B , we have the following expansion.

VOGT AND WALSH
Lemma 3 can be proven by going along the lines of the arguments for theorem 4 in the work of Mammen et al. (1999). To see that note thatm ∕p (x ) uniformly in x j and Combining the above calculations with the arguments from the proof of theorem 4 in the work of Mammen et al. (1999) yields formula (A10) for̂T ,0 .

PROOF OF THEOREM 2
In this appendix, we prove Theorem 2, which describes the asymptotic behavior of our smooth backfitting estimators. For the proof, we split up the estimators into a stochastic and a bias part. In Theorem 5, we provide a uniform expansion of the stochastic part. This result is an extension of a related expansion given in the work of Mammen and Park (2005) in the context of bandwidth selection in additive models. The bias part is treated in Theorem 6. The proof of both theorems requires the uniform convergence results summarized in Appendix A for the kernel smoothers that enter the backfitting procedure as pilot estimators. Note that Theorems 5 and 6 are needed not only for the second estimation step but also for the derivation of the asymptotics of the AR estimators in the third step. Throughout the appendices, we use the symbol C to denote a finite real constant, which may take a different value on each occurrence. We now turn to the proof of Theorem 2. To start with, we decompose the backfitting estimators m into a stochastic partm A and a bias partm B according tõ The two components are defined bỹ for S = A, B, wherem A k andm B k denote the stochastic and the bias part of the Nadaraya-Watson pilot estimators defined in (A5) and (A6). Moreover, We now analyze the convergence behavior ofm A andm B separately.
We first provide a higher-order expansion of the stochastic partm A . The following result extends theorem 6.1 in the work of Mammen and Park (2005) (in particular, their equation (6.3)) to our setting.
uniformly in x. A similar claim holds for the term this implies the statement of Theorem 5. The idea behind the proof of (B2) is as follows. From the definition of the operatorŝ, it can be seen thatŜ In what follows, we show that the terms S j,k (x j ) have the representation uniformly in x j . Thus, they essentially have the desired form 1 T ∑ t w t,k (x ) t with some weights w t,k . This allows us to infer that uniformly in x with some absolutely bounded functions b t satisfying |b t (x) − b t ( y)| ≤ C||x − y|| for some C > 0. Moreover, using the uniform convergence results from Appendix A, it can be shown that uniformly in x, where S is defined analogously toŜ with the density estimators replaced by the true densities. Combining (B5) and (B6) completes the proof of (B2).
To show (B4), we exploit the mixing behavior of the variables X t . Plugging the definition ofm A k into the term S j,k , we can write Then, applying the uniform convergence results from Appendix A, we can replace the density estimators in the above expression by the true densities. This yields uniformly for x j ∈ [0, 1]. In the final step, we show that again uniformly in x j . This is done by applying a covering argument together with an exponential inequality for mixing variables. The employed techniques are similar to those used to establish the results of Appendix A.
We now turn to the bias partm B . This, together with standard arguments, yields consistency of̃. In order to prove (C4), we decompose 1 ) .
Using (C1)-(C3), it is straightforward to show that the four terms on the right-hand side of the above equation are all o p (1) uniformly in . This shows (C4).

Proof of Theorem 4
By the usual Taylor expansion argument, we obtain where T (̃, * ) is the p × p matrix whose ith row is given by In what follows, we show that 1 T T (̃, * ) with Ψ = 4W + 4Ω and H = −2Γ p , where Γ p is the autocovariance matrix of the AR process , … ,p and Ω is given in (C15). This completes the proof.
Proof of (C5). By straightforward calculations, it can be seen that Defining the p × p matrix  T (̃, * ) analogously to T (̃, * ) withl T replaced by l T , it is further easy to show that 1 T  T (̃, * ) P → H, yielding (C5).
Introducing the notation * 0 = −1, we obtain that where the last equality follows from the fact that , k, and i by (C1). In what follows, we derive a stochastic expansion of the terms By symmetry, this also gives us an expansion for Q [i,k] T and thus, by (C7), also for the difference .
From Appendix B, we know that the backfitting estimatorsm (x ) can be decomposed into a stochastic partm A (x ) and a bias partm B (x ). This allows us to rewrite the term Q T as with for j = 0, … , d. In Lemmas 6 and 7, we will show that Moreover, Lemmas 4 and 5 establish that where g = g [k,i] are deterministic functions whose exact forms are given in the statement of Lemma 4. These functions are easily seen to be absolutely bounded by a constant independent of T. Inserting the above results in (C8), we obtain Using this together with (C7) yields with the absolutely bounded function where we suppress the dependence of h i on the parameter vector * in the notation. As a result, that is, the term of interest can be written as a normalized sum of random variables U t,T plus a term that is asymptotically negligible. Using the mixing assumptions in (H1), it is straightforward to see that the variables {U t,T ∶ t = 1, … , T} form an -mixing array with mixing coefficients that decay exponentially fast to zero. We can thus apply a central limit theorem for mixing arrays to obtain that More precisely, we apply theorem 2.1 from the work of Liebscher (1996) In order to complete the proof of asymptotic normality, we still need to show that Equations (C9)-(C12) are fulfilled for the terms Q T, , Q T,V, j , and Q T,B, j . We begin with the expansion of the variance components Q T,V, j for j = 1, … , d, as this is technically the most interesting part.

Lemma 4. It holds that
for j = 1, … , d. The functions g are given by where E −s [ · ] is the expectation with respect to all variables except for those depending on the index s and the functions r ,s (·) = r ( s T , X s , ·) are defined in Theorem 5 in Appendix B.
Proof. By Theorem 5, the stochastic partm A of the smooth backfitting estimatorm has the expansionm uniformly in x j , wherem A is the stochastic part of the Nadaraya-Watson pilot estimator and r ,s (·) = r ( s T , X s , ·) is Lipschitz continuous and absolutely bounded. With this result, we can decompose Q T,V, j as follows: In the following, we will give the arguments needed to treat Q NW T,V, . The line of argument for Q SBF T,V, is essentially identical, although some of the steps are easier due to the properties of the r j,s functions.
Plugging the definition (A5) of the estimatorm A (x ) into the term Q NW T,V, , we get In the first step, we show that Notice that sup x ∈[0,1] |B (x )| = O p (h) and sup x ∈[0,1] |V (x )| = O p ( √ log T∕Th). Using a second-order Taylor expansion of f (z) = (1 + z) −1 , we arrive at uniformly in x j . Plugging this decomposition into (C16), we obtain All that is required to establish (C17) is to show that both Q NW,B T,V, and Q NW,V T,V, are o p (1). As In the next step, we replace the inner sum over t in (C17) by a deterministic function that only depends on X s and show that the resulting error can be asymptotically neglected. Define where E −s [·] is the expectation with respect to all variables except for those depending on the index s. With the above notation at hand, we can rewrite (C17) as Once we show that R NW T,V, = o p (1), we are left with with 0 = q −1 (X −k ) −i and q (X −k ) = ∫ for a fixed > 0. Then, by Chebyshev's inequality where S is the set of tuples (s, s ′ , t, t ′ ) with 1 ≤ s, s ′ , t, t ′ ≤ T such that (at least) one index is separated from the others and S c is its complement. We say that an index, for instance, t, is separated from the others if min{|t − t ′ |, |t − s|, |t − s ′ |} > C 2 log T, that is, if it is further away from the other indices than C 2 log T for a constant C 2 to be chosen later on. We now analyze P S and P S c separately.
(a) First, consider P S c . If a tuple (s, s ′ , t, t ′ ) is an element of S c , then no index is separated from the others. Because the index t is not separated, there exists an index, for example, t ′ , such that |t − t ′ | ≤ C 2 log T. Now, take an index different from t and t ′ , for instance, s. Then, by the same argument, there exists an index, for example, s ′ , such that |s − s ′ | ≤ C 2 log T. As a consequence, the number of tuples (s, s ′ , t, t ′ ) ∈ S c is smaller than CT 2 (log T) 2 for some constant C. Using (H8), this suffices to infer that Th 2 → 0.
Thus, P S = P S 1 + P S 2 + P S 3 + P S 4 with for r = 1, … , 4. In what follows, we show that P S r → 0 for r = 1, … , 4. As the four terms can be treated in exactly the same way, we restrict attention to the analysis of P S 1 . We start by taking a cover {I m } Using (C19), we can further write By Davydov's inequality, it holds that with some C 3 > 0, where q and r are chosen slightly larger than 4 3 and 4, respectively. Note that we can make C 3 arbitrarily large by choosing C 2 large enough. From this, it is easily seen that P A S 1 → 0.