A Stationary Spatio‐Temporal GARCH Model

We introduce a lagged nearest&#8208;neighbour, stationary spatio&#8208;temporal generalized autoregressive conditional heteroskedasticity (GARCH) model on an infinite spatial grid that opens for GARCH innovations in a space&#8208;time ARMA model. This is illustrated by a real data application to a classical dataset of sea surface temperature anomalies in the Pacific Ocean. The model and its translation invariant neighbourhood system are wrapped around a torus forming a model with finite spatial domain, which we call circular spatio&#8208;temporal GARCH. Such a model could be seen as an approximation of the infinite one and simulation experiments show that the circular estimator with a straightforward bias correction performs well on such non&#8208;circular data. Since the spatial boundaries are tied together, the well&#8208;known boundary issue in spatial statistical modelling is effectively avoided. We derive stationarity conditions for these circular processes and study the spatio&#8208;temporal correlation structure through an ARMA representation. We also show that the matrices defined by a vectorized version of the model are block circulants. The maximum quasi&#8208;likelihood estimator is presented and we prove its strong consistency and asymptotic normality by generalizing results from univariate GARCH theory.


INTRODUCTION
A spatio-temporal generalized autoregressive conditional heteroskedasticity (Spatio-temporal GARCH (STGARCH)) model is a time-space extension of the univariate GARCH models (Engle, 1982;Bollerslev, 1986), but why do we need GARCH in space-time models? For the same reasons we would in time series: whenever we suspect that a white noise series does not have a constant conditional volatility and we want our model to capture this. Characteristics of GARCH are little or no autocorrelation, yet profound correlation of the squared or absolute series. Varying conditional volatility, heavy tailed marginal distribution and clustering of extremes are other traits. The fact that we have a spatial component means that these features also appear in space, that is, the conditional volatility depends on where and when, and extremes cluster in space-time. Utilizing GARCH errors for another model, for example, as innovations in an ARMA model, is a common practice in time series and this can surely also be done in space-time. The models can be used for volatility forecasting and thus improve the quality of prediction intervals by accounting for conditional heteroskedasticity.
The STGARCH model we present has translation invariant neighbourhoods at different time lags that determine the set of variables influencing the future volatility in a traditional GARCH manner. The model is defined on the infinite lattice Z d , which leads to a boundary problem for a statistical analysis with data confined to a fixed finite spatial region. When a neighbourhood is translation invariant, you lack observations for some of the neighbours at the boundary. A circular modification of the model suggested here both solves the boundary problem and retains a projected version of the translation invariant neighbourhood system. We call this new model circular spatio-temporal GARCH (CSTGARCH). It is clear that it is closely related to the original model and

Spatio-temporal GARCH
Let ∶ Z × Z d ⇝ R 0 be a function with finite support. For fixed i, i ∶ Z d ⇝ R 0 . The function is defined in the same way. We refer to , as the parameter functions. The STGARCH model is given by for t ∈ Z. The modelled process is {X t (u)}, while {Z t (u)} is a residual process and { t (u)} is the volatility process. Let Δ 1i = {v ∈ Z d ∶ i > 0} and Δ 2i = {v ∈ Z d ∶ i > 0} for i ≥ 1. For i < p, the model allows for some zero-valued i (v) for v ∈ Δ 1i and likewise for the 's. The order of the model is defined as the largest (p, q) so that Δ 1p and Δ 2q are non-empty. With the order defined, the second part of (2.1) is expressed more specifically as 2 , v ∈ Δ 2i , i = 1, … , q} and write def = ( , , ) for the parameter vector contained in the parameter space Θ, with the restriction that > 0.

Circular Spatio-Temporal GARCH
Let m = (m 1 , … , m d ) ∈ Z d + and  = (m) = Z d ∕(mZ d ) be the quotient group of order m. We get the circular version of (2.1) by replacing the infinite spatial index part Z d with . In doing this we do not change the parameter functions but restrict the model so that for u, v ∈ , the difference (u − v) and the sum (u + v) are to be understood modulus m and are therefore also points in . This means that the stochastic processes involved are indexed on Z × . We will identify [0, m − 1] with . For v ∈ Z d , we use the notation (v|m) def = v − ⌊v∕m⌋ • m ∈ , where • and ∕ are elementwise or Hadamard multiplication and division respectively, and ⌊⋅⌋ means rounding down to the nearest integer elementwise. We write v ′ ∼ v if (v ′ |m) = (v|m). In the circular model, we replace X 2 t−i (u − v) 180 S. HØLLELAND AND H. KARLSEN with X 2 t−i (u − v|m), and likewise for the last part of (2.2). Thus, the circular model is given by for t ∈ Z. Note that in general, Δ ki ⊈ . When the indexes are irrelevant, we refer to the process dummies X = {X t (u), (t, u) ∈ Z × } and likewise for and Z. Some characteristics of the circular model are: (i) The circular model may be seen as a torus approximation of the infinite model.
(ii) The advantage of the circular model comes from the group structure of  which retains stationarity in a beneficial way. (iii) The infinite model and the finite model are different, but share the same parameter vector and residual process. (iv) If m is not very small, a window of the infinite process confined to  would be quite similar to the circularly defined process. The differences that may be seen are mainly at the boundaries of the rectangle. (v) Both models are designed to be strictly stationary. (vi) The circular model could be conceived as a mathematical construction to handle the boundaries that are generated from a finite window of the infinite model in a smooth way. (vii) Estimate of a circular model from data generated by an infinite model can be bias corrected to a meaningful estimate of the true model (cf. Section 5). (viii) Estimation obstacles created by the boundaries of finite samples from the infinite model is a main motivation for the circular model. In that sense, it could be viewed more as a method than an alternative model. On the other hand, a stationary solution of the circular model does not necessarily rely on a corresponding stationary solution of the infinite model.
is the number of spatial locations, and likewise for Z t and 2 t . We also need X 2 t = X t •X t . Throughout the article, we use the convention that all vectors are column vectors.
In spatial statistics, a neighbourhood set is a collection of sites that influence a given point. There are two requirements to a neighbourhood set: A site cannot be its own neighbour and neighbourhood relations are mutual. In spatio-temporal statistics with time-lagged nearest neighbour dependence, the first condition is not necessary, because we do not have instantaneous spatial dependency. In fact, the opposite is encouraged. Hence, for our neighbourhood systems, which are collections of neighbourhood sets for different temporal lags, we only require that neighbour relations are mutual. In the following theorem, we need a condition on each Δ ki to ensure that our neighbourhood systems fulfil this property.

A1:
Each parameter domain Δ ki is symmetric in the sense that −Δ ki = Δ ki . The neighbourhood systems for STGARCH are defined by  ki (u) def = u ⊖ Δ ki . The spatial part of X that explicitly influences t (u) is located at ∪ i  1i (u) and, in parallel, the direct effect from previous spatial values of the volatility have sites ∪ i  2i (u). We see that these systems are translation invariant, that is, Theorem 2.1. The circular version of the STGARCH model is a CCC-GARCH model of dimension m, (2.5) and likewise for the b i 's. For the spatial neighbourhood systems we have The neighbourhood system at 1i is for u ∈ , The translation invariance holds since where the addition in the last equality is the group addition since we consider a neighbourhood system on .
Remains to show that neighbourhood relations are mutual.  for d = 2. In Figure 1(b,c) we have visualized both  as an index set and a quotient group respectively, where the 16 points on the equidistant grid corresponds to the points on the torus surface. The circular neighbourhood set is visualized on a larger area of interest, that is, [0, 6] 2 , in Figure 1a for three different locations. We present two different ways of parametrizing; one for and one for . Considering a CSTGARCH(1,1) model, we need to specify (v) and (v). In the notation of (2.3) and Remark 2.1, let Δ = Δ 11 = Δ 21 = {(v 1 , v 2 ) ∈ Z 2 ∶ |v i | ≤ 1, i = 1, 2}, or more explicitly It makes sense to have more or less symmetry in the spatial part of the model. Let therefore which means that the parameter vector, = ( , a 0 , a 1 , b 0 , b 1 , b 2 ) ′ , consists of in total 6 parameters. The parametrization of gives equal weight to each neighbour (a 1 ) and another weight to the site of the observation (a 0 ). For we have one parameter for vertical and horizontal direction (b 1 ), one for all diagonal neighbours (b 2 ) and the point itself has its own weight (b 0 ). The specification of the model is illustrated in Figure 2. This is sometimes called a queen contiguity neighbourhood, as opposed to a rook contiguity, named after the possible movements of the respective chess pieces.

183
We can also specify the model using (2.5). Let  =  11 =  21 , since Δ 11 = Δ 21 . In this case, In the vector notation of (2.4), 2 t = 1 m + AX 2 t−1 + B 2 t−1 , where B is given by (2.6) The matrix A is obtained by setting b 0 = a 0 and b 1 = b 2 = a 1 in B. Notice that the matrix also illustrates the neighbourhood structure. We have ordered the rows and columns of B lexicographically according to the coordinates of . The block matrix S represents relations between sites in the same row, while T represents relations between sites of adjacent rows.

Remark 2.3.
If m becomes larger, the order of A and B, which is m × m, increases. However, the row and column sums will remain the same, a 0 + 8a 1 and b 0 + 4b 1 + 4b 2 respectively, and the number of non-zero terms in each row is constant. Therefore, the sparsity of the matrix will increase with m. The example illustrates that the number of parameters does not depend on the actual size of the spatial region. We consider m fixed, but in contrast to other multivariate GARCH models, a larger m is beneficial.

Generalized Circulant and Stationary Structure
It turns out that the model has an interesting circulant algebraic structure which is neither obvious nor intended. Circulant matrices of order n are matrices that can be written on the form A block circulant is a circulant block matrix whose blocks again are circulants. This is also called a circulant of level 2. In general a circulant of level d is a block circulant whose blocks are circulants of level d − 1 (Davis, 1994, pp. 184-91). We will show that the parameter matrices A i and B i are circulant matrices for d = 1. Since we index the matrices on  2 we use a circulant concept that does not depend on dimension.

S. HØLLELAND AND H. KARLSEN
Proof. For addition, it is obvious. For multiplication, let C = A B. Then Proposition 2.2. Generalized circulants are stationary matrices. u, v) Then A is a block circulant with circulant blocks, also called circulant of level 2. This can be seen in (2.6) of Example 1, where S and T are circulant blocks of B, that is, B = circ(S, T, 0, T). For d ≥ 2, A is a block circulant whose blocks are circulants of level d − 1, when the rows and columns are ordered lexicographically, by an induction argument. □

Stationarity
For p, q ≥ 1, let Q t be a square block matrix of order (p + q − 1) defined as , I k is a k × k identity matrix and 0 is a null matrix. Apart from the first row, the matrix Q t is defined by and c = 1 m . The same is true for p = q = 1, except that Q t = [A 1 Z 2 t + B 1 ]. Combining (2.7) and (2.8), we have that the CSTGARCH model can be expressed as a stochastic recurrence equation (SRE), (2.9) Under an i.i.d. assumption on the residuals, this is the state space representation of the CSTGARCH model since the sequence {(Q t , c)} is i.i.d. and any reasonable solution {V t } is Markov. From the SRE in (2.9), we formally get When Elog + ‖Q 0 ‖ < ∞, the Lyapunov exponent is defined and satisfy for any matrix operator norm. The maximum absolute row sum, defined below, is a convenient choice in this context.
We will refer to the following assumptions, where 0 denotes the true parameter.

Remark 2.4. The condition E log
where each block is m × m. When the model has an ergodic solution, this matrix will be the driving force for the vectorized form of the forthcoming likelihood process in Section 3.
∑ v j (v) and B be the spectral radius of B.
There is a unique adapted ergodic solution of (2.3). (iv) If also A2 is satisfied, then X ∈ L 2 and ∈ L 2 . Berkes et al. (2003) we get (iv). For (ii) we see that

Proof. By Bougerol and Picard (1992) (iii) is true and by
, a block matrix of dimension q × 1. Now, From the definition of B, we see that where the implication is revealed by tracking what happens with the rows going from B k−1 to B k when multiplying by B. From (2.11), a last matrix multiplication gives Remark 2.5. Theorem 2.3 will also hold locally for all in a sufficiently small compact neighbourhood of the non-zero part of 0 . We also see that B < 1 is a necessary, but not sufficient condition for the existence of a stationary solution of {X t }.
As long as A3 is fulfilled, {X t (u)} is a strictly stationary process in the sense that, for all n ≥ 1 and for any by Theorem 2.3(iii). Like for univariate GARCH(p, q) models, a CSTGARCH(p, q) is weakly stationary if and only if E Z 2 is finite and This condition is much easier verified and does not depend on , but it is somewhat more restrictive than A2-A3 with = 1. We will prove this fact next.
By the row-sum norm, for k large enough. □ Remark 2.6. This is a replacement inequality. We can replace all stochastic terms in Q • t−j by their respective expected value.
Proof. Denote the first row of G as the -row. We have that The interesting rows consist of the two sets [2, q] and [q + 2, r] which have only one non-zero block equal to the unit matrix I m . We consider G k for k ≥ 1. The row q + 1 changes to the first row after one iteration. For each iteration the two sets of interesting rows are reduced by one row as their respective top row is removed and the non-zero block in each of the remaining rows is shifted one step to the left. So for k = 2 these sets are [3, q] and [q + 3, r]. This process continues until both sets are empty and that happens exactly after k = a steps. In this situation either row q or row r in G k equals the -row. During this sequence of iterations all rows have been a -row. Since ‖ ∑ j G ij ‖ ≤ ‖I‖ for all i, any block row sum is decreasing by each iteration. Now, let k = ha,

S. HØLLELAND AND H. KARLSEN
Remark 2.7. The sufficient condition in Corollary 2.2 does not depend on m.

Spatio-temporal Dependency
Since CSTGARCH is a spatio-temporal model, the spatial dependence structure is particularly interesting. It is well known that a GARCH(p, q) process has an ARMA(p ∨ q, q) representation. Here we use such an ARMA representation for a CSTGARCH process to study the spatio-temporal dependence structure implied by the model. We derive the ARMA representation, find an expression for the autocovariance function for the squared process and illustrate the dependence structure for the CSTGARCH(1,1).
Theorem 2.5. Suppose that (2.12) holds and let r = p ∨ q. Then an extended spatio-temporal autocorrelation function exists and is given by (2.15) Proof. It is well known that we can rewrite a CCC-GARCH model as a VARMA(r, q) model, This is a second order stationary causal VARMA(r, q) model if the roots of the corresponding determinant of the characteristic matrix polynomial are strictly outside the unit circle and the residual process {U t } has finite second-order moment (Brockwell and Davis, 1991, Theorem 11.3.1, p. 418). Moreover, by the same theorem, where the filter is given by (2.15) and is the expectation of X 2 t . The root condition is implied by (2.12) and for the moment one we proceed by assuming it holds. Later on we will relax this assumption. Now, the covariance matrix of U t is proportional to the unit matrix and therefore the multivariate autocovariance, Γ, function for {X 2 t } is given by By definition, and that gives the first part of (2.15). We see that (2.15) itself does not rely on the assumption of finite variance of X 2 t . It just requires (2.12) to make sense and for that purpose we use I m for the covariance matrix of U t . It remains to show that R(h) is a stationary matrix for any h in the sense of Definition 2.2. All matrices in (2.4) are stationary (Theorem 2.2), which in turn implies that all the coefficients in the VARMA(r, q) formulation are stationary (Proposition 2.1). Hence, {Ψ j } is stationary. By the same argument now used on (2.16), we see that Γ(h) is also stationary and therefore (2.14) holds. The stationarity implies that Remark 2.10. Alternatively to (2.15), we can use the multivariate Yule-Walker equations (Brockwell and Davis, 1991, 11.3.12, 11.3.15, pp. 419-20) to describe and compute R,  Figure 3. The ACF is given by (2.14) and the model specifications are the same as Example 1, with a 0 = 0.1, a 1 = 0.4∕8, b 0 = 0.06 The scale parameter does not influence the correlation, thus it is not specified. For h = 1, the nine dark blue squares are the nine neighbours. As the temporal lag increase, the correlation spreads out and fades. An interesting element is the low magnitude of the correlations, peaking around 0.13. This is most likely due to the high number of neighbours and the chosen parameters.

CONDITIONAL MAXIMUM QUASI-LIKELIHOOD ESTIMATION
In the first part of this section, we introduce a version of the volatility process that is defined by a free parameter. With this framework established, we derive the conditional quasi-likelihood.
Let 0 be the true parameter vector which has generated the {X t } process and the not directly observable squared volatility process { 2 t (u)}. Related to the last one is {h t (u, ), ∈ Θ} which we call the likelihood process. It extends the squared volatility process { 2 t (u)} to a function-valued process. Define equivalently to (2.4), s. By the forthcoming assumptions A4 and A5, it will become clear that {h t (u, ⋅)} is a spatio-temporal process with values in C[Θ, R], the space of real continuous functions defined on Θ, uniquely defined by (3.1).
We vectorize (3.1) with B defined in (2.10) and the mq × 1 vector processes from (3.1). If q ∈ {0, 1}, let B = B 11 = B 1 , C t = D t and H t = h t with B 1 = 0 for q = 0. Note that we have suppressed the dependency of , but B, H t and C t depend on . From (2.10) and (3.1)-(3.3), we get a first order SRE for the likelihood process given by Note that H t ( 0 ) = S t−1 in the context of (2.8). The special case when q = 0 gives h t = D t in (3.4) and h t is then fully observable from the observations. Most of what we discuss here is only relevant when q > 0, namely when the model is not a pure ARCH. An important concept for SRE's is convergence with exponential rate. − −− → 0 or X n = (1) e.a.s. if X n = (a n ) a.s. for some fixed a ∈ (0, 1).
By carrying out the recurrence in (3.4), we get (3.5) The infinite sum in (3.5) converges e.a.s. uniformly with respect to contained in any compact subset of the neighbourhood described in Remark 2.5. To make sure that this holds globally on Θ, we need the following assumptions: We extend Definition 2.3.

Definition 3.2. For
t−i and where {x t } and {h t } are initial values. The theoretical counterpart of this definition is (3.3). Note thatD t ≡ D t for t ≥ p + 1.

Remark 3.2.
The initial values represent fixed values that are different from the observations. However, our subsequently derived results allowh t to be stochastic as long as E‖h t ‖ Θ < ∞ for some > 0. The same applies tox 2 t .
In Proposition 3.1 we prove that the difference between the empiricalĥ t and the stationary process h t will converge e.a.s. to zero uniformly on Θ. This implies that the effect of the initial values is asymptotically negligible.
Then for t ∈ [p + 1, n], we have thatĈ t = C t , and (3.7) Using the triangle inequality on (3.7), we have The other terms are finite with probability one, so (3.8) must go to zero with an exponential rate as t → ∞. □

Conditional Quasi-likelihood
Assuming that Z is a spatio-temporal sequence of i.i.d. standard normally distributed variables opens the door to Gaussian quasi-likelihood estimation. If the distribution of Z is truly standard normal, what we derive here will be the true conditional likelihood. If not, we call it quasi-likelihood.
The density of X n def = [X 1 … X n ] conditional on W 0 , can be written as By using the first part of (2.4), we see that where the vector division is a Hadamard one. Given W k−1 , k is successively given from the second part of (2.4).
We get the quasi-likelihood from the conditional simultaneous density with f Z as the standard normal density and where the empirical likelihood processĥ replaces 2 . Taking the logarithm of this likelihood giveŝ with the processesX andĥ given by (3.6).
Remark 3.3. In (3.9) we have denotedt(u, ),ĥ t (u, ) andX t (u), without their respective space-time locations (t, u) and parameter inputs. This works since the involved processes are simultaneously strictly stationary and the actual expression does not depend on a particular point (t, u). We will use this convention when considering an arbitrary variable, but also when we discuss the processes as one unit, for example,̂,ĥ andX.
We use the previously specifiedĥ as the empirical counterpart of h. Substituting h forĥ and X forX, defines the theoretical likelihood L n and its terms t . From Proposition 3.1 we have that ‖ĥ t (u) − h t (u)‖ Θ converges e.a.s. to zero. Therefore,L n is usable as an approximation of the theoretical log likelihood function as the notation indicates. The two likelihoods have the respective maximum likelihood estimators

LARGE SAMPLE PROPERTIES
Here we present asymptotic results for the maximum quasi-likelihood estimator, both consistency and asymptotic normality, under certain regularity conditions. Proofs are found in Section 8.
The following list of assumptions will be used.

A6:
The projections Δ ki ⇝ Λ ki into  preserve cardinality. A7: At 0 ,  and  are left coprime and either [A p | B q ] or [A ′ p | B ′ q ] has full rank. A8: The squared residual process Z 2 is non-degenerate with expectation 1. A9: The interior of Θ as a subset of the Euclidean space contains 0 . A10: The variance of the squared residual process, Z def = Var Z 2 , is finite.
For consistency we need A2-A8, while in addition A9 and A10 are needed for the central limit theorem.
These assumptions are similar to those of the univariate GARCH literature, especially Berkes et al. (2003), Francq and Zakoïan (2004) and Straumann and Mikosch (2006). Our assumptions are also related to Francq and Zakoïan (2011, Ch. 11) for the CCC-GARCH, but the comparison to univariate theory is more appropriate (see Remark 3.4) perhaps with the exception of identifiability.
The conditions A2 and A3 ensure stationarity and ergodicity of the process, while A4 and A5 ensure the global existence of the likelihood process on Θ. The compactness of Θ in A5 is also needed for the maximization of the likelihood. The identifiability of the model follows from A6 and A7. As discussed in Remark 2.2, the dimension of the parameter space will necessarily be reduced unless A6 is satisfied. The coprime part of assumption A7 is not enough to guarantee identifiability. Identifiability means here that  −1 ( , z)( , z) =  −1 ( 0 , z)( 0 , z) does not have any solution in Θ except ≡ 0 . With  and  coprime, additional conditions must ensure that the only possible unimodular matrix  , as a common factor, is the identity matrix. The full rank assumption is easily confirmed after estimation, but can in some cases also be confirmed a priori. A sufficient condition is that for any pair of corresponding eigenvalues of the two circulant matrices A p and B q , at least one of them is nonzero. The condition for identifiability of CCC-GARCH models corresponds to A7 (Francq and Zakoïan, 2011, p. 295), but it is quite possible that A7 can be simplified since our model is not really multivariate. The non-degeneracy in A8 is clearly necessary. Otherwise, X 2 will be constant and the model degenerates.
Condition A9 is also necessary, because if for instance 1 (v) = 0 for some v ∈ Δ 11 , N 1∕2 (̂1(v) − 1 (v)) with N def = N(n) = nm, cannot be normal with zero expectation because it can only take non-negative values. Berkes et al. (2003) assumed A9 for consistency as well, but like Zakoïan (2004, 2011) and Straumann and Mikosch (2006), we avoid that. A10 is necessary for the finiteness of the asymptotic covariance matrix.
Remark 4.1. The condition A6 means that each Δ ki can be contained in [0, m − 1] by a translation.
With the assumptions above, we can present the asymptotic results. The first being consistency and the second asymptotic normality of the QMLE. Convergence in distribution of X n to X is written as X n ⇒ X.

Remark 4.2.
When the residual process is Gaussian, the constant = 1 and I 0 is the Fisher information matrix. Note that I 0 depends only on the marginal distribution of (h 0 , ∇h 0 ) and thus it is influenced by the squared residual distribution beyond its first two moments.
Although it is not quite clear from the expression, the matrix I 0 depends on , but the model parameters do not, though the model does.

Remark 4.3.
Under A2-A10,̂n will eventually satisfy the likelihood equations ∇L n = 0, sincên is consistent for the interior point 0 andL n is smooth and takes its maximum in an open ball around 0 . The same relation is of course true for̃n and ∇L n .

SIMULATION EXPERIMENTS
We conduct a simulation experiment to see how well the estimation procedures described in previous sections perform on finite samples. We will both consider circular and non-circular data.
Based on circular data, we should get consistent and asymptotic normally distributed estimates,̂n, according to Theorems 4.1 and 4.2. On non-circular data however, the circular estimator will be biased due to the projection of the neighbourhoods, but this bias may be compensated by a parametric bootstrap bias correction (PBBC). For this procedure to be successful, we need that E 0 (̂n − 0 ) ≈ Ên( * −̂n), where * denotes the mean estimate of the parametric bootstrap samples.
The parameters are estimated by maximizing the likelihoodL n . The case d = 1 is not so interesting since we know that the circular approximation will perform well. We therefore focus on d = 2. The model we consider is the same as in Example 1, with a 0 = a 1 = and b 0 = b 1 = b 2 = . Thus, we only have three parameters, , and , giving equal weight to all nine members of the neighbourhood. The true parameter vector that generates all the datasets is 0 = (0.31, 0.024, 0.070) and we can write the models in the vector notation of (2.4) with A 1 = W and B 1 = W, as is a neighbourhood matrix. The independent innovations are sampled from a standard normal distribution. We use 200 bootstrap replicates in the bias correction, 500 Monte Carlo repetitions and three different sample sizes. The performance is primarily evaluated by the mean square error (MSE) of the estimates, but we also report the bias, standard deviation, the squared bias divided by the variance, marginal and simultaneous coverage. MSE can be decomposed into the sum of squared bias and variance. The squared bias to variance ratio is therefore useful to better understand what drives the development in MSE. Coverage means the proportion of the 500 Monte Carlo simulations where the 95% confidence interval based on the parameter estimate contains the true parameter. The simultaneous coverage is found by checking the condition whereΣ is the Monte Carlo approximated covariance matrix of̂n, while the marginal checks whether |̂n ,i − 0,i |∕ŜD(̂n ,i ) ≤ z 0.975 ≈ 1.96 for every parameter estimator̂n ,i , whereŜD(̂n ,i ) is the square root of the diagonal elements of Σ. For comparison purposes, we use Monte Carlo estimates for the covariance matrices. We could use an approximation of the information matrix in (4.1), but this will only be correct in the circular case where the Monte Carlo estimate and the information matrix approximation give similar results. We will refer to circular estimates of circular data by CC, circular estimates of non-circular data by CNC and the parametric bootstrap bias corrected CNC by PBBC. standard deviation in parenthesis.
In Table I, the blue and red numbers are the most and least optimal values respectively, for each parameter ( , , ) in each row. This colour coding makes it quite clear that the circular model is best on circular data, which should be no surprise to anyone. As expected the absolute bias, the standard deviation and the MSE goes down as the spatial sample size increases in almost all cases. The exception is in the PBBC, where the bias increases slightly from spatial dimension 5×5 to 10×10. The standard deviations of the circular estimates are about the same based on circular and non-circular data, and higher for the PBBC estimates. For non-circular data it is interesting to compare the pre-and post PBBC estimates. In all cases, the PBBC successfully reduces the bias of the original estimate extensively. With the exception of in the smallest sample size, this leads to a smaller MSE. For CC, the squared bias goes down slower than the variance for and , while for the bias is practically zero in all cases. For the CNC estimates, the squared bias goes down faster than the variance for , while for the small standard deviation and the large bias is inflating the ratio evidently. For the relationship between squared bias and variance is quite stable. In the PBBC case, we find all the blue numbers for and with values close to zero. For we see a reduction in the squared bias to variance relation as the sample size increase. For the coverage, we are quite pleased with getting results around 95% for the CC and PBBC estimates.
The kernel density estimates in Figure 4 are green for the CC estimator, blue for CNC and orange for the bias corrected estimator. The red dashed vertical lines indicate the correct value of each parameter. By visual inspection, the circular estimator on circular data is centred around 0 with decreasing spread as spatial sample size increases. For , it seems that the CNC is moving towards the correct value. One may think that the constant term should not be influenced by the circular assumption, but due to strong correlation with the other estimators it is. In fact, the unconditional standard deviation exists, since (2.12) is fulfilled under 0 , that is, 9( + ) = 0.846 < 1, and E( 2 t ) 1∕2 = ( ∕(1−9 −9 )) 1∕2 is 1.419 under 0 . The table shows that this quantity is, to some surprise, preserved in all different simulations with high accuracy. For the column in Figure 4, we see why the coverage in Table I  is so low for the CNC. The bell curve does not even touch the red line, although it seems to be moving towards it with increasing sample sizes. The effect of the bias correction is largest in this column, where the improvement in location is substantial and the cost in terms of higher variance is definitely worth its price. The column is not as influenced by the circular assumption. Especially for the largest sample, there is very little difference pre-and post PBBC. This experiment shows that the estimation of circular models is meaningful for circular and non-circular data, but in the latter case it is important to include a parametric bootstrap bias correction step if you want to reproduce the non-circular parameters.

REAL DATA EXAMPLE: SEA SURFACE TEMPERATURE ANOMALIES
The data we consider has been studied in great detail by Berliner et al. (2000) and used as an example by Wikle and Hooten (2010), Cressie and Wikle (2011), Wikle and Holan (2011) amongst others. These are monthly averaged sea surface temperature (SST) anomalies dating from January 1970 to March 2003, measured on a 2 • ×2 • resolution grid in the tropical Pacific Ocean. This is not intended to be a comprehensive example, but merely an illustration of how one can implement CSTGARCH on real non-circular data. In this regard, the model is mainly descriptive here. However, the GARCH part could be used to improve prediction intervals by volatility forecasting. The data is available for download as supplementary material to the book by Cressie and Wikle (2011) (link at the end).
We choose a rectangle area without observations over land (see Figure 5) and reduce the data to a 4 • × 4 • grid by mean aggregation. This reduction in sample size is to make the computations less demanding. Finally, we spatially difference the data to get a stationary series. That is, let {Y t (u 1 , u 2 )} denote the observations, then where ∇ 1 and ∇ 2 denote the spatial difference operators in the two spatial dimensions respectively. The data we are left with is of dimension 20 × 14 × 399 (longitude × latitude × time) or 111720 data points, with longitudes from 165 • E to 241 • E (119 • W) and latitudes from 24 • S to 28 • N. The differenced data is correlated and shows signs of a 12 month season (see Figure 7b), so we fit a circular spatio-temporal ARMA (CSTARMA) model using the starma R-package (Cheysson, 2016). The CSTARMA model is, by adapting the definition of Pfeifer and Deutsch (1980), where W (k) is a circular queen-contiguity neighbourhood matrix, characterized by the movement patterns of a chess queen, with k as the spatial lag. The construction of such a circular neighbourhood matrix for a regular grid is implemented in the spdep package (Bivand and Wong, 2018). Here, the spatial lag refers to how many tiles the chess queen moves over and is illustrated in Figure 6a. We use backward-stepwise model selection, that is, we start with temporal order 12 and spatial order 2 and fit the model (6.1) with (12 × 2 × 3 =)72 parameters. Then, gradually remove the least significant one and estimate the model again, until only one parameter remains.
We then have 72 model candidates and choose the one that minimize Akaike's information criterion (AIC). This model, which also minimize BIC, has only nine parameters and the largest p-value is of order 10 −19 . The nine parameter estimates with their corresponding standard deviations are given in the left column of Table II.
The residuals of the CSTARMA model is modelled using a CSTGARCH(1, 1) model with one spatial lag, corresponding to the one used in Example 1. We allow for different parameters in the different directions and it turns out that for the GARCH part, a rook contiguity neighbourhood only in the longitudinal direction is sufficient, while for the ARCH part a queen neighbourhood fits best. The model specification we end up choosing is visualized in Figure 6(b,c), corresponding to the illustration used in Figure 2. The CSTGARCH parameter estimates are presented in the right column of Table II. The empirical variance of the data is ≈ 0.025, which fits well witĥ∕{1 − ∑ j (̂j +̂j)} ≈ 0.027. Plotting four-dimensional processes is difficult, and animations of all the fitted spatio-temporal series Y, X = Y −Ŷ,ĥ and Z = X∕ĥ (cf. Remark 3.3), are therefore given as supplementary material (details at the end). In these animations, notice how the fitted conditional volatility clusters in space and time. The spatio-temporal autocorrelation of Y, X, X 2 and Z are plotted in Figure 7b. Here we see that the original data and the squared fitted GARCH process are correlated with points close in space and time, while the GARCH process and the standardized residuals are not. This relates to one of the stylized facts about GARCH processes: Little autocorrelation in the process itself, yet profound correlation in the squared process. When comparing autocorrelation of X 2 to that of Z, it seems the model fits well. The empirical distribution of the standardized residuals is close to the standard normal (see Figure 7a). We have not performed any correlation tests here, but since the sample size is so large, we do not expect to reject a hypothesis of correlated residuals.

CONCLUDING REMARKS
Finite area models are common in spatial statistics and wrapping a spatial model onto a toroidal surface is a known strategy for forming spatially stationary models in this field. The construction of a circular model is not limited to GARCH models and can be applied to other spatial-and spatio-temporal models. The circular assumption is fruitful for calculations, simulation and asymptotic theory. It is also crucial for having an explicit efficient likelihood with a fixed temporal boundary not depending on the sample size.
We consider CSTGARCH as a model in its own right, but it can definitely be used as an approximation of a non-circular situation as well. Estimating an STGARCH model will lead to a boundary issue due to unobserved sites outside the area of interest, and our simulation experiment indicates that a bias corrected circular approximation is a viable alternative. The upside to the circular method is a utilization of all data points with a complete disappearance of spatial boundaries. The circular processes are Markov, which is not the case for other ways of approximating an STGARCH. The downside is its misspecification, mostly accentuated near the boundaries, if the true model is not circular.
The boundary problem of spatial and spatio-temporal processes is well known and different approaches have been proposed. Guyon (1982) showed that the edge effect goes to zero like (nm) −1∕(d+1) in our notation. It is therefore necessary to handle the edge effect in a proper way. For dealing with the boundary issue in spatial processes, Cressie (2015, p. 422) mentioned integrating out the unobserved data from the conditioning event, but warned that this might lead to a complicated likelihood. We clearly find his warning justified for the STGARCH model. Another suggestion of his is to form a guard area inside the perimeter of the area of interest, where observations contribute to the likelihood only through their neighbourhood relations with internal sites. The volatility process must be estimated within the guard area, which means the guard area has to be quite wide and the practitioner must set boundary values for the volatility at every time point. For a pure STARCH model, the guard area approach is an alternative and a possible next step is to approximate an STGARCH with an STARCH. However, this procedure will inevitably lead to biased estimates and a sacrifice of a significant proportion of the observations, depending on the sample size. Estimation of GARCH models is infamous for requiring large samples and we cannot afford losing too much data. The circular model leaves none behind.

PROOFS
The main structure of the proofs for Theorems 4.1 and 4.2 is an adaptation of established theory. In particular we are influenced by Straumann and Mikosch (2006) and Francq and Zakoïan (2004). There are also some new elements.
The circular projection, as we have seen, makes an STGARCH model into a CCC-GARCH with a specific circular neighbourhood structure (Theorem 2.1). Asymptotic theory exists for CCC-GARCH models (Ling and McAleer (2003), Francq and Zakoïan (2011, pp. 289-307)), but the stationary spatio-temporal context here is quite far from that framework. The following proofs of consistency and asymptotic normality are adapted to the current context and therefore considerably simplified in comparison to the multivariate case. The proofs are relevant for further work on GARCH models in the space-time domain, with and without the circular assumption. In particular, in a working paper by Karlsen and Hølleland (2019), these proofs are used as a framework for the likelihood estimation.

Proof of Consistency
Throughout this subsection, we assume that A2-A8 hold.
The subscript 0 refers to 0 for quantities depending on , for example, h 0 = h( 0 ). An arbitrary, but fixed point (t, u) is suppressed whenever that is convenient to do (cf. Remark 3.3). We use = min Θ ( ) so that h ≥ . The total number of observations is N = nm with n, m as the number of time-and spatial points respectively. Proof. Due to A6, the VARMA form of (3.1) is fully parametrized and therefore possibly identifiable. The conditions are therefore in terms of the associated matrix polynomials  and . By A4,  is causal and A7 states the coprime property. Since A p and B q are circulants of level d (Theorem 2.2) and simultaneously diagonalizable (Davis, 1994, Thm. 5.8.4), the two alternative conditions in A7 are equivalent, and the full rank requirement on [A p , B q ] is satisfied. These three properties guaranties identifiability (Reinsel, 2003, p. 37 has a unique global maximum at 0 , that is, L( ) < L( 0 ) for ∈ Θ∖{ 0 }.

Proof.
The proof is an adapted version of an argument by Straumann and Mikosch (2006). Let Proof. By the mean value theorem | log y − log x| ≤ min −1 (y, x) and

Consistency by the Upper Semicontinuous Framework
The following important result is a device to face the possibility of non-uniform convergence.

Proof of the CLT
Throughout this subsection, we assume A2-A10. Let  ( ) be an open ball with centre in ∈ Θ and radius > 0. When A9 holds, let  0 =  0 ( ) =  0 ( ) be contained in Θ for some > 0 and let  0 be the closure of this open ball. We use both as a component of the parameter vector and its reference index. As an index, we write ∈ and for multiple indexes = ( 1 , … , k ) ∈ k . The following properties of the model inherent in the assumptions are important input for the CLT proof.
(i) There exist a closed ball with center 0 in the parameter space Θ. Without loss of generality we can assume that the radius is one;  0 (1) ⊆ Θ. (ii) When we consider a fixed outcome outside a fixed null set, then botĥn and̃n converges to 0 . Hence for any > 0, both these estimators are inside  0 ( ) and satisfies their respective log likelihood equation for all n large enough. (iii) The components of Θ are uniformly bounded in both directions; def = ‖ ‖  0 ∨‖1∕ ‖  0 so that 0 < −1 ≤ ≤ for any component ∈ on  0 . (iv) The elements of B k , b (k) , decrease with an exponential rate to zero as k goes to infinity; b (k) ≤ k−q with = 1∕q B , where B is the spectral radius of B. The proof goes along relatively standard lines. The starting point is a Taylor expansion of the theoretical log likelihood.
Proof. Let n =̃n − 0 . We use an integrated mean value theorem for the vector-valued multivariate ∇L n . ) ds ] n = ∇L n ( 0 ) + (J n + R n ) n .
Sincẽn satisfies the log likelihood equations, that is, ∇L n (̃n) = 0, the left-hand side is zero and (8.2) follows by a simple rearrangement and dividing by N 1∕2 . □ The main points to show are: (i) The remainder term R n can be neglected.
(ii) The observed information matrix, J n , converges to a positive definite matrix. (iii) The observable estimator̂n and its theoretical companioñn are square root N equivalent.

The Remainder Term
Recall that h = h t (u) and B (k) 11 is the first block of B k . For the second part, The main influence on h from the parameter goes through B. It is therefore advantageous to neutralize the impacts from the D t 's. This is the content of the second point in the lemma above.