Factorized estimation of high-dimensional nonparametric covariance models

Estimation of covariate-dependent conditional covariance matrix in a high-dimensional space poses a challenge to contemporary statistical research. The existing kernel estimators may not be locally adaptive due to using a single bandwidth to explore the smoothness of all entries of the target matrix function. In this paper, we propose a novel framework to address this issue, where we factorize the target matrix into factors and estimate these factors in turn by the kernel approach. The resulting estimator is further regularized by thresholding and optimal shrinkage. Under certain mixing and sparsity conditions, we show that the proposed estimator is well-conditioned and uniformly consistent with the underlying matrix function even when the sample is dependent. Simulation studies suggest that the proposed estimator significantly outperforms its com-petitors in terms of integrated root-squared estimation error. We present an application to financial return data.


INTRODUCTION
Nonparametric estimation of covariate-dependent conditional covariance matrix Σ(u) in covariance models is fundamental to contemporary scientific research including neuroimaging in neuroscience, disease mapping in health science, daily ozone concentration analysis in environmental science and asset portfolio risk analysis in finance, among others (Chen & Leng, 2016;Fan et al., 2013;Fox & Dunson, 2015;Lamusa et al., 2012;Ledoit & Wolf, 2004;Reich et al., 2011;Yin et al., 2010;Zhang & Liu, 2015;Zhang & Su, 2015;Donoho et al., 2018). However, most efforts in nonparametric covariance estimation suffer from a curse of dimensionality (Fan & Yao, 2003). For example, in asset portfolio risk analysis, modeling market-dependent co-volatility of p assets by use of historical return data over n consecutive months involves estimating p(p + 1)/2 nonparametric curves (Fama & French, 2004). The dataset we are studying in this paper contains historical returns of 75 assets over three time periods, namely before-financial-crisis, in-financial-crisis and after-financial-crisis with n equal to 84, 36, and 95 months, respectively. Note that many more assets can be collected for investigation whereas the number of months in a period is sometimes quite limited (Engle et al., 2017). When p is close to or larger than n, the kernel covariance estimator proposed by Yin et al. (2010) can be degenerate or ill-conditioned with a high condition number. Hence, it cannot be reliably inverted to compute the precision matrix which is required in the above risk analysis. In the literature, regularization of estimated covariances was often done by banding, thresholding, or truncating the number of the leading eigenvalues (Bickel & Levina, 2008;Cai & Liu, 2012;Fan et al., 2013). Chen and Leng (2016) proposed a method (called DCM) to regularize the kernel covariance model by thresholding covariance entries. These authors pointed out that the resulting covariance estimator can still be ill-conditioned for finite samples, where an ad hoc and small constant is required to add to its eigenvalues. These authors also established a consistency theory for their estimator when the sample is i.i.d. There are three main issues that arise when we use these existing methods. First, the performance of these methods can be compromised by employing the same smoothing bandwidth for all entries which have varying degrees of smoothness. In particular, under the sparse assumption, the covariance matrix function contains many zero entries which are in favor with an infinite large bandwidth and thus affect estimation of other nonzero entries if we use a single bandwidth for all entries. On the other hand, letting all the entries have their own bandwidths will generate p(p + 1)/2 tuning constants to choose. The resulting estimator may not be an appropriate covariance matrix estimator as it can be negative definite for finite samples. Secondly, as the ad hoc eigenvalues adjustment of Chen and Leng (2016) to the estimated matrix is not principle guided, it is desirable to explore an optimal shrinkage procedure. Finally, the existing asymptotic theory holds only for i.i.d. samples although, in most applications, the samples are dependent. For instance, in the above asset portfolio risk analysis, both market returns and asset returns are serially correlated time series.
In this paper, we propose a novel framework to address these issues. It is based on a variance-correlation factorization of Σ(u) in the form of Σ(u) = Q 0 (u)C 0 (u)Q 0 (u) T , where Q 0 (u) is a diagonal matrix function composed by the square roots of the diagonal entries of Σ(u) and C 0 (u) is the correlation matrix function. We further factorize C 0 (u) into the product of invertible band matrix factors of Σ(u). In general, we choose band matrices which are less complex than Σ(u). In the proposal, we first estimate these band matrices in turn with separate kernel bandwidths, followed by entry-wise thresholding on the resulting estimator of C 0 (u). Estimation of these band matrices with different bandwidths is expected to improve the flexibility of the proposal and thus to provide a more accurate estimator for Σ(u). Intuitively, performing thresholding on estimated correlations is better than on covariances, since the variation of the estimated correlations is likely to be smaller and more homogenous than that of the estimated covariances. In fact, thresholding correlations has been proved adaptive to the variability of individual entries of covariance matrix (Cai & Liu, 2012). Finally, a well-conditioned and optimal shrinkage estimator of Σ(u) is derived by the principle of minimizing the Frobenius loss. In summary, the proposed framework differs from the DCM in using multiple factorization-based bandwidths, thresholding correlations and taking into account a shrinkage effect. The proposal can be viewed as a nonparametric extension of the so-called DCC-GARCH approach (Engle et al., 2017), a popular technique for estimating a multivariate time series model.
To evaluate the performance of the new proposal, a set of simulation studies are conducted. The results demonstrate that the new proposal substantially outperforms its counterparts in terms of the Frobenius loss and other criteria. The proposed method is illustrated through an application to the analysis of monthly return data for a group of risky assets mentioned above. The analysis reports the following findings: (1) Some asset returns present a striking nonlinear departure from the Capital Asset Pricing Model (CAPM) (Fama & French, 2004). (2) Both volatility and co-volatility of these asset returns are market-dependent, see Figure 1 for more details. These two findings provide an empirical support for building a nonparametric CAPM for risk assessment and portfolio selection. We also establish an asymptotic theory for the new proposal: under some mixing and regularity conditions, the proposed estimator is asymptotically consistent with the underlying covariance matrix function even when the samples are dependent. In the procedure, the thresholding step ensures that the resulting estimator converges to the true covariance matrix with a good rate while the shrinkage step makes the resulting estimator not ill-conditioned even in finite samples. To prove the above theory, a dedicated concentration inequality (Merlevede et al., 2009) different from Chen and Leng (2016) is employed for dependent samples. In particular, the proof for the convergence rate of the proposed shrinkage is nontrivial. Note that without the extra thresholding step, a standard shrinkage estimator is expected to have convergence rate of √ p∕(nh), where h is the bandwidth in the kernel estimation (Ledoit & Wolf, 2004). After adding the extra thresholding step in the shrinkage procedure, we show that the resulting estimator has a faster convergence rate √ log(p∕h)∕(nh) than does the standard shrinkage if the underlying covariance matrix is sparse.
The rest of the article is organized as follows. The proposed factorized estimators are introduced in Section 2. The corresponding algorithms are developed to determine the bandwidths in the related kernel smoothing as well as the levels of thresholding and shrinkage. The uniform consistency and the convergence rate of the proposed estimator are established with dependent samples in Section 3. In Section 4, simulation studies are conducted to evaluate the performance of the proposed method and compare it to the existing method. The proposed procedure is employed to analyze financial returns for a group of assets. We conclude with a discussion in Section 5. The proofs of asymptotic theory and further numerical results are delayed to Data S1. Throughout this paper, we let min (⋅) and max (⋅) denote the minimum and maximum eigenvalues of a square matrix. For a vector x, let ||x|| denote its Euclidean norm. For a square matrix ∑ n k=1 |a ik | denote its (size-normalized) Frobenius, spectral, max and ∞-norms. Let ⟨A, B⟩ = tr(AB T )/p be the inner product of square matrices A and B. Let I(⋅) denote an indicator function. Note that these norms satisfy ||A|| F ≤ ||A|| ≤ ||A|| ∞ ≤ max 1≤i≤p ∑ p j=1 I(|a ik | > 0)||A|| max . Let diag(x) denote the diagonal matrix with diagonal entries made from the elements of x. Let c ∧ d and c ∨ d denote the minimum and maximum of numbers c and d. Let I p be a p-dimensional identity matrix.

METHODOLOGY
Let Y = (Y 1 , … , Y p ) T ∈ R p be a p-dimensional random vector and U ∈ R be the associated index random variable. We model the conditional mean and covariance matrix of Y given U = u as (u) = E[Y |U = u] and Σ(u) = cov(Y |U = u), respectively, whose entries are assumed to be the unknown but smooth functions of u. Suppose that (y i , u i ) n i=1 with y i = (y i1 , … , y ip ) T , are random observations from the population (Y , U), satisfying the equations is a dependent random sample of U. Also, given (u i ) n i=1 , i 's are dependent on each other and with zero means and unity covariance matrices (i.e., Let K(u) be a kernel density function, K h (u) = h −1 K(u/h) (the scaled kernel function with a bandwidth h > 0) and w ih (u) = K h (u i − u)∕ ∑ n k=1 K h (u k − u) (the weighting function). Yin et al. (2010) considered the following kernel estimators for (⋅) and Σ(⋅): where h a and h are bandwidths for mean and covariance matrix functions, respectively.

The variance-correlation-based approach
To improve the above covariance estimator, we consider a variance-correlation factorization in the form where Q 0 (u) = diag(Σ(u)) 1∕2 and C 0 (u) = Q 0 (u) −1 Σ(u)Q 0 (u) −1 . The proposed variance-correlation based Q 0 -procedure can be implemented in the following three steps.
Step 2: ThresholdĈ 0 (u). Note that the dimension p is frequently larger than the local sample size nh. This results in a degenerate estimatorĈ 0 (u). Following Bickel and Levina (2008), we regularize the above correlation matrix estimator by thresholding its entries as follows: whereĉ jk (u) is the (j, k)th entry ofĈ 0 (u) and I(⋅) is an indicator function and t 0 (u) is a positive function of u. The above rate of thresholding is suggested by Theorem 2 in Section 3 below.
Here, p/h is related to the dimension of an approximate parametric model to the original model: approximated by a p(b − a)/h-dimensional step model. Note that unlike the covariance matrix, the correlation matrix is scale-invariant and with homogenous diagonals. Therefore, thresholding correlation matrix is expected to make less errors than does thresholding covariance matrix, in particular, when individual variances kk (u), 1 ≤ k ≤ p greatly differ from each other (Cai & Liu, 2012). Using the above estimators, we construct a plug-in estimator of Σ(u) in form Step 3: ShrinkΣ (t) (u).In Section 3 below, under sparsity and regularity conditions, we show that under certain regularity conditions the above thresholded covariance estimator is consistent with the underlying covariance matrix function as n and p tend to infinity. However, for a finite sample, the proposed estimator may still be ill-conditioned. To ameliorate it, we propose to shrink Σ (t) (u) to the identity matrix I p , where the amount of shrinkage is optimized in terms of the data-driven Frobenius loss. There are other covariance shrinkage methods in the literature, but most of them were developed for estimating covariance models without covariates. See Bai and Silverstein (2010), Jolliffe (2002), Donoho et al. (2018) and references therein. To find the optimal amount of shrinkage, we first consider a population version, namely a linear combination of I p andΣ (t) (u), Σ * (u) = aI p + (1 − )Σ (t) (u), whose expected Frobenius loss E||Σ * (u) − Σ(u)|| 2 F attains the minimum with respect to 0 ≤ ≤ 1 and a ∈ R. The resulting solutions depend on Σ(u) as well as variability ofΣ (t) (u). Replacing these unknown quantities by their estimators, we obtain the following plug-in estimator of Σ(u) with a data-driven optimal amount of shrinkage: wherê2 Note that̂2 p (u) is a plug-in bias when we use p −1 tr(Σ (t) (u))I p to estimate Σ(u) whilê2 p (u) gauges the variability ofΣ (t) (u) as an estimator of Σ(u). So estimator (4) is intended to strike a balance between variability and bias of covariance estimators. Our idea is general, which can be directly used to improve other nonparametric covariance matrix estimators including the DCM. See the Appendix A of Data S1 for the detailed derivation.

Effects of unknown zero-entries on bandwidth selection
In Step 1 above, we explore the smoothness of C 0 (u) with a single bandwidth. This may introduce a large bias to estimating non-zero entries when C 0 (u) contains many unknown zero-entries as illustrated by the following toy example. Let (n, p) = (250, 100). For simplicity, we assumed that both (u) and Q 0 were known and estimated C 0 (u) by using a cross-validated (CV) kernel. To generate an i.i.d. sample (y i , u i ) 1≤i≤n , we first randomly drew (u i ) 1≤i≤n from the uniform distribution over [−0.95, 0.95]. Then, given u i , we defined Σ(u i ) through its square root matrix R(u i ) = (r kj ) p×p . For a preselected ∈ [0, 1], we randomly selected p * = ⌊p 0 ⌋ entries from the strictly lower triangle part of R(u i ) and assigned zeros to them. To keep the symmetry of R(u i ), we reflected these zero-entries to the upper triangle part of R(u i ). For the remaining entries, for example, (k, j)th entry, we set r kj (u We calculated the sparsity index S C 0 defined as the proportion of zero-entries in C 0 (u i ). Given u i and C 0 (u i ), we drew y i from the multivariate normal N(0, C 0 (u i )). For = 0.855, 0.91, 0.95, 0.97, 0.98, 0.985, 0.99, 0.995, we used the above sampling procedure to obtain a sample for each case with the sparsity index S C 0 taking values around 0.10, 0.40, 0.74, 0.88, 0.93, 0.95, 0.97, 0.98, respectively. For each sample, we calculated the CV kernel estimates for C 0 (u i ), 1 ≤ i ≤ n and obtained cross-validation values for bandwidth grid points, which resulted in a CV curve. Note that these curves may have different minimum values. To make them comparable, we divided these curves by their minimum values respectively. The plots, displayed in Figure 2, showed that when the sparsity index increased from 0.10 to 0.98, the curves became flat and the CV-selected bandwidth tended to infinity. This implied that the CV-based bandwidth selection was gradually dominated by zero-entries where there were many unknown zero-entries. This can be explained by the fact that the arithmetic average of y ik y T ij , i = 1, … , n (equivalent to choosing an infinite large bandwidth) is an optimal unbiased estimate for the (k, j)th zero-entry, but not for nonzero entries.
To assess the zero-entry effect with multiple samples, we performed the above sampling procedure 90 times, obtaining 90 independent datasets. Applying the single-bandwidth kernel procedure to each sample, for each u i , we calculated the Frobenius loss between the estimated and the true values for all zero off-diagonal entries and for all nonzero off-diagonal entries respectively. We calculated the average Frobenius losses over these u i 's. Box-plots of these average Frobenius losses were made for each ∈ {0.855, 0.91, 0.95, 0.97, 0.98, 0.985, 0.99, 0.995} as shown in Figure 3. The result indicated that the average Frobenius losses did increase for nonzero entries and decreasing for zero-entries when the sparsity index was increasing.
To tackle the problem, we factorize C 0 (u) further. For example, we can construct a new esti-matorĈ 0 (u) =Q 1 (u)Ĉ 1 (u)Q T 1 , whereQ 1 andĈ 1 are defined in the next subsection. We applied this new procedure to each of the 90 datasets for the sparsity index S C 0 = 0.97. The average Frobenius losses were presented as box-plots in Figure 4. The result indicated that adding another factorization reduced the Frobenius loss for estimating nonzero entries by 35% while the Frobenius losses for estimating zero-entries under two methods were kept almost the same.

A multiple factorization approach
To reduce the above zero-entry effect, we further factorize We hope C m (u) contains fewer zero-entries than does C 0 (u). This can reduce the bias effect of these zero-entries. See the Appendix A of Data S1 for some details. For the prefixed m, we estimate Q v (u), 0 ≤ v ≤ m, and C m (u) in turn with separate kernel bandwidths, followed by entry-wise thresholding on the resulting plug-in estimator of C 0 (u). We replaced Step 1 in the variance-correlation based procedure by the above multiple factorization approach with multiple bandwidths. Here, with multiple bandwidths we aim to improve the flexibility of the proposed procedure in addressing varying smoothness across entries of C 0 (u).
There are a few matrix factorization algorithms for estimating a covariance matrix in the literature, for example, the Cholesky algorithm (Rothman et al., 2010). In this paper, we first estimate the diagonal entries,Q 0 (u) = diag(̂k k (u) ∶ 1 ≤ k ≤ p) with a Q 0 (u)-specified bandwidth h = h 0 . Then, we standardize y i , 1 ≤ i ≤ n by usinĝ(u i ) andQ 0 (u i ): Letỹ (m) denote the mth row of the standardized data matrixỹ = (ỹ 1 , … ,ỹ n ). To identify band matrix factors for C 0 (u), we recouple the coordinates of Y by maximizing the marginal correlations of consecutive coordinates as follows: Let m 1 = 1 and where Corr(⋅ , ⋅) denotes the operator for calculating the sample correlation between two random vectors. Let Y * = (Y m 1 , … , Y m p ) be the recoupled Y . Then the large entries in C 0 (u) are likely rearranged to be close to the diagonal band. To simplify the notation, in the following we assume that the underlying coordinates have already been coupled, that is, satisfying Y = Y * with m k = k, 1 ≤ k ≤ p. In many applications, the above assumption may hold when coordinates in Y have a natural ordering. We opt for the following band matrices as factors whose inverses can be explicitly calculated:

0, Otherwise
with bandwidths h 1 , h 2 , … , h m . Using the above band matrices, we make the following transformationy After the transformation, the new random vectory i may have much fewer zero-entries in its covariance matrix cov(y i |u i ) than does the originalỹ i . For m ≥ 1, we estimate C m (u) bŷ with bandwidth h r . With these estimated factors, we reconstruct the following estimator for C 0 (u): WithĈ 0 andQ 0 , we constructed a new plug-in estimatorΣ(u) =Q 0 (u)Ĉ 0 (u)Q 0 (u). Finally, replacing Step 1 in the variance-correlation-based procedure by the multiple factorization step above, we end up with a general procedure for estimating covariance matrix.

THEORY
In this section, we develop an asymptotic theory for the proposed estimators which covers both i.i.d. and non i.i.d. cases and thus is more general than Chen & Leng (2016). Under certain regularity conditions, the proposed estimators are shown to be consistent with the underlying matrix function if we let the related bandwidths be different from each other but have the same convergence rate to zero.
(C2) The density function of U, g(u), has the second-order continuous derivative g ′′ (⋅) over a compact support [a, b] and inf u∈ [a,b] g(u) > 0. For any i ≠ i 1 , the joint density of u i and u i 1 , There exist positive constants 2 and 2 < 1 such that for k ≥ 1, are bounded below from zero uniformly for 1 ≤ j ≤ p and u ∈ [a, b]. Their first-order derivatives are also uniformly bounded. The conditional expectations The above conditions are routinely used in the literature of nonlinear time series analysis (e.g., Fan & Yao, 2003;Lam & Yao, 2012;Zhang & Liu, 2015). It follows from (C5) that b 2= max 1≤j≤p sup u∈ [a,b] | j (u)| < ∞. (C3) and (C4) assume that the response observations have an exponentially fast mixing rate and subexponential tails. Note that these conditions are imposed to facilitate the proofs and thus may not be the weakest possible for establishing the theory below.
Letĝ h a (u) = 1 n ∑ n i=1 K h a (u i − u) be a kernel density estimator of g(u). It follows from Proposition 1 in the Appendix B of Data S1 thatĝ h a (u) is uniformly consistent with g(u).
Note that 0 < 1 < 1∕2 as 1 ≤ 1 and 2 < 1. The above bandwidth condition imposed on h a holds and Let 1∕ 2 = 2∕ 1 + 1∕ 2 . In the next theorem, we show that the entries of the proposed covariance matrix estimator are consistent with the underlying ones uniformly in u and
There exists a positive constant s 1 such that sup u∈ [a,b] ||Σ(u)|| ≤ s 1 . (C9) There exists a positive constant s 0p such that as p → ∞, (1) and there exist positive constants t a < t b such that for t a < inf u∈ [a,b] Note that Condition (C7) implies that Σ(u) is not close to cI p in a distance less than the product of the sparsity index and the rate np ∕ log(p∕h), where c is any arbitrary constant. Conditions (C8) and (C9) are about the uniform boundedness of ||Σ(u)|| from above and away from zero in an order of np timed by the sparsity index. Finally, we can see from Theorem 3 below that although (C10) requires the tuning constantt 0 (u) has a finite limit as n tends to infinity, the order of the convergence rate of the corresponding estimatorΣ (st) (u) is independent of such a limit. Under these conditions, we state a uniform consistent result forΣ (st) (u) as follows.
In addition to the above conditions, if Condition (C9) holds, then uniformly in u ∈ [a, b], Finally, in addition to the above conditions, if Condition (C10) holds, then the above results continue to hold after replacing t 0 (u) byt 0 (u) inΣ (t) (u) andΣ (st) (u).
Note that if h = c 0 n −1/5 (c 0 is a constant) and for d = max{1∕(2 1 ), 2∕ 1 − 1, 1∕(2 2 ), 2∕ 2 − 1}, (log(p)) d ∕n = o(1), the above condition imposed on h holds. Also, the above bandwidth assumption that they have the same convergence rate to zero does not rule out these bandwidths are different. However, the cross-validation (or the so-called subset) selected bandwidths may not tend to zero. In particular, some of these bandwidths may tend to infinity when there are many zeros and a few nonzeros in the underlying covariance matrix. In this situation, simulation studies in the next section showed that the proposed estimators could reduce the bias and outperformed the DCM in terms of integrated mean squared errors. The theoretical development along this aspect will be spelled out in a future paper.

NUMERICAL STUDIES
In this section, to demonstrate the merits of the proposed estimators in finite sample settings, we applied the proposed procedures to both synthetic and real data. We presented the numerical results for the proposed procedure using m (m = 0, 1) band matrix factors.
To facilitate the presentation, let t NCM 0 and st NCM 0 denote the proposed estimatorsΣ (t) (u) andΣ (st) (u), respectively, with m = 0. Let t NCM 1 and st NCM 1 denote the proposed estimators respectively with m = 1. Let DCM 1 and DCM 2 denote three DCM estimators defined by DCM 1 (u) = (̂1 jk (u)I(̂1 jk (u) ≥ d(u))), where d(u) is the level of thresholding and Note that DCM 1 differs from DCM 2 in the way of estimating the residuals: The former uses estimators has an extra biaŝ(u i ) −̂(u). So, DCM 2 is expected to perform worse than DCM 1 . Following the same procedure as in st NCM 0 , we improve DCM 1 by incorporating the effects of shrinkage on it. Let s DCM 1 denote the optimal shrinkage estimator after replacing t NCM 0 by DCM 1 in the definition of st NCM 0 .

Choice of tuning parameters
As is common in most smoothing methods, the choice of appropriate tuning parameters plays an important role in the performance of a regularized estimator. It is prudent to restrict attention to data-driven choice of the tuning parameters. Here we apply the cross-validation to choose the values of tuning parameters in a sequential manner as follows.
Bandwidth for estimating (u).We let h a = argmin CV (h) as the optimal bandwidth for the mean kernel estimator in Equation (1), where Here,̂h ,−i (u i ) is a kernel mean function estimator after dropping the ith observation from the data. The trimming function (u) = I(u (1) < u < u (n−1) ) is used for reducing the boundary effects on CV (h), where u (k) is the kth order statistic of (u i ) n i=1 . Bandwidth for estimating Q 0 (u).To select the bandwidth forQ(u), for each h, we calcu-latêk k(−i) ∶ 1 ≤ k ≤ p after dropping the ith observation. We choose the optimal bandwidth h 0 = arg min CV 0 (h) forQ(u), where CV 0 (h) is a Stein-loss-based cross-validation function defined by Bandwidth for estimating Q k (u), 1 ≤ k ≤ m. We choose h k = arg min CV k (h) at which the following criterion attains the minimum: is the kernel estimator of the j(j + k)th correlation j(j+k) (u i ) based on the leave-one-out dataset (ỹ t , u i ) t≠i .
Bandwidth for estimating C m (u). There are two existing cross-validation methods for selecting the bandwidth h for C m (u): One is a Stein-loss-based approach (Yin et al., 2010) which was however applicable only to low-dimensional data. The other is a subset-based approach (Chen & Leng, 2016) for high-dimensional data. In this paper, we choose h m+1 = arg min CV C (h) at which the following criterion attains the minimum: Thresholding level forĈ (t) (u). Following Bickel and Levina (2008), we split the sample into two sub-samples called trial and testing samples and select the threshold by minimizing the Frobenius norm of the difference between the trial-sample-based thresholded estimator and the testing-sample-based covariance matrix. Specifically, we divide the original sample into two samples at random of size n 1 and n 2 , where n 1 = n(1 − 1∕ log(n)) and n 2 = n∕ log(n), and repeat this N 1 times. Here, we set N 1 = 100 as the default value according to our numerical experience. Let C 1,s (u) andĈ 2,s (u) be the plug-in estimators based on n 1 and n 2 observations, respectively, with the bandwidth selected by the leave-one-out cross-validation. LetĈ (t) 1,s be the thresholded estimator derived fromĈ 1,s (u) with the thresholding level t 0 (u). Given u, we select t 0 (u) by minimizing Turning parameters for estimating DCM. The bandwidth h and the level of thresholding of the DCM estimators in (7) are determined by the so-called subset and sample-splitting approaches, respectively (Chen & Leng, 2016).

Criteria for performance assessment
We need a criterion to measure the performance of a nonparametric covariance matrix estimator. There are multiple possible criteria, but one particularly convenient choice is integrated root-squared error (IRSE). For any estimatorΨ(u) of Σ(u), u ∈ [a, b] the IRSE is defined as where v k , 1 ≤ k ≤ K 0 be grids evenly distributed over the interval (a, b). In the following, we set K 0 = 20 for (a, b) = (−1, 1). In our study, we also consider a spectral-norm-based IRSE. The results are omitted as they are similar to the Frobenius version. We also evaluate the performance of the proposed procedure in discovering zero entries in the covariance matrix. Let p 1 (p 2 ) be the number of nonzero (zero) entries in Σ(u). For any estimatorΨ(u) of Σ(u), let n 11 be the number of true discoveries of nonzero entries in Σ(u) byΨ(u). Similarly, let n 22 denote the number of true discoveries of zero entries in Σ(u) bŷ Ψ(u). Let SEN, SPE and ACC denote sensitivity, specificity, and accuracy in the above testing, namely SEN = n 11 p 1 . SPE = n 22 p 2 . ACC = n 11 + n 22 p 1 + p 2 .

Synthetic data
In this subsection, we carried out a set of simulation studies. We considered three settings for (u) and Σ(u) in our simulations.
The results can be summarized as follows: • On average, the IRSE loss of each procedure was increasing in the dimension p and in the degree of serial correlation while decreasing in sample size n.
• The degrees of sparsity and diagonal homogeneity in Σ(u) had an effect on the performance of these four procedures. For example, when (n, p, ) = (100, 300, 0), compared to in Setting 1, the IRSE loss of st NCM 0 in Setting 2 increased by 84%. This is not surprising as the degrees of sparsity and diagonal homogeneity in Setting 2 lead to a higher dimensionality (i.e., the number of effective parameters in the model) than that in Setting 1.

TA B L E 9
The average (SE in %) of integrated root-squared error for Setting 3 (continued) n p DCM 2 DCM 1 s DCM 1 t NCM 0 s t NCM 0 t NCM 1 s t NCM 1 = 0.8 and st NCM 1 performed slightly better than t NCM 0 and st NCM 0 , in particular, in Setting 3. Compared to DCM 2 , on average DCM 1 reduced the IRSE loss by 99%. Compared to t NCM, on average st NCM reduced the IRSE loss by 5%. In Setting 3, compared to DCM 1 , on average t NCM 0 and st NCM 0 reduced the IRSE by 14% and 15%, respectively. Compared to DCM 2 , on average DCM 1 reduced the loss by 94%. Compared to t NCM 0 , on average st NCM 0 reduced the IRSE loss by 2%. t NCM 1 and st NCM 1 performed substantially better than their counterparts t NCM 0 and st NCM 0 . The similar conclusion can be made for dependent samples when = 0.3 and 0.8. In particular, the optimal shrinkage can reduce the serial correlation effect on the proposed procedures st NCM 0 and st NCM 1 .
• Similar results were obtained in terms of ACC. See Tables 1-7 in the Appendix D of Data S1.
• The CPU-time costs of t NCM m and st NCM m , m = 0, 1, are less than those of DCM 1 and DCM 2 .
See Figures 2 and 3 in the Appendix D of Data S1.

Asset return data
Capital asset pricing model (CAPM) is a model that describes the relationship between systematic risk and expected return for assets, which is widely used throughout finance for the pricing of risky assets. However, the assumption that asset returns are linearly related to the market return is imposed on the model. The primary goal of this study was to extend the CAPM to the nonlinear setting. In particular, we are interested in how the volatility and co-volatility of a group of asset returns depend on the market return.
For this purpose, from the database of Yahoo Finance, we collected monthly return data of 75 assets across eight sectors over three time-periods, namely, before-financial-crisis period from We also collected the index return of S&P500 which was treated as the market's return.
We applied the proposed st NCM 0 and st NCM 1 to the data for each time-period, obtaining almost the same result. Here, we reported the corresponding estimates for mean (u) and covariance matrix Σ(u). Note that the diagonals of estimated Σ(u) show the volatility of individual returns while estimated correlation coefficient matrix C 0 (u) captures cross-sectional relationships in these returns.
We plotted the estimated individual mean functions and the estimated volatility functions in Figure 1, revealing a number of assets which had nonlinear relationships to the market return. The degree of this nonlinearity significantly decreased after financial crisis, indicating that the CAPM fitted to the market better than before the financial crisis. See the Appendices E and F of Data S1 for more details. Figure 1 also shows that the individual volatility of the assets increased a lot during the financial crisis period but returned to normal after the financial crisis. The pattern of the dependence of the volatility on the market also changed a lot after financial crisis: Changes from nonconstant volatility functions before the financial crisis to almost constant volatility functions after the financial crisis. We also investigated effects of the financial crisis on the co-volatility of the selected assets by the estimated nonzero correlation coefficient functions. By use of the estimated covariance matrix functions, in each time-period, we identified the associated pairs of assets that were of nonzero market-dependent conditional correlation coefficients (and nonzero conditional co-volatility). We further conducted asymptotic tests for significance of co-volatility for these pairs as follows. For any pair of assets (a, b), let Corr (a,b) (u) denote its correlation coefficient as a function of u (the market's return) and with estimatorĈorr (a,b) (u). LetF (a,b) (u) = 0.5 log(1 +Ĉorr (a,b) (u))∕ log(1 − Corr (a,b) (u)) be Fisher's Z transformation. To test H 0 : Corr (a,b) (u) ≢ 0, we considered the test statistics where the sample variance of |F (a,b) (u i )|, 1 ≤ i ≤ n is denoted byv ar(|F (a,b) (U)|) and P(⋅ |N(0, 1)) is the cumulative distribution function of the standard normal N(0, 1). Then, even after Bonferroni correction for multiple testing, these P-values were all significant (< 10 −2 ) for the above selected pairs of assets. The final list of significant pairs are as follows: • Before-financial-crisis. There were 1, 14, 1 pairs existed within Technology, Energy and Consumer-Defensive, respectively.
• In-financial-crisis. There were 4, 1, 8,1, 4,1, 1,1,1 pairs of correlated assets presented within Technology, Industrial, Energy, Consumer Defensive, Health Care and Financial Services, respectively. Also, there was a pair of correlated assets belonging two different sectors: Industrial and Consumer cyclical, Consumer Cyclical and Consumer Defensive, and Financial Service and Industrial.
• After-financial-crisis. There were 3, 2, 10, 1, 11, and 12 pairs of assets within Technology, Industrial, Energy, Consumer defensive, Health care, and Financial services. There were 1, 1, 1, 1 and 2 pairs of assets between Financial service and Industrial, between consumer defensive and Financial services, between Consumer cyclical and Consumer defensive, between Technology and Industrial, and between Health Care, and Consumer Defensive.
The results indicate that before financial crisis, there were only 16 significant within-sector co-volatility connections among these assets. In particular, there were no significant cross-sectional co-volatility connections among these assets. The number of co-volatility assets within and across sectors was significantly increasing during and after financial-crisis: The number of within-sector co-volatility connections increased from 16 to 22 during the financial crisis period and to 37 after the financial crisis. The number of between-sector co-volatility connections increased from 0 to 3 during the financial crisis period and to 7 after the financial crisis. This implies that in response to the financial crisis, the financial market has been more closely integrated than before the financial crisis.

DISCUSSION AND CONCLUSION
Estimating covariate-dependent covariance matrix Σ(u) of a high-dimensional response vector poses a big challenge to contemporary statistical research. The existing kernel methods in Chen and Leng (2016) and Yin et al. (2010) might not be flexible enough to capture varying smoothness across key parts of the matrix as they used a single bandwidth for all entries of Σ(u). Here, we have proposed a novel estimation procedure to overcome this obstacle, based on a variance-correlation factorization of Σ(u), namely Σ(u) = Q 0 (u)C 0 (u)Q T 0 (u), where Q 0 (u) = diag(Σ(u)) 1∕2 and the correlation matrix function C 0 (u) is further factorized into the product of multiple band matrices. The proposal has been implemented in two steps. In Step 1, we estimate Q 0 (u) and C 0 (u) robustly by use of separate bandwidths for band matrices, followed by thresholding entries of the estimated C 0 (u). In Step 2, substituting these estimators in the above factorization formula to obtain a plug-in estimator, followed by an optimal shrinkage from a decision-making point of view.
We have conducted a set of simulations to demonstrate that the new proposal outperforms the existing DCM approach in terms of estimation loss and CPU-time cost. To illustrate our new proposal, we have applied it to a dataset of asset returns. We have developed a nonparametric capital asset pricing model to capture volatility and co-volatility among these risky assets. We have showed that under some sparsity conditions, the proposed estimator is consistent with the underlying covariance matrix as both the sample size and the dimension tend to infinity. There are a few important topics which are remained to address but beyond the scope of this paper, such as nonparametric nonlinear shrinkage.