Large covariance estimation by thresholding principal orthogonal complements

Authors


Summary

The paper deals with the estimation of a high dimensional covariance with a conditional sparsity structure and fast diverging eigenvalues. By assuming a sparse error covariance matrix in an approximate factor model, we allow for the presence of some cross-sectional correlation even after taking out common but unobservable factors. We introduce the principal orthogonal complement thresholding method ‘POET’ to explore such an approximate factor structure with sparsity. The POET-estimator includes the sample covariance matrix, the factor-based covariance matrix, the thresholding estimator and the adaptive thresholding estimator as specific examples. We provide mathematical insights when the factor analysis is approximately the same as the principal component analysis for high dimensional data. The rates of convergence of the sparse residual covariance matrix and the conditional sparse covariance matrix are studied under various norms. It is shown that the effect of estimating the unknown factors vanishes as the dimensionality increases. The uniform rates of convergence for the unobserved factors and their factor loadings are derived. The asymptotic results are also verified by extensive simulation studies. Finally, a real data application on portfolio allocation is presented.

1. Introduction

Information and technology make large data sets widely available for scientific discovery. Much statistical analysis of such high dimensional data involves the estimation of a covariance matrix or its inverse (the precision matrix). Examples include portfolio management and risk assessment (Fan et al., 2008), high dimensional classification such as the Fisher discriminant (Hastie et al., 2009), graphic models (Meinshausen and Bühlmann, 2006), statistical inference such as controlling false discoveries in multiple testing (Leek and Storey, 2008; Efron, 2010), finding quantitative trait loci based on longitudinal data (Yap et al., 2009; Xiong et al., 2011) and testing the capital asset pricing model (Sentana, 2009), among others. See Section 'Applications of POET' for some of those applications. Yet, the dimensionality is often either comparable with the sample size or even larger. In such cases, the sample covariance is known to have poor performance (Johnstone, 2001), and some regularization is needed.

Realizing the importance of estimating large covariance matrices and the challenges that are brought by the high dimensionality, in recent years researchers have proposed various regularization techniques to estimate Σ consistently. One of the key assumptions is that the covariance matrix is sparse, namely many entries are 0 or nearly so (Bickel and Levina, 2008; Rothman et al., 2009; Lam and Fan, 2009; Cai and Zhou, 2012; Cai and Liu, 2011). In many applications, however, the sparsity assumption directly on Σ is not appropriate. For example, financial returns depend on the equity market risks, housing prices depend on the economic health and gene expressions can be stimulated by cytokines, among others. Because of the presence of common factors, it is unrealistic to assume that many outcomes are uncorrelated. An alternative method is to assume a factor model structure, as in Fan et al. (2008). However, they restrict themselves to the strict factor models with known factors.

A natural extension is conditional sparsity. Given the common factors, the outcomes are weakly correlated. To do so, we consider an approximate factor model, which has been frequently used in economic and financial studies (Chamberlain and Rothschild (1983), Fama and French (1992) and Bai and Ng (2002), among others):

display math(1.1)

Here inline image is the observed response for the ith (i=1,…,p) individual at time t=1,…,T, inline image is a vector of factor loadings, inline image is a K×1 vector of common factors and inline image is the error term, which is usually called the idiosyncratic component, uncorrelated with inline image. Both p and T diverge to ∞, whereas K is assumed fixed throughout the paper, and p is possibly much larger than T.

We emphasize that, in model (1.1), only inline image is observable. It is intuitively clear that the unknown common factors can only be inferred reliably when there are sufficiently many cases, i.e. p→∞. In a data rich environment, p can diverge at a rate that is faster than T. The factor model (1.1) can be put in a matrix form as

display math(1.2)

where inline image, inline image and inline image. We are interested in Σ, the p×p covariance matrix of inline image, and its inverse, which are assumed to be time invariant. Under model (1.1), Σ is given by

display math(1.3)

where inline image is the covariance matrix of inline image. The literature on approximate factor models typically assumes that the first K eigenvalues of inline image diverge at rate O(p), whereas all the eigenvalues of inline image are bounded as p→∞. This assumption holds easily when the factors are pervasive in the sense that a non-negligible fraction of factor loadings should be non-vanishing. The decomposition (1.3) is then asymptotically identified as p→∞. In addition to it, in this paper we assume that inline image is approximately sparse as in Bickel and Levina (2008) and Rothman et al. (2009): for some q ∈ [0,1),

display math

does not grow too fast as p→∞. In particular, this includes the exact sparsity assumption (q=0) under which inline image, the maximum number of non-zero elements in each row.

The conditional sparsity structure of form (1.2) was explored by Fan et al. (2011a) in estimating the covariance matrix, when the factors inline image are observable. This allows them to use regression analysis to estimate inline image. This paper deals with the situation in which the factors are unobservable and must be inferred. Our approach is simple and optimization free and it uses the data only through the sample covariance matrix. Run the singular value decomposition on the sample covariance matrix inline image of inline image, keep the covariance matrix that is formed by the first K principal components and apply the thresholding procedure to the remaining covariance matrix. This results in a principal orthogonal complement thresholding estimator POET. When the number of common factors K is unknown, it can be estimated from the data. See Section 'Regularized covariance matrix via principal components analysis' for additional details. We shall investigate various properties of POET under the assumption that the data are serially dependent, which includes independent observations as a specific example. The rate of convergence under various norms for both estimated Σ and inline image and their precision (inverse) matrices will be derived. We show that the effect of estimating the unknown factors on the rate of convergence vanishes when p log (p)≫T and, in particular, the rate of convergence for inline image achieves the optimal rate in Cai and Zhou (2012).

This paper focuses on the high dimensional static factor model (1.2), which is innately related to the principal component analysis (PCA), as clarified in Section 'Regularized covariance matrix via principal components analysis'. This feature makes it different from the classical factor model with fixed dimensionality (e.g. Lawley and Maxwell (1971)). In the last decade, much theory on the estimation and inference of the static factor model has been developed, e.g. Stock and Watson (1998, 2002), Bai and Ng (2002), Bai (2003) and Doz et al. (2011), among others. Our contribution is on the estimation of covariance matrices and their inverse in large factor models.

The static model that is considered in this paper is to be distinguished from the dynamic factor model as in Forni et al. (2000); the latter allows inline image to depend also on inline image with lags in time. Their approach is based on the eigenvalues and principal components of spectral density matrices, and on the frequency domain analysis. Moreover, as shown in Forni and Lippi (2001), the dynamic factor model does not really impose a restriction on the data-generating process, and the assumption of idiosyncrasy (in their terminology, a p-dimensional process is idiosyncratic if all the eigenvalues of its spectral density matrix remain bounded as p→∞) asymptotically identifies the decomposition of inline image into the common component and idiosyncratic error. The literature includes, for example, Forni et al. (2000, 2004), Forni and Lippi (2001), Hallin and Liška (2007, 2011) and many other references therein. Above all, both the static and the dynamic factor models are receiving increasing attention in applications of many fields where information usually is scattered through a (very) large number of interrelated time series.

There has been extensive literature in recent years that deals with sparse principal components, which has been widely used to enhance the convergence of the principal components in high dimensional space. d'Aspremont et al. (2008), Shen and Huang (2008), Witten et al. (2009) and Ma (2013) proposed and studied various algorithms for computations. More literature on sparse PCA is found in Johnstone and Lu (2009), Amini and Wainwright (2009), Zhang and El Ghaoui (2011) and Birnbaum et al. (2012), among others. In addition, there has also been a growing literature that theoretically studies the recovery from a low rank plus sparse matrix estimation problem; see, for example, Wright et al. (2009), Lin et al. (2008), Candès et al. (2011), Luo (2011), Agarwal et al. (2012) and Pati et al. (2012). It corresponds to the identifiability issue of our problem.

There is a big difference between our model and those considered in the aforementioned literature. In the current paper, the first K eigenvalues of Σ are spiked and grow at a rate O(p), whereas the eigenvalues of the matrices that have been studied in the existing literature on covariance estimation are usually assumed to be either bounded or slowly growing. Because of this distinctive feature, the common components and the idiosyncratic components can be identified and, in addition, PCA on the sample covariance matrix can consistently estimate the space that is spanned by the eigenvectors of Σ. The existing methods of either thresholding directly or solving a constrained optimization method can fail in the presence of very spiked principal eigenvalues. However, there is a price to pay here: as the first K eigenvalues are ‘too spiked’, one can hardly obtain a satisfactory rate of convergence for estimating Σ in absolute terms, but it can be estimated accurately in relative terms (see Section 'Convergence of POET' for details). In addition, inline image can be estimated accurately.

We would like to note further that the low rank plus sparse representation of our model is on the population covariance matrix, whereas Candès et al. (2011), Wright et al. (2009) and Lin et al. (2009) considered such a representation on the data matrix. (We thank a referee for reminding us about these related works.) As there is no Σ to estimate, their goal is limited to producing a low rank plus sparse matrix decomposition of the data matrix, which corresponds to the identifiability issue of our study, and does not involve estimation and inference. In contrast, our ultimate goal is to estimate the population covariance matrices as well as the precision matrices. For this, we require the idiosyncratic components and common factors to be uncorrelated and the data-generating process to be strictly stationary. The covariances that are considered in this paper are constant over time, though slow time varying covariance matrices are applicable through localization in time (time domain smoothing). Our consistency result on inline image demonstrates that decomposition (1.3) is identifiable, and hence our results also shed the light of the ‘surprising phenomenon’ of Candès et al. (2011) that one can separate fully a sparse matrix from a low rank matrix when only the sum of these two components is available.

The rest of the paper is organized as follows. Section 'Regularized covariance matrix via principal components analysis' gives our estimation procedures and builds the relationship between the PCA and the factor analysis in high dimensional space. Section 'Asymptotic properties' provides the asymptotic theory for various estimated quantities. Section 'Choice of threshold' illustrates how to choose the thresholds by using cross-validation and guarantees the positive definiteness in any finite sample. Specific applications of regularized covariance matrices are given in Section 'Applications of POET'. Numerical results are reported in Section 'Monte Carlo experiments'. Finally, Section 'Real data example' presents a real data application on portfolio allocation. All proofs are given in Appendix A. Throughout the paper, we use inline image and inline image to denote the minimum and maximum eigenvalues of a matrix A. We also denote by inline image, ‖A‖, inline image and inline image the Frobenius norm, spectral norm (also called the operator norm), inline image-norm and elementwise norm of a matrix A, defined respectively by inline image, inline image, inline image and inline image. When A is a vector, both inline image and ‖A‖ are equal to the Euclidean norm. Finally, for two sequences, we write inline image if inline image and inline image if inline image and inline image

The programs that were used to analyse the data can be obtained from

http://www.blackwellpublishing.com/rss

2. Regularized covariance matrix via principal components analysis

There are three main objectives of this paper:

  1. to understand the relationship between PCA and high dimensional factor analysis;
  2. to estimate both covariance matrices Σ and the idiosyncratic inline image and their precision matrices in the presence of common factors;
  3. to investigate the effect of estimating the unknown factors on the covariance estimation.

The propositions in Section 'High dimensional principal components analysis and factor model' show that the space that is spanned by the principal components in the population level Σ is close to the space that is spanned by the columns of the factor loading matrix B.

2.1. High dimensional principal components analysis and factor model

Consider a factor model

display math

where the number of common factors, inline image, is small compared with p and T, and thus is assumed to be fixed throughout the paper. In the model, the only observable variable is the data inline image. One of the distinguished features of the factor model is that the principal eigenvalues of Σ are no longer bounded, but growing fast with the dimensionality. We illustrate this in the following example.

2.1.1. Example 1

Consider a single-factor model inline image where inline image. Suppose that the factor is pervasive in the sense that it has non-negligible effect on a non-vanishing proportion of outcomes. It is then reasonable to assume that inline image for some c>0. Therefore, assuming that inline image, an application of decomposition (1.3) yields

display math

for all large p, assuming that inline image.

We now elucidate why PCA can be used for the factor analysis in the presence of spiked eigenvalues. Write inline image as the p×K loading matrix. Note that the linear space that is spanned by the first K principal components of inline image is the same as that spanned by the columns of B when inline image is non-degenerate. Thus, we can assume without loss of generality that the columns of B are orthogonal and inline image, the identity matrix. This canonical form corresponds to the identifiability condition in decomposition (1.3). Let inline image be the columns of B, ordered such that inline image is in a non-increasing order. Then, inline image are eigenvectors of the matrix inline image with eigenvalues inline image and the rest 0. We shall impose the pervasiveness assumption that all eigenvalues of the K×K matrix inline image are bounded away from 0, which holds if the factor loadings inline image are independent realizations from a non-degenerate population. Since the non-vanishing eigenvalues of the matrix inline image are the same as those of inline image, from the pervasiveness assumption it follows that inline image are all growing at rate O(p).

Let inline image be the eigenvalues of Σ in a descending order and inline image be their corresponding eigenvectors. Then, an application of Weyl's eigenvalue theorem (see Appendix A) yields the following proposition.

Proposition 1. Assume that the eigenvalues of inline image are bounded away from 0 for all large p. For the factor model (1.3) with the canonical condition

display math(2.1)

we have

display math

In addition, for jK, inline image.

Using proposition 1 and the  sin (θ) theorem of Davis and Kahn (1970) (see their appendix), we have the following proposition.

Proposition 2. Under the assumptions of proposition 1, if inline image are distinct, then

display math

Propositions 1 and 2 state that PCA and factor analysis are approximately the same if inline image. This is assured through a sparsity condition on inline image, which is frequently measured through

display math(2.2)

The intuition is that, after taking out the common factors, many pairs of the cross-sectional units become weakly correlated. This generalized notion of sparsity was used in Bickel and Levina (2008) and Cai and Liu (2011). Under this generalized measure of sparsity, we have

display math

if the noise variances inline image are bounded. Therefore, when inline image, proposition 1 implies that we have distinguished eigenvalues between the principal components inline image and the rest of the components inline image and proposition 2 ensures that the first K principal components are approximately the same as the columns of the factor loadings.

The aforementioned sparsity assumption appears reasonable in empirical applications. Boivin and Ng (2006) conducted an empirical study and showed that imposing zero correlation between weakly correlated idiosyncratic components improves the forecast. (We thank a referee for this interesting reference.) More recently, Phan (2012) empirically estimated the level of sparsity of the idiosyncratic covariance by using UK market data.

Recent developments on random-matrix theory, e.g. Johnstone and Lu (2009) and Paul (2007), have shown that, when p/T is not negligible, the eigenvalues and eigenvectors of Σ might not be consistently estimated from the sample covariance matrix. A distinguished feature of the covariance that is considered in this paper is that there are some very spiked eigenvalues. By propositions 1 and 2, in the factor model, the pervasiveness condition

display math(2.3)

implies that the first K eigenvalues are growing at a rate p. Moreover, when p is large, the principal components inline image are close to the normalized vectors inline image when inline image. This provides the mathematics for using the first K principal components as a proxy for the space that is spanned by the columns of the factor loading matrix B. In addition, because of condition (2.3), the signals of the first K eigenvalues are stronger than those of the spiked covariance model that was considered by Jung and Marron (2009) and Birnbaum et al. (2012). Therefore, our other conditions for the consistency of principal components at the population level are much weaker than those in the spiked covariance literature. However, this also shows that, under our setting, PCA is a valid approximation to factor analysis only if p→∞. The fact that PCA on the sample covariance is inconsistent when p is bounded has also previously been demonstrated in the literature (see, for example, Bai (2003)).

With assumption (2.3), the standard literature on approximate factor models has shown that PCA on the sample covariance matrix inline image can consistently estimate the space that is spanned by the factor loadings (e.g. Stock and Watson (1998) and Bai (2003)). Our contribution in propositions 1 and 2 is that we connect the high dimensional factor model to the principal components and obtain the consistency of the spectrum in the population level Σ instead of the sample level inline image. The spectral consistency also enhances the results in Chamberlain and Rothschild (1983). This provides the rationale behind the consistency results in the factor model literature.

2.2. Principal orthogonal complement thresholding

A sparsity assumption directly on Σ is inappropriate in many applications owing to the presence of common factors. Instead, we propose a non-parametric estimator of Σ based on PCA. Let inline image be the ordered eigenvalues of the sample covariance matrix inline image and inline image be their corresponding eigenvectors. Then the sample covariance has the following spectral decomposition:

display math(2.4)

where inline image is the principal orthogonal complement, and K is the number of diverging eigenvalues of Σ. Let us first assume that K is known.

Now we apply thresholding on inline image. Define

display math(2.5)

where inline image is a generalized shrinkage function of Antoniadis and Fan (1976), employed by Rothman et al. (2009) and Cai and Liu (2011), and inline image is an entry-dependent threshold. In particular, the hard thresholding rule inline image (Bickel and Levina, 2008) and the constant thresholding parameter inline image are allowed. In practice, it is more desirable to have inline image entry adaptive. An example of the adaptive thresholding is

display math(2.6)

where inline image is the ith diagonal element of inline image. This corresponds to applying the thresholding with parameter τ to the correlation matrix of inline image.

The estimator of Σ is then defined as

display math(2.7)

We shall call this estimator the principal orthogonal complement thresholding estimator POET. It is obtained by thresholding the remaining components of the sample covariance matrix, after taking out the first K principal components. One of the attractive features of POET is that it is optimization free and hence is computationally appealing. (We have written an R package for POET, which outputs the estimated Σ, inline image, K, the factors and the loadings.)

With the choice of inline image in expression (2.6) and the hard thresholding rule, our estimator encompasses many popular estimators as its specific cases. When τ=0, the estimator is the sample covariance matrix and, when τ=1, the estimator becomes that based on the strict factor model (Fan et al., 2008). When K=0, our estimator is the same as the thresholding estimator of Bickel and Levina (2008) and (with a more general thresholding function) Rothman et al. (2009) or the adaptive thresholding estimator of Cai and Liu (2011) with a proper choice of inline image.

In practice, the number of diverging eigenvalues (or common factors) can be estimated on the basis of the sample covariance matrix. Determining K in a data-driven way is an important topic and is well understood in the literature. We shall describe the estimator POET with a data-driven K in Section 'Principal orthogonal complement thresholding with unknown K'

2.3. Least squares point of view

The estimator POET (2.7) has an equivalent representation using a constrained least squares method. The least squares method seeks inline image and inline image such that

display math(2.8)

subject to the normalization

display math(2.9)

The constraints (2.9) correspond to the normalization (2.1). Here we assume that the mean of each variable inline image has been removed, i.e. inline image for all ip,jK and tT. Putting it in a matrix form, the optimization problem can be written as

display math(2.10)

where inline image and inline image. For each given F, the least squares estimator of B is inline image, using the constraint (2.9) on the factors. Substituting this into problem (2.10), the objective function now becomes inline image The minimizer is now clear: the columns of inline image are the eigenvectors corresponding to the K largest eigenvalues of the T×T matrix inline image and inline image (see, for example, Stock and Watson (2002)).

We shall show that under some mild regularity conditions, as p and T→∞, inline image consistently estimates the true inline image uniformly over ip and tT. Since inline image is assumed to be sparse, we can construct an estimator of inline image by using the adaptive thresholding method by Cai and Liu (2011) as follows. Let inline image and inline image For some predetermined decreasing sequence inline image, and sufficiently large C>0, define the adaptive threshold parameter as inline image The estimated idiosyncratic covariance estimator is then given by

display math(2.11)

where, for all inline image (see Antoniadis and Fan (2001)),

display math

It is easy to verify that inline image includes many interesting thresholding functions such as hard thresholding (inline image), soft thresholding (inline image), smoothly clipped absolute deviation and the adaptive lasso (see Rothman et al. (2009)).

Analogous to the decomposition (1.3), we obtain the following substitution estimators:

display math(2.12)

and, by the Sherman–Morrison–Woodbury formula, noting that inline image

display math(2.13)

In practice, the true number of factors K might be unknown to us. However, for any determined inline image, we can always construct either inline image as in estimator (2.7) or inline image as in estimator (2.12) to estimate inline image. The following theorem shows that, for each given inline image, the two estimators based on either regularized PCA or least squares substitution are equivalent. Similar results were obtained by Bai (2003) when inline image and no thresholding was imposed.

Theorem 1. Suppose that the entry-dependent threshold in definition (2.5) is the same as the thresholding parameter that is used in expression (2.11). Then, for any inline image, estimator (2.7) is equivalent to the substitution estimator (2.12), i.e.

display math

In this paper, we shall use a data-driven inline image to construct POET (see Section 'Principal orthogonal complement thresholding with unknown K'), which has two equivalent representations according to theorem 1.

2.4. Principal orthogonal complement thresholding with unknown K

Determining the number of factors in a data-driven way has been an important research topic in the econometrics literature. Bai and Ng (2002) proposed a consistent estimator as both p and T diverge. Other recent criteria have been proposed by Kapetanios (2010), Onatski (2010) and Alessi et al. (2010), among others.

Our method also allows a data-driven inline image to estimate the covariance matrices. In principle, any procedure that gives a consistent estimate of K can be adopted. In this paper we apply the well-known method in Bai and Ng (2002). It estimates K by

display math(2.14)

where M is a prescribed upper bound, inline image is a inline image matrix whose columns are √T times the eigenvectors corresponding to the inline image largest eigenvalues of the T×T matrix inline image and g(T,p) is a penalty function of (p,T) such that g(T,p)=o(1) and min{p,Tg(T,p)→∞. Two examples suggested by Bai and Ng (2002), IC1 and IC2, are respectively

display math
display math

Throughout the paper, we let inline image be the solution to problem (2.14) by using either IC1 or IC2. The asymptotic results are not affected regardless of the specific choice of g(T,p). We define the POET-estimator with unknown K as

display math(2.15)

The procedure is as stated in Section 'Principal orthogonal complement thresholding' except that inline image is now data driven.

3. Asymptotic properties

3.1. Assumptions

This section presents the assumptions on model (1.2), in which only inline image are observable. Recall the identifiability condition (2.1).

The first assumption has been one of the most essential in the literature of approximate factor models. Under this assumption and other regularity conditions, the number of factors, loadings and common factors can be consistently estimated (e.g. Stock and Watson (1998, 2002), Bai and Ng (2002) and Bai (2003)).

Assumption 1. All the eigenvalues of the K×K matrix inline image are bounded away from both 0 and ∞ as p→∞.

Remark 1.

  1. It is implied from proposition 1 in Section 'Regularized covariance matrix via principal components analysis' that the first K eigenvalues of Σ grow at rate O(p). This unique feature distinguishes our work from most of other work on low rank plus sparse covariances that has been considered in the literature, e.g. Luo (2011), Pati et al. (2012), Agarwal et al. (2012) and Birnbaum et al. (2012). (To our best knowledge, the only other references that estimate large covariances with diverging eigenvalues (growing at the rate of dimensionality O(p)) are Fan et al. (2008, 2008) and Bai and Shi (2011). Whereas Fan et al. (2008, 2011a) assumed that the factors are observable, Bai and Shi (2011) considered the strict factor model in which inline image is diagonal.)
  2. Assumption 1 requires the factors to be pervasive, i.e. to impact a non-vanishing proportion of individual time series. See example 1 in Section 2.1.1 for its meaning. (It is important to distinguish the model that we consider in this paper from the ‘sparse factor model’ in the literature, e.g. Carvalho et al. (2008) and Pati et al. (2012), which assumes that the loading matrix B is sparse. The intuition of a sparse loading matrix is that each factor is related to only a relatively small number of stocks, assets, genes, etc. With B being sparse, all the eigenvalues of inline image and hence those of Σ are bounded.)
  3. As to be illustrated in Section 'Convergence of POET' below, owing to the fast diverging eigenvalues, we can hardly achieve a good rate of convergence for estimating Σ under either the spectral norm or Frobenius norm when p>T. This phenomenon arises naturally from the characteristics of the high dimensional factor model, which is another distinguished feature compared with those convergence results in the existing literature.

Assumption 2.

  1. inline image is strictly stationary. In addition, inline image for all ip,jK and tT.
  2. There are constants inline image such that inline image, inline image and
    display math
  3. There are inline image and inline image such that, for any s>0, ip and jK,
    display math

Condition (a) requires strict stationarity as well as the non-correlation between inline image and inline image. These conditions are slightly stronger than those in the literature, e.g. Bai (2003), but are still standard and simplify our technicalities. Condition (b) requires that inline image be well conditioned. The condition inline image instead of a weaker condition inline image is imposed here to estimate K consistently. But it is still standard in the approximate factor model literature as in Bai and Ng (2002), Bai (2003), etc. When K is known, such a condition can be removed. Fan et al. (2011b) shows that the results continue to hold for a growing (known) K under the weaker condition inline image. Condition (c) requires exponential-type tails, which allow us to apply the large deviation theory to inline image and inline image.

We impose the strong mixing condition. Let inline image and inline image denote the σ-algebras that are generated by inline image and inline image respectively. In addition, define the mixing coefficient

display math(3.1)

Assumption 3. (strong mixing.) There exists inline image such that inline image, and C>0 satisfying, for all inline image,

display math

In addition, we impose the following regularity conditions.

Assumption 4. There exists M>0 such that, for all ip, tT and sT,

  1. inline image,
  2. inline image and
  3. inline image.

These conditions are needed to estimate consistently the transformed common factors as well as the factor loadings. Similar conditions were also assumed in Bai (2003) and Bai and Ng (2006). The number of factors is assumed to be fixed. Our conditions in assumption 4 are weaker than those in Bai (2003) as we focus on different aspects of the study.

3.2. Convergence of the idiosyncratic covariance

Estimating the covariance matrix inline image of the idiosyncratic components inline image is important for many statistical inferences. For example, it is needed for large sample inference of the unknown factors and their loadings, for testing the capital asset pricing model (Sentana, 2009) and large-scale hypothesis testing (Fan et al., 2012). See Section 'Applications of POET'.

We estimate inline image by thresholding the principal orthogonal complements after the first inline image principal components of the sample covariance have been taken out: inline image By theorem 1, it also has an equivalent expression given by estimator (2.11), with inline image. Throughout the paper, we apply the adaptive threshold

display math(3.2)

where C>0 is a sufficiently large constant, though the results hold for other types of thresholding. As in Bickel and Levina (2008) and Cai and Liu (2011), the threshold that is chosen in the current paper is in fact obtained from the optimal uniform rate of convergence of inline image When direct observation of inline image is not available, the effect of estimating the unknown factors also contributes to this uniform estimation error, which is why inline image appears in the threshold.

The following theorem gives the rate of convergence of the estimated idiosyncratic covariance. Let inline image. In the convergence rate below, recall that inline image and q are defined in the measure of sparsity (2.2).

Theorem 2. Suppose that inline image, inline image and assumptions 1–4 hold. Then, for a sufficiently large constant C>0 in the threshold (3.2), the POET-estimator inline image satisfies

display math

If further inline image, then the eigenvalues of inline image are all bounded away from 0 with probability approaching 1, and

display math

When estimating inline image, p is allowed to grow exponentially fast in T, and inline image can be made consistent under the spectral norm. In addition, inline image is asymptotically invertible whereas the classical sample covariance matrix based on the residuals is not when p>T.

Remark 2.

  1. Consistent estimation of inline image indicates that inline image is identifiable in model (1.3), namely the sparse inline image can be separated perfectly from the low rank matrix there. The result here gives another proof (when assuming that inline image) of the ‘surprising phenomenon’ in Candès et al. (2011) under different technical conditions.
  2. Fan et al. (2011a) recently showed that, when inline image are observable and q=0, the rate of convergence of the adaptive thresholding estimator is given by
    display math
  3. Hence, when the common factors are unobservable, the rate of convergence has an additional term inline image, coming from the effect of estimating the unknown factors. This effect vanishes when p  log (p)≫T, in which case the minimax rate as in Cai and Zhou (2012) is achieved. As p increases, more information about the common factors is collected, which results in more accurate estimation of the common factors inline image.
  4. When K is known and grows with p and T, with slightly weaker assumptions, Fan et al. (2011b) shows that, under the exactly sparse case (i.e. q=0), the result continues to hold with convergence rate
    display math

3.3. Convergence of POET

Since the first K eigenvalues of Σ grow with p, we can hardly estimate Σ with satisfactory accuracy in absolute terms. This problem does not arise from the limitation of any estimation method but is due to the nature of the high dimensional factor model. We illustrate this by using a simple example.

3.3.1. Example 2

Consider an ideal case where we know the spectrum except for the first eigenvector of Σ. Let inline image be the eigenvalues and vectors, and assume that the largest eigenvalue inline image for some c>0. Let inline image be the estimated first eigenvector and define the covariance estimator inline image Assume that inline image is a good estimator in the sense that inline image. However,

display math

which can diverge when inline image.

In the presence of very spiked eigenvalues, although the covariance Σ cannot be consistently estimated in absolute terms, it can be well estimated in terms of the relative error matrix

display math

which is more relevant for many applications (see example 4 in 'Applications of POET'). The relative error matrix can be measured by either its spectral norm or the normalized Frobenius norm defined by

display math(3.3)

In equality (3.3), there are p terms being added in the trace operation and the factor inline image plays the role of normalization. The loss (3.3) is closely related to the entropy loss, which was introduced by James and Stein (1961). Also note that

display math

where inline image is the weighted quadratic norm in Fan et al. (2008).

Fan et al. (2008) showed that, in a large factor model, the sample covariance is such that inline image which does not converge if p>T. In contrast, theorem 3 below shows that inline image can still be convergent as long as inline image. Technically, the effect of high dimensionality on the convergence rate of inline image is via the number of rows in B. We show in Appendix A that B appears in inline image through inline image whose eigenvalues are bounded. Therefore it successfully cancels out the curse of high dimensionality that is introduced by B.

Compared with estimating Σ, in a large approximate factor model, we can estimate the precision matrix with a satisfactory rate under the spectral norm. The intuition follows from the fact that inline image has bounded eigenvalues.

The following theorem summarizes the rate of convergence under various norms.

Theorem 3. Under the assumptions of theorem 2 the POET-estimator that is defined in equation (2.15) satisfies

display math

In addition, if inline image, then inline image is non-singular with probability approaching 1, with

display math

Remark 3.

  1. When estimating inline image, p is allowed to grow exponentially fast in T, and the estimator has the same rate of convergence as that of the estimator inline image in theorem 2. When p becomes much larger than T, the precision matrix can be estimated at the same rate as if the factors were observable.
  2. As in remark 2, when K>0 is known and grows with p and T, Fan et al. (2011a) prove the following results (when q=0):
    display math
  3. The results state explicitly the dependence of the rate of convergence on the number of factors. (The assumptions in Fan et al. (2011a) are slightly weaker than those presented here, in that they required that inline image instead of inline image be bounded.)
  4. The relative error inline image in operator norm can be shown to have the same order as the maximum relative error of estimated eigenvalues. It does not converge to 0 nor diverge. It is much smaller than inline image, which is of order p/√T (see example 2).

3.4. Convergence of unknown factors and factor loadings

Many applications of the factor model require estimating the unknown factors. In general, factor loadings in B and the common factors inline image are not separably identifiable, as, for any matrix H such that inline image, inline image. Hence inline image cannot be identified from inline image. Note that the linear space that is spanned by the rows of B is the same as that by those of inline image. In practice, it often does not matter which is used.

Let V denote the inline image diagonal matrix of the first inline image largest eigenvalues of the sample covariance matrix in decreasing order. Recall that inline image and define a inline image matrix inline image Then, for tT, inline image Note that inline image depends only on the data inline image and an identifiable part of parameters inline image. Therefore, there is no identifiability issue in inline image regardless of the identifiability condition imposed.

Bai (2003) obtained the rate of convergence for both inline image and inline image for any fixed (i,t). However, the uniform rate of convergence is more relevant for many applications (see example 3 in Section 'Applications of POET'). The following theorem extends those results in Bai (2003) in a uniformity sense. In particular, with a more refined technique, we have improved the uniform convergence rate for inline image.

Theorem 4. Under the assumptions of theorem 2,

display math

As a consequence of theorem 4, we obtain the following corollary (recall that the constant inline image is defined in assumption 2).

Corollary 1. Under the assumptions of theorem 2,

display math

The rates of convergence that were obtained above also explain the condition inline image in theorems 2 and 3. It is needed to estimate the common factors inline image uniformly in tT. When we do not observe inline image, in addition to the factor loadings, there are KT factors to estimate. Intuitively, the condition inline image requires the number of parameters that are introduced by the unknown factors to be ‘not too many’, so that we can consistently estimate them uniformly. Technically, as demonstrated by Bickel and Levina (2008), Cai and Liu (2011) and many others, achieving uniform accuracy is essential for large covariance estimations.

4. Choice of threshold

4.1. Finite sample positive definiteness

Recall that the threshold value inline image, where C is determined by the users. To make POET operational in practice, we must choose C to maintain the positive definiteness of the estimated covariances for any given finite sample. We write inline image, where the covariance estimator depends on C via the threshold. We choose C in the range where inline image. Define

display math(4.1)

When C is sufficiently large, the estimator becomes diagonal, whereas its minimum eigenvalue must retain strict positivity. Thus, inline image is well defined and, for all inline image, inline image is positive definite under finite samples. We can obtain inline image by solving inline image We can also approximate inline image by plotting inline image as a function of C, as illustrated in Fig. 1. In practice, we can choose C in the range inline image for a small ɛ and sufficiently large M. Choosing the threshold in a range to guarantee the finite sample positive definiteness has also been previously suggested by Fryzlewicz (2012).

Figure 1.

Minimum eigenvalue of inline image as a function of C for three choices of thresholding rules (the plot is based on the simulated data set in Section 6.2): image, hard thresholding; image, soft thresholding; image, smoothly clipped absolute deviation

4.2. Multifold cross-validation

In practice, C can be data driven, and chosen through multifold cross-validation. After obtaining the estimated residuals inline image by PCA, we divide them randomly into two subsets, which are, for simplicity, denoted by inline image and inline image. The sizes of inline image and inline image, which are denoted by inline image and inline image, are inline image and inline image For example, in sparse matrix estimation, Bickel and Levina (2008) suggested the choice inline image.

We repeat this procedure H times. At the jth split, we denote by inline image the POET-estimator with the threshold inline image on the training data set inline image We also denote by inline image the sample covariance based on the validation set, defined by inline image Then we choose the constant inline image by minimizing a cross-validation objective function over a compact interval

display math(4.2)

Here inline image is the minimum constant that guarantees the positive definiteness of inline image for inline image as described in the previous subsection, and M is a large constant such that inline image is diagonal. The resulting inline image is data driven, so it depends on Y as well as p and T via the data. In contrast, for each given N×T data matrix Y, inline image is a universal constant in the threshold inline image in the sense that it does not change with respect to the position (i,j). We also note that the cross-validation is based on the estimate of inline image rather than Σ because POET thresholds the error covariance matrix. Thus cross-validation improves the performance of thresholding.

It is possible to derive the rate of convergence for inline image under the current model setting, but it ought to be much more technically involved than the regular sparse matrix estimation that was considered by Bickel and Levina (2008) and Cai and Liu (2011). To keep our presentation simple we do not pursue it in the current paper.

5. Applications of POET

We give four examples to which the results in theorems 2–4 can be applied. Detailed pursuits of these are beyond the scope of the paper.

5.1. Example 3 (large-scale hypothesis testing)

Controlling the false discovery rate in large-scale hypothesis testing based on correlated test statistics is an important and challenging problem in statistics (Leek and Storey, 2008; Efron, 2010; Fan et al., 2012). Suppose that the test statistic for each of the hypotheses

display math

is inline image and these test statistics Z are jointly normal N(μ,Σ) where Σ is unknown. For a given critical value x, the false discovery proportion is then defined as FDP(x)=V(x)/R(x) where inline image and inline image are the total number of false discoveries and the total number of discoveries respectively. Our interest is to estimate FDP(x) for each given x. Note that R(x) is an observable quantity. Only V(x) needs to be estimated.

If the covariance Σ admits the approximate factor structure (1.3), then the test statistics can be stochastically decomposed as

display math(5.1)

where inline image is sparse. By the principal factor approximation (theorem 1, Fan et al. (2012))

display math(5.2)

when inline image and the number of true significant hypotheses inline image is o(p), where inline image is the upper x-quantile of the standard normal distribution, inline image and inline image.

Now suppose that we have n repeated measurements from model (5.1). Then, by corollary 1, inline image can be uniformly consistently estimated, and hence inline image and FDP(x) can be consistently estimated. Efron (2010) obtained these repeated test statistics on the basis of the bootstrap sample from the original raw data. Our theory (theorem 4) gives a formal justification to the framework of Efron (2007, 2010).

5.2. Example 4 (risk management)

The maximum elementwise estimation error inline image appears in risk assessment as in Fan et al. (2012). For a fixed portfolio allocation vector w, the true portfolio variance and the estimated variance are given by inline image and inline image respectively. The estimation error is bounded by

display math

where inline image, the inline image-norm of w, is the gross exposure of the portfolio. Usually a constraint is placed on the total percentage of the short positions, in which case we have a restriction inline image for some c>0. In particular, c=1 corresponds to a portfolio with no short positions (all weights are non-negative). Theorem 3 quantifies the maximum approximation error.

The above discussion compares the absolute error of perceived risk and true risk. The relative error is bounded by

display math

for any allocation vector w. Theorem 3 quantifies this relative error.

5.3. Example 5 (panel regression with a factor structure in the errors)

Consider the panel regression model

display math

where inline image is a vector of observable regressors with fixed dimension. The regression error inline image has a factor structure and is assumed to be independent of inline image, but inline image, inline image and inline image are all unobservable. We are interested in the common regression coefficients β. This panel regression model has been considered by many researchers, such as Ahn et al. (2001) and Pesaran (2006), and has broad applications in social sciences.

Although ordinary least squares produces a consistent estimator of β, a more efficient estimation can be obtained by generalized least squares. The generalized least squares method depends, however, on an estimator of inline image, which is the inverse of the covariance matrix of inline image. By assuming that the covariance matrix of inline image is sparse, we can successfully solve this problem by applying theorem 3. Although inline image is unobservable, it can be replaced by the regression residuals inline image, obtained via first regressing inline image on inline image. We then apply POET to inline image. By theorem 3, the inverse of the resulting estimator is a consistent estimator of inline image under the spectral norm. A slight difference lies in the fact that, when we apply POET, inline image is replaced with inline image, which introduces an additional term inline image in the estimation error.

5.4. Example 6 (validating an asset pricing theory)

A celebrated financial economic theory is the capital asset pricing model (Sharpe, 1964) that helped William Sharpe to win the Nobel prize in economics in 1990, whose extension is the multifactor model (Ross, 1976; Chamberlain and Rothschild, 1983). It states that, in a frictionless market, the excessive return of any financial asset equals the excessive returns of the risk factors times its factor loadings plus noise. In the multiperiod model, the excess return inline image of firm i at time t follows model (1.1), in which inline image are the excess returns of the risk factors at time t. To test the null hypothesis (1.2), we embed the model into the multivariate linear model

display math(5.3)

and wish to test inline image. The F-test statistic involves the estimation of the covariance matrix inline image, whose estimates are degenerate without regularization when pT. Therefore, in the literature (Sentana (2009), and references therein), we focus on the case that p is relatively small. The typical choices of parameters are T=60 monthly data and the number of assets p=5, 10, 25. However, the capital asset pricing model should hold for all tradeable assets, not just a small fraction of assets. With our regularization technique, non-degenerate estimate inline image can be obtained and the F-test or likelihood ratio test statistics can be employed even when pT.

To provide some insights, let inline image be the least squares estimator of model (5.3). Then, when inline image, inline image for a constant inline image which depends on the observed factors. When inline image is known, the Wald test statistic is inline image. When it is unknown and p is large, it is natural to use the F-type of test statistic inline image. The difference between these two statistics is bounded by

display math

Since under the null hypothesis inline image, we have inline image. Thus, it follows from boundness of inline image that inline image Theorem 2 provides the rate of convergence for this difference. Detailed development is out of the scope of the current paper, and we shall leave it as a separate research project (see Pesaran and Yamagata (2012)).

6. Monte Carlo experiments

In this section, we shall examine the performance of POET in a finite sample. We shall also demonstrate the effect of this estimator on asset allocation and risk assessment. Similarly to Fan et al. (2008, 2011a), we simulated from a standard Fama–French three-factor model, assuming a sparse error covariance matrix and three factors. Throughout this section, the timespan is fixed at T=300, and the dimensionality p increases from 1 to 600. We assume that the excess returns of each of p stocks over the risk-free interest rate follow the model

display math

The factor loadings are drawn from a trivariate normal distribution inline image and the idiosyncratic errors from inline image, and the factor returns inline image follow a vector auto-regressive VAR(1) model. To make the simulation more realistic, model parameters are calibrated from the financial returns, as detailed in the following section.

6.1. Calibration

To calibrate the model, we use the data on annualized returns of 100 industrial portfolios from the Web site of Kenneth French, and the data on 3-month Treasury bill rates from the Center for Research in Security Prices database. These industrial portfolios are formed as the intersection of 10 portfolios based on size (market equity) and 10 portfolios based on the book equity to market equity ratio. Their excess returns inline image are computed for the period from January 1st, 2009, to December 31st, 2010. Here, we present a short outline of the calibration procedure.

  1. Given inline image as the input data, we fit a Fama–French three-factor model and calculate a 100×3 matrix inline image, and 500×3 matrix inline image, using the principal components method that was described in Section 'Assumptions'
  2. We summarize 100 factor loadings (the rows of inline image) by their sample mean vector inline image and sample covariance matrix inline image, which are reported in Table 1. The factor loadings inline image for i=1,…,p are drawn from inline image.
  3. We run the stationary vector auto-regressive model inline image, which is a VAR(1) model, to the data inline image to obtain the multivariate least squares estimator for μ and Φ, and we estimate inline image. Note that all eigenvalues of Φ in Table 2 fall within the unit circle, so our model is stationary. The covariance matrix inline image can be obtained by solving the linear equation inline image The estimated parameters are depicted in Table 2 and are used to generate inline image.
  4. For each value of p, we generate a sparse covariance matrix inline image of the form
    display math
    Here, inline image is the error correlation matrix, and D is the diagonal matrix of the standard deviations of the errors. We set inline image, where each inline image is generated independently from a gamma distribution G(α,β), and α and β are chosen to match the sample mean and sample standard deviation of the standard deviations of the errors. A similar approach to that of Fan et al. (2011a) has been used in this calibration step. The off-diagonal entries of inline image are generated independently from a normal distribution, with mean and standard deviation equal to the sample mean and sample standard deviation of the sample correlations between the estimated residuals, conditional on their absolute values being no larger than 0.95. We then employ hard thresholding to make inline image sparse, where the threshold is found as the smallest constant that provides the positive definiteness of inline image. More precisely, start with threshold value 1, which gives inline image, and then decrease the threshold values in a grid until positive definiteness is violated.
Table 1. Mean and covariance matrix used to generate b
inline image   inline image  
0.00470.0767−0.000040.0087
0.0007−0.000040.08410.0013
−1.80780.00870.00130.1649
Table 2. Parameters of the inline image-generating process
μ   inline image    Φ  
−0.00501.00370.0011−0.0009−0.07120.04680.1413
0.03350.00110.99990.0042−0.0764−0.00080.0646
−0.0756−0.00090.00420.99730.0195−0.0071−0.0544

6.2. Simulation

For the simulation, we fix T=300, and let p increase from 1 to 600. For each fixed p, we repeat the following steps N=200 times, and record the means and the standard deviations of each respective norm.

  • Step 1: generate independently inline image, and set inline image
  • Step 2: generate independently inline image.
  • Step 3: generate inline image as a vector auto-regressive sequence of the form inline image.
  • Step 4: calculate inline image from inline image.
  • Step 5: set hard thresholding with threshold inline image. Estimate K by using IC1 of Bai and Ng (2002). Calculate covariance estimators by using POET. Calculate the sample covariance matrix inline image.

In the graphs below, we plot the averages and standard deviations of the distance from inline image and inline image to the true covariance matrix Σ, under norms inline image, ‖·‖ and inline image. We also plot the means and standard deviations of the distances from inline image and inline image to inline image under the spectral norm. The dimensionality p ranges from 20 to 600 in increments of 20. Because of invertibility, the spectral norm for inline image is plotted only up to p=280. Also, we zoom into these graphs by plotting the values of p from 1 to 100, this time in increments of 1. Note that we also plot the distance from inline image to Σ for comparison, where inline image is the estimated covariance matrix that was proposed by Fan et al. (2011a), assuming that the factors are observable.

6.3. Results

In a factor model, we expect POET to perform as well as inline image when p is relatively large, since the effect of estimating the unknown factors should vanish as p increases. This is illustrated in the plots.

From the simulation results, reported in Figs 2-5, we observe that POET under the unobservable factor model performs just as well as the estimator in Fan et al. (2011a) if the factors are known, when p is sufficiently large. The cost of not knowing the factors is approximately of order inline image. It can be seen in Figs 2 and 3 that this cost vanishes for p≥200. To give a better insight of the effect of estimating the unknown factors for small p, a separate set of simulations is conducted for p≤100. As we can see from Figs 2(b) and 2(d) and 3(b), 3(c), 3(e) and 3(f) the effect decreases quickly. In addition, when estimating inline image, it is difficult to distinguish the estimators with known and unknown factors, whose performances are quite stable compared with the sample covariance matrix. Also, the maximum absolute elementwise error (Fig. 4) of our estimator performs very similarly to that of the sample covariance matrix, which coincides with our asymptotic result. Fig. 5 shows that the performances of the three methods are indistinguishable in the spectral norm, as expected.

Figure 2.

(a), (b) Averages and (c), (d) standard deviations of the relative error inline image with known factors (image, inline image), POET (image, inline image) and the sample covariance (image, inline image) over 200 simulations, as a function of the dimensionality p: (a), (c) p ranges in 20–600 with increment 20; (b), (d) p ranges in 1–100 with increment 1

Figure 3.

(a)–(c) Averages and (d)–(f) standard deviations of inline image with known factors (image, inline image), POET (image, inline image) and sample covariance (image, inline image) over 200 simulations, as a function of the dimensionality p: (a), (d) p ranges in 20–600 with increment 20; (b), (e) p ranges in 1–100 with increment 1; (c), (f) the same as (a) and (d) with the sample covariance curve omitted

Figure 4.

(a) Averages and (b) standard deviations of inline image with known factors (image, inline image), POET (image, inline image) and sample covariance (image, inline image) over 200 simulations, as a function of the dimensionality p: they are nearly undifferentiable

Figure 5.

(a) Averages of inline image and (b) inline image with known factors (image, inline image), POET (image, inline image) and sample covariance (image, inline image) over 200 simulations, as a function of the dimensionality p: the three curves are barely distinguishable in (a)

6.4. Robustness to the estimation of K

POET depends on the estimated number of factors. Our theory uses a consistent esimator inline image. To assess the robustness of our procedure to inline image in finite samples, we calculate inline image for K=1,2,…,10. Again, the threshold is fixed to be inline image.

6.4.1. Design 1

The simulation set-up is the same as before where the true inline image. We calculate inline image, inline image, inline image and inline image for K=1,2,…,10. Fig. 6 plots these norms as p increases but with a fixed T=300. The results demonstrate a trend that is quite robust when K≥3; especially, the accuracy of estimation of the spectral norms for large p are close to each other. When K=1 or K=2, the estimators perform badly because of modelling bias. Therefore, POET is robust to overestimated K, but not to underestimation.

Figure 6.

Robustness of K as p increases for various choices of K (design 1, T=300): (a) inline image; (b) inline image; (c) inline image; (d) inline image

6.4.2. Design 2

We also simulated from a new data-generating process for the robustness assessment. Consider a banded idiosyncratic matrix

display math

We still consider a inline image factor model, where the factors are independently simulated as

display math

Table 3 summarizes the average estimation error of covariance matrices across K in the spectral norm. Each simulation is replicated 50 times and T=200.

Table 3. Robustness of K: design 2, estimation errors in spectral norm†
  Errors for the following values of K:
  1 2 3 4 5 6 8
  1. †True K=3.

p=100
inline image 10.705.231.631.801.912.042.22
inline image 2.712.511.511.501.441.842.82
inline image 2.692.481.471.491.411.562.35
inline image 94.6691.3629.4131.4530.9133.5933.48
inline image 17.3710.042.052.832.942.952.93
p=200
inline image 11.3411.451.641.711.791.872.01
inline image 2.693.911.571.561.812.263.42
inline image 2.673.721.571.551.702.133.19
inline image 200.82195.6457.4463.0964.5360.2456.20
inline image 20.8614.223.294.524.724.694.76
p=300
inline image 12.7415.201.661.711.781.841.95
inline image 7.587.801.742.182.583.545.45
inline image 7.597.491.702.132.493.375.13
inline image 302.16274.1287.9292.4791.9083.2192.50
inline image 23.4316.894.386.046.166.146.20

Table 3 illustrates some interesting patterns. First, the best accuracy of estimation is achieved when inline image. Second, the estimation is robust for inline image. As K increases from inline image, the estimation error becomes larger but is increasing slowly in general, which indicates the robustness when a slightly larger K has been used. Third, when the number of factors is underestimated, corresponding to K=1,2, all the estimators perform badly, which demonstrates the danger of missing any common factors. Therefore, overestimating the number of factors, while still maintaining a satisfactory accuracy of estimation of the covariance matrices, is much better than underestimating. The resulting bias caused by underestimation is more severe than the additional variance that is introduced by overestimation. Finally, estimating Σ, the covariance of inline image, does not achieve good accuracy even when inline image in the absolute term inline image, but the relative error inline image is much smaller. This is consistent with our discussions in Section 'Convergence of POET'

6.5. Comparisons with other methods

6.5.1. Comparison with related methods

We compare POET with related methods that address low rank plus sparse covariance estimation, specifically, the low rank and sparse covariance estimator LOREC proposed by Luo (2011), the strict factor model SFM by Fan et al. (2008), the dual method (Dual) by Lin et al. (2009) and, finally, the singular value thresholding method of Cai et al. (2008), SVT. In particular, SFM is a special case of POET which employs a large threshold that forces inline image to be diagonal even when the true inline image might not be. Note that Dual, SVT and many others dealing with low rank plus sparseness, such as Candès et al. (2011) and Wright et al. (2009), assume a known Σ and focus on recovering the decomposition. Hence they do not estimate Σ or its inverse, but decompose the sample covariance into two components. The resulting sparse component may not be positive definite, which can lead to large estimation errors for inline image and inline image.

Data are generated from the same set-up as design 2 in Section 'Robustness to the estimation of K' Table 4 reports the averaged estimation error of the five methods being compared, calculated on the basis of 50 replications for each simulation. Dual and SVT assume that the data matrix has a low rank plus sparse representation, which is not so for the sample covariance matrix (though the population Σ has such a representation). The tuning parameters for POET, LOREC, Dual and SVT are chosen to achieve the best performance for each method. (We used the R package for LOREC that was developed by Luo (2011) and the MATLAB codes for Dual and SVT provided on Yi Ma's Web site ‘Low-rank matrix recovery and completion via convex optimization’ at the University of Illinois. The tuning parameters for each method have been chosen to minimize the sum of relative errors inline image. We have also written an R package for POET.)

Table 4. Method comparison under spectral norm for T=100†
Method inline image inline image RelE inline image inline image
  1. †RelE represents the relative error inline image.

p=100
POET1.6241.3362.0801.30929.107
LOREC2.2741.8802.5641.51132.365
SFM2.0842.0392.7072.02234.949
Dual2.3065.6542.7074.67429.000
SVT2.5913.642.806103.129.670
p=200
POET1.6411.3583.2951.34658.769
LOREC2.1791.7673.8741.54362.731
SFM2.0982.0713.7582.06560.905
Dual2.416.5544.5415.81356.264
SVT2.930362.54.68047.2163.670
p=300
POET1.6621.3944.3371.39565.392
LOREC2.3641.6354.9091.74291.618
SFM2.0912.0644.8742.06188.852
Dual2.4752.6026.1902.23474.059
SVT2.681 inline image 6.247 inline image 80.954

6.5.2. Comparison with direct thresholding

This section compares POET with direct thresholding on the sample covariance matrix without taking out common factors (Rothman et al., 2009; Cai and Liu, 2011). We denote this method by THR. We also run simulations to demonstrate the finite sample performance when Σ itself is sparse and has bounded eigenvalues, corresponding to the case K=0. Three models are considered and both POET and THR use soft thresholding. We fix T=200. Reported results are the average of 100 replications.

  1. Model 1, one-factor: the factors and loadings are independently generated from N(0,1). The error covariance is the same banded matrix as design 2 in Section 'Robustness to the estimation of K' Here Σ has one diverging eigenvalue.
  2. Model 2, sparse covariance: set K=0; hence inline image itself is a banded matrix with bounded eigenvalues.
  3. Model 3, cross-sectional AR(1): set K=0, but inline image. Now Σ is no longer sparse (or banded) but is not too dense either since inline image decreases to 0 exponentially fast as |ij|→∞. This is the correlation matrix if inline image follows a cross-sectional AR(1) process: inline image.

For each model, POET uses an estimated inline image based on IC1 of Bai and Ng (2002), whereas THR thresholds the sample covariance directly. We find that, in model 1, POET performs significantly better than THR as the latter misses the common factor. For model 2, IC1 estimates inline image precisely in each replication, and hence POET is identical to THR. For model 3, POET still outperforms THR. The results are summarized in Table 5.

Table 5. Method comparison for T=200†
Model inline image inline image inline image
POET THR POET THR
  1. †The reported numbers are the averages based on 100 replications.

p=200
126.20240.181.312.671
22.042.042.072.070
37.7311.248.4811.406.2
p=300
132.60314.432.182.581
22.032.032.082.080
39.4111.298.8111.415.45

6.6. Simulated portfolio allocation

We demonstrate the improvement of our method compared with the sample covariance and that based on the strict factor model, in a problem of portfolio allocation for risk minimization purposes.

Let inline image be a generic estimator of the covariance matrix of the return vector inline image, and w be the allocation vector of a portfolio consisting of the corresponding p financial securities. Then the theoretical and the empirical risk of the given portfolio are inline image and inline image respectively. Now, define

display math

the estimated (minimum variance) portfolio. Then the actual risk of the estimated portfolio is defined as inline image, and the estimated risk (which is also called the empirical risk) is equal to inline image. In practice, the actual risk is unknown, and only the empirical risk can be calculated.

For each fixed p, the population Σ was generated in the same way as described in Section 'Calibration', with a sparse but not diagonal error covariance. We use three different methods to estimate Σ and to obtain inline image: strict factor model inline image (estimate inline image by using a diagonal matrix), our POET-estimator inline image (both are with unknown factors) and sample covariance inline image. We then calculate the corresponding actual and empirical risks.

It is interesting to examine the accuracy and the performance of the actual risk of our portfolio inline image in comparison with the oracle risk inline image, which is the theoretical risk of the portfolio that we would have created if we knew the true covariance matrix Σ. We thus compare the regret inline image, which is always non-negative, for three estimators of inline image. They are summarized by using the boxplots over the 200 simulations. The results are reported in Fig. 7. In practice, we are also concerned about the difference between the actual and empirical risk of the chosen portfolio inline image. Hence, in Fig. 8, we also compare the average estimation error inline image and the average relative estimation error inline image over 200 simulations. When inline image is obtained on the basis of the strict factor model, both differences—between actual and oracle risk, and between actual and empirical risk—are persistently greater than the corresponding differences for the approximate factor estimator. Also, in terms of the relative estimation error, the factor-model-based method is negligible, whereas the sample covariance does not have such a property.

Figure 7.

Boxplots of regrets inline image for (a) p=80 and (b) p=140: in each panel, the boxplots from left to right correspond to inline image obtained by using inline image based on the approximate factor model, the SFM and the sample covariance

Figure 8.

Estimation errors for risk assessments as a function of the portfolio size p (image, POET; image, SFM; image, sample): (a) average absolute error inline image; (b) average relative error inline image (here, inline image and inline image are obtained on the basis of three estimators of inline image)

7. Real data example

We demonstrate the sparsity of the approximate factor model on real data and present the improvement of POET over the SFM in a real world application of portfolio allocation.

7.1. Sparsity of idiosyncratic errors

The data were obtained from the Center for Research in Security Prices database and con sist of p=50 stocks and their annualized daily returns for the period January 1st, 2010–December 31st, 2010 (T=252). The stocks are chosen from five different industry sectors (more specifically, ‘consumer goods—textiles and apparel clothing’, ‘financial—credit services’, ‘healthcare—hospitals’, ‘services—restaurants’ and ‘utilities—water utilities’), with 10 stocks from each sector. We made this selection to demonstrate a block diagonal trend in the sparsity. More specifically, we show that the non-zero elements are clustered mainly within companies in the same industry. We also note that these are the same groups that show predominantly positive correlation.

The largest eigenvalues of the sample covariance equal 0.0102,0.0045 and 0.0039, whereas the rest are bounded by 0.0020. Hence K=0,1,2,3 are the possible values of the number of factors. Fig. 9 shows the heat map of the thresholded error correlation matrix (for simplicity, we applied hard thresholding). The threshold has been chosen by using cross-validation as described in Section 'Choice of threshold'. We compare the level of sparsity (the percentage of non-zero off-diagonal elements) for the five diagonal blocks of size 10×10, versus the sparsity of the rest of the matrix. For K=2, our method results in 25.8% non-zero off-diagonal elements in the five diagonal blocks, as opposed to 7.3% non-zero elements in the rest of the covariance matrix. Note that, out of the non-zero elements in the central five blocks, 100% are positive, as opposed to a distribution of 60.3% positive and 39.7% negative among the non-zero elements in off-diagonal blocks. There is a strong positive correlation between the returns of companies in the same industry after the common factors have been taken out, and the thresholding has preserved them. The results for K=1, 2, 3 show the same characteristics. These provide stark evidence that the strict factor model is not appropriate.

Figure 9.

Heat map of the thresholded error correlation matrix for number of factors (a) K=0, (b) K=1, (c) K=2 and (d) K=3

7.2. Portfolio allocation

We extend our data size by including larger industrial portfolios (p=100), and a longer period (10 years): from January 1st, 2000, to December 31st, 2010, of annualized daily excess returns. Two portfolios are created at the beginning of each month, based on two different covariance estimates through approximate and strict factor models with unknown factors. At the end of each month, we compare the risks of both portfolios.

The number of factors is determined by using the penalty function that was proposed by Bai and Ng (2002), as defined in expression (2.14). For calibration, we use the last 100 consecutive business days of the above data, and both IC1 and IC2 give inline image. On the first of each month, we estimate inline image (method SFM) and inline image (POET with soft thresholding) using the historical data of excess daily returns for the preceeding 12 months (T=252). The value of the threshold is determined by using the cross-validation procedure. We minimize the empirical risk of both portfolios to obtain the two respective optimal portfolio allocations inline image and inline image (based on inline image and inline image): inline image. At the end of the month (21 trading days), their actual risks are compared, calculated by

display math

We can see from Fig. 10 that the minimum risk portfolio that was created by POET performs significantly better, achieving lower variance 76% of the time. Among those months, the risk is decreased by 48.63%. In contrast, during the months that POET produces a higher risk portfolio, the risk is increased by only 17.66%.

Figure 10.

Risk of portfolios created with POET and SFM

Next, we demonstrate the effect of the choice of number of factors and threshold on the performance of POET. If cross-validation seems computationally expensive, we can choose a common soft threshold throughout the whole investment process. The average constant in the cross-validation was 0.53, which is close to our suggested constant 0.5 used for simulation. We also present the results based on various choices of constant C=0.5,0.75,1,1.25, with soft threshold inline image. The results are summarized in Table 6. The performance of POET seems consistent across different choices of these parameters.

Table 6. Comparisons of the risks of portfolios by using POET and SFM†
C Results for the following values of inline image :
inline image inline image inline image
  1. †The first number is the proportion of the time that POET outperforms and the second number is the percentage of average risk improvements. C represents the constant in the threshold.

0.250.58/29.6%0.68/38%0.71/33%
0.50.66/31.7%0.70/38.2%0.75/33.5%
0.750.68/29.3%0.70/29.6%0.71/25.1%
10.66/20.7%0.62/19.4%0.69/18%

8. Conclusion and discussion

We study the problem of estimating a high dimensional covariance matrix with conditional sparsity. Realizing that an unconditional sparsity assumption is inappropriate in many applications, we introduce a latent factor model that has a conditional sparsity feature and propose POET to take advantage of the structure. This expands considerably the scope of the model based on the strict factor model, which assumes independent idiosyncratic noise and is too restrictive in practice. By assuming a sparse error covariance matrix, we allow for the presence of the cross-sectional correlation even after taking out the common factors. The sparse covariance is estimated by the adaptive thresholding technique.

It is found that the rates of convergence of the estimators have an extra term approximately inline image in addition to the results based on observable factors by Fan et al. (2008, 2011a), which arises from the effect of estimating the unobservable factors. As we can see, this effect vanishes as the dimensionality increases, as more information about the common factors becomes available. When p grows sufficiently large, the effect of estimating the unknown factors is negligible, and we estimate the covariance matrices as if we knew the factors.

The proposed POET also has wide applicability in statistical genomics. For example, Carvalho et al. (2008) applied a Bayesian sparse factor model to study breast cancer hormonal pathways. Their real data results have identified about two common factors that have highly loaded genes (about half of 250 genes). As a result, these factors should be treated as ‘pervasive’ (see the explanation in example 1 in Section 2.1.1), which will result in one or two very spiked eigenvalues of the gene expressions’ covariance matrix. POET can be applied to estimate such a covariance matrix and its network model.

Acknowledgements

The research was partially supported by National Institutes of Health grants R01GM100474-01 and R01-GM072611, grant DMS-0704337 and the Bendheim Center for Finance at Princeton University. The bulk of the research was carried out while Yuan Liao was a postdoctoral fellow at Princeton University.

Appendix A:: Estimating a sparse covariance with contaminated data

We estimate inline image by applying the adaptive thresholding given by expression (2.11). However, the task here is slightly different from the standard problem of estimating a sparse covariance matrix in the literature, as no direct observations for inline image are available. In many cases the original data are contaminated, including any type of estimate of the data when direct observations are not available. This typically happens when inline image represent the error terms in regression models or when data are subject to the measurement of errors. Instead, we may observe inline image. For instance, in the approximate factor models, inline image

We can estimate inline image by using the adaptive thresholding proposed by Cai and Liu (2011): for the threshold inline image, define

display math
display math(A.1)

where inline image satisfies, for all inline image, inline image and inline image

When inline image is sufficiently close to inline image, we can show that inline image is also consistent. The following theorem extends the standard thresholding results in Bickel and Levina (2008) and Cai and Liu (2011) to the case when no direct observations are available, or the original data are contaminated. For the tail and mixing parameters inline image and inline image that are defined in assumptions 2 and 3, let inline image.

Theorem 5. Suppose that inline image, and that assumptions 2 and 3 hold. In addition, suppose that there is a sequence inline image so that inline image and inline image then there is a constant C>0 in the adaptive thresholding estimator (A.1) with

display math

such that

display math

If further inline image, then inline image is invertible with probability approaching 1, and

display math

Proof. By assumptions 2 and 3, the conditions of lemmas A.3 and A.4 of Fan et al. (2011a) are satisfied. Hence, for any ɛ>0, there are positive constants inline image and inline image such that each of the events

display math

occurs with probability at least 1−ɛ. By the condition of threshold function, inline image. Now for inline image under the event inline image

display math

Let inline image. Then, with probability at least 1−2ɛ, inline image. Since ɛ is arbitrary, we have inline image. If, in addition, inline image, then the minimum eigenvalue of inline image is bounded away from 0 with probability approaching 1 since inline image. This then implies that inline image.

Appendix B:: Proofs for Section 2

We first cite two useful theorems, which are needed to prove propositions 1 and 2. In lemma 1 below, let inline image be the eigenvalues of Σ in descending order and inline image be their associated eigenvectors. Correspondingly, let inline image be the eigenvalues of inline image in descending order and inline image be their associated eigenvectors.

Lemma 1.

  1. (Weyl's theorem) inline image.
  2. ( sin (θ) theorem; Davis and Kahan (1970)):
    display math

B.1. Proof of proposition 1

Since inline image are the eigenvalues of Σ and inline image are the first K eigenvalues of inline image (the remaining pK eigenvalues are 0), then by Weyl's theorem, for each jK,

display math

For j>K, inline image. However, the first K eigenvalues of BB are also the eigenvalues of inline image. By the assumption, the eigenvalues of inline image are bounded away from 0. Thus, when jK, inline image are bounded away from 0 for all large p.

B.2. Proof of proposition 2

Applying the  sin (θ) theorem yields

display math

For a generic constant c>0, inline image for all large p, since inline image but inline image is bounded by prosposition 1. However, if j<K, the same argument implies that inline image. If j=K, inline image, where inline image is bounded away from 0, but inline image. Hence, again, inline image.

B.3. Proof of theorem 1

The sample covariance matrix of the residuals by using the least squares method is given by

display math

where we used the normalization condition inline image and inline image. If we show that inline image, then from the decompositions of the sample covariance,

display math

we have inline image. Consequently, applying thresholding on inline image is equivalent to applying thresholding on inline image, which gives the desired result.

We now show that inline image indeed holds. Consider again the least squares problem (2.8) but with the following alternative normalization constraints: inline image, and inline image is diagonal. Let inline image be the solution to the new optimization problem. Switching the roles of B and F, then the solution of problem (2.10) is inline image and inline image. In addition, inline image. From inline image, it follows that inline image.

Appendix C:: Proofs for Section 3

We shall proceed by subsequently showing theorems 4, 2 and 3.

C.1. Preliminary lemmas

The following results are to be used subsequently. The proofs of lemmas 2, 3 and 4 are found in Fan et al. (2011a).

Lemma 2. Suppose that A and B are symmetric semipositive definite matrices, and inline image for a sequence inline image. If inline image, then inline image, and

display math

Lemma 3. Suppose that the random variables inline image and inline image both satisfy the exponential-type tail condition: there exist inline image, inline image and inline image, such that, ∀s>0,

display math

Then, for some inline image and inline image, and any s>0,

display math(C.1)

Lemma 4. Under the assumptions of theorem 2,

  1. inline image,
  2. inline image and
  3. inline image.

Lemma 5. Let inline image denote the Kth largest eigenvalue of inline image; then inline image with probability approaching 1 for some inline image.

Proof. First, by proposition 1, under assumption 1, the Kth largest eigenvalue inline image of Σ satisfies, for some c>0,

display math

for sufficiently large p. Using Weyl's theorem, we need only to prove that inline image. Without loss of generality, we prove the result under the identifiability condition (2.1). Using model (1.2), inline image. Using this and model (1.3), inline image can be decomposed as the sum of the four terms

display math

We now deal with them term by term. We shall repeatedly use the fact that, for a p×p matrix A,

display math

First, by lemma 4, inline image, which is inline image if K log(p)=o(T). Consequently, by assumption 1, we have

display math

We now deal with inline image. It follows from lemma 4 that

display math

Since inline image, it remains to deal with inline image, which is bounded by

display math

which is inline image since log(p)=o(T).

Lemma 6. Under assumption 3, inline image.

Proof. Since inline image is weakly stationary, inline image. In addition, inline image for some constant M and any i and t since inline image has an exponential tail. Hence by Davydov's inequality (corollary 16.2.4 in Athreya and Lahiri (2006)), there is a constant C>0, for all ip,tT, inline image, where α(t) is the α-mixing coefficient. By assumption 3, inline image. Thus, uniformly in T,

display math

C.2. Proof of theorem 4

Our derivation below relies on a result that was obtained by Bai and Ng (2002), which showed that the estimated number of factors is consistent, in the sense that inline image equals the true K with probability approaching 1. Note that, under our assumptions 1–4, all the assumptions in Bai and Ng (2002) are satisfied. Thus immediately we have the following lemma.

Lemma 7. (theorem 2 in Bai and Ng 2002.) For inline image defined in expression (2.14),

display math

Proof. For a proof, see Bai and Ng (2002).

Using expression (A.1) in Bai (2003), we have the identity

display math(C.2)

where inline image, inline image and inline image.

We first prove some preliminary results in the following lemmas. Denote inline image.

Lemma 8. For all inline image,

  1. inline image,
  2. inline image,
  3. inline image and
  4. inline image.

Proof.

  1. We have, ∀i, inline image. By the Cauchy–Schwarz inequality,
    display math
  2. By lemma 6, inline image, which then yields the result.
  3. By the Cauchy–Schwarz inequality,
    display math
  4. Note that inline image. By assumption 4, inline image, which implies that inline image and yields the result.
  5. By definition, inline image. We first bound inline image. Assumption 4 implies that inline image. Therefore, by the Cauchy–Schwarz inequality,
    display math
  6. Similarly to part (c), noting that inline image is a scalar, we have
    display math
    where the last line follows from the Cauchy–Schwarz inequality.

Lemma 9.

  1. inline image,
  2. inline image,
  3. inline image and
  4. inline image.

Proof.

  1. By the Cauchy–Schwarz inequality and the fact that inline image,
    display math
    The result then follows from assumption 3.
  2. By the Cauchy–Schwarz inequality,
    display math
    It follows from assumption 4 that inline image. It then follows from Chebyshev's inequality and Bonferroni's method that inline image.
  3. By assumption 4, inline image. Chebyshev's inequality and Bonferroni's method yield inline image with probability 1, which then implies
    display math
  4. By the Cauchy–Schwarz inequality and assumption 4, we have demonstrated that inline image. In addition, since inline image, inline image. It follows that
    display math

Lemma 10.

  1. inline image.
  2. inline image.
  3. inline image.

Proof. We prove this lemma conditioning on the event inline image. Once this has been done, because inline image, it then implies the unconditional arguments.

  1. When inline image, by lemma 5, all the eigenvalues of V/p are bounded away from 0. Using the inequality inline image and identity (C.2), we have, for some constant C>0,
    display math
    Each of the four terms on the right-hand side are bounded in lemma 8, which then yields the desired result.
  2. Part (b) follows from part (a) and
    display math
    Part (c) is implied by identity (C.2) and lemma 9.

Lemma 11.

  1. inline image.
  2. inline image.

Proof. We first condition on inline image.

  1. Lemma 5 implies that inline image. Also inline image. In ad dition, inline image. It then follows from the definition of H that inline image. Define inline image. Applying the triangular inequality gives
    display math(C.3)
  2. By lemma 4, the first term in inequality (C.3) is inline image. The second term of inequality (C.3) can be bounded, by the Cauchy–Schwarz inequality and lemma 10, as follows:
    display math
  3. Still conditioning on inline image, since inline image and inline image, right multiplying H gives inline image. Part (a) also gives, conditioning on inline image, inline image. Hence further left multiplying inline image yields inline image. Because inline image, we reach the desired result.
C.2.1. Completion of proof of theorem 4

The second part of theorem 4 was proved in lemma 10. We now derive the convergence rate of inline image