## 1. Introduction

Information and technology make large data sets widely available for scientific discovery. Much statistical analysis of such high dimensional data involves the estimation of a covariance matrix or its inverse (the precision matrix). Examples include portfolio management and risk assessment (Fan *et al*., 2008), high dimensional classification such as the Fisher discriminant (Hastie *et al*., 2009), graphic models (Meinshausen and Bühlmann, 2006), statistical inference such as controlling false discoveries in multiple testing (Leek and Storey, 2008; Efron, 2010), finding quantitative trait loci based on longitudinal data (Yap *et al*., 2009; Xiong *et al*., 2011) and testing the capital asset pricing model (Sentana, 2009), among others. See Section 'Applications of POET' for some of those applications. Yet, the dimensionality is often either comparable with the sample size or even larger. In such cases, the sample covariance is known to have poor performance (Johnstone, 2001), and some regularization is needed.

Realizing the importance of estimating large covariance matrices and the challenges that are brought by the high dimensionality, in recent years researchers have proposed various regularization techniques to estimate **Σ** consistently. One of the key assumptions is that the covariance matrix is sparse, namely many entries are 0 or nearly so (Bickel and Levina, 2008; Rothman *et al*., 2009; Lam and Fan, 2009; Cai and Zhou, 2012; Cai and Liu, 2011). In many applications, however, the sparsity assumption directly on **Σ** is not appropriate. For example, financial returns depend on the equity market risks, housing prices depend on the economic health and gene expressions can be stimulated by cytokines, among others. Because of the presence of common factors, it is unrealistic to assume that many outcomes are uncorrelated. An alternative method is to assume a factor model structure, as in Fan *et al*. (2008). However, they restrict themselves to the strict factor models with known factors.

A natural extension is conditional sparsity. Given the common factors, the outcomes are weakly correlated. To do so, we consider an approximate factor model, which has been frequently used in economic and financial studies (Chamberlain and Rothschild (1983), Fama and French (1992) and Bai and Ng (2002), among others):

Here is the observed response for the *i*th (*i*=1,…,*p*) individual at time *t*=1,…,*T*, is a vector of factor loadings, is a *K*×1 vector of common factors and is the error term, which is usually called the *idiosyncratic component*, uncorrelated with . Both *p* and *T* diverge to ∞, whereas *K* is assumed fixed throughout the paper, and *p* is possibly much larger than *T*.

We emphasize that, in model (1.1), only is observable. It is intuitively clear that the unknown common factors can only be inferred reliably when there are sufficiently many cases, i.e. *p*→∞. In a data rich environment, *p* can diverge at a rate that is faster than *T*. The factor model (1.1) can be put in a matrix form as

where , and . We are interested in **Σ**, the *p*×*p* covariance matrix of , and its inverse, which are assumed to be time invariant. Under model (1.1), **Σ** is given by

where is the covariance matrix of . The literature on approximate factor models typically assumes that the first *K* eigenvalues of diverge at rate *O*(*p*), whereas all the eigenvalues of are bounded as *p*→∞. This assumption holds easily when the factors are pervasive in the sense that a non-negligible fraction of factor loadings should be non-vanishing. The decomposition (1.3) is then asymptotically identified as *p*→∞. In addition to it, in this paper we assume that is *approximately sparse* as in Bickel and Levina (2008) and Rothman *et al*. (2009): for some *q* ∈ [0,1),

does not grow too fast as *p*→∞. In particular, this includes the exact sparsity assumption (*q*=0) under which , the maximum number of non-zero elements in each row.

The conditional sparsity structure of form (1.2) was explored by Fan *et al*. (2011a) in estimating the covariance matrix, when the factors are observable. This allows them to use regression analysis to estimate . This paper deals with the situation in which the factors are unobservable and must be inferred. Our approach is simple and optimization free and it uses the data only through the sample covariance matrix. Run the singular value decomposition on the sample covariance matrix of , keep the covariance matrix that is formed by the first *K* principal components and apply the thresholding procedure to the remaining covariance matrix. This results in a principal orthogonal complement thresholding estimator POET. When the number of common factors *K* is unknown, it can be estimated from the data. See Section 'Regularized covariance matrix via principal components analysis' for additional details. We shall investigate various properties of POET under the assumption that the data are serially dependent, which includes independent observations as a specific example. The rate of convergence under various norms for both estimated **Σ** and and their precision (inverse) matrices will be derived. We show that the effect of estimating the unknown factors on the rate of convergence vanishes when *p* log (*p*)≫*T* and, in particular, the rate of convergence for achieves the optimal rate in Cai and Zhou (2012).

This paper focuses on the high dimensional *static factor model* (1.2), which is innately related to the principal component analysis (PCA), as clarified in Section 'Regularized covariance matrix via principal components analysis'. This feature makes it different from the classical factor model with fixed dimensionality (e.g. Lawley and Maxwell (1971)). In the last decade, much theory on the estimation and inference of the static factor model has been developed, e.g. Stock and Watson (1998, 2002), Bai and Ng (2002), Bai (2003) and Doz *et al*. (2011), among others. Our contribution is on the estimation of covariance matrices and their inverse in large factor models.

The *static* model that is considered in this paper is to be distinguished from the *dynamic factor model* as in Forni *et al*. (2000); the latter allows to depend also on with lags in time. Their approach is based on the eigenvalues and principal components of spectral density matrices, and on the frequency domain analysis. Moreover, as shown in Forni and Lippi (2001), the dynamic factor model does not really impose a restriction on the data-generating process, and the assumption of idiosyncrasy (in their terminology, a *p*-dimensional process is idiosyncratic if all the eigenvalues of its spectral density matrix remain bounded as *p*→∞) asymptotically identifies the decomposition of into the common component and idiosyncratic error. The literature includes, for example, Forni *et al*. (2000, 2004), Forni and Lippi (2001), Hallin and Liška (2007, 2011) and many other references therein. Above all, both the static and the dynamic factor models are receiving increasing attention in applications of many fields where information usually is scattered through a (very) large number of interrelated time series.

There has been extensive literature in recent years that deals with sparse principal components, which has been widely used to enhance the convergence of the principal components in high dimensional space. d'Aspremont *et al*. (2008), Shen and Huang (2008), Witten *et al*. (2009) and Ma (2013) proposed and studied various algorithms for computations. More literature on sparse PCA is found in Johnstone and Lu (2009), Amini and Wainwright (2009), Zhang and El Ghaoui (2011) and Birnbaum *et al*. (2012), among others. In addition, there has also been a growing literature that theoretically studies the recovery from a low rank plus sparse matrix estimation problem; see, for example, Wright *et al*. (2009), Lin *et al*. (2008), Candès *et al*. (2011), Luo (2011), Agarwal *et al*. (2012) and Pati *et al*. (2012). It corresponds to the identifiability issue of our problem.

There is a big difference between our model and those considered in the aforementioned literature. In the current paper, the first *K* eigenvalues of **Σ** are spiked and grow at a rate *O*(*p*), whereas the eigenvalues of the matrices that have been studied in the existing literature on covariance estimation are usually assumed to be either bounded or slowly growing. Because of this distinctive feature, the common components and the idiosyncratic components can be identified and, in addition, PCA on the sample covariance matrix can consistently estimate the space that is spanned by the eigenvectors of **Σ**. The existing methods of either thresholding directly or solving a constrained optimization method can fail in the presence of very spiked principal eigenvalues. However, there is a price to pay here: as the first *K* eigenvalues are ‘too spiked’, one can hardly obtain a satisfactory rate of convergence for estimating **Σ** in absolute terms, but it can be estimated accurately in relative terms (see Section 'Convergence of POET' for details). In addition, can be estimated accurately.

We would like to note further that the low rank plus sparse representation of our model is on the population covariance matrix, whereas Candès *et al*. (2011), Wright *et al*. (2009) and Lin *et al*. (2009) considered such a representation on the data matrix. (We thank a referee for reminding us about these related works.) As there is no **Σ** to estimate, their goal is limited to producing a low rank plus sparse matrix decomposition of the data matrix, which corresponds to the identifiability issue of our study, and does not involve estimation and inference. In contrast, our ultimate goal is to estimate the population covariance matrices as well as the precision matrices. For this, we require the idiosyncratic components and common factors to be uncorrelated and the data-generating process to be strictly stationary. The covariances that are considered in this paper are constant over time, though slow time varying covariance matrices are applicable through localization in time (time domain smoothing). Our consistency result on demonstrates that decomposition (1.3) is identifiable, and hence our results also shed the light of the ‘surprising phenomenon’ of Candès *et al*. (2011) that one can separate fully a sparse matrix from a low rank matrix when only the sum of these two components is available.

The rest of the paper is organized as follows. Section 'Regularized covariance matrix via principal components analysis' gives our estimation procedures and builds the relationship between the PCA and the factor analysis in high dimensional space. Section 'Asymptotic properties' provides the asymptotic theory for various estimated quantities. Section 'Choice of threshold' illustrates how to choose the thresholds by using cross-validation and guarantees the positive definiteness in any finite sample. Specific applications of regularized covariance matrices are given in Section 'Applications of POET'. Numerical results are reported in Section 'Monte Carlo experiments'. Finally, Section 'Real data example' presents a real data application on portfolio allocation. All proofs are given in Appendix A. Throughout the paper, we use and to denote the minimum and maximum eigenvalues of a matrix **A**. We also denote by , ‖**A**‖, and the Frobenius norm, spectral norm (also called the operator norm), -norm and elementwise norm of a matrix **A**, defined respectively by , , and . When **A** is a vector, both and ‖**A**‖ are equal to the Euclidean norm. Finally, for two sequences, we write if and if and

The programs that were used to analyse the data can be obtained from