Geometric consistency of principal component scores for high‐dimensional mixture models and its application
Abstract
In this article, we consider clustering based on principal component analysis (PCA) for high‐dimensional mixture models. We present theoretical reasons why PCA is effective for clustering high‐dimensional data. First, we derive a geometric representation of high‐dimension, low‐sample‐size (HDLSS) data taken from a two‐class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample principal component scores in the HDLSS context. We develop ideas of the geometric representation and provide geometric consistency properties for multiclass mixture models. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. Finally, we demonstrate the performance of the clustering using gene expression datasets.
1 INTRODUCTION
High‐dimension, low‐sample‐size (HDLSS) data situations occur in many areas of modern science such as genetic microarrays, medical imaging, text recognition, finance, chemometrics, and so on. In recent years, substantial work has been done on HDLSS asymptotic theory, where the sample size n is fixed or n/d→0 as the data dimension d→∞. Hall, Marron, and Neeman (2005), Ahn, Marron, Muller, and Chi (2007), Yata and Aoshima (2012), and Lv (2013) explored several types of geometric representations of HDLSS data. Jung and Marron (2009) showed inconsistency properties of the sample eigenvalues and eigenvectors in the HDLSS context. Yata and Aoshima (2012, 2013) developed the noise‐reduction methodology to provide consistent estimators of both the eigenvalues and eigenvectors together with principal component (PC) scores in the HDLSS context. Shen, Shen, Zhu, and Marron (2016) and Hellton and Thoresen (2017) also provided several asymptotic properties of the sample PC scores in the HDLSS context.
The HDLSS asymptotic theory was created under the assumption of either the population distribution is Gaussian or the random variables in a sphered data matrix have a ρ‐mixing dependency. However, Yata and Aoshima (2010) developed an HDLSS asymptotic theory without such assumptions. Moreover, they created a new principal component analysis (PCA) called the cross‐data‐matrix methodology that is applicable to construct an unbiased estimator in HDLSS nonparametric settings. Based on the cross‐data‐matrix methodology, Aoshima and Yata (2011) developed a variety of inference for HDLSS data such as given‐bandwidth confidence region, two‐sample test, classification, variable selection, regression, pathway analysis, and so on (see Aoshima et al., 2018 for the review).
PCA is an important visualization and dimension reduction technique for high‐dimensional data. Furthermore, PCA is quite popular for clustering high‐dimensional data (see section 9.2 in Jolliffe, 2002 for details). For clustering HDLSS gene expression data, see Armstrong et al. (2002) and Pomeroy et al. (2002). Liu, Hayes, Nobel, and Marron (2008) and Ahn, Lee, and Yoon (2012) gave binary split‐type clustering methods for HDLSS data. Borysov, Hannig, and Marron (2014) considered hierarchical clustering for high‐dimensional data. Li and Yao (2018) considered a model‐based clustering for a high‐dimensional mixture. Given this background, we decided to focus on high‐dimensional structures of multiclass mixture models via PCA. In this article, we consider asymptotic properties of PC scores for high‐dimensional mixture models to apply for cluster analysis in HDLSS settings. The main contribution of this article is that we give theoretical reasons why PCA is effective for clustering HDLSS data.
Remark 1.When μk≠0, let for i=1,…,k. Then, it holds that . Hence, for any inference of ∑ by the sample covariance matrix, one can assume μk=0 without loss of generality.
As the sign of an eigenvector is arbitrary, we assume that for i=1,…,k−1, without loss of generality. In addition, we assume the cluster means are more spread than the within class variation in the sense that:
Condition 1. as d→∞.
In this article, we consider asymptotic properties of sample PC scores for Equation 1 in the HDLSS context that d→∞ while n is fixed. In Section 2, we first derive a geometric representation of HDLSS data taken from the two‐class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample PC scores in the HDLSS context. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. In Section 3, we investigate asymptotic behaviors of true PC scores for the k(≥3)‐class mixture model and provide geometric consistency properties of sample PC scores when k≥3. In Section 4, we demonstrate the performance of clustering based on sample PC scores using gene expression datasets. We show that the real‐HDLSS datasets hold the geometric consistency properties.
2 PC SCORES FOR TWO‐CLASS MIXTURE MODEL
In this section, we consider PC scores for the two‐class (k=2) mixture model.
2.1 Preliminary
Remark 2.Yata and Aoshima (2012) showed that Equation 4 holds under the sphericity condition and Var(||xj−μ||2)/tr(∑)2→0 as d→∞.
From Equation 4, we observe that the eigenvalue becomes deterministic as the dimension increases, whereas the eigenvector of SD does not uniquely determine the direction. In addition, Hellton and Thoresen (2017) present asymptotic properties of the sample PC scores when Z is ρ‐mixing. We note that Equation 1 does not presuppose the assumption that X is Gaussian or Z is ρ‐mixing. See section 4.1.1 in Qiao, Zhang, Liu, Todd, and Marron (2010) for details. In the present article, we present new asymptotic properties of the sample PC for Equation 1.
2.2 Geometric representation and consistency property of PC scores when k = 2
We will find a geometric representation for Equation 1 and the finding is completely different from Equation 4. We assume the following conditions:
Condition 2. as d→∞.
Condition 3. as d→∞.
Condition 4. as d→∞ for all i,j=1,…,k(i<j).
Remark 3.Condition 2 is stronger than Condition 1 as it holds that for i=1,…,k. Let β(>0) be a constant such that (Δk−1/dβ)>0. Let λi1≥···≥λid≥0 be eigenvalues of ∑i for i=1,…,k. For a spiked model such as
Remark 4.If Πis are Gaussian, it holds that for i=1,…,k, so that Condition 3 naturally holds under Condition 2.
Remark 5.When k=2, Condition 4 holds if tr(∑1)/tr(∑2)→1 as d→∞ and .
Theorem 1.Assume Δ1/tr(∑)→c(>0) as d→∞. Under Conditions 2, it holds
When SD≠O, we note that , so that . Thus from Equation 5, the first eigenvector of SD uniquely determines the direction. In fact, by noting and ||r||2=nη1η2, we have the following results for the first eigenvector and PC scores when k=2. Using Corollary 1, one can cluster xjs into two groups by the sign of s:
Corollary 1.Under Conditions 2, it holds that for ni>0,i=1,2
We considered an easy example such as Πi:Nd(μi,∑i),i=1,2, with μ1=1d, μ2=0, , and , where B=diag[−{0.5+1/(d+1)}1/2,{0.5+2/(d+1)}1/2,…,(−1)d{0.5+d/(d+1)}1/2]. We note that Δ1=d and ∑1≠ ∑2 but tr(∑1)=tr(∑2)=d. Then, Conditions 1 hold. We set n1=1 and n2=2. We took n=3 samples as x1∈ Π1 and x2,x3∈ Π2. In Figure 1, we displayed scatter plots of 20 independent pairs of when (a) d=5, (b) d=50, (c) d=500, and (d) d=5,000. We denoted r=(2/3,−1/3,−1/3)T by the solid line and 1n=(1,1,1)T by the dotted line. We note that Angle when SD≠O. We observed that all the plots of gather on the surface of the orthogonal complement of 1n. Also, the plots appeared close to r as d increases. Thus, one can cluster xjs into two groups by the sign of s.

Next, we investigated robustness of Corollary 1 against Condition 4 by some simulation studies. Let Δ∑=|tr(∑1)−tr(∑2)|. We considered an easy example such as Πi:Nd(μi,∑i),i=1,2, with μ1=(1,…,1,0,…,0)T whose first ⌈d3/4⌉ elements are 1, μ2=0, , and , where γ≥1. Here, ⌈·⌉ denotes the ceiling function. Note that Δ∑=(γ−1)d. We set d=5,000, n=10, n1=4, and n2=6. We took n=10 samples as x1,…,x4∈ Π1 and x5,…,x10∈ Π2. In Figure 2, we displayed scatter plots of , j=1,…,n, when (a) γ=1+2d−1/2, (b) γ=1+2d−1/4, and (c) γ=3. From Corollary 1 we denoted (3/2)1/2 and −(2/3)1/2 by dotted lines. Note that Δ∑/Δ1≈2d−1/4 for (a), Δ∑/Δ1≈2 for (b), and Δ∑/Δ1≈2d1/4 for (c). Thus, Condition 4 holds for (a), while it does not hold for (b) and (c). For (a) and (b), we observed that the estimated PC scores give good performances. On the other hand, the first PC score did not gather around (3/2)1/2 or −(2/3)1/2 for (c). However, s were concentrated on the origin (0,0) for xj∈ Π2.

When Δ1/Δ∑→0 as d→∞, we give the following result to explain the reason of the phenomenon in Figure 2c. Under the assumptions of Proposition 1, one can cluster xjs into two groups by the size of s even when Condition 4 is not met:
Proposition 1.Assume k=2, and , where l∗(≠l′∗) is an integer such that . Assume also that , and Δ1/Δ∑→0 as d→∞. Then, it holds that
Remark 6.For k≥3, we do not give any consistency property when Condition 4 is not met because the sufficient conditions of Proposition 1 become quite complicated for k≥3. Detailed study for the case when k≥3 is left for a future work.
The assumptions of Proposition 1 hold for (c) of Figure 2. Thus, s were concentrated on the origin (0,0) for xj∈ Π2 in (c).
3 PC SCORES FOR MULTICLASS MIXTURE MODEL
In this section, we consider PC scores for the k(≥3)‐class mixture model.
3.1 Asymptotic behaviors of true PC scores when k ≥ 3
Condition 5. and as d→∞ for i,j=1,…,k−1(i<j).
Remark 7.We consider the case when all elements of μis are constants (not depending on d) such as μi=(μi1,…,μip,0,…,0) with μis≠0 (not depending on d) for s=1,…,p. If all elements of μis are constants, the condition “Angle(μi, μj) →π/2 as d→∞” holds for i<j under Δj/Δi→0 as d→∞, so that Condition 5 holds under Δi+1/Δi→0 as d→∞ for i=1,…,k−2. See the settings of Figures 3 and 4. Note that Δ1≫···≫Δk−1 under Condition 5. We emphasize that Conditions 1 become strict as k increases under Condition 5.


We have the following results.
One can check whether xj∈ Π1 or not by the first PC score. If xj∉ Π1, one can check whether xj∈ Π2 or xj∈ Π3 by the second PC score. In general, one can cluster xjs using at most the first k−1 PC scores.
We considered a toy example such as Πi:Nd(μi,∑i),i=1,…,4, where μ1=1d, μ2=(1,…,1,0,…,0)T whose first ⌈d3/4⌉ elements are 1, μ3=(1,…,1,0,…,0)T whose first ⌈d1/2⌉ elements are 1, and μ4=0. We set , , ∑3=0.8∑1, and ∑4=1.2∑2, where B is defined in Section 2.2. Then, Conditions 1 and 5 hold. We first considered the case when k=3:Πi,i=1,2,3, having (ε1,ε2,ε3)=(1/2,1/4,1/4). We set n=20 and (n1,n2,n3)=(10,5,5). From Theorem 2, one can expect that becomes close to (1,0) when xj∈ Π1, (−1,21/2) when xj∈ Π2, and (−1,−21/2) when xj∈ Π3. In Figure 3, we displayed scatter plots of (z1j,z2j), j=1,…,n, when (a) d=100, (b) d=1,000, and (c) d=10,000. We observed that the scatter plots appear close to those three vertices as d increases.
Next, we considered the case when k=4: Πi,i=1,…,4, having ε1=···=ε4=1/4. We set n=20 and n1=···=n4=5. In Figure 4, we displayed scatter plots of (z1j,z2j,z3j), j=1,…,n, when (a) d=100, (b) d=1,000 and (c) d=10,000. From Theorem 2, we displayed the triangular pyramid given by Equation 6 with k=4. As expected theoretically, we observed that the scatter plots appear close to four vertices of the triangular pyramid as d increases. They seemed to converge slower in Figure 4 than in Figure 3. This is because the conditions of Theorem 2 become strict as k increases. See Remark 7.
3.2 Consistency property of PC scores when k ≥ 3
Condition 6. as d→∞ for j=1,…,k.
Remark 9.From the fact that , Condition 6 holds under as d→∞ for j=1,…,k.
As for the estimated PC scores, we have the following result. From Theorem 3, one can cluster xjs into k groups by the elements of :
Theorem 3.Under Conditions 2, it holds that for nl>0, l=1,…,k
Condition 5 is essential for the consistency properties given in Theorems 2 and 3. We investigated the robustness of Theorem 3 against Condition 5 by some simulation studies. We considered a toy example such as Πi:Nd(μi,∑i),i=1,2,3, where μ1=1d, μ2=(1,…,1,0,…,0)T whose first ⌈d/ζ⌉ elements are 1, μ3=0, , , and . We set d=5,000, n=20, and (n1,n2,n3)=(10,5,5). In Figure 5, we displayed scatter plots of , j=1,…,n, when (a) ζ=1/5, (b) ζ=2/5, and (c) ζ=4/5. Also, we displayed the triangle given by Equation 7 with k=3. Note that Angle(μ1,μ2)=0.352π and Δ2/Δ1=1/5 for (a), Angle(μ1,μ2)=0.282π and Δ2/Δ1=2/5 for (b), and Angle(μ1,μ2)=0.148π and Δ2/Δ1=4/5 for (c). For (a) and (b), we observed that the estimated PC scores give good performances. On the other hand, the estimated PC scores seemed not to converge to the theoretical points for (c). This is because Condition 5 is not met. However, we could find three separate clusters for Πi,i=1,2,3. See Appendix B for the reason.

4 REAL‐DATA EXAMPLES
We demonstrate the performance of clustering, based on sample PC scores, using gene expression datasets.
4.1 Clustering when k = 2
We analyzed microarray data by Chiaretti et al. (2004) in which the dataset consists of 12,625 (=d) genes and 128 samples. The dataset has two tumor cellular subtypes, Π1: B cell (95 samples) and Π2: T cell (33 samples). Refer to Jeffery, Higgins, and Culhane (2006) as well. We checked behaviors of the PC scores using several samples from the two tumor cellular subtypes. We considered three cases: (a) n=10 samples consist of the first five samples from both Π1 and Π2 (i.e., n1=5 and n2=5), (b) n=40 samples consist of the first 20 samples from both Π1 and Π2 (i.e., n1=20 and n2=20), and (c) n=128 samples consist of n1=95 samples from Π1 and n2=33 samples from Π2. In the top panels of Figure 6, we displayed scatter plots of the first two PC scores, s, for (a), (b), and (c). From Corollary 1, we denoted (η2/η1)1/2 and −(η1/η2)1/2 by dotted lines. For (a), we observed that the estimated PC scores give good performances. The first PC scores gathered around (η2/η1)1/2 or −(η1/η2)1/2. For (b), the estimated PC scores gave adequate performances except for the two points from Π2. Those two samples, which are the ninth and twentieth samples of Π2, are probably outliers. In fact, the two points are far from the cluster of Π2. The other 38 samples were perfectly classified into the two groups by the sign of the first PC scores. As for (c), although there seemed to be two clusters except for the two samples, we could not classify the dataset by the sign of the first PC scores. This is probably because η1 and η2 are unbalanced. From Equation 2, when the mixing proportions are unbalanced, λ1 becomes small. The first eigenspace was possibly affected by the other eigenspaces, so that the first PC scores appear in the wrong direction. We tested the clustering except for the outlying two samples. We used the remaining 31 samples for Π2. We considered the following three cases for samples from Π1: (d) the first 16 samples from Π1, so that n1=16,n2=31,n=47, and η1/η2≈0.5; (e) the first 31 samples from Π1, so that n1=31,n2=31,n=62, and η1/η2=1; and (f) the first 62 samples from Π1, so that n1=62,n2=31,n=93, and η1/η2=2. In the bottom panels of Figure 6, we displayed scatter plots of s for (d), (e), and (f). For (d) and (e), we observed that the estimated PC scores give good performances. As for (f), although there seemed to be two clusters, we could not classify the dataset by the sign of the first PC scores. Note that η1 and η2 are unbalanced in (d) and (f). Even though (d) is an unbalanced case, the estimated PC scores worked well for the case. We had an estimate for the ratio of the largest eigenvalues, , as 1.598 by the noise‐reduction methodology given by Yata and Aoshima (2012). The first eigenspace of ∑ in (d) is less affected by the first eigenspace of ∑is than in (f) as . This is probably the reason why the estimated PC scores gave good performances even in (d).

4.2 Clustering when k ≥ 3
We analyzed microarray data by Bhattacharjee et al. (2001) in which the dataset consisted of five lung carcinomas types with d=3,312. We only used four classes as Π1: pulmonary carcinoids (20 samples), Π2: normal lung (17 samples), Π3: squamous cell lung carcinomas (21 samples), and Π4: adenocarcinomas (20 samples), so that n1=20,n2=17,n3=21, and n4=20. Note that Π4 originally had 139 samples. We used only the first 20 samples from Π4 in order to keep balance in sample sizes with the other classes. We first considered clustering when k=3 under the following setups: (a) the dataset consists of Π1, Π2, and Π3 (n=58); (b) the dataset consists of Π1, Π2, and Π4 (n=57); and (c) the dataset consists of Π1, Π3, and Π4 (n=61). In Figure 7, we displayed scatter plots of the first two PC scores, s, for each of (a), (b), and (c). Also, we displayed the triangle given by Equation 7 with k=3 using Theorem 3. We observed that the estimated PC scores give good performances. The three clusters gathered around each vertex for (a), (b), and (c).

Next, we considered clustering when k=4: Πi,i=1,…,4, so that n=78. In Figure 8, we displayed scatter plots of the first three PC scores. The dataset seemed not to converge to the theoretical convergent points given by Equation 7 in Theorem 3. This is probably because the conditions of Theorem 3 become strict as k increases. See Remark 7. Thus, the convergence is slower than in the case when k=3 as in Figure 7. However, there seemed to be four separate clusters of each Πi.
Finally, we introduce an example using next generation sequencing datasets. Shen, Shen, Zhu, and Marron (2012, 2016) gave a scatter plot of first two PC scores for the next generation sequencing cancer data by Wilhelm and Landry (2009) in which the dataset consists of three curves with d=1,709 and n=180. See Figure 9 which was given in figure 1 of Shen et al. (2012). The three clusters seem to compose of a triangle such as Figure 7.


4.3 Clustering: Special case
We analyzed microarray data by Armstrong et al. (2002) in which the dataset consists of three leukemia subtypes having 12,582 (=d) genes. We used two classes such as Π1: acute lymphoblastic leukemia (24 samples) and Π2: mixed‐lineage leukemia (20 samples), so that n1=24,n2=20, and n=44. In Figure 10, we displayed scatter plots of the first three PC scores.

We observed that the dataset is perfectly separated by the sign of the second PC scores. This figure looks completely different from Figure 6. This is probably because the largest eigenvalue, or , is too large. When k=2, we give the following result to explain the reason of the phenomenon in Figure 10. Under the assumptions of Proposition 2, one can cluster xjs into two groups by some ith PC score even when Condition 1 is not met:
Proposition 2.Assume that as d→∞. Then, there exists some positive integer i⋆ such that
We estimated the largest eigenvalue using the noise‐reduction methodology given by Yata and Aoshima (2012). By noting Remark 1, we considered Δ1 as . We estimated Δ1 using an unbiased estimator given by Aoshima and Yata (2014). Then, we obtained the estimates of as (0.465,0.787), so that Condition 1 is not met obviously. In addition, by estimating εis by ηis, we had . Thus, the first eigenspace of ∑ is probably the first eigenspace of ∑2 as . Thus, i⋆ in Proposition 2 must be 2. This is the reason why the dataset can be separated by the sign of the second PC scores in Figure 10.
5 CONCLUDING REMARKS
In this article, we considered the mixture model by Equation 1 in high‐dimensional settings. We studied asymptotic properties of both the true PC scores and the sample PC scores for the high‐dimensional mixture model. We gave conditions under which PCA is very effective for clustering HDLSS data. We showed that HDLSS data can be classified by the sign of the first several PC scores theoretically. However, we have to say, in actual HDLSS data analyses, one may encounter cases such as in Figures 6c and 10, where the dataset is not always classified by the sign of the first several PC scores. Several reasons should be considered: (i) actual HDLSS datasets often include several outliers, (ii) the regularity conditions are not met, and (iii) the mixing proportions εis are quite unbalanced. Thus, we recommend the following three steps: (i) apply PCA to HDLSS data; (ii) using PC scores, map the dataset onto a feature space such as the first three eigenspaces, and (iii) apply general clustering methods such as the k‐means method to the feature space. However, the number of clusters k is unknown in general. We emphasize that the first k−1 eigenvalues are quite spiked for the model 1. Recently, Jung, Lee, and Ahn (2018) proposed a test of the number of spiked components for high‐dimensional data. Thus, one may apply the test to the choice of k for clustering.
ACKNOWLEDGEMENTS
We would like to thank the two anonymous referees for their constructive comments.
Appendix A: Lemmas and their proofs
Lemma 1.When k=2, it holds that under Conditions 2
Proof.As μ2=0, we can write that for i=1,2;j=1,…,n. From the fact that , we have that as d→∞ for j=1,…,n;i=1,2 under Condition 2. Also, we have that for all j≠j′ and i,i′ =1,2 under Condition 2. Then, using Chebyshev's inequality, for any τ>0, under Condition 2, it holds that for all j≠j′ and i,i′ =1,2
Lemma 2.Let for i=1,…,k−1 and let Δ(i,j)= Δj,j+1/Δi,i+1 for i,j=1,…,k−1(i<j). Under Conditions 1 and 5, it holds that as d→∞
Proof.From the fact that , under Condition 5 it holds that as d→∞
Let be an arbitrary unit vector. From , it holds that
From the facts that and Δk−1= Δk−1,k, by combining Equation A3 with Equations A2 and A4, we have that
Next, we consider λ2 and h2. From A2, we note that and Δ(i,j)=o(1) for i,j=1,…,k−1(i<j) under Condition 5. Then, under Conditions 1 and 5, it holds that for j≥2
By combining Equation A3 with Equations A4 and A5, we have that
Next, we consider λ3 and h3. Note that for j≥3 from . Then, under Conditions 1 and 5, we have that for j≥3
Similar to Equation A6, by combining Equation A3 with Equations A4 and A9, under Conditions 1 and 5, we have that
In a way similar to λ3 and h3, as for λi and hi (4≤i≤k−1), we have that λi/Δi,i+1=εi(1−ε(i))/(1−ε(i−1))+o(1), and together with for i,j=1,…,k−1 (i+1<j) under Conditions 1 and 5. It concludes the results.
Proof.We write that
Proof.From the fact that , we have that as d→∞ for i=1,…,k;j=1,…,n under Condition 2. Then, we have that for i=1,…,k−1; i′=1,…,k;j=1,…,n
Let be an arbitrary random unit vector such that . We note that . Then, by noting , under Equation A11, Conditions 1, and 6, we have that
Lemma 5.Assume Condition 5. For ni>0,i=1,…,k, it holds that
Proof.By noting Equation A10 with εi=ηi and ε(i)=η(i), i=1,…,k, we can write that
We have the eigendecomposition of VVT/n by , where is a unit eigenvector corresponding to for each i. We note that ηi>0,i=1,…,k for ni>0,i=1,…,k. Then, by noting Lemmas 2 and 3 and the fact that Equation A14 is same as Equation A4 with ε(i)=η(i),i=1,…,k−1, under Condition 5, we have that for i=1,…,k−1
Appendix B: Additional Proposition
When Condition 5 is not met, Theorem 3 does not hold. However, in Figure 5c, we could find three separate clusters of Πi,i=1,2,3, even though Condition 5 is not met. To explain the reason of this phenomenon, we give the following result.
Proposition 3.Assume Conditions 2 and 6. Then, under the condition:
Proof.By noting that for i=1,…,k−1 when rank(SD)≥k−1, from Equation A13, we can conclude the result.
By noting that , from Proposition 3, for sufficiently large d, the estimated PC scores depend only on the structure of VTV even when Condition 5 is not met. Then, as rank(VTV)=k−1, there must be k separate clusters for Πi,i=1,…,k, in the first k−1 PC spaces as seen in Figure 5c.
Appendix C: Proofs of Theorems, Corollaries, and Propositions
Proofs of Theorem 1 and Corollary 1.We note that tr(∑1)/tr(∑)→(1−ε1ε2c) as d→∞ under Condition 4 and Δ1/tr(∑)→c(>0) as d→∞. Then, using Lemma 1, we can conclude the result of Theorem 1.
Next, we consider the proof of Corollary 1. From the fact that , it holds that when SD≠O, so that . Also, note that ||r||2=nη1η2 and . Then, using Lemma 1, under Conditions 1, it holds that . Hence, from Equation 3 and the assumption that , we have that for ni>0, i=1,2. In view of the elements of r, we can conclude the result of Corollary 1.
Proof of Proposition 1.We assume xj∈ Π1 for j=1,…,n1, xj∈ Π2 for j=n1+1,…,n, and tr(∑1)≥tr(∑2) without loss of generality. Similar to the proof of Lemma 1, under the assumptions of Proposition 1, we have that
Proofs of Theorem 2 and Corollary 2.We write that for j=1,…,n; i=1,…,k. We note that as d→∞ under Condition 1 for j=1,…,n;i=1,…,k, where is an arbitrary unit vector. Then, under Condition 1, when xj∈ Πi, it holds that
For the proof of Corollary 2, by noting that Δi,i+1/Δi=1+o(1) and for i=1,…,k−1, under Condition 5, from Lemma 2, the results are obtained straightforwardly.
Proof of Theorem 3.By combining Lemmas 4 and 5, from Theorem 2 and the assumption that for all i, the result is obtained straightforwardly.
Proof of Proposition 2.Let ∑(∗)=ε1∑1+ε2∑2. Then, we define the eigendecomposition of ∑(∗) by , where λ1(∗)≥···≥λd(∗)≥0 are eigenvalues of ∑(∗) and hi(∗) is a unit eigenvector corresponding to λi(∗) for each i. Let λ=ε1ε2Δ1 and . Then, from , under as d→∞, it holds that , so that
Let κ(i)=λi(⋆⋆)−λ for i=1,…,d. For a sufficiently large d, when κ(1)>0, there exists some positive integer i⋆⋆ such that




