Volume 47, Issue 3 p. 899-921
ORIGINAL ARTICLE
Open Access

Geometric consistency of principal component scores for high‐dimensional mixture models and its application

Kazuyoshi Yata

Institute of Mathematics, University of Tsukuba

Search for more papers by this author
Makoto Aoshima

Corresponding Author

Institute of Mathematics, University of Tsukuba

Correspondence Makoto Aoshima, Institute of Mathematics, University of Tsukuba, Ibaraki 305‐8571, Japan.

Email: aoshima@math.tsukuba.ac.jp

Search for more papers by this author
First published: 11 November 2019
Funding information Grant‐in‐Aid for Young Scientists (B), Japan Society for the Promotion of Science (JSPS), 26800078; Grants‐in‐Aid for Scientific Research (A) and Challenging Research (Exploratory), JSPS, 15H01678; 17K19956

Abstract

In this article, we consider clustering based on principal component analysis (PCA) for high‐dimensional mixture models. We present theoretical reasons why PCA is effective for clustering high‐dimensional data. First, we derive a geometric representation of high‐dimension, low‐sample‐size (HDLSS) data taken from a two‐class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample principal component scores in the HDLSS context. We develop ideas of the geometric representation and provide geometric consistency properties for multiclass mixture models. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. Finally, we demonstrate the performance of the clustering using gene expression datasets.

1 INTRODUCTION

High‐dimension, low‐sample‐size (HDLSS) data situations occur in many areas of modern science such as genetic microarrays, medical imaging, text recognition, finance, chemometrics, and so on. In recent years, substantial work has been done on HDLSS asymptotic theory, where the sample size n is fixed or n/d→0 as the data dimension d→∞. Hall, Marron, and Neeman (2005), Ahn, Marron, Muller, and Chi (2007), Yata and Aoshima (2012), and Lv (2013) explored several types of geometric representations of HDLSS data. Jung and Marron (2009) showed inconsistency properties of the sample eigenvalues and eigenvectors in the HDLSS context. Yata and Aoshima (2012, 2013) developed the noise‐reduction methodology to provide consistent estimators of both the eigenvalues and eigenvectors together with principal component (PC) scores in the HDLSS context. Shen, Shen, Zhu, and Marron (2016) and Hellton and Thoresen (2017) also provided several asymptotic properties of the sample PC scores in the HDLSS context.

The HDLSS asymptotic theory was created under the assumption of either the population distribution is Gaussian or the random variables in a sphered data matrix have a ρ‐mixing dependency. However, Yata and Aoshima (2010) developed an HDLSS asymptotic theory without such assumptions. Moreover, they created a new principal component analysis (PCA) called the cross‐data‐matrix methodology that is applicable to construct an unbiased estimator in HDLSS nonparametric settings. Based on the cross‐data‐matrix methodology, Aoshima and Yata (2011) developed a variety of inference for HDLSS data such as given‐bandwidth confidence region, two‐sample test, classification, variable selection, regression, pathway analysis, and so on (see Aoshima et al., 2018 for the review).

PCA is an important visualization and dimension reduction technique for high‐dimensional data. Furthermore, PCA is quite popular for clustering high‐dimensional data (see section 9.2 in Jolliffe, 2002 for details). For clustering HDLSS gene expression data, see Armstrong et al. (2002) and Pomeroy et al. (2002). Liu, Hayes, Nobel, and Marron (2008) and Ahn, Lee, and Yoon (2012) gave binary split‐type clustering methods for HDLSS data. Borysov, Hannig, and Marron (2014) considered hierarchical clustering for high‐dimensional data. Li and Yao (2018) considered a model‐based clustering for a high‐dimensional mixture. Given this background, we decided to focus on high‐dimensional structures of multiclass mixture models via PCA. In this article, we consider asymptotic properties of PC scores for high‐dimensional mixture models to apply for cluster analysis in HDLSS settings. The main contribution of this article is that we give theoretical reasons why PCA is effective for clustering HDLSS data.

Suppose there are independent and d‐variate populations, Πi,i=1,…,k, having an unknown mean vector μi and unknown (positive‐semidefinite) covariance matrix i for each i. We consider a mixture model to classify a dataset into k (≥2) groups. We assume that any sample is taken with mixing proportions εis from Πis, where εi∈(0,1) and i = 1 k ε i = 1 but the label of the population is missing. We assume that k and εis are independent of d. We consider a mixture model whose probability density function (or probability function) is given by
f ( x ) = i = 1 k ε i π i ( x ; μ i , i ) , (1)
where x R d and πi(x;μi,i) is a d‐dimensional probability density function (or probability function) of Πi having a mean vector μi and covariance matrix i. Suppose we have a d×n data matrix X=[x1,…,xn], where xj,j=1,…,n, are independently taken from Equation 1. We assume nk. Let
n i = # { j | x j Π i for j = 1 , , n } and η i = n i / n for i = 1 , , k ,
where #A denotes the number of elements in a set A. We assume that n and nis are independent of d. Let μ and be the mean vector and the covariance matrix of Equation 1, respectively. Then, we have that
μ = i = 1 k ε i μ i and = i = 1 k 1 j = i + 1 k ε i ε j ( μ i μ j ) ( μ i μ j ) T + i = 1 k ε i i .
We note that E(x|x∈ Πi)=μi and Var(x|x∈ Πi)= i for i=1,…,k. We denote the eigendecomposition of by =HΛHT, where Λ=diag(λ1,…,λd) having eigenvalues λ1≥…≥λd≥0 and H=[h1,…,hd] is an orthogonal matrix of the corresponding eigenvectors. Let x j μ = H Λ 1 / 2 ( z 1 j , , z d j ) T for j=1,…,n. Then, (z1j,…,zdj)T is a sphered data vector from a distribution with the identity covariance matrix; E{(z1j,…,zdj)T}=0 and Var { ( z 1 j , , z d j ) T } = I d , where Id denotes the d‐square identity matrix. The ith true PC score of xj is given by
h i T ( x j μ ) = λ i 1 / 2 z i j (hereafter called s i j ) .
We note that Var(sij)=λi for all i,j. Let Δi=||μi||2 for i=1,…,k, where ||·|| denotes the Euclidean norm. Here, we assume that
Δ 1 Δ 2 ··· Δ k ,
without loss of generality. We also assume that
Δ k = 0 ( i.e., μ k = 0 ) ,
for the sake of simplicity.

Remark 1.When μk0, let μ i = μ i μ k for i=1,…,k. Then, it holds that i = 1 k 1 j = i + 1 k ε i ε j ( μ i μ j ) ( μ i μ j ) T = i = 1 k 1 j = i + 1 k ε i ε j ( μ i μ j ) ( μ i μ j ) T . Hence, for any inference of by the sample covariance matrix, one can assume μk=0 without loss of generality.

As the sign of an eigenvector is arbitrary, we assume that h i T μ i 0 for i=1,…,k−1, without loss of generality. In addition, we assume the cluster means are more spread than the within class variation in the sense that:

Condition 1. max i = 1 , , k λ max ( i ) Δ k 1 0 as d→∞.

Here, λ max ( M ) denotes the largest eigenvalue of any positive‐semidefinite matrix, M. We consider clustering x1,…,xn into one of Πis in HDLSS situations. When k=2, Yata and Aoshima (2010) gave the following result: we denote the angle between two nonzero vectors x and y by Angle ( x , y ) = cos 1 { x T y / ( | | x | | · | | y | | ) } . By noting that μ2=0, under Condition 1, it holds that as d→∞
λ 1 ε 1 ε 2 Δ 1 1 and Angle ( h 1 , μ 1 ) 0 , (2)
from the fact that λ 1 / Δ = h 1 T h 1 / Δ = ε 1 ε 2 ( h 1 T μ 1 ) 2 + o ( 1 ) as d→∞ under Condition 1. Furthermore, for the normalized first PC score s 1 j / λ 1 1 / 2 ( = z 1 j ) , it follows that
plim d s 1 j λ 1 1 / 2 = ( ε 2 / ε 1 ) 1 / 2 when x j Π 1 , ( ε 1 / ε 2 ) 1 / 2 when x j Π 2 , (3)
for j=1,…,n. Here, “plim” denotes the convergence in probability. This result is a special case of Theorem 2 in Section 3. See Remark 8. One would be able to cluster xjs into two groups if s1j is accurately estimated in HDLSS situations.

In this article, we consider asymptotic properties of sample PC scores for Equation 1 in the HDLSS context that d→∞ while n is fixed. In Section 2, we first derive a geometric representation of HDLSS data taken from the two‐class mixture model. With the help of the geometric representation, we give geometric consistency properties of sample PC scores in the HDLSS context. We show that PCA can cluster HDLSS data under certain conditions in a surprisingly explicit way. In Section 3, we investigate asymptotic behaviors of true PC scores for the k(≥3)‐class mixture model and provide geometric consistency properties of sample PC scores when k≥3. In Section 4, we demonstrate the performance of clustering based on sample PC scores using gene expression datasets. We show that the real‐HDLSS datasets hold the geometric consistency properties.

2 PC SCORES FOR TWO‐CLASS MIXTURE MODEL

In this section, we consider PC scores for the two‐class (k=2) mixture model.

2.1 Preliminary

The sample covariance matrix is given by S = ( n 1 ) 1 ( X X ) ( X X ) T = ( n 1 ) 1 j = 1 n ( x j x n ) ( x j x n ) T , where x n = n 1 j = 1 n x j and X = x n 1 n T with 1 n = ( 1 , , 1 ) T R n . Then, we define the n×n dual sample covariance matrix by S D = ( n 1 ) 1 ( X X ) T ( X X ) . We note that rank(SD)≤n−1. Let λ ^ 1 ··· λ ^ n 1 0 be the eigenvalues of SD. Then, we define the eigendecomposition of SD by
S D = i = 1 n 1 λ ^ i u ^ i u ^ i T ,
where u ^ i = ( û i 1 , , û i n ) T denotes a unit eigenvector corresponding to λ ^ i . As the sign of u ^ i s is arbitrary, we assume u ^ i T z i 0 for all i without loss of generality, where zi is defined by zi=(zi1,…,zin)T. Note that S and SD share the nonzero eigenvalues. Let
z ^ i j = û i j n 1 / 2 for i = 1 , , n 1 ; j = 1 , , n .
We note that z ^ i j is an estimate of s i j / λ i 1 / 2 ( = z i j ) for i=1,…,n−1;j=1,…,n from the facts that
z ^ i j = { n / ( n 1 ) } 1 / 2 h ^ i T ( x j x n ) / λ ^ i 1 / 2 and j = 1 n z ^ i j 2 / n = 1 if λ ^ i > 0 ,
where h ^ i denotes a unit eigenvector of S corresponding to λ ^ i . Let X 0 = X μ 1 n T and P n = I n n 1 1 n 1 n T . We note that S D = P n X 0 T X 0 P n / ( n 1 ) . We consider the sphericity condition: tr(2)/tr()2→0 as d→∞. Note that the sphericity condition is equivalent to “λ1/tr()→0 as d→∞.” When one can assume that X is Gaussian or Z=(zij) is ρ‐mixing and the fourth moments of each variable in Z are uniformly bounded, under the sphericity condition, Jung and Marron (2009) suggested a geometric representation as follows:
plim d X 0 T X 0 tr ( ) = I n , so that plim d ( n 1 ) S D tr ( ) = P n . (4)

Remark 2.Yata and Aoshima (2012) showed that Equation 4 holds under the sphericity condition and Var(||xjμ||2)/tr()2→0 as d→∞.

From Equation 4, we observe that the eigenvalue becomes deterministic as the dimension increases, whereas the eigenvector of SD does not uniquely determine the direction. In addition, Hellton and Thoresen (2017) present asymptotic properties of the sample PC scores when Z is ρ‐mixing. We note that Equation 1 does not presuppose the assumption that X is Gaussian or Z is ρ‐mixing. See section 4.1.1 in Qiao, Zhang, Liu, Todd, and Marron (2010) for details. In the present article, we present new asymptotic properties of the sample PC for Equation 1.

2.2 Geometric representation and consistency property of PC scores when k = 2

We will find a geometric representation for Equation 1 and the finding is completely different from Equation 4. We assume the following conditions:

Condition 2. max i = 1 , , k tr ( i 2 ) Δ k 1 2 0 as d→∞.

Condition 3. max i = 1 , , k Var ( | | x μ i | | 2 | x Π i ) Δ k 1 2 0 as d→∞.

Condition 4. tr ( i ) tr ( j ) Δ k 1 0 as d→∞ for all i,j=1,…,k(i<j).

Remark 3.Condition 2 is stronger than Condition 1 as it holds that { λ max ( i ) } 2 tr ( i 2 ) for i=1,…,k. Let β(>0) be a constant such that lim inf d k−1/dβ)>0. Let λi1≥···≥λid≥0 be eigenvalues of i for i=1,…,k. For a spiked model such as

λ i j = a i j d α i j ( j = 1 , , t i ) and λ i j = c i j ( j = t i + 1 , , d ) ,
with positive constants, aij, cij, and αij (not depending on d), and a positive integer ti (not depending on d), Condition 1 holds when αi1<β for i=1,…,k. Also, Condition 2 holds when β>1/2 and αi1<β for i=1,…,k. See Yata and Aoshima (2012) for the details of the spiked model.

Remark 4.If Πis are Gaussian, it holds that Var ( | | x μ i | | 2 | x Π i ) = O { tr ( i 2 ) } for i=1,…,k, so that Condition 3 naturally holds under Condition 2.

Remark 5.When k=2, Condition 4 holds if tr(1)/tr(2)→1 as d→∞ and lim inf d { Δ 1 / tr ( ) } > 0 .

We define that
r j = ( 1 ) i + 1 ( 1 η i ) according to x j Π i for j = 1 , , n .
The following result gives a geometric representation for Equation 1 when k=2.

Theorem 1.Assume Δ1/tr()→c(>0) as d→∞. Under Conditions2, it holds

plim d ( n 1 ) S D tr ( ) = c r r T + ( 1 ε 1 ε 2 c ) P n , (5)
where r=(r1,…,rn)T.

When SDO, we note that u ^ 1 T 1 n = 0 , so that u ^ 1 T P n = u ^ 1 T . Thus from Equation 5, the first eigenvector of SD uniquely determines the direction. In fact, by noting r T 1 n = 0 and ||r||2=1η2, we have the following results for the first eigenvector and PC scores when k=2. Using Corollary 1, one can cluster xjs into two groups by the sign of z ^ 1 j s:

Corollary 1.Under Conditions2, it holds that for ni>0,i=1,2

plim d u ^ 1 = r ( n η 1 η 2 ) 1 / 2 and plim d z ^ 1 j = ( η 2 / η 1 ) 1 / 2 when x j Π 1 , ( η 1 / η 2 ) 1 / 2 when x j Π 2 , for j = 1 , , n .

We considered an easy example such as Πi:Nd(μi,i),i=1,2, with μ1=1d, μ2=0, 1 = ( 0 . 3 | i j | 1 / 3 ) , and 2 = B ( 0 . 3 | i j | 1 / 3 ) B , where B=diag[−{0.5+1/(d+1)}1/2,{0.5+2/(d+1)}1/2,…,(−1)d{0.5+d/(d+1)}1/2]. We note that Δ1=d and 12 but tr(1)=tr(2)=d. Then, Conditions 1 hold. We set n1=1 and n2=2. We took n=3 samples as x1∈ Π1 and x2,x3∈ Π2. In Figure 1, we displayed scatter plots of 20 independent pairs of ± u ^ 1 when (a) d=5, (b) d=50, (c) d=500, and (d) d=5,000. We denoted r=(2/3,−1/3,−1/3)T by the solid line and 1n=(1,1,1)T by the dotted line. We note that Angle ( u ^ 1 , 1 n ) = π / 2 when SDO. We observed that all the plots of ± u ^ 1 gather on the surface of the orthogonal complement of 1n. Also, the plots appeared close to r as d increases. Thus, one can cluster xjs into two groups by the sign of z ^ 1 j s.

SJOS-12432-FIG-0001-c
Toy example to illustrate the geometric representation of ± u ^ 1 on the unit sphere when k=2 and n=3. We plotted 20 independent pairs of ± u ^ 1 when x1∈ Π1 and x2,x3∈ Π2. The solid line denotes r=(2/3,−1/3,−1/3)T and the dotted line denotes 1n=(1,1,1)T [Color figure can be viewed at wileyonlinelibrary.com]

Next, we investigated robustness of Corollary 1 against Condition 4 by some simulation studies. Let Δ=|tr(1)−tr(2)|. We considered an easy example such as Πi:Nd(μi,i),i=1,2, with μ1=(1,…,1,0,…,0)T whose first ⌈d3/4⌉ elements are 1, μ2=0, 1 = γ ( 0 . 3 | i j | 1 / 3 ) , and 2 = B ( 0 . 3 | i j | 1 / 3 ) B , where γ≥1. Here, ⌈·⌉ denotes the ceiling function. Note that Δ=(γ−1)d. We set d=5,000, n=10, n1=4, and n2=6. We took n=10 samples as x1,…,x4∈ Π1 and x5,…,x10∈ Π2. In Figure 2, we displayed scatter plots of ( z ^ 1 j , z ^ 2 j ) , j=1,…,n, when (a) γ=1+2d−1/2, (b) γ=1+2d−1/4, and (c) γ=3. From Corollary 1 we denoted (3/2)1/2 and −(2/3)1/2 by dotted lines. Note that Δ1≈2d−1/4 for (a), Δ1≈2 for (b), and Δ1≈2d1/4 for (c). Thus, Condition 4 holds for (a), while it does not hold for (b) and (c). For (a) and (b), we observed that the estimated PC scores give good performances. On the other hand, the first PC score did not gather around (3/2)1/2 or −(2/3)1/2 for (c). However, ( z ^ 1 j , z ^ 2 j ) s were concentrated on the origin (0,0) for xj∈ Π2.

SJOS-12432-FIG-0002-c
Toy example to illustrate asymptotic behaviors of the estimated principal component scores when k=2. We plotted ( z ^ 1 j , z ^ 2 j ) which is denoted by small circles when xj∈ Π1 and by small triangles when xj∈ Π2. The theoretical convergent points, (3/2)1/2 and −(3/2)1/2, are denoted by dotted lines [Color figure can be viewed at wileyonlinelibrary.com]

When Δ1→0 as d→∞, we give the following result to explain the reason of the phenomenon in Figure 2c. Under the assumptions of Proposition 1, one can cluster xjs into two groups by the size of z ^ i j s even when Condition 4 is not met:

Proposition 1.Assume k=2, n l 2 and n l 1 , where l(≠l) is an integer such that tr ( l ) > tr ( l ) . Assume also that max l = 1 , 2 tr ( l 2 ) / Δ 2 0 , max l = 1 , 2 Var ( | | x μ l | | 2 | x Π l ) / Δ 2 0 and Δ1→0 as d→∞. Then, it holds that

plim d | z ^ i j | > 0 when x j Π l for some i [ 1 , n l 1 ] and plim d z ^ i j = 0 when x j Π l for i = 1 , , n l 1 .

Remark 6.For k≥3, we do not give any consistency property when Condition 4 is not met because the sufficient conditions of Proposition 1 become quite complicated for k≥3. Detailed study for the case when k≥3 is left for a future work.

The assumptions of Proposition 1 hold for (c) of Figure 2. Thus, ( z ^ 1 j , z ^ 2 j ) s were concentrated on the origin (0,0) for xj∈ Π2 in (c).

3 PC SCORES FOR MULTICLASS MIXTURE MODEL

In this section, we consider PC scores for the k(≥3)‐class mixture model.

3.1 Asymptotic behaviors of true PC scores when k ≥ 3

Let
ε ( 0 ) = 0 and ε ( i ) = j = 1 i ε j for i = 1 , , k .
We assume the following condition:

Condition 5. Angle ( μ i , μ j ) π 2 and Δ j Δ i 0 as d→∞ for i,j=1,…,k−1(i<j).

Remark 7.We consider the case when all elements of μis are constants (not depending on d) such as μi=(μi1,…,μip,0,…,0) with μis≠0 (not depending on d) for s=1,…,p. If all elements of μis are constants, the condition “Angle(μi, μj) →π/2 as d→∞” holds for i<j under Δji→0 as d→∞, so that Condition 5 holds under Δi+1i→0 as d→∞ for i=1,…,k−2. See the settings of Figures 3 and 4. Note that Δ1≫···≫Δk−1 under Condition 5. We emphasize that Conditions 1 become strict as k increases under Condition 5.

SJOS-12432-FIG-0003-c
Toy example to illustrate the asymptotic behaviors of true principal component scores when k=3. We plotted (z1j,z2j) which is denoted by small circles when xj∈ Π1, by small triangles when xj∈ Π2, and by small squares when xj∈ Π3. The dashed triangle consists of three vertices, namely, (1,0), (−1,21/2), and (−1,−21/2), which are theoretical convergent points [Color figure can be viewed at wileyonlinelibrary.com]
SJOS-12432-FIG-0004-c
Toy example to illustrate the asymptotic behaviors of true principal component scores when k=4. We plotted (z1j,z2j,z3j). The dashed triangular pyramid was given by Equation 6 with k=4 [Color figure can be viewed at wileyonlinelibrary.com]

We have the following results.

Theorem 2.Under Conditions1 and 5, it holds that for i=1,…,k−1;j=1,…,n

plim d s i j λ i 1 / 2 = 0 when i 2 and x j m = 1 i 1 Π m , 1 ε ( i ) ε i ( 1 ε ( i 1 ) ) 1 / 2 when x j Π i , ε i ( 1 ε ( i ) ) ( 1 ε ( i 1 ) ) 1 / 2 when x j m = i + 1 k Π m . (6)

Remark 8.The consistency in Equation 3 is equivalent to Equation 6 with k=2 and i=1.

Corollary 2.Under Conditions1 and 5, it holds that for i=1,…,k−1

λ i ε i ( 1 ε ( i ) ) Δ i / ( 1 ε ( i 1 ) ) 1 and Angle ( h i , μ i ) 0 a s d .

For example, when k=3, from Equation 6, we have that for j=1,…,n
plim d s 1 j λ 1 1 / 2 = { ( 1 ε 1 ) / ε 1 } 1 / 2 when x j Π 1 , { ε 1 / ( 1 ε 1 ) } 1 / 2 when x j Π 1 and plim d s 2 j λ 2 1 / 2 = 0 when x j Π 1 , [ ε 3 / { ε 2 ( 1 ε 1 ) } ] 1 / 2 when x j Π 2 , [ ε 2 / { ε 3 ( 1 ε 1 ) } ] 1 / 2 when x j Π 3 .

One can check whether xj∈ Π1 or not by the first PC score. If xj∉ Π1, one can check whether xj∈ Π2 or xj∈ Π3 by the second PC score. In general, one can cluster xjs using at most the first k−1 PC scores.

We considered a toy example such as Πi:Nd(μi,i),i=1,…,4, where μ1=1d, μ2=(1,…,1,0,…,0)T whose first ⌈d3/4⌉ elements are 1, μ3=(1,…,1,0,…,0)T whose first ⌈d1/2⌉ elements are 1, and μ4=0. We set 1 = ( 0 . 3 | i j | 1 / 3 ) , 2 = B ( 0 . 3 | i j | 1 / 3 ) B , 3=0.81, and 4=1.22, where B is defined in Section 2.2. Then, Conditions 1 and 5 hold. We first considered the case when k=3:Πi,i=1,2,3, having (ε1,ε2,ε3)=(1/2,1/4,1/4). We set n=20 and (n1,n2,n3)=(10,5,5). From Theorem 2, one can expect that ( z 1 j , z 2 j ) ( = ( s 1 j / λ 1 1 / 2 , s 2 j / λ 2 1 / 2 ) ) becomes close to (1,0) when xj∈ Π1, (−1,21/2) when xj∈ Π2, and (−1,−21/2) when xj∈ Π3. In Figure 3, we displayed scatter plots of (z1j,z2j), j=1,…,n, when (a) d=100, (b) d=1,000, and (c) d=10,000. We observed that the scatter plots appear close to those three vertices as d increases.

Next, we considered the case when k=4: Πi,i=1,…,4, having ε1=···=ε4=1/4. We set n=20 and n1=···=n4=5. In Figure 4, we displayed scatter plots of (z1j,z2j,z3j), j=1,…,n, when (a) d=100, (b) d=1,000 and (c) d=10,000. From Theorem 2, we displayed the triangular pyramid given by Equation 6 with k=4. As expected theoretically, we observed that the scatter plots appear close to four vertices of the triangular pyramid as d increases. They seemed to converge slower in Figure 4 than in Figure 3. This is because the conditions of Theorem 2 become strict as k increases. See Remark 7.

3.2 Consistency property of PC scores when k ≥ 3

Let
η ( 0 ) = 0 and η ( i ) = j = 1 i η j for i = 1 , , k .
We assume the following condition:

Condition 6. max i = 1 , , k 2 ( μ i T j μ i ) Δ k 1 2 0 as d→∞ for j=1,…,k.

Remark 9.From the fact that μ i T j μ i Δ i λ max ( j ) , Condition 6 holds under Δ 1 λ max ( j ) / Δ k 1 2 0 as d→∞ for j=1,…,k.

As for the estimated PC scores, we have the following result. From Theorem 3, one can cluster xjs into k groups by the elements of u ^ i , i = 1 , , k 1 :

Theorem 3.Under Conditions2, it holds that for nl>0, l=1,…,k

plim d z ^ i j = 0 when i 2 and x j m = 1 i 1 Π m , 1 η ( i ) η i ( 1 η ( i 1 ) ) 1 / 2 when x j Π i , η i ( 1 η ( i ) ) ( 1 η ( i 1 ) ) 1 / 2 when x j m = i + 1 k Π m , (7)
for i=1,…,k−1;j=1,…,n.

Condition 5 is essential for the consistency properties given in Theorems 2 and 3. We investigated the robustness of Theorem 3 against Condition 5 by some simulation studies. We considered a toy example such as Πi:Nd(μi,i),i=1,2,3, where μ1=1d, μ2=(1,…,1,0,…,0)T whose first ⌈d/ζ⌉ elements are 1, μ3=0, 1 = ( 0 . 3 | i j | 1 / 3 ) , 2 = B ( 0 . 3 | i j | 1 / 3 ) B , and 3 = ( 0 . 4 | i j | 1 / 3 ) . We set d=5,000, n=20, and (n1,n2,n3)=(10,5,5). In Figure 5, we displayed scatter plots of ( z ^ 1 j , z ^ 2 j ) , j=1,…,n, when (a) ζ=1/5, (b) ζ=2/5, and (c) ζ=4/5. Also, we displayed the triangle given by Equation 7 with k=3. Note that Angle(μ1,μ2)=0.352π and Δ21=1/5 for (a), Angle(μ1,μ2)=0.282π and Δ21=2/5 for (b), and Angle(μ1,μ2)=0.148π and Δ21=4/5 for (c). For (a) and (b), we observed that the estimated PC scores give good performances. On the other hand, the estimated PC scores seemed not to converge to the theoretical points for (c). This is because Condition 5 is not met. However, we could find three separate clusters for Πi,i=1,2,3. See Appendix B for the reason.

SJOS-12432-FIG-0005-c
Toy example to illustrate asymptotic behaviors of the estimated principal component scores when k=3. We plotted ( z ^ 1 j , z ^ 2 j ) which is denoted by small circles when xj∈ Π1, by small triangles when xj∈ Π2, and by small squares when xj∈ Π3. The dashed triangle consists of three vertices, namely, (1,0), (−1,21/2), and (−1,−21/2), which are the theoretical convergent points [Color figure can be viewed at wileyonlinelibrary.com]

4 REAL‐DATA EXAMPLES

We demonstrate the performance of clustering, based on sample PC scores, using gene expression datasets.

4.1 Clustering when k = 2

We analyzed microarray data by Chiaretti et al. (2004) in which the dataset consists of 12,625 (=d) genes and 128 samples. The dataset has two tumor cellular subtypes, Π1: B cell (95 samples) and Π2: T cell (33 samples). Refer to Jeffery, Higgins, and Culhane (2006) as well. We checked behaviors of the PC scores using several samples from the two tumor cellular subtypes. We considered three cases: (a) n=10 samples consist of the first five samples from both Π1 and Π2 (i.e., n1=5 and n2=5), (b) n=40 samples consist of the first 20 samples from both Π1 and Π2 (i.e., n1=20 and n2=20), and (c) n=128 samples consist of n1=95 samples from Π1 and n2=33 samples from Π2. In the top panels of Figure 6, we displayed scatter plots of the first two PC scores, ( z ^ 1 j , z ^ 2 j ) s, for (a), (b), and (c). From Corollary 1, we denoted (η21)1/2 and −(η12)1/2 by dotted lines. For (a), we observed that the estimated PC scores give good performances. The first PC scores gathered around (η21)1/2 or −(η12)1/2. For (b), the estimated PC scores gave adequate performances except for the two points from Π2. Those two samples, which are the ninth and twentieth samples of Π2, are probably outliers. In fact, the two points are far from the cluster of Π2. The other 38 samples were perfectly classified into the two groups by the sign of the first PC scores. As for (c), although there seemed to be two clusters except for the two samples, we could not classify the dataset by the sign of the first PC scores. This is probably because η1 and η2 are unbalanced. From Equation 2, when the mixing proportions are unbalanced, λ1 becomes small. The first eigenspace was possibly affected by the other eigenspaces, so that the first PC scores appear in the wrong direction. We tested the clustering except for the outlying two samples. We used the remaining 31 samples for Π2. We considered the following three cases for samples from Π1: (d) the first 16 samples from Π1, so that n1=16,n2=31,n=47, and η12≈0.5; (e) the first 31 samples from Π1, so that n1=31,n2=31,n=62, and η12=1; and (f) the first 62 samples from Π1, so that n1=62,n2=31,n=93, and η12=2. In the bottom panels of Figure 6, we displayed scatter plots of ( z ^ 1 j , z ^ 2 j ) s for (d), (e), and (f). For (d) and (e), we observed that the estimated PC scores give good performances. As for (f), although there seemed to be two clusters, we could not classify the dataset by the sign of the first PC scores. Note that η1 and η2 are unbalanced in (d) and (f). Even though (d) is an unbalanced case, the estimated PC scores worked well for the case. We had an estimate for the ratio of the largest eigenvalues, λ max ( 1 ) / λ max ( 2 ) , as 1.598 by the noise‐reduction methodology given by Yata and Aoshima (2012). The first eigenspace of in (d) is less affected by the first eigenspace of is than in (f) as = ε 1 ε 2 ( μ 1 μ 2 ) ( μ 1 μ 2 ) T + ε 1 1 + ε 2 2 . This is probably the reason why the estimated PC scores gave good performances even in (d).

SJOS-12432-FIG-0006-c
Scatter plots of the first two principal component scores, supposing k=2 in the dataset of Chiaretti et al. (2004). We denoted them by small circles when xj∈ Π1 and by small triangles when xj∈ Π2. The theoretical convergent points, namely, (η2/η1)1/2 and −(η1/η2)1/2, are denoted by dotted lines. The two samples, encircled by dots in (b) and (c), are probably outliers [Color figure can be viewed at wileyonlinelibrary.com]

4.2 Clustering when k ≥ 3

We analyzed microarray data by Bhattacharjee et al. (2001) in which the dataset consisted of five lung carcinomas types with d=3,312. We only used four classes as Π1: pulmonary carcinoids (20 samples), Π2: normal lung (17 samples), Π3: squamous cell lung carcinomas (21 samples), and Π4: adenocarcinomas (20 samples), so that n1=20,n2=17,n3=21, and n4=20. Note that Π4 originally had 139 samples. We used only the first 20 samples from Π4 in order to keep balance in sample sizes with the other classes. We first considered clustering when k=3 under the following setups: (a) the dataset consists of Π1, Π2, and Π3 (n=58); (b) the dataset consists of Π1, Π2, and Π4 (n=57); and (c) the dataset consists of Π1, Π3, and Π4 (n=61). In Figure 7, we displayed scatter plots of the first two PC scores, ( z ^ 1 j , z ^ 2 j ) s, for each of (a), (b), and (c). Also, we displayed the triangle given by Equation 7 with k=3 using Theorem 3. We observed that the estimated PC scores give good performances. The three clusters gathered around each vertex for (a), (b), and (c).

SJOS-12432-FIG-0007-c
Scatter plots of the first two principal component scores, supposing k=3 in the dataset of Bhattacharjee et al. (2001). We denoted them by small circles when xj∈ Π1, by small triangles when xj∈ Π2, by small squares when xj∈ Π3, and by small inverted triangles when xj∈ Π4. The theoretical convergent points are denoted by the vertices of the triangle [Color figure can be viewed at wileyonlinelibrary.com]

Next, we considered clustering when k=4: Πi,i=1,…,4, so that n=78. In Figure 8, we displayed scatter plots of the first three PC scores. The dataset seemed not to converge to the theoretical convergent points given by Equation 7 in Theorem 3. This is probably because the conditions of Theorem 3 become strict as k increases. See Remark 7. Thus, the convergence is slower than in the case when k=3 as in Figure 7. However, there seemed to be four separate clusters of each Πi.

Finally, we introduce an example using next generation sequencing datasets. Shen, Shen, Zhu, and Marron (2012, 2016) gave a scatter plot of first two PC scores for the next generation sequencing cancer data by Wilhelm and Landry (2009) in which the dataset consists of three curves with d=1,709 and n=180. See Figure 9 which was given in figure 1 of Shen et al. (2012). The three clusters seem to compose of a triangle such as Figure 7.

SJOS-12432-FIG-0008-c
Scatter plots of the first three principal component scores, supposing k=4 in the dataset of Bhattacharjee et al. (2001). The dashed triangles and triangular pyramid were given by Equation 7 with k=4 [Color figure can be viewed at wileyonlinelibrary.com]
SJOS-12432-FIG-0009-c
Shen et al. (2012) gave a scatter plot of first two principal component scores for the next generation sequencing cancer data. [Color figure can be viewed at wileyonlinelibrary.com]

4.3 Clustering: Special case

We analyzed microarray data by Armstrong et al. (2002) in which the dataset consists of three leukemia subtypes having 12,582 (=d) genes. We used two classes such as Π1: acute lymphoblastic leukemia (24 samples) and Π2: mixed‐lineage leukemia (20 samples), so that n1=24,n2=20, and n=44. In Figure 10, we displayed scatter plots of the first three PC scores.

SJOS-12432-FIG-0010-c
Scatter plots of the first three principal component scores, supposing k=2 in the dataset of Armstrong et al. (2002) [Color figure can be viewed at wileyonlinelibrary.com]

We observed that the dataset is perfectly separated by the sign of the second PC scores. This figure looks completely different from Figure 6. This is probably because the largest eigenvalue, λ max ( 1 ) or λ max ( 2 ) , is too large. When k=2, we give the following result to explain the reason of the phenomenon in Figure 10. Under the assumptions of Proposition 2, one can cluster xjs into two groups by some ith PC score even when Condition 1 is not met:

Proposition 2.Assume that max i = 1 , 2 ( μ 1 T i μ 1 ) / Δ 1 2 0 as d→∞. Then, there exists some positive integer i such that

λ i ε 1 ε 2 Δ 1 1 as d .
Furthermore, assume that λ i is distinct in the sense that
lim inf d λ i λ i 1 > 0 for i = 1 , , d ( i i ) .
Then, if h i T μ 1 0 , it holds that Angle ( h i , μ 1 ) 0 as d→∞ and for j=1,…,n
plim d s i j λ i 1 / 2 = ( ε 2 / ε 1 ) 1 / 2 when x j Π 1 , ( ε 1 / ε 2 ) 1 / 2 when x j Π 2 . (8)

We estimated the largest eigenvalue using the noise‐reduction methodology given by Yata and Aoshima (2012). By noting Remark 1, we considered Δ1 as Δ 1 = | | μ 1 | | 2 = | | μ 1 μ 2 | | 2 . We estimated Δ1 using an unbiased estimator given by Aoshima and Yata (2014). Then, we obtained the estimates of ( λ max ( 1 ) / Δ 1 , λ max ( 2 ) / Δ 1 ) as (0.465,0.787), so that Condition 1 is not met obviously. In addition, by estimating εis by ηis, we had ε 2 λ max ( 2 ) > ε 1 ε 2 Δ 1 . Thus, the first eigenspace of is probably the first eigenspace of 2 as = ε 1 ε 2 ( μ 1 μ 2 ) ( μ 1 μ 2 ) T + ε 1 1 + ε 2 2 . Thus, i in Proposition 2 must be 2. This is the reason why the dataset can be separated by the sign of the second PC scores in Figure 10.

5 CONCLUDING REMARKS

In this article, we considered the mixture model by Equation 1 in high‐dimensional settings. We studied asymptotic properties of both the true PC scores and the sample PC scores for the high‐dimensional mixture model. We gave conditions under which PCA is very effective for clustering HDLSS data. We showed that HDLSS data can be classified by the sign of the first several PC scores theoretically. However, we have to say, in actual HDLSS data analyses, one may encounter cases such as in Figures 6c and 10, where the dataset is not always classified by the sign of the first several PC scores. Several reasons should be considered: (i) actual HDLSS datasets often include several outliers, (ii) the regularity conditions are not met, and (iii) the mixing proportions εis are quite unbalanced. Thus, we recommend the following three steps: (i) apply PCA to HDLSS data; (ii) using PC scores, map the dataset onto a feature space such as the first three eigenspaces, and (iii) apply general clustering methods such as the k‐means method to the feature space. However, the number of clusters k is unknown in general. We emphasize that the first k−1 eigenvalues are quite spiked for the model 1. Recently, Jung, Lee, and Ahn (2018) proposed a test of the number of spiked components for high‐dimensional data. Thus, one may apply the test to the choice of k for clustering.

ACKNOWLEDGEMENTS

We would like to thank the two anonymous referees for their constructive comments.

    Appendix A: Lemmas and their proofs

    Throughout, let μi,j=μiμj and Δi,j=||μi,j||2 for i,j=1,…,k(i<j). Let ui=(ui1,…,uin)T, where
    u i j = 0 when i 2 and x j m = 1 i 1 Π m , [ ( 1 η ( i ) ) / { n η i ( 1 η ( i 1 ) ) } ] 1 / 2 when x j Π i , [ η i / { n ( 1 η ( i ) ) ( 1 η ( i 1 ) ) } ] 1 / 2 when x j m = i + 1 k Π m ,
    for i=1,…,k−1;j=1,…,n. Let ν i = m = 1 k η m ( μ i μ m ) for i=1,…,k. Let V=[ν(1),…,ν(n)], where ν(j)=νi according to xj∈ Πi for j=1,…,n. Note that V 1 n = j = 1 n ν ( j ) = 0 . We define the eigendecomposition of VTV/n by V T V / n = i = 1 k 1 λ ˜ i u ˜ i u ˜ i T from the fact that rank(V)≤k−1, where λ ˜ 1 ··· λ ˜ k 1 0 are eigenvalues of VTV/n and u ˜ i = ( ũ i 1 , , ũ i n ) T is a unit eigenvector corresponding to λ ˜ i for each i. We assume u ˜ i T u i 0 for i=1,…,k−1, without loss of generality.

    Lemma 1.When k=2, it holds that under Conditions2

    plim d ( n 1 ) S D tr ( 1 ) P n Δ 1 = r r T .

    Proof.As μ2=0, we can write that x j η 1 μ 1 = ( x j μ i ) + ( 1 ) i + 1 ( 1 η i ) μ 1 for i=1,2;j=1,…,n. From the fact that λ max ( i ) tr ( i 2 ) 1 / 2 , we have that Var { ( x j μ i ) T μ 1 | x j Π i } = μ 1 T i μ 1 Δ 1 λ max ( i ) = o ( Δ 1 2 ) as d→∞ for j=1,…,n;i=1,2 under Condition 2. Also, we have that Var { ( x j μ i ) T ( x j μ i ) | x j Π i , x j Π i } = tr ( i i ) tr ( i 2 ) 1 / 2 tr ( i 2 ) 1 / 2 = o ( Δ 1 2 ) for all jj′ and i,i′ =1,2 under Condition 2. Then, using Chebyshev's inequality, for any τ>0, under Condition 2, it holds that for all jj′ and i,i′ =1,2

    P { | ( x j μ i ) T ( x j μ i ) / Δ 1 | > τ | x j Π i , x j Π i } = o ( 1 ) and P { | ( x j μ i ) T μ 1 / Δ 1 | > τ | x j Π i } = o ( 1 ) , (A1)
    so that ( x j μ i ) T ( x j μ i ) / Δ 1 = o P ( 1 ) and ( x j μ i ) T μ 1 / Δ 1 = o P ( 1 ) when xj∈ Πi and x j Π i (jj). We note that E ( | | x j μ i | | 2 | x j Π i ) = tr ( i ) . Similar to A1, under Condition 3, it holds that { | | x j μ i | | 2 tr ( i ) } / Δ 1 = o P ( 1 ) when xj∈ Πi for i=1,2;j=1,…,n. By noting that {tr(1)−tr(2)}/Δ1=o(1) under Condition 4, we have that
    plim d ( X η 1 μ 1 1 n T ) T ( X η 1 μ 1 1 n T ) tr ( 1 ) I n Δ 1 = r r T ,
    under Conditions 1. By noting that P n ( X η 1 μ 1 1 n T ) T ( X η 1 μ 1 1 n T ) P n / ( n 1 ) = S D and r T P n = r T from r T 1 n = 0 , we conclude the result.

    Lemma 2.Let μ ' i , i + 1 = μ i , i + 1 / Δ i , i + 1 1 / 2 for i=1,…,k−1 and let Δ(i,j)= Δj,j+1i,i+1 for i,j=1,…,k−1(i<j). Under Conditions1 and 5, it holds that as d→∞

    λ i Δ i , i + 1 = ε i ( 1 ε ( i ) ) 1 ε ( i 1 ) + o ( 1 ) and h i T μ ' i , i + 1 = 1 + o ( 1 ) for i = 1 , , k 1 ; h i T μ ' i 1 , i = 1 ε ( i ) 1 ε ( i 1 ) Δ ( i 1 , i ) 1 / 2 { 1 + o ( 1 ) } for i = 2 , , k 1 when k 3 ; a n d h j T μ ' i , i + 1 = o ( Δ ( i , j ) 1 / 2 ) for i , j = 1 , , k 1 ( i + 1 < j ) when k 3 .

    Proof.From the fact that | μ i T μ j | ( Δ i Δ j ) 1 / 2 , under Condition 5 it holds that as d→∞

    Δ i , i + 1 Δ i = Δ i + Δ i + 1 + O { ( Δ i Δ i + 1 ) 1 / 2 } Δ i 1 for i = 1 , , k 2 .
    Then, under Condition 5, it holds that
    μ i , i + 1 T μ j , j + 1 ( Δ i , i + 1 Δ j , j + 1 ) 1 / 2 = μ i T μ j + O { ( Δ i Δ j + 1 ) 1 / 2 + ( Δ i + 1 Δ j ) 1 / 2 } ( Δ i Δ j ) 1 / 2 { 1 + o ( 1 ) } = o ( 1 ) ,
    for i,j=1,…,k−1(i<j). Hence, under Condition 5, we claim that
    μ ' i , i + 1 T μ ' j , j + 1 0 and Δ j , j + 1 Δ i , i + 1 0 as d for i , j = 1 , , k 1 ( i < j ) . (A2)

    Let e d ( R d ) be an arbitrary unit vector. From e d T ( i = 1 k ε i i ) e d i = 1 k λ max ( i ) , it holds that

    e d T e d Δ k 1 = e d T ( i = 1 k 1 j = i + 1 k ε i ε j μ i , j μ i , j T ) e d Δ k 1 + o ( 1 ) , (A3)
    under Condition 1. Note that μ i , j = m = i j 1 μ m , m + 1 for i,j=1,…,k(i<j). Thus, it holds that
    i = 1 k 1 j = i + 1 k ε i ε j μ i , j μ i , j T = i = 1 k 1 ε ( i ) ( 1 ε ( i ) ) μ i , i + 1 μ i , i + 1 T + i = 1 k 2 j = i + 1 k 1 ε ( i ) ( 1 ε ( j ) ) ( μ i , i + 1 μ j , j + 1 T + μ j , j + 1 μ i , i + 1 T ) . (A4)

    From the facts that λ 1 = h 1 T h 1 = max e d ( e d T e d ) and Δk−1= Δk−1,k, by combining Equation A3 with Equations A2 and A4, we have that

    λ 1 Δ 1 , 2 = max e d ε ( 1 ) ( 1 ε ( 1 ) ) ( e d T μ ' 1 , 2 ) 2 + o ( 1 ) = ε ( 1 ) ( 1 ε ( 1 ) ) + o ( 1 ) ,
    under Conditions 1 and 5. Hence, by noting that Δ1,21=1+o(1) and h 1 T μ 1 0 , it holds that h 1 T μ ' 1 , 2 = h 1 T μ 1 / Δ 1 1 / 2 + o ( 1 ) = 1 + o ( 1 ) .

    Next, we consider λ2 and h2. From A2, we note that μ ' i , i + 1 T μ ' j , j + 1 = o ( 1 ) and Δ(i,j)=o(1) for i,j=1,…,k−1(i<j) under Condition 5. Then, under Conditions 1 and 5, it holds that for j≥2

    0 = h 1 T h j Δ 1 , 2 = ε ( 1 ) ( 1 ε ( 1 ) ) { 1 + o ( 1 ) } μ ' 1 , 2 T h j + ε ( 1 ) ( 1 ε ( 2 ) ) μ ' 2 , 3 T h j Δ ( 1 , 2 ) 1 / 2 + o ( Δ ( 1 , 2 ) 1 / 2 ) ,
    from Equations A3 and A4 and h 1 T μ ' 2 , 3 = o ( 1 ) , so that for j≥2
    h j T μ ' 1 , 2 = { ( 1 ε ( 2 ) ) / ( 1 ε ( 1 ) ) } μ ' 2 , 3 T h j Δ ( 1 , 2 ) 1 / 2 + o ( Δ ( 1 , 2 ) 1 / 2 ) . (A5)

    By combining Equation A3 with Equations A4 and A5, we have that

    λ 2 Δ 2 , 3 = h 2 T h 2 Δ 2 , 3 = h 2 T { i = 1 2 ε ( i ) ( 1 ε ( i ) ) μ i , i + 1 μ i , i + 1 T + 2 ε ( 1 ) ( 1 ε ( 2 ) ) μ 1 , 2 μ 2 , 3 T } h 2 Δ 2 , 3 + o ( 1 ) = ε ( 2 ) ( 1 ε ( 2 ) ) ( μ ' 2 , 3 T h 2 ) 2 + ε ( 1 ) ( 1 ε ( 1 ) ) ( μ ' 1 , 2 T h 2 ) 2 Δ ( 1 , 2 ) + 2 ε ( 1 ) ( 1 ε ( 2 ) ) ( μ ' 1 , 2 T h 2 ) ( μ ' 2 , 3 T h 2 ) Δ ( 1 , 2 ) 1 / 2 + o ( 1 ) = ε ( 2 ) ( 1 ε ( 2 ) ) ε ( 1 ) ( 1 ε ( 2 ) ) 2 1 ε ( 1 ) + o ( 1 ) = ε 2 ( 1 ε ( 2 ) ) 1 ε ( 1 ) + o ( 1 ) , (A6)
    under Conditions 1 and 5. Hence, from the assumption that h 2 T μ 2 0 , it holds that h 2 T μ ' 2 , 3 = h 2 T μ 2 / Δ 2 1 / 2 + o ( 1 ) = 1 + o ( 1 ) .

    Next, we consider λ3 and h3. Note that h j T μ ' 2 , 3 = o ( 1 ) for j≥3 from h 2 T μ ' 2 , 3 = 1 + o ( 1 ) . Then, under Conditions 1 and 5, we have that for j≥3

    0 = h 1 T h j Δ 1 , 2 = ε ( 1 ) ( 1 ε ( 1 ) ) { 1 + o ( 1 ) } μ ' 1 , 2 T h j + ε ( 1 ) ( 1 ε ( 2 ) ) { 1 + o ( 1 ) } μ ' 2 , 3 T h j Δ ( 1 , 2 ) 1 / 2 + ε ( 1 ) ( 1 ε ( 3 ) ) μ ' 3 , 4 T h j Δ ( 1 , 3 ) 1 / 2 + o ( Δ ( 1 , 3 ) 1 / 2 ) and (A7)
    0 = h 2 T h j Δ 2 , 3 = ε ( 1 ) ( 1 ε ( 1 ) ) h 2 T μ ' 1 , 2 μ ' 1 , 2 T h j Δ ( 1 , 2 ) + ε ( 1 ) ( 1 ε ( 2 ) ) h 2 T ( μ ' 1 , 2 μ ' 2 , 3 T + μ ' 2 , 3 μ ' 1 , 2 T ) h j Δ ( 1 , 2 ) 1 / 2 + ε ( 1 ) ( 1 ε ( 3 ) ) h 2 T μ ' 1 , 2 μ ' 3 , 4 T h j Δ ( 1 , 2 ) 1 / 2 Δ ( 2 , 3 ) 1 / 2 + ε ( 2 ) ( 1 ε ( 2 ) ) { 1 + o ( 1 ) } μ ' 2 , 3 T h j + ε ( 2 ) ( 1 ε ( 3 ) ) μ ' 3 , 4 T h j Δ ( 2 , 3 ) 1 / 2 + o ( Δ ( 2 , 3 ) 1 / 2 ) = ε 2 ( 1 ε ( 2 ) ) 1 ε ( 1 ) { 1 + o ( 1 ) } μ ' 2 , 3 T h j + ε 2 ( 1 ε ( 3 ) ) 1 ε ( 1 ) μ ' 3 , 4 T h j Δ ( 2 , 3 ) 1 / 2 + o ( Δ ( 2 , 3 ) 1 / 2 ) + μ ' 1 , 2 T h j × o ( Δ ( 1 , 2 ) 1 / 2 ) , (A8)
    from Equations A2 to A5, h 1 T μ ' 2 , 3 = o ( 1 ) , h 1 T μ ' 3 , 4 = o ( 1 ) , and h 2 T μ ' 3 , 4 = o ( 1 ) . Then, by combining Equations A7 and A8, under Conditions 1 and 5, it holds that for j≥3
    h j T μ ' 1 , 2 = o ( Δ ( 1 , 3 ) 1 / 2 ) and h j T μ ' 2 , 3 = 1 ε ( 3 ) 1 ε ( 2 ) μ ' 3 , 4 T h j Δ ( 2 , 3 ) 1 / 2 + o ( Δ ( 2 , 3 ) 1 / 2 ) . (A9)

    Similar to Equation A6, by combining Equation A3 with Equations A4 and A9, under Conditions 1 and 5, we have that

    λ 3 Δ 3 , 4 = ε ( 3 ) ( 1 ε ( 3 ) ) ( μ ' 3 , 4 T h 3 ) 2 + ε ( 2 ) ( 1 ε ( 2 ) ) ( μ ' 2 , 3 T h 3 ) 2 Δ ( 2 , 3 ) + 2 ε ( 2 ) ( 1 ε ( 3 ) ) ( μ ' 2 , 3 T h 3 ) ( μ ' 3 , 4 T h 3 ) Δ ( 2 , 3 ) 1 / 2 + o ( 1 ) = ε ( 3 ) ( 1 ε ( 3 ) ) ε ( 2 ) ( 1 ε ( 3 ) ) 2 1 ε ( 2 ) + o ( 1 ) = ε 3 ( 1 ε ( 3 ) ) ( 1 ε ( 2 ) ) + o ( 1 ) ,
    so that h 3 T μ ' 3 , 4 = 1 + o ( 1 ) from the assumption that h 3 T μ 3 0 .

    In a way similar to λ3 and h3, as for λi and hi (4≤ik−1), we have that λii,i+1=εi(1−ε(i))/(1−ε(i−1))+o(1), h i T μ ' i , i + 1 = 1 + o ( 1 ) and h i T μ ' i 1 , i = { ( 1 ε ( i ) ) / ( 1 ε ( i 1 ) ) } Δ ( i 1 , i ) 1 / 2 { 1 + o ( 1 ) } together with h j T μ ' i , i + 1 = o ( Δ ( i , j ) 1 / 2 ) for i,j=1,…,k−1 (i+1<j) under Conditions 1 and 5. It concludes the results.

    Lemma 3.Under Conditions1 and 5, it holds that for i=1,…,k−1

    lim d h i T m = 1 k ε m ( μ i μ m ) λ i 1 / 2 = 0 when i 2 and i < i , [ ( 1 ε ( i ) ) / { ε i ( 1 ε ( i 1 ) ) } ] 1 / 2 when i = i , [ ε i / { ( 1 ε ( i ) ) ( 1 ε ( i 1 ) ) } ] 1 / 2 when i > i .

    Proof.We write that

    m = 1 k ε m ( μ 1 μ m ) = m = 1 k 1 ( 1 ε ( m ) ) μ m , m + 1 , m = 1 k ε m ( μ k μ m ) = m = 1 k 1 ε ( m ) μ m , m + 1 and m = 1 k ε m ( μ i μ m ) = m = i k 1 ( 1 ε ( m ) ) μ m , m + 1 m = 1 i 1 ε ( m ) μ m , m + 1 , (A10)
    for i=2,…,k−1. Using Lemma 2, under Conditions 1 and 5, we have that as d→∞
    h 1 T m = 1 k ε m ( μ 1 μ m ) Δ 1 , 2 1 / 2 = h 1 T ( 1 ε ( 1 ) ) μ 1 , 2 Δ 1 , 2 1 / 2 + o ( 1 ) = 1 ε ( 1 ) + o ( 1 ) and h 1 T m = 1 k ε m ( μ i μ m ) Δ i , i + 1 1 / 2 = h 1 T ε ( 1 ) μ 1 , 2 Δ 1 , 2 1 / 2 + o ( 1 ) = ε ( 1 ) + o ( 1 ) for i = 2 , , k ,
    from Equation A10. Also, using Lemma 2, under Conditions 1 and 5, we have that for i=2,…,k−1;i′ =i+1,…,k;i′′ =1,…,i−1
    h i T m = 1 k ε m ( μ i μ m ) Δ i , i + 1 1 / 2 = h i T ( 1 ε ( i ) ) μ i , i + 1 ε ( i 1 ) μ i 1 , i Δ i , i + 1 1 / 2 + o ( 1 ) = ( 1 ε ( i ) ) + ε ( i 1 ) ( 1 ε ( i ) ) 1 ε ( i 1 ) + o ( 1 ) = 1 ε ( i ) 1 ε ( i 1 ) + o ( 1 ) , h i T m = 1 k ε m ( μ i μ m ) Δ i , i + 1 1 / 2 = ε ( i ) + ε ( i 1 ) ( 1 ε ( i ) ) 1 ε ( i 1 ) + o ( 1 ) = ε i 1 ε ( i 1 ) + o ( 1 ) , and h i T m = 1 k ε m ( μ i μ m ) Δ i , i + 1 1 / 2 = o ( 1 ) .
    Thus, from Lemma 2, we can conclude the results.

    Lemma 4.Assume Conditions2. Then, under the condition:

    0 < plim d λ ˜ i Δ i , i + 1 < for i = 1 , , k 1 , (A11)
    it holds that
    plim d u ^ i T u ˜ i = 1 for u ^ i T u ˜ i 0 , i = 1 , , k 1 .

    Proof.From the fact that λ max ( i ) tr ( i 2 ) 1 / 2 , we have that Var { μ k 1 T ( x j μ i ) | x j Π i } = μ k 1 T i μ k 1 λ max ( i ) Δ k 1 = o ( Δ k 1 2 ) as d→∞ for i=1,…,k;j=1,…,n under Condition 2. Then, we have that for i=1,…,k−1; i′=1,…,k;j=1,…,n

    Var { μ i , i + 1 T ( x j μ i ) | x j Π i } = μ i , i + 1 T i μ i , i + 1 = O ( μ i T i μ i + μ i + 1 T i μ i + 1 ) = o ( Δ k 1 2 ) ,
    under Conditions 2 and 6. Then, similar to Equation A1, under Conditions 2 and 6, it holds that μ i , i + 1 T ( x j μ i ) / Δ k 1 = o P ( 1 ) when x j Π i for i=1,…,k−1;i=1,…,k;j=1,…,n. In addition, under Conditions 2 and 3, we can claim that ( x j μ i ) T ( x j μ i ) / Δ k 1 = o P ( 1 ) and | | x j μ i | | 2 / Δ k 1 = tr ( i ) / Δ k 1 + o P ( 1 ) when xj∈ Πi and x j Π i for all jj and i,i=1,…,k. Here, we write that xjμη=(xjμi)+νi for i=1,…,k;j=1,…,n, where μ η = i = 1 k η i μ i . Then, by noting Equation A10 with εii and ε(i)(i), i=1,…,k, under Conditions 1, and 6, we have that
    | | x j μ η | | 2 Δ k 1 = | | ν i | | 2 + tr ( i ) Δ k 1 + o P ( 1 ) and ( x j μ η ) T ( x j μ η ) Δ k 1 = ν i T ν i Δ k 1 + o P ( 1 ) ,
    when xj∈ Πi and x j Π i for all jj and i,i=1,…,k. Thus, under Conditions 1, and 6, it holds that
    plim d ( X μ η 1 n T ) T ( X μ η 1 n T ) tr ( 1 ) I n V T V Δ k 1 = O . (A12)

    Let e n ( R n ) be an arbitrary random unit vector such that e n T 1 n = 0 . We note that P n ( X μ η 1 n T ) T ( X μ η 1 n T ) P n / ( n 1 ) = S D . Then, by noting e n T P n = e n T , under Equation A11, Conditions 1, and 6, we have that

    e n T ( n 1 ) S D tr ( 1 ) P n Δ k 1 e n = i = 1 n 1 ( n 1 ) λ ^ i e n T u ^ i u ^ i T e n tr ( 1 ) Δ k 1 = e n T ( X μ η 1 n T ) T ( X μ η 1 n T ) tr ( 1 ) I n Δ k 1 e n = e n T V T V Δ k 1 e n + o P ( 1 ) = i = 1 k 1 n λ ˜ i e n T u ˜ i u ˜ i T e n Δ k 1 + o P ( 1 ) , (A13)
    from Equation A12. We note that u ˜ i T 1 n = 0 for i=1,…,k−1 in case of rank(V)=k−1. Also, from Equation A2, we note that λ ˜ i , i = 1 , , k 1 , are distinct under Condition 5 and Equation A11 for a sufficiently large d. Thus, from Equation A13, if u ^ i T u ˜ i 0 for i=1,…,k−1, we have that u ^ i T u ˜ i = 1 + o P ( 1 ) for i=1,…,k−1. It concludes the result.

    Lemma 5.Assume Condition5. For ni>0,i=1,…,k, it holds that

    plim d λ ˜ i Δ i , i + 1 = η i ( 1 η ( i ) ) 1 η ( i 1 ) and plim d u ˜ i T u i = 1 for i = 1 , , k 1 .

    Proof.By noting Equation A10 with εii and ε(i)(i), i=1,…,k, we can write that

    V V T n = i = 1 k 1 η ( i ) ( 1 η ( i ) ) μ i , i + 1 μ i , i + 1 T + i = 1 k 2 j = i + 1 k 1 η ( i ) ( 1 η ( j ) ) ( μ i , i + 1 μ j , j + 1 T + μ j , j + 1 μ i , i + 1 T ) . (A14)

    We have the eigendecomposition of VVT/n by V V T / n = i = 1 k 1 λ ˜ i h ˜ i h ˜ i T , where h ˜ i is a unit eigenvector corresponding to λ ˜ i for each i. We note that ηi>0,i=1,…,k for ni>0,i=1,…,k. Then, by noting Lemmas 2 and 3 and the fact that Equation A14 is same as Equation A4 with ε(i)(i),i=1,…,k−1, under Condition 5, we have that for i=1,…,k−1

    plim d λ ˜ i Δ i , i + 1 = η i ( 1 η ( i ) ) 1 η ( i 1 ) and plim d h ˜ i T ν ( j ) λ ˜ i 1 / 2 = u i j n 1 / 2 ,
    if h ˜ i T μ i 0 . We note that ũ i j = h ˜ i T ν ( j ) / ( n λ ˜ i ) 1 / 2 from the fact that u ˜ i = V T h ˜ i / ( n λ ˜ i ) 1 / 2 for i=1,…,k−1. Hence, we can conclude the result.

    Appendix B: Additional Proposition

    When Condition 5 is not met, Theorem 3 does not hold. However, in Figure 5c, we could find three separate clusters of Πi,i=1,2,3, even though Condition 5 is not met. To explain the reason of this phenomenon, we give the following result.

    Proposition 3.Assume Conditions2 and 6. Then, under the condition:

    0 < plim d λ ˜ k 1 Δ k 1 < ,
    it holds that for i=1,…,k−1, as d→∞
    u ^ i T ( n 1 ) S D Δ k 1 u ^ i = tr ( 1 ) Δ k 1 + u ^ i T V T V Δ k 1 u ^ i + o P ( 1 ) .

    Proof.By noting that u ^ i T 1 n = 0 for i=1,…,k−1 when rank(SD)≥k−1, from Equation A13, we can conclude the result.

    By noting that u ^ i = ( z ^ i 1 , , z ^ i n ) T / n 1 / 2 , from Proposition 3, for sufficiently large d, the estimated PC scores depend only on the structure of VTV even when Condition 5 is not met. Then, as rank(VTV)=k−1, there must be k separate clusters for Πi,i=1,…,k, in the first k−1 PC spaces as seen in Figure 5c.

    Appendix C: Proofs of Theorems, Corollaries, and Propositions

    Proofs of Theorem 1 and Corollary 1.We note that tr(1)/tr()→(1−ε1ε2c) as d→∞ under Condition 4 and Δ1/tr()→c(>0) as d→∞. Then, using Lemma 1, we can conclude the result of Theorem 1.

    Next, we consider the proof of Corollary 1. From the fact that 1 n T S D 1 n = 0 , it holds that u ^ 1 T 1 n = 0 when SDO, so that P n u ^ 1 = u ^ 1 . Also, note that ||r||2=1η2 and r T 1 n = 0 . Then, using Lemma 1, under Conditions 1, it holds that u ^ 1 T { ( n 1 ) S D tr ( 1 ) P n } u ^ 1 / Δ 1 = n η 1 η 2 + o P ( 1 ) . Hence, from Equation 3 and the assumption that u ^ 1 T z 1 0 , we have that u ^ 1 T { ( n η 1 η 2 ) 1 / 2 r } = 1 + o P ( 1 ) for ni>0, i=1,2. In view of the elements of r, we can conclude the result of Corollary 1.

    Proof of Proposition 1.We assume xj∈ Π1 for j=1,…,n1, xj∈ Π2 for j=n1+1,…,n, and tr(1)≥tr(2) without loss of generality. Similar to the proof of Lemma 1, under the assumptions of Proposition 1, we have that

    plim d ( n 1 ) S D tr ( 2 ) P n Δ = P n D n P n ,
    where Dn=diag(1,…,1,0,…,0) whose first n1 diagonal elements are 1. Note that the first n1−1 eigenvalues of PnDnPn are multiple. Also, note that the eigenspace for the multiple eigenvalue consists of the n1−1 vectors,
    ( 1 , 1 , 0 , , 0 ) T , ( 1 , 0 , 1 , 0 , , 0 ) T , , ( 1 , 0 , , 0 , 1 , 0 , , 0 ) T .
    Thus, by noting that u ^ i T 1 n = 0 for i=1,…,n1−1, we can conclude the result.

    Proofs of Theorem 2 and Corollary 2.We write that x j μ = ( x j μ i ) + m = 1 k ε m ( μ i μ m ) for j=1,…,n; i=1,…,k. We note that Var { e d T ( x j μ i ) / Δ k 1 1 / 2 | x j Π i } = e d T i e d / Δ k 1 λ max ( i ) / Δ k 1 = o ( 1 ) as d→∞ under Condition 1 for j=1,…,n;i=1,…,k, where e d ( R d ) is an arbitrary unit vector. Then, under Condition 1, when xj∈ Πi, it holds that

    e d T ( x j μ ) Δ k 1 1 / 2 = e d T { m = 1 k ε m ( μ i μ m ) } Δ k 1 1 / 2 + o P ( 1 ) .
    Then, using Lemmas 2 and 3, we can conclude the result of Theorem 2.

    For the proof of Corollary 2, by noting that Δi,i+1i=1+o(1) and h i T μ i , i + 1 / Δ i , i + 1 1 / 2 = h i T μ i / Δ i 1 / 2 + o ( 1 ) for i=1,…,k−1, under Condition 5, from Lemma 2, the results are obtained straightforwardly.

    Proof of Theorem 3.By combining Lemmas 4 and 5, from Theorem 2 and the assumption that u ^ i T z i 0 for all i, the result is obtained straightforwardly.

    Proof of Proposition 2.Let (∗)=ε11+ε22. Then, we define the eigendecomposition of (∗) by ( ) = i = 1 d λ i ( ) h i ( ) h i ( ) T , where λ1(∗)≥···≥λd(∗)≥0 are eigenvalues of (∗) and hi(∗) is a unit eigenvector corresponding to λi(∗) for each i. Let λ=ε1ε2Δ1 and μ ' 1 = μ 1 / Δ 1 1 / 2 . Then, from = λ μ ' 1 μ ' 1 T + ( ) , under max i = 1 , 2 ( μ ' 1 T i μ ' 1 ) / Δ 1 0 as d→∞, it holds that μ ' 1 T μ ' 1 / λ 1 , so that

    i = 1 d λ i ( ) ( h i ( ) T μ ' 1 ) 2 λ = o ( 1 ) . (C1)

    Let κ(i)=λi(⋆⋆)−λ for i=1,…,d. For a sufficiently large d, when κ(1)>0, there exists some positive integer i⋆⋆ such that

    i = max { i | κ ( i ) > 0 for i = 1 , , d } .
    Then, from Equation C1, we have that i = 1 i ( h i ( ) T μ ' 1 ) 2 = o ( 1 ) , so that λ i / λ = 1 + o ( 1 ) with i=i⋆⋆+1. When κ(1)≤0 for a sufficiently large d, it holds that λ i / λ = 1 + o ( 1 ) with i=1. In addition, under lim inf d | λ i / λ i 1 | > 0 for i=1,…,d(ii), it holds that h i T μ ' 1 = 1 + o ( 1 ) from h i T μ 1 0 . Then, from the fact that h i T i h i / λ 0 as d→∞ for i=1,2, in a way similar to Equation A1, we have that s i j / λ i 1 / 2 = h i T ( x j μ ) / λ i 1 / 2 = h i T ( μ i μ ) / λ i 1 / 2 + o P ( 1 ) when xj∈ Πi for j=1,…,n;i=1,2. We can conclude the results.

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.