Estimating the number of signals using principal component analysis

In this work, we develop inferential tools for determining the correct number of principal components under a general noisy latent variable model, which includes as a special case, for example, the noisy independent component model. The problem is approached using hypothesis testing, and we provide both a large‐sample test and several resampling‐based alternatives. Simulations and an application to sound data reveal that both types of approaches keep the desired levels and have good power.

also with a noisy model using simulations). Similar studies include also Luo and Li (2016) and Nordhausen, Oja, Tyler, and Virta (2017) which provide tools to test for the number of non-Gaussian components in an ICA model with internal noise. These techniques were extended in Matilainen, Nordhausen, and Virta (2018), Nordhausen and Virta (2018), and Virta and Nordhausen (2019) to the second-order source separation model to test for white noise.

NOISY ICA MODEL
We assume that the observed random vector x ∈ R p obeys the noisy model, where A is an unknown full rank p × d matrix, z ∈ R d is a latent signal vector with finite fourth moments, d < p, and ∈ R p is a noise vector that has finite fourth moments and is independent of z. The covariance matrix of z is confounded with the matrix A, and without loss of generality, we fix Cov(z) = I d . To enable the derivation of our null distributions, we make the standard assumption that the noise vector is spherical (Fang, Kotz, & Ng, 1990) around the origin. In the context of ICA, the vector z could be assumed to have independent components (or subvectors), but as no such assumptions are required in the PCA step, we refrain from making them here. Similarly, to enable the subsequent ICA step, the noise most likely has to be Gaussian to be annihilated by the standard cumulant-based ICA methods; but again, this assumption is not needed in the PCA step, and the following results are valid for any spherical noise. The noisy ICA model is discussed, for example, in Cichocki and Amari (2002) and Comon and Jutten (2010) where also many references are given.
Our model can also be seen as a generalization of a factor analysis model or the probabilistic PCA model (Tipping & Bishop, 1999). In both of these models, the random vectors z and are assumed to be Gaussian, and the signal dimension can be estimated, for example, using information criteria-based techniques; see Comon and Golub (1990), Liavas, Regalia, and Delmas (1999), Minka (2001), Cichocki and Amari (2002), and Comon and Jutten (2010) and the references therein. A quite different model is considered in Choi, Taylor, and Tibshirani (2017) where Az is considered deterministic and only the Gaussian noise is random. Karhunen, Cichocki, Kasprzak, and Pajunen (1997) argued then in the context of noisy blind source separation that due to a lack of more general methods, approaches that assume Gaussian signals and noise could prove useful as the observed mixture x is heuristically more Gaussian than z itself. In this paper, however, even though we have the noisy ICA model in mind, the presented methods hold for any general random signal vector z and spherical noise . The only further assumptions we need are the moment conditions given after Equation 1.
Using the singular value decomposition A = UΛV ⊤ , U ∈ R p×d , Λ = diag( 1 , … , d ) ∈ R d×d , V ∈ R d×d , it is easy to see that the singular values of the unknown parameter A are identifiable as are its left singular vectors, possibly up to an orthogonal transformation. Namely, assume that a random vector has two representations of the form (1), that is, Az + = A 0 z 0 + 0 . Taking covariance matrices, we obtain As eigenvalues of a matrix are unique, we must have 2 0 = 2 and Λ 0 = Λ. Moreover, as eigenspaces related to distinct eigenvalues are unique, we must have U 0 = UW for some block-diagonal orthogonal matrix W ∈ R d×d .
Our main objective is to estimate the dimension d, given an independent and identically distributed sample from the model. After obtaining the estimate, we can compute the eigenvectors UW and project the observations to obtain a reduced d-dimensional vector where no information on the signal z i has been lost, (Fang et al., 1990). In the following, we propose two different methods, asymptotic and bootstrap based, for estimating the dimension. Both are based on testing the set of null hypotheses Denoting in the following the covariance matrix by Σ, the form (2) reveals that the eigenvalues of Σ(x) are ( 2 1 + 2 , … , 2 d + 2 , 2 , … , 2 ) and estimating d is thus equivalent to estimating the index after which the eigenvalues stay constant. Thus, a logical choice for the test statistict k for H 0k is the variancet k = s 2 k (̂(x i )) of the smallest r = p − k eigenvalues of the sample covariance matrix̂(x i ). Under the model,̂(x i ) can alternatively be written aŝ is an arbitrary matrix with orthonormal columns such that (U | R) is a full p × p orthogonal matrix and * i ∼ i . Consequently, the eigenvalues of̂(x i ) and̂((I d | 0) ⊤ ΛV ⊤ z i + * i ) are equal, and we may in Section 3 assume without loss of generality that U = (I d | 0) ⊤ . The next two sections propose different strategies for approximating the null distribution oft k .
Remark 1. A standard way of ''robustifying'' methods to be more tolerant to outliers is to replace Σ with a robust scatter matrix (Nordhausen & Tyler, 2015). However, our proposed method relies on the variance decomposition (2), which is only satisfied by scatter matrices that are additive. Virta (2018) showed that Σ is, under natural regularity conditions, the only additive scatter matrix, meaning that the use of other scatter matrices cannot be justified using our current approach.

ASYMPTOTIC NULL DISTRIBUTION
The development very closely mimics that of Nordhausen et al. (2016). By Eaton and Tyler (1991), we have under H 0k that n · s 2 k (̂(x i )) = n · s 2 (̂k(x i )) +  p (n −1∕2 ), where s 2 (̂k(x i )) is the variance of the eigenvalues of̂k(x i ), the lower right r × r block of̂(x i ). This block equalŝ( ki ), where ki contains the last r components of the noise vectors * i . By the central limit theorem, √ n · vec(̂( ki ) − 2 I r ) has a limiting multivariate normal distribution, and following the proof of Nordhausen et al., 2016 (Theorem 1), we have under H 0k that for a specific constant . The limiting distribution stays the same if 2 and are replaced by their consistent estimates. For 2 , such an estimate is given by the mean of the r smallest eigenvalues of̂(x i ). As in Nordhausen et al. (2016), an estimate for is given bŷ wherêk i =R ⊤ x i andR contains the last r eigenvectors of̂(x i ). If the assumption of normally distributed noise is made, then the population value = 1 can be substituted instead.
As in Nordhausen et al. (2016) and Virta and Nordhausen (2019), asymptotic tests for the different null hypotheses can be combined to yield an estimator for the signal dimension d. That is, for all k = 0, … , p − 1, let c k,n be a deterministic sequence such that c k,n → ∞ and c k,n = o(n).
Then, the least value of k for which the test statistic is asymptotically bounded is a consistent estimate of d, Despite the guarantee of consistency, the previous construction is difficult to use in practice as the sequences c k,n would need to be chosen very carefully to obtain good results also for finite samples. As such, we advocate carrying out the hypothesis testing (whether asymptotic or the bootstrap-based introduced in the next section) in practice by chaining together tests for different values of k and taking as the estimate, for example, the smallest value of k for which H 0k is not rejected on some particular level of significance.

BOOTSTRAP NULL DISTRIBUTION
For the bootstrap approach, samples that mimic the distribution of the data and simultaneously satisfy the null hypothesis, even if it does not hold for the original data, need to be drawn.
If H 0k holds, all eigenvectors of Σ(x) are given as H = (UW | R), where W ∈ R k×k is an orthogonal matrix and R ∈ R p×r has orthonormal columns and satisfies R ⊤ U = 0. Consequently, R ⊤ x gives us access to the last r (rotated) noise components. Thus, regardless whether H 0k holds or not, replacing the valuesR ⊤ x i with a sample from a spherical distribution lets us force the data distribution to satisfy H 0k .
In Algorithm 1, we propose two different ways of sampling from a spherical distribution: (a) replacing the vectorsR ⊤ x i with independent and identically distributed Gaussian vectors, which can be seen as parametric bootstrapping; and (b) spherifying the originalR ⊤ x i by random orthogonal transformations (this does not change the distribution of the noise process under the null). As randomly rotating all observations is computationally expensive and as permutations are a subset of rotations, we also consider as third strategy whether it is enough to permute the estimated noise components for each observation. The different bootstrapping approaches are detailed in Algorithm 1.

NUMERICAL EXAMPLES
To evaluate our tests, we use the following noisy ICA design. We have three independent signals having a logistic, t 5 , and uniform distribution, all standardized to have mean zero and variance one. The noise is taken either from a N 6 (0, 2 I 6 )-distribution (normal setting) or from a t 5 -distribution standardized to have covariance matrix equal to 2 I 6 (t 5 setting). The sample sizes under consideration are n = 1,000 and n = 2,000. As the performance of the methods depends on the mixing matrix A, we randomly choose at each iteration its 18 elements independently from n(0, 1).
We then record, on the basis of 1,000 repetitions, the rejection proportions at the level = .05 for the null hypotheses H 0k = {2, 3, 4}, where k = 3 corresponds to the correct dimension. We include in our comparisons the asymptotic test both with the constant fixed to 1, that is, assuming Gaussian noise (Asymp Fix), and witĥestimated from the data (Asymp Est). The three bootstrap variants are all applied using 200 bootstrap samples and respectively denoted Boot Para, Boot Orth, and Boot Perm.
The results are given in Figures 1 and 2 for different values of . In the normal case, we see that knowing the constant for the asymptotic test is not a real benefit, but it has a huge negative effect when it is misspecified in the t 5 noise setting. Similarly, parametric bootstrap works well if the noise distribution is correctly specified but fails otherwise. In general, it seems that the asymptotic test that estimates the constant and the bootstrap test based on orthogonal transformations are the most preferable. They hold the alpha level and have good power, which, however, deteriorates with an increasing noise level. Permutation-based bootstrapping, on the other hand, has some trouble keeping the alpha level.
Restricting to the asymptotic test with estimated , we consider three sound samples (n = 50,000), freely available in the R package JADE and used for example in Miettinen, Nordhausen, and Taskinen (2017). The three sound signals are mixed with a 20 × 3 matrix A where the elements come from a uniform distribution on [0, 1]. The three nonzero eigenvalues of Σ(Az) are 14.051, 2.391, and 0.639. We add then either 20-variate Gaussian or t 5 noise to the mixed signals and study at which noise level the number of signals gets undetectable by a divide-and-conquer strategy   (all individual tests are performed at level = .05). As competitors, we use the Bartlett test (Bentler & Yuan, 1996) and Lawley (1956) test for equality of eigenvalues in pure Gaussian model as implemented in the R package nFactors (Raiche, 2010).
The estimated signal dimensions are plotted against the noise standard deviations in Figure 3, with the grey horizontal line indicating the correct value. With Gaussian noise, our test can still detect all three signals if the noise standard deviation is at most 3.5 (meaning that its variance is roughly equal to the largest signal eigenvalue), and the corresponding number for t 5 noise is 2.5. And it is very clear that the two tests assuming a normal model in this context are both not very appropriate here. They hardly ever estimate the correct signal dimension but instead overestimate it quite dramatically for modest noise level and are in general too optimistic for the signal dimension.

CONCLUSION
In this work, we developed inference tools to detect the true latent signal dimensionality using PCA. The problem has been previously considered in the context of noiseless elliptical data (Nordhausen et al., 2016), known as the subsphericity problem, and in a fully Gaussian case using different information criteria (see Chapter 3.2.3 of Cichocki & Amari, 2002, and the references therein). Our model, however, allows for any signal distribution and any spherical noise distribution, making it therefore of special interest in a noisy ICA context.
One drawback of the method is that it requires both the signal and the noise to have finite fourth moments. Consequently, the next step is to develop robust alternatives to the proposed method. Interestingly, additional simulation studies (not shown here) revealed that bootstrap testing actually succeeded in estimating the dimension also when the covariance matrix was replaced with certain robust alternatives, even though the alternatives lack the additivity property. The theoretical justification of this will provide an interesting future prospect, as will investigating whether the theoretical properties of various ICA estimators are retained in the noisy ICA model when the dimension is first estimated.

DATA AVAILABILITY STATEMENT
The data used in the sound example were obtained from the R package JADE ), available for download from CRAN (http://CRAN.R-project.org/package=JADE).

ACKNOWLEDGEMENT
The work of K. N. was supported by CRoNoS European Cooperation in Science and Technology Action IC1408.