Fast tensorial JADE

Abstract We propose a novel method for tensorial‐independent component analysis. Our approach is based on TJADE and k‐JADE, two recently proposed generalizations of the classical JADE algorithm. Our novel method achieves the consistency and the limiting distribution of TJADE under mild assumptions and at the same time offers notable improvement in computational speed. Detailed mathematical proofs of the statistical properties of our method are given and, as a special case, a conjecture on the properties of k‐JADE is resolved. Simulations and timing comparisons demonstrate remarkable gain in speed. Moreover, the desired efficiency is obtained approximately for finite samples. The method is applied successfully to large‐scale video data, for which neither TJADE nor k‐JADE is feasible. Finally, an experimental procedure is proposed to select the values of a set of tuning parameters. Supplementary material including the R‐code for running the examples and the proofs of the theoretical results is available online.


Tensorial independent component analysis
In modern data analysis an increasingly common and a natural assumption is that the covariance matrices are Kronecker-structured: the random vector x ∈ R pq has a covariance matrix Σ = Σ 2 ⊗ Σ 1 where Σ 1 ∈ R p×p , Σ 2 ∈ R q×q are positive-definite and ⊗ is the Kronecker product.In order for an estimator of Σ to be adequate, it should utilize this special structure, see e.g.Leng and Pan (2018); Roś et al. (2016); Srivastava et al. (2008); Werner et al. (2008); Wiesel (2012).In particular, if the Kronecker structure is ignored, the amount of parameters is inflated from p(p + 1)/2 + q(q + 1)/2 to pq(pq + 1)/2.
Basic properties of the Kronecker product ensure that any zero-mean random vector x, with the Kronecker covariance structure, admits a natural representation as a random matrix by means of the matrix location-scatter model, , where the original random vector is obtained from vectorization, vec[X] = x, and the latent random matrix Z is standardized such that Cov[vec[Z]] = I pq .In order to find interesting structures, the uncorrelatedness assumption for the elements of Z is occasionally not enough.In Virta et al. (2017) the uncorrelatedness assumption was replaced by the assumption of full statistical independence and the location-scatter model was extended to the matrix-valued independent component model, where the invertible Ω 1 ∈ R p×p , Ω 2 ∈ R q×q are unknown parameters and the latent tensor Z is assumed to have mutually independent, marginally standardized components.The estimation procedure of a latent vector with independent components, using only the information provided by observed linear combinations of the components, is referred to as independent component analysis (ICA).See Comon and Jutten (2010); Hyvärinen et al. (2004); Ilmonen and Paindaveine (2011); Miettinen et al. (2015); Nordhausen and Oja (2018); Samarov and Tsybakov (2004) for different approaches under classic multivariate settings.Since the location has no effect on the estimation of the so-called mixing matrices Ω 1 and Ω 2 , we can without loss of generality assume that Z is centered.Under the additional assumption that at most one entire row of Z has a multivariate normal distribution and at most one entire column of Z has a multivariate normal distribution, it can be shown that the latent Z is identifiable up to the order, joint scaling and signs of its rows and columns (Virta et al., 2017).This pair of assumptions is less restrictive than its counterpart in standard vectorvalued ICA, which allows a maximum of one normal component in the latent vector (Comon and Jutten, 2010).On the contrary, the assumptions of the matrix model allow Z to contain up to pq − max(p, q) normal components, if the non-normal elements are suitably located.Vectorizing Eq. ( 1) yields a Kronecker-structured standard independent component model x = (Ω 2 ⊗ Ω 1 )z, where x = vec [X] ∈ R pq , z = vec [Z] ∈ R pq .Again, here any reasonable ICA approach should take the special form of the mixing matrix into account.To our knowledge, no structured estimation methods have yet been developed for standard vector-valued ICA in this setting.However, the problem has been considered under the matrix representation of Eq. ( 1) in Virta et al. (2017Virta et al. ( , 2018)), where two classical ICA procedures, fourth order blind identification (FOBI) (Cardoso, 1989) and joint diagonalization of eigen-matrices (JADE) (Cardoso and Souloumiac, 1993), are extended to TFOBI and TJADE in order to solve Eq. (1).Even though TJADE provides a definite improvement in efficiency with respect to TFOBI and the classical multivariate versions of the procedures, it is relatively expensive to compute.Consequently, the main goal of this paper is to address this issue.We provide a computationally more powerful, alternative TJADE-based estimator, which retains the desirable statistical properties of TJADE.In particular, the estimator retains the consistency and the limiting distribution of TJADE.We wish to point out that the statistical models for matrix data we are about to discuss are markedly different from the ones usually encountered when discussing array-valued data in engineering contexts.The relevant engineering literature is populated mostly with different tensor decompositions, the most popular ones being the Tucker decomposition and the CP-decomposition, which extend the singular value decomposition to higher order structures in different ways, see Cichocki et al. (2015); Kolda and Bader (2009) for comprehensive reviews.A fundamental difference between the classical tensor decompositions and our current model is that the former rarely incorporate the concept of sample in the formulation.As such, tensor decompositions generally reduce also the dimension of the observation space which is very unusual for classical statistical methods.These fundamentally different objectives also prevent any simple, meaningful comparisons, and therefore we have refrained from including comparisons to classical tensor decompositions in this work.
The rest of the paper is structured as follows.In Section 2, we review the existing methodology on tensorial independent component analysis and, in particular, TJADE on which our proposed extension will be based.In Section 3, we formulate k-TJADE in the case of matrix-valued observations and thoroughly establish its limiting behavior along with the necessary assumptions.The theoretical results are extended to a general tensor-valued model in Section 4 using the technique of matricization.Section 5 explores the finite-sample performance of the proposed method under both simulations and data examples on videos and hand-written digits.We also investigate the estimation of a set of tuning parameters of k-TJADE in the context of the latter example.In Section 6 we provide a short summary and list ideas for future work.Technical proofs are presented in Appendix A.

Tensorial JADE
We begin by briefly describing the theory behind TFOBI and TJADE.Everything in the following is formulated on the population level for a random matrix X ∈ R p×q .In practice, one would obtain a sample of matrices, X 1 , . . ., X n , from the distribution of X and the expected values below should be replaced by sample means.
Assuming that the random matrix X ∈ R p×q follows the model in Eq. ( 1), the first step is to simultaneously standardize both, the rows and the columns of the matrix, using the left and right covariance matrices of X, We denote the unique symmetric inverse square root of the positive-definite matrix S by S −1/2 .The standardized variable X st = (Σ satisfies X st = τ U 1 ZU 2 for some τ = τ (Ω 1 , Ω 2 ) > 0 and some orthogonal matrices, U 1 ∈ R p×p , U 2 ∈ R q×q , see Virta et al. (2017).The unknown constant of proportionality τ is a result of the joint scaling of the model in Eq. ( 1) being left unfixed.After the standardization, solving the IC problem is reduced to the estimation of the orthogonal matrices U 1 , U 2 , a task commonly addressed in ICA using higher-order cumulants.The TFOBI and TJADE procedures also utilize the higher-order cumulants.In TFOBI, a Fisher consistent (under mild assumptions, see Section 3) estimator, Γ F [X], for the inverse of the matrix Ω 1 is constructed such that where the columns of V F [X] are the eigenvectors of the matrix Thus, Γ F [X] provides a solution for the left-hand side of the IC model in Eq. ( 1) (Virta et al., 2017).
The TJADE procedure utilizes a set of matrices, C = C ij X st : i, j ∈ {1, . . ., p} , referred to as the set of cumulant matrices, such that where δ ij is the Kronecker delta, e k ∈ R p , k ∈ {1, . . ., p}, are the standard basis vectors of R p and E kl = e k e l .The left covariance matrix of the standardized matrix satisfies Σ 1 [X st ] = τ 2 I p and is included in Eq. ( 2) solely for the estimation of the constant of proportionality τ .The authors in Virta et al. (2018) proved that the joint diagonalizer of C is under mild assumptions, see Section 3, equal to the orthogonal matrix U 1 up to the order and signs of its columns.The joint diagonalizer of the set C is defined as any orthogonal matrix where off(S) ∈ R p×p is equal to S ∈ R p×p with the diagonal elements set to zero.
The joint diagonalizer defines a coordinate system in which the linear transformations C ij [X st ], i, j ∈ {1, . . ., p}, have minimal sum of squared off-diagonal elements.There exists several algorithms for optimizing Eq. ( 3), the most popular being the Jacobi-rotation technique, for details see Belouchrani et al. (1997); Illner et al. (2015).After the estimation of the joint diagonalizer V J [X], an estimated inverse for the matrix Ω 1 is obtained as the TJADE-fucntional, The results of this paper are derived only for the left-hand side of the matrixvalued model.The right-hand side of the matrix-valued model can be solved exactly as the left-hand side.One can simply take the transpose of X and proceed as with the left-hand side.Only the sizes of the cumulant and transformation matrices change from p×p to q ×q.Moreover, matricization allows us to extend the estimators beyond the matrix-valued model to arbitrary-dimensional tensor-valued IC models, see Virta et al. (2017Virta et al. ( , 2018)).Matricization allows us to hide a considerable amount of the unpleasant notation related to tensor algebra, see Section 4. In total, it is sufficient to present the results in matrix form and only for the left-hand side of the model in Eq. ( 1).
When the dimension q is equal to one, the TJADE procedure for the lefthand side of the model is equivalent to the standard JADE for vector-valued data.Extensive comparisons between JADE and TJADE are conducted in Virta et al. (2016Virta et al. ( , 2017Virta et al. ( , 2018) ) with the conclusion that the latter is uniformly superior to the former under the Kronecker-structured IC model.Moreover, the tensorial version is computationally significantly faster.Consider a tensor of rth order with all dimensions of size p.Standard JADE requires a single joint diagonalization of p 2r matrices that are of size p r × p r , whereas TJADE requires r joint diagonalizations of p 2 matrices that are of size p × p.In essence, adding dimensions to a tensor has a multiplicative effect on the number of operations the classic vectorial methods require and merely an additive effect on the tensorial methods.However, even with its considerable advantages over JADE, running the TJADE procedure is slow for large tensors.
To obtain a faster method, we approach the problem in the spirit of Miettinen et al. (2013) where a faster version of JADE, k-JADE, is derived.The modification can be described very succinctly: instead of diagonalizing the entire set of cumulant matrices C ij , we diagonalize only a specific subset of them, chosen such that the desirable statistical properties of TJADE are still carried over to the extension.Since the subset of cumulant matrices can be chosen separately in each direction of the tensor, the k-JADE approach provides even more significant improvements in tensorial settings compared to its original use in improving JADE.Note that, similar ideas as in Miettinen et al. (2013) were used already in Cardoso (1999) to formulate shifted blocks for blind separation (SHIBBS), where only those cumulant matrices of regular JADE with matching indices are diagonalized.

Tensorial k-JADE
In this section we propose a novel extension of the TJADE procedure.We formulate the extension, k-TJADE, such that it retains the following three key properties of TJADE.The first of these properties is the ability to solve the tensor independent component model, manifesting either as Fisher consistency or consistency, depending on whether we are at the population or sample level, respectively.The second property is orthogonal equivariance under arbitrary data.ICA-estimators are customarily expected to have some form of equivariance, which makes the generalization of limiting properties more straightforward (Miettinen et al., 2015;Virta et al., 2017).The third desired property is the limiting distribution of TJADE, the one with the lowest limiting variance of the known tensorial ICA methods.Next, we establish these properties one-by-one.As mentioned in the previous section, all results derived for the left-hand side of the model also hold for the right-hand side, prompting us to consider only the former in the following.In the same spirit, even though the proposed k-TJADE method has in the case of matrix data two tuning parameters, k 1 for the rows and k 2 for the columns, the method still acts only on a single mode at a time, and as such we omit the subscripts and speak in the following only of the tuning parameter k.Moreover, the same idea is reflected in the name of the method which should (for matrix data) technically be (k 1 , k 2 )-TJADE.In order to keep the presentation more readable, we prefer to call the method simply k-TJADE.
We define matrix independent component functionals, the extension of independent component functionals (Miettinen et al., 2015) to matricial ICA.
1 for all X ∈ R p×q that follow the matrix IC model of Eq. ( 1), where two matrices A, B ∈ R p×p satisfy A ≡ B if A = cPJB for some c > 0, some diagonal matrix J ∈ R p×p with diagonal elements equal to ±1 and some permutation matrix P ∈ R p×p .
The first condition in Definition 1 requires that a matrix IC functional must be able to solve the left-hand side of the model in Eq. ( 1) (Fisher consistency).The second condition essentially states that the functional cancels out any orthogonal transformations on the observed matrices (orthogonal equivariance).As a particularly useful consequence of the latter, the limiting distribution of a matrix IC functional under trivial mixing, Ω 1 = I p , Ω 2 = I q , instantly generalizes to any orthogonal mixing as well.
Let κ ∈ R p be the vector of the row means of the element-wise kurtoses, E x 4 kl − 3, of the elements of Z. Assumption 1.At most one element of κ equals zero.
Assumption 2 (v).The multiplicities of the elements of κ are at most v.
The TFOBI functional Γ F is a matrix IC functional in the sense of Definition 1 if Assumption 2( 1) is satisfied and the TJADE functional Γ J is a matrix IC functional if Assumption 1 is satisfied.Naturally, the column mean analogues of the assumptions are required to separate the right-hand side of the model in Eq. (1).
The same comparison as was done between the normality assumptions in Section 1 holds analogously between Assumption 1, Assumption 2(ν) and their vectorial counterparts allowing maximally one zero-kurtosis component or only distinct kurtoses, respectively.The main implication is that in matrix ICA numerous latent components may have identical kurtoses as long as their row means (and column means when separating the right-hand side of the model) satisfy the necessary requirements.The assumptions also satisfy the following set of relations, where ⊂ means "implies".Moreover, we have also Assumption 2(1) ⊂ Assumption 1.
In order to speed up TJADE such that the properties in Definition 1 are retained, we proceed as in Miettinen et al. (2013), and instead of diagonalizing the set C, we diagonalize only those members of it which satisfy |i − j| < k, for some pre-defined value of the tuning parameter k ∈ {1, . . ., p}.This discarding can be motivated in two ways.Firstly, all except the repeated index matrices, C 11 , . . ., C pp , vanish asymptotically.Every matrix C ij [Z] = 0, i = j, implying that with increasing sample size all the separation information is eventually contained in the p repeated index matrices.Secondly, by assuming that the values in κ are in decreasing order (this is guaranteed by using TFOBI as a preprocessing step, see the next paragraph) the ith row of Z is the most difficult to separate from its immediate neighboring rows and the separation information between them is contained precisely in the matrices C ij and C ji where j is close to i.
Analogously to k-JADE, we use the TFOBI-algorithm to obtain an initial value for the functional.This ensures that even after the previous modification, the functional remains orthogonally equivariant.The following definition and theorem formalize our resulting novel method, called k-TJADE.
where Γ F is the TFOBI functional, ) is the TFOBIsolution for X and the orthogonal matrix Theorem 1.Let Assumptions 1 and 2(v) hold for some fixed v. Then the k-TJADE functional Γ k is a matrix IC functional for all k ≥ v.
Theorem 1 provides the Fisher consistency and the orthogonal equivariance for the k-TJADE functional.The assumptions that Theorem 1 requires are interesting since they provide an interpretation for the tuning parameter kthe parameter is the maximal number of allowed kurtosis mean multiplicities.The values k = 1 and k = p correspond to the extreme cases where all the kurtosis means have to be distinct (as in TFOBI) and where no assumptions are made on the multiplicities of the non-zero kurtosis means (as in TJADE).Thus, k-TJADE can be seen as a middle ground between TFOBI and TJADE.As the assumptions, also the methods can be ordered according to the strictness of the assumptions they require, where " " is read as "makes at least as many assumptions as" and "=" as "makes the same assumptions as".
We next give some intuition behind the proof of Theorem 1 (given in Appendix A).The proof relies on a specific interplay between the preliminary TFOBI-step and the set of cumulant matrices C. Being an eigendecompositionbased method, TFOBI is able to estimate the matrix Ω −1 1 only up to row blocks determined by the elements of κ, such that the rows of the estimate corresponding to equal kurtosis values remain mixed by orthogonal matrices (c.f.eigenvectors corresponding to multiple eigenvalues are not uniquely defined but their span is).However, the proof also shows that the joint diagonalization of all cumulant matrices with |i − j| < k, for some given k, can separate mixed row blocks of size at most k (intuitively, increasing k means using more cumulant matrices which increases the amount of available information).Thus, by sequencing the two steps together and having k ≥ ν ensures that any blocks left still mixed after the TFOBI-step will be unmixed in the joint diagonalization step.
Assumptions 1 and 2(v) not only provide Theorem 1, but are also sufficient to guarantee the two remaining desired properties of k-TJADE, consistency and the same limiting distribution as that of TJADE.These asymptotic properties are formalized in the following two theorems.Remarkably as a special case, when q = 1, the latter also proves a previously unsolved conjecture posed about the limiting behavior of vectorial k-JADE in Miettinen et al. (2015).
Theorem 2. Let Assumptions 1 and 2(v) hold for some fixed v and let X 1 , . . ., X n be an i.i.d.sequence from the matrix IC model in Eq. (1) with identity mixing matrices, Ω 1 = I p , Ω 2 = I q .Assume that the population quantity X has finite eight moments.Then, for all k ≥ v, there exists a consistent sequence of k-TJADE-estimators (indexed by n).That is, Theorem 3.Under the assumptions of Theorem 2, we have for all k ≥ v that, where ΓJ is the TJADE-estimator.The notation o p (1) refers to a sequence of random matrices that converges in probability to the zero matrix.
The proofs of Theorem 2 and 3 are based on the fact that the kurtosis matrices of the latent Z with non-matching indices vanish asymptotically, C ij [Z] = 0, i = j.The key point of the proof is then to show that this property transfers to the corresponding kurtosis matrices C ij [X F ], of the TFOBI-standardized observation, to such extent that the asymptotical contribution of the sample estimates of these matrices to the final k-TJADE estimate is negligible.Thus, regardless of the choice of k, the asymptotical behavior of the method is determined by the p repeated index matrices C 11 , . . ., C pp .Finally, the proof reveals that the rotation specified by the joint diagonalization dominates over the preliminary TFOBI-rotation and the effect of the latter also turns out to be asymptotically negligible.
Note that, ΓJ = Γp does not generally hold as the latter estimator utilizes the preliminary TFOBI-step while the former does not.The limiting distribution of √ n( ΓJ − I p ) is in Virta et al. (2018) shown to be multivariate normal and closed form expressions for its limiting variances are also given therein.By the orthogonal equivariance of matrix independent component functionals (property (ii) of Definition 1) the limiting results of Theorem 3 generalize to any orthogonal mixing matrices.Note that the original k-JADE in Miettinen et al. ( 2013) is affine equivariant (equivariant under all coordinate system changes, not just orthogonal).The problem of achieving affine equivariance in the context of tensorial ICA is discussed in Virta et al. (2018).There it was conjectured that tensorial ICA cannot be affine equivariant.
The two limiting theorems above show that, under suitable assumptions, k-TJADE indeed has all the desirable properties listed at the beginning of this section.This can be summarized by saying that k-TJADE makes a trade-off between assumptions and computational load: with the price of added assumptions, we obtain a method with the same limiting efficiency as TJADE, but with significantly lighter computational burden.As the claim about efficiency holds only asymptotically, we conduct a simulation study (in Section 5) to compare the finite-sample efficiency of the estimators.
Note that the maximal kurtosis multiplicity ν is unknown in practice, which makes the choosing of k such that k ≥ ν a non-trivial task.However, the simulations of Section 5 reveal that not a lot of separation efficiency is necessarily lost when using a slightly too small value of k, making the problem less dire.To further alleviate the issue, in the final part of Section 5, we propose a procedure for estimating ν and apply it to hand-written digit data.

A note on tensorial ICA
In this section, we formulate the general tensorial IC model and discuss how it can be reduced to the matricial IC model.We begin with a short review of the basic concepts of multilinear algebra.
An rth order tensor and has a total of r modes or ways, that is, directions from which we can view it.For example, a matrix (r = 2) can be viewed either through its columns or through its rows.Two complementary ways of dividing a tensor into a disjoint collection of smaller tensors are called the m-mode vectors and the m-mode faces.In the former, we choose an index m = 1, . . ., r and have the values of the indices {1, . . ., r} \ {m} fixed and let the mth index vary over its range.Each fixed combination of the r − 1 indices then yields a single p m -dimensional vector and the collection of all ρ/p m such vectors is called the set of m-mode vectors of X.On the other hand, if we fix the value of the mth index and let the others vary over their ranges, we get a total of p m tensors of order r − 1 and size  m-mode faces of X. Illustrations of both, m-mode vectors and m-mode faces, m = 1, 2, 3, in the case of a 3-dimensional tensor, are shown in Figures 1 and 2.
Two algebraic operations can now be defined in terms of the m-mode vectors.
The m-mode multiplication is easily understood: X × m A m is obtained by multiplying each m-mode vector of X from the left by A m and collecting the resulting vectors back into an r-dimensional tensor in the same order.The mmode multiplications × 1 , . . ., × r from distinct modes are commutative and we use the shorthand In the case of matrices, X = X ∈ R p1×p2 (r = 2), the simultaneous multiplication from both modes simply gives X × 2 m=1 A m = A 1 XA 2 .To show the connection between the matricial and tensorial IC methods, we still need the concept of m-mode matricization.For a fixed mode m = 1 . . ., r, the m-mode matricization X (m) ∈ R pm×ρ/pm of X is obtained by taking the m-mode vectors of X and collecting them horizontally into a wide matrix.The arrangement order of the m-mode vectors has no effect on our further developments, and we choose to use the cyclical ordering as in De Lathauwer et al. (2000).In this case the relationship, holds.For more comprehensive introduction to multilinear algebra, see Cichocki et al. (2009);De Lathauwer et al. (2000).
We are now sufficiently equipped to examine the connection between the matricial ICA and the tensorial ICA.The zero-mean rth order random tensor where the random tensor Z ∈ R p1×•••×pr has mutually independent, marginally standardized components and Ω m ∈ R pm×pm , m = 1, . . ., r, are invertible (Virta et al., 2017).We further assume that for each mode m = 1, . . ., r at most one m-mode face of Z consists solely of normally distributed components.
The objective of tensorial independent component analysis is to estimate Z given a random sample of observed tensors from the distribution of X.
We fix the mode m = 1, . . ., r and consider the m-mode matricization of the model in Eq. ( 5).It now follows from Eq. ( 4) that As Kronecker products of invertible matrices are themselves invertible, a comparison to Eq. ( 1) now reveals that Eq. ( 6) is in the form of a matrix IC model with Ω m replacing Ω 1 and In addition, Z (m) satisfies all the assumptions of the matrix IC model.
To be precise, it is possible that multiple columns of Z (m) have multivariate normal distributions as a consequence of the matricization.However, as the mmode matricization leaves the structure of the mth mode intact, the normality assumption on Z guarantees that at most one row of Z (m) has a multivariate normal distribution, which is then sufficient for the identifiability of Ω m , the current parameter of interest.Thus, using Eq. ( 6), we can estimate Ω m exactly as Ω 1 in the matricial case.In the tensorial case, the kurtosis means in the vector κ in Assumption 1 and in Assumption 2(v) are computed over the rows of Z (m) , or equivalently, over the m-mode faces of Z.All our theoretical results, such as orthogonal equivariance (a Kronecker product of orthogonal matrices is itself orthogonal) and the asymptotic variances, hold fully under the tensorial IC model, as long as the relevant kurtosis assumptions are satisfied.Finally, we remind that for an rth order input tensor X, the k-TJADE has a total of r tuning parameters, k 1 , . . ., k r , with the underlying idea that k m should be chosen to be equal to or larger than the maximal kurtosis multiplicity in the corresponding mode, see Theorem 3.

Simulations and examples
In this section, we illustrate the finite sample properties of the tensor-valued procedures TFOBI, TJADE and the novel k-TJADE, which are implemented in the R-package tensorBSS (Virta et al., 2016).For comparison, we also consider the classical vector-valued versions of these estimators, denoted by VFOBI, VJADE and k-VJADE, as implemented in the JADE package (Miettinen et al., 2017).
In the classical procedures, the tensor-valued observations are first vectorized and the algorithms are then applied to the resulting data matrices.
Next, we consider a collection of observed i.i.d.tensors {X j } j∈{1,...n} , generated from the tensorial independent component model, such that X j ∈ R p1×•••×pr , for every j.Let Ω 1 , . . ., Ω r be the theoretical mixing matrices and let Γ1 , . . ., Γr be the corresponding unmixing estimates produced by one of the tensor-valued procedures.We denote the Kronecker products of the matrices as Γ = Γ1 ⊗ . . .⊗ Γr and Ω = Ω 1 ⊗ . . .⊗ Ω r .The vector-valued procedures produce a single unmixing estimate, denoted also by Γ.Note that the estimates produced by the vectorial and the tensorial methods are comparable, since in both cases the matrix Γ estimates the inverse of the compound matrix Ω.
The so-called gain matrix is defined as Ĝ = ΓΩ.The unmixing is considered successful if the gain matrix is close to the identity matrix, up to the order and signs of its rows.We quantify this closeness using the performance measure called minimum distance (MD) index (Ilmonen et al., 2010).The MD index is formulated as follows, where ρ = r j=1 p j and C 0 is the set of all matrices with exactly one non-zero element in each row and column.The range of the MD index is [0, 1], where the value 0 corresponds to the case of the gain matrix being exactly a permuted and scaled identity matrix, i.e. the estimate is perfectly accurate.Additionally, the limiting distribution of the MD index can be obtained from the limiting distribution of the corresponding IC functional Γ, see (Ilmonen et al., 2010, Theorem 1) and (Virta et al., 2017, Theorem 6).
In simulation studies, where multiple iterations are performed under identical conditions, the asymptotic value for the mean of the transformed MD index n(ρ − 1)D( Ĝ) 2 can be obtained using the limiting variances of the applied IC functionals, see Virta et al. (2017) for further details.The convergence towards the theoretical limiting values given by Theorem 3 can then be demonstrated by visualizing the theoretical limits alongside −1 2 , where i is the number of iterations for the sample size n i and Ĝj is the estimated gain matrix for the corresponding jth iteration.
Before presenting the simulations, we discuss shortly the importance of the tuning parameter k, which in k-TJADE can be chosen separately for each mode.In the following, (k 1 , . . ., k r )-TJADE is used to refer to k-TJADE with the value k m for the tuning parameter in the mth mode, m ∈ {1, . . ., r}.Based on Theorem 3, the computationally cheapest but still asymptotically optimal choice is the largest multiplicity of the kurtosis means in the current mode.This means that we have generally three choices for each mode: we may use the full TJADE if the mode is short; we can use k-TJADE for some small k if the mode is long; or we can choose not to separate the mode at all if it is not expected to contain any relevant information, as might be the case, e.g., for the color dimension of a sample of images.

Finite-sample efficiency, setting 1
We begin by demonstrating that the finite sample performance of k-TJADE is in line with the asymptotic results given in Theorem 3. In the first setting, we consider simulated collections of i.i.d.matrices of size 3 × 3, Z := {Z j } j∈{1,...,n} .The components of every Z j are simulated independently from the following distributions, where C, E, U and N denote independent replicates from the χ 2 1 , standard exponential, uniform and normal distribution, respectively, all scaled to have zero mean and unit variance.
In this simulation setting, none of the theoretical row or column kurtosis means are zero.However, the theoretical mean kurtoses are the same for the first two rows and the first two columns.Thus, the preferable (k 1 , k 2 )-TJADE procedure here is 22-TJADE, k 1 = k 2 = 2.Note that in this setting the requirements of TFOBI are not fulfilled.
When the observations are vectorized, the length nine vectors contain a single normally distributed component, two χ 2 -distributed elements, three elements from the exponential distribution and three elements from the uniform distribution.The vector now contains one element (the normally distributed component) with theoretical kurtosis 0, making vectorial ICA viable.The most natural k-VJADE procedure here is 3-VJADE.The assumptions of VFOBI are violated.
The simulation was performed using 13 different sample sizes, n i = 2 i−1 • 1000, i ∈ {1, . . .13} with = 2000 repetitions per sample size.To evaluate the equivariance properties and the effect of non-orthogonal mixing, we mixed the observations at each repetition using (i) identity mixing: where U 1 and U 2 are random orthogonal matrices uniformly sampled at each repetition with respect to the Haar measure and (iii) normal mixing, where Ω 1 and Ω 2 were filled at each repetition with random elements from N (0, 1).q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.9 ) Identity q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.9 Orthogonal q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.9 To evaluate the effect of the mixing matrix and the limiting distributions, we present, in Figure 3, the transformed MD values for 3-VJADE, VJADE, 22-TJADE and TJADE and the corresponding theoretical limit values for the cases available.As our theory extends only the orthogonal mixing, the theoretical limit value of TJADE is missing from the third panel.The sample sizes n i , i ≥ 8, are omitted from Figure 3, since the sample size 64 • 10 3 is large enough for the convergence of both the vectorial and the tensorial methods.
Figure 3 clearly shows that for the vectorial methods, we have affine invariance and thus all of the three curves are identical.For the tensorial methods, we have only orthogonal invariance and hereby the curve corresponding to the normal mixing differs from the other two.Moreover, despite the limiting value of TJADE being unknown for normal mixing, based on the finite-sample curves, it seems that normal mixing does not change the relative separation accuracy of the tensorial methods over orthogonal mixing.Only the overall separation difficulty level is affected (the curves are higher in the plots).Furthermore, the benefit of applying the tensorial methods over the vectorized methods is impressive.Finally, Figure 3 also illustrates that all the methods converge to the theoretic values and that the k-TJADE versions have the same limiting values as TJADE.
We next, under the same setting as above, present the results for seven q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2.0 ) Identity q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2.0 Orthogonal q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2.0 additional procedures that violate some of the required assumptions: VFOBI, TFOBI, 1-VJADE, 2-VJADE, 11-TJADE, 12-TJADE and 21-TJADE.Note that TFOBI is the initial step in the k-TJADE procedure and hereby the comparison between k-TJADE and TFOBI illustrates the added benefit of the additional rotation after the TFOBI-solution.Likewise, the same holds between VFOBI and k-VJADE.The resulting mean values of the transformed MD indices are presented in Figure 4, where methods that have almost identical performance are presented using the same colors.
In Figure 4, TFOBI and VFOBI diverge at an exponential rate.Furthermore, the (k 1 , k 2 )-TJADE procedures that have either k 1 or k 2 less than the number of distinct kurtosis values, are not converging to anything reasonable, even at sample sizes greater than 4 • 10 6 .However, in this simulation setting, the k-VJADE procedure seems to allow slight deviations from the required assumptions.It seems that 2-VJADE converges to an asymptotic value that is at least close to that of VJADE, see Section 6 for further discussion.
To summarize this part of the simulation study: k-TJADE works as expected, when the theoretical conditions are met, and the convergence to the asymptotic value is relatively fast.Furthermore, even though k-TJADE is not affine invariant, its performance is better under all mixing scenarios, when compared to the affine invariant vectorial counterparts.

Finite-sample efficiency, setting 2
In our second simulation, we illustrate unmixing under a tensorial setting.We consider simulated collections of i.i.d.tensors of size 3 × 3 × 4, Z := {Z j } j∈{1,...,n} .Let Z (3) jk denote the kth 3-mode face of Z j .The components of every Z j are then simulated independently from the following distributions, where the different distributions are denoted as in Section 5.1.
All of the mean kurtoses over the different tensor faces are nonzero and none of the theoretical mean kurtoses are the same for the 1-mode faces.Moreover, two of them are the same for the 2-mode faces and three of them are the same for the 3-mode faces.Hereby, the preferable (k 1 , k 2 , k 3 )-TJADE here is 123-TJADE, k 1 = 1, k 2 = 2, k 3 = 3.The vectorized versions of the observations contain several normal components and thus the assumptions for the vectorial methods are not satisfied here.
The simulation was performed using 11 different sample sizes, n i = 2 i−1 • 1000, i ∈ {1, . . .11} and the simulation was repeated 2000 times for each sample size.We considered the same three mixing scenarios as in Section 5.1 and generated the mixing matrices in the same way, with the distinction that here we have three mixing matrices instead of two.We performed the unmixing using VFOBI, TFOBI, 1-VJADE, 2-VJADE, VJADE, TJADE and 11 different versions of k-TJADE.The resulting mean values of the transformed MD index, −1  j=1 n i (ρ − 1)D( Ĝj ) 2 , where = 2000 and ρ = 36, are presented in Figure 5.The orthogonal mixing is omitted from Figure 5, since the tensorial methods are orthogonally invariant and the vectorial methods are affine invariant.The performance curves under the orthogonal mixing would be identical to those under the identity mixing, similarly as in Section 5.1.
The k-TJADE performs as expected for the values of k 1 , k 2 , k 3 that satisfy the assumptions and the convergence towards the theoretical limit is considerably fast.The vectorial methods fail completely in this example.Interestingly, k-TJADE has relatively nice performance when the elementwise deviation between (k 1 , k 2 , k 3 ) and (1,2,3) is not too large.See Section 6 for further discussion.

Finite-sample efficiency, setting 3
In the third simulation setting, we have a collection of i.i.d.matrices Z := {Z j } j∈{1,...,n} , with components simulated independently from the following distributions, 1) G(0.9) G(0.9) G(0.9998) G( 1) G(0.9) G( 1) G(1)   , q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2.5 ) Identity q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2.5 where G(α) denotes the gamma distribution, with shape parameter α and rate parameter 1, scaled to have zero mean and unit variance.Hereby, in this simulation setting, the row kurtosis means are very close to each other.They start to differ at the fourth decimal.Also the column kurtosis means are quite close to each other.
The simulation study was conducted exactly as in Section 5.1, with the distinction that the largest sample size of Section 5.1 was omitted here.Furthermore, we had more versions of k-TJADE in this study.The results are displayed in Figure 6.
Theoretically, the preferable (k 1 , k 2 )-TJADE would be 11-TJADE, k 1 = 1, k 2 = 1.However, with the current sample sizes, 11-TJADE, 12-TJADE and 13-TJADE do not seem to converge, see Figure 6.Similarly, 21-TJADE and 31-TJADE exhibit erratic behavior with small sample sizes.Yet, with sample sizes over one million, the corresponding procedures are very close to the theoretical limiting value, given as the solid black line in Figure 6.It seems that with sample sizes of magnitude 10 6 , differences in column kurtosis means that are of magnitude 10 −1 , are not a problem for the procedures.Conversely, the current sample sizes are not large enough to compensate the small differences in the row kurtosis means.
The curves corresponding 22-TJADE, 23-TJADE, 32-TJADE and TJADE behave nicely in Figure 6, even with relatively small sample sizes.Especially, the good behavior of 22-TJADE is a little surprising.The corresponding procedure assumes that at most two of the column kurtosis means and at most two of the row kurtosis means are equal.We wish to emphasize that the assumptions required for the k-TJADE procedures are only sufficient conditions and it can be that in some special cases, e.g., under some special distributional structures, some of the procedures behave well, even though the sufficient assumptions are not satisfied.
In Figure 6, the curves that correspond to VFOBI and TFOBI are again only visible in the top left corner.This is unsurprising since the required assumptions are not satisfied for VFOBI, and TFOBI cannot handle the close kurtosis means, even though they are distinct on the population level.With the current sample sizes, 1-VJADE does not seem to converge to the theoretical limiting value, given as the dashed black line.The vectorial 2-VJADE, 3-VJADE and VJADE seem to function quite well in this setup.This could be explained by the fact that the vectorial methods have assumptions directly related to the theoretical kurtoses of the components, whereas the tensorial methods have assumptions related to means of the kurtoses.It is easier to detect small kurtosis differences, when compared to detecting differences in their means.

Timing comparison
The results from Sections 5.1 and 5.2 illustrate that the performances of TJADE and suitably chosen versions of k-TJADE are very similar.Next, we quantify the significant improvement in computational speed that k-TJADE provides when compared to TJADE.In the timing comparison, we consider a simulated collection of i.i.d.matrices of size 3 × q, Z := {Z j } j∈{1,...,n} , such that components of every Z j are simulated independently from the following distributions, where χ 2 ν denotes the χ 2 -squared distribution with ν degrees of freedom, and the width q of the matrix is the varying parameter in this simulation setting.We used parameter values q = 5, 10, 15, 20, . . ., 50 and the sample size n = 1000.We considered the same procedures as in Section 5.1: VFOBI, TFOBI, 1-VJADE, 2-VJADE, 3-VJADE, VJADE, 11-TJADE, 12-TJADE, 21-TJADE, 22-TJADE, TJADE and recorded the mean running times over a total of 5 iterations.The time it took R to vectorize the tensors was also considered as a part of the vectorized procedures.However, the time the vectorizing took, q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.5 ) Identity q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.5 was negligible.We used two alternative stopping criteria for the methods that involve joint diagonalization, that is, for all the methods except for TFOBI and VFOBI.A single iteration was stopped, if either the converge tolerance of the Jacobi rotation based joint diagonalization was less than 10 −6 or if the required tolerance was not satisfied after 100 iterations.The average running times in minutes as a function of the dimension q are presented in Figure 7 and methods that have almost identical computing times, are presented using the same colors.Figure 7 clearly illustrates the superior computation speed of k-TJADE, when compared to either TJADE or any of the vectorized counterparts.The timing comparison was conducted on Ubuntu 16.04.4LTS with Intel R Xeon R CPU E3-1230 v5 with 3.40GHz and 64GB.

Video example
We applied k-TJADE to the WaterSurface surveillance video (Li et al., 2004) that is viewable at http://pages.cs.wisc.edu/~jiaxu/projects/gosus/supplement/ and available to download as an .Rdata file at https://wis.kuleuven.be/stat/robust/Programs/DO/do-video-data-rdata.The video has already been used as an example for blind source separation (Virta and Nordhausen, 2017a,b).Each frame of the video is of size h × w × 3 with the height h = 128, width w = 160 and a three-variate color channel (RGB).The total video consists of q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −4 −2 633 such frames, making our data a sample of size n = 633 of random third order tensors in R h×w×3 .The data constituting a single continuous surveillance video, the observations are naturally not independent and the assumptions of tensorial independent component analysis are not fully satisfied.However, ICA is known to be robust against deviations from the independence assumption and applying it to sets of dependent data with success is common practice.We thus expect k-TJADE to successfully extract components of interest from our current data.
The video shows a beach scene with little to no motion until frame 480, when a man enters the scene from the left, staying in the picture for the remainder of the video.Figure 8 shows frames 50 and 550 of the video, illustrating moments before and after the man enters the scene.Our objective with the surveillance video is to find low-dimensional components that allow us to pinpoint the most obvious change point in the video, namely, the man's entrance.As change points are most likely captured in the data as outliers of some form, it is natural to seek the component of interest among those with high kurtosis values.
We proceed as follows: we run k-TJADE on the data with different choices of the tuning parameters, find the component with the highest absolute kurtosis, and plot its time course to visually assess whether it captures the change point.The component found with (1, 1, 0)-TJADE is shown in Figure 9, where k 3 = 0 means that we do not unmix the supposedly uninformative color dimension at all.The time series is instantly seen to capture the desired time point as the  spike coincides with the first frames the man spends in the scene.The running time of the method was 39 minutes on a Windows server with Intel R Xeon R CPU R5 2440 with 2.40GHz and 64GB.Applying (2, 2, 0)-TJADE gave almost identical results with the increased running time of one hour and 54 minutes.However, the original TJADE proved to be very slow.The running time of the algorithm was over five days.Concluding, the example shows that k-TJADE can be used to reliably extract information from data where TJADE can not be applied due to its extremely high computational cost.In several real world applications, e.g. in crime scene investigation, waiting for days is not an option.

Estimation of the maximal kurtosis multiplicity ν
In the previous video example, applying (1, 1, 0)-TJADE already gave satisfactory results in the sense that it allowed us to find a latent component identifying the point of interest in time.However, in some cases, using the preferable value k = ν for the tuning parameter might be justified.This will guarantee the successful estimation of all components, while simultaneously keeping computation times and the amount of noise in the estimation minimal.The results of Theorem 3 could possibly be used to formulate an asymptotic hypothesis test for the null hypothesis that ν ≤ ν 0 for some given ν 0 in a given mode.Namely, if the null hypothesis holds, then the limiting distributions of the unmixing estimates Γν0 , Γν0+1 , . . ., Γp are identical, allowing us to pin-point the true value of ν.In the following, we will pursue this idea from an empirical perspective and develop a heuristic procedure for the estimation of ν.
Consider a fixed mode of size p of a sample of data X 1 , . . ., X n from the tensorial IC model in Eq. ( 5) and let Γ1 , Γ2 , . . ., Γp be the unmixing matrices estimated from the corresponding data, respectively, to k-TJADE with the value of k ∈ {1, . . .p}. Letting ν be the true maximal kurtosis multiplicity in the data, we thus expect Γν to separate the data well but Γν−1 to leave some parts still unmixed (within the multiplicities).Analogously, we expect m ν−1 := D( Γν−1 ( Γν ) −1 ), where D(•) is the MD index in Eq. ( 7), to be large as the two matrices should have structures differing beyond row permutations, scaling and sign changes (in the extreme case where Γν−1 = Γν we have m ν−1 = 0).On the other hand, we expect all the following sequential MD indices, m ν+ , ∈ {0, . . ., p − ν − 1}, to be small as all values of k ∈ {ν, . . ., p} are sufficient for the separation, and yield asymptotically equal solutions.
The true value of ν can now be located by plotting m k versus k ∈ {1, . . ., p − 1} and finding the value, starting from which the curve stays roughly constant.However, since using a single sequential MD index per value of k might leave us with a curve that is not smooth enough to accurately distinguish the true ν, we compute, for each k ∈ {1, . . ., p − 1}, the average of the forward sequential MD indices: The plot of m * k versus k can be used to find the true value of ν similarly as with the plot of m k versus k, and has the interpretation of quantifying the efficiency loss one encounters when using a particular value of k, relative to the choice k = p.Note also that the choice of k can be done individually for each mode in the sense that the value of m * k in (k, k 0 )-TJADE is invariant to the choice of k 0 , and similarly for higher order data.
To illustrate the proposed tool, we consider the hand-written digit data set from the R-package ElemStatLearn (Halvorsen, 2015).The data set consists of 7291 images of hand-written numerals 0-9 digitized to 16 × 16 matrices, with each element in [−1, 1], describing the grayscale intensity of the corresponding pixel.The underlying objective of the data set is to build a classifier that can correctly identify the digits that each of the matrices represent.For simplicity, we work on a subset of 400 randomly chosen images of the digits 1 and 7.These two digits were chosen as their visual similarity makes the classification task more difficult.In this kind of context, ICA is commonly used as a q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Row Column preprocessing step to reduce the data into a low-dimensional subspace, which hopefully contains all the classification information, simplifying the task for the subsequent classifier.In this spirit, we apply the proposed estimation strategy for ν to the sample and obtain the two plots shown in Figure 10.The plots should be interpreted similarly as the scree plot in PCA, where the aim is to find an "elbow" where the slope changes quickly.The two curves seem to imply that the correct parameter for k-TJADE is roughly (4, 3) .In Figure 11 we plot the components with the indices (16, 16) and (16, 15) of the (4, 3)-TJADE solution.These components were chosen as k-TJADE orders the rows/columns in descending order according to their mean kurtoses, and low kurtosis is often a sign of bimodality, making it natural to search for components with classifying ability in the lower-right corner of Z. Indeed, the chosen components reveal a clear separation of the digits.All ones are concentrated roughly in one spherical cluster, with a single outlying seven inside.The bulk of the sevens lies around the cluster of ones in a curved manner and the location of an individual seven is based on the angle of its stem.The analysis could further be continued from here by fitting a classifier, for example, a support vector machine, to the obtained components.
We conclude with two remarks.First, experimenting reveals (not shown here) that also, e.g., (2, 2)-TJADE finds similar separation of the two groups as is shown in Figure 11.There are two explanations for this: On one hand, locating the elbows in the scree plots in Figure 10 is visually difficult for a sample size this low, and thus it could be that (2,2) is the actual optimal choice.On the other hand, whatever the true value of ν is, using k larger than that is sure to guarantee that all independent components are estimated (asymptotically) correctly.However, it could be that the group separation structure is such a strong feature of the data that it can be found also with a sub-optimal choice of k, and that using the optimal value of k just guarantees that also the 254 other (possibly less interesting) components are recovered successfully.
Second, the suggested procedure is somewhat excessive, as it requires the  computation of the k-TJADE solution for a wide range of values of k.In addition, after having computed the solutions, one option would be to simply continue the analysis using the solution with the largest value of k and ignore the estimation altogether.However, when dealing with very low sample sizes, it could happen that using a too large k, although asymptotically sufficient for the estimation of all components, induces noise in the estimation.Still, the suggested procedure could prove useful in studies where a pilot/training data set is used to determine the optimal value of k, which is then used for later data sets, saving computation time for all other data sets at the expense of the first.Finally, one option to save computation time would be to compute the values of m * k sequentially, increasing k one-by-one, and stopping when a significant drop in the plot is visible.

Discussion
We proposed a sped-up version of TJADE, the method with the lowest limiting variance among the currently studied methods of tensorial independent component analysis.Under easily interpretable additional assumptions, the extension, k-TJADE, achieves the same limiting variance to TJADE, while simultaneously exhibiting significantly lower computational cost.A large part of this efficiency is preserved also for samples of finite size.
An interesting future research question is to derive the theoretic behavior of k-TJADE when Assumption 2(v) is violated.Based on our simulations, even when the value of k is chosen to be too small or too large, k-TJADE and k-JADE can still work.This can be seen as a safety net for the users of k-TJADE.The simulations suggest that the performance deteriorates the further down one goes from the optimal k.

A Proofs
We denote the sequence of i.i.d.observations as X N = {X 1 , . . ., X n }, such that e.g.Σ1 [X N ] denotes the left sample covariance matrix estimated from the sample X N .
Before the proofs of the main results, we establish three auxiliary lemmas.
Proof of Lemma 2. Both results follow by the same arguments and we prove only the former.Since e i XX e j is a scalar and Σ 1 [HX] = HΣ 1 [X]H , the left-hand side of Eq. ( 8) can be written as where the first term can be written as where the 1/q-scaled expected value is the first term in the definition of the matrix C ab [X] and has the representation Plugging Eq. ( 11) into Eq.( 10) we obtain Furthermore, plugging Eq. ( 12) into Eq.( 9) gives us, Element-wise examination reveals that, This concludes the proof.
Lemma 3. Let Ŝ ∈ R p×p be a sequence of estimators (indexed by n) such that √ n( Ŝ − Λ) D, where denotes convergence in distribution, D is some matrix-valued distribution and where and the values λ k , k ∈ {1, . . ., R}, are distinct and in a strictly decreasing order (λ Let Û be the sequence of eigenvector matrices, where the columns are the eigenvectors of the matrices Ŝ.
, in a similar way to Λ. Then Proof of Lemma 3. The sequence of eigenvectors satisfies the eigenequation where Λ is a sequence of diagonal matrices containing the estimated eigenvalues.Then, where the second equality holds since Û = O p (1) as a result of the compactness of the space of orthogonal matrices and since, by Prohorov's theorem, our assumption 1).It follows from (Eaton and Tyler, 1991, Theorem 3.2) (1), and consequently we obtain By combining Eqs. ( 15) and ( 16) with Eq. ( 14), we arrive at If this is written block-wise, we obtain , for all (i, j) ∈ {1, . . ., R} × {1, . . ., R}.
The result now follows by considering only the off-diagonal blocks, i = j, and dividing by the non-zero (λ i − λ j ).
Remark 1.In the sequel, the assumption of Lemma 3 that √ n( Ŝ − Λ) D is fulfilled by applying central limit theorem.
Proof of Theorem 1.Let Assumptions 1 and 2(v) hold for some fixed v. Consider now the k-TJADE functional Γk where k ≥ v.
By (Virta et al., 2017, Theorem 1) the standardized matrix satisfies X st = τ U 1 ZU 2 for some τ > 0 and for some orthogonal matrices, U 1 ∈ R p×p , U 2 ∈ R q×q .By (Virta et al., 2017, Eq. ( 5)) the TFOBI-matrix functional B[X st ] has the form, and denote the corresponding multiplicities of these values by k 1 , . . ., k R , respectively.Note that each λ r can be given as τ 4 (κ k + p + q + 1) for some k.As k ≥ v, where v is the largest multiplicity among the row mean kurtoses κ 1 , . . ., κ p , we have that k r ≤ k, for all r.
The set of eigenvectors of B[X st ] is identifiable up to orthogonal transformations within each eigenspace.That is, ignoring the order and signs, the eigenvector matrix W 1 of B[X st ] has the form where each H 1r ∈ R kr×kr , r ∈ {1, . . ., R}, is orthogonal.The FOBI-rotated data is then The proof of Theorem 1 is divided into two parts.First, we prove that the condition (i) in Definition 1 holds and after that we prove that the condition (ii) holds.

Condition (i):
The condition (i) claims that k-TJADE can, under Assumptions 1 and 2(v), estimate the block diagonal orthogonal matrix H 1 up to the signs and the order of its columns.For convenience, we drop all subscripts referring to the side (left or right) of the model, e.g. in the following H 1r is H r and H 1 is H.
Adapting the proof of (Virta et al., 2018, Theorem 1), we have that where D ij = p k=1 τ 4 h ik h jk κ k E kk is diagonal matrix for every i, j ∈ {1, . . ., p}.Since g(H, X F ) = 0, the matrix H is a solution to the k-TJADE minimization problem min Next, we show that H is a unique minimizer, up to the signs and the order of its columns.Let K r = r m=1 k r such that K 0 = 0 and let I r = {K r−1 + 1, K r−1 + 2, . . ., K r } be the subset of the index set {1, . . ., p} for which the corresponding columns of H belong to the rth eigenvalue block.By (Bonhomme and Robin, 2009, Lemma 2), the minimizer is unique up to the signs and order of its columns if, for each pair of distinct columns h s , h t of H, there exists a pair (i, j) ∈ {(i, j) : |i − j| < k} such that the eigenvalues of C ij [X F ] corresponding to h s and h t are distinct.By the decomposition in Eq. ( 18), this is equivalent to requiring that We next show that Eq. ( 19) holds for all s = t by considering separately the two cases where s and t either belong to two different sets or where s and t belong to the same set of the partition I 1 , . . ., I R .First, assume that s and t belong to different sets of the partition.We proceed with proof by contraposition and assume that Eq. ( 19) does not hold.That is, for all (i, j), |i − j| < k, the eigenvalue pairs are always equal.In particular, τ 4 h 2 is κ s = τ 4 h 2 it κ t for all i ∈ {1, . . ., p}.By summing over the p equations, we have, from the orthogonality of H, that τ 4 κ s = τ 4 κ t , where τ 4 > 0. This implies that κ s = κ t , which is a contradiction since we assumed that s and t belong to different sets of the partition and thus have different eigenvalues.Consequently, Eq. ( 19) holds for any pair of columns belonging to distinct sets of the index partition.
Assume then that s and t belong to the same set I q of the partition.We again proceed with proof by contraposition and assume that Eq. ( 19) does not hold.Now, κ s = κ t and we have that τ 4 h is h js κ s = τ 4 h it h jt κ s for all (i, j), |i − j| < k.In particular, this holds for all (i, j) in the subset By Assumption 2(v) k q ≤ k and consequently the distance |i − j| is always less than k in the set {(i, j) : i ∈ I q ∧ j ∈ I q }.Hereby, the intersection in Eq. ( 20) is equal to {(i, j) : i ∈ I q ∧ j ∈ I q }.We now multiply each equality τ 4 h is h js κ s = τ 4 h it h jt κ s , with indices (i, j) in the set {(i, j) : i ∈ I q ∧ j ∈ I q }, by h is h it .By summing twice over I q , we obtain The constant τ 4 > 0. The constant κ s = 0, as the contrary would imply that two of the kurtosis values were equal to zero, κ s = κ t = 0, which would then contradict Assumption 1. Thus we can divide both sides of Eq. ( 21) by τ 4 κ s and we obtain As H has the block diagonal structure given in Eq. ( 17), the sums in Eq. ( 22) are dot products between the columns of the qth orthogonal block of H and the left-hand side of Eq. ( 22) is equal to one and the right-hand side of Eq. ( 22) is equal to zero.This is obviously a contradiction.Consequently, Eq. ( 19) holds also for any pair of columns which belong to the same set of the partition This concludes the proof of condition (i).

Condition (ii):
To see that condition (ii) holds, recall from Virta et al. (2017) that the TFOBI functional Γ F is orthogonally equivariant and that the TFOBI-transformation is orthogonally invariant.Thus, for all X ∈ R p×q and all orthogonal U 1 ∈ R p×p , U 2 ∈ R q×q , we have that This concludes the proof of condition (ii).
Proof of Theorem 2. Recall that X N = {X 1 . . .X n } is an i.i.d.sequence from the tensor IC model and let X be the sample mean of X N .Then, the TFOBItransformed observations are where the orthogonal matrices Ĥ, R are the left and right TFOBI-rotations and where ( Σj [X N ]) − 1 2 , j ∈ {1, 2}, are the symmetric square roots of the left and right sample covariance matrices.Note that the cumulant matrices Ĉij depend on the observations X F i only through the product allowing us to omit the right rotation R in the following.
What makes proving the limiting results challenging, is that the sequence of matrices Ĥ has no general limiting properties.If there are any kurtosis values having multiplicity larger than one, this implies that the corresponding eigenvectors are not uniquely defined and, consequently, the TFOBI-solution for them does not converge.However, we can still say two things: First, by the compactness of the set of orthogonal matrices, Ĥ = O p (1).Second, by Lemma 3 and the central limit theorem, the elements of any column of Ĥ with indices not belonging to the corresponding diagonal multiplicity block converge to 0 with the rate of root-n.That is, √ n ĥkl = O p (1), for k, l satisfying the aforementioned conditions.These two, in conjunction with Assumption 2, are sufficient for proving the limiting results.
As our first task, we show that the sample and the population objective functions of k-TJADE are asymptotically equivalent to those of TJADE.The left k- , where the orthogonal V = (v 1 , . . ., vp ) is, by Lemma 1, the sequence of the maximizers of the sequence of objective functions, Consistency of the estimator V Ĥ( Σ1 [X N ]) − 1 2 is now equal to the claim that there exists a sequence of estimators V Ĥ( Σ1 [X N ]) − 1 2 that converges in probability to I p .It is sufficient to show that V Ĥ → P I p , since the weak law of large numbers and the continuous mapping theorem directly imply that ( Σ1 [X N ]) − 1 2 → P I p .Since V maximizes the objective function ĝ and since Ĥ is orthogonal, we have that ĝ( V, ĤY N ) ≥ ĝ(V, ĤY N ), ∀ V orthogonal, and The range of V → Ĥ V is the set of all p × p orthogonal matrices and consequently Ĥ V = Ŵ is the sequence of maximizers for the argument-modified sequence of objective functions, ĝ( ĤW, ĤY N ).The original sequence of maximizers can be written as V = Ĥ Ŵ and the claim takes the form Ŵ → P I p .Applying Lemma 2, this new sequence of objective functions ĝ( ĤW, ĤY N ) can be reformulated as where the third equality follows from the orthogonality of Ĥ. Next, we establish two properties of m: (1 • ) If the argument W is replaced with any sequence of orthogonal matrices bounded in probability (the boundedness is an instant consequence of the orthogonality), n • m is bounded in probability.( 2• ) m converges uniformly in probability to zero.
We first prove the property (1 • ) (which will be used in the proof of Theorem 3 later).Let individually for all terms in the sum m.By the Cauchy-Schwarz inequality, the operator norm inequality, the equivalence of all finite-dimensional norms and the monotonicity of the supremum, we obtain the following upper bound (where the dependency of the expression on the variable W is removed by the unit lengths of its columns), where c 0 is a constant resulting from switching from the operator norm to the Frobenius norm.Using the same arguments we used to prove the property (1 • ) above, Eq. ( 24) can be seen to where f (W, Y N ) and f (W, Y) are the sample and population objective functions of TJADE (Virta et al., 2018).
Having established this, we next show that the difference in the population and the sample k-TJADE objective functions converges uniformly in probability to zero, allowing us to use the M-estimator convergence argument, see e.g.(Van der Vaart, 1998, Theorem 5.7), to prove the consistency of the estimator.By Eq. ( 25 and the uniform convergence of the k-TJADE objective functions is a direct consequence of the uniform convergence of the TJADE objective functions.The latter result is given implicitly in Virta et al. (2018), but for completeness, we present the proof here. Let, Since the Frobenius norm of the diagonal elements of a matrix is always bounded by the Frobenius norm of the entire matrix and since W is orthogonal, the following inequalities hold, By the reverse triangle inequality, we have that for any A, B ∈ R p×p , Now, by the monotonicity of the supremum,  26) and ( 27), that under Assumption 1 and Assumption 2(v), the same holds for the sequence of the sample k-TJADE objective functions.
We close the proof with the M-estimator convergence argument.In the following we use the notation, h(W, Y) := g(HW, HY) and ĥ(W, Y N ) := ĝ( ĤW, ĤY N ).As all our maximizers are unique only up to the signs and the order of their columns, to obtain a sequence such that Ŵ → P I p , we restrict ourselves to a subset U 0 of U where the signs and order are fixed and W = I p is the unique maximizer of h(W, Y).A corresponding set U 0 can be constructed as follows, U 0 = W ∈ R p×p : (W ∈ U) ∧ (w i 1 p ≥ 0, ∀i) ∧ w 1 D 11 w 1 ≥ . . .≥ w p D pp w p , where D ii = (p − 1)E ii .Moreover, as the conditions defining U 0 are continuous in W, there exists ε > 0 such that an ε-neighborhood around I p fits completely within U 0 .The following set defines the complement of such ball in U 0 , Since there is a finite number of different combinations of the order and the signs of the columns of Ŵ, we can, for every n, consider the equivalent maximizer Ŵ ( Ĉaa − κ a E aa ) Ŵ + κ a Ŵ E aa Ŵ, we obtain Thus the matrix Â2 in Eq. ( 34) can be given as where we have used the identities (I p 2 − K) a κ 2 a [E aa ⊗ E aa ] = 0 and K[E aa ⊗ I p ]K = [I p ⊗ E aa ] and the notation D = a κ 2 a E aa .Taking the sum over the two expanded estimating equations in Eq. ( 34 To see that A is invertible, we compute its determinant and verify that it is non-zero under Assumption 1. Recall that Assumption 1 says that there is maximally one zero diagonal element in D.
Observe first that in the case p = 2, the matrix A takes the form, where the 2 × 2 diagonal block has determinant equal to −2(κ 2 1 + κ 2 2 ).As the determinant of a block diagonal matrix is the product of the determinants of the blocks, we have that det(A) = −8(κ 2 1 + κ 2 2 ), verifying our claim in the case p = 2.
For general p, the matrix A can, after a suitable permutation of its rows and columns, not affecting the value of the determinant, be seen to consist of similar blocks as in the case p = 2.For each diagonal element of √ n( Ŵ − I p ) (the elements a + (a − 1)p, a ∈ {1, . . ., p}, of √ nvec( Ŵ − I p )), we get a 1 × 1 diagonal block equal to 2. For each (a, b)th off-diagonal element in the lower triangle of A, we get a 2 × 2 diagonal block, with the determinant equal to −2(κ 2 a + κ 2 b ).As the number of diagonal elements is p and the number of off-diagonal elements in the lower triangle is p(p − 1)/2, the total determinant is, det(A) = 2 p (−2) p(p−1)/2 a>b (κ 2 a + κ 2 b ), which is non-zero if and only if Assumption 1 holds.Multiply now both sides of Eq. ( 35) by ( Â1 + Â2 ) −1 (which is asymptotically well-defined as det(A) = 0) and invoke Slutsky's theorem to obtain, √ nvec( Ŵ − I p ) = ( Â1 + Â2 ) −1 √ nvec( F1 − F1 ) + o p (1),

Figure 3 :
Figure 3: Means of the transformed MD indices.The solid black line is the limiting value of TJADE (under orthogonal mixing) and the dashed black line is the limiting value of VJADE.The solid line is missing from the third panel as the limiting value of TJADE is not known under normal mixing.

Figure 4 :
Figure 4: Means of the transformed MD indices.The solid black line is the limiting value of TJADE (under orthogonal mixing) and the dashed black line is the limiting value of VJADE.The solid line is missing from the third panel as the limiting value of TJADE is not known under normal mixing.

Figure 5 :
Figure 5: Means of the transformed MD indices.The solid black line is the limiting value of TJADE (under orthogonal mixing).

Figure 6 :
Figure 6: Means of the transformed MD indices.The solid black line is the limiting value of TJADE (under orthogonal mixing) and the dashed black line is the limiting value of VJADE.

Figure 7 :
Figure 7: The logarithms of the mean computation times (in minutes) as a function of the dimension q.

Figure 8 :
Figure 8: Frames 50 and 550 of the surveillance video data.

Figure 9 :
Figure9: The source with the highest absolute kurtosis found from the video data using (1, 1, 0)-TJADE.The plot caption refers to the indices of the source in the extracted matrix Z.

Figure 10 :
Figure 10: The plots of the sequential MD index means m * k versus the tuning parameter value k.The left plot is for the rows and the right plot for the columns of the digit data.

Figure 11 :
Figure 11: The scatter plot of two particular low-kurtosis independent components found by (4, 3)-TJADE from the digit data.
make the first sum run over the whole range of i and j.
b a ,b ĥia ĥjb ĥia ĥjb w l Ĉab [Y N ] wl w l Ĉa b [Y N ] wl , where W = ( w1 , . . ., wp ) is some sequence of orthogonal matrices bounded in probability and all other terms are as in the definition of m in Eq. (23).We divide the terms of m into different cases based on the indices.et al. (2018) provides us the result Ĉa b [Y N ] → P C a b for some constant matrix C a b .Both ĥia and ĥja (with |i − j| ≥ k) cannot belong to a diagonal block of Ĥ as that would compromise Assumption 2(v).Thus, by Lemma 3, at least one of ĥia and ĥja must be O p (1/ √ n).Consequently, any summand in m with the indices a = b and a = b is O p (1/n). 3. a = b, a = b : Since m is symmetric with respect to (a, b) and (a , b ), it follows from the same arguments as in case 2 that any summand in m with the indices a = b and a = b is O p (1/n). 4. a = b, a = b : By the same arguments as above, Ĉab [Y N ] and Ĉa b [Y N ] both converge in probability to some constant matrices and the pairs ( ĥia , ĥja ) and ( ĥia , ĥja ) each provide one term that is O p (1/ √ n).Thus any summand in m with the indices a = b and a = b is O p (1/n).Since the sum defining m is a finite sum of O p (1/n)-terms, it holds that m = O p (1/n).Note that by choosing W = W, we have m = m and the objective function satisfies ĝ( ĤW, ĤY N ) = w l ) 2 .We next prove property (2 • ).By the triangle inequality and the monotonicity ĥjb w l Ĉab [Y N ]w l w l Ĉa b [Y N ]w l , where U = {V ∈ R p×p : VV = I p }.It is thus sufficient to show that sup W∈U ĥia ĥjb ĥia ĥjb w always converge in probability to zero, regardless of the values of the indices.Thus ĝ( ĤW, ĤY N ) = f (W, Y N ) − m for which sup W {| m|} = o p (1).Now, Lemma 2 along with exactly analogous calculations as above (with exact zeros taking the roles of o p (1)-terms) can be used to show that under Assumption 2(v), the population version of the argument-modified k-TJADE objective function, g(HW, HY) = p |i−j|<k p l=1w l H C ij [HY]Hw l 2 ,where H is the population TFOBI-rotation, equals the augmented objective function not depending on H, that is,f (W, Y) ij [Y]w l 2 = g(HW, HY).Consequently, ĝ( ĤW, ĤY N ) − g(HW, HY) = f (W, Y N ) − f (W, Y) − m, ), we havesup W∈U ĝ( ĤW, ĤY N ) − g(HW, HY) = sup W∈U f (W, Y N ) − f (W, Y) − m ≤ sup W∈U f (W, Y N ) − f (W, Y) + o p (1),(26) which converges in probability to zero, since Ĉij [Y N ] → P C ij [Y].Thus, the sequence of the sample TJADE objective functions f (W, Y N ) converges uniformly in probability, with respect to the set of orthogonal matrices, to the theoretical TJADE objective function f (W, Y).It now follows from Eqs. (