Hierarchical nuclear norm penalization for multi-view data

The prevalence of data collected on the same set of samples from multiple sources (i.e., multi-view data) has prompted significant development of data integration methods based on low-rank matrix factorizations. These methods decompose signal matrices from each view into the sum of shared and individual structures, which are further used for dimension reduction, exploratory analyses, and quantifying associations across views. However, existing methods have limitations in modeling partially-shared structures due to either too restrictive models, or restrictive identifiability conditions. To address these challenges, we formulate a new model for partially-shared signals based on grouping the views into so-called hierarchical levels. The proposed hierarchy leads us to introduce a new penalty, hierarchical nuclear norm (HNN), for signal estimation. In contrast to existing methods, HNN penalization avoids scores and loadings factorization of the signals and leads to a convex optimization problem, which we solve using a dual forward-backward algorithm. We propose a simple refitting procedure to adjust the penalization bias and develop an adapted version of bi-cross-validation for selecting tuning parameters. Extensive simulation studies and analysis of the genotype-tissue expression data demonstrate the advantages of our method over existing alternatives.


Introduction
Technological advances in biomedical fields led to prevalence of multi-source data collected from the same set of samples, often called multi-view data.Our motivating example is the Genotype-Tissue Expression (GTEx) project (The GTEx Consortium, 2020) that collects gene expression data on the same individuals across multiple tissues (views).The gene-expression is often tissue-specific.For instance, the p53 gene is critical for cell regulation playing prominent role in cancer development (Tanikawa et al., 2017), however p53 tissue-specific expression makes it difficult to develop targeted therapies (Vlatkovic et al., 2011).It is thus crucial to identify patterns in gene expression that are shared and unique across multiple tissues.Traditional methods for signal extraction tend to be applied either separately to each view or to all views combined, thus failing to separate shared and unique parts of the signal.
Specifically, let X d ∈ R n×p d be the observed data matrix for the d-th view with n matched samples and p d measurements for d = 1, . . ., D. We consider an additive error model where M d is the signal matrix and E d is the noise matrix.Successful extraction of the signals in the model (1) requires additional structural assumptions.For single-view case, M d is commonly assumed to be low-rank (Srebro and Jaakkola, 2003;Candès and Recht, 2009;Candès et al., 2013).
A direct application of this approach for multi-view data leads to low-rank assumption on either (i) each M d or (ii) the concatenated signal [M 1 . . .M D ].The main limitation of (i) is that it not only ignores any structural relationships between the signals due to the matched samples, but also separately estimates signals without borrowing strength across views.The main limitation of (ii) is that it assumes a joint structure across all M d , resulting in all estimated M d having the same rank.Therefore, it does not take into account possible heterogeneity across the views.In multiview context, it is of interest to explore structural assumptions on M d that allow both joint (thus accounting for matched samples) and individual (thus permitting differences across views) parts of the signal matrices.
The simultaneous exploration of the joint and individual signals has been a prominent line of research for multi-view data (Lock et al., 2013;Zhou et al., 2016;Yang and Michailidis, 2016;Feng et al., 2018;Gaynanova and Li, 2019).The joint structure is a shared pattern across all views and defined as the intersection of the column spaces of M d by JIVE (Lock et al., 2013), COBE (Zhou et al., 2016) and AJIVE (Feng et al., 2018), while the individual structure is unique to a particular view adjusted after the joint structure, see Example A1 in Appendix A.1 for illustration.Despite these advances, many existing approaches (Lock et al., 2013;Zhou et al., 2016;Yang and Michailidis, 2016;Feng et al., 2018) lack explicit definition of partially-shared structures when D > 2, e.g., a signal structure shared by muscle and blood tissues, but not the skin tissue in the motivating GTEx example.This implies that the corresponding methods cannot take advantage of partially-shared structures for signal estimation, leading to rank mis-identification and worse signal estimation performance compared with methods that account for such structures (Gaynanova and Li, 2019).
While several methods for identification of partially-shared structures have been proposed, they have limitations from modelling perspective.The penalized matrix factorization approaches by Jia et al. (2010) and Van Deun et al. (2011) lack explicit formulation of underlying models, making it difficult to assess identifiability and provide interpretation.SLIDE (Gaynanova and Li, 2019) requires orthogonality across individual and partially-shared signals for model identifiability.This implies that the SLIDE model is not always unique, causing issues in estimation as well as interpretation.
In addition to modeling issues (lack of partially-shared structures or lack of identifiability), existing methods (Jia et al., 2010;Van Deun et al., 2011;Lock et al., 2013;Zhou et al., 2016;Yang and Michailidis, 2016;Gaynanova and Li, 2019) estimate the signals M d based on explicit scoreand-loading factorizations, leading to non-convex optimization problems.The convergence to the global optimum is not guaranteed and the obtained solution depends on the starting point.Thus, in practice, the obtained solution may be sub-optimal.The recently proposed BIDIFAC (Park and Lock, 2020) and BIDIFAC+ (Lock et al., 2022) are exceptions, as those methods provide a convex formulation via the use of nuclear norm penalization.Both are designed for bidimensionallymatched data, which includes multi-view data as a special case.However, BIDIFAC does not consider partially-shared structures.Furthermore, both BIDIFAC and BIDIFAC+ suffer from bias associated with nuclear norm penalization (Chen et al., 2013;Josse and Sardy, 2016), affecting their signal estimation performance.
In this work, we address the limitations of existing models by both formulating an explicit model for partially-shared structures and avoiding identifiability issues.We also address the limitations of associated estimation approaches by avoiding scores and loadings factorization, and correcting for possible estimation bias.We achieve these goals as follows.First, we propose a recursive definition of partially-shared structures based on the new concept of hierarchical level.For D = 3 views, the hierarchical levels are: (i) all three views (Level 1); (ii) any two views (Level 2); (iii) individual views (Level 3).The key idea is to work with the column spaces of M d , and recursively consider column spaces at each level after accounting for previous levels.Our formulation includes JIVE, COBE and AJIVE models as a special case and avoids identifiability issues of SLIDE.Secondly, guided by our hierarchical levels, we propose a new penalty, hierarchical nuclear norm (HNN), for direct estimation of M d without scores and loadings factorization.To our knowledge, HNN is novel in the literature, and its hierarchical construction allow to restrict the ranks of the signals at each level.By varying the corresponding tuning parameters, the solution path encompasses a wide range of models with different patterns of shared signals.HNN is convex, but does not have a closed-form proximal operator.We address this challenge by adapting the dual block-coordinate forwardbackward algorithm (Abboud et al., 2017).Third, we address penalization bias by a new type of refitting.As the proposed definition of partially-shared structures relies on the column spaces, we use HNN to estimate the column spaces, and then refit M d under the column space constraint, thus preserving the estimated shared and individual patterns.Finally, we take advantage of SURE criterion (Candès and Recht, 2009) to reduce the number of tuning parameters, and use the adapted version of the bi-cross validation (Owen and Perry, 2009) for model selection.In numerical studies, the resulting HNN method outperforms the competitors in both signal estimation and structure identification. Notation.
p j=1 a 2 ij be the Frobenius norm, and A * = min(n,p) i=1 σ i (A) be the nuclear norm, where σ i (A) is the i-th largest singular value of A. For the column space C(A), let P C(A) and P ⊥ C(A) be the projection matrices onto C(A) and its orthogonal complement C(A) ⊥ , respectively.Let 0 and I be the zero and identity matrix, respectively, with dimensions inferred from the context.
. ., C k } be subspace spanned by the union of sets of basis vectors of subspaces C 1 , . . ., C k ⊂ R n .

Motivating example
Here we provide a toy example that illustrates the definition of joint and individual structures of JIVE (Lock et al., 2013), COBE (Zhou et al., 2016) and AJIVE (Feng et al., 2018), and how this definition does not account for partial sharing.We also illustrate how it is possible to have non-orthogonality between different structures (individual, joint, partially-shared), thus violating identifiability conditions for SLIDE (Gaynanova and Li, 2019).
Example 1 Consider D = 3 with the signal matrices: . JIVE, COBE and AJIVE define the joint structure as the intersection of the column spaces with the individual structures being the remaining signal: Individual view 1: However, these individual structures are not truly unique to each view: while their overall intersection is zero, they have non-trivial pairwise intersections.Concretely, it follows that (i) span{[1 1 0 0 0] } is shared by the 1st and 2nd views (but not the 3rd view); (ii) span{[1 0 − 1 0 0] } is shared by the 1st and 3rd views (but not the 2nd view).Likewise, in many applications, it is reasonable to expect that certain signals are shared across a few views rather than all views.We refer to such signals as partially-shared.
Example 1 illustrates how many existing methods (Lock et al., 2013;Zhou et al., 2016;Yang and Michailidis, 2016;Feng et al., 2018) treat partially-shared structures as individual.To consider partially-shared structures, we can apply these methods separately to each of the D 2 view pairs: if the estimated joint rank from any pair is larger than that from all D views, it indicates the existence of a partially-shared structure between the corresponding pair.However, such approach significantly increases computational burden associated with corresponding methods and the resulting estimates across pairs are not guaranteed to be consistent with each other due to estimation errors.
To address these challenges, we propose to define partially-shared and individual structures in terms of column spaces of signal matrices.As existing models, we define the joint structure as an intersection of all column spaces.In contrast to these methods, we further decompose each to incorporate partially-shared structures.Our key idea is to consider remaining signal structures after accounting for joint structure.More precisely, the partially-shared signal structures (rigorously defined in Definition 1) in Example 1 are specified as In comparison to COBE, JIVE and AJIVE, our approach models the individual structures that are indeed view-specific after taking into account both partially-shared structures in Example 1.Our signal structure modeling is discussed with more details in Section 2.2.
At the same time, SLIDE model (Gaynanova and Li, 2019) also allows partial-sharing but requires orthogonality between different types of structures to guarantee uniqueness.However, in Example 1, the individual and partially-shared structures are not orthogonal to each other, that is J(1, 2) ⊥ J(1, 3), J(1, 2) ⊥ I(3), J(1, 3) ⊥ I(2) and I(2) ⊥ I(3).This implies that SLIDE's identifiability conditions are violated, and the resulting parametrization of SLIDE model is not unique, leading to potential difficulties in estimation and interpretation.This is illustrated in Example A2 in Appendix A.1.

Proposed definition of hierarchical signal structure
Motivated by Example 1, we propose to define joint, partially-joint, and individual structures via hierarchical levels.When D = 3, we consider three hierarchical levels: Level 1 : {(1, 2, 3)} , Level 2 : {(1, 2), (1, 3), (2, 3)} , Level 3 : {(1), (2), (3)} , where the integers indicate the views.Levels 1, 2 and 3 correspond to the joint structure, partiallyshared structures between 3 2 distinct view pairs, and the three individual structures, respectively.The structure at each level is defined recursively based on the previous level, starting with the intersection of all column-spaces C(M d ) at Level 1. Figure 1 provides an illustration of the proposed definition as applied to Example 1, which we formalize below.3 ), and missing R (2) 1 being fully captured by J(1, 2) and J(1, 3)) At Level 1, the joint structure is defined as to represent the remaining signal in each view after accounting for the joint structure.At Level 2, we use the signal remaining from Level 1 to define the joint structures across each view pair.Formally, J(l, m) = R (1) m for (l, m) = (1, 2), (1, 3), (2, 3).For example, J(1, 2) is shared across views 1 and 2, is not present in view 3, and by construction is orthogonal to the globally-shared J(1, 2, 3).We use R (2) d to define the remaining signal in each view after accounting for structures in Levels 1-2.That is, R 3 ) where and Q (2) 3 = span{J(1, 3), J(2, 3)}.At Level 3, only singletons remain, with individual structure for view d being the remainder since it is already adjusted for both joint and partially-shared structures at the previous two levels.
For general D, there are in total D hierarchical levels.For k ∈ {1, . . ., D}, the (D−k+1)-th level corresponds to the structures shared across k views, and there are D k distinct view combinations (i 1 , . . ., i k ).The key idea is to recursively decompose the remaining signal from the previous level into the shared structures specified at the current level and the remaining signal.Below, we provide the full definition of joint, partially-joint and individual structures.
, the partially-shared signal for each distinct view combination (i 1 , . . ., i k ) is the intersection of the remaining signals from the previous level: We call J(i 1 , . . ., i k ) the k-way partially-shared structure common to views i 1 , . . ., i k .Let Q be the space spanned by all D−1 k−1 distinct J(i 1 , . . ., i k ) with d ∈ {i 1 , . . ., i k } (the k-way partiallyshared spaces associated with view d).The remaining signal in view d is At Level D, the individual structure is defined as The main advantage of Definition 1 is that each type of structure is defined in terms of column space, instead of relying on matrix representations (via decomposition and factorization) that existing methods (Lock et al., 2013;Zhou et al., 2016;Yang and Michailidis, 2016;Feng et al., 2018;Gaynanova and Li, 2019) use.As illustrated below, this facilitates a clean and non-restrictive definition of partially-shared structures.Take D = 3 views as an example.Definition 1 decomposes C(M d ) as In the column-space representation, JIVE, COBE and AJIVE correspond to the decomposition C(M d ) = J(1, 2, 3) + I JIVE (d), where I JIVE (d) = C(P ⊥ J(1,2,3) M d ) denotes their individual structure for d = 1, 2, 3.In contrast, our formulation makes a further decomposition of I JIVE (d), making partially-shared structures explicit.Furthermore, by construction, Definition 1 allows the spaces within the same level be nonorthogonal to each other: (i) J(1, 2), J(1, 3) and J(2, 3) at Level 2; (ii) I(1), I(2) and I(3) at Level 3. Additionally, the structures that do not contain the same view can be non-orthogonal such as (i) J(1, 2) and I(3); (ii) J(1, 3) and I(2); (iii) J(2, 3) and I(1).Since our approach does not impose extra orthogonality conditions and is based directly on the column spaces of signal matrices, it overcomes identifiability issues of SLIDE model.

Hierarchical rank constraints
Based on Definition 1, one possibility is to use matrix factorization for signal estimation, that is to decompose each M d into a sum of several matrices that represent each signal structure.The scores and loadings factorization of each structure can be used for estimation, which is pursued by many existing methods for data integration (Lock et al., 2013;Feng et al., 2018;Gaynanova and Li, 2019).In our framework, however, it is unclear how to adopt the scores and loadings factorization while preserving the potential non-orthogonality of the signal structures.Another possibility is to consider signal estimation as optimization problem with explicit subspace constraints (Yuan and Gaynanova, 2022).However, in presence of multiple types of structures, this leads to a highly non-trivial non-convex, manifold optimization.Here, we propose a new strategy for direct signal estimation that avoids both matrix factorization, and subspace constraints, leading to a convex optimization problem.Our key idea is to introduce hierarchical nuclear norm penalty, which is motivated by the rank constraints imposed on the matrices at each hierarchical levels discussed in this section.
Specifically, we propose to take advantage of the constraint on the ranks of the matrices at each hierarchical levels.Let I D−k+1 be the (D − k + 1)-th hierarchical level as in Section 2.2.Given the total signal 2 shows the matrices of each H D−k+1 (M) when D = 3.As shown in Section 2.1, the existing models' definition of joint structure Lock et al. (2013); Zhou et al. (2016); Feng et al. (2018) relies on non-trivial intersection of column spaces of signal matrices, which implies rank constraint: When estimating M d from noisy X d in model ( 1), these methods effectively impose the low-rank constraints with some Choosing ranks such that r 123 < 3 d=1 r d leads to non-trivial intersections in column spaces.Using the intuition from (3), we introduce the hierarchical rank constraints which are imposed on the Level 3 ranks of all matrices in H D−k+1 (M) for all hierarchical levels.Using D = 3 example in Figure 2, the constraints become The constraints for D > 3 are similar.Various combinations of ranks in (4) lead to intersection of column spaces with varying dimensions, allowing a wide range of signal structures.However, direct use of constraints (4) in estimations leads to a non-convex, NP-hard optimization problem (Candès and Recht, 2009;Mazumder et al., 2010).Instead, in Section 2.4 we propose to replace each constraint in (4) with corresponding nuclear norm penalty.

Hierarchical nuclear norm penalization
Nuclear norm penalization (Bach, 2008;Candès and Recht, 2009;Negahban and Wainwright, 2011) is commonly used for the estimation of the low-rank signal M d in single-view settings since nuclear norm is a convex relaxation to rank leading to convex optimization problem minimize where λ d ≥ 0 is a tuning parameter.Parallel to 1 -norm penalization for sparse vector estimation, the nuclear norm penalty in (5) encourages sparse singular values for low-rank signal estimation.The solution to (5) can be written in closed-form via the proximal operator of the nuclear norm (Polson et al., 2015): where x + = max(x, 0) and u i ∈ R n×1 and v i ∈ R 1×p d are the left and right singular vectors corresponding to σ i (X d ), respectively.That is, S(X d , λ d ) is the matrix resulting from soft-thresholding the singular values of X d with cutoff λ d .The larger is the value of λ d , the stronger is the penalty, and subsequently the lower is the rank of the resulting S(X d , λ d ).
Motivated by the above hierarchical rank constraints (4) in the multi-view case, we propose the hierarchical nuclear norm (HNN) penalty: where M (i 1 ,...,i k ) is the submatrix of M formed by concatenating M d for d ∈ {i 1 , . . ., i k }, and λ i 1 ...i k ≥ 0 is the corresponding tuning parameter that controls the amount of shrinkage applied to the singular value of each matrix in . By encouraging sparse singular values of matrices at each hierarchical level, (8) naturally induces different signal structures due to the correspondence between rank and the number of non-zero singular values.
The corresponding joint hierarchical nuclear norm penalization problem is argmin with the solution path covering a wide range of models with different configurations of hierarchical ranks.The choice of the tuning parameters will be discussed in Section 2.6.For an example of Algorithm 2 Refitting procedure for each view when D = 3 end for Return: Note that our approach is not based on matrix factorizations, which would lead to a non-convex optimization.Instead, ( 8) is a convex optimization problem, which can be efficiently solved by the dual block-coordinate forward-backward algorithm (Abboud et al., 2017).We specialize this general algorithm to solving (8).The pseudo-code is presented in Algorithm 1.The detailed derivation is deferred to Appendix A.1.One important finding is that the proximal operator of the hierarchical nuclear norm can be evaluated efficiently because the norm is a sum of convex functions with linear operators.

Refitting procedure
Like many penalized estimation methods, it is well-known that the nuclear norm penalization causes shrinkage bias (Chen et al., 2013;Josse and Sardy, 2016).In particular, since it penalizes both small and large non-zero singular values by equal amount, the so-called overshrinkage phenomenon could occur where non-zero singular values are shrunken too much.This causes inaccurate estimation of signal as the non-zero singular values are likely to be estimated smaller than the original values.
Since our hierarchical nuclear norm penalty consists of the nuclear norms of multiple matrices, our approach is also affected by such overshrinkage phenomenon.Thus, we adopt a simple refitting step to "un-shrink" our estimate.The key idea is to extract estimated low-rank column space from preliminary M d from Algorithm 1, and then construct M refit d with the given column space without penalization of singular values.This idea is formalized in Algorithm 2. Since the refitting step corresponds to solving the least-squares problem (10) with a closed-form solution, it does not require a significant amount of computation.As illustrated in Figure 5 of Appendix A.1, while the singular values of each M d are smaller than the true values, our refitting procedure is effective in correcting the bias.

Selection of tuning parameters
The tuning parameters in the HNN penalty in (7) need to be properly chosen for the good performance of our approach.If the nuclear norms are over-penalized, there is a degradation in signal estimation performance due to underestimated ranks (which the above refitting procedure cannot correct).On the other hand, if the tuning parameters are too small, the ranks are overestimated, and the estimated signals contain noise.One clear difficulty in selecting tuning parameters for (8) is the large number of total parameters.For instance, even a modest D = 3 case (9) requires 7 parameters.Tuning all these parameters freely via cross-validation or a model selection criterion is computationally prohibitive.
We propose to reduce the total number of tuning parameters from 2 D − 1 to only D by taking advantage of the Stein's Unbiased Risk Estimates (SURE) (Candès et al., 2013;Josse and Sardy, 2016).The SURE formulation for single-view nuclear-norm minimization allows to derive a closed form of optimal λ SURE d to use in S(X d , λ d ) (6) as a minimizer of SURE criterion, see Candès and Recht (2009) We propose to reparametrize penalty in (9) as where τ, κ > 0 control the overall shrinkage amount for the individual and pairwise matrices, respectively, and the relative weights are chosen as That is, we form a convex combination of the nuclear norms within each hierarchical level in the penalty (11).The approach for D > 3 is similar.By design, τ , κ and λ control the amount of penalization across different hierarchical levels, while the weights ( 12) induce (potentially) differential shrinkage within the same level based on SURE.Since the weights are closed form, only three parameters (τ, κ, λ) remain to be tuned.
To choose the remaining D tuning parameters, we adapt the bi-cross-validation (BCV) approach (Owen and Perry, 2009), similar approach is used in SLIDE (Gaynanova and Li, 2019).The original BCV is designed for rank selection in the single-view case, and can be viewed as matrix extension of k-fold cross-validation.The (j × k)-fold BCV splits the rows and the columns of X into j and k blocks, respectively.The submatrix corresponding to each block is hold out as test data, with the remaining submatrices used to fit the model of specified rank and make predictions on the test.
To illustrate our application of the BCV, we focus on the 2 × 2 folds of X such that where X j,k d denotes the sub-matrix of X d corresponding to the j-th row fold and k-th column fold.Algorithm 3 summarizes the corresponding BCV procedure.The main difference with original BCV is that the splits are distributed equally across the views.Suppose we hold out X j,k d for all Algorithm 3 Proposed 2 × 2 BCV procedure for evaluation of prediction errors Input: X with random split as (13) and ] by back-scaling and back-centering the resulting matrix from Algorithm 2 with Z −j,−k d and (τ l , κ l , λ l ).(ii) Evaluate the average BCV error such that BCVErr j,k,l = 1 3 where the superscript + denotes the pseudo-inverse of the corresponding matrix.end for end for Return: {BCVErr j,k,l : j, k = 1, 2} L l=1 d = 1, 2, 3. Given the tuning parameters (τ l , κ l , λ l ), we obtain the estimates M refit −j,−k,l based on the submatrix of X that shares no rows or columns with held-out X (d) j,k (this submatrix is denoted by X (d) −j,−k in Algorithm 3).We then use both X j,−k d (sharing only the rows with X j,k d ) and X −j,k d (sharing only the columns with X j,k d ) to evaluate the prediction error BCVErr j,k,l (14) for X j,k d as in Owen and Perry (2009).Given the grid of tuning parameters {(τ l , κ l , λ l )} L l=1 , we calculate the average BCV error Avg Err l = 4 −1 2 j,k=1 BCVErr j,k,l for combination l, and choose optimal (τ * , κ * , λ * ) based on 1 standard error rule (Hastie et al., 2009), see Appendix A.1 for more details.

Simulation studies
We conduct simulations to evaluate the performance of our approach.We label our approach as HNN and select its tuning parameters as described in Section 2.6.To investigate the effect of our tuning parameter selection, we also consider our estimator with the smallest error (15) and label it as HNN best.For comparison, we further implement the other competing methods as follows: (a) JIVE (Lock et al., 2013) with the default permutation method for rank selection and true ranks (labeled as JIVE true); (b) AJIVE (Feng et al., 2018); (c) SLIDE (Gaynanova and Li, 2019); (d) BIDIFAC (Park and Lock, 2020) (e) BIDIFAC+ (Lock et al., 2022).Additional implementation details are discussed in Appendix B. The R code for all analyses can be found at http://github.com/sangyoonstat/HNN paper.
We simulate each data matrix X d via the model where each element of the noise matrix E d is generated independently from N (0, σ 2 d ).The noise level σ d is set to satisfy the chosen signal-to-noise ratio (SNR), where We vary the number of views (D = 2 or D = 3), and consider two cases for each D: (i) orthogonal (all joint, partially-shared and individual structures are orthogonal); (ii) non-orthogonal (some structures are non-orthogonal).Appendix B provides additional data generation details.Given an estimated M = [ M 1 . . .M D ], we compare all methods in terms of the estimated ranks and the scaled squared Frobenius norm error defined by Figures 3 shows the results for D = 2 setting over 100 replications.In the orthogonal case, AJIVE and SLIDE are the best performing methods, followed closely by HNN and BIDIFAC+ refit.Good performance of SLIDE is expected in the orthogonal case as the identifiability conditions are satisfied.Good performance of AJIVE can be attributed to its accurate ranks estimation, as no partially-shared signals exist when D = 2. JIVE and BIDIFAC+ perform the worst.Comparing the ranks reveals that JIVE overestimates the rank of M while correctly estimating the ranks of individual M d , hence it under-estimates the rank of joint structure.Since JIVE with true ranks (JIVE true) performs as well as AJIVE and SLIDE, the results indicate that JIVE's poor performance in this setting is due to rank mis-identification.As expected, BIDIFAC+ performs significantly worse than BIDIFAC+ refit due to the bias associated with the penalization in original BIDIFAC+ solution.In the non-orthogonal case, the performance of both AJIVE and SLIDE deteriorates, with HNN becoming the best performing method.The deterioration of SLIDE is expected as non-orthogonal individual structures violate SLIDE's identifiability conditions (Gaynanova and Li, 2019).The deterioration of AJIVE could be attributed to rank mis-identification, as it significantly under-estimates the total rank of M. The median estimated ranks of HNN always match the true ranks (in both orthogonal and non-orthogonal cases).
Figure 4 shows the results for D = 3 setting over 100 replications.Unlike D = 2 setting, HNN performs better than AJIVE in both orthogonal and non-orthogonal cases, and is only second to SLIDE in the orthogonal case (albeit close in both errors and estimated ranks).AJIVE's worse performance is expected since it does not account for partially-shared structures.JIVE performs better or similar to AJIVE, but worse than HNN, SLIDE and BIDIFAC+ refit.As expected, the refitted versions of BIDIFAC are better than the original ones, with BIDIFAC+ performing better than BIDIFAC.As in D = 2 case, SLIDE's performance is significantly worse in non-orthogonal case due to identifiability issues.In non-orthogonal case, the median estimated ranks for HNN are the closest to true ranks.
Surprisingly, BIDIFAC and BIDIFAC+ have significantly worse performance than other methods in our experiments.This is due to the bias caused by the penalization as the resulting singular values from both BIDIFAC and BIDIFAC+ are shrunken too much relative to the true values.In Figures 3-4, both BIDIFAC refit and BIDIFAC+ refit show much smaller Frobenius norm errors compared to the corresponding results without refitting.
Overall, the proposed HNN performs the best in terms of median estimation errors and median estimated ranks.As expected, the errors of oracle HNN best are always better.However, the difference between the two medians is negligible compared to the difference with other methods.This supports the good performance of the proposed tuning parameter selection procedure.At the same time, HNN shows high variability across replications, especially in the non-orthogonal two-view case.As shown in Appendix B, this extra variability is due to the randomness in BCV splits.An occasional "poor" split may lead to rank under-estimation, subsequently giving higher signal estimation errors, whereas a different split on the same data has no such issue.In practice, we recommend running the BCV procedure with several split choices to assess the ranks that are selected most consistently across splits.In the orthogonal case, SLIDE has an advantage over HNN as it has similar or slightly higher accuracy, and lower variance.However, SLIDE performs significantly worse in non-orthogonal cases, which is further supported by additional results in Appendix C. Since in practice it is difficult to assess a priori whether orthogonality condition holds, HNN has an advantage over SLIDE.

Analysis of multi-tissue gene expression data
The p53 gene plays an important role in cell regulation with impact on aging processes and cancer (Seim et al., 2016;Tanikawa et al., 2017).However, the tissue-specific nature of p53 gene expressions makes it difficult to develop targeted therapies (Vlatkovic et al., 2011).Thus, it is crucial to identify patterns in p53 gene expression that are shared and unique across multiple tissues.To identify such patterns, we analyze gene expression data corresponding to p53 signaling pathway from three tissues (muscle, blood and skin).The original data are available from the Genotype-Tissue Expression project (The GTEx Consortium, 2020), we use the preprocessed data by Li and Jung (2017) that is available at https://github.com/reagan0323/SIFA.The resulting data contain measurements on 191 genes from 204 matched samples corresponding to three tissues: for blood and X 3 ∈ R 204×191 for skin.Our goal is to identify shared, partially-shared and individual signal structures across three tissues, and compare the results across different methods.
In the first three rows of Table 1, we compare HNN with JIVE, AJIVE, SLIDE, BIDIFAC and BIDIFAC+ on each view in terms of (i) the total estimated rank; (ii) the percentage of explained variation (the ratio of the squared Frobenius norm of the estimated signal to the squared Frobenius norm of data).JIVE, AJIVE and BIDIFAC estimate smaller ranks compared to SLIDE, BIDIFAC+ and HNN.Furthermore, HNN has the highest total ranks for each view, with highest percentage of variance explained.While larger explained variance can be partially attributed to larger ranks, this is not the case for muscle and skin tissues.For muscle, HNN has higher percentage of variance explained than SLIDE and BIDIFAC+ refit with the same rank of 16.For skin, HNN has higher percentage than SLIDE with the same rank of 19.To provide a more fair comparison, we adjust HNN estimates to match the ranks of other methods by applying the singular value decomposition to each M d and only keeping the top singular vectors (matching the rank).The adjusted HNN estimates still explain more variation in each view than other methods (Figure 9 in Appendix D).Overall, these comparisons suggest that HNN provides a better fit to the data.The last seven rows in Table 1 show the ranks of the corresponding structures identified by each method.For all methods, individual structures have higher ranks than shared structures, which is consistent with known tissue-specificity of p53 gene expressions (Seim et al., 2016;Tanikawa et al., 2017).HNN identifies more partially-shared structures compared to SLIDE but less than BIDIFAC+.Since JIVE, AJIVE and BIDIFAC are unable to identify partially-shared structures directly, we also apply these methods separately to each pair of views.If the joint rank estimated on a pair of views is larger than the joint rank estimated on all three views, there is evidence for partially-shared structures.Table 2 shows estimated ranks for JIVE, AJIVE and BIDIFAC when applied to each pair of views.For Muscle & Skin, the joint ranks (3 for JIVE, 3 for AJIVE and 6 for BIDIFAC) are larger than the corresponding joint ranks for all three views (2 for JIVE, 1 for AJIVE and 5 for BIDIFAC), suggesting a missing partially-shared structure between the muscle and skin tissues.From Table 1, BIDIFAC+ and HNN are the only methods that identify partialsharing between Muscle & Skin.To compare how partially-shared structures from HNN relate to other methods, in Figure 10 in Appendix D we illustrate cosines of principal angles between HNN partially-shared structures and the corresponding individual structures from the other approaches.For both Muscle & Blood and Muscle & Skin, we find that rank one subspace of corresponding partially-shared structure from HNN is identified as individual for Muscle by JIVE and AJIVE (cosine value close to one).For Blood & Skin, the partially-shared structures of HNN are distinct, and are not identified as individual by any of the other methods.The numerical results in Section 3 coupled with comparisons of HNN with other methods in terms of variance explained and identified ranks suggest that HNN provides new insights into specificity of p53 pathway gene expressions across tissues.The presence of partially-shared structures across each pair of tissues supports the existing research that each pair has unique similarity depending on the underlying characteristics used for comparison.Sonawane et al. (2017) clustered tissues into communities based on tissue-specific targeting patterns: both Skin and Blood tissues belong to the cell proliferation community, whereas Muscle tissue belongs to a distinct extracellular structure community.The GTEx Consortium (2020) considered pairwise associations between tissues based on cis-eQTLs, expression quantitative trait loci located near the gene of origin.They found that Skin and Muscle have a high pairwise association, whereas the cis-eQTL-based association of each tissue with Blood was considerably lower.Oliva et al. (2020) considered the impact of gender on tissue-expressions, with considerably higher number of sex-biased genes in Skin tissue compared to Muscle and Blood tissues.Consequently, they found that Muscle and Blood tissues co-cluster based on the effect size of sex-biased genes, with Skin tissue being significantly more distinct.While these prior analyses were not restricted to p53 signaling pathway, it is known that p53 pathway plays a crucial role in cell proliferation (Seim et al., 2016) and is enriched for sex-biased genes (Lopes-Ramos et al., 2020).The identified shared, partially-shared and individual signal by HNN provide a principled data-driven foundation for further investigation of distinct biological characteristics affecting p53 expression.

Conclusion
In this work, we formally introduce a model for joint, partially-shared, and individual signal structures in multi-view data based on hierarchical levels.We also propose a new hierarchical nuclear norm penalization approach for signal estimation and a simple refitting step for bias adjustment.Our simulation studies indicate that our method outperforms competing methods in terms of signal recovery and rank estimation, particularly when not all signal structures are orthogonal.
While the proposed BCV-based tuning parameter selection scheme performs well in our numerical studies, a large number of tuning parameters still needs to be chosen, leading to high computational burden.It would be of interest to explore a way of choosing tuning parameters that is analogous to the SURE approach in single-view nuclear norm penalization.This approach, however, is quite challenging because Algorithm 1 does not admit a closed-form solution.Theoretical analysis of our method is also an interesting future study, which would be possible thanks to considerable development of theories in single-view nuclear norm penalization (Bach, 2008;Negahban and Wainwright, 2011).

A.1.1 Example with two views
Example A1 Consider D = 2 with the signal matrices: JIVE (Lock et al., 2013), COBE (Zhou et al., 2016) and AJIVE (Feng et al., 2018) define the joint structure as the intersection of the column spaces: which represents a signal structure shared by both views.The individual structures are defined as the column space of the remaining non-intersecting signal, that is the projection of the signal onto the orthogonal complement of the joint structure C(M 1 ) ∩ C(M 2 ): Individual view 2: The individual structures represent view-specific patterns.By construction, they are orthogonal to the joint structure, and have zero intersection with each other (albeit not necessarily orthogonal to each other).

A.2 SLIDE identifiability example
Web Appendix A.3 in Gaynanova and Li (2019) provides SLIDE decomposition given M d when D = 3. Starting with R SLIDE d = M d , we identify each structure as follows: 1. (Individual structure) For view d = 1, 2, 3: (a) Set U (d) , the individual score for view d, to be the orthonormal basis vectors for d) to be the loading matrix corresponding 2. (Partially-shared structure) For (k, l) = (1, 2), (1, 3), (2, 3): (a) Set U (kl) , the individual score for view d, to be the orthonormal basis vectors for kl) to be the loading matrix corresponding 3. (Joint structure) (a) Set the joint score U (123) to be the orthonormal basis vectors for C(R SLIDE ) where 123) .
Example A2 Recall that the signal matrices in Example 1 are However, the signal matrices in ( 16) do not admit a unique SLIDE decomposition because its identifiability condition (the orthogonality between each signal structure) is violated.To see this, we apply the above construction process of SLIDE decomposition to M d 's in ( 16) according to two different orders for score and loading calculation: • Order 1: We calculate the individual scores and loadings of 1st, 2nd and 3rd view.And the partially-shared scores and loadings of (i) 1st and 2nd views; (ii) 1st and 3rd views; (iii) 2nd and 3rd views are calculated.This order gives the following decomposition of the signal matrices in ( 16): where the score and loading matrices after rounding to the second decimal are • Order 2: We calculate the individual scores and loadings of 3rd, 1st and 2nd view.And the partially-shared scores and loadings of (i) 2nd and 3rd views; (ii) 1st and 3rd views; (iii) 1st and 2nd views are calculated.This gives the following decomposition of the signal matrices in ( 16): where the score and loading matrices after rounding to the second decimal are While the decomposition ( 17) identifies (a) the joint structure with rank 3; (b) the partiallyshared structure of 1st and 2nd views with rank 1; (c) the individual structure of 2nd view with rank 1, the decomposition (18) identifies (a) the joint structure with rank 3; (b) the partially-shared structure of 2nd and 3rd views with rank 1; (c) the individual structure of 3rd view with rank 1.
(21) By its definition in Algorithm 4, each proximity operator in ( 21) can be further simplified.Letting D 1 , for example, we have Similarly, the other proximity operators in (21) can also be simplified, which leads to the same update of the dual variables in Algorithm 1 of the main paper.
2. For the other part in Algorithm 1 of the main paper, we rewrite x t+1 = x t − J j=1 A j y j t+1 − y j t according to Table 3.This gives where 13,2 ].Thus, we also verify that ( 22) is the same with the update of the primal variables in Algorithm 1 of the main paper.Abboud et al. (2017) Algorithm 1 in the main paper ), vec(D ), vec(D ), vec(D ), vec(D ), vec(D ), vec(D A.4 Empirical result of refitting procedure  where τ, κ > 0 control the overall shrinkage amount for the individual and pairwise matrices, respectively, and the relative weights are chosen as In ( 24), λ SURE d denotes the optimal cutoff of S(X d , λ d ) chosen by the the Stein's Unbiased Risk Estimates (SURE) (Candès et al., 2013) such that where SURE(S(X d , λ d )) is unbiased risk estimates with div (S(X , λ d as degrees freedom estimates such that and the noise level s d comes from the underlying model is the optimal cutoff of S(X kl , λ kl ) chosen by the SURE criteria.In practice, calculating the SURE criteria ( 26) requires the noise level as an input for which we use the median absolute deviation (MAD) estimator (Gavish and Donoho, 2017) using the R package denoiseR (Josse et al., 2016).
Step 2: Consider all combinations of each grid points of τ , κ and λ, that is, where S τ , S κ and S λ denote the set of the above three sequences of τ , κ and λ.
Step 3: Consider the hyperplane H(τ, κ, λ) : aτ + bκ + cλ = d passing through (τ max , 0, 0), κ max , 0) and (0, 0, λ max ).Given the first two coordinates (τ, κ), we let h(τ, κ) be the value of the third coordinate such that (τ, κ, h(τ, κ)) lies on that hyperplane.Then, our tuning grid is chosen as That is, we only keep the elements in the set G when the corresponding value of the 3rd coordinate is less than or equal to h(τ, κ).Thus, the resulting {(τ l , κ l , λ l )} L l=1 would be a tetrahedron.This is motivated to avoid the case where all singular values are truncated to be zero.

A.6 One standard error rule for choosing tuning parameters
Given the grid of tuning parameters {(τ l , κ l , λ l )} L l=1 , let Avg Err l = 4 −1 2 j,k=1 BCVErr j,k,l be the average BCV error for combination l, and let Avg Err min = min 1≤l≤L Avg Err l be the minimal error.Let SE(Avg Err min ) be the standard error estimate of Avg Err min .To choose the most parsimonious model in terms of total rank of the estimated concatenated M refit , we propose to pick the optimal (τ * , κ * , λ * ) based on 1 standard error rule (Hastie et al., 2009).Specifically, let rank{ M refit (τ l , κ l , λ l )} denote the total rank of the concatenated M refit obtained using full data X with parameters (τ l , κ l , λ l ).We determine optimal tuning parameters (τ * , κ * , λ * ) as where BCVErr j,k,min denotes the BCV error of the held-out X j,k d at value of the paramthat the average BCV error.In ( 27), we consider all (τ l , κ l , λ l ) whose corresponding average BCV error is at most one standard error larger than the minimal error, select among them (τ * , κ * , λ * ) as the one that minimizes total rank on the full dataset.
B Supplementary material for Section 3

B.1 Implementation details
We compare the performance of our approach with the following methods: 1. JIVE (Lock et al., 2013) using the R package r.jive (O'Connell and Lock, 2016) with (i) the default permutation method for rank selection and (ii) the true ranks 2. AJIVE (Feng et al., 2018) using the MATLAB code available at https://github.com/MeileiJiang/AJIVE Project with the initial ranks chosen by the R package igraph using the profile likelihood approach (Zhu and Ghodsi, 2006) 3. SLIDE (Gaynanova and Li, 2019) using the R package SLIDE (https://github.com/irinagain/SLIDE) with their adapted 2 × 2 BCV scheme for rank selection 4. BIDIFAC (Park and Lock, 2020) and BIDIFAC+ (Lock et al., 2022).We use R code available at https://github.com/lockEF/bidifac to implement BIDIFAC+, and modify this code for multi-view case to implement BIDIFAC.Since neither approach makes a bias adjustment to account for penalization, we also apply our refitting step (Algorithm 2) with both methods for fair comparison.The corresponding results are labeled as BIDIFAC refit and BIDIFAC+ refit.
For HNN, JIVE, AJIVE and SLIDE, each X d is first column-centered and scaled so that X d F = 1 in agreement with the standard preprocessing of data matrices from multiple views (Smilde et al., 2003;Lock et al., 2013).For BIDIFAC and BIDIFAC+, the scaling is done as recommended in Park and Lock (2020); Lock et al. (2022) by using MAD estimator (Gavish and Donoho, 2017) of noise level.

B.2 Data generation
(a) Two views (D = 2).We set n = 150 and p 1 = p 2 = 50.The signal matrices M 1 and M 2 are generated in the following way with the joint and individual ranks as s 0 = 2 and s 1 = s 2 = 4: where the elements of the scores For the individual scores, we consider the following two cases: . Specifically, the four principal angles between C(U 1 ) and C(U 2 ) are set to be approximately 30 Given U 1 and U 2 , the joint structure C(U 0 ) is set to be orthogonal to both individual spaces.The elements of the diagonal matrices D 0 ∈ R s 0 ×s 0 , D 1 ∈ R s 1 ×s 1 , D 2 ∈ R s 2 ×s 2 are drawn independently from Unif(1.0, 1.5).The noise matrix is generated such that SNR = 1.

B.3 Results with different BCV splits
Figure 6 shows the results of our HNN estimator based on the one SE rule with 10 different 2 × 2 BCV splits of the same simulated data, for the unorthogonal settings when D = 2.While there are two splits (the 3rd and 5th splits) showing large Frobenius norm error and smaller rank estimates than the true rank, it is clear that the other BCV splits show smaller Frobenius norm errors with more precise rank estimates.

Figure 1 :
Figure 1: Illustration of Definition 1 applied to the signals in Example 1.The solid arrows (in yellow) indicate the corresponding intersection of column-spaces (joint structures).The red dashdotted lines (in red) together with the dotted arrows (in blue) indicate the remainder terms after accounting for joint structures.Observe the missing J(2, 3) (due to zero intersection or R (1) 2 and R (1)

Figure 3 :Figure 4 :
Figure 3: Boxplots of scaled squared Frobenius norm errors (15) and rank estimates of the concatenated signals and each M d over 100 independent replication for Orthogonal scheme (left column) and Non-orthogonal scheme (right column) when D = 2.The true ranks are indicated by the red dotted line 45 −0.39 −0.11

Figure 5
Figure5illustrates the performance of our refitting method to adjust bias of singular values.

Figure
Figure Scatter plots of singular values of M d (green triangles), M refit d (blue squares) and the true signal M d (red circles) for d = 1, 2, 3.The figure is based on one replication from our additional simulation studies in C with the tuning parameter chosen by BCV in Section 2.6 subject to Avg Err l ≤ Avg Err min + SE(Avg Err min ), (27) where SE(Avg Err min ) is the standard error estimates of Avg Err min such that SE(Avg Err min ) = 1 4 2 j,k=1 {BCVErr j,k,min − Avg Err min } 2 4 − 1 ,

Figure 6 :
Figure 6: Each panel shows scaled Frobenius norm error (Top Left), rank estimates of M (Top right), M 1 (Bottom left) and M 2 (Bottom right) from HNN by the one-SE rule over 10 different 2 × 2 BCV splits on the same simulated data with the unorthogonal scheme when D = 2.The true ranks are indicated by the red dotted line

Figure 7 :Figure 8 :
Figure 7: Boxplots of scaled squared Frobenius norm errors (15) and rank estimates of the concatenated signals and each M d over 100 independent replication for Orthogonal scheme (left column) and Non-orthogonal scheme (right column) when D = 3 with the ranks s 0 = s 1 = s 2 = s 3 = s 12 = s 13 = s 23 = 2.The true ranks are indicated by the red dotted line.
for the closed-form of λ SURE d and derivation details.While SURE formulation for HNN minimization is intractable, we propose to take advantage of closed-form λ SURE d in single matrix case to determine the relative weights that should be assigned to each matrix within each hierarchical level H D−k+1 (M).Specifically, let λ SURE , λ d ) and S(X kl , λ kl ), respectively, where d

Table 1 :
Table of the rank estimates of each view with the percentage of explained variation (Rows 1-3) and the estimated ranks of each signal structures (Rows 4-10).For both BIDIFAC and BIDIFAC+, we present the corresponding results with and without refitting.

Table 3 :
Abboud et al. (2017)e corresponding notation inAbboud et al. (2017)and our work.The notation in the same row corresponds to each other.vec(•) denotes vectorization of matrix (i.e., stacking the columns of the input matrix into a single vector).The subscripts of I and 0 indicate the dimensions of the identity and zero matrices, respectively.