A technical review of canonical correlation analysis for neuroscience applications

Abstract Collecting comprehensive data sets of the same subject has become a standard in neuroscience research and uncovering multivariate relationships among collected data sets have gained significant attentions in recent years. Canonical correlation analysis (CCA) is one of the powerful multivariate tools to jointly investigate relationships among multiple data sets, which can uncover disease or environmental effects in various modalities simultaneously and characterize changes during development, aging, and disease progressions comprehensively. In the past 10 years, despite an increasing number of studies have utilized CCA in multivariate analysis, simple conventional CCA dominates these applications. Multiple CCA‐variant techniques have been proposed to improve the model performance; however, the complicated multivariate formulations and not well‐known capabilities have delayed their wide applications. Therefore, in this study, a comprehensive review of CCA and its variant techniques is provided. Detailed technical formulation with analytical and numerical solutions, current applications in neuroscience research, and advantages and limitations of each CCA‐related technique are discussed. Finally, a general guideline in how to select the most appropriate CCA‐related technique based on the properties of available data sets and particularly targeted neuroscience questions is provided.

ships among multiple data sets, which can uncover disease or environmental effects in various modalities simultaneously and characterize changes during development, aging, and disease progressions comprehensively. In the past 10 years, despite an increasing number of studies have utilized CCA in multivariate analysis, simple conventional CCA dominates these applications. Multiple CCA-variant techniques have been proposed to improve the model performance; however, the complicated multivariate formulations and not well-known capabilities have delayed their wide applications. Therefore, in this study, a comprehensive review of CCA and its variant techniques is provided. Detailed technical formulation with analytical and numerical solutions, current applications in neuroscience research, and advantages and limitations of each CCA-related technique are discussed. Finally, a general guideline in how to select the most appropriate CCA-related technique based on the properties of available data sets and particularly targeted neuroscience questions is provided. Each of these data types, termed modality here, contains multiple measurements and provides a unique view of the subject. These measurements can be the raw data (e.g., neuropsychological tests) or derived information (e.g., brain regional volume and thickness measures derived from T1-weighted MRI).
Neuroscience research has been focused on uncovering associations between measurements from multiple modalities. Conventionally, a single measurement is selected from each modality, and their one-to-one univariate association is analyzed. Multiple correction is then performed to guarantee statistically meaningful results. These Xiaowei Zhuang and Zhengshi Yang contributed equally to this manuscript. univariate associations have illuminated numerous findings in various neurological diseases, such as association between gray-matter density and Mini Mental State Examination score in Alzheimer's disease (Baxter et al., 2006), correlation between brain network temporal dynamics and Unified Parkinson Disease Rating Scale part III motor scores in Parkinson's disease subjects , and relationship between imaging biomarkers and cognitive performances in fighters with repetitive head trauma (Mishra et al., 2017).
However, the one-to-one univariate association overlooks the multivariate joint relationship among multiple measurements between modalities. Furthermore, when dealing with brain imaging data, highly correlated noise further decreases the effectiveness and sensitivity of mass-univariate voxel-wise analysis (Cremers, Wager, & Yarkoni, 2017;Zhuang et al., 2017), and different methods of multiple corrections might lead to various statistically meaningful results. Multivariate analysis, alternatively, uncovers the joint covariate patterns among different modalities and avoids multiple correction steps, which would be more appropriate to disentangle joint relationship between modalities and guarantees full utilization of all common information.
Canonical correlation analysis (CCA) is one candidate to uncover these joint multivariate relationships among different modalities. CCA is a statistical method that finds linear combinations of two random variables so that the correlation between the combined variables is maximized (Hotelling, 1936). CCA can identify the source of common statistical variations among multiple modalities, without assuming any particular form of directionality, which suits neuroscience applications.
In practice, CCA has been mainly implemented as a substitute for univariate general linear model (GLM) to link different modalities, and therefore, is a major and powerful tool in multimodal data fusion. Multiple CCA variants, including kernel CCA, constrained CCA, deep CCA, and multiset CCA, also have been applied in neuroscience research.
However, the complicated multivariate formulations and obscure capabilities remain obstacles for CCA and its variants to being widely applied.
In this study, we review CCA applications in neuroscience research from a technical perspective to improve the understanding of the CCA technique itself and to provide neuroscience researchers with guidlines of proper CCA applications. We briefly discuss studies through December 2019 that have utilized CCA and its variants to uncover the association between multiple modalities. We explain the existing CCA method and its variants for their formulations, properties, relationships to other multivariate techniques, and advantages and limitations in neuroscience applications. We finally provide a flowchart and an experimental example to assist researchers to select the most appropriate CCA technique based on their specific applications.

| INCLUSION/EXCLUSION OF STUDIES
Using the PubMed search engine in December 2019, we searched neuroimaging or neuroscience articles using CCA with the following string: ("canonical correlation" analysis) AND (neuroscience OR neuroimaging). This search yielded 192 articles; 11 additional articles were included based on authors' preidentification. We excluded non-English articles, conference abstracts and duplicated studies, yielding 188 articles assessed for eligibility. We further identified 160 studies that met the following criteria: (a) primarily focused on a CCA or CCAvariant technique and (b) with an application to neuroimaging or neuroscience modalities. Reasons for exclusion and numbers of articles meeting exclusion criteria at each stage are shown in Figure 1. In the following sections, we present technical details (Section 3) and neuroscience applications for each category (Section 4). In Section 5, we discuss technical differences and summarize advantages and limitations of each CCA-related technique. We finally provide an experimental example and guidance in Section 6 to researchers who are interested in applying multivariate CCA-related techniques in their work. Figure 3 shows the detailed CCA equations (red box) and linkages between CCA and its variants. Constrained CCA (yellow boxes), nonlinear CCA (gray boxes), and multiset CCA (orange boxes) are focused, and linkages between CCA and other univariate (light green boxes) and multivariate (dark green boxes) techniques are also included.

| TECHNICAL DETAILS
Here, we provide basic formulations and solutions of each CCA and its variants. We also discuss how CCA is mathematically linked to its variants and to other multivariate or univariate techniques. Researchers interested in further details can refer to the corresponding references.

| Conventional CCA
Formulations. CCA is designed to maximize the correlation between two latent variables y 1  p 1 × 1 and y 2  p 2 × 1 , which are also being referred to as modalities. Here, we denote Y k  N × p k , k = 1,2 as collected samples of these two variables, where N represents the number of observations (samples) and p k , k = 1, 2 represent the number of features in each variable. CCA determines the canonical coefficients u 1  p 1 × 1 and u 2  p 2 × 1 for Y 1 and Y 2 , respectively, by maximizing the correlation between Y 1 u 1 and Y 2 u 2 : In Equation (1), Σ 11 and Σ 22 are the within-set covariance matrices and Σ 12 is the between-set covariance matrix. The denominator in Equation (1) is used to normalize within-set covariance, which guarantees that CCA is invariant to the scaling of coefficients.
Solutions. Canonical coefficients u 1 and u 2 can be found by setting the partial derivative of the objective function (Equation (1)) with respect to u 1 and u 2 to zero, respectively, leading to: Σ 12 u 2 = ρΣ 11 u 1 and Σ 21 u 1 = ρΣ 22 u 2 : Equation (2) can be further reduced to a classical eigenvalue problem, if Σ kk is invertible, as follows: F I G U R E 1 Inclusion and exclusion criteria for this review F I G U R E 2 Number of articles summarized by category (a) and year (b) Each pair of canonical coefficients {u 1 , u 2 } are the eigenvectors of Σ −1 11 Σ 12 Σ −1 22 Σ 21 and Σ − 1 22 Σ 21 Σ −1 11 Σ 12 , respectively with the same eigenvalue ρ 2 . Following Equation (3), up to M = min(p 1 , p 2 ) pairs of canonical coefficients can be achieved through singular value decomposition (SVD), and every pair of canonical variables As we stated above, one requirement for solving the CCA problem (Equation (1)) through this eigenvalue problem (Equation (3)) is that within-set covariance matrices Σ 11 and Σ 22 must be invertible.
The alternative hypothesis is that at least one canonical correlation value is nonzero. A test statistic based on Wilk's Λ is (Bartlett, 1939): which follows a chi-square distribution χ 2 p 1 × p 2 with degree of freedom of p 1 × p 2 . It is also of interest to test if a specific canonical correlation value (ρ (m) , 1 ≤ m ≤ M) is different from zero. In this case, the test statistic in Equation (4) becomes: which follows χ 2 Þ . In practice, this parametric inference is not commonly used since it requires variables to strictly follow the Gaussian distribution and is sensitive to outliers (Bartlett, 1939

| CCA variants
The conventional CCA (Equation (1) (1). Penalties can be either equality constraints or inequality constraints, and based on researcher's own considerations, penalties can be added to either u 1 or u 2 , or to both u 1 and u 2 . Therefore, in general, the constrained CCA problem can be formulated in terms of the constrained optimization problem as: ; s:t: con i u 1 ,u 2 ð Þ= 0,8iE; where E represents the set of equality constraints and InE represents the set of inequality constraints.
Special case: L 1 -norm penalty and sparse CCA Formulation. The most commonly implemented penalty in constrained CCA is the L 1 -norm penalty added to either u 1 or u 2 , and is termed sparse CCA: where |u i | 1 < c i are inequality constraints.
The L 1 -norm penalty induces sparsity on canonical coefficients, and therefore sparse CCA can be implemented to high-dimensional variables. When dealing with high-dimensional variables, the withinset covariance matrices Σ 11 and Σ 22 in Equation (7) are also highdimensional matrices, which are memory intensive. In addition, when the number of observations is less than the number of features, the covariance matrices cannot be estimated reliably from the sample. In these cases, within-set covariance matrices are usually replaced by identity matrices, and sparse CCA is then equivalent to sparse PLS.
Please note that researchers may still name this technique as sparse CCA even after this replacement (Witten, Tibshirani, & Hastie, 2009).
With known prior information about features or observations, sparse CCA can be further modified to structure sparse CCA or discriminant sparse CCA, respectively. If the known prior information is about features, such as categorizing features into different groups (Lin et al., 2014) or characterizing connections between features (Kim et al., 2019), the prior information will be implemented as an additional penalty on features, leading to structure sparse CCA. Alternatively, if the known prior information is about observations, such as diagnostic group of each subject, the prior information will be implemented as additional constraint on observations, leading to discriminant sparse CCA .
Solutions. Sparse CCA, structure sparse CCA, and discriminant sparse CCA can all be considered as special cases of a generalized constrained CCA (Equation (6)) problem with different equality and inequality constraint sets. Iterative optimization techniques used to solve the generalized constrained CCA problem are also applicable here to solve these special cases. and Y 2 on to a new feature space through a predefined kernel function.

| Nonlinear CCA
However, this new feature space is not explicitly defined. Instead, the original feature space for each observation in Y k is implicitly projected to a higher dimensional feature space Y k ! ϕ(Y k ) embedded in a prespecified kernel function H k  N × N , which is independent of the number of features in the projected space. After transforming u k to ϕ(Y k ) T v k , the CCA form in Equation (1) in the higher dimensional feature space, namely kernel CCA can be written as: where v 1 and v 2 are unknowns to estimate, instead of u 1 and u 2 .
Temporal kernel CCA is a kernel CCA variant that is specifically designed for two time series with temporal delays. In temporal kernel CCA, one variable, for example, Y 1 , is shifted for multiple different time points and a new variableỸ 1 is formed by concatenating the original Y 1 and the temporally shifted Y 1 . The new variableỸ 1 and the original Y 2 are then input to kernel CCA as in Equation (8).
Solution. Closed-form analytical solution exists for kernel CCA (Equation (8)). By setting the partial derivatives of the objective function in Equation (8) with respect to v 1 and v 2 to zero separately, kernel CCA can be converted to the following problem: Note that the kernel CCA problem defined in Equation (9) always holds true when ρ = 1. To avoid this trivial solution, a penalty term needs to be introduced to the norm of original canonical coefficients λ is a regularization parameter. This regularized kernel CCA problem can be further represented as an eigenvalue problem (Hardoon, Szedmak, & Shawe-Taylor, 2004): where a closed-form solution exists in the new feature space.
Solution. Deep CCA is solved through a deep learning schema by dividing the original data into training and testing sets. θ 1 and θ 2 are optimized by following the gradient of the correlation objective as estimated on the training data (Andrew et al., 2013 where K > 2 is the number of variables. and u T k Σ kk u k , k = 1,…, K is set to 1 for normalization. Besides maximizing SUMCOR, Kettenring (1971)  In practice, SUMCOR multiset CCA is most commonly used followed by MAXVAR and SSQCOR multiset CCA.
Solution. Analytical solutions of multiset CCA are obtained by calculating the partial derivatives of the objective function with respect to each u i . Since SUMCOR and SSQCOR are linear and quadratic functions of each u i , respectively, closed-form analytical solutions can be obtained for these two cost functions by setting the partial derivatives equal to 0, which leads to generalized eigenvalue problems. Multiset CCA with all these five objective functions can also be solved by means of the general algebraic modeling system (Brooke, Kendrick, Meeraus, & Rama, 1998) and NLP solver CONOPT (Drud, 1985).

Multiset CCA with constraints
In constrained multiset CCA, penalty terms can be added to each u i individually. Here we give examples of two commonly incorporated constraints in multiset CCA: sparse multiset CCA and multiset CCA with reference.
Formulation: Sparse multiset CCA. Similar to sparse CCA, sparse multiset CCA applies the L 1 -norm penalty to one or more u i in Equation (12), and therefore induces sparsity on canonical coefficient(s) and can be applied to high-dimensional variables. Here, we give the equation of SUMCOR sparse multiset CCA as an example: Formulation: Multiset CCA with reference. Multiset CCA with reference enables the discovery of multimodal associations with a specific reference variable across subjects, such as a neuropsychological measurement . In multiset CCA with reference, additional constraints of correlations between each canonical variable and the reference variable (v ref ) are added: where λ>0 is the tuning parameter and j Á j j j 2 2 is the L 2 -norm. Therefore, multiset CCA with reference is a supervised multivariate technique that can extract common components across multiple variables that are associated with a specific prior reference. (14) and (15) can be viewed as constrained optimization problems with an objective function and multiple equality and inequality constraints. In this case, iterative optimization techniques are required to solve constrained multiset CCA problems.

| Other CCA-related techniques
There are many other CCA-related techniques developed, and here we only included three that have been applied in the neuroscience field: supervised local CCA, Bayesian CCA, and tensor CCA.

Supervised local CCA
CCA by formulation is an unsupervised technique that uncovers joint relationships between two variables. Meanwhile, CCA can become a supervised technique by (a) adding additional constraints such as CCA (multiset CCA) with reference discussed in the section "Multiset CCA with constraints," or (b) directly incorporating group information into the objective function as in the supervised local CCA technique (Zhao et al., 2017).
Supervised local CCA is based on locally discriminant CCA (Peng, Zhang, & Zhang, 2010), which uses local group information to construct a between-set covariance matrixΣ 12 , as a replacement of Σ 12 in Equation (1). More specifically,Σ 12 is defined as the covariance matrix from d nearest neighboring within-class samples (Σ w ) penalized by the covariance from d nearest neighboring between-class samples (Σ b ) with a tuning parameter λ, However, this technique only considers the local group information with the global discriminating information ignored. To address this issue, Fisher discrimination information together with local group information is considered in supervised local CCA, which can be written as: where S k denote the between-group scatter matrices of the dataset k. If samples i and j belong to cth class, U ij is set to 1 nc , where n c denotes the number of samples in cth class; otherwise, U ij is set to 0. Supervised local CCA is usually applied sequentially with gradually decreased d (named as hierarchical supervised local CCA) to reduce the influence of the neighborhood size and improve classification performance.

Bayesian CCA
Bayesian CCA is another technique that overcomes the overfitting problem when applying CCA to variables with small sample sizes. Bayesian CCA is also proposed to complement CCA by providing a principal component analysis (PCA)-like description of variations that are not captured by the correlated components (Klami, Virtanen, & Kaski, 2013). Input to CCA in Equation (1), Y 1 and Y 2 , can be considered as N observations of one-dimensional random variables y 1  p 1 × 1 and y 2  p 2 × 1 . Using the same notations, Bayesian CCA can be formulated as a latent variable model (with latent variable z) between y 1 and y 2 (Klami & Kaski, 2007;Wang, 2007): where N 0, I ð Þ denotes the multivariate Gaussian distribution with mean vector 0 and identity covariance matrix I. D k are diagonal covariance matrices and indicate features in y k with independent noise. The latent variable z  q × 1 , where q represents the number of shared components, captures the shared variation between y 1 and y 2 , and can be linearly transformed back to the original space of y k through A k z, k = 1, 2. Similarly, the latent variable, where q k represents the number of variable-specific components, captures the variable k-specific variation not shared between y 1 and y 2 , and can be linearly transformed back to the original space in y k by B k z k . Browne (1979) demonstrated that Equation (18) (1)) in the aspect that their solutions share the same subspace. However, unlike conventional CCA (Equation (1)) that uses two variables u 1 and u 2 to project y 1 and y 2 to this subspace, Bayesian CCA maintains the shared variation between y 1 and y 2 in a single variable z.
The formulation of y k in Equation (18) can be rewritten as y k $ N A k z, B k B T k + D k ,k = 1,2 after algebra operations. With (18) can be transformed to In Equation (19), prior knowledge of the parameters (e.g., A k and Ψ k ) are required to construct the latent variable model for Bayesian CCA. For instance, the inverse Wishart distribution as a prior for the covariance Ψ k and the automatic relevance determination (ARD; Neal, 2012) prior for the linear mappings A k are used when Bayesian CCA is proposed (Klami & Kaski, 2007;Wang, 2007). Since then, multiple Bayesian inference techniques have been developed, however, the early work of Bayesian CCA is limited to low-dimensional data (not more than eight dimensions in Kaski, 2007 andWang, 2007) due to the computational complexity to estimate the posterior distribution over the p k × p k covariance matrices Ψ k (Klami et al., 2013). A group-wise ARD prior (Klami et al., 2013) Klami et al. (2013).
Tensor CCA Two-dimensional CCA and tensor CCA for high-dimensional variables.
Variables input to CCA (Y k  N × p k , k = 1,2,…, ) are usually required to be 2D matrices with a dimension of number of observations (N) times number of features (p k ) in each variable. Y k can be considered as N observations of the 1D variable y k  p k × 1 . In practice, tensor data, such as 3D images or 4D time series, are commonly involved in neuroscience applications, and these variables are required to be vectorized before inputting to CCA algorithms. This vectorization could potentially break the feature structures. In this case, to analyze 3D data, such as N samples of 2D variables (N × p 1 × p 2 ), without breaking the 2D feature structure, two-dimensional CCA (2DCCA) has been proposed by Lee and Choi (2007).
Mathematically, 2DCCA maximizes the canonical correlation between two variables with N observations of 2D features: Y 1 : For each variable, 2DCCA searches left transforms l 1  p 11 × 1 and l 2  p 21 × 1 and right transforms r 1  p 12 × 1 and r 2  p 22 × 1 in order to maximize the correlation between l T 1 Y 1 r 1 and l T 2 Y 2 r 2 : In Equation (20), for fixed l 1 and l 2 , r 1 and r 2 can be obtained with the SVD algorithm similar to the one used in conventional CCA, and l 1 and l 2 can be obtained for fixed r 1 and r 2 , alternatingly. Therefore, an iterative alternating SVD algorithm (Lee & Choi, 2007) has been developed to solve Equation (20).
Above described 2DCCA can be treated as a constrained optimization problem with low-rank restrictions on canonical coefficients, similar restrictions are used in (Chen, Kolar, & Tsay, 2019), where 2DCCA has been extended to higher dimensional tensor data, termed tensor CCA.
The tensor CCA (Chen et al., 2019) searches two rank-one tensors where "∘" denotes outer product and u k1 , …, u km are vectors. Chen et al. (2019) also introduced an efficient optimization algorithm to solve tensor CCA for high dimensional data sets.
Tensor CCA for multiset data. Another way to handle input variables with high-dimensional feature spaces is to generalize conventional CCA by analyzing constructed covariance tensors (Luo, Tao, Ramamohanarao, Xu, & Wen, 2015). This method requires random variables to be vectorized and is similar to multiset CCA since both of them deal with more than two input modalities. The differences between tensor CCA and multiset CCA in this case lie in that tensor CCA constructs a high-order covariance tensor for all input variables (Luo et al., 2015), whereas multiset CCA finds pair-wise covariance matrices. In addition, tensor CCA (Luo et al., 2015) does not maximize the pairwise correlation as in multiset CCA; instead, it directly maximizes the correlation over all canonical variables, where ʘ denotes element-wise product and 1  N × 1 is an all ones vector. The problem formulated in Equation (21) can be solved by using the alternating least square algorithm (Kroonenberg & de Leeuw, 1980). Besides permutation tests, a null distribution can also be built by creating null data input to CCA variant techniques. The null data are usually generated based on the physical properties of input variables.

| Statistical inferences of CCA variants
For instance, when applying CCA-variant technique to link task fMRI data and the task stimuli, the null data of task fMRI can be obtained by applying wavelet-resampling to resting-state fMRI data (Breakspear, Brammer, Bullmore, Das, & Williams, 2004;Zhuang et al., 2017). The null hypothesis here is that task fMRI data are not multivariately correlated with task stimuli, and the wavelet resampled resting-state fMRI data fits the requirements of the null data in this case.

|
where u 1 and u 2 are obtained by minimizing the residual term ε  N × 1 . Since CCA is scale-invariant, a solution to Equation (22) is also a solution of Equation (1). Furthermore, with normalization terms of u T 1 Σ 11 u 1 = 1 and u T 2 Σ 22 u 2 = 1, the MVMR model is exactly equivalent to CCA, that is, maximizing the canonical correlation between Y 1 and Y 2 is equivalent to minimizing the residual term ε: In addition, by replacing the covariance matrices Σ 11 and Σ 22 in the denominator in Equation (1) with the identity matrix I, conventional CCA is converted to partial least square (PLS), which maximizes the covariance between latent variables. If Y 1 is the same as Y 2 , the PLS will maximize the variance within a single variable, which is equivalent to PCA.

Relationship with univariate techniques
If one variable in CCA, for example, Y 1 , only has a single feature, that is, y  N × 1 , u 1 can then be defined as 1 and CCA becomes a linear regression problem: where Y 1 is renamed as y and Y 2 is renamed as X to follow conventional notations. ε  N × 1 denotes the residual term. If both variables Y 1 and Y 2 contain only one feature, the canonical correlation between Y 1 and Y 2 becomes the Pearson's correlation between Y 1 and Y 2 as in the univariate analysis. In normal healthy subjects, using CCA, multiple studies have delineated the joint multivariate relationships between the above imaging- Parkinson's disease (Lin, Baumeister, Garg, and McKeown, 2018;, multiple sclerosis (Leibach et al., 2016;Lin et al., 2017), epilepsy (Kucukboyaci et al., 2012) and drug addictions (Dell'Osso et al., 2014).
Brain activation in response to task stimuli CCA has also been applied to detect brain activations in responses to stimuli during task-based fMRI experiments. Compared to the most commonly general linear regression model, local neighboring voxels are considered simultaneously in CCA to determine activation status of the central voxel (Friman, Cedefamn, Lundberg, Borga, & Knutsson, 2001;Nandy & Cordes, 2003;Nandy & Cordes, 2004;Rydell et al., 2006;Shams et al., 2006). In addition, in task-based electrophysiological experiments, Dmochowski et al. (2018) and de Cheveigne et al. (2018) have maximized the canonical correlation between an optimally transformed stimulus and properly filtered neural responses to delineate the stimulus-response relationship in electroencephalogram (EEG) data.

Denoising neuroscience data
Another application of CCA in neuroscience research is to remove noises from signals in the raw data. Through a blind source separation conventional CCA can produce statistically stable and meaningful results. However, in neuroscience applications, this requirement is not always fullfilled, especially when Y 1 or Y 2 represents brain activities where each brain voxel is considered a feature individually. In this case, any feature can be picked up and learned by the CCA process and directly applying Equation (1) to two sets will produce overfitted and unstable results. Therefore, additional data-reduction steps applied before CCA or constraints incorporated in the CCA algorithm are necessary to avoid overfitting in CCA applications. In this section, we focus on data reduction steps applied before conventional CCA.
The most commonly used data reduction technique is the PCA method applied to Y 1 and Y 2 separately. Through orthogonal transformation, PCA converts Y 1 and Y 2 into sets of linearly uncorrelated principal components. The principal components that do not pass certain criteria are discarded, leading to dimension-reduced variables: (1)  In addition, the least absolute shrinkage and selection operator (LASSO) algorithm (Tibshirani, 1996) has also been applied prior to CCA as a feature selection step to eliminate less informative There is no single "correct" way or "gold standard" of the feature reduction step before applying CCA. Decisions should be made based on the data itself and the specific question that researchers are interested in.

| Constrained CCA: Removing noninformative features and stabilizing results
The other common solution in practice for N ( p k , k = 1, 2 is to incorporate constraints into the CCA algorithm directly, and consequently noninformative features can be removed and overfitting problems can be avoided (Table 2). Alzheimer's disease spectrum Yan et al., 2017).
Longitudinal data could also be collected in neuroscience research and are useful to monitor disease progression. Temporal constrained sparse CCA has been proposed to uncover how single nucleotide polymorphisms affect brain gray matter density across multiple time points in subjects with Alzheimer's disease spectrum (Du, Liu, Zhu, et al., 2019; Furthermore, as we stated in Section 4.1.1, CCA has been applied to detect brain activations in response to task stimuli during fMRI experiments. In these type of applications, Y 1 represents time series from local neighborhood that is considered simultaneously in determining the activation status of the central voxels, and Y 2 represents the task design matrix. CCA is applied to find optimized coefficients u 1 and u 2 , such that the correlation between combined local voxels and task design is maximized. In this case, even though the central voxel may be inactivated in the task, activated neighboring voxels would lead to a high canonical correlation and thus produce falsely activated status of the central voxel, which is termed assmoothing artifact (Cordes et al., 2012a Xia et al. (2018) Brain imaging data Brain imaging data Avants, Cook, Ungar, Gee, and Grossman (2010); Deligianni, Carmichael, Zhang, Clark, and Clayden (2016); Deligianni, Centeno, Carmichael, and Clayden (2014); Duda, Detre, Kim, Gee, and Avants (2013) Abbreviation: CCA, canonical correlation analysis. Zhuang et al., 2017;Zhuang et al., 2019). Yang, Zhuang, et al. (2018) further extend the constraints from two-dimensional local neighborhood to three-dimensional neighboring voxels.

| Kernel CCA: Focusing on a nonlinear relationship between two modalities
Above CCAapplications assume joint linear relationships between two modalities; however, this assumption might not always hold in neuroscience research. Kernel CCA has been proposed to uncover the nonlinear relationship between modalities without explicitly specifying the nonlinear feature space (Equation (8)). In human research, kernel CCA has been applied to investigate the joint nonlinear relationship between simultaneously collected fMRI and EEG data (Yang, Cao, et al., 2018), to uncover gene-gene co-association in Schizophrenia subjects (Ashad Alam et al., 2019), and to detect brain activations in response to fMRI tasks (Hardoon et al., 2007;Yang, Zhuang, et al., 2018). In preclinical research, temporal kernel CCA has been proposed to investigate the temporal-delayed nonlinear relationship between simultaneously recorded neural (electrophysiological recording in frequency-time space) and hemodynamic (fMRI in voxel space) signals in monkeys (Murayama et al., 2010), and to investigate a nonlinear predictive relationship between EEG signals from two different brain regions in macaques (Rodu et al., 2018) (Table 3).
Sparse multiset CCA has been applied to combine more than two variables and remove noninformative features simultaneously. Specifically, sparse multiset CCA has been applied to combine multiple brain imaging modalities with genetic information Hu et al., 2016;Hu et al., 2018).
Multiset CCA with reference is specifically proposed as a supervised multimodal fusion technique in neuroscience research. Using neuropsychological measurements such as working memory or cognitive measurements as the reference, studies have uncovered stable covariated patterns among fractional amplitude of low frequency contribution maps derived from resting-state fMRI, gray matter volumes derived from structural MRI and fractional anisotropy maps derived from diffusion-weighted MRI that are linked with and can predict core cognitive deficits in schizophrenia Sui et al., 2018).
Using genetic information as a prior reference, multiset CCA with reference has also uncovered multimodal covariated MRI biomarkers that are associated with microRNA132 in medication-naïve major depressive patients (Qi, Yang, et al., 2018). Furthermore, with clinical depression rating score as guidance, Qi et al. (2020) have demonstrated that the electroconvulsive therapy Hdepressive disorder patients produces a covariated remodeling in brain structural and functional images, which is unique to an antidepressant symptom response. As a supervised technique, multiset CCA can be applied to uncover covariated patterns across multiple variables of special interest.

| Other applications
CCA has also been applied in a supervised and hierarchical fashion. gradually varying neighborhood sizes in early autism diagnosis, and in each iteration, CCA is used to combine canonical variates from the previous step (Table 5).
Bayesian CCA has been used to realign fMRI activation data between actors and observers during simple motor tasks to investigate whether seeing and performing an action activates similar brain areas (Smirnov et al., 2017). The Bayesian CCA assigns brain activations to one of three types (actor-specific, observer-specific and shared) via a group-wise sparse ARD prior. Furthermore, using Bayesian CCA, Fujiwara et al. (2013) establish mappings between the stimulus and the brain by automatically extracting modules from measured fMRI data, which can be used to generate effective prediction models for encoding and decoding.

| Limitations
CCA assumes and uncovers only a linear intermodality relationship, which might not hold for neuroscience data. Furthermore, directly applying CCA requires sufficient observation support of the variables (detailed in Section 3.1). For neuroscience data, especially voxel-wise brain imaging data, it is usually difficult to have more observations (e.g., subjects) than features (e.g., voxels). In this case, any feature in Y 1 and Y 2 can be picked up and learned by the CCA process, and directly applying CCA will produce overfitted and unstable results.
ROI-based analysis, data reduction (e.g., PCA), and feature selection (e.g., LASSO) steps are commonly applied to reduce the number of features in neuroscience data prior to CCA.
Another limitation of CCA in general is that signs of the canonical correlations and canonical coefficients are indeterminate. Solving the eigenvalue problem in Equation (3) will always give a positive canonical correlation value, and reversing the signs of u 1 and u 2 simultaneously will lead to the same canonical correlation value. Therefore, with CCA, we can only conclude that two modalities are linearly and  (Du, Liu, Zhang, et al., 2017) and guide the algorithm to produce more biologically meaningful results (Du, Huang, et al., 2016a;Liu et al., 2017). Alternatively, with group assignments implanted in each observation, discriminant sparse CCA is able to discover group discriminant features, which can later improve the performance of supervised classification .
Other constraints are also beneficial in neuroscience research. For instance, the L 2 -norm penalty on canonical coefficients retains all features in the model with regularized weights, and therefore most of the variance can be maintained in a stable model (Dashtestani et al., 2019).
In addition, when applied to task fMRI activation detection, locally constrained CCA penalizes weights on the neighboring voxels to guarantee the dominance of the central voxel and therefore, is able to reduce false positives (Cordes et al., 2012b;Zhuang et al., 2017).

| Limitations
One major limitation of constrained CCA is the requirement of expertise in optimization techniques. By having additional penalty terms on canonical coefficients or covariance matrices, analytical solutions of constrained CCA no longer exist, and, instead, iterative optimization methods are required to solve the constrained CCA problems efficiently.
The predefined constraint itself also requires prior knowledge about the data. For structure and discriminant sparse CCA, prior information about the observation domain or the feature domain is required. Furthermore, in neuroscience application, the constraint itself is usually data specific. For instance, when applying local constrained CCA to task fMRI activation detection, the predefined constraint should be strong enough to penalize neighboring voxels, but loose enough to guarantee the multivariate contribution of neighboring voxels to the central voxel. This constraint can only be selected through simulating a series of synthetic data that mimic real fMRI signals, which requires prior knowledge of the data and is timeconsuming.

| Advantages
By definition, nonlinear CCA is able to uncover multivariate nonlinear relationships between two modalities, which commonly exist in neuroscience variables. For instance, during an fMRI task, collected fMRI signals are nonlinearly correlated with the task design due to the unknown hemodynamic response function; and kernel CCA can extract this multivariate nonlinear relationship and produce a localized brain activation map (Hardoon et al., 2007).
In general, kernel CCA first implicitly transforms the original feature space into a kernel space with a predefined kernel function. With this transform, nonlinear relationship between two modalities can be discovered. Furthermore, in the new kernel space, kernel CCA can be solved efficiently with a closed-form analytical solution.
Temporal kernel CCA shares similar advantages with kernel CCA, with additional benefits from considering temporal delays between modalities when applied to simultaneously collected data. In neuroscience research, simultaneously collected EEG/fMRI data are a typical candidate for temporal kernel CCA, as neural activities collected by fMRI data, which are the blood oxygenated level-dependent signals, contain temporal delays caused by the hemodynamic response function (Ogawa, Lee, Kay, & Tank, 1990), as compared to the simultaneously collected EEG signals.
Deep CCA, a purely data-driven technique, can reveal unknown nonlinear relationships between variables without assuming any predefined nonlinear intermodality relationship. It has the potential to be applied to neuroscience data that contains enough samples for training a deep learning schema.

| Limitations
For kernel CCA, a predefined kernel function needs to be selected and this selection will affect final results. This choice of kernel functions requires additional knowledge about data and the kernel function.
Another major limitation of both kernel CCA and temporal kernel CCA is that it is difficult to project the kernel space (H 1 and H 2 ) back to the original feature space (Y 1 and Y 2 ), leading to additional difficulties in interpreting results (Hardoon et al., 2007). For instance, when applying kernel CCA to link fMRI task stimuli and collected BOLD signals for activation detection, the obtained high-dimensional features cannot be mapped backwards to an individual voxel in order to assign the activation value because the feature embedded for commonly used nonlinear kernels (e.g., Gaussian kernel and power kernel) have information from multiple voxels. Therefore, kernel CCA with a general nonlinear kernel remains unsolved for fMRI activation analysis, and only linear kernels were used for constructing activation maps in fMRI.
Unlike kernel CCA, deep CCA does not require a predefined function and learns the nonlinear feature mapping from the data itself.
However, in deep CCA, the number of unknown parameters significantly increases with the number of layers, which requires much more samples in the training data. In neuroscience data, it is usually difficult to have enough number of subjects as training samples for deep CCA.
Furthermore, deep learning expertise is also required for selecting the appropriate deep learning structures for nonlinear feature mapping.

| Advantages
In neuroscience research, more than two variables are commonly collected for the same set of subjects. Multiset CCA uncovers multivariate joint relationships among multiple variables, which is well defined to link all collected data in this case. Furthermore, if data from one subject are treated as one modality (or variable), multiset CCA will also discover the common patterns across subjects, which becomes a powerful data-driven group analysis method.
Sparse multiset CCA combines more than two modalities and suppresses noninformative features simultaneously, and therefore shares the advantages and limitations with both multiset CCA and sparse CCA.
Multiset CCA with reference is the only supervised CCA technique and is proposed specifically for neuroscience applications. It discovers joint multivariate relationships among variables in response to a specific reference variable. For instance, using this method, common brain changes from structural, fMRI and diffusion MRI with respect to a specific neuropsychological measurement can be discovered.

| Limitations
There are five possible objective functions for multiset CCA optimiza-

| Abstract
To summarize, conventional CCA uncovers joint multivariate linear relationships between two modalities and can be quickly and easily applied.  Table 7 summarizes current and potential techniques that can be applied for each application.
After determining the application of interest, the flowchart in   Furthermore, here, we give an experimental example of CCA applications in neuroscience research. Among many neuroscience applications, CCA is commonly used as a data fusion technique to uncover the association between two datasets. In the following, we demonstrate how to follow the guidance in Figure 4 to link disease-related pathology using fMRI and structural MRI data from cognitive normal subjects and subjects with mild cognitive impairment (MCI). As a prodromal stage of Alzheimer's disease, both functional and structural pathology are expected in MCI subjects. Yang, Zhuang, Bird, et al. (2019) used CCA to examine the disease-related links between voxel-wise functional information (e.g., eigenvector centrality mapping from fMRI data, X 1  N × p 1 ) and voxel-wise structural information (e.g., voxel-based morphometry from T1 structural MRI data, X 2  N × p 2 ), where N is the number of subjects, and p 1 and p 2 are the number of voxel-wise features for fMRI and structural MRI data, respectively. Since there are only two imaging modalities in the analysis, multiset CCA is not an option for this case. Considering that deep CCA requires a large number of samples but N ( p 1 or p 2 , and kernel CCA has the difficulty to project coefficients back to original voxelwise feature space as mentioned in Section 5.3, a linear relationship between these two imaging modalities is considered. There are two approaches for the scenario that the number of samples is much less than the number of features. The first approach is to perform dimension reduction before feeding data into conventional CCA as shown in Figure 5a. Yang, Zhuang, Bird, et al. (2019) used PCA or sPCA  for dimension reduction and then fed CCA with dimension-reduced data Y 1 and Y 2 .
CCA found a set of canonical coefficients U k , k = 1, 2 and the corresponding canonical variables A k . The voxel-wise weight coefficient can be obtained with a pseudo inverse operation. The other approach is to implement constrained CCA as shown in Figure 5b. With the assumption that a proportion of voxels in the brain is not informative for finding the association between fMRI and structural MRI data, sparse CCA was applied with X 1 and X 2 directly without dimension reduction step (Yang, Zhuang, Bird, et al., 2019). The canonical coefficients U k , k = 1, 2 are in the voxel-wise feature space, thus no operation is required to calculate voxel-wise weight coefficients.
The voxel-wise weight coefficients play a role in uncovering which brain regions are most relevant for finding the association between datasets. The voxel-wise weight maps for the most significant disease-related component in A k for (s)PCA + CCA and sparse CCA is shown in Figure 5c. A nonparametric permutation test is applied to test the significance of the association between fMRI and structural MRI data with p values shown at the bottom of Figure 5c.
In this study, the canonical variables A k computed from sPCA + CCA have the highest classification accuracy for both fMRI and structural MRI data.
F I G U R E 4 Selecting a canonical correlation analysis (CCA)-technique that suits your application. Three scenarios are most commonly encountered in neuroscience applications: CCA with and without constraints (dashed yellow box); nonlinear CCA (dashed gray box) and multiset CCA (dashed orange box)

| FUTURE DIRECTION OF CCA IN NEUROSCIENCE APPLICATIONS
Currently, when applying CCA to data with a smaller number of observations than features, either a data reduction orfeature selection step is performed as a preprocessing step, or an L 1 norm penalty is added as a constraint to remove noninformative features. Future efforts should be made toward incorporating prior information on feature structures of input variables that are more reasonable or more biological meaningful, and canonical correlation values should be computed in a one step process that includes prior information. Furthermore, applying CCA and its variant techniques to uncover joint multivariate relationships between two modalities has dominated the current CCA applications in the neuroscience field. In these applications, various techniques have been proposed to incorporate prior information within variables to boost the model performance, such as considering group-discriminant features to strengthen group separation. However, much less effort was put to incorporate these prior information within the variables in multiset CCA. In neuroscience research, collecting multiple modalities of a single subject has become a commonplace, and with more than two variables, multiset CCA should be considered for this multimodal data-fusion more often. Future efforts toward incorporating prior information within each variable to further improve the performance of multiset CCA could shed new lights in neuroscience research. For instance, we suggest incorporating group information in multiset CCA to extract common group-discriminant patterns among multiple measurements derived from fMRI, or to uncover correlated group-discriminant feature among brain imaging data and behavioral or clinical measurements. Furthermore, nonlinear relationships among multiple modalities have not been explored within multiset CCA in neuroscience research. It might be of interest to incorporate kernels in multiset CCA to uncover covariated nonlinear patterns among multiple brain imaging data, or to input each variable through multiple layers to generate "deep" features before applying multiset CCA.
In addition, future efforts are also required to statistically interpret CCA results. Currently, a parametric statistical significance of CCA model is only well defined for conventional CCA. Statistical significances of CCA variants are usually determined nonparametrically through permutation tests, which are time-consuming and methods dependent. Furthermore, even using permutation tests, statistical significance can only be determined for each canonical correlation value, instead of canonical coefficients. Therefore, we cannot determine the statistical significance of a specific feature in the model. Identifying important features as potential biomarkers is usually an end goal in neuroscience. Therefore, developing test statistics to interpret CCA results by determining statistically important features would also benefit neuroscience research.

| CONCLUSION
Uncovering multivariate relationships between modalities of the same subjects have gained significant attentions in neuroscience research.
F I G U R E 5 Example of choosing canonical correlation analysis (CCA) variants by following the guideline. Voxel-wise functional and structural MRI information from cognitive normal subjects and subjects with mild cognitive impairment were used for data fusion analysis. (a) Schematic diagram of (sparse) principal component analysis (PCA) + CCA. The abbreviation sPCA stands for sparse PCA. (b) Schematic diagram of sparse CCA (sCCA). (c) Top panel shows the most disease-discriminant functional and structural component and the bottom panel shows the correlation between datasets (ρ), the significance of the correlation derived from nonparametric permutation test (p corr ) and the classification accuracy for each method CCA is a powerful tool to investigate these joint associations and has been widely applied. Multiple CCA-variant techniques have been proposed to fulfill specific analysis requirements. In this study, we reviewed CCA and its variant techniques from a technical perspective, with summarized applications in neuroscience research. Of each CCArelated technique, detailed formulation and solution, relationship with other techniques, current applications, advantages, and limitations are provided. Selecting the most appropriate CCA-related technique to take full advantages of available information embedded in every variable in joint multimodal research might shed new lights in our understandings of normal development, aging, and disease processes.

ACKNOWLEDGMENTS
The study is supported by the National Institute of Health (grants

DATA AVAILABILITY STATEMENT
There is no data or code involved in this review article.