# Canonical Correlation Analysis Through Linear Modeling

## Summary

In this paper, we introduce linear modeling of canonical correlation analysis, which estimates canonical direction matrices by minimising a quadratic objective function. The linear modeling results in a class of estimators of canonical direction matrices, and an optimal class is derived in the sense described herein. The optimal class guarantees several of the following desirable advantages: first, its estimates of canonical direction matrices are asymptotically efficient; second, its test statistic for determining the number of canonical covariates always has a chi-squared distribution asymptotically; third, it is straight forward to construct tests for variable selection. The standard canonical correlation analysis and other existing methods turn out to be suboptimal members of the class. Finally, we study the role of canonical variates as a means of dimension reduction for predictors and responses in multivariate regression. Numerical studies and data analysis are presented.

## 1 Introduction

Principal component analysis is one of the most popular dimension reduction tools in high-dimensional data analysis. Since it does not require the inversion of the covariance matrix of the variables, it has convenient applications to data where the sample size is less than the number of variables, namely .

Canonical correlation analysis (CCA) seeks pairs of linear combinations from two sets of variables based on maximisation of the Pearson correlation between each pair. We call the pairs of linear combinations and their correlations canonical variates and canonical correlations, respectively. It is believed that a few pairs of canonical variates can represent the original sets of variables to explain their relationships and variabilities (Johnson & Wichern 2007, pp. 539–574).

Since principal component analysis is based on the marginal covariance structure of each set of variables, it ignores any association between the two sets when effecting dimension reduction. In contrast CCA reduces the dimensions of the two sets while maximizing Pearson correlation. Therefore when a high-dimensional relationship between two sets of variables is of interest with , the latter should be more appropriate than the former and can produce a more reasonable dimension reduction.

It should be noted here that CCA is closely related to classical multivariate regression where

(1)

where is the random vector of responses, is a vector of predictors, is an intercept vector, is an unknown coefficient matrix, and the error vector is independent of . The notation indicates the multivariate normal distribution. Also, for a symmetric matrix , the notations and indicate that is positive definite and positive semi-definite, respectively.

It was shown by Tso (1981) that the maximum likelihood analysis of reduced-rank regression under (1) has an intimate connection with CCA. Tso's work showed that the canonical covariates corresponding to predictors can be used as dimension-reduced predictors in multivariate regression. Also, Yoo, Kim & Um (2012) studied the relation of CCA and regression analysis, focusing on a reduced-rank regression framework. A method of estimating the canonical variate more accurately by considering weights was proposed by Ter Braak (1990). The weights were constructed using the residuals, , where and represent the ordinary least squares estimates under (1).

Even though the papers discussed above improved estimation in CCA and clarified the role of CCA in data analysis, they have several practical limitations. The equivalence of maximum likelihood estimation in reduced-rank regression to CCA shown by Tso (1981) holds only under (1). That is, if the random error is not multivariate normal, then this equivalence is no longer valid. Additionally reduced-rank regression typically reduces the dimensions of the predictors alone, although the responses are multi-dimensional. If one considers that CCA was introduced to reduce dimensions of two sets of variables simultaneously, the relation between CCA and reduced rank regression described in Yoo et al. (2012) is not quite satisfactory. Further, optimality through weights as developed by Ter Braak (1990) holds only under (1), just as for Tso (1981). That is, the normality of the random error is an extremely crucial condition. Finally, variable selection in CCA has been largely neglected to date, although this can substantially help the interpretation of canonical variates.

The main purpose of this paper is to overcome these deficiencies in CCA. To achieve this, several steps are required. First, we connect CCA to ordinary least squares coefficients. Then we establish a linear modeling form of CCA based on this connection. Under this setup, the unknown quantities in CCA are estimated optimally in a sense which will be discussed in later sections. Second, we show that standard CCA and the method of Ter Braak (1990) are sub-optimal. In addition, we propose a method of variable selection in CCA under the linear modeling of CCA. Finally, we investigate CCA as a dimension reduction tool in multivariate regression by adopting existing theories of sufficient dimension reduction.

The paper is organised as follows. In Section 'Classical canonical correlation analysis (CCA)', we give a brief explanation of classical canonical correlation analysis. In Section 'Linear modeling of canonical correlation analysis' we develop a linear modeling approach for canonical correlation analysis. The roles of canonical covariates as a means of dimension reduction in multivariate regression are studied in Section 'Canonical correlation in regression'. Sections 'Simulation study' and 'Minneapolis school data analysis' provide numerical studies and a real data application, respectively. Finally, a summary of our work is provided in Section 'Discussion'. To avoid interrupting the discussion, proofs for most results are given in the Appendix.

## 2 Classical canonical correlation analysis (CCA)

Here we will give short explanation of CCA. Suppose that we have two sets of variables and , and define , , and . Let two linear combinations of and be and , where and . Then we have , , and . Now we determine and to maximise

(2)

Canonical correlation analysis seeks such and based on the following criteria:

1. The first canonical variate pair is obtained from the maximisation of (2) with the restriction that .
2. The second canonical variate pair is obtained from the maximisation of (2) with the restriction that and and are uncorrelated.
3. At step the th canonical variate pair is obtained from the maximisation of (2) with the restriction that and are uncorrelated with the previous canonical variate pairs.
4. Repeat steps 1 and 3 until .
5. Select pairs of to represent the relationship between and .

Finally, it can be shown that pairs based on the criteria above are constructed as follows: and for , where and are respectively the eigenvectors of and with corresponding common ordered-eigenvalues . The matrices and are called canonical direction matrices.

The selection of the pairs of canonical variates is equivalent to testing how many non-zero eigenvalues the matrix has. This is also the same as the estimation of the rank of . Throughout the rest of the paper, the symbol represents the true rank of . A selection criterion for determining the value of will be discussed in later sections. For more details regarding CCA, readers are referred to Johnson & Wichern (2007).

## 3 Linear modeling of canonical correlation analysis

### 3.1 Relation of canonical direction matrices and least squares

In this section, we introduce a new approach to CCA for two sets of variables and . Throughout the rest of the paper we will directly follow the notation of the previous two sections. For , the expressions and represent the column rank and the subspace spanned by the columns of , respectively. In addition, we define , , and . By symmetry we have .

Using and , is simplified as follows:

(3)

The relation (3) directly implies that .

Let . It can be shown that the columns of form an orthonormal basis for because . It follows that the columns of the matrix form a basis for , equivalently . Since post-multiplication by any non-singular matrix does not change the rank and column space of a matrix, we establish the following key equivalences:

(4)

The quantity in the last equivalence of (4) consists precisely of the ordinary least squares (OLS) coefficients of the regression of given . This directly indicates that any orthonormal basis matrix for can also be used to construct a canonical direction matrix for . By applying the same arguments to , it is easily shown that . Again, the quantity consists of the OLS coefficients of . Based on these results, we will present linear modeling of CCA by means of and .

### 3.2 Linear modeling of canonical correlation analysis

Recalling the definitions of and , we first consider inference regarding . The estimation of requires two parts, which consist of determining its dimension, , and its orthonormal basis, .

Since is also an orthonormal basis matrix of according to (4), we have the relation where . Unknown population quantities and are replaced by their usual moment estimators and . We can then consider the estimation of and with arguments and that minimise the following objective function over and :

(5)

where is a inner-product matrix and indicates a vector constructed by stacking the columns of a matrix ; .

As discussed in Shapiro (1986), any solution provides a consistent estimator for any choice of in (5) in the sense that converges to , where stands for the orthogonal projection operator with respect to the usual inner-product space structure, onto . Therefore we can construct a class of estimators for canonical direction matrices depending on choices of . In addition it is clear that the minimisation of (5) and its asymptotic behavior depend on the choice of . According to Shapiro (1986) a best choice for is any consistent estimate of the inverse of the covariance matrix of the asymptotic distribution of , which will be denoted . The quantity is exactly equal to in Yoo & Cook (2007) and its explicit expression is

where .

Here the following quantity is used as a consistent estimate of :

where and .

Then we define the following best quadratic function as follows:

(6)

We define and to be the solutions of the minimisation of (6) with respect to and .

Now the estimation of is performed via a sequence of hypothesis tests (Rao 1965): beginning with , test vs. . If is rejected, increment by 1 and repeat the test, stopping the first time is not rejected and setting . This estimation procedure relies on obtaining a test statistic for , and, as the statistic we propose Under , the statistic is distributed asymptotically as . Then is estimated by , where is constructed from (6) with . Also, the estimator is asymptotically efficient. The asymptotic efficiency of means that it has minimum asymptotic variance within family (5). The results regarding the asymptotic chi-squared distribution of and the asymptotic efficiency of are directly derived from theorem 2 in Yoo & Cook (2007), and the results are guaranteed by the existence of finite fourth moments of simple random samples of , , according to theorem 1 of Cook & Ni (2005).

Applying the same rationale to , we can estimate and its orthonormal basis of by minimising the following quadratic objective function over and :

(7)

where is a consistent estimator of the inverse of the asymptomatic variance of .

Let and be the solutions of the minimisation of (7). As a test statistic for vs. , we propose which is asymptotically under .

The estimates of determined from (6) and (7) are not always equal, but the same number of pairs of canonical variates for and should be selected in the CCA. Therefore, we will consider the following Bonferroni determination of throughout the rest of the paper:

1. Starting with , compute the -values from and with . Let them be and , respectively.
2. If either or is less than , reject and increment by 1.
3. Repeat Step 2 until both and are bigger than for the first time. Then set .

Effecting CCA by means of the Bonferroni determination of and the estimation of canonical direction matrices through minimising (6) and (7) will be called optimal linear modeling (OLM) of CCA.

### 3.3 Sub-optimality of existing approaches

From Section 'Classical canonical correlation analysis (CCA)', the sample canonical direction matrix for , from the standard application of CCA, is constructed from the spectral decomposition of , by taking the eigenvectors corresponding to the first largest eigenvalues . For the determination of through sequential hypothesis tests of versus , , the following statistic proposed by Bartlett (1938) is widely used:

Under the joint normality of and , is asymptotically distributed as .

Let the pairs , , be the eigenvalues and their corresponding eigenvectors of . Let . Consider the minimisation of the following quadratic objective function over and :

(8)

Let and be the minimisers of (8) and define . Then a pair can be obtained from and by the following lemma.

Lemma 1. The following two equations hold: (i) for . (ii) with , if .

The next lemma demonstrates that the proposed approach is asymptotically more efficient than standard canonical correlation application.

Lemma 2. Let be the covariance matrix of the asymptotic distribution of constructed from (6). Define to be that of constructed from (8). Then, for any , .

Another difference between (6) and (8) is the asymptotic distributions of and . The former is always asymptotic regardless of joint normality of and , while the latter requires a multivariate normal distribution of and for . Therefore, if the normality is a cause of concern, will be problematic with respect to determination of . Being deficient in these two desirable properties namely efficiency and validity of the associated distribution, the standard CCA application can be said to be sub-optimal.

Also, Ter Braak's approach can be viewed as a particular case of the linear modelling approach by setting where . The optimality of Ter Braak's approach is guaranteed under the following condition:

If and are independent then, the two quantities are equal, and hence Ter Braak's approach is optimal. However, if they are not, equality is not guaranteed all the time so Ter Braak's approach may be sub-optimal. It is interesting to note that Ter Braak's approach coincides with the results of Cook & Setodji (2003), who developed a model-free reduced-rank regression.

Simulation studies in Section 5.1 show that the potential advantages of using the OLM by minimising (6) and (7) are most noticeable in the estimation of the canonical direction matrices when there exists a complicated association between and , such as non-trivial noise and high skewness in variables.

### 3.4 Variable selection

Variable selection in canonical correlation analysis has been largely neglected possibly due to the difficulty of deriving a proper methodology for it. Since CCA is done through and , variable selection in CCA should be based on choosing variables in and which contribute substantially to and . Direct use of (6) and (7) enables us to do this without knowing . In other words, we can perform variable selection prior to carrying out canonical correlation analysis.

In canonical correlation analysis, the importance of and is measured only through and , respectively. For example, if , which is the first coordinate of , does not contribute to canonical correlation analysis, the corresponding first row of should be zero. This condition can be written as where . It is straightforward to test an hypothesis such as through the linear modeling of canonical correlation analysis given in (6) and (7).

Since the variables significant to and must be significant to and , we test the following hypotheses for variable section of and :

where represents the -dimensional canonical basis vector with the th entry equal to one and other entries equal to zero.

If is not rejected, then the th coordinate in does not contribute to . Thus we can remove before conducting canonical correlation analysis. Also, if is not rejected, the th coordinate in can be removed.

The hypotheses are tested by using the Wald-type statistic:

Under and , and asymptotically converge to and , respectively, which immediately follows from Slutsky's theorem.

## 4 Canonical correlation in regression

We consider a multivariate regression of with . Define and with the smallest possible ranks of and so that , where is a matrix. If , we are interested in the equation .

This says that can be thought of as influencing and all other conditional mean components are determined from via . It can be shown that is a generalised inverse of : . Without loss of generality, we take . Then forms the orthogonal projection operator for relative to the inner product , so we have

(9)

This says that varies in the subspace spanned by only through dependence on . In other words, we pursue dimension reduction of and through linear projection without loss of information about . We call this type of dimension reduction in regression sufficient dimension reduction for (Yoo & Cook 2007, 2008).

Considering sufficient dimension reduction of for , the following condition on the marginal distribution of is typically imposed:

C1. is linear in .

Condition C1 is called the linearity condition and will hold to a reasonable approximation in many problems (Hall & Li 1993). If has an elliptically contoured distribution, condition C1 is automatically satisfied. In the case that condition C1 does not hold, can often be one-to-one transformed so as to satisfy this condition. Under condition C1, . Hereafter, we will assume that for exhaustive estimation of . Then, from (4), we have

This relation implies that the canonical variates for can replace the original predictor without loss of information on under condition C1.

The use of as given in (9) is implicit in the method in Yoo & Cook (2008, section 2.1). Expressing their results in our terms we would say that has full information on in the sense that . We then have the following equivalence:

The quantity in the last equivalence above is used for the usual construction of the canonical variates for . Thus the original response can be replaced by the canonical variates for without loss of information on .

## 5 Simulation study

To confirm that the proposed OLM method has potential advantages in the estimation of the dimension and bases of the canonical correlation subspace, we consider joint distributions between and . First, the coordinate variables of were independently generated from Gamma (0.25, 1) and . Based on this, we have constructed the following four joint distributions between and : , , , and , where the variates , 4 are independent standard normal variates independent of . In the simulation, sample sizes of 100, 200, 400 and 800 were considered, and the number of simulation replicates for each sample size was 500.

In the simulation model, the variates are quite skewed and prone to outliers. If we reduce the dimensions of and and focus on , which is a very common target in regression, then it can be seen that . Under the simulated model, the directions of and for and and for are the results of the dimension reduction. In other words, two-dimensional sets of variables and should be sufficient to represent the association between and for . Therefore the true number of pairs of canonical variates, which is , is two. For methodological comparison, we have used the proposed OLM method, the standard CCA method, and the methods of Ter Braak (1990) and Yoo & Cook (2007).

In the estimation of , we sequentially tested hypotheses H0:   =  m versus H1:   >  m for m = 0; 1; 2; 3 with a significance level equal to 0.05. In respect of dimension estimation, we computed the percentages of the time that the estimate  > 2 and those of the time that the estimate  = 2. The former percentages are called the observed significance levels. The two percentages are reported in Figure 1. In Figures 1(a) and (b), the horizontal lines represent the reference 5% and 95% lines respectively. The observed significance levels were reported in Figure 1(a) by computing the rejection percentages of estimates > 2. The Yoo-Cook method shows the best results, and standard CCA application and the Ter Braak method are similar to each other. Since the proposed method uses the Bonferroni procedure, estimated significance levels should be at most 5%, which is observed in Figure 1(a). Figure 1(b) shows that the OLM method which invokes the Bonferroni procedure produces the highest percentages of = 2 regardless of sample sizes, although the Yoo-Cook method is close to the OLM method. The other two methods under consideration do not quite match the OLM method with smaller sample sizes. This confirms that the OLM with the Bonferroni procedure has potential advantages over the three other methods in dimension estimation.

Let be the estimated canonical direction matrices of and from either the space approach or the standard CCA given , respectively. To measure how well the true basis of for was estimated, and were computed as the square roots of the s from the ordinary least squares regressions of on and of on . The same criteria were applied to summarise the estimation of and , and the similar notation of and was adopted. Since there were no notable differences between the two approaches in the basis estimation of for both and , we report the averages of and in Figure 2 and those of and in Figure 3.

Figure 2 shows that the OLM method outperforms the other three approaches in the estimation of the direction of . For the estimation of the direction of , and , all four methods are quite similar. So, it can be concluded that the OLM method shows equally good or better asymptotic performances in the estimation of canonical direction matrices over the three other methods.

Next we tested each coordinate effect on both canonical direction matrices with significance level 5%. Since , , and contribute to and , and for and should be rejected 100% of the time and for and should be rejected about 5% of the time. We report the percentages of rejections of , , and in Table 1, because testing behaviors of , , and are similar to those of , , and in order. Table 1 shows that the variable selection with respect to each canonical direction matrix is not a cause for concern with mild sample sizes.

Table 1. Percentages of rejections of , , and in Section 4.1.
100100.011.3100.018.5
200100.06.40100.012.2

## 6 Minneapolis school data analysis

To illustrate the CCA methodology explained in Section 'Canonical correlation in regression', we use data on the performance of students in Minneapolis schools introduced in Cook (1998). The dimensional variables consist of the percentages of students in a school scoring above (A) and below (B) average on standardised fourth and sixth grade reading comprehension tests, . Subtracting either pair of grade specific percentages from 100 gives the percentage of students scoring about average on the test. From the collection of all variables in the dataset, the following five variables were chosen to be components of for the purpose of illustration: (i) the percentage of children receiving Aid to Families with Dependent Children (AFDC); (ii) the percentage of children not living with both biological parents (B); (iii) the percentage of adults in the school area who completed high school (HS); (iv) the percentage of persons in the area below the federal poverty level (PL); and (v) the pupil teacher ratio (PT). The first four variables in were square-root transformed to satisfy Condition C1. The efficacy of the transformation was confirmed by graphical inspection (not reported).

Variable selection for and was performed; the related -values are summarised in Table 2. According to the table, and in and and in are determined to be significant at level 0.05. For the purpose of illustration, both the proposed linear modeling approach and the standard CCA procedure were considered.

Table 2. -values for variable selections in Minneapolis school data.
0.0000.0870.0010.3840.4490.3840.6490.0000.000

To determine the number of pairs of canonical variates, the Bonferroni procedure and Bartlett statistics were applied to two different cases of before and after variable selection. Table 3 presents the corresponding -values. Before the selection, with significance level 5%, the Bonferroni procedure yields the estimate , while the standard CCA yields . However, after selection, both procedures yield . The different conclusions before and after variable selections may result from standard CCA failing to properly detect the quadratic relationship between the first two canonical variates of because of noise induced by extraneous variables.

Table 3. -values for the rank estimations in canonical correlation analysis in Minneapolis school data.
Before variable selectionAfter variable selection

0.0000.0000.0000.0000.0000.000
0.1230.0170.0700.0020.0000.002
0.5830.5750.613N/AN/AN/A

To further explain the relationship between and through multivariate regression analysis one can commence modeling using and as responses and and as predictors.

## 7 Discussion

In this article, we propose linear modeling of canonical correlation analysis by considering a quadratic objective function. In the linear modeling approach, we construct an optimal class, which is discussed herein. Canonical direction matrices are then estimated through the minimisation of the objective functions in the optimal class, and a Bonferroni procedure is proposed to determine the number of pairs of canonical variates. It can be shown that standard canonical correlation analysis, as well as other existing methods, can be expressed in the same form as the linear modeling approach, and that they are sub-optimal cases of this approach. In addition the proposed approach enables us to conduct variable selection in canonical correlation analysis.

We investigate the role of canonical correlation analysis in multivariate regression, and it turns out that the canonical variates can be used as dimension-reduced responses and predictors without loss of information on the conditional mean under mild conditions.

It is believed that this paper will re-emphasize the importance and usefulness of canonical correlation analysis in multivariate data analysis. The code for the proposed approach is available upon request.

## Appendix: Justifications

Proof of Lemma 1. For notational convenience, let , set , where the th column of . Therefore . Defining , the maximum number of non-zero eigenvalues acquired from the spectral decomposition of is . We denote the eigensystem of as follows: with where is the eigenvector corresponding to , for . We will not consider the eigenvectors corresponding to the zero eigenvalues, because they are not informative in the reduction of .

Let . According to lemma A.1 of Cook & Ni (2005), we have

where is the usual Euclidean norm and is the value of that minimises .

We re-express as and define , , and . It then follows that:

The last equation is the same as the quadratic objective function in (8). Therefore, we have that for , and . From these results, the conclusions follow directly.

Proof of Lemma 2. Recall that is the covariance matrix of the asymptotic distribution of and that the pairs of and are the OLM solutions of the minimisation of (6). Define and let and be the solutions of the minimisation of (8).

According to theorem 2 in Yoo & Cook (2007) and lemma A.4 in Cook & Ni (2005), the explicit expressions for and are as follows:

(10)

where and is the Jacobian matrix

Define . Then for any , , because . Using (10), the explicit form of is as follows:

Replacing to by the corresponding quantities, we see that is equivalent to , and this completes the proof.

## Acknowledgements

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (KRF) funded by the Ministry of Education, Science and Technology (2012-004002) for Keunbaik Lee and (2012-040077) for Jae Keun Yoo, respectively. The authors are also grateful to the associate editor and the three referees for many insightful and helpful comments.