## 1 Introduction

Principal component analysis is one of the most popular dimension reduction tools in high-dimensional data analysis. Since it does not require the inversion of the covariance matrix of the variables, it has convenient applications to data where the sample size is less than the number of variables, namely .

Canonical correlation analysis (CCA) seeks pairs of linear combinations from two sets of variables based on maximisation of the Pearson correlation between each pair. We call the pairs of linear combinations and their correlations *canonical variates* and *canonical correlations*, respectively. It is believed that a few pairs of canonical variates can represent the original sets of variables to explain their relationships and variabilities (Johnson & Wichern 2007, pp. 539–574).

Since principal component analysis is based on the marginal covariance structure of each set of variables, it ignores any association between the two sets when effecting dimension reduction. In contrast CCA reduces the dimensions of the two sets while maximizing Pearson correlation. Therefore when a high-dimensional relationship between two sets of variables is of interest with , the latter should be more appropriate than the former and can produce a more reasonable dimension reduction.

It should be noted here that CCA is closely related to classical multivariate regression where

where is the random vector of responses, is a vector of predictors, is an intercept vector, is an unknown coefficient matrix, and the error vector is independent of . The notation indicates the multivariate normal distribution. Also, for a symmetric matrix , the notations and indicate that is positive definite and positive semi-definite, respectively.

It was shown by Tso (1981) that the maximum likelihood analysis of reduced-rank regression under (1) has an intimate connection with CCA. Tso's work showed that the canonical covariates corresponding to predictors can be used as dimension-reduced predictors in multivariate regression. Also, Yoo, Kim & Um (2012) studied the relation of CCA and regression analysis, focusing on a reduced-rank regression framework. A method of estimating the canonical variate more accurately by considering weights was proposed by Ter Braak (1990). The weights were constructed using the residuals, , where and represent the ordinary least squares estimates under (1).

Even though the papers discussed above improved estimation in CCA and clarified the role of CCA in data analysis, they have several practical limitations. The equivalence of maximum likelihood estimation in reduced-rank regression to CCA shown by Tso (1981) holds only under (1). That is, if the random error is not multivariate normal, then this equivalence is no longer valid. Additionally reduced-rank regression typically reduces the dimensions of the predictors alone, although the responses are multi-dimensional. If one considers that CCA was introduced to reduce dimensions of two sets of variables simultaneously, the relation between CCA and reduced rank regression described in Yoo *et al*. (2012) is not quite satisfactory. Further, optimality through weights as developed by Ter Braak (1990) holds only under (1), just as for Tso (1981). That is, the normality of the random error is an extremely crucial condition. Finally, variable selection in CCA has been largely neglected to date, although this can substantially help the interpretation of canonical variates.

The main purpose of this paper is to overcome these deficiencies in CCA. To achieve this, several steps are required. First, we connect CCA to ordinary least squares coefficients. Then we establish a linear modeling form of CCA based on this connection. Under this setup, the unknown quantities in CCA are estimated optimally in a sense which will be discussed in later sections. Second, we show that standard CCA and the method of Ter Braak (1990) are sub-optimal. In addition, we propose a method of variable selection in CCA under the linear modeling of CCA. Finally, we investigate CCA as a dimension reduction tool in multivariate regression by adopting existing theories of sufficient dimension reduction.

The paper is organised as follows. In Section 'Classical canonical correlation analysis (CCA)', we give a brief explanation of classical canonical correlation analysis. In Section 'Linear modeling of canonical correlation analysis' we develop a linear modeling approach for canonical correlation analysis. The roles of canonical covariates as a means of dimension reduction in multivariate regression are studied in Section 'Canonical correlation in regression'. Sections 'Simulation study' and 'Minneapolis school data analysis' provide numerical studies and a real data application, respectively. Finally, a summary of our work is provided in Section 'Discussion'. To avoid interrupting the discussion, proofs for most results are given in the Appendix.