Low-rank constrained weighted discriminative regression for multi-view feature learning

In recent years, multi-view learning has attracted much attention in the fields of data mining, knowledge discovery and machine learning, and been widely used in classification, clustering and information retrieval, and so forth. A new supervised feature learning method for multi-view data, called low-rank constrained weighted discriminative regression (LWDR), is proposed. Different from previous methods handling each view separately, LWDR learns a discriminative projection matrix by fully exploiting the complementary information among all views from a unified perspective. Based on least squares regression model, the high-dimensional multi-view data is mapped into a common subspace, in which different views have different weights in projection. The weights are adaptively updated to estimate the roles of all views. To improve the intra-class similarity of learned features, a low-rank constraint is designed and imposed on the multi-view features of each class, which improves the feature discrimination. An iterative optimization algorithm is designed to solve the LWDR model efficiently. Experiments on four popular datasets, including Handwritten, Caltech101, PIE and AwA, demonstrate the effectiveness of the proposed method.


| INTRODUCTION
Owing to the rapid growth of multimedia data, multi-view learning aroused much interests of researchers from data mining, knowledge discovery and machine learning areas in recent years [1][2][3][4][5][6][7][8]. In real world, one object can be described by different kinds of data or from different views. For example, a news can be expressed by texts, audios and videos. One person can be identified by the face, fingerprint and DNA information. Although these features may be heterogeneous and very different, they naturally reflect some inherent structures or characteristics of the object. Compared with single view data, multi-view data contains more underlying information. How to effectively and efficiently exploit the correlative yet complementary information among diverse views is an important research topic for multi-view learning [9].
Many researchers tried to develop effective information fusion techniques for multi-view learning, including unsupervised [9], semi-supervised [10] and supervised [11,12]. Some researchers proposed multi-view learning methods by cotraining [13,14] and co-regularization [15,16]. However, these methods neglect the problem caused by the high dimension of multi-view data. Canonical correlation analysis (CCA) is a classical unsupervised multi-view subspace learning method, which seeks a low dimensional feature space by maximizing the correlation between different views [9]. However, CCA can only deal with two views which limits its further application on more complex data. Luo et al. proposed a tensor CCA method, which extended original CCA for multiple views [17]. To make use of the label information for better classification performance, Kan et al. proposed a multi-view discriminant analysis (MvDA) method by maximizing the inter-class distance and minimizing the intra-class distance from both inter-view and intra-view [11]. MvDA can be regarded as the extension of linear discriminant analysis (LDA) on multi-view data. These methods mentioned above can be categorized into subspace learning based methods, which assume that different views can be generated from a common latent feature subspace.
Regression-based method is one of the most popular methods in machine learning [18][19][20][21][22], and it provides another effective and efficient way for multi-view feature learning [23]. Specifically, regression-based methods seek a linear mapping by transforming data to fit the label matrix. Zheng et al. extended low-rank regression model for single view data to multi-view fully low-rank regression (FLR), in which multiple projection matrices were learned and the final classification was performed by majority voting [23]. However, FLR does not consider the differences among all views. Yang et al. proposed an adaptiveweighting discriminative regression (ADR) by learning a unified transform matrix for multi-view classification [24]. ADR introduces an adjustment matrix to enlarge the distance between different classes which improves the model robustness to some extent. However, the intra-class compactness is destroyed in ADR, which is also important for pattern analysis. In [25], the authors incorporated feature selection into linear regression model by l 2,1 -norm regularization. Wen et al. adopted a graphregularized matrix factorization model to handle the multi-view clustering problem with incomplete views [7].
In this article, we propose a multi-view low-rank weighted discriminative regression (LWDR) method for feature learning. LWDR learns a common feature subspace across all views. Adaptive weights learning mechanism is adopted to automatically learn different views and the important views containing more discriminative information is enforced to contribute more to subspace learning. To improve the intra-class similarity, a class-wise low-rank constraint is imposed on the multiview features. Besides, a flexible error term based on l 2,1 -norm is introduced to relax the label matrix. Experiments on four datasets demonstrate that LWDR outperforms previous single view feature learning methods and related multi-view learning methods.
The rest of this paperarticle is organized as follows. In Section 2, we briefly review the related works about linear regression methods for multi-view learning. Section 3 introduces the formulation of our proposed method and the optimization algorithm in detail. Section 4 reports the experimental results and analysis. Section 5 concludes this article.

| RELATED WORKS
For convenience, we first present the main notations used in this paper. matrices and vectors are written in boldface uppercase and boldface lowercase, respectively. fX k g v k¼1 denotes a multiview dataset with v views. X k ∈ R m k �n is the data matrix of the kth view, where m k is the feature dimensionality of view k and n is the number of training samples. c is the number of classes. Y ∈ {0,1} c�n is the label matrix, and Y ij = 1 if the j-th training sample belongs to class i and otherwise 0. W k ∈ R m k �c is the projection matrix of the k-th view. For matrix A ¼ ðA ij Þ ∈ R p�q , its l 2,1 -norm, Frobenius-norm, and nuclear norm are defined as is the i-th singular value of A. I is an identity matrix and 1 n is an n-dimensional vector with all elements being 1. ⊙ is the Hadamard multiplication.
For single view data X ∈ R m�n and Y, the basic linear regression solves the following problem to find the optimal projection matrix W.
where α > 0 is a balance parameter and ϕ(W) is the regularizer. l 2,1 -norm, Frobenius-norm and nuclear norm are usually used in ϕ(W) for learned features with different properties. When ϕðWÞ ¼ ‖W‖ 2 F , Equation (1) has a closed-form solution W * ¼ ðXX T þ 2αIÞ −1 XY T , which is very efficient in real application. However, Equation (1) cannot be directly applied on multi-view data. Some researchers extended it for multi-view data by learning multiple projection matrices as follows [23]: where W k is the projection matrix of the k-th view. It can be observed that Equation (2) is equivalent to simply concatenating multi-view features into one single view feature and performing traditional linear regression on it. Such operation is not physically meaningful and treats all views equally. However, different views usually have different characteristics and contribute differently to pattern analysis. For example, to identify a person, the face appearance is more important than voice. Thus, adaptive weights learning strategy is introduced into multi-view linear regression model by weighting different views, which can be generally described as follows.
where δ k is the weight of the k-th view. The first term and constraints force the model to learn optimal weight for each view automatically. However, in Equation (3), the latent multiview features are directly regressed to approximate the binary label matrix Y, which may be too strict and is not appropriate as the regression target. To improve the model robustness, in [24], the authors relaxed the label matrix and proposed the following auto-weighted discriminative regression model for multi-view classification (ADR): where b ∈ R c is a bias vector and M is a non-negative matrix. By introducing the adjustment matrix M, the one-zero label vector where M ij is the corresponding element in M. By introducing non-negative matrix M, Equation (4) uses the ϵ-draggings technique to enlarge the distances between true and false classes and improve the model robustness. Such label relaxation strategy is widely used in other regression-based methods [26,27].

| Formulation
From Section 2, although the ϵ-dragging technique used in ADR [24] enlarges the distance between different classes or inter-class separability to some extent, the same class may have different labels after relaxation (i.e. Y + Y ⊙M) because of the dynamic of M. Thus, the intra-class similarity cannot be guaranteed in ADR and two samples from same class may be projected distant from each other. Both inter-class separability and intra-class similarity are important for good classification performance. In low-rank representation (LRR) [28][29][30], a set of instances are generally drawn from a union of multiple subspaces, and the instances of each subspace are regarded from the same class, which illustrates that the instances from the same class should locate in the same subspace and the data matrix of each class should be low-rank. Inspired by this issue, we propose the following low-rank constrained adaptive weighted discriminative regression (LWDR) model to improve the intra-class similarity of learned multi-view features: where α, β and γ are balance parameters, E is an error matrix, H is the to-be-learned feature matrix, and H i denotes the features of the i-th class. The first term in Equation (5) learns the multi-view features approximated by matrix H. The second term utilizes the label matrix to supervise the feature learning. The third term P c i¼i ‖H i ‖ * forces the features of same class to be similar, which are considered to be low-rank. The last two terms are regularizers avoiding overfitting. Error matrix E is constrained by l 2,1 -norm to compensate the regression errors and it is flexible in learning transform matrices fW k g v k¼1 . The overall framework of our LWDR is shown in Figure 1. Equation (5) involves multiple variables and cannot be solved directly. In the next section, we will give an iterative algorithm to solve it efficiently.

| Optimization
To make the variables separable in Equation (5), we introduce an auxiliary variable T and rewrite the original problem as follows: where W k ¼ ffi ffi ffi δ p k W k . By some simple manipulation, the above problem can be equivalently expressed as: The overall framework of LWDR model for multi-view feature learning. LWDR adaptively learns a projection matrix and a weight for each view. Then the multi-view features H can be obtained by integrating the single-view features using learned weights. Label matrix Y is used as regression target to guide the feature learning in a supervised way, and a sparse error matrix E is introduced to compensate the regression error. To improve the intra-class compactness of multi-view features, a class-wise low-rank constraint on multi-view features is incorporated into LWDR To solve, Equation (7) is equivalent to minimize its augmented Lagrange function L θ defined as: where θ > 0 is a penalty factor and Z is the augmented Lagrange multiplier. We adopt the iterative strategy to minimize it, in which a sequence of sub-problems with respect to each unknown variable are solved respectively [31][32][33]. In specific, it contains following six steps in each iteration.
(1) Update W : We first reorganize fδ k g v k¼1 as δ and rewrite the sub-problem w.r.t e W as follows: This is a typical least squares regression problem. By setting its derivative w.r.t. W to zero, we can obtain its optimal solution: Since δ is a set of fδ k g v k¼1 , the final solution is: where Δ = diag(1/δ 1 , …, 1/δ 1 , 1/δ 2 , …, 1/δ v ).
(2) Update H by solving the following minimization problem: With the same strategy in optimizing W and we can obtain the solution to Equation (12): (3) Update E by solving the following problem: Equation (14) can be solved by the following theorem.
Theorem 1 [34] Given Q ∈ R p�q , the optimal solution A* of is given by G λ (Q), and G λ (Q) is the following operator: where Q i is the i-th row of matrix Q.
According to Theorem 1, we can get the solution to Equation (14): (4) Update T with other variables fixed by solving the following optimization problem: It can be transformedto solve each T i respectively.

ZHANG AND LI
The optimal T * i can be obtained by the following theorem.
(5) Fix others and update fδ k g v k¼1 : According to the Cauchy inequality theory [24], it holds The "=" in Equation (20) is satisfied if and only if and Equation (19) gets minimum when δ k takes the values of Equation (21).
(6) Fix others and update Z, θ: In this article, θ max = 10 6 and ρ = 1.1. By performing steps (1)-(6) iteratively, the original loss function can be minimized until convergence or reaching the maximum iterations. Algorithm 1 summarizes the iterative algorithm for solving LWDR model in detail.

| Computational complexity analysis
From Algorithm 1, the main computational time is consumed on the calculation inside the iterations. We can observe that steps (1) and (2) contain matrix inverse operation, and step (4) contains SVD operation. The operations in other steps are addition and multiplication of matrices, which have much lower complexity than matrix inverse and SVD. The time consumption for step (1) is O(n 3 ). In step (2), (2 + θ)I is a diagonal matrix and ðð2 þ θÞIÞ −1 ¼ 1 2þθ I. Then the matrix inverse operation can be simplified to matrix multiplication.
Step (4) needs total c SVD operations and its time complexity is O (c min(ca 2 , c 2 a)) where a is the average number of samples per class. Thus, the total computational complexity of Algorithm 1 is O(t(n 3 + c min(ca 2 , c 2 a))) if there are t iterations.

| Connections to other methods
� Connections to FLR [23]: Similar to our proposed LWDR, FLR also adopts regression model for multi-view feature learning. FLR learns a projection matrix for each view with low-rank constraint, which is helpful to explore the low-rank structure of data. However, FLR treats all views equally and ignores the fact that different views have different roles for pattern analysis, which may degrade the classification performance. Differently, our proposed LWDR adaptively learns the weights of all views and enforces those informative views to contribute more to feature learning. Thus, the proposed method can achieve better performance than FLR, which will be demonstrated in experiments. � Connections to ADR [24]: ADR learns a weighted multiview regression model for multi-view feature learning, which considers the different weights of different views. To avoid a rigid regression target, ADR utilizes the relaxed label matrix for regression as presented in Equation (4), which is beneficial to enlarge the margins of samples from different classes. However, according to [34], the margins of samples from the same class may be also enlarged, and the discriminative power of projection matrix will be compromised. To address this problem, LWDR introduces the class-wise low-rank constraint, which enforces the transformed samples of the same class to have the same structure. In this way, the margins of the transformed samples from the same class will be reduced and the intra-class compactness can be improved. Therefore, the proposed method has the potential to perform better than ADR.

| EXPERIMENTS
In this section, we conduct the experiments on Handwritten [36], Caltech101 [37], PIE [38] and AwA [39] datasets to validate the effectiveness of proposed LWDR, compared with single view and related multi-view learning methods.  Figure 2 shows some samples images used in our experiments. Table 1 presents the views, feature dimensions, number of classes and samples of these datasets.

| Experimental setup
For Handwritten and Caltech101 dataset, we randomly select 10 images per subject for training and the rest for testing. For PIE, five face images per person are randomly chosen for training. For AwA, the training size of each class is 20.
We compare our proposed LWDR with single view and multi-view approaches. For single view method, each view is handled separately. LDA is performed on each view for feature learning [40]. The reduced dimension of LDA is c − 1. For multi-view methods, all views are simply concatenated and LDA is used for feature extraction, that is, LDA (all). Also, we concatenate the top three views and then perform LDA, that is, LDA (top three). The top three views are selected based on the performance of single view method.
Multi-view methods FLR [23] and ADR [24] are performed for comparison. The parameter settings of FLR and ADR are followed the author's suggestions. Nearest neighbour with cosine distance is used for classification. We repeat the experiments 20 times by randomly sampling data partitions and report the experimental results by the mean recognition rate with standard deviation. Table 2 lists the classification accuracies of single view and multi-view methods on Handwritten, Caltech101, PIE and AwA datasets. As can be clearly seen, our LDWR achieves the best performance on four datasets. SVi (i = 1, 2, …, 6) denotes the single view method which performs LDA on the i-th view. The performance of SV varies significantly, which indicates that these views have different roles for classification. The simple concatenation of all views makes effects to improve the performance and LDA (all) is superior to all SV methods. LDA (top three) can generally produce better performance the LDA (all) except Caltech101 dataset. FLR learns a projection matrix for each view and adopts majority voting for classification, which obtain better performance than simple concatenation. ADR uses adaptive weight learning strategy and ϵ-dragging technique, and it performs better than FLR. However, our proposed LWDR still outperforms ADR on multi-view feature learning. For FLR, ADR and LWDR, the tops three views of four datasets are also tested. We can observe that by using the most informative top three views, these methods can generally better performance.

| Adaptive weights analysis
LWDR can automatically learn the weights of different views. The large weight means its corresponding view makes more contribution in feature learning. largest weight for the four datasets, respectively. From Table 2, these views also have good performance among all SV methods. These experimental results demonstrate that our LWDR can automatically pay more attention to those views which contain more discriminative information and make full use of them to learn discriminative features.

| Ablation study
In our proposed LWDR model (Equation 5), each view is assigned with an adaptively learned weight and a low-rank constraint is imposed on class-wise multi-view features to improve the intra-class compactness. To evaluate the effect of them separately, we conduct ablation experiments in this section. Two variations are derived from LWDR, that is, LWDR-s and LWDR-t. LWDR-s discards the weights learning in LWDR, that is, δ k = 1. LWDR-t discards the low-rank constraint and directly uses the label matrix to guide the feature learning. The experimental results of LWDR and its two variations on four datasets are reported in Table 4. We can observe that LWDR outperforms LWDR-s and LWDR-t, which demonstrates that the adaptive weights and class-wise low-rank constraint are effective to boost the performance.

| Convergence analysis
In Section 3.2, we propose an iterative algorithm for solving our LWDR model. Here we illustrate its good convergence property by experiments. Figure 4 shows the convergence curves of proposed algorithm versus iterations on Handwritten, Caltech101, PIE and AwA datasets. It is obvious that the objective function value gradually decreases to a stable value with the increase of iterations. In particular, the proposed algorithm can generally converge within 20 iterations. The experimental results demonstrate that our algorithm is effective and efficient for solving LWDR model.

| Parameter analysis
There are three parameters, that is, α, β, γ, in LWDR which influence the performance of proposed method. To analyse the parameter sensitivity, we first define a candidate set {10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 , 10 3 } for α and γ, and {10 −3 , 10 −2 , 10 −1 , 1, 10, 10 2 } for parameter β. Then LWDR is conducted on four datasets with different combinations of the three parameters. However, it is still difficult for optimal parameter selection on different datasets. In this article, we use the simple grid search for parameter selection [34]. We first fix β as a value in [10 −3 , 10 −1 ] like 0.01, then find the optimal α and β by different combinations of the two parameters. After obtaining the optimal α and β, we can find the optimal β by searching in its candidate set.

| CONCLUSION
We propose a low-rank constrained weighted discriminative regression method for multi-view feature learning (LWDR). LWDR learns a common space across all views from a unified perspective. Each view is assigned with an adaptive weight, which enables the model to focus on the important views automatically. A low-rank constraint is imposed on the features of each class, which improves the intra-class similarity and enhances the model robustness. The strict sparse label matrix is relaxed by an l 2,1 -norm based regularization term. It is more flexible to deal with the errors in learning process. Experimental results on several popular datasets demonstrate the effectiveness of proposed method compared with some other single view and multi-view learning methods.