Orthogonal tucker decomposition using factor priors for 2D + 3D facial expression recognition

In this article, an effective approach is proposed to recognise the 2D + 3D facial expression automatically based on orthogonal Tucker decomposition using factor priors (OTDFPFER). As a powerful technique, Tucker decomposition on the basis of the low rank approximation is often used to extract the useful information from the constructed 4D tensor composed of 3D face scans and 2D images aiming to maintain correlations and their structural information. Finding a set of projected factor matrices is our ultimate goal. During the 4D tensor modelling process, high similarities among samples will emerge because of the information missed partially. Based on the tensor orthogonal Tucker decomposition, the involved core tensor with the structured sparsity, and a graph regularisation term via the graph Laplacian matrix together with the fourth factor matrix are employed for better characterisation of the generated similarities and for keeping the consistency of low dimensional space. To recover the missing information, a framework for tensor completion (TC) will be embedded naturally. Finally, an alternating direction method coupled with the majorisation ‐ minimisation scheme is designed to solve the resulting tensor completion problem. The numerical experiments are conducted on the Bosphorus and the BU ‐ 3DFE databases


| INTRODUCTION
Facial expression is an essential non-verbal way of the human emotional communication, and a main means of conveying cues on the human's emotional state. Nowadays, with the fast development of intelligent devices, such as scans, smart phones, sensors, and cameras, it has been an ideal goal for understanding, computing, and measuring human emotional expressions, aiming to substantially improve many applications in human-computer interaction, entertainment, commerce, medical service, and psychology study [1]. Therefore, the facial expression recognition (FER) has attracted more and more attention in the domain of pattern recognition, computer vision, multimedia and affective computing [2,3].
As pioneers of researching facial expression analysis in the late 1970s, Ekman and Friesen pointed out that there is a universal set of facial expressions to human beings, namely anger, disgust, fear, happiness, sadness, and surprise, plus the neutral one, which are worldwide recognized and kept consistent across different ethnicity and cultures. Since then, a great deal of progress has been made in exploring the topics of human facial expression. The earlier research for FER mainly focused on the 2D field in 2D face images and videos [4][5][6], and considerable advancements were obtained along with algorithm evolution. However, there exists the very challenging problem of illumination and pose changes, which hinder its further application in practical applications.
With the advent of acquisition equipments for a 3D image, the 3D method is considered a promising option for handling unsolved problems in the 2D field. Furthermore, facial expressions (FE) are generated by the facial muscle movements which cause transient facial deformations in both the facial texture and geometry, and the subtle variations which occur in the depth of the 3D facial surface can be captured in a detailed manner by 3D scans. Compared with 2D data, 3D data contribute more theoretically to FER.
the 3D FER including model based, feature based [3,7] and deep learning based. Model-based approaches, always transform a generic neutral template into a 3D face model by means of fitting. Consequently, the corresponding fitting parameters or coefficients are generated, which serve as the feature vector for the classifier. For instance, a bilinear model [8], a model based on the statistical facial feature (SFAM) [9], the combination of both the basic facial shape component (BFSC) and the expressional shape component (ESC) [10], etc. Feature-based methods, the local expression features around facial landmarks, are extracted by chiefly utilising the 3D surface geometric attributes or differential quantities. For example, 3D landmark distances [11][12][13], surface normal [14], the depth [15,16], surface curvatures [17], normal maps [18], conformal images [19], local surface patchbased distances [20], and curvature maps [21] are most widely utilised features. Deep learning techniques are applied in a wide domain in recent years, such as computer vision, pattern recognition, etc. Deep learning-based methods, a deep neural network (DNN) [21][22][23][24] is constructed usually for extracting features applied into expression prediction, in which the hierarchical shape descriptions are generated. For instance, Li et al. [21] construct a convolutional neural network (CNN) to obtain deep features from textured maps and geometric maps, and the fusion features are combined through a subnet. In [24], a new approach for Data Augmentation (DA) is proposed for improving 3D FER by utilising DNN. Some performance of these methods for FER has been greatly improved.
Among the three approaches, the main shortcomings of the model-based approaches lie in high computation like dense 3D face registration, model fitting and the sensitiveness to topological variations. Feature-based methods usually require less computation, while their performance depends on the accuracy of 3D facial landmarking to a great extent. Deep learning-based methods often require larger expression samples, more parameters, and the construction of complex neural networks, etc.
Compared with the existing three approaches, tensor decomposition-based method could overcome the shortcomings of the above approaches, that is, it does not need a dense correspondence among face scans, and is also insensitive to topological changes, and requires no key landmarks, smaller samples, fewer parameters, and less complexity.
In recent years, there are several tensor decompositionbased methods [25][26][27][28] which have been proposed for FER. For instance, in [26], a method based on low-rank tensor completion is proposed and applied into the 2D+3D FER. A 4D tensor from 2D images and 3D face scans is firstly constructed to protect the structural information. Also, through a Tucker decomposition of the 4D tensor, a structured sparsity of the generated core tensor combined with the low-rankness of the generated factor matrices are utilised to characterise the similarities among the 4D tensor samples, in which the obtained factor matrices are used for classification prediction by the linear SVM classifier. Jiang et al. [28] proposed the 3D method of optimal directions (MOD), in which sparse dictionaries are firstly obtained by a dictionary learning algorithm and the sparse coefficient tensor is then calculated by the Nway block orthogonal matching pursuit (OMP), and finally the tensor feature achieved is used to predict the facial expression. However, the existing tensor decomposition-based methods have not solved the problem that the lowdimensional features extracted by the tensor decomposition of the 4D expression samples also show the similarities in the tensor subspace.
In this article, a tensor dimensionality reduction approach is proposed via orthogonal Tucker decomposition using factor priors for 2D+3D FER, and its flowchart is shown in Figure 1. Firstly, a 4D tensor model is constructed by extracting nine effective features via 2D maps for all samples, and the generated 4D tensor model avoids the problem of the dimension disaster and small samples. Secondly, to solve the problem that low-dimensional features extracted by tensor decomposition of the 4D tensor are similar in the tensor subspace, a Laplacian graph derived from the similarity matrix of the fourth mode of the 4D tensor is taken as the factor prior, after which the factor matrix is introduced into the graph embedding regularisation for keeping the consistency of the low F I G U R E 1 The flowchart of the proposed orthogonal tucker decomposition using factor priors approach in the BU-3DFE database dimensional space. Thirdly, Tucker decomposition [29] on the basis of the low rank approximation, as a powerful technique, is utilised to capture the essential low rank structure of the generated 4D tensor. After Tucker decomposing the generated 4D tensor, a set of factor matrices will be found, which is our aim for the tensor dimensionality reduction, and the found factor matrices are used for further classification prediction by being projected into the generated 4D tensor. Fourthly, when the 4D tensor is modelled, missing data emerges from the corresponding 2D mapping process and the high similarities among samples are naturally generated by stacking the samples in the fourth order of the tensor. By employing the orthogonal Tucker decomposition of the 4D tensor, the involved core tensor with the structured sparsity and the graph regularisation term via the graph Laplacian matrix are utilised together with the fourth factor matrix for better characterisation of the involved similarities among samples. As we know, the fourth factor matrix reflects the fourth dimension variation of the constructed 4D tensor, which is closely related to the involved samples. The corresponding tensor completion model is then formulated to recover those missing entries of the 4D tensor. Finally, the resulting tensor optimization problem is generally NP-hard due to the involved ℓ 0 norm and the orthogonality constraint. The log-sum surrogate is then adopted to replace the ℓ 0 norm, and the alternating direction method with majorisation-minimisation scheme is designed to obtain an approximate numerical solution of the original optimization model, that is, to achieve the factor matrices for projection. The multi-class-SVM is employed to predict the facial expression for further validation of the efficiency of the proposed method.
The main contributions of our work are: � On the basis of an orthogonal Tucker decomposition of the constructed 4D tensor, a tensor completion optimization model is adopted to characterise high similarities via the core tensor with the structured sparsity and factor priors. � The proposed tensor optimization model is solved effectively by the alternating direction method joined with the majorisation-minimisation scheme, and an approximate numerical solution obtained is applied into the FER. The convergence and the computational complexity of the proposed algorithm are analysed in detail. � Solving the problem that low-dimensional features extracted by the Tucker decomposition of the 4D tensor are similar in the tensor subspace.
The remainder of the paper is organised as follows: In Section 2, preliminaries on tensors are reviewed. In Section 3, a novel tensor completion model based on the orthogonal Tucker decomposition and factor priors for 2D+3D FER is proposed and solved. Section 4 provides numerical experiments and relative discussion. Section 5 concludes the paper with perspectives.

| Preliminaries on tensors
Following from [29], a lowercase letter x, a capital letter X and a calligraphic letter X represent a vector, a matrix and a tensor, respectively. The outer, Kronecker and Hadamard product are denoted by symbols •, ⊗, and *, respectively. X ðnÞ 2 R I n �∏ j≠n I j represents the mode-n unfolding of an Nth-order tensor X 2 R I 1 �I 2 …�I N , and Y ¼ X � n U n , which can be rewritten as Y (n) = U n X (n) , denotes the mode-n product of X with a matrix U n 2 R I n �R n . kX k 2 F = ⟨X ; X ⟩ represents the Frobenius norm of X . A N-tuple (rank(X (1) ), …, rank(X (N) )) denotes the rank of X .

| Orthogonal Tucker decomposition
In order to get a better approximation of a given N-order tensor X 2 R I 1 �I 2 …�I N , the orthogonal Tucker decomposition of X is used and can take the form of Here, StðI n ; R n Þ ≔ fU n 2 R I n �R n |U T n U n ¼ I R n g is exactly the Stiefel manifold [30].

| Tensor projection and tensor reconstruction
Given a tensor X 2 R I 1 �I 2 …�I N , a Tucker decomposition of X can obtain a core tensor G 2 R R 1 �R 2 …�R N and factor matrices fU n g 2 R I n �R n . Tensor projection will be achieved by projecting {U n } into each mode of X , that is, X � 1 U T 1 � 2 U T 2 ⋯ � N U T N , and the generated tensor by tensor projection has a same size with the core tensor G. The tensor reconstruction will be obtained by the form of G� 1 U 1 � 2 U 2 ⋯ � N U N , and the resulting tensor is of the same size with X .

| Multilinear graph embedding
The multilinear graph embedding framework [31][32][33][34] for tensor decomposition is utilised by incorporating sample similarities into matrix factorization and exploiting the auxiliary information given as similarity matrices, which is shown as below: T rðU T n L n U n Þ s:t: � n U n ; U n 2 StðI n ; R n Þ; ð2Þ FU ET AL. where Tr(⋅) represents the matrix trace which means the smaller the value of Tr(⋅), the less similar the samples are, denotes the graph Laplacian matrix with the weight matrix W n for the samples and the diagonal matrix D n , that is, ðW n Þ ij is defined with the Cosine metric as below: where x i and x j are the ith and jth samples respectively, N(x i ) is a neighbourhood function which often shows the k-nearestneighbour of x i .

| OUR PROPOSED OTDFPFER APPROACH
A tensor completion approach based on the orthogonal Tucker decomposition using factor priors is proposed for 2D+3D FER in this section.

| The tensor completion model
Now, M effective features of size I 1 � I 2 are first extracted from a sample of 2D images and 3D face scans. By storing all N samples, a 4D tensor X 0 2 R I 1 �I 2 �M�N can be obtained. It is well-known that achieving the set of factor matrices by a Tucker decomposition of X 0 is our final aim because they will be projected into X 0 for classification prediction. During the 4D tensor modelling, the data are missed in the corresponding 2D mapping process, and the high similarities are then generated by stacking all samples in the fourth order of the tensor. X 0 will be attempted to recover to an appropriate tensor X by employing the orthogonal Tucker decomposition Here, G 2 R R 1 �R 2 �R 3 �R 4 is the generated core tensor, and is assumed to have the low-rank property in terms of R n < = I n (n = 1, …, 4), and the structured sparsity in the sense that P 4 n¼1 ‖½G ðnÞ * G ðnÞ �e‖ 0 is relatively small with e the the alrightone vector of an appropriate size. Moreover, since the sample similarities can also be reflected in terms of T rðU T 4 LU 4 Þ with the corresponding graph Laplacian matrix L as defined in (3), the value of such a term is required to be as small as possible, which allows the samples to be better separated so that the low dimensional features extracted from the 4D tensor expression samples are similar to those found in the tensor subspace. Since, the factor matrix U 4 is the principal component of the fourth-mode information of samples, its rows correspond to the low dimensional approximation of the samples, so the similarities of the samples are maintained in the low dimensional space. The resulting tensor optimization problem is generally NP-hard due to the involved l 0 norm and the orthogonality constraint. Thus, the tensor completion model turns out to be min G;fU n g;X � n U n ; U n 2 StðI n ; R n Þ; where Ω is the index set of non-zero entries in X 0 , which means there are some missing items (i.e., zero items) in the process of 3D face scans mapping into the 2D plane (see the details in Section 4.2.1), and θ > 0 is the regularisation parameter. By adopting the log-sum relaxation strategy as illustrated in [35], the relaxation counterpart of (4) takes the form of min G;fU n g;X � n U n ; U n 2 StðI n ; R n Þ;

| An alternating direction method
By adopting the Tikhonov regularisation for the non-linear equality constraint, Problem (5) is transformed into min G;fU n g;X ðΩÞ¼X 0 ðΩÞ U n 2StðI n ;R n Þ LðG; fU n g; X Þ where β is a parameter which is used to balance the fitting error and the sparsity of the core tensor. Problem (6) is a non-linear and non-convex tensor optimization problem, and an alternating direction method (ADM) [36,37] is then applied to handle it with the essential iteration scheme as follows: LðG tþ1 ; fU tþ1 n g; X Þ: Followed by the majorisation scheme for the log-sum function in [38], an approximate function for LðG; fU n g 4 n¼1 ; X Þ as defined in (6) is chosen as follows: ≥LðG; fU n g 4 n¼1 ; X Þ; ð8Þ and D t is a tensor of the same size of G with its ði 1 ; i 2 ; …; i 4 Þ th entry given by Thus, for the current iteration ðG t ; fU t n g; X t Þ, the G-part is updated by Let g ≜ vec(G), D ≜ diag(vec(D t )), x ≜ vec(X), and P ≜ ð⊗ n U n Þ. The above optimization can be represented in the form of Obviously, Problem (11) admits a unique optimal solution g* with Thus, the new update where ten(⋅) is the linear map that transforms a vector into a tensor in an appropriate way.

| Optimization of U n 's
Let {n 1 , n 2 , n 3 , n 4 } be any given permutation of {1, 2, 3, 4}. Then fU n i g's (i = 1, 2, 3, 4) can be updated by By utilising the equation (14) can be rewritten as where By employing the von Neumann's trace inequality [39], (16) can be solved easily by the following formula: where B n i and C n i are two matrices, corresponding to the left and right singular matrices of the SVD decomposition of A n i , respectively.

| Optimization of X
Given fU n g 4 n¼1 and G, the update of X can be obtained by solving By the projection property, the unique optimal solution X * can be achieved easily as below: where � Ω is the complement set of Ω. Thus, Algorithm 1 summarises our proposed algorithm based on the alternating direction method framework.

Input:
Two tensors X 2 R I 1 �I 2 …�I 4 ,X 0 ,; Parameters β, γ, θ, ζ, t max ; output: Factor matrices fU n g 4 n¼1 ; Step 0 Initialisation: choose fU 0 n g 4 n¼1 , G 0 , X 0 ,and set t = 0; Step 1 Update G tþ1 by (12) and (13); Step 2 Update fU tþ1 n g 4 n¼1 by (17) Using a rank reduction strategy to remove insignificant columns of fU n g 4 n¼1 and the corresponding rows of each mode unfolding of G with a given threshold; where ζ is a prescribed tolerance values; otherwise stop.

| Algorithm complexity
Calculating the updates of G and fðU n i Þg are the main computation cost in our proposed approach at each iteration. The computational complexity for updating G via (12) is Oð P 4 n¼1 ∏ n k¼1 R k ∏ 4 j¼n I j Þ, and it is mainly reflected in the computation of P T x (12). At the same time, the major complexity for updating U n i in (17) is where the two terms come from A n i and Φ n i respectively. Therefore, the overall computational complexity at each iteration requires Oð P 4 n¼1 ∏ n k¼1 R k ∏ 4 j¼n I j Þ flops that changes naturally with the core tensor size.

| Convergence analysis
For the convergence analysis, the convergency of the surrogate function QðG tþ1 ; fU tþ1 n g 4 n¼1 ; X tþ1 |G t Þ has been proved. Thus, we will further discuss and verify the objective function value of LðG tþ1 ; fU tþ1 n g; X tþ1 Þ which is non-increasing in t via where the first inequality (a) comes from (8) and QðG; fU n g 4 n¼1 ; X |G t Þ − LðG; fU n g 4 n¼1 ; X Þ obtains the minimum when G ¼ G ½t� , and the second inequality (b) is from (10).

| EXPERIMENT RESULTS
In this section, numerical experiments are conducted on two well-known facial expression databases including both BU-3DFE [40] and Bosphorus [41] to verify the effectiveness of OTDFPFER. The details will be focused on the BU-3DFE database.

| Implementation details
This subsection mainly discusses three parts: two commonly used databases, the setting of parameters in the Algorithm 1, and the method of classification prediction.

| Databases
BU − 3DFEdatabase. It contains 3D face scans and correlative texture images of 100 subjects in which there are 44 males and 56 females. All subjects are right between 18 and 70, and root from different racial/ethnic ancestries, is composed of Black, Hispanic Latino, Indian Middle-east Asian, East-Asian, and White. Each subject performs seven expressions including the neutral expression and six prototypic expressions of four levels of intensity (fear, happiness, sadness, anger, disgust, and surprise) (see Figure 2). Hence, there are 25 three-dimensional face scans for each subject, resulting in a total of 2500 face scans.
Bosphorus database. It includes 105 subjects, in which there are 4666 pairs of 3D face scans and 2D face images with poses, various action units, expressions, and occlusion conditions. The majority of the subjects with 60 males and 45 females are Caucasian, aged between 25 and 35. It is worth mentioning that only 65 subjects perform six basic facial expressions without intensity information, leading to a total of 390 2D and 3D face pairs.

| Algorithm Initialisation
In this article, nine kinds of effective features of size 128 � 128 are extracted and then processed by LBP (Local Binary Pattern) descriptor [42] which are popular to be used in FER, M is set to be 9. The extracted features are the following: geometry map I g , normal maps of three directions I x n ; I y n and I z n , mean curvature map I mc and curvature map I c , 2D texture maps of three channels I r t ; I g t and I b t . In order to alleviate the sensitivity of the algorithm parameters, θ is set to be according to the gradient of U 4 in (6), where w 1 = 5e3. β relies on the value of the facial expression data and the data missing ratio. To obtain a stable performance, β is set in the range [0.1, 1]. By using the high order singular value decomposition (HOSVD) of X 0 , fU 0 n g 4 n¼1 are achieved. Meanwhile the core tensor G 0 can be obtained by X 0 ∏ 4 n¼1 � n ðU 0 n Þ ⊤ . In this article, the maximal iteration numbers t max and the accuracy parameter ζ of Algorithm 1 are set to 100 and 1e-4, respectively. To retain strong interactions of the core tensor G and factor matrices fU n g 4 n¼1 , a rank reduction strategy (RRS) is adopted [26] to remove insignificant rows of each mode unfolding of G and the corresponding columns of fU n g 4 n¼1 , and the parameter of RRS is set similar to that in the literature [26].

| Classification prediction
On implementing Algorithm 1 according to Setup I or Setup II (the details in Section 4.2), the estimated factor matrices fU n g 4 n¼1 can be achieved. Thus, two matrices X Training and X Testing can be obtained by the mode-4 unfolding of X Training ∏ 3 n¼1 � n ðU n Þ ⊤ and X Testing ∏ 3 n¼1 � n ðU n Þ ⊤ , respectively. These two generated matrices are transmitted to the classifier with the labels of the corresponding samples for expression classification prediction.

| Experiments on the BU-3DFE database
Protocol in Experiments. In order to make a fair comparison with the advanced methods, two experimental protocols are mainly used to evaluate the FER methods: Setup I chooses fixed 60 subjects throughout the experiments, while 60 subjects are randomly selected in each cycle of the experiment of Setup II. Only expressions of 3-level and 4-level intensities are taken into consideration for expression prediction in the two experimental protocols. Setups I and II adopt the 10-fold cross-validation strategy, in which 60 subjects involved are divided into 10 subsets, and nine subsets are treated as the training and the one left as the testing for each round. The experiments of Setups I and II are repeated for 1000 times, and the average recognition accuracies of Setups I and II are reported for the experimental analysis below. In all experiments, we adopt Multi-Class-SVM to classify and predict facial expressions.
Results on BU − 3DFEdatabase: Table 1 shows the average confusion matrices by implementing Setups I and II, respectively. From this table, it is not difficult to find that surprise and happiness have gained higher recognition accuracies because their faces are deformed greatly, while sadness and fear have achieved relatively lower recognition accuracies. Meanwhile, it is observed that fear and disgust can be confused with any other expressions, and similar phenomena can be seen in [22,[24][25][26]28].
To verify the effectiveness of each feature by excluding one at one time, nine experiments are carried out, in which comparisons of recognition accuracies are reported in Table 2 on the BU-3DFE database with Setup I. From this Table, it is observed that the difference is in the range [−3.50, −1.97] comparing the nine results in Table 2 with the accuracy 83.75% in Table 1, which means there has been a significant difference in discrimination among nine features. Meanwhile, Table 2 also shows the large complementarity among different features, which validates that none of the nine features can be excluded.
In addition, some experiments validate the effectiveness of the weight strategy for samples by using the Cosine metric. As introduced in Section 3.2.3, the weight matrix W is symmetric and needs to be constructed. Thus, the three strategies are considered and compared. i) the heat kernel weight metric with w ij ¼ expð−‖X ðiÞ − X ðjÞ ‖ 2 =δÞ where ‖X ðiÞ − X ðjÞ ‖ is the distance between X ðiÞ and X ðjÞ . It is worth noting that the parameter δ is difficult to be adjusted which is set to be 3000; ii) the binary weight metric which means all the components of the weight matrix W are either 0 or 1; iii) the Cosine weight metric, that is, w ij ¼ ⟨X ðiÞ ; X ðjÞ ⟩=ð‖X ðiÞ ‖ ⋅ ‖X ðjÞ ‖Þ. The comparisons with the different weight metrics are shown in Table 3 on BU-3DFE database using Setup I. As can be seen from this table, the result using the Cosine weight metric is the best, while the heat kernel weight metric achieves the worst.
Thus, the Cosine weight metric is more competitive compared with other strategies.

| Comparison with other tensor methods
Four tensor methods WTucker [43], APG_NTDC [44], IRTD [38], and KBR_TC [45] are compared with our proposed approach (OTDFPFER). The WTucker proposes a low-rank tensor completion on the basis of the predefined multilinear rank, and the APG_NTDC proposes an alternating proximal gradient (APG) method via sparse and non-negativity constraints on the core tensor involved and the factor matrices involved by decomposing a tensor; the IRTD decomposes the incomplete tensor into a core tensor with structured sparsity and factor matrices with a Frobenius norm, and the KBR_TC proposes a tensor sparsity measure in which a tensor is represented by the number of rank-1 Kronecker bases. In order to accelerate the convergence process, IRTD adopts the rank reduction strategy, while the KBR_TC utilises the oppositive scheme that lets the rank to increase [46]. It is worth mentioning that the multilinear rank of the APG_NTDC should be predefined, and the KBR_TC needs to estimate the smaller multilinear rank, and the WTucker requires to overestimate the multilinear rank.
To achieve the best performances, all the parameters of the four compared methods are set carefully according to the recommendations in relevant literatures. For IRTD, β, δ, λ 1 , λ 2 , and γ are set to be (2 − δ)/L(f), 0.1, 0.1, 1, 1.25e-3, respectively; and for APG_NTDC, we set (R 1 , R 2 , R 3 , R 4 ) to (15,11,6,12), and λ c and fλ n g 4 n¼1 are alright set to be 5; for WTucker, the multilinear rank is predefined as (30,22,9,24); and for KBR_TC, μ, λ, v, c, and ρ are set to be 250, 10, 0.1, 1e − 3, 1.05, respectively, while the multilinear rank is provided for (12,12,9,12). The comparison results of the following four terms are the recognition accuracy (RA), the convergency behaviour illustrated by the relative error (RE), the changes of a multilinear rank reflected by the variations of factor matrices, and the tensor recovery performance shown by face reconstruction (FR). Figure 3(a) shows the comparisons with four methods in terms of average recognition accuracy (%) by using Setup I. From this figure, it is easily seen that the RA of OTDFPFER achieve better result than others, while APG_NTDC obtains the worse performance. The result indicates that OTDFPFER is more effective than others because it can achieve more discriminable low-dimensional features for multimodal FER.
The convergence behaviour is often verified by using the formula RE = kX t −X t−1 k F /kX 0 k F . It is shown that RE can converge in the limited iterations in Figure 3(b). From this figure, it is found that IRTD and the proposed approach OTDFPFER that are used by the rank reduction strategy converge faster than the other methods that use the predefined rank or the rank-increasing strategy. From the view of the recognition accuracy, the proposed approach OTDFPFER obtains the highest result among the four compared algorithms. At the same time, OTDFPFER is superior to APG_NTDC, KBR_TC and WTucker in terms of the convergence behaviour. Therefore, the proposed approach OTDFPFER is competitive. Because the rank variations of factor matrices reflect changes of the multilinear rank for the given tensor, the comparisons in terms of rank variations of the four factor matrices are reported in Figure 4. It can be observed easily from this figure that U 3 indicates that the feature number has less change. Except for KBR_TC, APG_NTDC and WTucker that adopt the rank-increasing or predefined rank strategy, it can also be seen that: i) U 4 that represents the sample number changes the fastest, which clearly verifies that there exist high similarities among samples; ii) U 1 and U 2 show that the feature size change slower than U 4 , iii) after some limited iterations, the rank of the four factor matrices will not change. Compared with the four tensor methods in terms of rank changes of factor matrices, OTDFPFER are relatively faster due to the rank reduction step in Algorithm 1, which also verifies the effectiveness of the rank reduction strategy. Among the ranks of the four factor matrices, the rank of U 4 of our proposed method varies faster than that of other three factor matrices because the high similarities are characterised by the involved core tensor with the structured sparsity and factor priors together.
In order to get a more vivid effect of the tensor recovery performance explained by face reconstruction, we compare OTDFPFER with IRTD, APG_NTDC on BU-3DFE database by using Setup I. The comparisons in terms of the face reconstruction are presented in Figure 5, including the 2D original feature maps, LBP features, LBP features of 70% sampling ratio (SR) randomly, and the comparisons illustrated by the subject F0007 with angry expression of the 4-level intensity on the BU-3DFE database. The reconstruction results show that OTDFPFER achieves the best results, while APG_NTDC is the worst among the three algorithms. Meanwhile, KBR_TC and WTucker could not successfully complete the face reconstruction task because they generate more negative low-dimensional feature by projecting the factor matrices into the given 4D tensor. The reason why the LBP features of OTDFPFERC could be reconstructed depends on the 2D original feature maps and LBP features (i.e., the 2D original feature maps by LBP descriptor). The former generates the indeterminate NaN values when mapping 3D face models into 2D planes, and the generated NaN values are processed to zero and mainly appear on the outlines of the 2D face images, which contain some key features that are important for 2D+3D FER and difficult to be recovered. At the same time it can be observed that the latter has produced some new zero values in face area. Therefore, the missing rate of the LBP features has directly affected the results of the face reconstructed with different SR. Thus, our proposed approach can extract fully more useful informations from the generated 4D tensor.  In order to further indicate the effectiveness of OTDFPFER, the comparisons of four aspects, such as data modality, features, classifiers, and accuracies of different protocols, are shown in Table 4 with some state-of-the-art methods. [12,17,47] are three earlier studies for 3D facial expression recognition, and their experiments were repeated only for 10 times, which made the performance stability unguaranteed. When such an experimental protocol is utilised, the average recognition accuracy of our proposed method (OTDFPFER) achieves 95.49% that is better than that in ref [7,9,12,17,26,47]. As reproduced by Gong et al. [10], the accuracies observed in [12,17,47] were found to be degraded greatly by adopting Setups I and II regarded as the stable experimental protocols. Compared with the state-of-the-art methods with Setups I and II, our proposed method (OTDFPFER) achieves better recognition accuracies of 83.75%, 81.63% respectively, and the stable performance can be obtained in our proposed method.
As we know, there exist some algorithms that outperform our proposed approach with Setups I and II in Table 5, such as [18,21,52]. Although these state-of-the-art algorithms adopt feature vectorisation and concatenation, they obtain higher recognition accuracies. The reasons lie in the following aspects: i) the facial landmark localization is utilised, such as [52], ii) larger samples are employed and the high complexity networks are constructed, such as in [18,21]. Compared with the stateof-the-art algorithms in Table 5, our proposed approach needs no facial landmark localization, and requires fewer samples and less complexity. Although there are some gaps among our proposed approach and the state-of-the-art algorithms in Table 5, improving the recognition rate is a direction of our future efforts.

| Experiments on the Bosphorus database
Protocol in Experiments. Setup III also adopts the 10-fold cross-validation scheme. This protocol is similar to Setup II on the BU-3DFE database, except that the number of people has been changed from 100 to 65 because there are only 65 persons owning six kinds of expressions on the Bosphorus database. The details can be found in Section 4.2.
Experimental Results: The average confusion matrix is shown in Table 6  disgust are easily confused as any other feature. It can also be observed that surprise is easily confused with fear, and vice versa. At the same time, the case is similar to the confusion of sadness expression and anger expression, and similar phenomena are easily found in [21,26]. In addition, nine experiments are also implemented to verify the validity of each feature with one feature excluded at one time. Table 7 indicates the comparisons on the Bosphorus database by utilising Setup III. Compared with the recognition result 75.97% in Table 6, the difference in the range [-2.55,-0.34] has been fully indicated by the large complementarity among different features, and has been validated as nine kinds of features which are indispensable. In short, it is very difficult for expression recognition on the Bosphorus database in this article.

| The behaviour on the Bosphorus database
Four comparison algorithms in the BU-3DFE database are also adopted in the Bosphorus database. All settings are the same as in the BU-3DFE database. As derived in Section 3, Algorithm 1 generates a decreasing sequence of the values of the objective function in (6). To further illustrate the behaviour of our proposed approach on the Bosphorus database, recognition accuracy, the changes of the relative error and the ranks of the factor matrices (i.e., rank(U n ), n = 1, …, 4), and face reconstruction are shown in Figure 6(a), 6(b),. The analysis of the comparison results is similar to that found in the BU-3DFE database.
From Figure 6(a), it is obviously observed that our proposed approach achieves a better result compared with other algorithms, while APG_NTDC obtains the worse. At the same time, it is easily found that the accuracy of our proposed method is only 75.97%, which indicates the facial expressions are not easily recognised in the Bosphorus database.
As one can see in Figure 6(b), the relative error is greatly reduced after 2 iterations and meets the tolerance of the algorithm within 16 iterations. In Figure 7, the rank of U 3 changes less, which indicates that all the nine selected features are less redundant as the same as that of the BU-3DFE database. Meanwhile, the ranks of U 1 , U 2 and U 4 all have a significant reduction in the first two iterations. The reason for this phenomenon is the same because the rank reduction step in Algorithm 1 is used here. Moreover, the fastest reduction in the rank of U 4 also indicates that the high similarities among samples have been the structured sparsity and factor priors together. In Figure 8, the comparisons of face reconstruction are indicated using 70% SRs of the subject bs017 with happiness expression, which shows that our proposed approach can achieve a better result of face reconstruction. In brief, our proposed approach OTDFPFER is more competitive than other four tensor methods in the recognition accuracy.

| Comparison with other methods
Three state-of-the-art methods (i.e., [14,26,48]) are compared with our proposed OTDFPFER approach by utilising Setup III, aiming to validate the efficiency of OTDFPFER on the Bosphorus database. Table 8 indicates the comparison results in terms of data, features, classifiers, and accuracies, in which [14] achieves the lowest accuracy. From Table 8, it is obviously observed that the accuracies of OTDFPFER and the method [26] are very similar (75.97% vs. 75.93%). It can be seen that expressions are difficult to be recognized for the Bosphorus database without expression intensity. In a word, our proposed approach obtains the best result among four approaches on the Bosphorus database using Setup III.

| CONCLUSIONS
In this aticle, the 4D tensor model based on nine features fusion has been built by utilising 3D face scans and 2D face images in order to take into account the structural information and correlations between different modalities simultaneously. Then, a tensor dimensionality reduction approach has been proposed for 2D+3D FER via orthogonal Tucker decomposition using factor priors (OTDFPFER), and the proposed tensor completion problem is then solved efficiently by an alternating direction method. The tensor recovery capability was enhanced and the accuracy of expression recognition was raised as shown in the numerical results on Bosphorus and BU-3DFE databases. The effectiveness of the proposed method has been verified by umerical experiments on the BU-3DFE and Bosphorus databases with promising recognition accuracies. As the feature extraction and the corresponding tensor modelling and computation greatly affect the speed and the accuracy for the final recognition, more efforts will be made on the extraction of effective features with appropriate feature descriptors and on the algorithm design for large-scale tensor optimization problems.