Multi ‐ factor joint normalisation for face recognition in the wild

Face recognition has become very challenging in unconstrained conditions due to strong intra ‐ personal variations, such as large pose changes. Face normalisation can help to resolve these problems and effectively improve the face recognition performance in unconstrained conditions by converting non ‐ frontal faces to frontal ones. However, there are other complex facial variations in addition to pose, such as illumination and expression, which will also influence face recognition performance. The authors propose a well ‐ designed generative adversarial network ‐ based multi ‐ factor joint normalisation network (MFJNN) to normalise multiple factors simultaneously. First, a multi ‐ encoder generator and a feature fusion strategy are designed and implemented in the MFJNN to realise the joint normalisation of multiple factors in addition to pose. Second, a convolutional neural network ‐ based (CNN ‐ based) network is applied in the MFJNN, which allows the MFJNN to simultaneously realise image synthesis and facial representation learning. Moreover, an identity perceptive loss is introduced based on the CNN ‐ based network to produce reliable identity ‐ preserving features of the input face images. The experimental results demonstrate that the proposed method can synthesise multi ‐ factor normalisation results with identity preservation


| INTRODUCTION
Face recognition is a typical biometric identification technology, which has extensive application prospects in the fields of public security of state, military security, financial security, human-computer interaction etc. Currently, face recognition has achieved outstanding results due to the development of deep learning technology and the easy access to large-scale labelled datasets. However, in an unconstrained environment with extreme pose, illumination, expression, variations etc., it is still very challenging to achieve robust face recognition. Face normalisation, which aims to synthesise a photorealistic frontal view face image from one with arbitrary pose, can effectively address the problem of face recognition in unconstrained conditions. Current face normalisation methods can be divided into three categories: 2D-based methods, 3D-based methods and deep learning-based methods. 2D-based methods realise pose normalisation using 2D image matching or test image encoding using a basis function, and 3D-based methods normally capture 3D face data or evaluate 3D models from 2D images and match them to the 2D test images. Compared with the first two methods, deep learning which has achieved outstanding normalisation performance [1][2][3][4] due to its highly nonlinear mapping capability and it is a new way to realise face normalisation. However, the above face normalisation methods have certain limitations. For example, 3D data capturing requires additional computation complexity and resources. Moreover, 3D-based approaches always lead to distortion and inauthenticity in the synthesised image. Although deep learning related methods, especially algorithms based on generative adversarial network (GAN) [5], have achieved impressive results in face normalisation, face normalisation in unconstrained conditions still faces a great challenge because there are many other complex factors besides pose that need to be normalised. In the real world, illumination, pose, expression and other factors are always intertwined with each other, as shown in Figure 1. Hence, it is difficult to achieve real practicality by handling only one of these factors (such as pose).
Therefore, the authors aim to achieve the joint normalisation of various factors that affect the face recognition performance. So, they propose a multi-factor Joint normalisation network (MFJNN), which is designed based on the GAN, for face normalisation. A multi-encoder generator and a fusion strategy are designed and implemented in the MFJNN to achieve multi-factor normalisation (MFN). Each encoder inputs an image with some abnormal factor in the training stage and is mainly responsible for this factor's normalisation. For example, the expression encoder takes abnormal expression face images as its input and mainly normalises the expressions. Using a weighted averaging feature fusion strategy, the features extracted from each encoder are fused as one feature representation and fed to the decoder to generate a synthesised virtual canonical face image.
To produce identity-preserving synthesised face image for face recognition, the authors introduce a multi-input based identity perceptive loss in the proposed model. Moreover, we apply the source of the identity perceptive loss (a convolutional neural network-based (CNN-based) network) as a feature learning subnetwork, which allows the proposed MFJNN to simultaneously realise face normalisation and facial representation learning.
The work of authors are summarised as follows.
1) The author propose a MFJNN based on the GAN is proposed to synthesise frontal face images and learn poseinvariant representations for face recognition from face images under arbitrary condition. Different from current GAN-based models that usually use one encoder and one decoder for the generator to synthesise images, the proposed GAN uses multiple encoders to input face images with different variations (e.g. pose and illumination) and fuse the outputs of these encoders to one decoder to achieve multi-factor joint normalisation. 2) Further, a multi-input-based identity perceptive loss is introduced to the proposed model to preserve the identity information of synthesised face images. Moreover, the CNN-based network used for computing the identity perceptive loss function is also used by us for feature learning, and this allows the proposed model to realise face normalisation and facial feature learning in one framework.

| RELATED WORK
Face recognition methods can be divided into two categories: hand-crafted methods and deep learning based methods. Early hand-crafted methods directly extract hand-crafted features for face recognition. Typical methods include local binary patterns [6], histograms of oriented gradients [7] etc. In recent years, deep learning based face recognition methods have achieved promising results [8][9][10]. The most popular deep learning network for face recognition is the Deep convolutional neural network (DCNN). However, on datasets such as IJB-A [11], which has larger pose variation than traditional LFW [12] dataset, the face recognition performance is still not up to the mark. Pose variations can be considered as the most important and challenging problem in unconstrained face recognition. Face normalisation is an efficient way to overcome pose problem. Early face normalisation algorithms can be divided into two categories: 3D-based and 2D-based normalisation methods. In the first category, the morphable model fits a 3D model to an input face using the prior knowledge of human faces and image-based reconstruction. Li et al. generated one virtual pose for each probe face image using a series of 3D displacement fields sampled from 3D face database, and matched this synthetic face to the gallery face [13]. Similarly, Asthana et al. used a pose-based active appearance model (AAM) to match a 3D model and 2D image [14]. Compared with 3D-based approaches, 2D techniques do not need 3D prior information for pose normalisation. For example, a dynamic programming stereo matching method was adopted to calculate the similarity of two faces for 2D face recognition in [15]. The tested face image was represented by linear combination of training images and linear regression coefficient was used as the feature for 2D face recognition in [16].
Recently, the deep neural network has been used in face normalisation and has achieved significant performance. In [2] the authors used a DCNN to normalise the illumination and pose variations of face image, and extracted the conventional features on the canonical face image after normalisation for face recognition. Experimental results on CMU Multi-PIE [17] dataset have shown that this method can achieve a high recognition rate compared with the method that directly extracts features. [1,3] recovered face images with noises and occlusions by utilising the restricted boltzmann machine (RBM) and gated markov random field, respectively, which have achieved good performance on face verification and face expression recognition, respectively. [4] applied the stacked denoising auto-encoder (SDA) to recover occluded face image, and then adopted the DCNN for face recognition. Experiments on the occluded face images from the AR [18] dataset demonstrated that the face image restoration method based on the SDA had a comparable recognition rate similar to the method based on the sparse representation.
It is worth mentioning that the GAN has attracted substantial attention in image synthesis due to its ability to synthesise photorealistic images with plausible high frequency details. The state-of-the-art face normalisation methods are almost based on the GAN [19][20][21][22][23][24]. For instance, the Two Pathway GAN (TP-GAN) [19] proposed a two-pathway GAN architecture to simultaneously perceive global and local information. The face frontalization GAN (FF-GAN) [20] incorporated a 3D face model into the GAN to maintain the visual quality under occlusion during frontal view synthesis. Zhao et al. [22] further extended the TP-GAN and proposed the PIM (pose invariant model) that introduces a domain adaptation strategy for face normalisation. A couple-agent poseguided GAN [21] was proposed to synthesise both frontal and profile face images. Bao et al. [23] proposed a GAN-based framework for synthesising face images recombining different identities and attributes in open domains. [25] proposed a face normalisation model to generate frontal, neutral expression, photorealistic face images. While most of the prior works learn a representation that is mainly invariant to the pose, our method can learn a representation invariant to other variations like illumination and expression besides pose. From a face image under an arbitrary condition, the MFJNN can synthesise a virtual frontal face image and learn a multi-factor invariant representation while preserving face identity.

| APPROACH
Here, the authors propose a MFJNN to synthesise a photorealistic and identity-preserving frontal face and learn robust feature representation for face recognition from face image with multi-factor variations. The overview of the MFJNN is detailed in Section 3.1. In the MFJNN, the multi-encoder generator, which is introduced in Section 3.2, is the main structure used to achieve MFN. Finally, the loss function of our method including identity preserving loss is detailed in Section 3.3.

| Multi-factor joint normalisation network
The general architecture of the MFJNN is shown in Figure 2. To jointly normalise multiple factors (including pose, illumination, expression, etc.), we design a multi-encoder generator G θ : ℝ H�W �C ↦ ℝ H�W �C that has one decoder but multiple encoders for different factors, where H, W, and C represent the height, the width and the number of channels of the input image of the generator, respectively; and θ denotes the network parameters for the generator. In the training stage, each encoder takes face images with some factor changes as its input and maps each input face to the feature space. Then, the extracted features from the encoders are fused by a weighted averaging fusion strategy in the feature fusion block. The decoder takes the fused feature as its input and generates the synthesised normal face image. In the testing stage, the image that needs to be normalised is taken as the input of each encoder of the multi-encoder generator and will be multifactor normalised by the multi-encoder generator.
To synthesise identity-preserving face images and simultaneously to learn facial representations for face recognition in addition to image synthesis, we apply the Light CNN [26] to the synthesised face images. The Light CNN is used as a source of the identity perspective loss to make the synthesised image identity-preserving and also as a feature learning subnetwork to learn the pose-invariant representation from a synthesised face image for face recognition. There are three reasons to choose the Light CNN as our CNN architecture. First, the Light CNN has achieved good face recognition performance in many face recognition via generation works. For example, the Light CNN demonstrates better recognition performance than Visual Geometry Group Face (VGG-Face) [27] and residual network (ResNet) [28] in [22,25], respectively. Second, many works use the Light CNN to generate the identity perception loss and have demonstrated good results [19,21,29]. Finally, the Light CNN has relatively simpler network architecture than other CNNs like ResNet-50 or ResNet-100, which are the most common ResNet network architectures.
The structure of the discriminator is shown in Figure 3. The discriminator denoted as D ∅ , where ∅ represents the network parameters for the discriminator, consists of a set of convolutional layers to execute down-sampling operations iteratively. The LReLU [30] is adopted as the activation function after each convolutional layer. The final layer is a fully connected layer that transforms the input feature into a onedimensional output feature.

| Multi-encoder generator
To normalise multiple factors in a unified framework, we design a multi-encoder generator that consists of multiple encoders denoted as G enc : ℝ H�W �C ↦ ℝ D , where D denotes the dimension of the output feature of the encoder, and one decoder is denoted as G dec . In the proposed experiment, we use three encoders corresponding to pose, illumination and expression normalisation. Therefore, the proposed MFJNN is considered as a three encoder architecture for simplicity here. The three encoders correspond to the input I ¼ fI p ; I i ; I e g ∈ ℝ H�W �C and aim to learn features G enc ðI p Þ, G enc ðI i Þ and G enc ðI e Þ from face images I p , I i and I e , respectively, where I p , I i and I e denote face images with pose, illumination and expression variations, respectively. With the input fI p ; I i ; I e g, the fused representation is the weighted average of the three representations: F I G U R E 2 Visual illustration of the MFJNN. The generator contains multiple encoders and one decoder for multi-factor normalisation. The features extracted from the encoders are fused as one feature through a weighted averaging fusion strategy and fed to the decoder. The discriminator (D) distinguishes between the synthesised frontal views and the ground-truth frontal views. Images with some factor variance are taken as the input of one encoder for training. During testing, the test image will be fed to the multiple encoders as the input of all encoders. Losses are drawn with dashed lines, where L adv is the adversarial loss that encourages the generator to synthesise the photorealistic normalised face image, L id is the identity perceptive loss for preserving identity information, L sym is the symmetry loss to alleviate the self-occlusion problem in large pose cases and L p is the pixel-wise loss to facilitate image content consistency F I G U R E 3 Structure of the discriminator D ∅ and the encoder of the generator G enc in the MFJNN. Each rectangle corresponds to one network module, and the feature map dimension of each layer is marked by dotted lines when the input image has a size of 224�224. Note that the last fully connected layer is only included in the discriminator architecture and the encoder of the generator directly outputs features with a size of 14�14�256 408 - The coefficients are set as ω p ¼ 0:5 and ω i ¼ ω e ¼ 0:25. The fused representation is then fed to the decoder to generate the normal face image I � with the same identity as all input images.
The multi-encoder generator is a fully convolutional network. The structure of the encoder is as same as the discriminator except the last fully connected layer, as shown in Figure 3. The decoder is composed of a set of modules, each of which comprises a transposed convolution layer [31], a ReLU layer and a residual block [28], as shown in Figure 4. The transposed convolution layer with a 4�4 filter and a stride of 2 executes an un-sampling operation and doubles the input feature. The residual block is composed of a convolutional layer (4�4 filter, stride = 1), batch normalisation (BN), and a ReLU layer, cascaded by another convolutional layer (4�4 filter, stride = 1) and BN. Finally, a 1�1 convolution is applied to generate the output image.

| Loss functions
(1) Adversarial Loss: To generate a photorealistic normalised face image, we adopt the adversarial loss term as follows: where L adv−D and L adv−G represent the adversarial loss for the discriminator and the generator, respectively, and I gt is the ground truth frontal face image of the synthesised image.
(2) Multi-input-based identity perceptive loss: Preserving the identity of the frontal face image is the most important part in developing the recognition via generation framework. Our identity perception loss for preserving identity information is based on the activations of the last two layers of the Light CNN. Since our model has multiple inputs, we introduce an identity perceptive loss based on multiple inputs. In a situation with three inputs, we define our identity perceptive loss as where ‖ ⋅ ‖ 2 means the vector 2-norm, W l and H l respectively denote the width and the height of the lth layer, and F l m;n is the value of point (m; nÞ in feature map of the lth layer. (3) Pixel-wise loss: For a fine-grained task like face normalisation, the details of the facial features will cause great challenges for the synthesis task. The synthesised images may contain distortions and lack authenticity without the supervision of the pixel-wise loss. Therefore, we use the pixel-wise loss to maintain the consistency of the image content. Since the l 2 loss tends to generate a blurry output, we adopt the l 1 loss as in [19,25] as our pixel-wise loss instead to better preserve high frequency signals. (4) Symmetric loss: Symmetry is a normal property of a face image that can be used to constrain the synthesised frontal face image as prior knowledge. We use the symmetric loss as in [19] formulated as follows: Face images in real world are usually not strictly symmetric on the grey level. However, the pixel variance in the local region and gradient of each point are consistent, so it is more reasonable to define the symmetry loss in Laplacian space. Thus, we first transform the face image to Laplacian space and then adopt the symmetric loss on transformed images.
(5) Overall Loss: Consequently, the complete objective loss is defined as a weighted sum of four individual loss functions: where λ 1 , λ 2 and λ 3 are parameters controlling the relative relation of objective terms. The MFJNN is optimised by alternatively optimising G θ and D ∅ for each training iteration.

| EXPERIMENTAL RESULTS
Here, the authors conduct experiments to evaluate the effectiveness of MFN, the synthetic image result and the face recognition performance of the proposed method. Now, we first describe the datasets and implementation details of our experiment. Then, we compare the performances of models with and without MFN to evaluate the effectiveness of MFN. We also present the visualisation of our synthesised normal face images results and compare the face recognition performance of the proposed method with those of the stateof-the-art methods. Finally, an ablation study is conducted to evaluate the effectiveness of each loss.

| Datasets and implementation details
Datasets Here, we train our model on the Multi-PIE [17] dataset To compare the results of the proposed method with those of the state-of-the-art methods, we also conduct experiments on Multi-PIE Setting-1 [19,21,25,32], which contains images of 250 identities in session one. The first 150 identities are used for training and the remaining 100 identities are used for F I G U R E 6 Face normalisation results on Multi-PIE. Each pair presents the input image (left), the face normalised by the MFJNN (middle) and the face normalised by the TP-GAN [19] (right) LIU AND CHEN testing. Since the images in Multi-PIE Setting-1 do not have expression variation, we include only two encoders in the MFJNN (one is for pose and the other is for illumination) and the corresponding fusion coefficients are set as ω p ¼ 0:75 and ω i ¼ 0: 25. In the unconstrained experiment, we train the MFJNN on the entire Multi-PIE and test it on the LFW [29] and IJB-A [11] datasets. LFW consists of 13,233 images of 5749 individuals collected from the web. In the verification protocols of LFW, the test set contains 10 folds, each with 300 matched pairs and 300 unmatched pairs. IJB-A is a more challenging large pose dataset that has 5396 images and 20,412 video frames for 500 subjects in uncontrolled settings.
The Light CNN used as the feature extraction network and the source of the identity perceptive loss is pre-trained on MS-Celeb-1M [33] and kept fixed both in both the training and testing processes.
Implementation details The images from all datasets are pre-processed by applying the TCDCN [34] as the face detection algorithm and then cropped and resized to 224�224 images. The pixel values are normalised to the range of [−1,1]. Our network is implemented on TensorFlow [35]. We use Adam [36] with a learning rate of 0.0001 to train the generator and the discriminators by iteratively minimising the generator loss function and the discriminator loss function. We empirically set the hyperparameters of the loss function as follows: λ 1 ¼ 10, λ 2 ¼ 0:001 and λ 3 ¼ 0:3. For the face recognition performance evaluation, the cosinedistance metric is used on the extracted feature from the Light CNN.

| Effectiveness of multi-factor normalisation
To evaluate the effectiveness of MFN, we remove some factor normalisation encoders from the MFJNN and train three other models on the same training set (the first 200 identities of Multi-PIE), that is only including one or two encoders in the generator of the proposed model, and compare these models with the proposed model. We report the rank-1 recognition rates to compare the different models in Table 1. We can observe that the MFJNN can improve the performance compared to w/o MFN, w/o illumination, and w/o expression. Especially, the MFJNN achieves a 2.6% average performance improvement compared to w/o MFN. These results demonstrate that each factor normalisation is useful to improve the face recognition performance. The face synthesis results of the proposed MFJNN and w/o MFN are shown in Figure 5. Although there is not a visually obvious illumination normalisation effect difference between the synthesised images from the MFJNN and w/o MFN, the recognition results from Table 1 have verified the effectiveness of the illumination normalisation in MFJNN on boosting recognition performance. In addition, it can be seen from Figure 5 that the MFJNN can achieve better expression normalisation results compared to w/o MFN, though w/o MFN can also slightly normalise expression.

| Face synthesis
This section mainly evaluates the MFJNN from two aspects, which are image quality and the ability to preserve an identity. The face synthesis results on Multi-PIE are illustrated in Figure 6. For comparison, the faces normalised by the TP-GAN [19] are also illustrated. We can see that our model can effectively normalise the pose, expression and illumination in various pose cases and synthesise photorealistic normalised face images while maintaining the identity information. Figure 7 visualises the results of five images selected from LFW. From Figure 7, we see that the MFJNN synthesises face images with both good image quality and identity maintenance in an unconstrained environment. Although the TP-GAN and method in [23] can also generate photorealistic frontal faces on the LFW dataset, our MFJNN can normalise illumination and expression simultaneously while other methods only normalise pose variance (see results in the last three columns). Figure 8 shows the synthesis results of the MFJNN on IJB-A. These results further demonstrate that the proposed model can simultaneously normalise pose, illumination and expression variations and synthesise high-quality and identitypreserving face images. Meanwhile, note that our model is trained on constrained Multi-PIE but can achieve better synthesis results on unconstrained datasets, which demonstrates that the proposed model has a good generalisation ability.

| Face recognition
Here, the authors use the features extracted from the MFJNN and conduct face recognition on the Multi-PIE, LFW and IJB-A datasets. To compare the results of our method with those of other face recognition methods in a constrained environment, the training and testing data from Multi-PIE Setting-1 are used to perform the experiments. Table 2 compares the recognition rates on Multi-PIE Setting-1. We observe that our method achieves comparable results with the state-of-the-art methods and better performance than the Light CNN on each different pose. Especially in extreme poses, our model achieves incredible improvements compared with the Light CNN, for example, from 2.6% to 59.2% under ± 90°. Note that the recognition performance of our method is higher than that of the Light CNN that extracts features from non-normal face images using the Light CNN, which demonstrates the advantage of our face normalisation method.
For the unconstrained evaluation, we compare the proposed method with the state-of-the-art pose invariant face recognition methods, the DCNN [41], the DR-GAN [42] and the FF-GAN [20], and others on the LFW and IJB-A datasets. The comparison results on the LFW and IJB-A datasets are respectively shown in Tables 3 and 4. Our method outperforms many state-of-the-art methods. In addition, our method that extracts features from normal face images achieves consistently significant improvements compared with the Light CNN that extracts features from non-normal face images using the same feature extraction network, as shown in Table 4. Moreover, we also trained the model without multi-factor normalisation (i.e., w/o MFN) on the same training set (the entire Multi-PIE) and test it on LFW and IJB-A. The results from Tables 3 and 4 indicate that our model including multi-factor normalisation has better recognition performance than that of w/o MFN, which demonstrate the effectiveness of MFN in an unconstrained environment.

| Ablation study
To further evaluate the effectiveness of the proposed MFJNN and the contribution of each loss, we train three partial variants of the MFJNN: without the identity perceptive loss, without the pixel-wise loss and without the symmetric loss, that is, w/o L id , w/o L p and w/o L sym , respectively.  Figure 9 illustrates the face visual effects of the synthesised face images generated by the MFJNN and its different variants. We can observe that the synthesis images of the MFJNN have better visual effects and preserve more identity information of the input face image compared with the results of other variants. Especially, it can be seen that the synthesis images without the constraint of L id tend to deviate from the identity feature of the input image due to lack of a constraint of the consistency of identity information, which indicates the effectiveness of the identity perceptive loss.

F I G U R E 9
Face normalisation results after removing the identity perceptive loss and multi-encoder on LFW TA B L E 5 Rank-1 recognition rate (%) of the networks removing the identity perceptive loss and multi-encoder on Multi-PIE Setting-1  Table 5 compares the recognition rates of different variants of the MFJNN on Multi-PIE Setting-1. The proposed MFJNN achieves better performance than all of its variants under different poses. These results suggest that each loss in the MFJNN is essential to obtaining a photorealistic and identitypreserving face image and improving the face recognition performance.

| CONCLUSION
Here, the authors have proposed a MFJNN for pose-invariant face recognition. The proposed method can realise pose normalisation when other factors vary except pose using multifactor normalisation, which can improve face recognition performance on normalised results. From the experimental evidence, it can be concluded that the proposed method can effectively normalise poses, illuminations and expressions; the features extracted from normalised results can boost the face recognition performance; and also achieves better face recognition performance than the state-of-the-art methods.