Hybrid face recognition under adverse conditions using appearance‐based and dynamic features of smile expression

Although recent deep‐learning‐based face recognition methods give remarkable accuracies on large databases, their performance has been shown to degrade under adverse conditions (e.g. severe illumination and contrast variations; blur and noise). Under such conditions, soft‐biometric features such as facial dynamics are expected to increase the performance if they are used together with appearance‐based features. We propose a novel hybrid face recognition, which uses appearance‐based features extracted using deep convolutional networks and statistical facial dynamics features extracted from facial landmark positions during smile expression. We evaluated the performances of three different state‐of‐the‐art pre‐trained deep convolutional neural networks (DCNNs) under a variety of severe image distortions with different parameters. The experimental results show that, although the face recognition performance using only DCNN‐based features drops significantly under adverse conditions, the utilization of facial dynamics features together with DCNN‐based features can compensate for the performance loss and increase the accuracy significantly. We believe the proposed system can be useful when face recognition is performed using videos obtained from systems, which may contain blurry and noisy images with a wide range of illumination variations.


| INTRODUCTION
The technological developments enabled us to perform many transactions in our daily lives using electronic devices. Therefore, the security of transactions in electronic environments has become an important problem. Hence, biometric systems have become more important for identity and recognition, which use static physiological or dynamic behavioural characteristics of a person.
Face has been one of the main biometric characteristic, and face recognition has many application areas including security, law enforcement, health, education, marketing, finance, entertainment and human-computer interaction. Face recognition systems have some advantages over other biometric modes since may operate remotely with minimal or no cooperation of the user. Facial biometric systems are mainly based on accessing the identity related information using physiological features obtained from face images or behavioural characteristics obtained from facial movements in a video. Using a facial image or video, not only the identity but also the age [1], gender [2] and race information can be determined. The emotional and mental state of the person can also be inferred [3][4][5][6][7] from changes in facial expressions over time.
Face recognition systems in the literature can be grouped under two main categories as image-based and video-based methods [8,9]. While image-based methods use facial appearance-based features, video-based methods can also exploit behavioural features, which can be considered as a soft biometric feature. Face recognition methods achieved over 90% recognition accuracy on face databases obtained in controlled environments in late 2000s [10][11][12]. After the introduction of deep learning methods to face recognition systems starting from the early 2010s, face recognition accuracies have exceeded 99% [13][14][15] on large-scale databases collected in the wild.
Video-based face recognition methods in the literature can be grouped as set-based and sequence-based methods. In set-based methods, frames of a video are treated as set of frame-aggregation methods, which try to combine information from video frames effectively [16][17][18][19]. In Ref. [20], a deep learning based method is proposed for face recognition from videos, which tries to consider blur and occlusion effects. However, the temporal evolution of the frames is not taken into account. The frames of the video are passed through the network and average pooling is used to obtain the compact video representation. Sequence-based methods for face recognition from video can be grouped as temporal methods and spatio-temporal methods. Temporal methods use the facial dynamics information separately from the texture information [21], whereas spatio-temporal methods model the texture and the motion information together [22]. In Ref. [22], a deep neural network is trained using a loss function defined between two video streams to perform face verification from unlabelled videos.
There are comprehensive survey papers summarizing the recent developments on deep face recognition and verification [23][24][25][26][27]. In Ref. [27], an evaluation framework is also presented in order to measure how different aspects of deep-learningbased methods including network architecture, choice of loss function, data augmentation and training influence their performance. The reader is referred to the survey papers for an indepth summary of face recognition literature.
Although deep-learning methods achieve high performance for both face identification and verification tasks under controlled environments, it was observed that their performance decreased significantly under adverse conditions (e.g. illumination, contrast and noise variations) [28,29], which are also referred to as semantic adversarial attacks [30]. Therefore, it can be expected that the use of soft-biometric features as well as appearance-based features obtained from deep learning networks can increase the performance of face recognition systems under adverse conditions. It is shown that the softbiometric features obtained from facial dynamics carry information about the identity of the person, which is also supported by psychological studies [31,32]. However, they are not sufficient alone for identification of the individual with high accuracy [33][34][35][36][37][38][39][40][41][42][43][44][45]. In a recent work [41], histograms of facial action units detected during spontaneous facial expressions, when the subject was interacting with a quiz-game are shown to carry discriminative information. In Ref. [42], it is explained that individual differences in facial expressions can make a face recognition system more robust to spoofing attacks. The facial expression of pain is also used as a biometric feature [43]. The facial dynamics of smile expression were modelled using long short-term memory networks on the top of appearance features in [46]. In another recent work [44], changes in facial expression were shown to carry identity-related information. We would like to note that none of the approaches above aims to compensate the loss of accuracy under adverse conditions (i.e. severe image distortions or semantic adversarial attacks).
In this work, we propose a hybrid face recognition (HFR) system, which combines static appearance-based features and dynamic behavioural features extracted from facial landmarks of smile videos, in order to increase the recognition performance under adverse conditions. Experimental studies on two video databases have confirmed that the proposed hybrid model is beneficial for face recognition under adverse conditions.
To the best of our knowledge, this is the first work that aims to compensate the accuracy loss of face recognition systems under challenging image distortions using emotional facial dynamics information. In an earlier version of our work [45], we have shown that dynamic features extracted from facial landmarks of smile videos carry identity related information. In this paper, we extend our previous work mainly in three aspects: The outline of the paper is as follows. In Section 2, we give the details of the proposed method including the extraction of appearance-based and facial dynamics features from smile videos. In Section 3, the tested adverse conditions are described and experimental results are given on two databases. In Section 4, concluding remarks are given and future directions for research are indicated.

| HFR SYSTEM
The block diagram of the proposed HFR system is shown in Figure 1. First, the location of the face is detected at each frame in a given video that contains smile expression. Then, in order to capture the facial dynamics, facial landmarks are detected around the eyes, eyebrows, nose, lip and the chin. If the facial landmarks can be detected successfully, dynamicsbased features extracted from the landmarks and appearancebased features obtained from a pre-trained deep convolutional neural network (DCNN) are used together for face recognition to improve the performance under adverse conditions. If the facial landmarks cannot be detected successfully, face recognition is performed using the appearance-based features only.
In the following, we provide detailed information about steps of the proposed face recognition system.

| Face detection
The first step of the proposed face recognition system is to detect the face locations in all frames of the video containing smile expression. Numerous methods have been proposed in the literature to perform face detection. A widely known is the Viola-Jones (VJ) face detector, which uses Haar cascades [47]. Although the VJ face detector is successful in detecting frontal faces, it may not achieve the same performance for different head poses. In 2005, Dalal and Triggs proposed another face detection method based on Histogram of Oriented Gradient descriptors and Support Vector Machine classifier [48], which achieved more successful results as compared to the VJ face detector.
Recently, the use of deep learning networks has led to more robust face detectors. The Single Shot MultiBox Detector (SSMD) method [49] uses the ResNet-10 network as backbone architecture. Training is done using images collected from the web, that gives successful results for face detection at different angles and distances. Another DCNN-based method used for face detection is the Max-Margin Object Detector (MMOD) [50]. While the training of traditional DCNN architectures requires large-scale databases, training of the MMOD network was performed using around 7000 face images taken from different face databases.
In our work, we used the MMOD face detector since it gives accurate results with various head poses and works much faster than the other tested face recognition methods.

| Facial landmark detection
Facial landmarks are detected both in order to perform 2-D face alignment and extract statistical dynamic features during smile activity. Recently, many methods have been proposed in the literature for facial landmark detection [51][52][53][54][55][56]. The treestructure based model proposed by Zhu and Ramanan in 2012 [55] has achieved good landmark localization results on faces in the wild database. The method has demonstrated similar performance as compared to a commercial software, which was trained with a large number of images. In the method known as CHEHRA [52], a general trained model is updated to be a specific model in order to perform facial landmark detection on face images in uncontrolled environments. In 2013, Xiong et al. [54] proposed a supervised descent method for optimization of a non-linear least squares function in order to solve the face alignment problem. In 2015, the supervised descent method was developed non-locally and successfully implemented to track facial landmarks on the face [51]. Kazemi et al. [53] proposed a method, which is based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and manages inherently missing data, for detecting facial landmarks.
In our work, we used the method in [53] for facial landmark detection since it was experimentally found to be more robust under tough conditions, which may be due to the fact that it has tolerance for missing data. In Figure 2 an example image is shown with 68 facial landmarks detected on a face image from the UvA-NEMO smile video database [57].

| Extraction of appearance-based features
The performance of face recognition systems has increased rapidly by the use of deep neural networks, yielding up to 99% face recognition accuracy on large-scale databases [13][14][15]. We evaluate three different pre-trained deep neural network architectures (VGG-Face [58], VGG-Face2 [59] and ArcFace [60]) to extract appearance-based feature vectors under adverse conditions, which are briefly explained below:

| Extraction of statistical dynamic facial features
Statistical facial dynamics information extracted from landmark positions in smile videos were used for gender recognition in Ref. [2]. We adopt a similar approach for face recognition using facial dynamics features of smile expression. First, the location of the face at each frame of the smile video was detected together with 68 facial landmarks locations. Then, 27 facial distances were calculated using these facial landmarks as described in Table 1, which were expected to change during the smile expression. In Table 1, ρ denotes the Euclidean distance between two facial landmarks and l i denotes i th facial landmark. Next, statistical dynamic features were calculated using these 27 distances after temporal segmentation.
The smile facial expression mainly consists of three temporal segments: onset, apex and offset. The onset refers to the time interval in which the facial expression changes from the neutral state to the expressive state with maximum intensity. Apex refers to the temporal segment during which the intensity of the emotion stays at maximum level. The offset refers to the time interval during which the facial expression changes from the maximum intensity of the expression to the neutral state. These three temporal segments must be estimated before statistical dynamic features are calculated. A low-pass filter of length 5 was used to filter the unwanted temporal variations of mouth length (D 5 ) before proceeding with the segmentation. The length of the mouth was then normalized with its maximum value and used to determine the onset, apex and offset segments of a smile facial expression. In Figure 3, an example plot of the normalized mouth length versus frame number is shown for a video in the UvA-Nemo smile database [57,66]. In order to detect the onset, apex and offset segments we consider the amplitude and the rate of change of the normalized mouth length. During onset, the mouth length is less than a specified threshold value (0.8) and it increases at a constant rate (ϵ) calculated based on the image resolution. The beginning of apex segment is marked when the normalized mouth length is above the threshold value and the rate of change of mouth length (ϵ) has slowed down. During offset, the mouth length should be below the threshold and decreasing at a constant rate.
After the smile expression is segmented into onset, apex and offset segments, 24 statistical dynamic features are calculated for each of the 27 facial-distances as described in Table 2. The superscripts ( þ ), ( a ) and ( À ) represent the onset, apex and offset segments of the smile expression, respectively. The velocity is calculated as V ¼ dD dt and the acceleration is F I G U R E 2 An example face image from the UvA-NEMO smile video database showing the 68 facial landmarks detected by Ref. [53] calculated as The parameter η represents the number of frames and ω represents the frame rate of the video sequence. After calculating the 24 statistical dynamic features for each of the 27 facial-distances, a 648-dimensional feature vector was obtained for each smile video. Next, a feature selection algorithm is used in order to detect the features, which carry identity-related information of the person, the details of which are explained next.

| Feature selection and classification
We used the Extremely Randomized Trees Classifier (Extra Trees Classifier) [67] for feature selection from the 648 dimensional statistical facial dynamics vector. The Extremely Randomized Trees Classifier is a collective learning technique that uses the results of multiple correlative decision trees collected in a 'forest' to achieve a classification result. Each Decision Tree in the Extra Trees Forest uses a subset of original training examples.
At each node in the tree, a mathematical criterion (such as the Gini Index) is used to select the best features for classification from each set of features. This causes many interconnected decision trees to be created. In order to perform feature selection using this forest structure, the normalized total reduction of the mathematical criterion is calculated for each feature. If Gini Index is used as the mathematical criterion, this calculation is called as the Gini TA B L E 1 Facial distances used to extract dynamic features. ρ(.,.) denotes the distance, and l i denotes i th facial landmark Importance. Then, the Gini Importance calculated for each feature is sorted in descending order and the top k features are selected. In our work, we used the 128 features with the highest Gini Importance (i.e, larger than 0.002) among the 648 dimensional feature vector to be used in face recognition.
After the feature selection process, we fused the facial dynamics features with the appearance-based features obtained from the DCNN and used a K-Nearest-Neighbor (KNN) classifier for face identification. We tested the K values between 1 and 8 and the value, which achieved the highest test performance, was selected. It was observed experimentally that the best result was obtained for K ¼ 7.

| EXPERIMENTAL RESULTS
In this section, we first described the used video databases. Then, the evaluation method and the tested distortions (adverse conditions) are described. We compared the three different DCNNs under the tested adverse conditions, and presented the experimental results of the proposed hybrid method on two databases.

| Video databases
In the literature, there are only a few publicly available databases, which contain multiple smile videos of the same subject. Duration Mean acceleration speed mean (A þ ) mean (A À ) Net Amp., Duration ratio ð∑ðD þ ÞÀ ∑ðD À ÞÞw ηðDÞ 104 - Databases containing expressive face videos are usually collected for the purpose of affecting recognition rate, and therefore contain only a single video for each emotion of the same subject. In this work, two different databases, which contain at least two smile videos for each subject were used to test the proposed HFR system. The first database is the UvA-NEMO smile videos database [57,66], which contains 1240 smile videos collected from 400 people (185 women and 215 men), ranging in age from 8 to 76 years. The videos are collected using 1920 � 1080 high definition format and have an average length of 3.9 s. The videos contain either spontaneous or deliberate. We use all smiles videos of a subject regardless of the smile type. All the smile videos start with a neutral expression, reach the apex and return back to the neutral state, hence contains the onset, apex and offset segments.
The second database used in the experiments is the FEEDTUM facial expression database, which contains 378 videos from 18 subjects with an age range of 23-38 [68]. Each subject has 21 videos including six different facial expressions and the neutral state. Since we only need the smile expression to test the proposed method, 54 smile videos obtained from 18 subjects were utilized. Description of the used video the databases is summarized in Table 3.

| Evaluation method and tested distortions
The main objective of this work is to show that the statistical dynamic features obtained from the smile expression have the potential to improve the performance of deep-learningbased face recognition systems, which deteriorates under adverse conditions. Therefore, we first evaluated the face recognition performance of pre-trained deep neural networks under adverse conditions. Then, we investigated the effect of combining appearance-based features obtained from pretrained DCNNs and statistical dynamic features on face recognition performance under adverse conditions. We applied various image distortions to the test images, which are described in detail below.

| Gaussian blur
Surveillance cameras may capture blurry videos due to outof-focus or motion blur. In order to simulate out-of-focus blur, Gaussian low-pass filters with different standard deviations were applied to test images with the goal of investigating the effect of blur degradations on deep-learningbased face recognition methods. The standard deviation (σ) value in this work was ranged from 2.5 to 15. -105

| Illumination variations
Two different methods were applied to investigate the effects of illumination changes on the face recognition performance. In the first method, pixel values in the images were modified by adding a constant value between 5 and 200. In the second method, the pixel intensities are multiplied by a constant value changing between 0.5 and 5. The modified pixel values were truncated to the range (0-255), if necessary.

| Gaussian noise
Gaussian noise was added to the test images, with zero mean and standard deviation between 0.1 and 0.5.

| Salt and pepper noise
The pixel intensities in the test images were replaced with a value of 0 or 255 with a given probability. The probability value is changed between 0.1 and 0.5.

| Contrast variations
In order to change the contrast of the test images, intensities between a low and high value were linearly mapped to the full range. The widest input range was selected as [0.1-0.9] and the narrowest range was selected as [0.49-0.51].

| Results on UvA-NEMO database
We performed both face verification and face identification experiments of the UvA-NEMO smile database, which are presented below.

| Face verification results using facial dynamics
In our earlier work [45], it was shown that face verification can be performed using statistical facial dynamics features obtained from smile videos. We improved our previous results by using the feature selection algorithm described above to estimate which dynamic features contribute the most to facial recognition, which reduced the number of dynamic features from 648 to 128. Statistical facial dynamics were extracted from each of the 1215 videos, and a Euclidean distance matrix was calculated containing the Euclidean distance between the feature vectors of each pair of videos. Then, false match rates (FMR) and false non-match rates (FNMR) were calculated using this matrix. All the videos were split into almost equal parts considering the total number of videos for each subject and making sure that the train and test videos for each subject were distinct. In Figure 4, FMR versus FNMR plot for the UvA-NEMO smile database is given. The equal error rate (EER) is also marked with a dashed line, where FMR is equal to FNMR. The EER without using feature selection was 31.20 %, where it reduces to 7.42% with feature selection, which indicates a significant increase in face verification performance using only facial dynamics information of smile expression. The detection error trade-off (DET) curve is also given in Figure 5.  -107

| Face identification and verification results using appearance and dynamics features
In order to investigate the face recognition performance on the UvA-NEMO database, an image database was formed by extracting 12 frames from each video at regular intervals regardless of the length of the video. Subjects with only one video were excluded, which resulted in 1215 videos of 370 subjects. Hence, a database containing 14,580 face images from 1215 videos was obtained. The training set consists of 9864 of these images and 4716 images were used for testing. This split was done considering the total number of videos for each subject and making sure that the train and test videos for each subject were distinct.
In the first part of the experiments, various distortions were applied to the test images obtained from UvA-NEMO smile video database and the face recognition performance of pre-trained deep neural networks under adverse conditions was investigated using these images.
In the second part of the experiments, the effect of combining the feature vectors obtained from pre-trained DCNNs and the statistical dynamic features obtained from smile videos was investigated. The facial landmarks could be detected for most of the tested distortions. In Figure 6, the images with the highest distortions for which the landmarks could be detected are shown for each of the six types of distortion categories. We can see from Figure 6 that, although the textures have deteriorated, the detection of landmarks enable us to make use of the dynamics features under severe distortions.
We also tested the face recognition performance of statistical facial dynamics under adverse conditions, which is given in Table 4. The table shows the face identification accuracy for each distortion with varying parameters. We can see that the face verification accuracy is 81.02% when there is no image degradation. The accuracy is especially stable for additive illumination and contrast variations.
The face identification results with both appearance-based and facial dynamics features for the UvA-NEMO smile database using 2-fold cross validation are shown in Table 5. Since subjects with at least two videos were used in the experiments, the k-value was chosen as 2 for k-fold cross validation. In Table 5, the first column indicates the applied distortions to the test images, where the distortion parameters are given in  Figure 7, the experimental results for UvA-NEMO database are plotted for ease of comparison. We can see that for additive illumination, contrast variations and multiplicative illumination variations, proposed method with ArcFace (HFR-ArcFace) gives the best results. On the other hand, for Gaussian blur, Gaussian noise and salt-and-pepper noise degradations, proposed method with VGGFace2 (HFR-VGGFace2) gives the best accuracies. Since the proposed hybrid facial recognition system is a closed-set identification system, the cumulative match characteristic curves (CMC) of HFR-VGGFace2 are given Figure 8 to evaluate the proposed method under the selected distortions. The CMC curves show rank-k identification rates versus k, where k varies between 1 and 30, and they are given for five different adverse conditions under which the proposed HFR system shows the highest improvements in the face recognition accuracy as compared to using DCNN methods only.
The DET curves obtained during the face verification experiments following a similar protocol described in Section 3.3.1 are given in Figure 9. In these curves, image distortions using contrast variations with the range (0.49-0.51) were used, for which the hybrid facial recognition system provided the most improvement. We can see that the DET curve of the proposed method (HFR-VGGface2) is below the others, which shows that HFR-VGGface2 always provides lower FMR as compared to using appearance-based features alone (VGGFace2).

| Results on FEEDTUM database
FEEDTUM is a relatively small database containing videos of 18 subjects. We extracted uniformly distributed 12 frames from each smile video (regardless of the length) to form an image database containing 648 images was from 54 videos. The F I G U R E 8 Cumulative match characteristic (CMC) curve for HFR-VGGFace2 using UvA-NEMO smile database under five different degradations. Example images are also shown F I G U R E 9 Detection error trade-off (DET) curve for contrast variations with range (0.49-0.51) on the UvA-NEMO smile database using HFR-VGGFace2, VGGFace2 and facial dynamics images extracted from one video of each subject were reserved for testing and the images obtained from the other two videos were used for training. As a result, the training database contains 432 images and the test database contains 216 images.
We would like to note that some smile videos in the FEEDTUM database do not contain the offset segment of the smile expression, hence they do not end with a neutral expression. Moreover, some subjects could not express the emotion well. Hence, it is difficult to extract the statistical dynamic features for such videos.
The face recognition results obtained for the smile videos in FEEDTUM database using twofold cross validation are shown in Table 6, where the first column indicates the applied degradations to the test images with the parameters indicated in parenthesis. In columns 3, 5 and 7, the accuracies for the proposed HFR method are given. The cases for which the facial dynamics features improve the accuracies are indicated in bold. We can see that the accuracies may decrease significantly using DCNN features alone, and dynamics-based features can compensate for the loss in some cases. For example, for contrast variations (0.49-0.51), the accuracy for VGGFace2 is 74.07%, whereas the accuracy of the proposed method HFR-VGGFace2 is 77.31%, with an increase of 3.24% in accuracy.
The experimental results are given as plots in Figure 10. We can see that in some of the cases, the proposed method using facial dynamics gives better results as compared to using DCNN features alone. Since FEEDTUM database contains videos, which does not contain the onset-apex-offset phases of the smile expression, the increases in face recognition accuracies are not as clear as in the UvA-NEMO database. Hence, the dynamical features may be as effective, when the offset phase of the smile is missing.
The CMC curves are also shown in Figure 11 for five different degradations under which the proposed face recognition system gives the highest improvement in face recognition accuracies as compared to using DCNN features alone.

| CONCLUSIONS AND FUTURE WORK
In this work, we presented a HFR method, which uses appearance-based features extracted using deep DCNNs and statistical facial dynamics features extracted from smile expression. First, we evaluated the performance of three different state-of-the-art pre-trained deep DCNNs under six different image distortion types (different illumination variations, blur and noise) with varying parameters, which showed that their accuracies drop significantly under adverse conditions. The used databases (UvA-NEMO and FEEDTUM) are easy however they contain frontal, high resolution videos therefore the decrease in accuracy was significant under the tested adverse conditions. The statistical facial dynamics features were extracted using the positions of 68 facial landmarks detected on each frame of the video containing a smile expression consisting of onset-apex-offset temporal phases. The utilization of facial dynamics features was shown to compensate for the performance loss and increased the accuracy significantly under severe image distortions. Hence, the temporal dynamics of smile expression contain identity-related information. We believe the proposed system can be useful when face recognition is performed using videos obtained from systems, which may contain blurry and noisy images and a wide range of illumination variations.
As for future work, the usefulness of statistical dynamic features for face identification obtained from facial expressions of other emotions (anger, surprise, disgust, sadness etc.) can be investigated. To this effect, collection of a new database which contains many repetitions of the facial expression for an emotion could be required for each subject. A more challenging problem could be to extract facial dynamics features when the person is talking and showing an emotional facial expression simultaneously. F I G U R E 1 1 Cumulative match characteristic (CMC) curves for the proposed method HFR-VGGFace using FEEDTUM smile database under five different degradations. Example images are also shown TASKIRAN ET AL. -113