mFERMeta++ : Robust Multiview Facial Expression Recognition Based on Metahuman and Metalearning

Facial pose variation presents a significant challenge to facial expression recognition (FER) in real‐world applications. Significant bottlenecks exist in the field of multiview facial expression recognition (MFER) including a lack of high‐quality MFER datasets, and the limited model robustness in real‐world MFER scenarios. Therefore, this article first introduces a metahuman‐based MFER dataset (MMED), which effectively addresses the issues of insufficient quantity and quality in existing datasets. Second, a conditional cascade VGG (ccVGG) model is proposed, which can adaptively adjust expression feature extraction based on the input image's pose information. Finally, a hybrid training and few‐shot learning strategy are proposed that integrates our MMED dataset with a real‐world dataset and quickly deploys it in real‐world application scenarios using the proposed Meta‐Dist few‐shot learning method. Experiments on the Karolinska Directed Emotional Face (KDEF) dataset demonstrate that the proposed model exhibits improved robustness in multiview application scenarios and achieves a recognition accuracy improvement of 28.68% relative to the baseline. It demonstrates that the proposed MMED dataset can effectively improve the training efficiency of MFER models and facilitate easy deployment in real‐world applications. This work provides a reliable dataset for the MFER studies and paves the way for robust FER in any view for real‐world deployment.


Introduction
Facial expressions are an important means of conveying human emotions, and accurate recognition of facial expressions can reflect the mental state and emotional information of a person promptly.Facial expression recognition (FER) is an important application domain in advanced intelligent systems.Intelligent systems utilize sensors such as cameras to perceive human emotions, thereby enhancing emotion recognition and natural interaction capabilities in artificial intelligence systems, leading to better human-machine interaction.Therefore, FER is becoming an increasingly hot issue in advanced intelligent systems and has crucial applications in areas including healthcare, education, human-computer interaction (HCI), and driver fatigue monitoring. [1]urrent FER studies are well based on the classification of six basic emotions, including anger, disgust, fear, happiness, neutral, sadness, and surprise. [2]onventional methods for FER are mainly via handcrafted feature extraction, including Gabor texture, [3] local binary pattern (LBP). [4]With the development of deep learning, using end-to-end learning methods, such as convolutional neural networks (CNN), are gaining increasing popularity.
Although many FER systems can achieve high accuracy in the controlled lab environment, the application deployment in the real world is complex, and variations including race, gender, age, and pose can deteriorate the recognition performance. [5,6]Especially, in terms of the pose variations, most of the research studies assume the frontal view as an underlying promise.However, in real-world practical applications, the premise may not hold true in most of the scenarios.For example, in the natural human-computer interaction domain, the camera is always fixed on the table or integrated in the screen while people can walk randomly in the space, resulting in the failure of frontal FER systems.Therefore, multiview facial expression recognition (MFER) in the real world is crucial.
There are two grand challenges for MFER.One is the lack of natural datasets. [7]On the one hand, setting up multiview camera experiments is time-consuming and complex.On the other hand, controlling the subjects' emotions in the experiment is sometimes difficult and not feasible.For example, it is not realistic for a certain individual to make the perfect surprise emotion.Although the data augmentation methods are utilized, for instance, Zhang et al. [8] proposed a generative adversarial networks (GAN)-based model that can generate images with different expressions to solve this problem, and Yi et al. [9] used a conditional GAN to generate images aimed at augmenting the FER2013 dataset, the GAN-based methods in principle do not introduce new information.
The other challenge is FER feature representation.Wang et al. [10] proposed a novel region attention network (RAN), to adaptively capture the importance of facial regions for occlusion and pose variant FER.Rudovic et al. [11] proposed the Coupled Scaled Gaussian Process Regression (CSGPR) model for headpose normalization.However, all of them are based on model level and controlled lab datasets.
To solve the above problems and achieve robust multiview facial expression recognition in the real world, an MFER algorithm based on metahuman and metalearning is proposed.Metalearning is a common approach to solve few-shot learning tasks.Specifically, this article first utilizes the MetaHuman to address the problem of datasets in the MFER.Unlike laboratory datasets collected in controlled conditions, the metahuman is not restricted by race, gender, or age.The virtual nature of the metahuman also determines that it can generate an unlimited amount of data, which greatly satisfies the need for deep learning.Unlike GAN networks that cannot introduce new information, the combination of MetaHuman with natural datasets in the real world can also increase the robustness of the dataset.Second

Related Work
In this section, we first review previous work on datasets, recognition models, and methods in the area of MFER.Then an overview of the research progress in the field of metahuman is presented.At last, we review the related work in the area of few-shot learning.

Multiview Facial Expression Recognition
MFER refers to identifying human emotions from face images, emphasizing head pose variation.Earlier studies focused on images of faces with frontal head poses, and with the increasing demand for real-world deployment, the more challenging MFER studies become an important research topic.The idea of MFER was first investigated in pose-invariant FER. [12]In recent studies, MFER is often associated with FER with occlusion or FER robust to pose, because the scientific problem is to recognize facial expression images when missing a certain amount of expression information compared to the frontal face.
Several MFER datasets are currently available to satisfy the studies in the field of deep MFER.BU3DFE [13] is the first proposed MFER dataset.Each subject performed seven expressions in front of the 3D face scanner.With the exception of the neutral expression, each of the six prototypic expressions (happiness, disgust, fear, anger, surprise, and sadness) includes four levels of intensity.Therefore, there are 25 instant 3D expression models for each subject, resulting in a total of 2500 3D facial expression models in the database.The CMU MultiPIE [14] is a dataset supporting variations in pose, illumination, and facial expressions by Gross et al.It contains 337 subjects with up to four recordings at 15 viewpoints and 19 illumination conditions with five expressions: disgust (DI), scream (SC), smile (SM), squint (SQ), surprise (SU), and neutral (NE).There are large viewpoint variations in this dataset where the face poses are between À90°and þ90°with an interval of 15°.The Radboud Faces Dataset (RaFD) [15] is created by Radboud University in the Netherlands, which includes expressions of Caucasians, minors, and Moroccan males totaling 67 individuals.RaFD is a high-quality facial expression dataset that is divided into eight expressions: anger, disgust, happiness, sadness, surprise, contempt, and neutrality, based on a facial action coding system.Each expression is displayed in three different gaze directions, and all images are taken simultaneously from five camera angles.The Karolinska Directed Emotional Face (KDEF) [16] is a multiview facial expression dataset proposed by Lundqvist et al., containing 4900 images from 70 balanced individuals (35 males and 35 females), including seven basic facial expressions, including anger (AN), disgust (DI), fear (FE), happiness (HA), sadness (SA), surprise (SU), and neutrality (NE).The facial images are divided into five head pose categories (À90°, À45°, 0°, 45°, 90°).
A summary comparison of the above datasets (Table 1) shows that although various MFER datasets were proposed, there are still further problems and challenges.Most of the publicly available datasets are relatively old, which were proposed in 2006, 2010, 2010, and 2008, respectively.Second, deep learning methods require a high amount of data, while most MFER datasets are small.Third, most of the MFER datasets are collected in the laboratory, so the expressions collected under controlled conditions are not from the real world, and also lacks expression diversity.In addition, also due to the limitation of subjects, other factors such as age, ethnicity, and gender that may influent expression recognition cannot be fully balanced.
Recently, some deep learning methods are also used in the MFER field.Deep learning methods can learn semantic features to increase the robustness to pose variations.Mahdi et al. [7] classified the methods that use deep neural networks for MFER tasks into three categories, which are pose-robust features, pose normalization, and pose-specific classification.
Pose-robust features are to extract facial features that are robust to head pose variations.Zhang et al. [8] proposed an end-to-end learning model based on GAN to simultaneously perform facial image synthesis and pose-invariant FER by jointly using different poses and expressions.Besides, Liu et al. [17] proposed a dynamic multichannel learning network (DML-Net) for pose perception and identity invariant FER, specifically, it used three parallel multichannel convolutional networks to learn fused global and local features from different facial regions, thus reducing the impact of pose and identity on robust FER performance.Pose normalization is a method to map facial expression features from nonfrontal poses to frontal space.Sun et al. [18] used 3D facial reconstruction to transform nonfrontal faces into frontal faces.They proposed an end-to-end 3D face feature reconstruction and learning network (3DF-RLN) that used a 3D face reconstruction module to reconstruct 2D nonfrontal facial appearance data before learning features and classification.Wang et al. [19] proposed a cascaded regression-based face frontalization (CRFF) method that used a cascaded regression model to multistep learn the pairwise spatial relationship between nonfrontal facial shapes and their frontal counterpart.Pose-specific classification is the method to recognize facial expressions for each head pose separately.Liu et al. [20] designed a conditional convolutional neural network augmented random forest (CoNERF) to learn to classify facial expressions from different perspectives separately using conditional probabilities.Liu et al. [21] proposed a multichannel pose-aware convolutional neural network (MPCNN) for MFER.It used three sub-CNNs to learn convolutional features from the mouth, eyes, and the whole face, respectively.It also used pose perception recognition to estimate head pose, and finally recognizes multiview facial expressions by minimizing the joint loss of pose and expression classification.

Metahuman
Metahumans, also called digital humans, are digital figures created using digital technology that is close to human figures. [22]igital human technology has many applications in real life.For example, in fields such as virtual reality and games, virtual simulations of humans are performed using 3D modeling software like Maya, 3 ds Max, and ZBrush to generate virtual individuals that meet their needs.25] In the FER field, pilot studies were performed to expand the dataset of facial expressions via digital humans for deep learning models.There are also some sophisticated tools for digital human facial production.Volonte et al. [26] proposed a HeadBox work that animates faces on the Microsoft Rocketbox avatar library.Epic has developed the MetaHuman Creator [27] which enables fast and custom creation of photo-quality realistic digital human.Digital human technology is also utilized for FER studies.Fernández-Sotos et al. [28] generated a new set of dynamic virtual faces (DVF) that simulated six basic emotions and neutral expressions.They used the Penn Emotion Recognition Test (ER40) to evaluate the difference between dynamic virtual faces and standardized natural faces and experimentally demonstrated that DVF was as effective as standardized natural faces in accurately recreating human facial expressions.Siddiqui J R et al. [29] generated a set of digital human data using metahuman and trained a GAN network on the data that can achieve proto-human-based facial expression generation.Akhyani et al. [30] generated three dynamic digital human facial expressions using the MakeHuman toolkit and the FACSHuman plugin, and experimentally demonstrated that using the digital human dataset to augment the natural dataset could improve the recognition accuracy of the model.
Although the above studies showed potential in using metahumans for FER, there are still several challenges that are not solved yet.First, there are few deep FER models trained based on the metahuman dataset and in the field of MFER, the application of digital people is still blank.Second, the metahumans generated in the current work are cartoon style without high fidelity, and thus quite different from real-world humans.In addition, none of the current studies have proposed a perfect metahuman dataset based on seven basic emotions.

Few-Shot Learning
Deep learning has achieved great success in many fields.However, its performance is greatly limited when the amount of data is insufficient.Efficiently obtaining useful information from small sample datasets is still a challenge in the deep learning field.Few-shot learning methods have emerged for this demand.
According to a review published by Wang et al. in 2020, [31] few-shot learning can be mainly classified into data-based, model-based, and algorithm-based approaches.The data-based method augments training data set by prior knowledge and is aimed to address the problem of insufficient data in small sample cases by expanding data using conventional methods or generative networks.Model-based method constrains hypothesis space by prior knowledge.The metric learning method uses a general embedding model to learn a metric function from existing data, calculate similarity among samples through the metric function, and achieve few-shot classification tasks.For example, the classic Siamese network [32] inputs two samples into the same network and calculates the loss between the output of two networks to determine if they belong to the same class.Other classic algorithms include ProtoNet, [33] which uses Euclidean distance  [13] 2006 À90°: þ90°(3D) AN, DI, FE, HA, SA, SU, NE 2500(3D models) Lab MultiPIE [14] 2010 À90°, À75°, À60°, À45°, À30°, À15°, 0°, þ15°, þ30°, þ45°, þ60°, þ75°, þ90°DI, SC, SM, SQ, SU, NE 755 370 Lab RaFD [15] 2010 À90°, À45°, 0°, þ45°, þ90°AN, DI, FE, HA, SA, SU, NE, CO 8040 Lab KDEF [16] 2008 À90°, À45°, 0°, þ45°, þ90°AN, DI, FE, HA, SA, SU, NE 4900 Lab to measure the distance between the classification samples and the known sample, and MatchingNet, [34] which introduces attention mechanism and cosine measurement.Bi-Similarity Network (BSNet) [35] is also a metric learning method that employs two distinct similarity metric modules to learn more discriminative feature maps.Algorithm-based method alters the search strategy in hypothesis space by prior knowledge.Currently, metalearning is the mainstream method for solving few-shot learning problems, focusing on learning good initialization parameters during the training phase to improve model generalization.The model-agnostic metalearning (MAML) uses a metalearner to guide the training of a basic learner on each task, accumulating a large amount of prior knowledge and achieving fast convergence using a small amount of data.

Methodology
In this section, we first give an overview of our proposed MMED and then detail each module in the proposed ccVGG structure.Finally, we present the training strategy for hybrid training and few-shot learning.

Overview
This article proposes a metahuman-based MFER model called mFERMetaþþ.The mFERMetaþþ can automatically learn multiview facial expression features at the level of datasets and using an end-to-end approach, and use few-shot learning to achieve practical applications in the real world.
The framework of the mFERMetaþþ algorithm (Figure 1) is proposed as follows.First, a metahuman-based multiview facial expression dataset using Unreal Engine 5 named MMED was generated.This dataset includes seven expressions (anger, disgust, fear, happiness, neutral, sadness, and surprise) of 10 metahuman individuals in different pose angles (0°, 15°, 30°, 45°, 60°, 75°, 90°).Next, the ccVGG network module is proposed.The MMED dataset is first used to train the pre-CNN to classify the three poses of the face (frontal, semifrontal, and profile), and then the expression images and the three poses of the expression images are simultaneously fed into the improved VGG network for training, so that the model can better classify expressions for different angles guided by the facial pose information.For training, we use the hybrid dataset training strategy and combine the FERPlus dataset [36,37] collected in the real world with the MMED dataset we created.When applying in the real world, [38] we use a few-shot learning approach Meta-Dist for rapid deployment of the model to achieve fast application in practical scenarios.

Meta Multiview Expression Dataset
MetaHuman Creator, developed by Epic Games, can be used to create multiview facial expression datasets of metahumans [27] with high fidelity.MetaHuman Creator is a new browser-based application that can create a bespoke photorealistic digital human, fully rigged and complete with hair and clothing.MetaHuman Creator can also be used in Unreal Engine 5, a game creation tool also developed by Epic.In Unreal Engine 5, it is possible to manipulate the face, which is to manipulate different facial units for exhibiting different facial expressions.According to Unreal Engine EULA, Epic Games permits users to utilize Unreal Engine 5 at no cost for educational purposes as well as nonprofit development projects; videos and images created utilizing Unreal Engine may also be authorized for free distribution and usage.
One target of this work is to create humans from different backgrounds to exclude the effects of ethnicity, gender, and age.Therefore, we create five male and five female metahuman individuals in Quixel Bridge within Unreal Engine 5. To balance race, five of the 10 metahuman individuals are White, three are Asian, and two are Black.MetaHuman Creator is then used to create expression animations.We create two sets of animations totaling about 17 s for happiness, surprise, fear, sadness, and anger.Because the neutral expression is more common, we create about 30 s of animation.As the disgust expression is relatively rare, we create only one set of animations of about 7 s.These expression animations can be attached to the face of the metahuman, and the metahuman can show a specific expression when the animation is played.These expressions have been validated by an annotator with Chinese culture.
For the collection of the expression images, we create a studio environment with the same lighting conditions and background for each metahuman and fix the camera on the face for the collection.To meet the needs of MFER, we also capture expressions from multiple viewpoints.We capture from seven angles, namely 0°, 15°, 30°, 45°, 60°, 75°, and 90°of horizontal rotation.The frame rate of the captured video is 60 fps.
After the collection is completed, the expression animations of 10 metahuman individuals can be obtained.Because this article is for static expression recognition, the expression animations need to be sampled.Schyns et al. [39] showed that the human brain takes only 140-200 ms to determine the emotional state by recognizing facial expression information.Therefore, a sampling frequency of 5 frames/s for expression animation was used.About 50 k expression images can be obtained after the sampling is completed.Then the faces in the images are detected using MTCNN, [40] and the detected faces are extracted and cropped to a size of 256 Â 256.
Besides, processing was performed on the dataset.Domain Randomization, [41] which is the randomization of visual information or physical parameters in the simulated environment, to implement sim2real, is commonly used to achieve better results in the real world.Considering that the backgrounds of facial expressions in the real world are diverse, we utilized thresholding binarization to extract the black background in the images and replaced it with images from the ImageNet dataset to simulate a diverse set of real-world backgrounds. [42]We also flipped the images in the dataset to get images in both the left and right directions.
The final proposed MMED dataset can be seen in Figure 2. The number of seven expressions of an individual in the MMED dataset can be seen in Table 2.

Cascaded Conditional VGG
As shown in Figure 1, the proposed ccVGG architecture consists of two parts.The first part involves learning the pose features of the input image using a CNN network to perform a simple classification of the pose angles, which is called the pose-classification module.The second part is multiview feature extraction module for learning pose-based expression features via a modified-conditional VGG network.
The pose-classification module utilizes a CNN for pose classification.We train this network on the proposed MMED dataset to classify three poses: frontal, semifrontal, and profile.The poses in the MMED dataset are relabeled to correspond to the following angles: 0°and AE15°as frontal, AE30°, AE45°, and AE60°as semifrontal, and AE75°and AE90°as profile.The input image undergoes six convolutional layers to extract features, and the resulting features are classified into one-hot labels of length three by FC and softmax layers.The multiview feature extraction module uses an inception-v4 [43] module to replace the first block of the original VGG network. [44]In CNN, the shallow layers of the network extract edge features, while the deeper layers extract more abstract semantic features.By widening the shallow network using the inception-v4 module, we preserve more image information and make it easier for the deeper network to learn features.Furthermore, the pose information of the input image is served as an additional input to the module.By utilizing this information to guide the network, we enable it to focus on expression features based on multiple perspectives.
The ccVGG architecture is a cascaded neural network (CNN) trained from the minimum network.When adding new networks, the weights of the trained networks are fixed, and only new networks are trained. [45]This approach increases the learning speed and avoids passing error information due to backpropagation.In this article, we train the pose classification module first.Then, we fix its network weights and continue to train the multiview feature extraction module.The combination of the two modules is shown in Figure 3 and 4. Images are first input to the pose classification module, and the pose information output from the module is then input to the multiview feature extraction module along with the images.In this case, the pose information is input after block2 in the ccVGG network.That is, the image is combined with the pose information after feature extraction in both block1 and block2 and then continues to train the rest of the network.Algorithm 1 details the ccVGG model training process.

Hybrid Training and Few-Shot Learning
This section aims to address the poor robustness problem of MFER in real-world scenarios.To enable practical applications in such scenarios, a hybrid-training approach is proposed that involves mixing real-world datasets and the MMED dataset for training, followed by using few-shot learning for rapid deployment in real-world application scenarios.

Hybrid Training
Although some studies suggested that mixing different datasets for training can make the accuracy rate decrease, [46] mixing multiple datasets for training can increase the information content of the model and enhance the robustness of expression recognition when applied in the real world.Therefore, we mix the real-world dataset FERPlus and the MMED dataset for training so that the model has the amount of information of both expression diversity and pose diversity.

Few-Shot Learning
To enable rapid deployment in real-world scenarios, few-shot learning methods are utilized.We propose a few-shot learning method called Meta-Dist, which can help pretrained models quickly adapt to new domains with limited samples.The training process involves a training stage and a fine-tuning stage, where the classifier of the model is replaced by a cosine distance classifier and a Euclidean distance classifier.Specifically, the cosine distance classifier [47] calculates the cosine distance between the feature vector extracted by the feature extraction network and the parameters of the fully connected (FC) layer in the classifier and the specific formula is as follows   ½s i,1 , s i,2 , s i,3 , where s i,j ¼ where s i,j is the computed distance metric, w j ! is the parameter of the fully connected layer in the distance measure, f θ ! is the feature vector by the multiangle expression classification network after processing the image, and x is the input image.
The Euclidean distance classifier calculates the Euclidean distance between the feature vector and the parameters of the fully connected (FC) layer in the classifier and the specific formula is as follows where the parameters of the classifier are initialized to the average feature vector of the input small sample, as in the following equation After calculating the two distances, we add two learnable scale factors alpha and beta, which are used to adjust the weights of the two distances in the final classification, [48] so the probability calculated by the softmax function in training becomes pðy ¼ kjxÞ where k is the kth category, cos<a,b> is the cosine distance between and b, and euc<a,b> is the Euclidean distance between a and b.

Database
The FERPlus dataset is based on the FER2013 dataset, which was improved by removing the nonhuman images from the FER2013 dataset and relabeling the misclassified images.The MMED dataset is the proposed dataset, and we use 80% of the images in the dataset as the training set and remain 20% as the test set.The KDEF dataset, which is used to simulate real-world scenarios, contains seven expressions from seventy subjects at five angles, for a total of 4900 images, which we use as the test set for our proposed algorithm.When conducting our few-shot learning using the 4-shot and 8-shot methods respectively on the KDEF, we randomly sampled four and eight images from each category (i.e., each expression at each angle) for training, utilizing the remaining images as the training set.We also performed preprocessing on the dataset before training.The FERPlus dataset was extremely unevenly distributed, with the number of neutral expression images (10 310) about 50 times higher than the number of disgust expression images (191).We balanced its data using data augmentation, specifically, the input samples were randomly cropped from the center and four corners of the image and then flipped horizontally and rotated to obtain ten times more data.All input data were scaled to 48Â48 pixels grayscale images before being fed into the network.

Implementation Details
The main training stages of this experiment included two parts.While training the ccVGG network, we first trained the pose classifier using the MMED dataset, then fed the blended data into the trained pose classifier, and put the output data into the backbone network for 100 epochs along with the original data.After a series of comparative experiments, we have determined the following hyperparameters for our training process.We used the Adam optimizer to train the model, with the base learning rate set to 0.001 and the learning rate decayed to 0.1 at epochs 40 and 70, and the batch size was set to 32.During the few-shot learning training phase, we trained the cosine classifier using the KDEF dataset and used the AdamW [49] optimizer to optimize it.The base learning rate was set to 0.001, and the learning rate was adjusted using cosine annealing method, the weight decay was set to 0.001, and the batch size was set to 16.We trained the model for about 12 h on an NVIDIA Tesla V100 GPU.

Experiment on the KDEF Dataset
In experiment, we presented a simulation of a real-world application scenario of MFER.We used a new dataset (KDEF dataset) as a test set for the proposed algorithm.The KDEF dataset contains rich multiview expression samples, enabling robust validation of the proposed model's performance.First, we designed an experiment to analyze the accuracy of the test set by training the model with varying proportions of the synthetic metahuman dataset (MMED) and real-world dataset (FERPlus).We first trained with only the FERPlus, then with the full FERPlus and 1/4, 1/3, 1/2, and all of the MMED, respectively, and finally with only the MMED.All trained models were validated on the KDEF respectively.The test results can be seen in Table 3.Our results indicated that the test accuracy on KDEF was comparatively lower when using either the FERPlus or the MMED dataset alone than while training the model with a mixture of the two.This highlights the effectiveness of our proposed hybrid training approach in increasing the information content of the model and considering both the expression and pose diversity of the datasets, which can improve the robustness of MFER in multiview application scenarios.In the hybrid training, we mixed all FERPlus and different proportions of the MMED.We achieved the best results while using all FERPlus and all MMED, obtaining a recognition accuracy of 55.73% on KDEF.This finding confirms the significance of data in deep learning, with larger datasets enhancing the performance of the trained model.
Next, we experimented with few-shot learning on the KDEF dataset (Table 4).We compared the proposed Meta-Dist approach with other well-known few-shot learning algorithms.We conducted experiments using 4-shot and 8-shot methods, and before the experiments, we randomly selected the corresponding number of samples for each category in the KDEF dataset and augmented the selected samples with data to expand the support set by a factor of 20.Baselineþþ [47] is using fine-tuning approach, fixing the feature extraction layer and training only the fully connected layer for classification.The ProtoNet, [33] MatchingNet, [34] and BSNet [35] both use the metric approach to achieve few-shot learning.We observed that the Meta-Dist approach achieved the best results, with a recognition accuracy of 68.17%.
Before training, we set the initial values of alpha and beta to 5.After the few-shot training on KDEF, their values were 21.04 and 10.64, respectively.It can be seen that the network considers cosine distance to be slightly more important than Euclidean distance in recognition.Euclidean distance measures the absolute distance between points in space, while cosine similarity measures the angle between two vectors in space, measuring the difference in orientation.In the field of expression recognition, the feature space of expression images is relatively small, and the network needs to learn a function to map them to a smaller feature space that is compactly distributed together.Therefore, the measurement method of Euclidean distance, which measures absolute distance, may not be effective.

Experiment on Multiview Expressions
The purpose of the proposed algorithm in this article is to improve the robustness of expression recognition in multiview application scenarios.Therefore, we designed several experiments to investigate the performance of our proposed algorithm in MFER.
We first experimentally investigated the accuracy of expression recognition for five viewpoints.Four models were trained, namely ccVGG alone, ccVGG with few-shot learning on KDEF, ccVGG with mixed dataset, and ccVGG with mixed dataset and few-shot learning on KDEF.The MFER accuracy of these models was tested on the KDEF dataset.The results (Table 5) show that the best performance across all five views is achieved using our proposed method (ccVGG þ HTFL).The addition of few-shot learning to the baseline model improved accuracy on all five views by allowing the model to learn the distribution of the test set and achieve better results.The use of the hybrid training method on the baseline model also leads to an improvement in accuracy, especially on the profile (90°), where the The ratio refers to the proportion of the data in the original dataset that participated in the training.Baselineþþ [47] 64.98% 67.50% MatchingNet [34] 57.95% 59.12% ProtoNet [33] 53.83% 56.51% BSNet [35] 62.61% 65.80% Meta-Dist 64.81% 68.17% accuracy improvement is most significant.This also demonstrates that using the hybrid training method, although the distribution on the test set cannot be obtained as in the case of few-shot learning, the introduction of a new data set can improve the pose robustness of the model and bring its performance up to the level of using few-shot learning.When using both few-shot learning and hybrid training, the recognition accuracy on the frontal side showed about 3% improvement using only few-shot learning (81.39% and 78.14%), but there was a significant improvement in accuracy on the nonfrontal face images, approximately 3.4% on the semifrontal side (45°) and approximately 13.8% on the profile side (90°).These results demonstrate the high robustness of our proposed algorithm for MFER.
The confusion matrix of four models on KDEF (Figure 5) indicates that the recognition accuracy for nonfrontal faces is notably low without hybrid training and few-shot learning, and the recognized expressions are concentrated on anger and neutral.The HTFL method greatly improves the recognition accuracy for nonfrontal faces, although the recognition accuracy for fear and sadness expressions remains low due to their less conspicuous features.
Our comprehensive analysis of recognition accuracy also reveals that for nonfrontal faces, except for the use of mixed training methods (e.g., 55.37% and 53.70% at 90°), the overall accuracy is higher for the right half side of the face.Some studies have reported that emotions on the face are asymmetrical, or chirality, [50] where the left and right sides of the face appear sadder or happier respectively.Taking ccVGG via the FL method as an example, it has a 4.62% higher recognition accuracy on the right half of the face than on the left half of the face.The confusion matrix (Figure 5) shows that the increased number of correctly recognized images is concentrated on the disgust expression, which is caused by the handedness of the facial expressions of the face.Some samples of disgust expressions in KDEF (Figure 6) show that for the same expression of the same person, there is a difference between the images on the left face and the right face, which results in the difficulty of recognition.

Ablation Analysis
To investigate the impact of each module in our proposed algorithm, we conducted several ablation experiments on the KDEF dataset, and the results are shown in Table 6 and 7.
Table 6 shows our studies on the improvement of the VGG network.Specifically, we evaluated the results of the original VGG, the VGG with structural improvements, and our proposed ccVGG.All three models were trained using the mixed dataset and tested directly on KDEF without few-shot learning at the end of training.Our results indicate that both the proposed improvement of the VGG network and the angle classification module utilized in the construction of the conditional cascade network can improve recognition accuracy.
Table 7 highlights the impact on recognition achieved by employing HT and FL approaches based on ccVGG, respectively.Our findings demonstrate that introducing the MMED dataset or utilizing the proposed few-shot learning method Meta-Dist can significantly improve the recognition accuracy in multiview scenes by 15.88% and 21.15%, respectively, compared to the baseline model.Comparing the two methods, we observe that utilizing the HT approach can enhance the model's robustness and generalizability in multiview application scenarios while employing the FL approach can improve the model's performance in specific scenarios.Consequently, to attain optimal performance in real-world application scenarios, we suggest utilizing both HT and FL methods, which can yield the highest recognition accuracy, improved by 28.68% relative to the baseline model.

Experiment on the FERPlus and RAF-DB Dataset
In addition to KDEF, we also conducted experiments on some in-the-wild expression datasets (FERPlus and RAF-DB) based on the pose-variant expression dataset proposed in the RAN. [10]They collected the faces with pitch or yaw angle larger than 30°collected to Pose-FERPlus and Pose-RAF-DB, and further distinguished them into faces larger than 30°and larger than 45°.The experimental results can be found in Table 8.
The wild expression datasets like FERPlus and RAF-DB cover a larger feature space, and the feature space overlap between similar samples is relatively small.So the improvement brought by few-shot learning is not significant.Therefore

Conclusion
In this article, we propose an MFER method mFERMetaþþ based on MetaHuman and few-shot learning to address the problem of low robustness of existing models for MFER in the real world.The proposed mFERMetaþþ has stronger robustness and generalization for multiview facial expressions in real-world scenarios.We first produce a metahuman-based multiview facial expression dataset MMED and propose a conditional cascade VGG which can adaptively adjust the extraction of expression features based on facial pose information.We further propose an HTFL training strategy that mixes MMED and real-world datasets to train the model to include multiple information amounts of poses and expressions and use the proposed Meta-Dist method for few-shot learning to get the best accuracy of 68.17% on the KDEF dataset.On the three perspectives of frontal, semifrontal, and profile, the accuracies are 54.54%,75.16%, and 81.39%, respectively, which are 34.65%,27.77%, and 18.19% higher compared to baseline.This work paves the way for intuitive and practical FER recognition system deployment in real scenarios without the limitation of view angles.Comparison of the disgusted expressions on both sides of the face (the left is the left side of the face). [16]ble 6.Evaluation of improvements to original VGG networks on KDEF.
The bold data represent the best performing results.The VGG refers to the VGG that has been structurally improved with respect to the original VGG.
, a cascaded conditional VGG (ccVGG) model is proposed in this article.It can better classify expressions for different angles guided by the pose information of the face while performing multiview expression classification, thus making the model more focused on expression features under multiple views.Third, this article proposes a hybrid-training and few-shot learning (HTFL) strategy.The hybrid-training is resulting in a more robust model against pose variations.The few-shot learning can facilitate better deployment in the real world, enabling the model to generalize to a wider range of applications and enhancing its robustness in real-world scenarios.The contributions of this work are as follows: 1) We propose the meta multiview expression dataset (MMED), which is based on MetaHuman and targeted for multiview facial expressions.The dataset can effectively solve the current problems of small amount of data and lack of multiple perspectives in the field of MFER and demonstrated the feasibility through detailed experiments; 2) We propose a ccVGG model.The ccVGG learns the pose-based expression features by simultaneously inputting expression images and the pose of the images to solve the conventional CNN's pose-variant problem.3) To address the problem of poor robustness of MFER in the real world, we propose a HTFL strategy.It can increase the amount of information in the model and enable rapid deployment in the real world.

Figure 3 .
Figure 3.The combination of two modules.The output of the pose-classification module will be reshaped to (24, 24, 1) through an MLP and then combined within the dimension 3.

Algorithm 1 .
The Training Process of the ccVGG.
, we directly validated the model after HT based on the model trained on Pose-FERPlus and Pose-RAF-DB datasets, which were used to verify the improvement of introducing multiview dataset MMED for nonfrontal face expression recognition during the training phase.Because the pretraining stage of the model is mixed training with FERPlus and MMED datasets, testing on Pose-FERPlus is a kind of in-domain recognition while testing on Pose-RAF-DB is a kind of cross-domain recognition.As shown in the table, the ccVGG network designed for multiview expression recognition can effectively improve the recognition accuracy on nonfrontal faces compared with the baseline.Introducing the multiview expression dataset MMED can further improve recognition accuracy.

Table 1 .
An overview of the MFER dataset.

Table 2 .
The number of seven expressions of an individual in the MMED.
Training epochs T, Number of iterations in each epoch K i , y i ) from D; 15 a i ← PCNet (x i ); 16 out i ← ECNet (x i , a i ); 17 Compute loss L (out i, y i ); 18 Update weight parameters of ECNet; 19 end 20 end 21 return {PCNet, ECNet}

Table 3 .
Comparison of recognition accuracy on KDEF of hybrid training of models.The bold data represent the best performing results.

Table 4 .
Comparison of recognition accuracy on KDEF of different few-shot learning methods.The bold data represent the best performing results.

Table 5 .
Evaluation of all components of the proposed method on KDEF from various angles.The bold data represent the best performing results.

Table 8 .
Experimental results of the model after HT on Pose-FERPlus and Pose-RAF-DB datasets.The bold data represent the best performing results.

Table 7 .
Evaluation of all modules of the proposed algorithm on KDEF.The bold data represent the best performing results.