Evaluating Signiﬁcant Features in Context-Aware Multimodal Emotion Recognition with XAI Methods

Analysis of human emotions from multimodal data for making critical decisions is an emerging area of research. The evolution of deep learning algorithms has improved the potential for extracting value from multimodal data. However, these algorithms do not often explain how certain outputs from the data are produced. This study focuses on the risks of using black-box deep learning models for critical tasks, such as emotion recognition, and describes how human understandable interpretations of these models are extremely important. This study utilizes one of the largest multimodal datasets available - CMU-MOSEI. Many researchers have used the pre-extracted features provided by the CMU Multimodal SDK with black-box deep learning models making it diﬃcult to interpret the contribution of individual features. This study describes the implications of individual features from various modalities (audio, video, text) in Context-Aware Multimodal Emotion Recognition. It describes the process of curating reduced feature models by using the GradientSHAP XAI method. These reduced models with highly contributing features


INTRODUCTION
In the recent years, Deep neural network (DNN) has emerged as an important machine learning tool to accomplish high performance on many learning tasks comparable to humans.However, deep learning models are inherently black-box and outputs are often produced with no interpretation or explanation to understand the aspects in the input that influenced the decisions of the model.These systems and decisions can be found in high risk and critical domains such as health, law and order, automotive etc.Given the nature of decisions, it is important for humans to understand the dominant features contributing to the DNN output in a specific context.
Human emotion recognition is an important and ongoing research area.In human emotion recognition application scenarios, various deep and shallow models interpret human emotions to provide various services like controlling appliances.In literature, extensive research has been done on human emotion recognition 1,2 .In these research works, one of the largest datasets evaluated is the CMU MOSEI Dataset 3 .This massive dataset contains many real world, un-staged videos depicting human emotions in a multimodal format comprising of audio, video and text modalities.We observed that many studies 4,5,6,7,8,9 use the same pre-extracted features provided by the CMU-MOSEI Multimodal SDK toolkit.However, it is essential that AI be transparent about the reasoning used in order to increase trust, clarity, and understanding of these applications.This study aims to determine which features influence the prediction capabilities of the model.Subsequently, we attempt to evaluate the effect of reducing the features to a subset consisting of highly contributing features on the performance of the models.

Explainable Artificial Intelligence (XAI)
The success of AI in delivering robust solutions has led to its extensive use in applications such as Emotion Recognition (ER), smart ecosystems, smart learning, finance, security, etc.This can be attributed to the ability of AI to enable improved productivity, better decision making, reduction of expenditures, and improved risk management.However, the techniques used for developing these AI solutions such as deep learning often do not explain how or why the certain output is obtained.These Blackbox/opaque models with large amounts of high-dimensional feature vectors output a final result without any human intelligible interpretation of the internal logic applied in these processes 10 .Lacking such auditability in AI systems can prove to be ethically risky and hazardous in real-world applications impacting the safety of the users 11 .
To develop and deploy trustworthy AI solutions, carefully balancing the trade-off between the prediction and explainability of these systems is essential.As seen in Figure 1, the explainability of machine learning models is inversely proportional to their prediction accuracies 12 .While deep learning models are known for their ability to achieve high prediction accuracies with minimal need for human intervention, they are accompanied by the curse of being highly opaque and un-interpretable by humans.Explainable AI (XAI) systems help tackle this by helping the user understand the behind-the-scenes processing logic of deep learning AI systems using simple, interpretable models 13 .XAI methods help decode these inexplicable, uninterpretable black boxes into transparent, human interpretable glass boxes.Interpretable machine learning techniques are grouped into two categories: intrinsic and post-hoc methods 14 .Intrinsic interpretability: Using simple interpretable models which are self-explanatory from their internal structure.Examples of such include decision trees, linear models, etc 14 .
Post-hoc interpretability: Complex black-box models can be interpreted after model training (post hoc) using a model-agnostic surrogate model 14 .
Model-agnostic methods work by changing the input of the machine learning model and measuring changes in the prediction output 14,15 14,15 .Figure 2 provides an overview on the above discussed methods.

Emotion Recognition Overview
Understanding and responding to emotions is an integral part of human communication.Emotions thus play a crucial role in Human-Computer Interaction (HCI).Therefore, extensive research has been conducted to develop intelligent systems capable of recognising and understanding human emotions as organically and efficiently as possible using Emotion Recognition (ER) algorithms.ER finds its applications in both simple daytoday systems including smart mirrors, customer satisfaction, gaming, chat-bots, smart home solutions as well as more complex, critical systems such as healthcare, criminal activity detection, mental health monitoring, emotion recognition of drivers for maintaining road safety, etc.Such Figure 2 Common Explainable AI (XAI) and interpretable Machine Learning (ML) techniques.Adopted from 14,16 critical applications of ER can be extremely sensitive to the final results obtained from the ER process.Incorrectly detecting emotions in such scenarios can be extremely hazardous and cause serious repercussions.Hence meticulously modelling the ER process and making the black-box machine learning models explainable to and interpretable by humans is crucial.

Unimodal v/s Multimodal Emotion Recognition
Recognizing emotions is not a straightforward task.Emotions are naturally perceived by humans using a fusion of various cues like facial expressions (visual), voice modulation (acoustic), words spoken(Textual).Unimodal ER techniques can prove insubstantial 1 in scenarios like sarcasm -the sarcastic expression of a disappointed smiling face could be classified as "happy" if the only focus is the visual cue.We hence focus exclusively on using multiple modalities (bimodal and trimodal ER) as opposed to single modalities (unimodal ER) to infer the appropriate emotion from a video as the multimodal ER technique reflects nuances of real emotional perception and makes the ER system more robust and reliable 18 .Flowchart to determine which explainable methods to use, adopted from 17 .The highlighted path describes the reason for selecting SHAP as our method for this study.

CMU MOSEI Dataset
such as 12 MFCCs, pitch, maxima dispersion quotients, etc. to portray emotions described by speech tonality.300 textual features were extracted using GloVe embeddings.
In our research on the CMU-MOSEI dataset, we came across multiple papers 4,5,6,7,8,9 using the pre-extracted features provided by the CMU-Multimodal SDK 19 .Most of these papers used the same 74 audio and 35 visual features as the baseline paper 3 .

Figure 4
Feature Splits used in this study include pre-extracted features acquired from CMU MultimnodalSDK 19 and features extracted from scratch inspired by 20 As for the textual modality, we came across a multitude of papers shifting to using BERT for extracting textual features rather than using the provided GloVe embeddings 21,22,23,24 .BERT or Bidirectional Encoder Representations from Transformers is a transformer-based model widely used for extracting high-quality textual embeddings.Conventional word embeddings like GloVe construct a single word vector for each unique word whereas BERT uses a bidirectional attention mechanism to recognize contextual information.Moreover, the pre-trained BERT model proves to be convenient with its pre-encoded language information which facilitates quicker development using high-quality features even with the availability of smaller training data.

Problem with pre-extracted features
The CMU MOSEI dataset is one of the most popular datasets used for multimodal emotion recognition.It has been heavily referenced with 129 citations on Scopus from which 35 papers 25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59 directly utilize CMU MOSEI features in their application, most of which utilize deep-learning architectures for their analysis.This existing research on the CMU MOSEI dataset however does not explore the explainability of the CMU MOSEI features.The pre-extracted features provided by the CMU MOSEI SDK have not been named/described.For applications directly using CMU features in their application or devices, it is highly important to understand/interpret these anonymous features -how they were generated, what they truly symbolize, or how they contribute to the final results.There has been prior research to make the pre-extracted features interpretable by using shallow-learning methods 3,60 .However, these features when used with deeplearning models, are essentially black-box features that we have no information about and hence cannot be used to comprehend the behavior of the models.We, therefore, felt the need to devise a way to understand the behavior of these features and the impact of their attributions in our deep-learning model.Using an XAI model to explain and interpret the attribution of features of each modality can help to improve the understanding of the feature significance and help us decide which modality contributes the most to ER.This can ultimately lead to improved robustness and performance accuracy in detecting the appropriate emotion and consequently minimize the hazards involved in wrongly identifying emotions in many scenarios such as criminal investigations, mental health monitoring, etc.

METHODOLOGY
Our experiments used deep-learning to recognize emotions using the audio, visual and textual modalities by utilizing the pre-extracted audio-visual features by combining them with the pre-extracted GloVe as well as newly extracted BERT features.We used an early fusion mechanism to fuse the three modalities which were then passed through a Bi-LSTM for emotion classification.We used a two-layered bidirectional LSTM having a hidden layer with 256 neurons inspired by 20 (Figure 5).trained new Bi-LSTM models using these 877 features.
Our ultimate aim was to interpret our models using XAI at the feature-level to find the importance of each individual feature with respect to their contribution to the model output also known as Primary Attribution.We needed to choose a suitable XAI method for our primary attribution analysis from the vast array of existing XAI methods. 17suggest the use of suitable explainable AI methods depending on the type of data used as illustrated in Figure 3.The data consumed by our model is tabular.While counterfactual explanations for a black-box model help understand how the smallest change to the feature values changes the prediction to a predefined output 14 , we needed to understand the overall global feature contribution for the model.We hence decided to use SHAP or SHapley Additive exPlanations method to interpret our model.

SHAP (SHapley Additive exPlanations
) is a post-hoc model-agnostic method that explains individual predictions by assigning each feature an importance value for a particular prediction 61 62 .The Shapley value attribution method is inspired by a cooperative game theory concept.It takes each permutation of the input features and individually adds them to the provided baseline.In this process, the output difference after adding each of the features corresponds to the contribution of that feature, and the average of these differences across all permutations determines the attribution.Due to the multiple permutations involved, using a larger number of features makes this method computationally intensive 63 .

Figure 6 SHAP (SHapley Additive exPlanations) pipline for explaining single predictions 62
To interpret the importance of the attribution of these features to the final classification results, we tested our models with GradientSHAP method provided by the Captum library.

Captum and GradientSHAP
Captum 63 is an open-source XAI library for PyTorch which helps with the explainability and interpretability of various AI models.Captum provides algorithms for evaluating attribute (feature, neuron and layer) importance and support for multimodality inputs such as text, audio and video 63 .
This study focuses on identifying the significant effects of input modality by examining the attribution value (relevance or contribution) of the input features to the deep neural network.Captum provides two categories of attribution methods, perturbation and gradient-based.
1. Perturbation-based Methods: Compute the attribution value for an individual or a set of input features by perturbing (removing, masking or altering) them before performing a forward pass 64 .Finally, calculate the difference between the new and original output 64 .
2. Gradient-based Methods: Compute the attribution values for all input features on a single forward and backward pass 64 .However, unlike the perturbation-based methods, attributes can not always be directly related to changes in output.
GradientSHAP available in Captum is a gradient algorithm used to compute SHAP values to evaluate the primary attribution of models.In gradient SHAP, each input sample is subjected to multiple Gaussian noise additions, a random point is selected along the path between baseline and input, and the gradient is computed based on the chosen random points.The algorithm assumes that the input features are independent and the explaining model is linear between the provided baselines and input features.It results into attributions approximating SHAP values that denote the expected value of gradients * (inputs -baselines) 63 .
We employed GradientSHAP and experimented with various subsets (top5, top10, top15, top20, top25) of the features with highest absolute attribution values.From these subsets, we empirically established that the optimal subset of features is top 20 features to obtain performance results close to all feature models while being inclusive of all modalities to find the most highly contributing features.We hence found out the top 20 features with the greatest absolute attribute values (check figure 7 and section 4.1 for reference).To gauge and validate the contribution of these top 20 features towards the classification results, we used these 20 features to train new models.We then tested these newly trained reduced feature models and compared their performance with the models trained with all features (Section 4.2).

RESULTS
The results obtained from these experiments were as follows.The modalities have been referred to with their initials to keep the document concise -A for Audio, V for Visual and T for Textual.

Top 20 Features based on GradientSHAP Attribute Importance
The tables 1, 2, 3, 4, 5, 6, 7 list the top twenty features with the highest absolute attribution scores along with the percentage contribution pf the respective modalities involved as well as the modality which dominates for 7 different bimodal and trimodal models.As observed from these findings, the dominant modality was found to be Text for the majority of the experiments followed by Video (Visual) and Audio which was found to be the least dominating modality when considering the top 20 features.Please check the ?? for visual illustration of the contribution of various modalities in terms of Top 20 Features.

Discussing the Results obtained by All Feature Models v/s Top 20 Feature Models
We evaluate the models based on the metrics used by our baseline paper 3 -F1 Score and Weighted Accuracy.The Weighted Accuracy metric (figure 8) is used to avoid any discrepancies caused due to the imbalance in the CMU-MOSEI dataset 3 .Weighted Accuracy Formula where N is the total number of negative labels while P is the total number of positive labels.TP represents True Positives while TN represents True Negatives.

F1 Scores
We compare the F1 scores obtained by reduced feature models with Top 20 Features (Table 9) against the models trained with with All Features (Table 8).The F1 score results for the Bimodal(A,V) analysis show that the Top 20 feature model outperformed the All Feature model in 3 emotions (Disgust, Happy and Surprise) while being very close for the remaining three emotions (Sad, Angry and Fear).
Comparing all versus top 20 features for GloVe models, we observe that for Bimodal(T,A) analysis, we achieved the same F1 scores for 4 out of 6 emotions, except Sad and Happy, for which results obtained by using top 20 features were still very close to those obtained by using all features.
For Bimodal(T,V), the F1 score performance of top 20 feature model was the same for 2 (Fear, Suprise) out of six emotions, slightly worse for 3

Weighted Accuracies
We compare the weighted accuracies obtained by reduced feature models with Top 20 Features (Table 11) against the models trained with with All Features (Table 10).The Weighted Accuracies obtained in the Bimodal(A,V) analysis show that the Top 20 feature model outperformed the All Feature model in 2 emotions (Disgust and Happy), tied for the emotions Fear and Surprise while being very close for the remaining two emotions Sad and Angry.
Comparing all versus top 20 features for GloVe models, we observe that for Bimodal(T,A) analysis, we achieved the same weighted accuracies for 4 out of 6 emotions, except Sad and Happy, for which results obtained by using top 20 features performed slightly worse (but still very close) compared to the all feature model.For Bimodal(T,V), the weighted accuracies of top 20 feature model was the same for 2 (Fear and Suprise) out of six emotions, slightly worse for 3 (Sad, Angry, Disgust) and slightly better for the happy class as compared to the all feature model.For the Trimodal analysis, the all feature model outperformed top twenty model feature model in 4 (Sad, Angry, Happy and Surprise) out of 6 emotion classes while giving exactly same F1 scores for Fear and Disgust.The trends with weighted accuracies are exactly the same (identical) as observed with the F1 score comparison between the two Bimodal(T,A), Bimodal(T,V) and trimodal models.On comparing the weighted accuracies obtained by the Top 20 Features Models with the baseline GraphMFN Model which uses all 409 features 3 , we observed that three of our reduced feature models BERT Bimodal(T,A), BERT Bimodal(T,V) as well as the BERT Trimodal models were able to outperform the baseline for the Disgust emotion class.Four of our Top Twenty Feature models -BERT Bimodal(T,A), Bimodal(T,V), BERT Trimodal, Bimodal(A,V) models were able to outperform the baseline for the Happy emotion class.Overall, our reduced feature models were able to obtain better weighted accuracies than the baseline for 2 out of 6 emotions.models.In some scenarios, especially for the angry, happy and disgust classes, the top twenty feature model even achieved better performance than our all-feature models.They also outperformed the baseline GraphMFN model 3 in 4 out of 6 emotion classes in terms of F1 scores and 2 out of 6 emotion classes in terms of weighted accuracies.

CONCLUSION
This study throws light on the hazards posed by using black-box AI/deep-learning models for critical tasks in Trustworthy systems like Emotion Recognition and explains the importance of making these models explainable/interpretable to humans.It focuses on interpreting the importance of individual features from various modalities (audio, video, text) in Context-Aware Multimodal Emotion Recognition.In the process, we highlight the problems of using pre-extracted anonymous features and employ a relevant XAI method called GradientSHAP for evaluating these features.
The XAI method implementation leads to finding a subset consisting of Top Twenty Features for various models described in section 4.1.We then compare the performance results of these lighter, reduced feature models in terms of F1 scores and weighted accuracies with their corresponding all feature models as well as the baseline model GraphMFN 3 .The results show that these smaller models with the advantage of being lighter to train and test.achieve comparable results to their all-feature counterparts and even outperform some of them.They also outperform the baseline model in 4 out of 6 emotion classes in terms of F1 scores and 2 out of 6 emotion classes in terms of weighted accuracies.
One of the limitations of this study is the use of a post-hoc XAI method for interpretability.Even though Model-agnostic (or post-hoc) methods allow interpreting complex machine learning models without understanding their underlying mechanism, they do face the challenge of striking a balance between flexibility and interpretability 65 .
The success of these reduced feature models suggests that employing XAI methods to interpret black-box deep-learning models can help us to carefully select high-quality, highly contributing features that can help curate trustworthy AI systems.It is hoped that this research will contribute to a deeper understanding of evaluating significant features using explainable methods (rather than blindly using all pre-extracted features) and unlock their potential to improve the performance and robustness of the system.

ACKNOWLEDGMENTS
Financial disclosure • The authors did not receive support from any organization for the submitted work.
• No funding was received to assist with the preparation of this manuscript.
• No funding was received for conducting this study.
• No funds, grants, or other support was received.

Conflict of interest
• The authors have no relevant financial or non-financial interests to disclose.
• The authors have no competing interests to declare that are relevant to the content of this article.
• All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
• The authors have no financial or proprietary interests in any material discussed in this article.

Figure 1
Figure 1 Explainability v/s Prediction Accuracy of ML Models 12

Figure 3
Figure 3Flowchart to determine which explainable methods to use, adopted from17 .The highlighted path describes the reason for selecting SHAP

Figure 7
Figure 7 GradientSHAP Attribute Values for the features of the emotion Disgust from the Bimodal (A,V) model sorted from highest to lowest (refertable 1)

Figure 9
Figure 9 Bimodal Audio and Video Contribution(s) Based on Gradient SHAP Values for Top 20 Attributes

Figure 10
Figure 10 Bimodal Text (BERT) and Audio Contribution(s) Based on Gradient SHAP Values for Top 20 Attributes

Figure 11 Figure 12
Figure 11 Bimodal Text (BERT) and Video Contribution(s) Based on Gradient SHAP Values for Top 20 Attributes

Figure 13
Figure 13 Bimodal Text (GloVe) and Audio Contribution(s) Based on Gradient SHAP Values for Top 20 Attributes

Figure 14
Figure 14 Bimodal Text (GloVe) and Video Contribution(s) Based on Gradient SHAP Values for Top 20 Attributes

Figure 15
Figure 15 Trimodal Audio, Video and Text (GloVe) Contribution(s) Based on Gradient SHAP Values for Top 20 Attributes . The surrogate can either be global or local.The global surrogate model approximates the overall prediction of the black-box model, whereas the Local surrogate model explains individual predictions of the black-box model