Text-based emotion detection: Advances, challenges, and opportunities

Emotion detection (ED) is a branch of sentiment analysis that deals with the extraction and analysis of emotions. The evolution of Web 2.0 has put text mining and analysis at the frontiers of organizational success. It helps service providers provide tailor-made services to their customers. Numerous studies are being carried out in the area of text mining and analysis due to the ease in sourcing for data and the vast benefits its deliverable offers. This article surveys the concept of ED from texts and highlights the main approaches adopted by researchers in the design of text-based ED systems. The article further discusses some recent state-of-the-art proposals in the field. The proposals are discussed in relation to their major contributions, approaches employed, datasets used, results obtained, strengths, and their weaknesses. Also, emotion-labeled data sources are presented to provide neophytes with eligible text datasets for ED. Finally, the article presents some open issues and future research direction for text-based ED.

1. Discrete emotion models (DEMs): The discrete model of emotions involves placing emotions into distinct classes or categories. Prominent among them include; • The Paul Ekman model, 14 that distinguishes emotions based on six (6) basic categories. The theory asserts that there exist six (6) fundamental emotions that originate from separate neural systems [15][16][17] as a result of how an experiencer perceives a situation, thus emotions are independent. These fundamental emotions are happiness, sadness, anger, disgust, surprise, and fear. However, the synergy of these emotions could produce other complex emotions such as guilt, shame, pride, lust, greed, and so on.
• The Robert Plutchik model, 18 which as Ekman, postulates that there exist few primary emotions, which occur in opposite pairs and produces complex emotions by their combinations. He named eight of such fundamental emotions, that is, acceptance/trust and anticipation in addition to the six(6) primary emotions posited by Ekman. The eight emotions in opposite pairs are joy vs sadness, trust vs disgust, anger vs fear, and surprise vs anticipation. According to Plutchik, for each emotion, there exist varying degrees of intensities that occurred as a result of how events are construed by an experiencer.
• Orthony, Clore, and Collins (OCC) model 19 dissented to the analogy of "basic emotions" as presented by Ekman and Plutchik. They, however, agreed that emotions arose as a result of how individuals perceived events and that emotions varied according to their degree of intensity. They discretized emotions into 22, adding 16 emotions to the emotions Ekman posited as basic, thus spanning a much wider representation of emotions, with additional classes of relief, envy, reproach, self-reproach, appreciation, shame, pity, disappointment, admiration, hope, fears-confirmed, grief, gratification, gloating, like, and dislike.
Any of the DEMs mentioned can be used to represent emotions when designing ED systems, depending on the researcher's preference. However, the OCC model generally provides a broader emotion representation scope due to its greater number of classes.
2. Dimensional emotion models (DiEMs): The dimensional model presupposes that emotions are not independent and that there exists a relation between them hence the need to place them in a spatial space. 13,20 Thus dimensional models position emotions on a dimensional space (uni-dimensional, ie, 1-D, and multidimensional, ie, 2-D and 3-D) depicting how related emotions are and usually, reflecting the two main fundamental behavioral states of good and bad. 21 Both unidimensional and multidimensional DiEMs are affected by relative degrees (low to high) of their occurrences. Unidimensional models are rarely used but their fundamental idea permeates most multidimensional models. This article elucidates more on multidimensional models for representing emotions. Of the emotion models, the discrete models have been widely adopted for emotion classification works 24 due to its simplicity although they do not fully exhaust a wider range of emotion classes, their intensities/degree of occurrences as compared with the dimensional models. The dimensional models, however, are highly recommended in projects that represent similarities in emotions.

ED from text
The emotions that prompt individuals to pen down certain words at particular times are what text-based ED is concerned about. Motivated by the works of References 25-27, which alluded that text-based ED research studies have been given little attention in comparison to other modes of ED, a search was carried out on two different libraries (ie, IEEE Xplore and Scopus) using the following keywords: "ED" and "ED from texts." It was carried to find out the frequency of research outputs involving ED in comparison to research outputs involving text-based ED. The search was limited to the following year range 2010 to 2020. The algorithm used as well as the results obtained after the study is presented in this section to highlight the paucity in text-based emotion research studies as hypothesized by . For simplicity and convenience, we use ED1, ED2, ED3 up to ED11 to represent ED search results for the year 2010, 2011, 2012 up to 2020, respectively, and TB1, TB2, TB3 up to TB11 to represent text-based ED search results, respectively.
The results from the search showed that out of a total of 1810 results available for the search phrase "ED" on IEEEXplore, for the entire year range, 202 were focused on "ED from texts." Similarly, the Scopus database presented 593 "ED from texts" results out of a total of 5481 for "ED." The graphs showing the distribution over the 10 years duration (ie, from 2010 to 2020) are shown in Figures 3 and 4.
The results shown indicate that multimodal forms of ED, including speech, body language, facial expressions, and so on, are frequently worked on as against their text-based ED counterpart. The paucity has mainly been because, unlike multimodal methods, texts may not portray peculiar cues to emotions making the detection of emotions from texts even more difficult in comparison with other methods. Also, the difficulties associated with extracting emotions from grammatically erroneous texts, short-messages, sarcasm in written documents, contextual information, and so on, could be backbreaking. Again, inadequate knowledge of effective text extraction methods for the field due to its infantile stage as a result of the paucity of research presents a major hurdle in detecting emotions efficiently from written texts.
Kao et al 28 further express the problem of text-based ED mathematically as: where T is the text from which emotions are to be determined, and A is the author that pens down the text T. The variable r is the relationship between the author and their written texts often expressing emotions and states that even though the problem may initially look straightforward, the task of determining the appropriate relation under which an author can be significantly associated with their written texts in order to determine their emotions could be backbreaking. All these present distinct challenges to the field of text-based ED. Notwithstanding the challenges, the field has contributed massively to an improved human-computer interaction. These include the detection and offering of untimely assistance to individuals who may be considered suicidal, 7,8 detecting insulting sentences in conversations, 10 chatbots for psychiatric counseling, 29 and so on, although these are still at their infantile stage.

Datasets for text-based ED research
After deciding on the model to represent emotions, the next crucial step in detecting emotions from text is the acquisition of data relevant to the course. There are a few structured annotated datasets for ED publicly available for research purposes. This section identifies major publicly available datasets and their features. A table containing the datasets, their features, emotion models they represent and downloadable links is provided in Table 1.

ISEAR Dataset
The International Survey on Emotion Antecedents and Reactions (ISEAR) database 21 constructed by the Swiss National Centre of Competence in Research and lead by Wallbott and Scherer consists of seven emotion labels (joy, sadness, fear, anger, guilt, disgust, and shame) obtained as a result of gathering series of data from cross-cultural questionnaire studies in 37 countries. Three thousand (3000) participants from varying cultural backgrounds were made to fill questionnaires about their experiences and reactions toward events. The final dataset reports a total of 7665 sentences labeled with emotions.

SemEval
The Semantic Evaluations (SemEval) is a database consisting of Arabic and English news headlines extracted from news websites such as BBC, CNN, Google News, and other major newspapers. The dataset contains 1250 data in total. The data in this database are rich in emotional content for emotion extraction and it is labeled using the 6 emotional categories (ie, joy, sadness, fear, surprise, anger, and disgust) presented by Ekman.

EMOBANK
This dataset consists of over 10 000 sentences annotated dimensionally in accordance to the Valence-Arousal-Dominance (VAD) emotion representation model. These sentences were obtained from news headlines, essays, blogs, newspapers, fiction, letters, and travel guides of writers and readers, thus spanning a wider domain. A subset of the dataset has also been annotated categorically using the Ekman's basic emotion model making it suitable for dual representational designs.

WASSA-2017 Emotion Intensities (EmoInt) Data
The Workshop on Computational Approaches to Subjectivity, Sentiment, and Social Media Analysis (WASSA-2017) data was constructed to detect emotion intensities in tweets. It is annotated for four discrete emotions, including joy, anger, fear, and sadness. Segregated into both training and testing sets, the data were used for the WASSA-2017 shared task on ED 30 to aid in the understanding of the level of emotion intensities portrayed in human languages.

Cecilia Ovesdotter Alm's Affect data
The dataset was constructed from tales and annotated categorically for fear, angry, disgust, sad, happy, and surprise. It also contains other annotations as feeler, intensity, and lists of emotion words characterized for assisting emotion annotations. 31

Daily Dialog
This dataset was built by crawling dialogues from regular human conversations. It contains a total of 13 118 sentences annotated for neutral, anger, disgust, fear, happiness, sadness, surprise discrete emotion labels. 32

AMAN'S Emotion dataset
The dataset identifies the presence of emotions in blogposts. It contains 1466 emotion-labeled sentences. 33 These sentences have been classified into happiness, sadness, disgust, anger, fear, surprise, mixed emotion, and no emotion. The mixed emotion category defines those sentences that express two or more emotions at the same time or which cannot be aligned to a particular emotion. No emotion defines sentences that portray neutral emotions.

Grounded Emotion data
Grounded emotion data explores the effects of five external factors on the emotion state of tweeters. 34 A total of 2557 tweets were annotated for happy and sad categorical emotions with 1525 happy tweets and 1032 sad tweets.

Emotion-Stimulus data
The data are built from sentences containing emotions as well as factors that cause emotions (emotion stimulus). 35 It contains 1594 emotion-labeled sentences and 820 sentences with both cause/stimulus and emotions. The emotion-labeled sentences are annotated in compliance to the Ekman's basic emotion categories plus shame.

MELD dataset
The Multimodal Multi-Party Dataset for Emotion Recognition in Conversation (MELD) 36 is a multimodal dataset circling modalities such as audio, video, and text. It contains over 1400 dialogues and 13 000 utterances from the Friends Television Show with utterances in dialogues labeled categorically as anger, disgust, sadness, joy, surprise, fear, and neutral.

Emotion lines
Emotion lines dataset contains data collected from both the Friends Television sitcom series and Facebook Messenger Chats. The data have a total of 2000 dialogues and 29 245 utterances labeled according to Ekman's basic categorical emotion classes plus a Neutral class.

Smile dataset
This dataset contains discrete emotion annotations of anger, disgust, happiness, surprise, and sadness for a total of 3085 tweets gathered from 13 Twitter handles affiliated with the British Museum. 37 Data for emotion analysis can also be obtained from other social media Application Programming Interfaces (APIs) such as Facebook, Google, Twitter, and so on, and also from blog posts as in the works by Ramalingam et al 38 and Strapparava and Mihalcea, 12 in which the Twitter GRAPH API was used to extract tweets and blog posts were extracted to detect emotions, respectively.
It is worth stating that text data acquired directly through blog posts and other social media platforms such as Twitter, Facebook, and so on, are unstructured and need to be structured and annotated before use. Klinger 39 outlined various methods for structuring and annotating such data.

Detection approaches
This section highlights the rule construction, machine learning (ML), and hybrid approaches as the general approaches to detecting emotions from texts. It further reveals significant drawbacks and strengths associated with each approach.

Rule construction approach
The rule-based approach outlines major grammatical and logical rules to follow in order to detect emotions from documents. Rules for few documents may be easily created; however, with large amounts of documents, complexities may result. The rule construction approach encompasses keyword recognition (KR) and lexical affinity methods. The KR method deals with the construction or the use of emotion dictionaries or lexicons. There are numerous KR dictionaries, notable among them are the WordNet-Affect. 44 EmoSenticNet, 45 DepecheMood, 46 SentiWord Net dictionaries, 47 and the National Research Council of Canada (NRC) lexicon. 48 These emotion lexicons contain emotion search words or keywords such as happy, hate, angry, sad, surprise, and so on. The task is to find occurrences of these search words in a written text at the sentence level. Once the keyword is identified within the sentence, a label is assigned to the sentence. For instance, if the constructed emotion dictionary contains joy and the written text from which emotion is to be determined reads "I was filled with so much joy on seeing my Father for the first time in ten years," the sentence is emotionally labeled with the keyword joy. This approach though simple and straightforward faces challenges, including the need for an emotion dictionary to contain reasonable number of emotion categories, since limited keywords can greatly affect the performance of the approach among ambiguity of keywords and the lack of linguistic information.
The lexical affinity (LA) method augments the KR method. This is because aside from the identification of keywords, random emotion words are assigned a probabilistic affinity. The LA is responsible for this second stage of assigning probabilistic affinities to the random emotion words. The word "good" for instance, may be assigned a probabilistic affinity of "positive," "angry" may be assigned a "negative" affinity, and so on. The drawback associated with this probabilistic affinity assignment is that it does not fully represent the various categories of emotions but rather reduces them into two extremes states (ie, "positive" or "negative"). Also, this approach can lead to inaccuracies in the classification of emotions depending on the context of the assigned words. For instance, the sentence "Meeting him should have been a good idea," may have expressed the author's disappointment on meeting someone and should be classified as a "negative" emotion, but this will generally pass for a positive emotion because the word 'good' has already been assigned a probabilistic affinity of "positive." These drawbacks often necessitate the need for using other approaches for detecting emotions in texts.

The ML approach
The ML approach solves the ED problem by classifying texts into various emotion categories through the implementation of ML algorithms. The detection is often carried out using a supervised or an unsupervised ML technique. The work in Reference 24 showed that supervised ML algorithms have been widely implemented in text-based ED problems and have offered comparatively better detection rates than in problems where unsupervised ML techniques were implemented. However, throughout this survey, we observed that not only have the widely explored techniques been unsupervised but also traditional. These traditional unsupervised ML (TUML) techniques such as the support vector machine (SVM), naive Bayes (NB), conditional random fields (CRF), and so on, as in the works in References 49-51 are not as robust and do not explicitly extract the semantic information relevant for effectively detecting emotions in texts. Recently, supervised deep learning models are being adopted as ML approaches to detect emotions from texts. The argument has been because deep learning techniques are more robust and that their deep layers can extract the intrinsic/hidden details texts may carry. 52 The implementation of these techniques to texts-based ED problems has been seen to outperform techniques that implemented TUML techniques. 52,53 Some works implementing some of these techniques breaking records in text-based ED are discussed in the later sections.

Hybrid approach
The hybrid approach combines the rule-construction and the ML approaches into a unified model. Thus, drawing from the strength of both approaches while concealing their associated limitations, this approach has a higher probability of transcending the other two approaches individually. However, in conducting this survey, it was identified that most systems that employed this approach implemented a rule engine together with a TUML technique 24

Current state-of-the-art text-based proposals
The reviews in this section are carried out by highlighting the following: the approach employed, dataset(s) used for the study, the major contributions, and possible limitations of the existing techniques. Canales and Martínez-Barco 24 in 2014 reviewed state-of-the-art text-based ED methods up to 2014. In view of that, this article seeks to continue the review of state-of-the-art works from 2015 to 2020. Badugu and Suhasini's 59 work used the rule-centered approach to detect emotions expressed by people based on their tweets. They obtained data from the Sentiment140 dataset, 60 which serves as a repository for 1 048 576 tweets, applied pre-processing methods such as tokenization, noise elimination and text normalization after which the normalized texts were tagged into the various parts-of-speeches (POSs). To determine the tagged texts carrying emotions, they had been juxtaposed with emotions on the Russell's Circumplex Model of Affect 22 in a knowledge extraction process. The emotion carrying texts were then checked for negations and features selected according to their frequency of occurrences. The Sen-tiwordNet lexicon 47 was used to validate the refined data for classification. When the particular text passed the validation check, it was assigned a score from which sentiments whether high positive, low positive, high negative, low negative were assigned. If the text did not pass the validation check, it was discarded. Then all texts assigned sentiments were classified into one of four (4) emotion classes as"happy-active," "happy-inactive," "unhappy-active," and "unhappy-inactive." Although their model attained an accuracy of ≈ 85% and was reported to outperform baseline methods, it considered only the "not" form of negation and disregarded other forms as "never" even though it was contained in some tweets. They had also classified large quantum of tweets into a restricted number of four classes; this implied that most tweets will not be classified into their specific emotion categories but their nearest categories as per the classes available which may result in inaccurate classifications. Furthermore, emojis that have been found to convey rich emotion content 61 were ignored as their model focused only on text classification.
Almanie et al 62 developed a real-time web application that displayed emotions expressed by people living in Saudi Arabia through the analysis of their tweets. They manually created a dataset consisting of over 4000 Arabic words including emojis by consolidating data from three sources including emotional survey responses from people in Saudi Arabia with different dialects, the Almaany Arabic Lexicon, and trending hashtags on Twitter. The data were preprocessed and classified into one of happy, sad, scared, surprised, angry, and neutral, for cases where words contained no emotions. In order to implement the real-time application, emotions after being classified were stored in a database together with their geographical positions and sets of queries running concurrently were used to determine how frequently a specific emotion was expressed by people residing in a particular city. Such frequent emotions were generalized as the emotion expressed by the city. These emotions were visualized by representing them on a map by color representations to the various cities in Saudi. They developed a real-time web application system that surveyed tweets and trending hashtags on Twitter; however, with no details on their applied methodology to classifying emotions. They had also mentioned that other emotion categories would be included in their future work to enhance the generalization ability of their system. Rabeya et al 63 observed that research studies involving the detection of emotions from English texts were becoming innumerable as compared to Bengali texts and implemented a system to detect two basic emotions of happy and sad in Bengali. Their work initially identifies the sentiment whether negative or positive and then the specific emotion associated. They accomplished this by tokenizing 301 Bengali sentences collected from a survey and using a hash algorithm, generated hash values for each tokenized sentences. The hash values were then compared to hashed values of sentences in a Word Lexicon containing 350 words. If the value matched, the hash value is stored in a database and an input expression generated. Using the backtracking approach, sentiments whether positive or negative were detected. After detecting sentiments, they applied a rule engine to detect emotions as happy and sad and obtained an accuracy of 77.16%. However, they mentioned that, in as much as their input data was relatively small, some of the words it contained could not be found in the Lexicon they used and had to be removed thus further reducing the amount of data and greatly affecting the performance of their system. Also, their system detected only two basic emotions; this means that other emotion categories were sidelined and as such the generalization ability of the system becomes questionable.
Kušen et al 64 compared the performance of three (3) word-emotion lexicons in a bid to determine the best word-emotion lexicon for identifying emotions in social media texts in a rule-centered approach. Using Facebook and Twitter, they compared the performance of the National Research Council, Canada (NRC), 48 EmosenticNet 45 and DepecheMood 46 word-emotion lexicons using various NLP techniques even though the word-emotion pair in these lexicons differed. To obtain the ground truth, they used the ISEAR dataset 21 and the results of a survey they conducted in which they made individuals assign emotions to Facebook and Twitter. The results they obtained after comparing the ground truth scores with the scores provided by the lexicons separately showed a better classification performance in the order NRC, DepecheMood, EmosenticNet, with NRC performing best at detecting anger, fear, and joy and DepecheMood performing the best at detecting sadness. They stated that inadequate words in all the lexicons affected the general performance of their model.
Seal et al 65 's work was based on detecting emotions using semantic rules and emotion keywords with particular attention to phrasal verbs. They collected data from the ISEAR database, preprocessed the data, and analyzed them to detect phrasal verbs. They found some phrasal verbs which could be aligned to emotion words but were not. Therefore, they created a list of phrasal verbs and constructed a database containing the phrasal verbs and their synonymy. Using the WordNet emotion lexicon, the power thesaurus, and their constructed phrasal verb database, they identified keywords and phrasal verbs associated with specific emotions and classified them accordingly. They obtained an accuracy of 65% but highlighted that their method did not resolve problems presented in existing systems such as the insufficient list of emotional keywords and disregard for word semantics based on context.
As per the flaws in the keyword/rule approach cited in References 54, 59, and 63, and their rippling effects on systems developed for detecting emotions from texts, it became crucial that new detection methods be investigated. To this quest, Mozafari and Tahayori 66 presented a similarity technique based on vector similarity measure (VSM) 67 and STASIS 68 method. They then compared the performances of their methods with the Keyword method and concluded that their proposed method outperformed the Keyword method under standard conditions. Their work initially implements the Stasis for extracting semantic relationships from texts and then determines the similarity measure of texts using the VSM method. They tested these methods on the ISEAR dataset 21 and found that the VSM method outperformed the Stasis and Keyword method in detecting emotions into joy, sadness, anger, fear, shame, disgust, and guilt categories.
Lee et al 69 shared in the notion presented by Reference 63 concerning the direction of research in the field towards English texts. They also believed that multilingual texts contained as many emotions as monolingual texts. Their work proposed the detection of emotions from monolingual and Bilingual texts, that is, English/Chinese and English and Chinese respectively using multiview learning; a semi-supervised ML Approach. After obtaining 4195 posts from Weibo, a popular Chinese blog spot and annotating using Cohen's Kappa coefficient, 2319 posts were observed to contain emotion words. Using code-switching text identification, they separated Chinese texts from English texts and texts written in both Chinese and English languages creating three views, that is, Chinese, English, and Bilingual views. In the bilingual view, texts written in the English language were all translated into the Chinese language using the word by word statistical machine-translation method 70 and all views fed into a semi-supervised co-training algorithm for prediction into five (5) categories, namely, happiness, sadness, fear, anger, and surprise. Their results indicated anF1 measure of 0.486 and improved performance in comparison with baseline methods. However, the relatively small quantum of data used posed as a hindrance to the performance of their proposal. Also translation of texts from one language to the other using the word by word statistical translation strategy could have affected the general semantics of texts and resulted in fuzzy and incorrect emotion classifications. 70 Finally, their method placed little emphasis on semantic and contextual information extraction after translation.
Wikarsa and Thahir 71 developed an application for detecting the emotions of Twitter users using the NB method. After extracting 105 tweets using the Twitter API, tweets were preprocessed by conversion to lowercase, removal of stop words, removal of mentions, removal of URLs, and conversion of emoticons into texts. Then, using the NB classification method, classified texts into happiness, sadness, fear, anger, surprise, and disgust. They then used the 10-fold cross-validation to determine the accuracy of classifications. They reported a system accuracy of 83% but recommended larger training data for improved performance in the future. They also recommended that a means of removing duplicate tweets could improve the response time of their systems. Furthermore, they mentioned that the use of other classification methods as the SVMs and K-nearest neighbors (KNN) could improve performance.
In response to some of the recommendations made by Wikarsa and Thahir, 71 Mashal and Asnani 72 proposed a model that is capable of extracting fine-grained emotion intensities. They collected data from Reference 30, performed major preprocessing steps as used and recommended by Reference 71, and after constructing feature vectors, predicted four categories of emotions, namely, happy, sad, angry, and fear using a linear regression model. Their system was able to predict the degree of emotions conveyed in an emotion category, however, had the limitation of a restricted number of emotion categories and the disregard of the strength of adjectives resulted in loose contextual information extraction.
Jayakrishnan et al 73 classified data obtained from Malayalam, an Indian dialect into one of happy, sad, fear, anger, and surprise. They had preprocessed the data, extracted features, trained the data, and evaluated using 500 Malayalam sentences for each emotion category. Using the SVM classifier, they then classified the texts into the various emotion classes. Their results showed an accuracy of 0.94, 0.92, 0.90, 0.93, and 0.90 for happy, sad, anger, fear, and surprise, respectively. The limitation associated with their work was the disregard for semantic information in texts and the failure to regard the role of context in sentences.
Allouch et al 10 developed a system that aided children to communicate well by detecting insulting words in conversations using an ML approach. They manually built a dataset consisting of 1241 non-insulting and 1255 insulting sentences. Out of these, 90% were used as the training data and 10% as the test data. Then they demonstrated the performance of some ML classifiers such as the SVM, the multilayer neural network, NB, decision tree, and the tree bagger in the prediction process. They observed that the SVM gave the highest accuracy of 0.769 in the prediction of insulting sentences with the multilayer neural network giving an accuracy of 0.768. Tree bagger, NB, and the decision tree gave an accuracy of 0.762, 0.738, and 0.735, respectively. They then validated these accuracy values by checking the frequencies of positive and negative words in their dataset using positive and negative word lists imported from Reference 74 as the benchmark. The order of performance, however, remained unchanged even though there was a slight improvement in the accuracy values. The list of positive and negative words used contained a fixed number of words and presents challenges similar to the keyword approach. They mentioned that contextual information about sentences was disregarded; however, these convey significant information and disregarding them may result in misclassification. They further stated that the multilayer neural network could provide a promising method for detecting and classifying emotions.
The work by Singh et al 75 contributed to solving the long-standing semantic extraction problem associated with the text-based ED by applying a two-stage text feature extraction methods, semantic and statistical stages. The semantic stage involved the extraction of meaning/semantics from texts using the POS tagger and was followed by the use of the chi-square method in getting rid of the weak semantic features in the statistical stage. They applied this procedure on the ISEAR dataset using the SVM classifier and detected seven emotions, that is, joy, anger, fear, disgust, guilt, shame, and sadness. Their results showed an improvement in performance compared to baseline methods. The drawback, however, was the fact that their method did not take into consideration the relationship between features.
Alotaibi 52 classified emotions in ISEAR database. He preprocessed and trained the data on four classifiers and realized that logistic regression outperformed the other methods, that is, SVM, KNN, and the XG-Boost with a Precision of 86%, recall of 84%, and an F-Score of 85%. They highlighted that the application of a deep learning technique could improve system performance.
Tripto and Ali 76 proposed a deep learning model to detect sentiment labels and emotions from Bengali YouTube comments. After acquiring their data, they preprocessed the text to remove stop words. They then obtained the word embedding representation using both the Skip-Gram and the continuous bag of words (CBOW) in Word2Vec. Then the output was fed as input into their defined long short-term memory (LSTM) architecture at the first phase of the model, followed by the second stage made up of a convolutional neural network (CNN) architecture. Results from their study showed that their applied deep learning methods; LSTM and CNN greatly outperform traditional ML methods like the SVM and NB with emotion classification accuracy of 59.2% and multiclass (3 and 5) sentiment labels accuracy of 65.97% and 54.24%, respectively.
Abdullah et al 77 detected sentiments and emotion in Arabic tweets using a CNN-LSTM Deep Learning Model. They conducted two experiments. Experiment 1 involving the use of a feed-forward phase and Experiment 2 involving the use of a CNN-LSTM phase. With Experiment 1, they initially feed a 4908 dimensional input vector into a feed-forward network with fully connected layers and three hidden layers consisting of 500, 200, and 80 neurons, respectively, and applied the ReLU activation function. The output was a Sigmoid function for predicting sentiment and emotion intensities. They optimized this network using the stochastic gradient descent (SGD). Experiment 2 involved feeding an input vector of 300 into a CNN model with 64 filters, kernel size of 3, and ReLU activation function. The vectors were then passed on to the LSTM model after the application of a MaxPool with size = 2. The LSTM had two hidden layers with 200 and 80 neurons, respectively. The ReLU activation function and the Sigmoid were applied as in Experiment 1 and the SGD optimizer was used. Their results indicated an accuracy of 40% for Experiment 1 and 60% for Experiment 2. The drawback with their system was the small amount of data used as well as the few numbers of hidden layers employed.
Chatterjee et al 58 participated in the SemEval-2019 Task 3 competition. They detected contextual emotions from textual dialogues and classified into one of four emotion classes as happy, sad, angry, and others using the long short term memory (LSTM) and obtained an F1 score of 0.5861. They further compared and discussed techniques that attained greater F1 measure and realized that such systems were implemented using Bi-LSTMs. Cai and Hao 78 in their work tried to detect emotions from Weibo, a popular Chinese microblogging site, using multiview and attention-based Bi-LSTM. The authors approached the problem in three levels by finding emotions from emotion words, emojis, and extracted semantic or hidden emotion expressions. They initially obtain Weibo texts, then preprocessed texts using the Jieba Word Segmentation tool to remove stop words after which words were tagged and encoded using a one-hot encoding. The words were then converted to vectors and trained with the Word2Vec using the skip-gram model. The vector representations were then integrated into a layer in the Bi-LSTM model and connected to a softmax layer to obtain the probability distribution vector. The training was done with an input dimension of 50. In order to enhance the training speed, batch training with a batch size of 256 was used and the filling method used to keep the batch size constant. Backpropagation was used for training and cross-entropy loss was implemented as the loss function. The NLP&CC 2013 evaluation dataset was used to determine whether a text contains an emotion or otherwise. If it contained an emotion, it is categorized into their defined emotion categories otherwise, it is discarded. In their results, they defined the Bi-LSTM AVE as the results of the summation of the output of texts represented in Bi-LSTM, followed by a full connection layer and a soft-max function, Bi-LSTM-AVE-MV as Bi-LSTM based on multiview and attention mechanism, Bi-LSTM-V1 as the emotional word angle, Bi-LSTM-V2, emoji angle, Bi-LSTM-V1-V2 as a blend of emotional words and emojis, Bi-LSTM-AVE-V2 as merging pure semantics and emojis. The authors then deduced that since the macro-average and micro-averages of the FI score of the Bi-LSTM-AVE-MV method were 7% higher than the Bi-LSTM-AVE averages. And the macro-average and micro-averages of the FI score of the Bi-LSTM-AVE-MV method are 17% higher than Bi-LSTM-V1 and Bi-LSTM-V2 averages, the purely semantic level served as a representation of the entire microblog text and contributed to the global understanding of the text depicting the importance of semantics in text-based ED.
Ma et al 79 used the SemEval dataset and the Bi-LSTM to classify emotions in textual and emoji utterances into happy, sad, and angry. They detected that their system's performance for happy and angry but not sad emotion classes outperformed baseline models and concluded that Bi-LSTMs are capable of extracting contextual information from texts. Their system, however, presented restricted categories of emotions.
Huang et al 56 used a 4-part episodal memory network (EMN) with self-attention (SA) mechanism to detect emotions from texts. Using the SemEval Dataset, they detected four basic emotion categories, that is, anger, fear, joy, and sadness. Their results indicated that whereas EMN outperformed other systems that used intelligent ML techniques like CNN, TLSTM, and RNN. EMN together with SA performed even better with extracting text semantics and detecting emotion categories with a precision of 65.8%, a recall of 63.5%, and anF1 score of 64.6%. Their model was capable of detecting emotion categories and text semantics but the limited number of emotion categories presents a peculiar challenge for generalization.
Polignano et al 80 designed a model that implemented BiLSTM, self-attention, and convolutional neural networks (CNN) together. They alluded that word embeddings contribute reasonably to the performance of a text-based ED system and thus focused on effective word embeddings for improved performance for text-based ED. Hence, they compared the performance of three (3) word embeddings viz, the Google Word Embedding, GloVe Embedding, and the FastText Embedding on their designed model using the ISEAR dataset, the SemEval2018 Task1 dataset, and the SemEval-2019 Task 3 dataset. The results obtained using the ISEAR dataset indicated a higher performance in precision, recall, and F1 measure mostly for GloVe embedding and FastText Embedding as far as the labeled emotion classes, that is, joy, fear, shame, disgust, guilt, anger were concerned except for sadness for which baseline approaches performed better. However, for the SemEval-2018 Task1 and SemEval-2019 Task3 dataset, FastText embedding gave a higher precision, recall, and F1 measure for their annotated emotion labels. They, however, encouraged the use of the FastText embedding for future research.
Ragheb et al 81 presented an attention-based model for detecting and classifying emotions in textual conversations. The data used was provided by the SemEval-2019 Task 3 competition organizers. It contains conversations and it is annotated categorically for happy, sad, angry and others. Their approach was made up of two(2) phases, that is, the encoder and classification phases. The data, after collection, was tokenized and fed into an encoder followed by Bi-LSTM units trained using average stochastic gradient descent (ASGD). Dropouts were applied between the LSTM units in order to avoid overfitting. Then, a self-attention mechanism followed by an average pooling was applied in order to concentrate on relevant emotion-carrying conversations. A dense layer was applied and a softmax activation also applied in order to classify conversations into the annotated categories. The model after training and testing yielded an F1 score of 0.7582.
Barbieri et al 82 presented a label-wise attention LSTM mechanism to predict emotions from emojis. It involved a two stacked word-based bidirectional LSTM networks 83 with skip connections between the first and the second LSTM and an attention layer to focus on important input features. The model was evaluated on the English data provided for the SemEval 2018 Shared Task on Emoji Prediction. 84 It classified emojis into 20 classes and showed improved results in comparison with baseline FastText method. 85 Using a bidirectional transformer-based BERT Architecture, Aditya Malte and Pratik Ratadiya 86 detected cyber abuse from Facebook multilingual texts, English, Hindi, and a mixture of both languages (Hinglish) texts. The dataset contained 25 013 multilingual comments made up of 12 000, training sets each for English and Hindi comments and a separate 916 English, 970 Hindi comments used for testing plus an additional 1257 English tweets and 1194 Hindi tweets which were used to reinforce the model's generalization ability. The data were preprocessed by transliterating Hindi texts and texts containing a blend of both Hindi and English to English only. Then noise and redundancies were removed from both English only texts and transliterated texts in order to pre-train. During the pre-training phase, the model was trained for masked language model (Masked ML) and next sentence prediction tasks. After fine-tuning, the data were categorically classified into three, that is, covertly aggressive (CAG), overtly aggressive (OAG), and non-aggressive (NAG). The approach attained an F1 score of 0.4521 for Hindi texts and 0.5520 for English texts.
Huang et al 87 ensembled the hierarchical LSTMs for contextual ED (HRLCE) model and the BERT model to detect emotions in Tweets. The data after collection, passed through the preprocessing stage where emojis (constituting a larger proportion of the entire data) were extracted and converted into texts using the emojis package 88 and the ekphrasis package was used to handle misspellings and normalize tokens. The HRLCE model was then used after the application of the Bert-Large pre-trained model with 24 layers. This was to ensure the effectual extraction of semantics in emotion-conveying texts. The model attained a macro-F1 score of 0.779 across the categories happy, angry, and sad emotions. An increase in misclassification was reported because a sizeable number of emotions were multiclassified.
Huang et al 89 investigated the performance of BERT in emotion recognition. They used the EmotionLines dataset that consists of dialogues from the Friends Television Sitcom series and EmotionPush made up of Facebook Messenger chat. Emotions were then classified into joy, sadness, anger, and neutral. The Bert pretrained transformer encoder was used to carry out the pre-training task. After tokenization, 15% tokens of the input sequence were masked at random and the model had to learn to predict the masked tokens from the sequences it learns from unmasked tokens. The model attained a micro F1 score of 0.815 and 0.885 from the Friends and EmotionPush datasets, respectively. The authors stated that a higher performance could be attained on a larger amount of data.
Ahmad et al 90 detected emotions using cross lingual embeddings in transfer learning. They built an emotion labeled dataset (Emo-Dis-HI) by crawling the Hindi news website to obtain disaster documents. The data contained 2668 training samples and was annotated at the sentence-level in accordance with the basic emotions as presented by Plutchik. Due to the limited amount of data, a transfer learning environment of cross-lingual word embedding involving Hindi and English was adopted to provide a shared representation of English and Hindi. Using an ensemble of CNN and Bi-LSTM, sentences were classified and an F1 score of 0.53 obtained. Their model, however, did not take into consideration the classification of emotions regarding the context in which words appeared in documents.
Matla and Badugu 91 explored the performance of K-NN and NB ML techniques in the detection of emotions using tweets in the Sentiment 140 corpus. 47T heir approached showed that under constant conditions, the NB outperformed the KNN with an accuracy of 72.06% as compared to 55.50%.
Grover and Verma 92 used the hybrid approach to detect emotions in Punjabi texts. They detected the presence of emotions using some rules and classified them into one of Ekman's six emotion categories, that is, happy, sadness, fear, anger, surprise, and disgust using the SVM and NB ML classifiers. In their work, they stated that a more robust ML technique could improve their performance.
Perikos and Hatilygeroudis 93 used an ensemble of classifiers to automatically detect emotions from texts. Their ensemble consisted of a knowledge-based tool implementing the keyword approach and two statistical ML methods; the NB and the maximum entropy learner. They obtained their data from the ISEAR and the affective texts datasets. The authors made use of the Stanford Parser to analyze the texts at the sentence level. The texts were then preprocessed by removing stop words and lemmatization before representing features using the Bag-of-Words (BOW). The output was then passed onto the ensemble of classifiers that determined if a sentence contained emotion and then identified the polarity of the emotion if it did. Their results showed that the ensemble of classifiers performed better than situations where single classifiers were used.
LeCompte and Chen 94 analyzed the effects of emoji inclusion in detecting emotions from texts. Their work adopted the Keyword Identification method and the SVM and Multinomial Naïve Bayes (MNB) to classify emotions into sad, angry, scared, happy, surprised, thankful, and love. Their result indicated an improvement in performance where emojis were considered rather than where they were not considered. They also indicated that under equal conditions, MNB performed better than SVM.
Jian et al 95 presented a hybrid framework for the automatic detection of multilanguage text data. Their proposed framework used NLP techniques to extract emotions existing in texts, and then classified them per the concept of emotion models presented by Ekman using the SVM first and then NB. They had said that their method provided better results in comparison with other methods achieving an accuracy of 72.81%; however, a robust ML Classifier can be used for better performance.
Ramalingam et al 38 detected emotions from texts using the hybrid approach. After developing an emotion WordNet and applying a variety of heuristics so that sentences containing emotions could be identified, they classified emotions into five categories (joy, sadness, disgust, anger, fear) using an ML classifier. The authors hinted that given a more robust feature extraction technique, the performance of their proposal could have been greatly enhanced.
Nida et al 96 developed an automatic emotion classifier for tweets using three emotion corpora. They initially extracted features from these databases individually and used the WordNet affect emotion dictionary to generate synonyms of emotion words in their databases. The authors then found the dependencies in sentences as well as the context of emotion words. The authors trained their model using the SVM classifier and classified emotions into one of the 6 Ekman's categories of emotion. Their work showed an accuracy of 59.71%, 63.24%, and 67.86% on the three emotion corpora used, respectively.
Hasan et al 49 presented an Emotex model for detecting emotions from texts using a supervised learning method and emotion dictionaries. Their approach consisted of two methods: an offline and an online classification task. The offline approach involved the development of their Emotex system for creating models for classifying emotions. Emotex was built using emotion-labeled texts from Twitter and the SVM, NB, and decision tree classifiers. The data went through preprocessing and feature vector constructions to obtain the training datasets for the classification model. The online approach used the model created in the offline approach to classify live streams of tweets in real-time. Their model yielded an accuracy of 90% but with loose semantic features.
Ghanbari-Adivi and Mosleh 57 used NLP tools together with an ensemble classifiers based on Tree-structured Parzen Estimator (TPE) to detect emotions from two regular datasets (ie, ISEAR and OANC) and an irregular dataset (ie, Twitter messages extracted via Crowdflower). They indicate a hyper word picture containing meaning words that served as Keyword for feature extraction purposes and trained the model using an ensemble of k-NN, MLP, and decision tree with the number of each classifier being 500 resulting in a total of 1500 classifiers being used. They evaluated their results by analyzing metrics in the confusion matrix they constructed. Their results showed a detection accuracy of 88.49% for the irregular data and 99.49% for the regular data, however, with great time complexity.
Tzacheva et al 97 used the NRC emotion lexicon to label twitter data obtained using Twitter API into emotion classes. Then classified emotions using the SVM. Afterward, they discovered action patterns and used action rules to transform negative or neutral emotion states to positive states and classified reclassified. They obtained a system accuracy of around 88%. Table 2  Emo-Dis-HI data Showed that knowledge gathered from resource-rich languages can be applied to other language domains using transfer learning and cross-lingual embeddings.
Obtained an F1 score of 0.53

OPEN ISSUES AND FUTURE RESEARCH DIRECTIONS
This section provides some identified open issues and proposes some future research direction for text-based ED researchers.
From the state-of-the-art discussions carried out in this article, it was realized that research in the field is predominantly categorized into two main phases. The phases are language representation and classification. During language representation, the extraction of contextual information is pivotal as it forms the bedrock for improving classification accuracies. 98 A major issue identified is the necessity to propose a robust technique for extracting this contextual information from text. The introduction of transformers-based embeddings has shown a significant increase in the quality of contextual information extraction. 10,87,89 However, the use of transformers is affected by some limitations, viz., out of vocabulary (OOV) limitation, increased complexity, and importantly the method overfitting in small networks. 99,100 Due to the highlighted drawbacks of transformers, an ensemble of attention and neuro-fuzzy networks 101 would help reduce the limiting effects, thus increasing classification performance. The attention networks are expected to focus on the extraction of relevant features while the neuro-fuzzy networks offer clearer comprehensibility and classification of the extracted feature prior to classification.
The studies conducted in this survey identified that the use of text-based ED had not been adequately explored for certain critical or life-saving applications. These areas include but not limited to crime detection and mitigation by analyzing messages of victims in order to identify threatening words, analysis of patient messages in order to determine depression levels of patients so that timely support could be offered, etc. Emanating from the enormous amount of text data generated on a daily bases by people, undertaking research activities around the listed application areas among other similar life-saving applications would enhance and sustain the relevance of text-based ED.
Finally, the cultural affiliations of an individual greatly influence their expressed emotions toward situations. However, there exist few emotion labeled resources for languages other than the English Language. 90 The availability of rich resources in other languages such as French, Spanish, Hindi, and so on can greatly change the narrative and encourage research in the field in order to balance work done in different languages.

CONCLUSION
In this article, a comprehensive guide to the subfield of SA/ED specifically text-based ED has been presented. The article introduces the concept of text-based ED, emotion models, and highlights some important datasets available for text-based ED research. The three main approaches utilized when designing text-based ED systems have been elucidated, together with their strengths and weaknesses. The article further discusses current state-of-the-art with emphasis on their applied approaches, datasets used, major contributions, and limitations. Finally, the article presents open issues and future research directions for researchers in the domain of text-based ED.

PEER REVIEW INFORMATION
Engineering Reports thanks Marco Polignano and other anonymous reviewers for their contribution to the peer review of this work.