An efficient supervised framework for music mood recognition using autoencoder ‐ based optimised support vector regression model

Music is the art of ‘language of emotions’. Recently, music mood recognition is an emerging task. An efficient supervised framework for music mood recognition using autoencoder ‐ based optimised support vector regression (SVR) model is developed for the music emotion recognition. Our main intention is to increase the accuracy of emotion classification of music by considering text ‐ dependent and non ‐ text ‐ dependent features. For the high level feature representation, stacked autoencoder is used with two hidden layers. Modified K ‐ Medoid ‐ based brain storm optimisation ‐ based support vector regression (SVR_KMBSO) model is utilised for the emotion classification. Using the K ‐ Medoid ‐ based brain storm algorithm, the optimal parameters of the SVR are selected. The proposed framework utilises ISMIR2012 dataset and NJU_V1 dataset for English and for Hindi; online songs are also gathered and used for the music mood recognition. All the three datasets include songs based on four emotions like happy, angry, relax and sad. The experimental results are evaluated and compared with the existing classifiers including SVR, deep belief network (DBN) and Recurrent neural network (RNN). The proposed method SVR_KMBSO achieved high accuracy


| INTRODUCTION
Music is one of the common ingregients to encourage our daily life. In the research community from the music information retrieval, music emotion recognition (MER) has attracted increasing attention. Recently with various perspectives of MER, significant research works are carried out on bimodal approaches, exploitation of lyrical information, automatic playlist generation, emotion variation detection and classification of song excerpts [1,2]. Almost every day, an enormous amount of music is published in this real-world to satisfy the music listener's favourite styles and diverse interest. Recognising the emotion of the music clip is the intention of the MER [3]. Based on the emotion semantic, the MER is essential as it enables music retrieval, indexing and organisation [4]. For describing the music emotion, most of the MER systems adopt the valence-arousal [5].
MER model is divided into categorical models and parametric models in accordance with the psychological theory. By using the valence and arousal, the emotions in music are represented by the parametric approaches. Whereas, in categorical approach, the emotion labels or the adjectives are tagged with the songs such as peaceful, happy, calm, bored and angry [6,7]. MER highly concerns on developing the computational models to accurately predict the affective contents in each music signal. By employing machine learning methods, the MER model can be trained. It maps the perceived music emotion with the acoustic features [8]. Human expressions and emotions are not considered by most of the recommender systems. However, in our daily life, emotions have important influences.
To recognise internal cognitive processes and emotions of people, psychological signal method is more reliable than the facial expressions [9]. On the basis of lyrics, the classification of music emotion is called text classification. For text classification, there are various algorithms such as decision tree, k-nearest neighbour (KNN), Bayes, and SVM [10]. The music information retrieval sources are classified as audio, text and mixed audio and text. Triggering and communicating the listener's emotions are identified as the major role of music.
Determining the emotional content from music in an automatic manner is considered as an essential task [11].
In several fields, the MER has many application values. For the automated music emotion classification and regression, various techniques have been proposed, and significant results have been achieved [12]. MER using computers has increasingly entered the perspective of people. Generally, MER is divided into three sections, which include emotion model, feature model and classification model. In MER, the selection of features of data is crucial. But for the researchers, it is challenging to get a strong idea about the process of MER because of the diversity and the complexity of the research systems. Hence, it is essential to compare and classify the techniques on the basis of text dependent and non-text dependent features utilised for the MER [13,14]. Some common issues arise when mapping the relationship between the emotions and music features.
Some of the issues include determining suitable feature extraction approach, and there is a correlation between the emotion and human preference, character etc. [15]. As in other recognition issues, the recognition techniques should be given along with the emotion model description. For the MER, various methods including multi-modal, ranking based and regression have been put forward [16,17]. In research community for the MER, a considerable amount of work has been devoted because of the human's response to music listening is mostly related to the emotion [18]. In our daily life, music is an important role, and during relaxation time, people enjoy music. The collection of songs increases every day in people's personal devices. There is a close relation between emotion and music so that people choose songs to listen, which suit their emotion [19,20].
The main contributions of this entire work are given as follows: � The songs are converted into text and music without lyrics (karaoke). The features are extracted as both the textdependent and non-text-dependent features. Also the results are evaluated and compared for songs, karaoke and lyrics by using three datasets by considering both English and Hindi. � SVR_KMBSO-based method is introduced for the classification process in which the optimal parameters of the support vector regression (SVR) are selected by using the KMBSO algorithm. The selection of optimal parameters using modified optimisation algorithm is considered as the major contribution of our proposed approach, as it reduces the complexity of recognition process by performing clustering as initial phase during emotion classification. The clustering in BSO reduces the complexity and increases the speed of optimal parameter selection.
Remaining sections of this entire paper is organised as follows: Few related works are discussed in Section 2, Section 3 includes the problem definition, Section 4 provides the proposed methodology, Section 5 gives the experimental analysis and Section 6 contains the conclusion.

| RELATED WORKS
Some of the recent research works related to the music mood recognition are listed below.
In Dong et al. [21], a new technique named as bidirectional convolutional recurrent sparse network (BCRSN) was introduced for MER. With recurrent neural networks (RNNs) and convolutional NNs (CNNs), this recognition process was accomplished. The sequential-information-included affectsalient features (SII-ASF) was learnt by this method from the 2D spectrogram of audio signal. In this method, the three different processes such as ASF, feature extraction, and emotion prediction were combined for continuously predicting the emotions from the audio files. The weighted hybrid binary representation (WHBR) method was introduced in this approach for reducing the computational complexity. The regression-based prediction process was converted by this WHBR as a weighted combination for respective multiple binary classification issues. Music categorisation based on emotions with low level features and agglomerative clustering techniques was introduced by Sarkar et al. [22]. From the music signal, they considered only low level spectral features and time domain features. Based on the specific emotion, they carefully considered limited features instead of considering more features. K-means and Agglomerative clustering-based unsupervised method is considered for the process of classification. The method applied by Chaudhary et al. [23], for an automatic emotion recognition process is hash tag graph generation. Two different steps were included in this method; training and testing phases. Maximum recognition result was achieved by this method.
Recommending the music based on emotions using wearable physiological sensors was developed by Ayata et al. [24], which recognises the user's emotion from the music signal. From the multi-channel physiological signals, the recognition of emotion issue was observed as the prediction of valence and arousal. By the integration of physiological sensors and wearable computing device, the emotion is classified. The performance was evaluated on thirty two subjects with the fusion of features using KNN, SVM, random forest and decision tree algorithms. The outcomes showed that the proposed system's accuracy can be integrated with any recommendation engine.
To accurately accomplish and perform mood classification for multimodal music, a new framework was developed by Hu et al. [25], which utilises complementary and multiple sources such as social tags related to music pieces, lyrics text and music audio. The experimental analysis showed that by using only the audio features, the proposed system outperformed with lyrics and audio. For the feature selection, the automatic methods proved to have the less feature space. The learning curves showed that when compared to the methods utilising either audio or lyrics, the hybridised method with the audio and lyrics features wanted shorter audio files and fewer training samples. Medina et al. [17] train the multi-layer perceptron (MLP) network using MediaEval database for emotion recognition from music. In this approach, initially, the pre-processing was applied, which balances and stratifies the dataset. It comprises three different classifiers for emotion AGARWAL AND OM -99 classification; they are random forest (RF), MLP, and SVM. Among these three, the MLP attained better recognition performance. Further, it integrates two binary classification techniques (binary classifier and one vs. rest [OvR]) for quadrant classification scheme. Finally, with the temporal annotation information, a dynamic classification having diverse time windows was executed.
Deep learning networks such as Support Vector Machine (SVM) and Random forest were utilised by Sarkar et al. [26] to perform the experiments on MER_traffic, Bi-Modal and Soundtracks dataset. An optimal results for the dataset was provided by the neural network, which combines the feature set for experimentation. Moreover, novel pre-processing technique was proposed in which it offers significant enhancement for the datasets. The VGGNet based on convolutional neural network was proposed to avoid the problems of feature design. In this work, a Valence-Arousal (V-A) model was utilised to automatically identify the emotion conveyed by the music. To train the classification models, the angle vectors of V-A plane was considered as a difficult task. So, a variant model of RNN was developed by Liu et al. [27] to extract the features from the melody. It was proved that the machine learning algorithm makes a good prediction of emotions from the music.
Gao et al. [28] proposed a hybrid neural network architecture (CRNN) for recognising the emotions from music, particularly for scratch-generated dataset. For such dataset creation, initially the main melody extraction approach was used. Then, the features from each music were extracted and provided as input to CNN. It learns the features and inputs the learnt features within RNN for emotion classification. The survey on existing techniques is shown in Table 1.

| PROBLEM DEFINITION AND MOTIVATION
Until today date, the music mood classification remains as an open research area. For western music, both the lyrics and audio are utilised; therefore the accuracies of mood classification are achieved in an appropriate range but with high complexity. Whereas, the accuracies achieved by the early research while classifying the music mood in Hindi are not in an appropriate range. This is because of the insufficiency of more number of mood annotated Hindi music clips. In Hindi songs, there are various moods in a single song when it is sung in diverse contexts. So it leads to various misclassification in Multimodal music mood classification framework.
Examines information source on classifying music pieces across mood categories.
Limited the generalisability of the findings to a variety of music.
The classifier allows the solutions to fall in local optimum. other research works. Therefore, to improve the recognition accuracy in this method, a novel optimisation-based SVR is introduced. In this method, along with English songs, the Hindi songs are also considered for analysis. Music is a digital resource, and now it is extensively available in various media some of them are pen-drives, mobile phones, websites, digital audio players etc. The volume available for music makes it difficult for the listeners to choose a desired piece of music. Therefore, it is essential to develop an effective technique to facilitate an efficient searching for desired music. By taking all this in consideration, here a novel optimisation-based SVR is introduced to attain a high accuracy while recognising the music mood.

| PROPOSED METHODOLOGY
Normally, music is associated with the unique content of emotions, which is often referred as music mood. State-of-the-art methods concentrate on music piece content analysis, which considers the music mood as important metadata. The main goal of proposed work is to construct a music emotion classification framework for both Hindi and English music. For robust feature extraction, the key features such as intensity, rhythm, and timbre are considered for mood classification task. Further, the emotional expression-associated acoustic features are extracted through Mel-frequency cepstral coefficients (MFCC), octavebased spectral contrast (OSC), statistical spectrum descriptors (SSDs), Chromagram, Daubechies Wavelet Coefficient Histograms (DWCH). Also, the text dependent features are extracted by N-grams, Term frequency-inverse document frequency (TF-IDF) and total emotion count. The automatic supervised music mood classification framework is developed with deep learningbased network model. Initially, the high-level of features from the input data is extracted by stacked auto encoder (SAE) model. With these extracted features, the target emotions are accurately predicted by SVR-based recognition technique. Furthermore, the KMBSO algorithm is introduced along with SVR to improve its performance. The optimal parameter selection for SVR is effectively performed by this KMBSO algorithm. The schematic structure for this proposed recognition technique is shown in Figure 1.
Compared to the other existing works, the advantage of proposed work is after extracting the features from music signal based on text-dependent and non-text-dependent manner, and they are fed into the stacked autoencoder to obtain higher level feature representation to avoid emotion class level confusion. Further, the higher level feature representation is trained via SVR model. The SVR model hyper parameters are selected through efficient KMBSO algorithm, which leads to enhance the recognition performance of music mood classification.

| Text-dependent features
In songs, many types of textual features can be mined from the lyrics, which are very rich resources. Some of the most common features used for the classification of text are considered, which comprises N-gram, TF-IDF and total emotion count. The N-gram, TEC and TF-IDF are used for extracting the emotions from words.

a. N-grams
In various classification tasks, comprising classification of emotion and sentiment, the N-grams are the most standard level features. It is simple and highly scalable. We used unigram and bigram to construct the feature vector for the music emotion classification.

b. Term frequency-inverse document frequency
It is normally represented as bag-of-word scheme, as the collection of words from the entire document is included in this TF-IDF [29]. TF-IDF is an efficient and simple technique. This feature basically assumes that the word that is highly significant is repeated number of times in the whole document; meanwhile, such word is rarely applied in other documents. In this, the initial assumption is related with TF feature, and the latter assumptions are related with IDF feature. The number of times that the word i appear in the document j is represented by parameter, tf ij . If the value of parameter, tf ij is found high, then that particular word is found large number of times in document. The word ithat is found number of times in the document j contains high TF and small IDF value. Where c. Total emotion count The number of words that are associated with an emotion in the entire document is captured by this feature. Only a popular emotion context of a word that are recommended by the lexicon is captured by this feature (i.e. the emotion that having maximum score in the lexicon). But, these words not only show single expression is provided by all but also they show multiple expressions, for example, the word 'beautiful' which is normally applied in both joy and love emotion. Therefore, for such word, both joy and love emotions are count as one, and other emotions are count as 0 (based on the lexicon Lex scores). To avoid this issue, the feature that includes the relation of both word and multiple emotions needs to be developed. The features obtained for j th emotion is obtained using Equation (5), where I(⋅) represents the indicator function and e j represents the feature value for j th emotion. TEC only collects the word those having highest score in lexicon [30]. The number of occurrences of word w in document d is represented as count (w,d).

| Non-text dependent features
The attributes of music are normally categorised into four to eight different categories; each one represents different concepts. The eight different categories are tone colour (associated with timbre), dynamics, harmony, rhythm, musical texture, expressive techniques, musical form, and melody. Two different understandings are provided by these categories; they are (i) where do the features that related to particular emotion belong (ii) and which category contains less computational models while performing the musical feature extraction from particular emotion. In this method, eight different nontext-dependent features are employed; they are rhythm, timbre, intensity, chromagram, MFCC, OSC, SSDs, and DWCH.

a. Rhythm
Music contains sequence of notes that may get changed over different time; however each contains its own specific duration. Therefore, determining the note duration statistics is identified as an obvious computational metric. The three different rhythm features that are closely related with the response of human emotions are rhythm regularity, tempo, and rhythm strength. Meanwhile, three different and possible categories are considered in this method to capture the dynamics related to duration and its changes; they are long, medium, and short notes. But the existing techniques determine the mean and standard deviation for the duration of all these notes to define these three ranges. The qualitative duration for note i is represented as ND i .
Timbre is also referred as tone or colour of sounds. One of the basic element in music is timbre, which determines the quality of sound. It is nothing but the set of subjective opinion developed by individuals while looking onto the sound that is found independent from amplitude (loudness) or the frequency (pitch). Only frequency and amplitude are identified as the dominant attributes for timbre; due to this limitations, the timbre definition is viewed as a most subjective and varying one. where Q i and Q j are the qualities of two sounds and T denotes the timbre.
c. Intensity Most essential feature for emotion recognition is intensity, as it is highly correlated by arousal. Therefore, it is widely applied for arousal dimension classification. The amplitude measurement accurately determines the intensity level and few descriptive features such as minimum energy rate; loudness centroid is included in song intensity. These descriptive features are extracted in frame due to the presence of short-time dynamic behaviour in intensity. The summed signal spectrum and the spectrum distribution from several sub-band are composed within the intensity feature of several frame, which is defined as in Equation (10), where I(n) is the intensity of the n th frame and A(n,l )is the absolute value of the l th FFT coefficients of the n th frame.

d. Mel-frequency cepstral coefficients
MFCCs are originally designed for speech recognition, and they have been shown as the most informative feature domains for MER. This acoustic feature is most extensively applied in various audio and speech processing. MFCCs basically represent the low-dimensional warped spectrum based on mel-scale. This process is accomplished in frame-by-frame manner, after that the each frame is converted into a feature vector of single N-dimensional [31]. The number of samples in each frame is greater than the value taken as N. It provides the data required to process by the back end system in which it minimises the quantity of data. In feature extraction, the audio input is converted into a vector sequence X ¼ [x 1 , x 2 , …, x k ], where k indicates the frame index and x k denotes the N-dimensional vector.
Using the following steps, the MFCCs are determined (i) First one is pre-emphasising the speech signal.
(ii) The speech signal is separated into a large number of frames with size 20 ms and shift 10 ms, then the hamming window is applied over these each frames.
Here, A represents the discrete fourier transform (DFT) length, and the power spectrum of k th hamming window is represented as P k (a). n represents the sample number.
(iii) By applying DFT, the magnetic spectrum is computed for each windowed frame. The DFT of k th hamming window is estimated as: where the time domain signal of the k th hamming window is indicated as s k (x).
(iv) Compute the Mel spectrum obtained by passing the DFT signal through Mel filter bank. (v) To the log Mel frequency coefficients (log Mel spectrum), DCT is applied to obtain the required MFCCs.
In this method, the dynamic and static features of music signal based on the 39th order MFCC is captured. Moreover, 39 dimensional MFCC feature vector of each frame is obtained with 13th order acceleration (delta-delta) coefficients, 13th order static coefficients and 13th order delta coefficients.
e. ctave-based spectral contrast OSC feature is found in 14 dimensions, and it applies 7 octave-based bands [32]. The spectral valley, spectral peak, and their difference in each sub-band are considered by this OSC feature. The difference between spectral peak and spectral valley provides the contrast. The values that are found in each sub-band is sorted in ascending order by spectral valley, and finally, it sums the initial 2% of bandwidth value. Spectral peak also performs the similar sorting process but in descending order [31]. For extracting the OSC features, initially apply the Fast Fourier Transform (FFT) on each audio frame to generate the spectrum. Then, subdivide such spectrum into sub-bands with a set of octave scale filters. P a,1 , P a,2 ,…, P a,Na is the power spectrum in the a th sub-band, and N a represents the number of FFT frequency bins in a th sub-band. The spectral valley and spectral peak of a th sub-band is given in Equation (13) PeakðaÞ where α represents the neighbourhood factor which is set as 0.2. The difference between spectral peak and spectral valley provides the spectral contrast (i.e. SC(a) ¼ peak(a)À valley(a)). The feature vector contains the spectral vector and spectral contrast of all sub-bands. Therefore, the OSC of feature vector is represented as X osc ¼ ½Valleyð0Þ; ::; ValleyðB À 1Þ; SCð0Þ; :: where B represents the number of octave scale filters. During music and audio processing, the SSD is frequently related with timbral texture. The hanning window is applied for each spectral shape function to divide the data into shortoverlapping segments and also to evaluate the magnitude DFT. It is a 4D feature, which contains spectral centroid, spectral flux, spectral rolloff, and spectral flatness in order.
where the centroid is denoted as S c , which identifies the weighted average of the spectrum; the flux is denoted as S fx , which identifies the Euclidean distance among successive spectral frames; the roll-off is denoted as S r , it provides frequency, which is found less than the total spectral energy; and the flatness measure is denoted as S f , which quantifies the closest relationship between the uniformly distributed spectrums. f k is the frequency of bin k, S i,k is the strength of bin k in frame i, P k denotes the magnitude of bin k, and n is the total number of bins in a band.

g. Chromagram
A spectrogram of 12 frequency bins, which represents the 12 semitones, is named as Chromogram or Chroma. Chromagram is a well-defined method that is widely applied to determine the pitch class component in less duration. It is nothing but a circular form of the spectrogram that are logarithmically warped. Within this version, the chroma frequencies of various octaves are gathered and summed for evaluating the energy at the each 12 pitch classes. Sometimes, the indication of the entire modality and musical keys are attained by this features.
where S(t,c) indicates the time-chroma distribution, c∈[0, 1) and k ∈ Z, in which Z indicates integers.

h. Daubechies wavelet coefficient histograms
The wavelet coefficients of the Daubechies is used since they reflect the local regularity present in the signal being analysed. Daubechies wavelet filter having seven decomposition levels is applied in this method. After decomposition process, the histogram for the wavelet coefficient at each subband is developed [33]. The waveform variations are perfectly approximated at several subband by the coefficient histogram. Its moments then uniquely characterise the probability distribution from the probability theory. Next, the initial three moments from the histogram are applied to describe the waveform distribution. Three different moments that are applied for this characterisation are the skewness, the average, and the variance of each subband. DWCH can be computed by using the scaling function.
where φðxÞ is the scaling function and a k is the filter coefficient.

Algorithm: training of SAE
Step 1: Initialise bias ¼ 0, and weight matrix randomly between, [À τ, τ], where τ is the constant whose value may depend on the total neurons in the hidden layers.
Step 3: Then, apply back propagation algorithm to compute weight and bias parameters, set, ΔW l : ¼ΔW l þ δ lþ1 ða l Þ T Δb l : ¼Δb l þ δ lþ1 Step 4: Update weight w and biasb where ε represents the learning rate. The text-dependent and text-independent features are extracted, and then that extracted features are provided to SAE for extracting high level features because the presence of low level features may reduce the performance of classification process. To avoid that, the SAE is introduced, which extracts the high level features and provide that extracted features to SVR for emotion recognition. AE is a type of unsupervised learning approach, which contains three different layers: input, hidden and output layers. The output attained from each layer is given as an input to the successive layers [34]. The hypothesis function h w,b (x) ¼ x (where w is weight and b is bias) having unlabelled data is learnt by SAE. The cost function for SAE is defined in Equation (21), KL(.) represents the Kullback-Leibler divergence function, β represents sparse penalty term, and ρis sparsity parameter. The average activation function subject to hidden units is represented asρ j . The back propagation (BP) with batch gradient descent optimisation is introduced in SAE to compute the gradients of each learning weights (Table 2). Then, for each node compute the error δ ðiÞ l (where l represents layer and irepresents each node) The probability of output values are statistically estimated by SVM classifier. This classifier uses the features that are learnt by last hidden layer to learn the all bias and weight parameters. The architecture of deep neural network (DNN) has two hidden layer, one input layer and one output layer are shown in Figure 2.
After learning the bias and weight parameter in the output layer, the fine-tuning for all w and b parameters is performed simultaneously in the entire network using KMBSO algorithm . SAE is considered as single model, but it has various layers, and the fine-tuning process in all these layers use back propagation algorithm to enhance the weight parameter. Based on labelled training samples, the bias and weight parameter of entire network are learnt by this standard back-propagation algorithm, which mainly aims in reducing the misclassification errors. The high level features that are learnt using the SAE is given as input to the SVM (output layer) for emotion classification. The data from various classes are separated by SVM using a hyperplane. The algorithm for DNN training is shown in Table 3.

| Emotion classification using SVR_KMBSO
The SVR maps applied nonlinear mapping to map the input data within the high-dimensional feature space. The linear regression problem obtained in this feature space is solved by this mapping process. Based on the available dataset, the regression approximation determines the function. The main aim of this modelling process is to determine a regression function that accurately performs the output prediction for another (new) set of input-output samples. Figure 3 illustrates the fundamental idea behind the SVR.
Identifying the smooth regression function f(x) is considered as the main objective of this process. At the same time, the maximum deviation that attained from all the training TA B L E 3 Training model for deep neural network Step 1 SAE determines w and b in the input layer (which contains extracted features) Step 2 Then determine the feature size (a) from the first hidden layer by applying z ¼ w * data þ b and f(z) where f represents the sigmoidal function.
Step 3 Then rank the features (a) with labels and drop the most least features.
Step 4 Repeat Steps 2 and 3 in 2 nd hidden layer by giving a 0 as input and obtain b and b 0 .
Step 5 Also repeat steps 2 and 3 for 3 rd hidden layer by giving b and b 0 to obtain c and c 0 . AGARWAL AND OM -105 targets needs to be maintained below ε. The subsequent problem is solved to obtain the SVR model The nonlinear decision hyper-surface is constructed over the SVR input space by kernel function. Normally, the high result on prediction performance is attained by Gaussian function. Based on nonlinear kernel function, the two problems are stated in Equations (22) and (23) maximize where the Lagrange multipliers are represented as α i and α * i , and k(x i ,x j ) represents the kernel function. In the above equation, ðα i À α * i Þis non-zero if x i is a support vector. In this method, the Gaussian function is taken as the kernel function. The Gaussian kernel is defined as in Equation (24), where γ is a positive constant. The parameters that are defined by user are C, ε and γ, and based on these parameters, the accuracy of entire system gets altered. The SVR parameters need to set carefully; therefore, this SVR model can perform more effectively. Therefore, parameter selection is consider as a significant step in SVR model. Then, with the selected parameters initially, train the SVR model. After completing training process, test the SVR model with valid parameters to obtain the accurate solution. This process is found time consuming and highly based on luck.
Therefore to minimise the consumption of time and for efficient parameters selection, K-Medoid based brain storm optimisation (KMBSO) algorithm is proposed. In this algorithm, there are two steps, which include clustering and new individual generation. For clustering the individuals, k-medoid approach is used in this proposed method. The proposed SVR_KMBSO model dynamically optimises the parameter values of SVR through the KMBSO algorithm.
Hyperparameter optimisation using K-Medoid based brain storm optimisation Initially, the parameter values are generated randomly in the range of zero to infinity. These parameter values are clustered by using K-Medoid clustering. The 'k' information things are selected as beginning medoids by this clustering process to represent the k clusters. Large number of residual things are included within this cluster, which has its medoid nearest to them. Another medoid from that point is resolved to perfectly represent the cluster. Then, the K clusters that are mainly focussed on medoids are developed. In this K-Medoid clustering, the distance between the medoid and each individual is calculated by using Equation (25).
where D denotes the distance between the medoids and the data points, m i denotes the medoids and d i denotes the data points. In this proposed method, the number of clusters is 3 (K ¼ 3) since we have to optimise three parameters. One cluster is for insensitive loss function, one is for the selected Gaussian kernel function and the other is for the regularisation parameter. After clustering the data points, random values are generated for each data point in each cluster in the range of 0 to 1. Based on the priority, from each cluster, a data point is selected for each iteration and performs the SVR classification. The fitness function for this optimisation is given in Equation (26), and the objective function is given in Equation (27).

Fitness definition
The k-fold cross-validation (CV) applied on the training samples is identified as an effective approach in machine learning. This ensures that this method accurately predict the training dataset, meanwhile it analyse whether any of the selected parameters results over-fitting. The value of k is practically set as 5-10, which maintains the trade-off of prediction accuracy and computational cost. In this method, the five-fold CV error referred as mean absolute percentage error (MAPE) is introduced in SVR to measure the training error, which is determined using Equation (25): where l represents the number of training data samples; and the actual and predicted value is represented as a i and p i , respectively. The solution those having less MAPE cross validation

-
AGARWAL AND OM contains smaller fitness value; due to this, the survival rate of such solutions may get reduced in successive generations. After processing all the individuals in the clusters, new data points are generated for every individual data point by using the below Equations (28) and (29) ::::::; d n ð28Þ δðT Þ ¼ log sig where d new i indicates the new data points, d old i indicates the old data points, δ(T) indicates the function of step size, R indicates the generated random value for the old data point, d n indicates the total number of data points within the cluster, I indicates the total iteration number, c indicates the current iteration number and C indicates the coefficient for changing the step size function's slope.
The random values are generated in the range of 0 to 1; then, the ranking process is performed, and finally, the classification process is carried out by using the SVR classifier, and the fitness values are evaluated and stored in the array. Same process will be repeated till reaching maximum iteration. Finally from the stored values, a best parameter set is selected, which provides the minimum fitness value. Using the optimal parameters, the emotion classification is performed.

| EXPERIMENTAL ANALYSIS
This proposed mood recognition process is implemented in MATLAB 2017a platform, and the performance such as accuracy, precision, sensitivity, F1-score, false positive rate (FPR), specificity and error are evaluated with respect to text-dependent and non-text-dependent manner and compared with other classifier models such as SVR, recurrent NN (RNN), and DBN. In this work, we have used one dimensional features. The parameter settings of proposed algorithms are shown in Table 4.

| Dataset description
The proposed method is implemented in three datasets in which two datasets for English and one is for Hindi. The description for the three datasets are given below. From that 777 music clips, 400 clips are taken for training, and 377 clips are taken for testing process. These clip samples are distributed in four different mood categories. The music clips present in this dataset are provided with a plain text (.txt) file, which contains the lyrics of each music clips along with the time tags (i.e. the time offset of the sentence relative to the start of the music). In each training/testing sample set, four information fields are separated by colon. It includes music number in the package, music title, performer's name and duration of music.
Hindi dataset: For Hindi, there is no dataset available with these four emotions. So, we have collected the songs from online sources for these emotions and used as Hindi dataset.

AGARWAL AND OM
The song count in each emotion for the three datasets is provided in Table 5.

| Performance metrics
The performance metrics used to evaluate the performance of this proposed method are accuracy, precision, sensitivity, specificity, false positive rate (FPR), F1-score and error. Accuracy is a performance metric, which is used to estimate the correctness of the classification. Precision is denoted as the fraction of recognized instances, which are significant, and sensitivity is denoted as the fraction of significant instances, which are retrieved. Specificity is a measure, which is used to evaluate the proportion of actual negatives that are correctly identified. The ratio of number of negatives that are incorrectly represented as positive to the total number of actual negatives provides the false positive rate (FPR). The F1-score is the harmonic average of the precision and recall. Error is used to evaluate the misclassification percentage. The formulas used for the performance evaluation are given in Equations (30) Recall where TP indicates true positive, TN indicates true negative, FP indicates false positive, and FN indicates false negative.

| Performance evaluation
The methods that are taken for comparison are SVR, DBN, and RNN. The brief description for this three techniques is explained below: DBN: DBN stands for DBN, which is a famous deep learning architecture. The layered structure in this algorithm is mainly utilised to learn the hidden structure of the patterns, and it is used to extract the important parameters from the data. Moreover, number of RAMs are stacked in this structure; so, it learns the high level representations, which are valuable for music emotion recognition. The total nodes found in each layer is 18-30, 30-30, 30-30, and 30-7. Total RBM is 3, learning rate is 0.1, batch size is 84, and weight decay is 0.0002. SVR: SVR are used in SVMs for regression problems.
Main goal of this SVR is discover a hyperplane in a feature space like SVM. The hyperplane of SVR correctly predicts the distribution of original information while the hyperplane of SVM separates the data into two parts. Finally, [1560, five] is a dimension of output layer in which the number of output music emotion label is mentioned as five. The performance metrics are evaluated for the proposed and existing methods, and the proposed method's performance values are tabulated in Table 6.
The proposed technique has the highest accuracy of 98.79% when using songs from ISMIR2012 dataset. From the same dataset when using karaoke of the songs, our proposed method attains the accuracy of 97.4%. When using the lyrics from ISMIR2012 dataset, the highest accuracy is achieved for our proposed method, which is 97.23%. On comparing the three datasets, ISMIR2012 dataset achieved the highest accuracy for songs, karaoke and lyrics. Also, when considering songs, karaoke and lyrics, the accuracy is high for songs. When using the NJU_V1 dataset, the highest accuracy for songs, karaoke and lyrics are 97.88%, 96.82% and 95.76% respectively. When using the developed Hindi dataset, the highest accuracy for songs, karaoke and lyrics are 97.57%, 95.63% and 94.17% respectively. Similarly for remaining two datasets, the proposed approach SVR_ KMBSO has attained higher accuracy result most particularly for songs. When considering the songs, the TP values of angry, happy, relax and sad are 70, 104, 98 and 97, respectively. When considering the karaoke, the TP values of angry, happy, relax and sad are 68, 103, 98 and 96, respectively. When considering the lyrics, the TP values of angry, happy, relax and sad are 68, 102, 97 and 94, respectively. The emotions that are correctly classified are comes under the TP class. The presence of modified BSO has improved the recognition rate of SVR; due to this, the accuracy rate is also gets improved.
The overall accuracy attained by proposed approach for song, karaoke, and lyrics are found to be 97.57%, 95.63%, and 94.17% respectively. The accuracy value attained by proposed SVR_ KMBSO for song in Hindi dataset is 98.21 (angry), 97.5 (happy), 97% (relax), and 97.67 (sad).
The accuracy, precision, sensitivity, F1-score, specificity, error, and FPR comparison of the proposed method for various datasets using existing SVR, DBN and RNN classifiers are shown in Figures. Figure 7 illustrates the accuracy comparison of the proposed method and the existing methods for various datasets using SVR, DBN and RNN classifiers. The above graph represents that the proposed technique has the highest accuracy of 98.79% when using songs from ISMIR2012 dataset. From the same dataset when using karaoke of the songs, our proposed method attains the accuracy of 97.4%. When using the lyrics from ISMIR2012 dataset, the highest accuracy is achieved for our proposed method, which is 97.23%. On comparing the three datasets, ISMIR2012 dataset achieved the highest accuracy for songs, karaoke and lyrics. Also, when considering songs, karaoke and F I G U R E 1 1 Comparison graph of specificity for various datasets 114lyrics, the accuracy is high for songs. When comparing the classifiers, the proposed SVR_KMBSO method achieves the best accuracy for songs, karaoke as well as for lyrics. When using the NJU_V1 dataset, the highest accuracy for songs, karaoke and lyrics are 97.88%, 96.82% and 95.76%, respectively. When using the developed Hindi dataset, the highest accuracy for songs, karaoke and lyrics are 97.57%, 95.63% and 94.17%, respectively. Figure 8 represents that the proposed method has the highest sensitivity of 98.78% when using songs from ISMIR2012 dataset. From the same dataset when using karaoke of the songs, our proposed method attains the sensitivity of 97.38%. When comparing the classifiers, the proposed SVR_KMBSO method achieves the best sensitivity for songs, karaoke as well as for lyrics.
From the same dataset when using karaoke of the songs, our proposed method attains the precision of 97.42% (Figure 9). When using the lyrics from ISMIR2012 dataset, the highest precision is achieved for our proposed method, which is 97.22%. On comparing the three datasets, ISMIR2012 dataset achieved the highest value of precision for songs, karaoke and lyrics. Also, when considering songs, karaoke and lyrics, the precision value is high for songs.
The proposed method has the highest F1-score of 98.77% ( Figure 10) when using songs from ISMIR2012 dataset. From the same dataset when using karaoke of the songs, our proposed method attains the F1-score of 97.4%. When using the lyrics from ISMIR2012 dataset, the highest F1-score is achieved for our proposed method which is 97.22%.
The specificity graph in Figure 11 represents that the proposed method has the highest specificity of 99.6% when using songs. On comparing the three datasets, ISMIR2012 dataset achieved the highest value of specificity for songs, karaoke and lyrics. When using the NJU_V1 dataset, the highest specificity values for songs, karaoke and lyrics are 99.29%, 98.95% and 98.58%, respectively. When using the developed Hindi dataset, the highest specificity values for F I G U R E 1 2 Comparison graph of false positive rate (FPR) for various datasets AGARWAL AND OM -115 songs, karaoke and lyrics are 99.2%, 98.55% and 98.05%, respectively. Figure 12 represents that the proposed method has the lowest FPR of 0.40% when using songs from ISMIR2012 dataset. From the same dataset when using karaoke of the songs, our proposed method attains the FPR of 0.87%. When using the lyrics from ISMIR2012 dataset, the lowest FPR is achieved for our proposed method which is 0.92%. On comparing the three datasets, ISMIR2012 dataset achieved the lowest FPR for songs, karaoke and lyrics.
From the comparison shown in Figure 13, we can conclude that the proposed method has achieved the lowest error of 1.21% for songs from ISMIR2012 dataset. For three datasets, the error rate attained by proposed SVR_KMBSO is found minimum than other existing techniques. This is because the SAE in proposed approach extracts only the high-level features, and it avoids the low-level features. Elimination of low-level features reveals more dimension, which automatically reduces the burden of classification process. Therefore, proposed SVR_KMBSO has attained less error rate.
The training and testing accuracy of ISMIR2012 dataset, NJU_V1 dataset and Hindi dataset with respect to songs, karaoke and lyrics are shown in Figures 14-16  Training accuracy is identified by estimating the classification accuracy of a model with the same data in which the model is trained. Testing accuracy is estimated by introducing two completely different datasets. One is to train the model which is F I G U R E 1 3 Comparison graph of error for various datasets 116called as training data and the other is to calculate the classification accuracy which is called as testing data. If the number of files increased, then the training and testing accuracy also gets increased. While comparing with the testing accuracy, the accuracy attained for training is found greater for all sets including songs, karaoke and lyrics. When considering the training accuracy of songs and karaoke for 1425 samples, it is 2% higher for the proposed method than SVR method, 6% (song), and 7% (lyrics) higher than DBN approach and 8% higher than RNN method. When considering the training accuracy of lyrics for 1425 samples, it is 5% higher for the proposed method than the SVR method, 6% higher than DBN approach and 7% higher than RNN method. When considering the testing accuracy of songs for 1140 samples, the proposed method's accuracy is nearly 4% higher than SVR method, 11% higher than DBN method and 11% higher than RNN method. When considering the testing accuracy of karaoke and lyrics for 1140 samples, the proposed method's accuracy is nearly 4% (karaoke), and 5% (lyrics) higher than SVR method, 12% (karaoke), and 14% (lyrics) higher than DBN method and 14% (karaoke), and 13% (lyrics) higher than RNN method. Normally, the training accuracy is higher than the testing accuracy, but in some cases the training accuracy is found less than the testing accuracy, this is because the features that are selected for emotion recognition cause under-fitting during learning process. To overcome such issue, the KMBSO algorithm is hybrid with SVR instead of BP. This hybrid technique attains high training accuracy than the testing accuracy. Table 8 shows the training and testing accuracy of NJU_V1 dataset with respect to songs, karaoke and lyrics. The training and testing accuracy of proposed SVR_KMBSO is compared with three existing techniques they are SVR, DBN, and RNN. Among these all techniques the accuracy attained by proposed SVR_KMBSO is found higher that other techniques. From this result we can conclude that our proposed SVR_KMBSO attains better accuracy result for songs rather than for lyrics and karaoke. This is because song based emotion recognition have inherit the benefits of both text dependent and text independent features. Due to this the accuracy of song based music recognition is found high.
Abbreviations: DBN, deep belief network; SVR_KMBSO, K-Medoid-based brain storm optimisation-based support vector regression; SVR, support vector regression, RNN, recurrent neural network. Table 9 shows the training and testing accuracy of Hindi dataset with respect to songs, karaoke and lyrics. When compared to testing accuracy, the training accuracy is highly achieved for all sets including songs, karaoke and lyrics. When considering the training accuracy of songs for 465 samples, it is 3% higher for the proposed method than SVR method, 7% higher than DBN approach and 6% higher than RNN method.
The accuracy result of proposed approach is compared with different existing techniques, and the comparison results are shown in Table 10. From this comparison it is clear that among all existing techniques, the performance of proposed (SVR_KMBSO) is found better because it includes a modified BSO algorithm along with SVR to improve the recognition result. Finally, this hybridisation process has achieved better accuracy result with less error rate. The computational complexity achieved by proposed SVR_KMBSO is found less  than other existing techniques as it use KMBSO algorithm for parameter optimisation. This parameter optimisation may avoid the overfitting meanwhile it maximise the recognition accuracy. But the existing algorithms cause solutions to fall within local optimum; therefore, the computational complexity of existing algorithms is high.

| CONCLUSION
Recently, the machine learning is gaining a prominent role in each field due to its high performance. Music is automatically integrated within the everyday life of all human beings. It harmonises the sensation of the listener; moreover, it also reveals the emotion. Till now, a number of techniques exist for music mood recognition, but none of them provide better accuracy with less computational complexity. An efficient supervised framework with autoencoder-based optimised SVR model is developed for the MER. In feature extraction process, the features are extracted from songs, karaoke and lyrics as text-dependent and non-text-dependent features. For the high level, feature representation stacked autoencoder is used with two hidden layers. KMBSO-based support vector regression model is utilised for the emotion classification. The optimal parameters of the SVR are selected by the KMBSO algorithm. So, the performance of SVR gets enhanced than the conventional support vector regression. Three datasets are used in this method for performance evaluation; among that, two datasets (ISMIR2012, NJU_V1) contain English songs and one (Hindi dataset) contains Hindi songs. The training and testing