Audio classification using grasshopper ‐ ride optimization algorithm ‐ based support vector machine

The accurate and robust detection of the audio has been widely grown as the speech technology in the area of audio forensics, speech recognition, and so on. However, in real time, it is a challenge to deal with the massive data arriving from distributed sources. Thus, the study introduces a method that effectively deals with the data from the distributed sources using the map ‐ reduce framework (MRF). The map and reduce function in MRF aim at feature extraction and audio classification. The robust classification using the proposed grasshopper ‐ ride optimization algorithm ‐ based support vector machine (G ‐ ROA ‐ based SVM) uses the features, such as multiple kernel Mel frequency cepstral coefficients, spectral flux, spectral kurtosis, and delta ‐ amplitude modulation spectrogram. The proposed G ‐ ROA is the integration of ROA and grasshopper optimization algorithm in tuning the optimal weights of SVM and also, the kernel function in SVM is modified using the Gaussian radial basis function, Gaussian kernel, and polynomial kernels. The experimentation of the proposed method is done using two datasets, namely TUT sound event 2017 dataset and ESC dataset. TUT sound event 2017 dataset consists of eight audio recordings from a single acoustic scene. ESC dataset consists of three parts and 252,

frequency from the original audio data, spectrograms are employed [1,14,15]. To minimize the input space dimensions, the feature extraction is progressed using the linear prediction cepstrum coefficient and Mel-frequency cepstral coefficients (MFCCs) [1]. Though they possess complex sound proportions, many details are common in the audio scenes belonging to the same category. However, matching is a serious criterion in discovering and matching the patterns [16]. Features obtained from the audio, like power spectral density, RASTA analysis, and frequency bands generated from the filter bank in addition to the recurrent neural networks and k-nearest neighbour for classification [17]. Hidden Markov models (HMMs) [18] are employed to model the conventional audio features also with the sound event possesses various representations in case of noisy and clean environments [19]. The traditional methods of classification, like nearest neighbour [20], support vector machine (SVM) [21], HMM [22], deep neural network [23], and Gaussian mixture model [24] are employed. At present, deep learning attracts the research in areas of audio processing for which the deep learning architecture, termed as deep belief network (DBN), is used [25][26][27]. In the parallel computing frameworks, the performance is high because of the processing speed of classification. The map-reduce function is commonly used to increase the processing speed [28]. The visual features of sounds are built starting from the audio file and are taken from images constructed from different spectrograms, a gammatonegram, and a rhythm image [29].
The main aim of this study is to design and develop an audio classification system in the MRF for the reliable detection of an event that is to improve the accuracy. The proposed system uses mapper and reducer functions, wherein pre-processing and feature extraction are carried out in the mapper phase and the classification is done in the reducer phase. The proposed G-ROA is obtained with the integration of ROA [30] and grasshopper optimization algorithm (GOA) [31]. The proposed G-ROA algorithm inherits the advantages of both the algorithms and the proposed algorithm tunes the optimal weights of SVM in such a way that the detection of audio becomes effective.

The major contribution of the research
G-ROA-SVM for classification of the audio: The audio classification is performed using the G-ROA-SVM classifier. Here, the SVM classifier is trained by the proposed G-ROA, which is the integration of ROA and GOA. The G-ROA tuning the optimal weights of SVM and the standard kernel functions in SVM are modified using radial basis function (RBF), Gaussian kernel, and polynomial kernel, which enhance the accuracy of the audio classification.
The rest of the study is organized as: Section 2 reveals the review of the existing methods with the reasonable challenges that insist the need for the new method of audio classification. The proposed method of audio classification is detailed in Section 3, and Section 4 presents the results of the proposed method. Finally, Section 5 concludes the study with the effective proof for audio classification.

| MOTIVATION
In this section, the review of the literature is presented with the challenges faced that motivatedf the work.

| Literature survey
In this section, the existing methods are reviewed with eight existing works along with the pros and cons of the developed method. Waldekar and Saha [19] developed a fused system framework, and in the method, the significance was regarding the frame-level statistics of the known spectral features, which formed the input to SVM that proved the outperforming nature of the system. The drawback of the method was regarding the binaural data format. Ali and Talha [32] developed a method for VAD based on unsupervised learning, and hence, there was no need for the training data for distinguishing the voice segments from the unvoiced ones. The drawback of the method was that the samples at the end of the audio were not a part of any frame as a result of the insufficient samples. Wang et al. [27] developed a method, hierarchical-diving deep belief network (HDDBN), which was robust even for noisy situations. The drawback of the method was that when the low-level DBNs tune the network for the iterations, the performance was affected. Phan et al. [16] developed a scene classification method using the convolutional neural networks (CNNs) and label-tree embeddings, which reduced the lowlevel features as likelihoods of metaclasses that made the learning and matching of templates efficient. The drawback of the method was regarding the size of the segment, which should not be shorter. Thus, more details of the signals were required, leading to the unreliable estimation by the random forest classifier using the posterior probabilities at the time of the feature learning using the label tree embedding (LTE). Souli and Lachiri [33] performed the audio classification with the SVM and scattering features. The ability associated with the representation of the nonstationary signals enables the ability to discriminate the events, including filters for time and rhythms. Moreover, the classification accuracy of the method was not good in case of the presence of environmental noises but was effective in the case of clean environments. Wu et al. [1] developed an attention-augmented CNN that enhanced the features generated from the frequency bands. The drawback of the method was regarding the higher number of the local frequency segments as this would increase the parameters in the model, affecting the performance of the system. Hong et al. [34] developed a non-negative matrix factorization depending on the feature learning mechanism that worked robustly even under the high environmental and noisy speech. The method was effective when compared with the other feature selection methods and the drawback was the training data samples were not guaranteed to remain independent. Arumugam and Kaliappan [35] developed a feature selection strategy using the modified bacterial foraging optimization algorithm (MBFOA) that was efficient in the case of multimedia applications, but the performance was questionable. SINGH AND JAISWAL -435

| Challenges
The challenges of the research are revealed below: � The classification was better, but the error rates remained high during audio classification [32,35]. � In deep learning [6] there is a need for a large amount of data and also there is a need for signal-specific feature engineering [32,34,36,37]. � The 'bag-of-frames' system is superior to the simpler onepoint average approach, while evaluating the other three available audio scenes datasets possessing less within-class variability [19,30,31]. � Latent Dirichlet allocation and probabilistic latent semantic analysis methods, there is no assumption regarding the generation of document-topic distribution [38]. � Sparse representations of the noise represented as the basis vectors of the transformation matrix appear to be different from the actual signal [4,6,26,39].

| PROPOSED AUDIO CLASSIFICATION STRATEGY USING THE OPTIMIZATION ENABLED SVM CLASSIFIER
The audio classification seems to be a complex and tedious process as the processing time is large to deal with the big data. Thus, the study uses the MRF for the classification of the audio to reduce the processing time in such a way that the parallel processing of the huge data is enabled. There are three major steps, the first step is pre-processing, feature extraction, and finally, the audio classification. The proposed system uses mapper and reducer functions, wherein pre-processing and feature extraction are carried out in the mapper phase and the classification is done in the reducer phase. The pre-processing helps to remove the background non-voice region from the signal and enables feature extraction. The feature extraction helps to classify the audio. The audio classification is performed using the G-ROA-SVM classifier. The SVM classifier is trained by the proposed G-ROA, which is the integration of ROA and GOA. For audio classification, a trained SVM classifier is employed in which the kernel is replaced with a new kernel function and the training is performed using the proposed algorithm, G-ROA. Finally, the features are progressed using the proposed G-ROA-based SVM. Figure 1 shows the block diagram of the proposed audio classification strategy.

| Map-reduce framework for audio classification
Map-reduce is a framework for processing and generating a huge amount of data and the structure is organized as a single master node and a number of worker nodes. Let us represent the input data as, S, which carries the input audio signals and the database Sresembles the big data. Hence, for the effective audio classification, the input database is split as the subsets of data, which is the role of the master node in the MRF. Let the data subsets are denoted as  where n refers to the total number of the data subset, and the subsets formed from the database is the input to the mapper phase. The individual audio signal from the database is represented as y.

| Mapper phase
The mappers are denoted as: It is clear from Equations (1) and (2) that the total number of mappers is equal to the total data subsets. The features are extracted from the pre-processed audio signals, which form the intermediate data of the individual mapper. Following are the steps in the mapper phase: a) Pre-processing: The major step is the pre-processing that processes the signals in such a way as to remove the background non-voice region from the signal to enable the effective feature extraction. b) Feature extraction: The second step in the mapper phase is feature extraction and these features classify the audio.

Spectral kurtosis of the audio signal:
The spectral kurtosis (SK) [40] plays a dominant role in filtering in recurring the randomly occurring signals, which are corrupted with the additive noise, and it is reported as the fourth-order cumulant of the Fourier transform (FT). Let the discrete random signal be represented as y ðbÞ , Q ðvÞ is the P-point discrete FT and the SK at the individual frequency bin v is given as: where Q * ðvÞ refers to the complex conjugate of QðvÞ, and f u specifies the u th order cumulant. The dimension of the SK is represented ½1 � 1�.

Multiple kernel Mel frequency cepstral coefficient feature:
The need for the multiple kernel Mel frequency cepstral coefficient (MKMFCC) feature [41] is interpreted that extracts all possible phonetic components from the audio signal through considering the low and high energy frames of the audio signal. Let us represent the MKMFCC features as F 2 and the dimension of the features is denoted as ½1 � 13�. Spectral flux of the audio signal: It is the measure of the change in the local spectrum. Let us represent the spectral flux [42] of the spectrum as, F 3 and the dimension is, ½1 � 1�. Delta-amplitude modulation spectrogram features: The meagre variations in the frequency and time domains of the signal are taken into account to represent the delta-amplitude modulation spectrogram (AMS) features [43]. Thus, the delta-AMS features are represented as: where p refers to the total segments or frames, R ðτ; $Þ is the AMS feature vector, and ΔR p ðτ; $Þ is the delta-AMS feature, which is computed as: The dimension of the delta-AMS features is given as ½1 � 64�. Therefore, the feature vector corresponding to the audio signal yis represented as: where where I g corresponds to the intermediate data of the g th mapper. The intermediate data is the input to the reducers in the reducer phase, which performs the audio classification.

| Reducer phase
The number of the reducers equal to the total classes and the reducers are denoted as: r ¼ � r 1 ; r 2 ; :::; r x ; :: where r x specifies the x th reducer and there are a total of q reducers, and the class outputs from the reducers are denoted as: C ¼ � C 1 ; :::; C x ; ::: where C x is the class value of the x th reducer. For the detection of the class label, the trained SVM is employed and the progressing steps of the classifier are discussed below.
Trained SVM classifier for audio classification SVM [43,45] handles the unconstrained optimization issue with the large-margin classifier aiming at the maximization of the margin in determining the optimal hyperplane. The training algorithm, G-ROA adaptively tunes the capacity, SINGH AND JAISWAL enabling effective classification accuracy. Let the input vector I with n intermediate data belongs to either of the two classes, Class 1 or Class 2, represented as given in Equation (10), where C x is the x th class label. The parameters of the decision variable XðIÞ are determined at the time of learning and later during testing; the unknown patterns are classified using the following formula: XðIÞ is computed based on the predefined functions of I .
where w x and w 2 are the adjustable parameters of the classifier. The adjustable parameters w x and w 2 are the weight and bias of the standard SVM, which is determined based on the optimization algorithm. I x is the training pattern and KðI x ; IÞ refers to the kernel function. In general, the kernel function KðI x ; IÞ exploits the exponential kernel, whereas in SVM, a new kernel function is used through summing the three kernels, polynomial, Gaussian, and RBF kernels. The advantage of using the new kernel is that the classification accuracy and the effectiveness of classification are better than the already existing kernel. The kernel is given as: where γ 1 , γ 2 , and γ 3 are the kernel weights that are determined using the proposed G-ROA. The polynomial kernel is represented as: where d refers to the degree of the polynomial. The Gaussian kernel is formulated as: where σ is a Gaussian constant that takes care of the dimensionality issues. RBF kernel is given as: Thus, the dimensionality issue in the classifier is tackled using the Lagrangian model for which the norm needs to be minimized, maximizing the margin. The Lagrangian model is given by: where w x specifies the Lagrange multiplier, which satisfies the condition w x ½C x X ðI k − 1Þ� ¼ 0. Thus, the problem is resolved by determining the minimal saddle point of N ðz; w 2 ; w x Þ in terms ofz.

Proposed G-ROA for optimal tuning of the SVM classifier
The ultimate goal of the proposed G-ROA is to derive the optimal kernel weights for tuning the SVM classifier, and the G-ROA follows the steps of GOA [19] with the modified equation of GOA using ROA [18]. The steps of the proposed G-ROA are demonstrated below: Solution encoding: The purpose of encoding is to represent the solution vector derived using G-ROA, which is of dimension ½1 � 3�. The optimal values of the kernel weights, γ 1 , γ 2 , and γ 3 are derived using G-ROA. Fitness measure: The fitness of the solutions is evaluated based on the mean square error, which is notated as: where O x is the output corresponding to the x th training sample and O est refers to the estimated output of the classifier. The fitness is determined based on Equation (19). iii) Update the value of the coefficienta: The coefficient a is the effective parameter that balances the exploration and exploitation phases in such a way that a decreases with the increasing iterations. The value of a is computed as: iv) Update the position of the current search agent: The position of l th grasshopper is updated based on three factors, social interaction, gravity force, and advection of wind as in equation [33]. The standard equation of GOA is given as, where a is the decreasing coefficient, U ∂ and L ∂ are the upper and lower bounds in ∂-dimension. The target is denoted as G ∂ , D lκ is the distance between the grasshoppers l and κ, K κ ∂ specifies the position of κ th grasshopper in ∂ th dimension. Likewise, K l ∂ is the position of l th grasshopper in ∂ th dimension. Let us assume that the dimension ∂ ¼ ði; jÞ and fix m ¼ κ ¼ 1, therefore, the equation becomes The above equation reveals that the next position of the l th grasshopper is based on the position of the l th grasshopper at τ, target location, and other grasshopper positions. The position update based on other grasshoppers in search space locates the search agent around the target, which is due to a. Additionally, the repulsion, attraction, and the comfort zones of the grasshoppers are decreased using the second a in Equation (22), whereas the a outside in Equation (22), balances the phases. However, the ability to deal with the multi-objectives is enabled through the integration of ROA in the above equation. The attractive and repulsive phases increase global convergence and local optimal avoidance. The bypass update rule is interpreted to modify the Equation (22) and the update rule is: Assume that η ¼ i, therefore the Equation (24) is rewritten as: Substitute the Equation (26) in Equation (22) as: The Equation (27) is the update equation of G-ROA that enhances the global optimal convergence and keeps away from converging to the local optimal solution. v) Terminate: The steps are repeated for the maximal iterations until the global best solution is derived, and this solution represents the kernel weights of SVM.

| RESULTS AND DISCUSSION
The results of the method are revealed in this section with an effective interpretation of the comparative results. The effectiveness is revealed through the comparison, and the significance of the method is presented as the comparative table, which displays the effective method.

| Experimental setup
The experimentation of the proposed G-ROA algorithm is performed in the MATLAB tool installed in the PC with the Windows 10 OS. Table 1 describes the simulation setup of the proposed system.

| Dataset 1: TUT sound event 2017 dataset
The dataset 1 [36] is accessed from TUT sound event, and it consists of the records of the acoustic scenes from the street carrying the different traffic levels along with other activities. Audio in the dataset is a subset of TUT Acoustic scenes 2017 dataset. The scene was selected as representing an environment of interest for detection of sound events related to human activities and hazard situations. A total of 24 audio records have been obtained from a single acoustic scene.

| Dataset 2: ESC dataset
The ESC dataset [46] is a collection of short environmental recordings available in a unified format such as 5-s long clips, 44.1 kHz, and single channel. All clips have been extracted from public field recordings available through the free sound.

| Experimental analysis
This section deliberates the sample input audio signals used for the audio classification. Figure 2 shows the sample results of the proposed G-ROA-SVM classifier for audio classification. Figure 2(a) and (b) shows the input audio signals 1 and 2, respectively and Figure 2(c) and (d) demonstrates the feature output using input audio signals 1 and 2 from the delta-AMS feature descriptor.

| Performance metrics
The section deliberates the metrics employed for the comparison, which is essential to deliberate the effectiveness of the effective audio classification method.

| Accuracy
where p þ and n þ are the true positive and true negative and p − and n − are the false positive and false negative.

| False rejection rate
False rejection rate (FRR) is given as:

| False alarm rate
False alarm rate (FAR) is formulated as:

| Competing methods
The methods used for the comparison include MultiSVM [45], CNN [16], deep belief neural network (DBN) [27], and unsupervised VAD [32]. The proposed G-ROA-SVM is compared with the existing methods to prove the effectiveness of the proposed method.

| Comparative analysis
This section demonstrates the comparative analysis of the methods using two input signals taken from two datasets, and the analysis is progressed based on the training percentage and k-fold validation. The training percentage is fixed between k = 0.4 and 0.9, while the k-fold is set to vary between kf = 5 and kf = 10, respectively.

| Comparative analysis using the signal_1
This section performs the analysis using the signal_1 based on the performance metrics, such as accuracy, FRR, and FAR.  Figure 4 shows the analysis using the signal_1 based on the k-fold. Figure 4(a) shows, when the k-fold, kf = 5, the accuracy of the methods MultiSVM, CNN, DBN, unsupervised VAD, and G-ROA-SVM is 0.5386, 0.5928, 0.7013, 0.888, and 0.948, respectively. Figure 4 Figure 6 shows the analysis using the signal_2 based on the k-fold. Figure 6

| Comparative discussion
The comparative discussion of the methods is revealed in Table 2, which reveals the effectiveness through the performance metrics. The performance is best at the training percentage and k-fold of signal_1 at k = 0.9 and kf = 10.
The processing speed of the proposed system G-ROA-SVMis compared with other existing methods given in Table 3. The time and speed-up ratio of the proposed system with other parallel computing methods are given in Table 4. The speed of the proposed system with the map-reduce platform is higher than all other parallel computing methods. Table 5

| CONCLUSIONS
The audio classification using the SVM classifier enables the accurate classification and enhances the robustness towards the noise. The audio database exhibits the big data configuration and is complex to be handled, and hence, the big data is handled perfectly using MRF, which comprises the feature extraction and audio classification steps. The features are progressed using the proposed G-ROA-based SVM. The optimal tuning of the SVM enhances the effectiveness of classification, and the analysis of the effectiveness is performed using the data taken from the TUT sound event 2017 dataset and ESC dataset. The environmental sounds are easily recognized using the developed classifier and the robustness is enhanced. The analysis of the proposed method in comparison with the existing methods reveals that the proposed method outperformed with the maximal accuracy of 0.96, minimal FAR and FRR of 0.022 and 0.0119, respectively. The future dimension of the audio classification relied in using any deep learning networks for audio classification.