Bat optimization algorithm for wrapper-based feature selection and performance improvement of android malware detection

Android malware is a serious threat to the mobile users and their data. The losses incurred are unimaginable, which stretch to the extent of identity theft, financial loss, sensitive information loss, espionage, sabotage, cyber fraud, to mention a few. Android application's permission attributes can be analysed for malware detection using machine learning. However, the high-dimensional permission attributes are the bottleneck in designing optimized malware detection system. Identification of useful permission attributes is an NP-hard problem. Bat Optimization Algorithm for Wrapper-based Feature Selection (BOAWFS) is proposed in this article and evaluated on the CICInve-sAndMal2019 benchmark dataset. The performance of BOAWFS is also compared with that of Cuckoo Search Optimization for Wrapper-based Feature Selection (CSOWFS) and Grey Wolf Optimization for Wrapper-based Feature Selection (GWOWFS). Five classifiers, Random Forest (RF), Support Vector Machines (SVMs), K-Nearest Neighbour (KNN), Decision Tree (DT), and Nearest Centroid (NC) are compared for wrapper feature selection. BOAWFS outperformed consistently with all the five classifiers. With 200 agents and 100 iterations, the BOAWFS-DT outperformed with 93.73% accuracy after reducing the features to 518 from 4115. The considerable contribution of BOAWFS is that a 1.67% improvement in accuracy with 87.41% redundancy removal in features is achieved for the very high-dimensional permission-based android malware dataset.

Feature minimization can be done by two approaches majorly, the filter and the wrapper. In the filter approach, there is no need to ask the classifier about the accuracy, and the values of the dataset attribute itself are enough. The wrapper approach shall make a note of the accuracy of the classifier for every set of attributes selected [8]. There shall be different ways in which a subset of android permission attributes can be chosen from the entire set of permission database, and hence is an NP-hard problem [9]. To minimize the permission attributes and achieve the universally best result in finite time, meta-heuristic search optimization techniques can be chosen.
The CICInvesAndMal2019 benchmark dataset [10] is considered, which consists of malicious and benign android permission data samples. Currently, no literature is found in feature optimization on this dataset using bio-inspired wrappers. For feature selection, Bat-inspired global search optimization algorithm is used; furthermore, machine learning classification algorithms are used as wrappers to evaluate and find the globally minimized set of features with improved accuracy. The performance of BOAWFS is also compared with that of Cuckoo Search Optimization for Wrapper-based Feature Selection (CSOWFS), and Grey Wolf Optimization for Wrapper-based Feature Selection (GWOWFS).

| RELATED WORK
Aung et al. [11] worked on Crowdroid towards identifying malware in android apps; they have validated with machine learning classifiers. The feature selection algorithm, information gain, was used to determine the most prominent features. Finally, validated using decision trees (DTs), J48, and Random forest (RF), Machine Learning (ML) classifiers, and reported 91.75% and 91.58% accuracies using the RF algorithm. Kakavand et al. [12] focus on Android malware analysis using the Modroid dataset, which consists of 200 Benign and 200 malicious applications. For classification, ML algorithms like Support Vector Machines (SVMs), SMO, and K-Nearest Neighbour (KNN) algorithms were considered and it was observed that K nearest Neighbour algorithm performed better with an accuracy of 80.50%. Taheri et al. [10] proposed the benchmark CICInvesAndMal2019 dataset that consists of two categories, one is static features of permission and Intent, and the other is dynamic API call features with complete 5591 samples assembled from applications of the android. This dataset consists of 5065 data samples of benign apps and 426 samples of malware apps. This data contributed to collecting a total of 4115 permission features. For classification, the ML algorithm, RF, was used. They analysed differently for both categories. For the malware category using static features, the observed precision is 95.3%; for the malware category using dynamic features, accuracy is 83.3%.
Martín et al. [13] used Androidpytool for feature extraction from Android Applications. For experimentation, they used the Omnidroid dataset, which consists of 22,000 samples. For analysis, ML classification algorithms like KNN, RF DT, and Bagging were used and 89.3% accuracy using the RF algorithm for static analysis and 89.4% accuracy using the Bagging classifier for dynamic analysis, were observed. Taheri et al. [14] analysed the work for malware detection on three datasets called Drebin (118,505 samples), Contagio (1200 samples), and Genome (28,760 samples); these three different types include API calls, Permission, and Intent features. Amid everything, API call feature gives the highest accuracy rates. An RF regression algorithm was used for feature selection. Various classification algorithms like, All Nearest Neighbours (ANN), Weighted All Nearest Neighbours (WANN), K-Medoid-based Nearest Neighbours (KMNN), and First Nearest Neighbours (FNN) were considered for accuracy analysis and it was observed that the WANN classifier performed well with an accuracy rate of 99.46%. Finally, the best reliable outcomes of the Genome dataset are more beneficial than other datasets.
Rana et al. [15] use a substring technique to attribute selection. For experimentation, the Drebin dataset was used, which consists of 11,120 samples of both static and dynamic features. Furthermore, employ four different tree-based learning machines, gradient boosted tree classifier, RF algorithm, DT algorithm, extremely randomized tree classifiers. Compared to other classifiers, RF produces the best accuracy result of 97.24%. Pehlivan et al. [16] worked with the Comodo dataset, which consists of 3784 samples of 2338 benign samples and 1446 malware samples. Four different attribute selections, Gain Ratio, ReliefF, Subset Evaluator, Consistency Subset Evaluator, were used. Towards malware analysis, five ML classifiers like Bayesian, CART, J48, RF, and SMO were considered. For experimentation, WEKA tool was used for both selection algorithms and classification algorithms and exercised to gain the best accuracy results on permission-based analysis. RF performed best with an accuracy of 94.90% by CFS subset Evaluator with 25 features. Zhang et al. [17] used different datasets AMGP (10,000 samples), Drebin dataset (21,883 samples), and In-the-wild dataset (70,000 samples) with various features including permissions, metadata, hardware and application components, and intents in the XML file. AWK and GREP tools are used for feature extraction. The Online Passive-Aggressive (PA) classifier-based model utilized to train dataset samples and also using different permission-based sub-fingerprints for malware detection, that achieves high exactitude rates 99.02% with the Benchmark dataset. Batch learning-based classifier (retrain all samples) for malware attribution to achieve 98.8% accuracy for the Drebin dataset. Varma et al. [18] used Ant Colony Optimization (ACO) for filter-based reduction of PE header malware features using Rough Sets. Cuckoo search is proven efficient for feature selection in the filter method for malware detection [19].
Large-sized attributes are the major issue in the android malware dataset, particularly with permissions. Feature minimization contributes to a great extent in reducing the computational complexity of the android malware detection system. However, as of now, not much work is done in this direction. This article proposes a Bat Optimization Algorithm for Wrapper-based Feature Selection (BOAWFS) to minimize and identify the smallest possible number of permissions that are enough for the successful classification of Android malware. Further, the performance of BOAWFS is also compared with CSOWFS and GWOWFS wrappers.

| BAT OPTIMIZATION ALGORITHM FOR WRAPPER-BASED FEATURE SELECTION (BOAWFS)
In Bat optimization [20], a group of Bats tries to find their own solution in every iteration and arrive at a globally best solution eventually after the prescribed number of repetitions. In the feature selection problem, a Bat's solution of position P i , is nothing but an array of 0's and 1's indicating if a particular attribute is included or not. All Bats will be initialized with a solution P i , fitness value F i , velocity C i , loudness L i , pulserate T i , frequency Q i .
The fitness plays a very crucial role in converging all Bats to a single solution, that is, the optimal solution. The ultimate target is to find the minimized attributes from the complete set so as to meet the fitness requirements, that obviously must depend on the accuracy of the wrapper classifier used and the length of solution l, total number of full attributes u, and the fitness tuning parameter τ. The fitness of the selected set of attributes of a Bat is calculated based on two parameters, accuracy and number of features selected as shown in Equation (1). The parameter τ ∈ ½0; 1�decides how much weightage is given to accuracy of a solution over the number of features selected while calculating the fitness of that solution. Higher the τ value, the more weightage will be given to the accuracy over the number of features selected. Choosing a right τ value is crucial for convergence of the Bats.
In Algorithm 1, a is the number of Bats, with each one fetching its own solution and run for b number of iterations. At the very beginning of the algorithm the solution vector P i , fitness values vector F i , velocity vector C i , loudness vector L i , pulse-rate vector T i , are initialized. The loudness value of all Bats is initialized randomly with a value between 1 and 2.

Algorithm 1 BOAWFS
Input: No. of Bats, a, no. of iterations, b, constants: τ, α, γ, ε, a dataset with x attributes. Output: Minimized attributes as the best solution, Sol, its size, Sol, and its accuracy, Sol acc f i = AC 10.
else 36. P j i = 0 37. Output Sol, its size, |Sol|, and its accuracy, Solacc The pulse-rate emission of all Bats is initially loaded with a value randomly between 0 and 1. Pulse rate emission is crucial; it decides if a Bat has to generate a new solution or not in an iteration. If the value of pulse-rate emission, T i , is set to zero, every Bat will generate a new solution and it leads to accelerated convergence simulating the greedy approach. On the other hand, if the pulse-rate emission T i is set to a value of one, then, the Bats shall not generate new solutions in the initial few iterations which leads to unwanted delayed convergence. Therefore the pulse-rate emission, T i is initialized with a random number between 0 and 1. For every iteration, in the line numbers 5-16 of Algorithm 1, the fitness of all the Bat's solutions (features selected) is calculated, and the best fitness solution is remembered. The main job of lines 17-36 of Algorithm 1 dictates how the Bats move in the search space based on loudness, frequency, and velocity. Two types of movement can be observed in the Bats search. Lines 18-26 performs a local search driven by the loudness parameter of the Bats. Lines 27-36 shows a broader search driven by the frequency and velocity of the Bats. At the beginning of the VARMA ET AL. algorithm, the fitness values are assigned to the smallest number possible, symbolically represented as −∞. At every iteration, all Bats compute the fitness with the help of the embedded classifier's accuracy, which makes the features selection process a wrapper selection. When a Bat's solution is a better one, its loudness value L i is decreased and its pulse-rate emission T i is increased as per lines 10 and 11 of the algorithm.
In every iteration, for each Bat a new solution will be generated by moving the Bat according to the equations depicted in lines 20-26 of Algorithm 1. As per the line 19 of Algorithm 1, the value of the pulse rate T i of the ith Bat and a random number rand 1 will decide whether a new solution will be generated by ith Bat or not in that particular iteration. Hence it is crucial to initialize the pulse rate vector T i to a random number between 0 and 1.
Loudness L i , is used to simulate the natural Bat's loudness, the ultrasonic signal used during hunting. While working in a swarm, each Bat approaches the prey by adjusting its position comparing with the average loudness of all Bats. In an artificial Bat optimization algorithm also the position of a Bat is adjusted in a similar manner. As the solution comes closer, the loudness reduces, and hence it can be said that the step size of the Bat depends on the loudness.

| CUCKOO SEARCH OPTIMIZATION FOR WRAPPER-BASED FEATURE SELECTION (CSOWFS)
Yang and Deb, in 2009, designed the CSO algorithm motivated by brood parasitism of a species of bird called the Cuckoo [21]. The CSO is suitable for finding an optimal solution for NP-hard problems that involved combinatorics. However, CSO, in its raw form, is not ideal for feature selection [22]. In a wrapper-based CSO, a binary vector candidate solution, nothing but a nest, represents the inclusion (binary value of 1) and non-inclusion (binary value of 0) of an attribute among a set of m-dimensional features. Each attribute can be treated as an egg of a cuckoo. A CSO algorithm for feature selection consists of n nests and r iterations. In the initialization phase, all the nests are initialized randomly, but once the iterations start, the Cuckoo explores the search space by updating the nests using levy flight, as shown in Equation (2).
here, α tunes the step size. Traditionally in CSO, the values generated in a nest are continuous. However, in binary CSO, the sigmoid function is used to convert it to binary, as shown in Equations (3) and (4).
At the end of an iteration, few nests are abandoned, and new nests are replaced using Equation (5).
where E x 1 and E x 2 are two randomly picked nests, δ ¼ randð0; 1Þ. The CSO Wrapper Feature Selection (CSOWFS) is shown in Algorithm 2. The first phase is the initialization of nests, in lines 1-6, n nests each with m features (eggs) are initialized by randomly including or not-including a particular feature from the master set. For each nest, fitness is calculated using Equation (1). The iterations start from line 7 in Algorithm 2. Lines 9-15 run for each iteration, where the Cuckoo traverses the search space using levy flight and updates all the nests. At the end of each iteration, as shown in lines 16-17, the global best solution is updated with that of the iteration's best, based on the fitness. According to the modalities of the CSO, after each iteration, ρ number of nests are abandoned, and they are replaced using Equation (5). After all the iterations are complete global minimal solution attested by the embedded wrapper classifier is returned.

Algorithm 2 CSOWFS
Input: Data sets S T for training and S E for evaluation. Number of features m, number of nests n, number of iterations r, and Tuning parameters α, ρ, σ, λ, ω. The global minimal solution glob solution is initialized with full feature set, Fit glob is the fitness of full feature dataset.  11. Train the classifier on € S T and find accuracy on € S E 12.
Evaluate the fitness, Fit i 13.
Choose E k randomly from n nests excluding E i 14.
If Fit i > Fit k 15 Yang Mirjalili et al. [23] introduced to yet another nature-inspired optimization approach, the GWO. The algorithm mimics the cooperative teamwork of the Grey Wolf pack in the hunting prey. The pack consists of typically up to 12 wolves, headed by the alpha wolf. There is a social hierarchy in the pack, beta and delta wolves are the next in the hierarchy to the alpha wolf, the other wolves in the pack try to follow the top three wolves in approaching as the prey. For feature selection purposes, a binary GWO is used [24,25]. During the wolf traversal, a new position value is normally continuous, and it shall be converted to binary with the help of Equations (2) and (3) similar to the CSO. In GWOWFS, m number of wolves run for r number of iterations to find an optimal feature set with the best fitness value. In the beginning, each wolf contains the set of binary values, equal to the number of features, a candidate solution, randomly chosen. As the iteration starts, in each iteration a new position of a wolf F i is obtained by cross-over, CO Fi , of the top three wolves as given in Equation (6)-(15). where here g is constant, itr no is the current iteration number, r is the number of iterations, P → and Z → are vectors of size m and the number of features. In the GWOWFS algorithm, which is listed in Algorithm 3 there is only one tuning variable, τ that belongs to the fitness function of Equation (1). In lines 1-6 n wolves, each representing a candidate solution with m features is initialized by randomly including or not-including a particular feature from the master set. For each wolf, fitness is calculated using Equation (1). The iterations start from line 7. For each attribute/feature j ð8 j ¼ 1; 2; 3……; mÞ 3.
€ S T and € S E derived from S T and S E , to contains the features of F i 5.
Train the classifier on € S T and find accuracy on € S E 6.
Update the wolf position F i using Equation (6)  10.
€ S T and € S E derived from S T and S E , to contains the features of F i 11. Train the classifier on € S T and find accuracy on € S E 12.
Evaluate the fitness, Fit i of updated wolf F i using Equation (1)  randomly. It holds a total of 1594 samples that carry a complete 8115 intents and permission attributes. However, we consider only 4115 permission attributes for malware detection. The BOAWFS, the CSOWFS, and the GWOWFS algorithms are developed in Python language, and the performance is tested with various classifier wrappers using the scikit learn library. The trials are run on a computer with Intel Core i7 that has 8 GB of RAM. After several trials and errors, five classifiers are used as wrappers for selecting the features towards optimization of the android malware detection system. The classifiers considered for wrapper feature selection using BOAWFS, CSOWFS, and GWOWFS are RF, SVMs, KNN, DT, and Nearest Centroid (NC). The performance of BOAWFS is compared with two more nature-inspired search algorithms, the CSOWFS [21,22], and the GWOWFS [23][24][25]. Several experiments are performed to evaluate and compare the performance of the three wrapper feature selection algorithms. Table 1 Table 1 tabulates the results of the three wrapper FS search methods evaluated on the RF classifier. RF produced the best accuracy before the process of feature reduction. BOAWFS-RF performed well with 853 selected attributes, showing a 79.27% reduction, and 94.36% accuracy post-reduction, a +2.3% improvement. GWOWFS-RF is faster compared to BOAWFS-RF, but with 1064 selected features and slightly lesser accuracy. In the case of SVM wrapper, as shown in Experiment 2 of Table 1, BOAWFS-SVM outperformed CSO and GWO counterparts with 600 selected features. However, the accuracy is not satisfactory. In Experiment 3, BOAWFS-KNN topped CSO and GWO with a good accuracy of 94.78% post-reduction, but 1012 features are selected. In Experiment 4 BOAWFS-DT wrapper's performance is satisfactory with a +3.13% improvement in accuracy and 81.24% reduction of features. Another significant performance advantage is the run time of BOAWFS-DT, which is 96% faster compared to BOAWFS-KNN, and 95.6% faster than BOAWFS-SVM, and 92.7% faster than BOAWFS-RF. In Experiment 5, BOAWFS-NC, a simple and faster classifier, is also compared. However, the accuracy after reduction is not satisfactory even though the number of features reduced is satisfactory.
To summarize the results with 50 agents and 50 iterations, with all the five classifiers that are tested, Bat Optimization outperformed CSO and GWO. Among the five classifiers, undoubtedly DT classifier did well. Figure 1 graphically show the results for a special understanding. Consistent 50 agents and 50 iterations are used for testing to narrow down to one classifier, and it is the DT.
Furthermore, few more experiments are conducted with increased agents and iterations because all the three search techniques are random selection and random walk oriented methods that typically give better results with a high number of agents and iterations. However, due to run time constraints, we cannot experiment with all the classifiers; therefore, the search space is further extended with more agents and increased iterations with DT classifier, whose results are tabulated in Table 2 and shown graphically in Figure 2. In Experiment 6, the iterations are 50, and BOAWFS-DT is run with 100 Bats, CSOWFS-DT is run with 100 nests, and GWOWFS-DT is run with 100 wolves. Again BOAWFS has given the best results with an 82.16% reduction in features. Experiment 7, the iterations are further increased to 100, and the agents are further increased to 200. The output is quite evident that increased iterations and agents shall produce even better results. BOAWFS-DT selected 518 features, an 87.41% reduction, with an accuracy post-reduction of 93.73%, a +1.67% improvement, which is an outstanding achievement. The noise and redundant data are eliminated by the process of increased agents and iterations, and therefore, the accuracy is also enhanced. It is worth looking at the Experiment 8, where the performance of a simpler and faster NC classifier wrapper, the BOAWFS-NC produced the least number of features, 375, a 90.88% reduction, with an improvement of +6.47% in post-reduction accuracy, but the accuracy just did not cross 90%.