Feature selection ‐ based android malware adversarial sample generation and detection method

With the popularisation of Android smartphones, the value of mobile application security research has increased. The emergence of adversarial technology makes it possible for malware to evade detection. Therefore, research is conducted on Android malicious applications of adversarial attack. To clarify the process and theory of adversarial sample generation, an adversarial sample generation algorithm is proposed that filters features based on feature spatial distribution and definition. These features are modified on real malicious samples to form adversarial samples. In addition, to enhance the robustness of adversarial sample classification detection, a multiple feature set detection algorithm is designed and implemented. Using the frequency differential enhancement feature selection algorithm to perform feature screening, the algorithm forms two different feature sets and establishes two different training sets to train different classification algorithms. Prediction results obtained by the two classification algorithms are integrated based on certain rules. Experimental results on the VirusShare dataset show that both algorithms are effective. The detection results in an actual environment also prove the effectiveness of the multiple feature set detection algorithm.


| INTRODUCTION
With the development of information technology, the popularity of smartphones has increased, making mobile security a hot research topics. Malicious apps are considered a threat to smartphones. The increasing market share of the Android operating system in smartphones has led to the emergence of a large amount of Android malware. Mobile user security has become a focus of researchers [1]. As a result, the demand for Android malware detection algorithms has also risen. Because machine learning has attracted the interest of many researcher, it has been used to solve various problems in different fields [2][3][4], including the detection of Android malware [5]. However, machine learning algorithms have the disadvantage of poor robustness, and attackers can use adversarial attack techniques to evade detection.
In response to the poor robustness of the machine learning algorithm, researchers constructed adversarial samples. Adversarial samples refer to the input samples formed by deliberately adding interference to the dataset. The construction of adversarial samples first appeared in the field of image recognition. Researchers made small changes to image pixels. Such changes are difficult for the human eye to detect, but they can interfere with recognition of the classifier [6,7]. Subsequent research on adversarial samples attracted the attention of researchers in various fields [8,9].
Because of the advantages of static detection, research on static detection of Android malware is extensive [10]. Therefore, related researchers and attackers have conducted research on methods to evade static detection, but there is a lack of research on adversarial attack defence methods in the field of Android malware detection. Generally, the most effective defence method against adversarial samples is to increase the robustness of the classification detection of adversarial samples by expanding the diversity of the dataset, but this method does not work when there are new adversarial samples.
The main contributions of this article are as follows: (1) A feature selection-based adversarial sample generation (FSASG) algorithm is proposed. This algorithm clarifies the adversarial sample generation process and generation principle and provides guidance for the study of adversarial sample defence methods. The algorithm initially screens out features that have an important role in classification through a chi-square check, information gain, and frequency differential enhancement (FDE) feature selection algorithm. Then, according to the definitions of benign features and malicious features, features that have an important role in identifying benign applications are obtained. After that, features on real malware samples are modified to obtain adversarial samples. Experimental results show that adversarial samples generated by the FSASG algorithm can effectively evade the detection of many classification algorithms. In addition, the influence of the feature selection algorithm on detecting adversarial samples and the performance of different classification algorithms in detecting adversarial samples are discussed. (2) A multiple feature set (MFS) detection algorithm is proposed, an adversarial sample classification detection algorithm of enhanced robustness. Through the analysis of feature distribution, two feature sets are obtained according to the FDE algorithm, a malicious typical features definition and a benign typical features definition. Based on two feature sets, we construct two different training sets, which are used to train different classification algorithms. After using different feature set information to construct the input, the MFS algorithm judges the prediction result according to the relevant rules to obtain the final prediction result. Experimental results on the mixed test set show that the algorithm can effectively detect adversarial samples. The effectiveness of the method is then tested in a real environment.
The rest of this article is arranged as follows. In Section 2, a survey of related work is presented. Section 3 describes the relevant definitions and methods in detail, explains the principles and ideas of the FSASG algorithm, analyses the complexity of the algorithm, and shows the experimental results and analysis of the adversarial samples generated by the FSASG algorithm on many classification algorithms. Section 4 describes the method to enhance the robustness of the detection of adversarial sample by classification, describes the idea and pseudocode of the MFS algorithm, and analyses related experiments of the MFS algorithm. Section 5 summarises the research work of the whole article.

| RELATED WORKS
Research on the static detection of Android malware is extensive. Because machine learning has attracted people's attention, researchers have continuously introduced relevant classification algorithms in machine learning into this field and have obtained impressive achievements. Traditional machine learning algorithms that are mature, simple and effective, such as support vector machines (SVMs), K-nearest neighbour (KNN), decision tree (DT), and Bayes, are widely used in many fields [11,12]. Yerima et al. [13] extracted permission information as data features and developed and analysed an active machine learning method based on Bayesian classification that demonstrated high-precision detection capabilities. Liu et al. [14] proposed a static malware detection method based on sparse Bayesian learning algorithm and n-gram analysis, which can fully and effectively detect malware. Wang et al. [15] used three traditional machine learning algorithms: SVM, random forest, and KNN, and compared and analysed these three algorithms. Zhou et al. [16] proposed a detection model to find and track the behaviour of malicious software in the equipment. After comparing it with SVM, KNN and other algorithms, they found that the detection model achieved high detection accuracy. Salah et al. [17] used the idea of term frequency-inverse document frequency to design and implement a feature selection algorithm. The features selected by this algorithm can effectively detect unknown applications. Deep learning algorithms such as convolutional neural networks (CNN) are also widely used to process various tasks [18]. Compared with traditional methods, deep learning algorithms are better at self-learning and self-renewal. Owing to these advantages, they are widely used in malicious detection [19]. McLaughlin et al. [20] used CNNs to learn the features of malware from the original opcode sequence; the recall rate reached 96.29%. Jinpei et al. [21] used a layered noise reduction network based on long short-term memory (LSTM). The first level was calculated on the opcode sequence and the second used method block sequences to learn and detect malicious software. To reduce the complexity of the model, Thiyagarajan et al. [22] developed a preprocessing module that contains five different data reduction techniques, which reduced 133 permissions information into 10-dimensional vector information, and identified malicious applications by training DTs. The accuracy of the method reached 94.3%.Other deep learning algorithms are also favoured by many researchers [16,23,24].
In view of the poor robustness of machine learning algorithms, researchers constructed adversarial samples.
Shahpasand et al. [25] used the GANS network to generate adversarial samples and studied the performance of feature subsets and complete feature sets on the Drebin [26] dataset. Adversarial samples generated by using feature subsets are detected on multiple classifiers with a maximum evasion rate of 99%. Grosse [27] extended the existing adversarial example production algorithm by first training a deep neural network (DNN) on the Drebin dataset and then using the Jacobian matrix to generate adversarial samples for deep learning. The evasion rate of the deep learning classifier reached 63%. Hu et al. [28] proposed a model called MalGAN, which drew on the black box principle and did not require specific knowledge of the target. The adversarial samples obtained through the MalGAN model achieved a good evasion effect on many classification algorithms. To evade firewall detection, Li et al. [29] proposed a variant of GAN (dual-target GAN) for adversarial sample generation. The difference between the dual-target GAN and the original GAN is that the former has two discriminators, so that the generator malware detectors and firewalls fight against each other and 95% of generated adversarial samples can evade detection.
Research on counterattack and defence mainly focus on images [30]. Metzen et al. [31] used an auxiliary classifier to detect adversarial samples to enhance the detection performance of the original neural network. Lu et al. [32] described a SafetyNet architecture consisting of an original classifier and a detector that resists adversarial attacks. This method quantises Rectified Linear Unit to generate discrete codes and uses Radial Basis Function-SVM to find adversarial samples on the adversarial detector. There are also many adversarial sample defence methods in the image domain [33,34]. In the field of Android, Hu et al. [26] used MalGAN to generate adversarial samples and added them to the training set to enhance the classifier's ability to detect adversarial samples. In addition, some researchers are committed to enhancing the robustness of the detection algorithm. Li et al. [35] introduced a HashTran-DNN framework that used a hash function to transform samples to enhance the robustness of DNNs. Garg et al. [36] enhanced the robustness of the detection method through ensemble learning. Turker et al. [37] compared traditional machine learning algorithms, ensemble learning algorithms, and neural network algorithms, designed a confidence value and used the value to detect malicious samples that did not appear in the dataset. Alazab et al. [38] analysed the feature information of Android applications and implemented three types of groupings for the feature design of malware detection, using two of them to detect unknown applications. Applying the adversarial sample from the image recognition domain in Android malware detection on Internet of Things devices, Liu et al. [39] analysed and restricted the types and number of permissions that can be added for the normal run of the app. They also used a genetic algorithm to generate adversarial samples, and the black box attack implemented almost reached a 100% success rate. Rathore et al. [40] designed a single-policy adversarial attack for the white-box scenario and a multipolicy attack for the grey-box scenario, achieving an average fooling rate of 44.28% and 53.20% with a maximum of five modifications across eight different sets of detection models.
In response to the challenge of the classification of resistant malware, Li et al. [41] proposed six guiding principles and established a defence framework to enhance the robustness of the DNN to evade attacks against offensive malware. The accuracy rate of 98.49% is achieved against grey box attacks, and an accuracy rate of 89.14% is achieved against lily attacks.
Adversarial samples of malware are a serious threat to malware detection systems [42,43]. Regarding methods of generating adversarial samples for Android malicious applications, few researchers analyse which specific features an attacker can modify from the perspective of feature frequency to generate adversarial samples, and there is little research on the detection of such adversarial samples.

| ALGORITHM FOR GENERATING ADVERSARIAL SAMPLES BASED ON FEATURE SELECTION
In the static detection of Android malware, adversarial samples can be divided into two types. One is the adversarial sample obtained by modifying the benign sample. This kind of adversarial sample has almost no meaning. The other is the adversarial sample obtained by modifying the malicious sample so that the sample is misclassified as benign. The adversarial sample is obtained by modifying some features on the real sample, so it is necessary to determine which features can be modified to detect disturbances efficiently. To clarify the principle and process of constructing adversarial samples, permissions and Application Programming Interface (API) call are used as features, the distribution of these features is analysed, and the steps of generating adversarial samples are clearly explained, providing guidance for defending against the attacks of adversarial samples.

| Feature analysis
The essence of the adversarial sample is to modify the features on the original sample. First, it is necessary to determine which features can be modified to disturb the detection of the classifier. Therefore, the feature appearance frequency is analysed here.
To clarify the tendency of the feature, the feature is expressed as a two-dimensional tuple count i = (N b , N m ), where N b is the number of occurrences in benign applications and N m is the number of occurrences in malware. The twodimensional tuple clearly reflects the frequency of features in benign applications and malware. Figure 1 shows the distribution of features in the dataset. The feature points are mapped into a two-dimensional space. The abscissa represents the number of occurrences in benign applications and the ordinate represents the number of occurrences in malware. According to the distribution of features in two-dimensional space, all features are divided into three categories.
The first category of features is those that favour malware, called malicious features, which are distributed in the upper LI ET AL. half of the diagonal in Figure 1. The characteristic of features in this category is that they appear frequently in malware and infrequently in benign applications. The specific information of these features is shown in Figure 2.
The second category of features is those that favour benign applications, called benign features, which are distributed in the lower half of the diagonal of Figure 1. The frequency of occurrence of these features is low in malware, and high in benign applications. The specific information of these features is shown in Figure 3.
The third category is no-tendency features, distributed near the diagonal of Figure 1. The characteristic of features in this category is that the occurrence frequency of these features in malware is similar to that in benign applications. The range can be determined based on the specific threshold.
According to the appearance frequency and distribution of various types of features, in common classification tasks, a combination of malicious features and benign features is usually selected as a feature set. In selecting features of adversarial sample for modification, malicious features and no-tendency features obviously have no positive help for disguising malicious samples. Only benign features can help disguise adversarial samples, but they need to be carefully selected. Figures 2 and 3 show the top 10 features selected from the perspective of feature frequency difference. In the process of selecting benign features, the frequency difference does not fully represent the importance of the features. Therefore, when determining features that need to be modified to generate the adversarial sample, they should be selected from multiple angles. Traditional feature selection algorithms interpret features from perspectives of mathematics and classification contributions, so they can be used to filter benign features.

| Related methods and definitions
To select benign features accurately that can be modified, chisquare check, information gain, and FDE feature selection algorithms are used to measure feature importance in preliminary feature selection.
Chi-square check (Chi) [44] is a method in statistics that is suitable for testing the relation between individual features and classes: where A and B represent the number of malware and normal applications that include feature f, whereas C and D represent the number of malware and normal applications that exclude feature f. When the X 2 value of each feature is calculated, and each feature is sorted according to its X 2 value, the larger the

F I G U R E 3 Examples of benign features
value, the higher the relation between the feature and the category. Information gain (IG) [45] is a concept in information theory used to examine the IG value of a feature to a classification system. H(c) is calculated when feature f is included, whereas H(c|f ) is calculated when feature f is not included. (2), the IG( f ) value of each feature is obtained and sorted. The larger the IG value, the more information this feature provides to the classification system:

IG(f ) value is obtained by subtracting H(c|f ) from H(c). As shown in Equation
The FDE algorithm [46] is a feature selection algorithm proposed by the author of this article from the perspective of feature frequency. It is suitable for the static detection of Android malware. Designed to make up for the shortcomings of chi-square verification and information gain in the feature selection of static detection of Android malware, this algorithm removes some high-ranking features with a low frequency. The core calculation formula of the algorithm is: where N m represents the number of malware containing feature f, N b represents the number of benign applications containing feature f, T m represents the total number of malware, and T b represents the total number of benign applications. The algorithm uses the absolute value of the difference between features appearing in benign applications and malware to measure the importance of the feature to classification. According to Equation (3), the score of each feature can be calculated. The larger the score, the more important the feature is to the classification. After the chi-square check, information gain, and FDE algorithms are used to calculate the importance of the features, some definitions need to be used to filter features accurately in further screening:

Definition 1 (Benign feature):
For a feature, the number of times it appears in benign applications and in malware is recorded as a two-dimensional tuple

Definition 2 (Malicious feature):
For a feature, the number of times it appears in benign applications and in malware is recorded as a two-dimensional tuple

Definition 3 (No-tendency feature):
For a feature, the number of times it appears in benign applications and in malware is recorded as a two-dimensional tuple count i = (N b , N m ). If N b ≈ N m , and the feature satisfies the following condition: it is called a no-tendency feature.

Definition 4 (Benign typical feature):
If a feature is a benign and satisfies: it is called a benign typical feature.

Definition 5 (Malicious typical feature):
If a feature is malicious and satisfies: it is called a malicious typical feature.

| Algorithm for generating adversarial samples based on feature selection
In research on adversarial sample generation methods in Android malware detection, methods and principles of adversarial sample generation are unclear. Therefore, an FSASG algorithm is designed and implemented that is simple, highly interpretable, and fast, and can be used for actual sample modification. It aims to select features accurately that need to be modified from the perspective of feature analysis and generate samples that can evade the detection of many classification algorithms. The overall flow of the FSASG algorithm is shown in Figure 4; its design process can be divided into three steps: The first step obtains relevant files through decompilation based on a collected dataset, obtains access permission and API information, and constructs a feature set through digitalisation.
The second step chooses chi-square check, information gain, FDE feature selection algorithm for the preliminary screening of features. These three algorithms can clarify the correlation between each feature and category from the perspective of mathematics, and so forth, so the researcher can know the importance of all features and draw a clear feature ranking.
The three feature selection algorithms have different angles and principles for feature evaluation, so a comprehensive evaluation of the results obtained by the three algorithms is required. Three algorithms are used to calculate all feature LI ET AL. rankings separately. For the same feature, the rankings calculated by different algorithms may be different. Therefore, it is necessary to obtain a comprehensive ranking of the features according to the rankings calculated by the three algorithms and sort them in descending order by their comprehensive scores. The comprehensive ranking calculation formula is: According to the definitions of benign typical features and malicious typical features, the comprehensively ranked features are further filtered, and the malicious typical features are removed to obtain benign typical features. The benign typical features obtained by screening are sorted in descending order according to comprehensive ranking, and the ranking order of features is the feature modification order. In a real environment, the cost of adding features is smaller. In addition, deleting features may lead to the loss of related malicious functions. Therefore, during the construction of the adversarial sample, no feature is deleted.
In the third step, a malicious sample that is easily detected is obtained, and the feature is modified according to the final feature ranking in the second step. If the feature already exists, no modification is made. If the feature does not exist, the feature is added to the sample.
The FSASG algorithm is different from traditional thinking. The feature selection algorithm is generally used to remove redundant features, shorten model training time, and improve detection accuracy. However, the FSASG algorithm uses the feature selection algorithm to construct adversarial samples. This algorithm first selects the features that are highly correlated with the category, and uses the frequency of the features in different applications to obtain features that are conducive to identifying benign or malware. Definitions 4 and 5 show that malicious typical features will appear frequently in malware but rarely in benign applications, so this feature is widespread in malware and is a key factor in achieving malicious functions. Such features cannot be increased or decreased. However, the situation of benign typical features is the opposite. Such features appear in a large number of benign applications, but rarely appear in malware. Therefore, such features can be added to malware to modify and enhance malicious samples. By continuously adding the features of benign applications to the malicious samples, the malicious samples will show the characteristics of benign ones, so as to disturb the detection of the classifier, causing confusion and misclassification. The pseudocode of the algorithm is shown in Algorithm 1.
In the pseudocode, the first line obtains relevant feature information from the original dataset. Its complexity is related to the number of original samples. Assuming that the sample size is m, its complexity is O(m). Lines 3 and 4 rank the features by chi-square check, information gain, and FDE, respectively. Their complexity is the same and is related to the number and scale of the specific features extracted in Line 1. Assuming that the feature scale is n, its complexity is O(n). Line 5 is the comprehensive ranking based on the three feature rankings calculated in Lines 3-4. Its complexity is O(n), too. Lines 6-8 further screen the obtained comprehensive ranking to remove malicious features. The number of executions fluctuates. The maximum number of executions is n and the minimum number is 0. Therefore, the complexity is between O(0) and O(n). Line 9 modifies the malicious samples according to the final features screened in Line 8 and obtains the adversarial samples. Its complexity is related to the size of the malware and the number of modified features. Assuming that the scale of the malware is k and the modified feature size is j, its complexity is O(k � j). Therefore, the overall complexity of the FSASG algorithm is O(max[n, m, k � j]).

Algorithm 1 An FSASG algorithm
The formula for calculating the evasion rate is the same as the false positive rate, which calculates the proportion of adversarial samples detected as positive in adversarial samples.

| Dataset
The author downloaded 5000 benign applications in the Google Play Store and 5000 malware in VirusShare before dividing the training set and the test set according to the 7:3 ratio. Malicious samples in the test set are used to construct adversarial samples. The author got AndroidManifest.xml and . smali files through Androidguard decompilation, extracted permissions and API call information, and digitized the extracted information as:

| FSASG algorithm experiment result and analysis
To test the effectiveness of the adversarial samples generated by the FSASG algorithm, a variety of feature selection algorithms and classification algorithms are combined, forming a total of 42 method combinations. To test the effectiveness of the FSASG algorithm better, for each method, the malicious samples are first screened. Then, the classification model is trained using the training set and the malicious samples in the test set that are correctly detected are selected to construct adversarial samples. Therefore, the adversarial samples used for each method are different. The purpose of this approach is to remove malicious samples that cannot be detected. These samples can evade detection without feature modification.
When constructing an adversarial sample, the number of modified features is limited. The number of features is counted on the dataset. The number of features of benign application is 756 on average with a maximum of 2392. The number of features of malware is 393 on average with a maximum of 1056. Therefore, the upper limit of features to be modified is set to 350. In addition, the accuracy of classification methods selected, after adjusting parameters, reaches more than 97% on the test set.
Appendix Table A1 shows the evasion rate of adversarial samples constructed using FSASG algorithm to modify different numbers of features on different methods. First, the evasion rate of 42 method combinations shows that adversarial samples generated by the FSASG algorithm can evade detection to a certain extent, demonstrating the effectiveness of the FSASG algorithm. In addition, as the number of modified features increases, the evasion rate of adversarial sample also increases. The more features are modified, the higher the probability of evading detection. In addition, data in Table A1 reveal the performance of different methods in detecting adversarial samples.
Factors that affect the performance of adversarial sample detection are also analysed. The detection performance of algorithms of the same kind but with different feature selection indicates that feature selection or dimensionality reduction algorithms have an impact on the detection of adversarial samples, but the specific impact is related to the classification algorithm.
The experimental results of different classification algorithms with the same feature selection show that different classification algorithms have great differences in adversarial sample detection. For example, the Bayes classification algorithm has poor ability to detect adversarial samples, whereas the KNN and DT classification algorithms have stronger abilities to detect adversarial samples with a small number of modified features and are superior in detecting other adversarial samples. Overall, however, the evasion rate of the adversarial sample with a large number of modified features is high in various methods.
The experimental results show that (1) the FSASG algorithm is effective, and the greater the number of feature modifications, the stronger the ability to evade detection. (2) The feature selection algorithm affects the performance of the classification algorithm in detecting adversarial samples. (3) The performance of detecting adversarial samples varies greatly among different algorithms.

| ROBUSTNESS ENHANCEMENT METHOD FOR ADVERSARIAL SAMPLE CLASSIFICATION DETECTION
Attackers evade detection by constructing adversarial samples, which can disturb the determination of many classification methods. To enhance the robustness of adversarial sample classification detection and help classification algorithms effectively detect the adversarial samples, a robustness enhancement study is conducted from two perspectives. The LI ET AL.
-7 first method expands data diversity and adds adversarial samples to the training set. The second method uses the idea of ensemble learning and adopts the multiple feature set and multiclassification algorithm to enhance robustness.

| Method 1: expand data diversity
To make the adversarial sample classification detection more robust, expanding the diversity of the dataset is the most simple and effective method. Adversarial samples are added to the training set so that the classification algorithm can obtain more information. The amount of information contained in the adversarial samples with different amounts of modification is different, so the robustness of the extended algorithm is enhanced to varying degrees.

| Method 2: multiple feature set detection algorithm
To understand whether adversarial samples generated by the FSASG algorithm are specific in feature distribution and whether this specificity can be used to identify adversarial samples, feature statistics are performed on benign, malicious, and adversarial samples in the dataset to analyse whether there are obvious differences in features. The number of features of benign applications, malware, and adversarial samples is basically the same, so it is difficult to distinguish adversarial samples from the number of features.
The distribution of features is analysed. According to the selection of features by the FSASG algorithm, the range of features for adversarial sample modification is limited within the Triangle feature points (i.e., benign typical features) in Figure 5. The feature distribution shows that there are many features of this type and they have a major role in classification. If malicious samples modify many of these features, it can disturb the detection of the classification algorithm. When generating an adversarial sample, the Pentagram and Circular features are not modified. A Pentagram represents a malicious typical feature. Adding features of this type will only make the malware more prominent. Features of this type retain the most original information. Although the Circular feature points, that is, no-tendency features, are not modified, they have a small role in classification. The malicious typical features retain the original feature information and have an important role in the classification task. Therefore, malicious typical features can be used to distinguish adversarial and malicious samples. When the number of modified features of the adversarial sample is small, benign typical features can be used to identify the malicious and adversarial samples with a low amount of modification.
According to the analysis of features, to enhance the robustness of adversarial sample classification detection, an algorithm is proposed that can detect adversarial samples (the MFS detection algorithm), which does not need to add adversarial samples to the training set.
The MFS algorithm uses different kinds of features to distinguish adversarial samples. This method first extracts and classifies the features to construct two feature sets; then, it puts the different feature sets into the classification algorithm and makes a comprehensive determination about the classification results. The flowchart is shown in Figure 6.
First, feature selection is performed according to the FDE algorithm. This algorithm can remove no-tendency features and obtain a feature set containing malicious typical features and benign typical features. According to the definitions of We use FB and FM feature sets to construct two training sets, and then use the two training sets to train different classification algorithms. In addition, we use the FB and FM feature sets to construct the input of different classification algorithms on the test samples and obtain the corresponding classification results, R 1 and R 2 .
Integrate the two recognition results and make determinations according to a rule: When the result of R 1 is malicious, the application is malicious. The remaining cases are judged based on the R 2 results.
The pseudocode of the MFS algorithm is shown in Algorithm 2.
The algorithm has four lines of pseudocode. Line 1 uses the FDE algorithm to extract features. Assuming the original feature scale is n, the complexity is O(n). Line 2 classifies the obtained feature set F using relevant definitions to obtain two feature sets FB and FM, whose complexity is related to the number of features of the two feature sets and is smaller than O(n). Line 3 takes the two feature sets as input to the classification algorithm and obtains result sets R 1 and R 2 . Among them, the classification algorithm can be replaced at will. In this experiment, the KNN algorithm works best. When the KNN algorithm is used, its complexity is O(D*N*M), where D is the dimension of the data, N is the number of samples in the training set, and M is the number of samples in the test set. Line 4 obtains the final recognition result for the results of R 1 and R 2 according to the rule, and its complexity is a constant. Therefore, the complexity of the MFS algorithm is O(max(n, D*N*M)). In most cases, the complexity of the MFS algorithm depends on the complexity of the classification algorithm.

| Experimental results and analysis
Experiments in this section are divided into four groups. The first group explores which adversarial samples are more efficient when used to expand data diversity. The second group uses only FB or FM or FB plus FM feature sets as the input of the algorithm and makes a comparative analysis of the results to improve the judgement result rules in the MFS algorithm. The third group detects the effectiveness of the MFS algorithm on the mixed test set, which contains benign, malicious, and adversarial samples. The fourth group tests the detection effect of the MFS algorithm under real conditions.

| Dataset and evaluation indicators
The dataset is the same as the one in Section 4.1; adversarial samples are added to the test set to form a mixed test set. The Accuracy rate, recall rate, F1 value and underreporting rate of the adversarial samples are used as evaluation indicators: The adversarial sample underreporting rate (PFR) is the proportion of adversarial samples that are incorrectly classified as positive in the adversarial samples.

| Extending data diversity experiment results and analysis
Two groups of experiments were conducted to compare and analyse the defence effect of the classification algorithm after adding the adversarial samples to the training set. The first group adds different proportions of adversarial samples to the training set and compares the detection performance of the analysis algorithm. The author selected and added adversarial samples with an evasion rate of over 90% to the training set, tested similar adversarial samples, and compared and analysed the correct rate. The second group adds adversarial samples to the training set to detect different types of adversarial samples. Figure 7 shows the results of the first set of experiments. A variety of classification algorithms based on the chi-square check are selected as the basic detection method. We select the adversarial sample set whose evasion rate first reaches 90%, extract different numbers of adversarial samples and add them to the training set to detect similar adversarial samples. In this set of experiments, 0.5%, 1%, and 5% of the adversarial samples were extracted from the adversarial sample set and added to the training set. The detection effect of similar adversarial samples is shown in Figure 7. The experimental results show that when a LI ET AL.
-9 small number of adversarial samples are added to the training set (0.5%, about seven samples), the detection accuracy of the KNN and DNN classification algorithms can reach more than 95%, but the detection accuracy of the LSTM classification algorithm is only 57.6%. When a large number of adversarial samples are added (5%, about 74 samples), most classification algorithms can better detect similar adversarial samples, achieving an accuracy rate of more than 98.5%. Only the LSTM classification algorithm is less effective. Figure 8 shows the detection of adversarial samples with a large number of modified features after adversarial samples when a small number of modified features are added to the training set. In this group of experiments, a combination of chi-square check and SVM was used. From this, the adversarial sample set with feature modifications of 10 and 50 was selected, and 5% (72 samples) of each was added to the training set. The adversarial sample set with modifications of 100, 150, and 200 is tested. The experimental results show that the greater the difference in the number of modifications, the worse the detection effect. In addition, 5% of the adversarial sample set with a feature modification of 200 was put into the training set to detect the adversarial sample set with feature modifications of 10 and 50. The detection accuracy rate was 96.53% and 87.53%, respectively. Therefore, the highly modified confrontation samples contain more information, which can be used to expand data diversity effectively.
These experimental results show that (1) adding an appropriate number of adversarial samples in the training set can well detect similar adversarial samples; (2) adding high-modification adversarial samples to the training set can detect similar adversarial samples well and also detect low-modification adversarial samples; and (3) adding adversarial samples in the training set can make the classification algorithm more robust, but there are also great limitations. When new adversarial samples appear, they cannot be effectively detected.

| Experimental results and analysis of different types of characteristics
To verify and compare the impact of different input feature sets on the classification results, we use three feature sets, FB, FM, and FB + FM, to construct the input information, which is used as the input of the KNN algorithm (in many classification algorithms, the KNN algorithm has a relatively good detection effect on adversarial samples and ordinary samples.) to test the normal test set and the adversarial sample set. This study analysed the classification results and compared the impact of different feature sets on the classification results. Figure 9 shows the detection effect of the three feature sets on the normal test set. The figure shows that among the three evaluation indicators, use of FM feature set information is the worst. However, when FB is used, the result is basically the same as the result of the FB + FM feature set. Figure 10 shows the detection effect on different modified numbers of adversarial samples when we use different feature information. Because malicious features are not modified in features modification, the detection effect of the FM feature set on all types of adversarial samples is the same. However, the FB feature set shows great differences in detecting various types of adversarial samples. The detection effect is good on the adversarial samples with a low amount of modification, with the highest success rate of 97.66%. However, as the amount of modification increases, its detection effect drops sharply.
Based on the analysis of the two sets of experimental results, the detection performance of the FB feature set for lowmodification adversarial samples and the FM feature set for adversarial samples with a high modification amount should be retained. Therefore, for this goal, the final prediction result decision rule in the detection algorithm of multiple feature sets is designed. To maintain the advantages of the FB feature set, when the result of the branch algorithm is malicious, the detected application is considered to be malware. To retain the advantages of the FM feature set, except for these cases, the results when the FM feature set is used as input should prevail.

| Experimental results and analysis of multiple feature set algorithm
To verify the effectiveness of the MFS algorithm, the test set was modified to add 600 adversarial samples based on the original 3000 test samples and various types of adversarial samples were evenly distributed. The experimental results of various methods show that SVM is effective in detecting normal samples; KNN and DT can also detect adversarial samples to some extent. Therefore, SVM, DT and KNN classification algorithms are selected as the classification algorithm in the MFS algorithm for experiment. In addition, in this dataset, the MFS detection method compares three basic classification algorithms and four other methods. The result is compared from four perspectives: accuracy, recall, F1 and PFR. The PFR of adversarial samples reflects the detection of adversarial samples. Table 1 shows the performance of three basic classification algorithms on the mixed test set with different feature sets as the sample information.
In terms of accuracy, Tables 1 and 2 show that the MFS (SVM) algorithm has the highest accuracy rate of 92.05%, so the MFS algorithm has good performance in detecting the overall sample. When we used FB, FM, and FB plus FM feature information to construct the input of SVM, DT and KNN, respectively, the performance of these classification algorithms in detecting adversarial samples was not satisfactory, as shown in Table 1. However, when the MFS algorithm was applied to these classification algorithms, the FPR index dropped significantly, which proves the effectiveness of MFS. The experimental results also show the effectiveness of the MFS. This effectiveness depends on the classification and integrated learning methods of the MFS algorithm for features.
Experimental results show that the MFS (KNN) algorithm has the lowest PFR of 6.53%. The detection methods of other researchers performed poorly in terms of the overall accuracy and PFR, especially in detecting adversarial samples. Therefore, the MFS algorithm has good performance in detecting adversarial samples. In addition, the recall rate reflects the detection of benign samples. The MFS algorithm detects benign samples the worst. This is because the MFS algorithm focuses on detecting adversarial samples and sacrifices the performance of detecting benign samples. The F1 value reflects the comprehensive performance of the algorithm. The F1 value of the MFS algorithm is high; the highest value reached 89.49%, reflecting that the MFS algorithm has better comprehensive detection performance.
These experimental results show that (1) the MFS algorithm can detect most adversarial samples and its comprehensive detection performance is good, so it can enhance the robustness of the adversarial sample detection classification algorithm; (2) the MFS algorithm sacrifices part of the detection performance of benign samples.

| Real-world testing
To detect the detection effect of the MFS algorithm in a real environment, android application package (APK)decompilation, feature extraction, feature selection, and the MFS algorithm were integrated to develop a website that could detect APK. The main purpose of this website is to test the detection F I G U R E 9 Experimental results of the three feature sets on the normal test set F I G U R E 1 0 Experimental results of the three feature sets on the adversarial sample set LI ET AL.
-11 effect of the MFS algorithm in actual applications. The mobile user can select the APK that needs to be tested for detection; the website will determine the APK and display the determination result, detection time, and permission information for the user.
To verify the detection performance of the MFS algorithm under real conditions, benign applications were randomly downloaded from the Google Play Store and the Wandoujia application market; other malware was randomly selected from VirusShare. The application details are shown in Table 3. The author increased the number of applications continuously, used the website for testing in the Android emulator, and adopted the testing time and testing accuracy of various samples as evaluation indicators. MFS (KNN) and MFS (SVM) have obvious detection characteristics in the experimental results, so they are selected as the detection algorithm in the actual environment to detect the APK; the results are shown in Table 4. As the number of applications continues to increase, the detection rate for all types of applications generally shows a downward trend, but the decline is relatively low. MFS (SVM) is more accurate in detecting malicious samples, but it also consumes relatively more time. With the increase in the number of detection samples, the efficiency of the MFS algorithm continues to decrease. When the detection number increases to 2000, the efficiency of MFS most significantly decreases. It takes more time, but its detection accuracy does not increase. The reason is that as the number of samples increases, the types of applications continue to improve and the average size of applications significantly increases, leading to increased detection times.
The detection results in the actual environment show that the design and selection tendency of the MFS algorithm lead to relatively high misjudgement of benign applications when the algorithm is applied in actual detection. In addition, because much existing benign software has excessive permissions, there are big errors in the determination results when the FM feature set is used as input in the MFS algorithm, especially when the SVM algorithm is used. This led to the situation in Table 4, in which, as the proportion of benign samples with excessive permissions decreased in the total sample, the accuracy of MFS (SVM) in detecting benign samples increased. In detecting benign applications, the actual detection effect is opposite the experimental results. In the actual detection, the detection effect of MFS (KNN) is better than that of MFS (SVM). The reason is that when dealing with unknown new samples (benign application with excessive permissions), KNN has better robustness than SVM. As shown in Table A1, KNN is better at detecting new types of unknown samples, and SVM is more effective in detecting samples of known types.

| CONCLUSIONS
To clarify the generation principle and generation process of adversarial samples, a new adversarial sample generation algorithm is proposed for Android malware static detection from the perspective of feature interpretability. The adversarial samples generated by this algorithm have a certain universality. As the number of modified features increases, the evasion rate of adversarial samples on seven detection methods also rises. Moreover, this work verified that adding adversarial samples with a high mount of feature modifications can enhance the robustness of classification algorithms, making them effective in detecting adversarial samples with a low amount of modifications. In addition, a detection algorithm MFS was designed that enhances the robustness of adversarial sample classification detection. The algorithm fully uses the characteristics of different types of features to detect adversarial samples and can detect adversarial samples effectively even when there are no adversarial samples in the training set. However, there are still some deficiencies in this work. The evasion rate of adversarial samples generated by the FSASG algorithm is not ideal in certain classification algorithms. The MFS algorithm still needs improvement in detecting adversarial samples, and its performance in detecting benign samples is relatively poor. Directions for future research and discussions are thus: (1) GAN can generate efficient adversarial samples for a certain detection method, and we can use the GAN feature modification range to optimise the FSASG process. (2) The potential of no-tendency features can be exploited in detecting adversarial samples. A large number of no-tendency features are not within the range of modified features of adversarial samples, and their detection potential for adversarial samples can be tapped.