1. Introduction
Android is an open-source smartphone operating system (OS) that dominated the global smartphone OS shipment market in 2021, accounting for 83.8% of the overall market [
1]. Android applications have been increasingly utilized in many aspects of daily life. Because Android-based smartphones are used by thousands of people every year, cyber-criminals are constantly developing dangerous applications in order to steal sensitive information and perform malicious attacks. According to the Kaspersky mobile threats report in 2021 [
2], cyber-criminals have increased the effectiveness of their mobile attacks by developing more complex mobile malware that utilizes new methods to steal users’ sensitive information. According to the study, 3.5 million malware apps were discovered, resulting in 46.2 million cyberattacks worldwide in 2021. Because the amount and complexity of malware are rapidly increasing, it is critical to develop effective malware classification approaches to handle this issue [
3,
4]. However, malicious software developers typically make minor changes to the malware app’s original source code, creating new, dangerous software variants in order to evade malware detection systems. As a result, identifying new malicious software variants becomes more difficult [
5].
To identify and categorize Android malware, security researchers have employed different approaches that can be classified as static, dynamic, or mixed, and they are all done with the help of machine learning [
6]. Static analysis parses the source code of an Android Application Package (APK) file without running it. APK files include numerous features, including API calls, network addresses, permissions, hardware structures, and signatures [
7]. Each of these characteristics can be utilized to determine the maliciousness of an Android application. Static analysis is relatively inexpensive; however, it is unable to recognize malware attacks that use zero-day vulnerabilities [
8]. Dynamic analysis, on the other hand, investigates the app’s features and behavior during operation. It isolates the app from the outside world by placing it in a sandbox and observing its behavior; when an application’s behavior is found to not follow the established pattern, it is classified as malware. Various behavior-based characteristics, such traffic in network traffic [
9], API calls [
10], and log files of the system [
11], can be obtained during dynamic analysis. However, because the analysis is carried out in an isolated setting, the malware’s behavior may vary during its execution. As a result, dynamic analysis may fall short of catching the true activities of a malware attack.
Although malware detection using a single machine-learning classifier has been thoroughly investigated, the performance of each model differs because of variations in training datasets and strategies of feature selection. Moreover, each classifier has inherent limitations and uncertainty. Using a combination of classifiers has an advantage over using a single classifier in reducing the variance of expected error and enhancing classification accuracy. Ensemble learning is a machine learning strategy that employs several methods instead of relying on a single method to build a single accurate classifier; it has been proven to be a successful approach in a range of fields. Theoretically and practically, it has been demonstrated that ensemble learning approaches outperform weak single methods, especially when solving complex and high-dimensional prediction problems [
12]. The most prevalent ensemble learning techniques are bagging [
13], boosting [
14], and stacking [
15]. Stacking is a merging strategy that fuses numerous machine learning methods (base models) organized in one stage and then applies a different machine learning method (meta model) to obtain a more accurate classification model [
16].
An ensemble learning method can be developed to address the shortcomings of single-model-based methods. It combines numerous separate classifiers and typically outperforms a single model in terms of generalization performance. In this paper, an approach for Android malware classification using optimized ensemble learning based on genetic algorithms is proposed. The proposed approach consists of two stages. First, a base learner is constructed to handle different machine learning algorithms, including support vector machine (SVM), logistic regression (LR), K-nearest neighbor (KNN), decision tree (DT), gradient boosting (GB), and AdaBoost (ADA) classifiers. Second, a meta learner RF-GA, utilizing genetic algorithms (GAs) to optimize the parameters of the random forest (RF) algorithm, is employed to classify the prediction probabilities from the base learner. This optimization mechanism involves setting the parameters in the RF algorithm training procedure, which improves the accuracy of the differentiation between goodware and malware apps.
The contributions of this research are as follows:
An improved ensemble learning approach based on genetic algorithms is proposed to classify Android malware apps.
In order to increase Android malware classification accuracy, genetic algorithms are used to optimize the meta learner’s parameters.
In comparison to other techniques, the suggested method achieved the highest classification accuracy.
The rest of the paper is organized as follows:
Section 2 discusses works relevant to Android malware classification;
Section 3 introduces the proposed approach for Android malware classification using optimized ensemble learning based on genetic algorithms;
Section 4 presents the experimental results and discussion; finally,
Section 5 concludes this paper and presents possible future work.
2. Literature Review
Several approaches have been suggested for Android malware classification based on machine learning techniques. Machine learning (ML) is a technique in which a system learns a pattern, builds a model, and provides classifications based on the input data.
SEDMDroid, a stacking integration framework proposed by Zhu et al. [
17], is a tool for detecting Android malware. By obtaining all principal components and utilizing the complete dataset to train each base learner multilayer perceptron (MLP), the analysis of the principal component was performed on each feature subset. Then, as a fusion classifier, the support vector machine (SVM) algorithm was employed to learn hidden supplemental data from the results of ensemble models and generate the ultimate classification. Using multi-level static features, the suggested method achieved an accuracy of 89.07%. Idress et al. [
18] suggested PIndroid, a novel permission- and intent-based approach for detecting malicious apps. PIndroid was one of the pioneer methods for identifying malware accurately by combining permissions and intent, supplemented by an ensemble technique. The authors investigated the link between permissions and intent using statistical significance tests, and found a statistically significant, robust association between intent and permissions that could be used to classify malware apps.
Rana et al. [
19] suggested and analyzed multiple machine learning techniques for classifying Android malware, coupled with a substring-based classifier feature selection (SBFS) strategy using an ensemble-based learning methodology. They used the Drebin dataset and an ensemble learning approach to achieve better results. Li et al. [
20] proposed an Android malware classification approach based on permissions. This approach utilized 22 Android permissions to differentiate between goodware and malware apps based on ML algorithms. The dataset of permissions was generated using a database of benign and malware apps from Google Play. The support vector machine (SVM) performed better than the other studied ML algorithms (decision tree (DT) and naïve Bayes (NB)) and achieved the highest accuracy of 90%. Lou et al. [
21] suggested a machine learning method to differentiate between Android goodware and malware apps. The proposed method employed the SVM algorithm for the detection of Android malware. The Drebin and Google Play datasets were utilized to train the classifier, and the highest accuracy achieved by the proposed approach was 93.7%. Since this study used SVM only, it is preferable to examine the performance of more ML algorithms.
Firdaus et al. [
22] suggested a method for detecting Android malware apps based on feature analysis, with Android permissions and directory paths among the features. With this method, based on the AndroZoo and Drebin datasets, they utilized three machine learning methods: multilayer perceptron (MLP), radial basis function network (RBFN), and voted perceptron (VP). Their proposed MLP-based technique had a classification accuracy of 90% and an 87% true positive rate. Arp et al. [
23] developed “Drebin”, software for identifying malicious Android applications. To distinguish between good and malicious programs, they employed network addresses, Android permissions, and API calls as characteristics. Their findings were based on a dataset that included 123,453 goodware and 5560 malware apps from a variety of app stores. Drebin attained 94% detection accuracy and a 1% false positive rate.
We proposed an evolving hybrid neuro-fuzzy classifier (EHNFC) for Android malware classification, utilizing characteristics based on permissions in a prior study [
24]. We improved the evolving clustering algorithm to include a dynamic mechanism for adjusting clustered permission-based feature radii and centers. The results show that the suggested technique diagnoses Android malware with an accuracy of 90%. In a former study [
25], we presented a strategy for Android malware classification utilizing adaptive neuro-fuzzy inference systems (ANFIS), an information gain algorithm used to select the necessary Android permissions. The suggested approach achieved 75% accuracy. We introduced an adaptive neuro-fuzzy inference method with fuzzy c-means clustering (FCM-ANFIS) for the classification of Android malware in our previous research [
26]. According to the results of the experiments, the proposed technique attained classification accuracy of 91%. In [
27], Garg and Niyati proposed an approach for Android malware classifiers based on parallel classifiers. Their proposed methodology merged characteristics from several parallel classifiers using expectation maximization and achieved 98.27% accuracy.
3. Suggested Approach
The suggested approach is an ensemble method for classifying malware/goodware that uses a random forest classifier (meta learner) to make the final decision using the predictions of other ensemble members (base learner) as features. The base learner consists of six classifiers: gradient boosting (GB), AdaBoost (ADA), K-nearest neighbor (KNN), logistic regression (LR), support vector machine (SVM), and decision tree (DT). To achieve the maximum classification accuracy for Android malware apps, genetic algorithms were utilized to enhance the base learner’s parameters. The suggested approach is depicted in
Figure 1.
3.1. Dataset and Feature Selection
A dataset of 5560 Android malware apps and 9476 Android goodware apps was used in this research [
23]. The malware apps in the dataset came from 49 distinct malware families, such as Goldmaster, DroidKungFu, GingerMaster, and FakeInstaller. New Android app samples were included in the dataset. In preparation for evaluation, preprocessing was performed on the dataset, and it was divided into training and testing datasets. Android permissions were included in the dataset as a differentiating factor between good and malicious apps.
Feature selection methods can help to increase the accuracy of classification by employing key characteristics that define the distinctive properties of the dataset. These methods shrink datasets by omitting the permissions that are not essential for categorization. Machine learning algorithms might suffer from redundancy or dissimilar properties, which can cause a range of problems, including a decline in their efficacy. It is important to employ a suitable feature selection strategy in order to attain precise detection of Android malware apps.
To boost classification accuracy, the crucial Android permissions were specified in this research using the information gain ratio (IGR) method [
28]. The IGR gives the maximum score to the most potential permissions based on the class of malware or goodware the Android app belongs to. The IGR equations are described as follows:
where
gain_r(
P,
M) indicates the gain ratio of permission
P occurrence in class and
M,
Mi and |
Mi| denote the occurrence of permission
P in class
M, the
ith subclass of
M, and the total number of permissions in
Mi, respectively.
The IGR algorithm depends on determining the relations between an Android app’s permissions and then calculating and rating each permission separately. Because of its great efficiency and quick calculation, the information-gain ratio is used to choose the key Android app permissions in the proposed method. It evaluates the appropriate permissions based on class (malware or goodware).
Figure 2 shows some of the features included in the proposed model, as well as their information gain rankings. The horizontal axis displays the information gain ranks for the chosen permissions.
3.2. Random Forest Algorithm
A random forest classifier uses an ensemble approach to make predictions by generating and using many classification trees. Using a statistical approach known as bootstrapping, the classification trees are trained independently on a portion of the training data. Bootstrapping is a strategy for obtaining a large number of datasets by sampling from the original dataset. The sampling can be done with replacement, which means that a sampled subset can contain duplicated data points, allowing for a high number [
29].
3.3. Genetic Algorithms
The genetic algorithm (GA) is based on the evolutionary hypothesis that individuals with greater capacity for environmental adaptation are more likely to survive and consequently pass on their abilities to the next generation. As the children inherit their parents’ features, they may generate individuals who are better or worse replicas of their parents [
30]. Better individuals will have a higher chance of surviving and, as a result, having better offspring, whereas the worst ones will eventually disappear. The individual with the top chromosomes (properties) will be selected as the global optimum after numerous generations [
31]. Population size, representing the total number of solutions, is one of the most essential elements of GA and has a major impact on the task results, as the number of generations corresponds to the maximum number of iterations of the algorithm [
32]. GA employs the probabilistic optimization method to obtain the optimization search space automatically and adapt the search path as needed. There are three steps in the basic process of selecting good variations, mutations, and crossings:
Generate a chromosome population.
Use the fitness function to assess the fitness value of individuals in the population.
Create new populations using genetic operators (selection, crossover, and mutation).
Repeat steps 2 and 3 indefinitely until the termination requirements are met.
In the process of using GA to optimize the RF algorithm parameters in our study, it was essential to define the GA parameters, which included population size (μ), iteration number (I), mutation_rate, crossover_rate, and fitness function. In this work, we chose μ = 100, I = 100, crossover_rate = 0.1, and mutation_rate = 0.9, as these values obtained the best results. For the fitness function, the fitness of each population chromosome was assessed based on classification accuracy.
3.4. Optimization of Random Forest Using GA
In the suggested ensemble learning technique, the genetic algorithm was employed to optimize the parameters of the base leaner, which is the RF algorithm, by identifying RF parameters that enhanced the accuracy of Android malware classification. The inputs to the random forest classifier were the prediction probabilities resulting from the base learner and the hyperparameters obtained from the GA, as shown in
Figure 1. The GA-RF implementation steps are as follows:
Step 1: Initialize the population. All chromosomes in the initial population are binary-coded, and there is a certain number of chromosomes chosen at random. A gene that is encoded 0 means that a certain characteristic is not desired.
Step 2: Assess each entity’s fitness. Entities with a greater fitness value are nearer to the algorithm’s ideal outcome. This hybrid model’s fitness function is based on the average accuracy using five-fold cross-validation.
Step 3: Apply GA operators. A parent population for subsequent procedures is created by first screening out numerous pairs of chromosomes based on the fitness of various entities using the selection operator. Entities in the parent population then carry out crossover and mutation operations using the respective operator defined in GA to create candidate entities with new attributes.
Step 4: Output hybrid model. Repeat step 3 until the maximum number of iterations is attained. The model outputs the ideal RF parameters when the stop criterion is fulfilled.
GA generates a candidate hyperparameter configuration in each of its allotted iterations, then these parameters are used along with the prediction probabilities from the base learner to obtain the classification accuracy of the trained model. The configuration that produced the best classification accuracy was utilized to train the final model before it was tested on the test dataset, yielding the value of random forest classification accuracy when utilizing GA.
Table 1 shows the RF parameters optimized by GA.
3.5. Performance Measure
A confusion matrix is commonly used to assess the effectiveness of machine learning techniques.
Table 2 provides an example of a confusion matrix, which contains the following information:
TP: The number of Android apps that are actually malware and are predicted as malware.
TN: The number of Android apps that are actually goodware and are predicted as goodware.
FP: The number of Android apps that are actually goodware but are predicted as malware.
FN: The number of Android apps that are actually malware but are predicted as goodware.
Accuracy is defined as the percentage of correct classifications divided by the total number of classification options. Accuracy can be mathematically described as:
Precision relates to how often the machine learning algorithm is correct. Precision can be expressed mathematically as:
Recall is a metric for true positive rates. The mathematical expression for recall is:
The F-measure is defined as the harmonic average of recall and precision:
A receiver operating characteristic (ROC) curve is a graph illustrating the performance of the classifier at all classification thresholds. This curve plots two parameters: false positive rate and true positive rate.
True positive rate (TPR) is another term for recall and is therefore defined as follows:
False positive rate (FPR) is defined as follows:
The area under the ROC curve (AUC) is a visual illustration of the FPR and TPR using various folds. AUC has higher accuracy in terms of performance evaluation because it is not predicated on a threshold. A classifier with an AUC value close to 1 generally performs better. AUC, precision, and recall measurements were utilized to assess the performance of the suggested strategy.
Precision and recall are important parameters for assessing classifier performance. Precision is a metric that indicates how many Android applications identified as malware were indeed malware, whereas recall indicates how many of the total relevant results were correctly categorized. A high score for recall implies a low proportion of false negatives (FN), whereas accuracy with a high score suggests a low number of false positives (FP). High percentages of precision and recall indicate that the model correctly provides classification results [
33]. As a result, the precision–recall score curve provides a reliable estimate of the classification model’s accuracy [
34].
4. Results and Discussion
In this section, we evaluate the performance of the proposed approach by comparing it to six basic classifiers using information from the experiments. The basic classifiers were logistic regression (RL), support vector machine (SVM), K-nearest neighbor (KNN), decision tree (DT), gradient boosting (GB), and AdaBoost (ADA).
The suggested technique has the highest accuracy score of 94.15 %, proving its ability to identify malware from good software (
Table 3). Furthermore, the suggested method earned a recall score of 94.15%, proving its capacity to correctly identify 94.15% of Android applications while reducing false positives. The RF algorithm attained the second highest accuracy of 93.77%, while AdaBoost achieved the lowest accuracy of 90.96%. The proposed approach attained the highest precision, recall, and F1 scores of 94.153, 94.151, and 94.152%, respectively.
Figure 3 shows the AUC, which was used to analyze the suggested technique. The AUC is a valuable and relevant indicator of overall performance. A high AUC value suggests a stronger ability to categorize. As shown in
Figure 3, the suggested technique has an AUC of 98.1%, demonstrating that it can efficiently differentiate between Android malware and goodware apps, and proving that our proposed ensemble classifier performed better than the classifiers we used as base models.
In general, the precision–recall curve is used to evaluate classifiers depending on recall and precision, and is visualized on a graph in which the ratio of precision is given on the Y-axis and the recall on the X-axis. The precision–recall curve provides a comprehensive view of classifier evaluation. The precision–recall curve of the proposed classifier was compared to those of existing machine learning techniques, as shown in
Figure 4.
The performance of the proposed approach was also compared to that of existing techniques.
Table 4 compares the performance of the suggested approach with other approaches.
5. Conclusions
Because Android malware applications damage sensitive data, resulting in large financial losses, it is critical to develop efficient categorization methods for them. Using optimized ensemble learning based on genetic algorithms, this study provides a solution for Android malware classification. The suggested method consists of two steps. First, a base learner is constructed to handle various machine learning algorithms, including support vector machine (SVM), logistic regression (LR), K-nearest neighbor (KNN), decision tree (DT), gradient boosting (GB), and AdaBoost (ADA) classifiers. Second, to classify the prediction probabilities from the base learner, a meta learner RF-GA is used, which employs the genetic algorithm (GA) to optimize the parameters of the random forest (RF) algorithm. GAs were utilized to adjust the RF algorithm’s parameter settings in order to achieve the highest Android malware classification accuracy. The experimental findings show that the proposed ensemble learning technique for Android malware classification, based on optimized random forest using genetic algorithms, outperformed the other approaches and reached the highest accuracy of 94.15%.