In this section, we present and discuss our results based on the experimental framework as outlined in
Section 3.3.
Table 6 and
Table 7 compare the detection performances of FPA against baseline classifiers (NB, BN, conjunctive rule (CR), decision table (DTable), alternating decision tree (ADT), and decision stump (DS)) on the original Android malware datasets. It should be noted that the selection of the baseline classifiers was based on their diverse computational characteristics and performance as reported in existing studies. Furthermore,
Table 8 and
Table 9 present the detection performances of FPA against the baseline classifiers on SMOTE-balanced Android malware datasets. The purpose of this is to showcase the effect of the data sampling method (SMOTE) on the performance of FPA.
Table 10 and
Table 11 present analyses of the detection performance of FPA and its enhanced variants (Cas_FPA and RoF_FPA) on the SMOTE-balanced Android malware datasets. Lastly, the detection performance of FPA, Cas_FPA, RoF_FPA, Cas_FPA+SMOTE, and RoF_FPA+SMOTE was compared with existing state-of-the-art Android malware models. This section also includes some figures to illustrate the significance of the research findings. The top results are highlighted in bold, whereas the suggested methods are denoted by an asterisk.
4.1. Android Malware Detection Performance Comparision of FPA and Baseline Classifiers
In this section, the Android malware detection performance of FPA is compared with that of the selected baseline classifiers on original and SMOTE-balanced Malgenome and Drebin datasets, as depicted in
Section 3.3 (Scenario 1).
Table 6 presents the detection performance of FPA and the selected baseline classifiers on the Malgenome dataset. FPA outperformed the baseline classifiers based on the evaluation metrics that were considered. Specifically, based on detection accuracy values, FPA had a 98.94% accuracy value, which is +3.15% higher than that of the best-performing base classifier, ADT, with a 95.89% detection accuracy. Just below ADT, the Bayesian models (NB and BN) also recorded good detection accuracy values of more than 92% detection accuracy. The high detection accuracy value achieved by FPA, even on the unbalanced Malgenome dataset, emphasizes its robustness and efficacy for Android malware detection. FPA achieved the highest AUC value, at 0.998, followed closely by ADT (0.991). In addition, FPA achieved a very good balance between precision and recall, recording an F-measure value of 0.989. This observation further affirms the detection performance consistency of FPA as compared to other experimented baseline classifiers. Although ADT had a good F-measure value, it is still inferior to FPA (+3.12%). Considering the misclassification rate, FPA obtained the lowest score in FPR, with only a 0.016 chance of misclassifying any instance. This very low FPR shows that FPA is not prone to the misclassification errors often observed in other base classifiers. It can also be observed that the Bayesian models (NB and BN) outperformed the rest of the base classifiers in terms of FPR.
Figure 2 presents a graphical representation of the performance comparison.
Similar performance patterns were observed on the Drebin dataset (
Table 7), as FPA recorded superior detection performances based on accuracy (98.13%), AUC (0.997), F-measure (0.981), and FPR (0.025) values when compared with the baseline classifiers. It was also observed that the detection performances of the classifiers on the Malgenome dataset are somewhat better than those on the Drebin dataset. Among the base classifiers, ADT recorded the highest detection accuracy (93.73%), AUC (0.981), and F-measure (0.937) values and the lowest FPR (0.063). It can also be observed that DTable closely follows the performance of ADT concerning accuracy (90.20%), AUC (0.922), F-measure (0.922), and FPR (0.077) values. That is, DTable showed a better performance than the Bayesian algorithms (BN and NB) on the Drebin dataset in contrast to the Malgenome dataset. Generally, all the algorithms achieved good performance on the original datasets, except CR and DS, the performances of which were relatively low.
Figure 3 shows the Android malware detection performances of FPA and the baseline classifiers on the Drebin dataset.
We then proceeded to investigate the performance of FPA and baseline classifiers on SMOTE-balanced Android malware datasets. The essence of the deployment of the SMOTE data sampling technique is to remove the inherent class imbalance problem in the Android malware datasets. Moreover, the preference for SMOTE is based on its usage and efficacy, as reported in existing studies.
Table 8 and
Table 9 present the experimental results of FPA and the baseline classifiers on the SMOTE-balanced Malgenome and Drebin datasets, respectively.
As presented in
Table 8, FPA still outperformed the baseline classifiers on all performance metrics evaluated. Specifically, FPA had a detection accuracy value of 98.85% and an FPR as low as 0.011. However, based on evaluation metrics (AUC and F-measure), the detection performance of the baseline classifiers was comparable to FPA. This observation can be attributed to the deployment of data sampling (SMOTE) to address the class imbalance problem. That is, the detection performances of the classifiers evaluated in the experiment improved as compared to their respective detection performances on the original Malgenome dataset. In particular, CR (+7.9%) and DS (+5.7%) achieved the greatest improvement in accuracy values. Additionally, a notable improvement in accuracy was observed in NB (+2.9%), BN (+1.9%), and DTable (+2.0%). ADT (+0.2%) recorded a slight improvement in its detection accuracy. Regardless, the detection accuracy performance of FPA is still superior. Concerning AUC values, DTable (+8.2%) achieved the greatest improvement, closely followed by NB (+7.34%), CR (+7.29%) and BN (+7.22%). Slight improvements were noticed in the AUC values of ADT (+0.20%), DS (+0.12%), and FPA (+0.10). A similar occurrence was observed with F-measure, as CR (+7.29%), DS (+5.31%), NB (2.92%), DTable (2.74%), and BN (1.94%) had improved F-measure values. However, concerning the FPR metric, FPA (−31.25%) achieved the greatest improvement, with a reduction of 31.25% in FPR value. It should be noted that the lower the FPR value, the better the detection performance. Similarly, ADT (28.57%), DTable (4.55%), and NB (2.08%) exhibited a relative reduction in their FPR values. In general, the detection performances of FPA and the baseline classifiers improved on the SMOTE-balanced Malgenome dataset, but FPA still achieved the best overall performance.
Figure 4 illustrates the Android malware detection performances of FPA and the baseline classifiers on the SMOTE-balanced Malgenome dataset.
Furthermore, on the SMOTE-balanced Drebin dataset (
Table 9), FPA still achieved the highest detection performance results in all performance measures considered. A similar finding to that observed in the experimental results on the SMOTE-balanced Malgenome dataset was also observed in the SMOTE-balanced Drebin dataset. That is, all classifiers recorded an improvement in their respective detection performance. As shown in
Table 5, FPA (+0.24%), CR (+6.02%), DS (+5.48%), NB (+3.87%), BN (+3.38%), and DTable (+2.45%) had improved accuracy values when compared to their respective detection performances on the original Drebin dataset. Additionally, concerning AUC values, the Bayesian models (NB (+14%), BN (+14%), CR (+11.11%), and DTable (+5.86%)) each recorded significant improvement. The analysis based on the F-measure metric showed similar findings. Both FPA and the base classifiers had improved F-measure values, except those of ADT, for which results remain unchanged. FPA and DTable had a 36.00% and 68.83% reduction in FPR values, respectively. Other classifiers (NB, BN, DS, ADT, and CR) also had an increased FPR value.
Figure 5 graphically represents the detection performances of FPA and baseline classifiers on the SMOTE-balanced Drebin dataset.
Based on the foregoing experimental results on the Malgenome and Drebin datasets, the following findings were observed:
FPA recorded a higher performance than the baseline classifiers for Android malware detection. The baseline classifiers were selected based on their usage and performances, as reported in existing studies.
The deployment of a data sampling method (in this case, SMOTE) not only alleviated the class imbalance problem but also improved the detection performances of the FPA and baseline classifiers.
FPA can perform well for Android malware detection with or without the application of a data sampling method.
These findings authenticate and validate the selection of FPA for Android malware detection. However, to amplify the FPA detection performances, FPA variants based on meta-learning (Cas_FPA and RoF_FPA) concepts were developed. Empirical analysis of the experimental results is presented in
Section 4.2.
4.2. Android Malware Detection Performance Comparison of FPA, Cas_FPA and RoF_FPA
In this section, the Android malware detection performance of FPA is compared with that of its enhanced variants (Cas_FPA and RoF_FPA) on original and SMOTE-balanced Malgenome and Drebin datasets, as shown in
Table 10,
Table 11,
Table 12 and
Table 13, respectively. Thereafter, the detection performance of FPA, Cas_FPA, RoF_FPA, Cas_FPA+SMOTE, and RoF_FPA+SMOTE was compared with that of existing state-of-the-art Android malware detection models (
Table 14 and
Table 15), as described in
Section 3.3 (Scenario 2).
Table 10 and
Table 11 present the detection performances of FPA, Cas_FPA, and RoF_FPA on the Malgenome and Drebin datasets. On the Malgenome dataset (
Table 10), both Cas_FPA and RoF_FPA recorded superior detection performance over FPA. Specifically, Cas_FPA achieved a detection accuracy value of 99.45%, an AUC value of 1, an F-measure value of 0.994, and an FPR value of 0.008. RoF_FPA had a similar performance, with a detection accuracy value of 99%, an AUC value of 0.999, an F-measure value of 0.990, and an FPR value of 0.018. Notably, the AUC values of 1 and 0.999 recorded by Cas_FPA and RoF_FPA, respectively, indicate the effectiveness of the two models (Cas_FPA and RoF_FPA) in distinguishing Android malware applications from benign applications with approximately 100% certainty. Additionally, a very low FPR of almost zero recorded by Cas_FPA (0.008) showed its reliable counteraction towards false Android malware detection. Furthermore, on the Drebin dataset (
Table 11), Cas_FPA and RoF_FPA outperformed FPA in terms of evaluation metrics. Cas_FPA (98.83%) and RoF_FPA (98.38%) showed a +0.71% and +0.25% increment, respectively, in detection accuracy value over FPA (98.13%). Concerning FPR value, significant improvements were observed; Cas_FPA (0.016) and RoF_FPA (0.023) had a 36% and 8% reduction in FPR, respectively, when compared with FPA (0.025).
Furthermore, the detection performance of FPA, Cas_FPA, and RoF_FPA on the SMOTE-balanced Malgenome and Drebin datasets was compared. With this comparison, we aimed to ascertain whether the Cas_ FPA and RoF_FPA algorithms can be further enhanced by deploying the SMOTE data sampling method.
Table 12 and
Table 13 present the detection performances of FPA, Cas_FPA, and RoF_FPA on the SMOTE-balanced Malgenome and Drebin datasets, respectively.
As presented in
Table 12, it can be observed that Cas_FPA and RoF_FPA performed better than FPA on the SMOTE-balanced Malgenome dataset. In particular, Cas_FPA and RoF_FPA recorded +0.57% and +0.22% increments in detection accuracy values over FPA. Additionally, Cas_FPA and RoF_FPA had 45% and 18% reductions, respectively, in FPR values over FPA. In addition, a similar occurrence was observed on the SMOTE-balanced Drebin dataset. As shown in
Table 13, Cas_FPA and RoF_FPA recorded +0.58% and +0.12% increments in detection accuracy values and 31.25% and 6.25% reductions, respectively, in FPR values over FPA.
Based on the analyses presented here, it can be deduced that the enhanced variants (Cas_FPA and RoF_FPA), especially Cas_FPA, are more effective in detecting Android malware than FPA. In other words, the meta-learners (cascading generalization and rotation forest) that were deployed amplified the detection performance of FPA.
For generalizability, the detection performance of FPA, Cas_FPA, RoF_FPA, Cas_FPA+SMOTE, and RoF_FPA+SMOTE were compared with that of existing state-of-the-art Android detection models.
Table 14 and
Table 15 display the detection performance of the proposed methods and existing models on the Malgenome and Drebin datasets, respectively.
In
Table 14, the results of the proposed methods are compared with findings from Lopez and Cadavid [
70]; Yerima, Sezer, McWilliams, and Muttik [
71]; Su, Chuah, and Tan [
72] (DT); and Sen, Aysan and Clark [
11]. Specifically, Lopez and Cadavid [
70] developed a kNN-based Android malware model with a detection accuracy of 94% and an F-measure value of 0.940. SAFEDroid, developed by Sen, Aysan, and Clark [
11], achieved a detection accuracy of 98.30% and an FPR value of 0.02. Additionally, the Bayesian-based Android malware model by Yerima, Sezer, McWilliams, and Muttik [
71] had a detection accuracy of 92.10% and AUC and FPR values of 0.972 and 0.061, respectively. All these methods were trained and tested using the Malgenome dataset. Clearly, the proposed models (FPA, Cas_FPA, and RoF_FPA) showed significant improvement over the existing Android malware detection solutions.
Table 14.
Detection performances of proposed methods and existing models on the Malgenome dataset.
Table 14.
Detection performances of proposed methods and existing models on the Malgenome dataset.
| Accuracy | AUC | F-Measure | Precision | Recall | TPR | FPR |
---|
* FPA | 98.94 | 0.998 | 0.989 | 0.989 | 0.989 | 0.989 | 0.016 |
* Cas_FPA | 99.45 | 1 | 0.994 | 0.994 | 0.994 | 0.994 | 0.008 |
* RoF_FPA | 99.00 | 0.999 | 0.990 | 0.990 | 0.990 | 0.990 | 0.018 |
* Cas_FPA+SMOTE | 99.42 | 1 | 0.994 | 0.994 | 0.994 | 0.994 | 0.006 |
* RoF_FPA+SMOTE | 99.07 | 1 | 0.991 | 0.991 | 0.991 | 0.991 | 0.009 |
Lopez and Cadavid [70] | 94.00 | - | 0.940 | 0.950 | 0.950 | - | - |
Yerima et al. [71] | 92.10 | 0.972 | - | 0.937 | - | 0.904 | 0.061 |
Su et al. [72] (DT) | - | - | - | - | - | 0.916 | - |
Su, Chuah and Tan [72] (RF) | - | - | - | - | - | 0.967 | - |
Sen, Aysan and Clark [11] | 98.30 | - | - | - | - | - | 0.020 |
Furthermore, as presented in
Table 15, the proposed methods were compared with findings from Frenklach, Cohen, Shabtai, and Puzis [
73]; Rana, Rahman, and Sung [
64]; Tanmoy, Pierazzi, and Subrahmanian [
74]; Salah, Shalabi, and Khedr [
75]; Rana and Sung [
66]; and Rathore, Sahay, Chaturvedi, and Sewak [
34]. These existing Android malware detection models were trained and tested using the Drebin dataset used in the present research. An Android malware detection model-based similarity graph proposed by Frenklach, Cohen, Shabtai, and Puzis [
73] had an F-measure value of 0.869, which is lower than the F-measure values of the proposed models. Rana, Rahman, and Sung [
64] developed a tree-based model with a detection accuracy of 97.92%. Additionally, Tanmoy, Pierazzi, and Subrahmanian [
74] suggested an ensemble of classification and clustering (EC2) method with an F-measure value of 0.970. Salah, Shalabi, and Khedr [
75] utilized a feature-selection-based framework for Android malware detection with a detection accuracy value of 94%. Similarly, the solutions produced by Rana and Sung [
66], as well as Rathore, Sahay, Chaturvedi, and Sewak [
34], had detection accuracy values of 97.24% and 97.92%, respectively. Although these existing methods achieved relatively good detection performances, they were still outperformed by the proposed FPA and its enhanced variants (Cas_FPA and RoF_FPA).
Table 15.
Detection performances of proposed methods and existing models on the Drebin dataset.
Table 15.
Detection performances of proposed methods and existing models on the Drebin dataset.
| Accuracy | AUC | F-Measure | Precision | Recall | TPR | FPR |
---|
* FPA | 98.13 | 0.997 | 0.981 | 0.981 | 0.981 | 0.981 | 0.025 |
* Cas_FPA | 98.83 | 0.999 | 0.988 | 0.988 | 0.988 | 0.988 | 0.016 |
* RoF_FPA | 98.38 | 0.997 | 0.984 | 0.984 | 0.984 | 0.984 | 0.023 |
* Cas_FPA+SMOTE | 98.94 | 0.999 | 0.989 | 0.990 | 0.989 | 0.989 | 0.011 |
* RoF_FPA+SMOTE | 98.49 | 0.998 | 0.985 | 0.985 | 0.985 | 0.985 | 0.015 |
Frenklach et al. [73] | - | - | 0.869 | - | - | 0.939 | - |
Rana, Rahman and Sung [64] | 97.92 | - | - | - | - | - | - |
Tanmoy et al. [74] | - | - | 0.970 | - | - | - | - |
Salah et al. [75] | 94.00 | - | - | - | - | - | - |
Rana and Sung [66] | 97.24 | - | 0.972 | 0.976 | - | 0.969 | 0.239 |
Rathore, Sahay, Chaturvedi and Sewak [34] | 97.92 | - | - | - | 0.976 | - | - |
4.3. Findings Based on Research Questions
In response to the research question raised in the introductory section, the following conclusions were drawn based on the experiments conducted.
- RQ1:
How effective is the FPA algorithm in comparison to baseline classifiers in Android malware detection?
It was observed from the experimental results that FPA produced significantly improved detection performance when compared with the baseline classifiers. This superior detection performance was observed on both the Malgenome and Drebin datasets.
- RQ2:
How effective are the enhanced variants of FPA (Cas_FPA and RoF_FPA) in Android malware detection?
According to the experimental results and analyses, Cas_FPA and RoF_FPA performed better than FPA alone on both the original and SMOTE-balanced Malgenome and Drebin datasets. Additionally, it was observed that the deployed data sampling method resolved the latent class imbalance problem and subsequently improved the detection performances of the models evaluated, especially FPA, Cas_FPA, and RoF_FPA.
- RQ3:
How well do the proposed FPA and its variants perform as compared to current state-of-the-art methods in Android malware detection?
It was gathered from the experimental results that the proposed FPA, Cas_FPA, and RoF_FPA algorithms, in most cases, had superior detection performance compared to existing state-of-the-art Android malware detection models.