1. Introduction
The Internet of Things (IoTs) has come to permeate all aspects of our life and IoT devices such as smartphones and smartwatches have become necessary in modern mobile-centric connected world. Android is an open-source operating system (OS) devised primarily for use in smart IoT devices. With 71.18% market share [
1], Android is undoubtedly the leading operating system used in IoT devices such as smartphones worldwide. Permission-based models are used in the Android platform to protect the IoT devices from dangerous apps. However, this security model has proven to be inadequate for dealing with malware threats for IoT devices using Android OS [
2]. Currently, malware targeting the Android platform far outnumbers all the other platforms and continue to rise considerably over the last few years [
3]. For example, in 2018 Kaspersky detected about 5,321,142 malware samples from different families targeting Android platforms. Also, Android app distribution markets such as third-party markets and official Google Play Store market have become a haven for malware app distribution [
4]. Specially, the third-party markets hosting up to 50% malware apps [
5], tend to be replete with malicious apps. Although Google tries to weed out malware-infected apps from its market, the Google Play Store occasionally hosts malicious apps estimated to be up to 22% of the apps uploaded on the Google Play Store [
6].
Android malware can infringe on users privacy, compromise data integrity and destroy sensitive information. With the fast evolution of Android malware, it is practically not feasible to manually detect malware apps in Android ecosystem. Similarly, the signature-based traditional classification methods are not effective [
2] thus calling for an alternative approach that can quickly and effectively detect malware apps. As a result, machine learning algorithms are taking the centre stage to malware detection [
2,
7,
8]. Machine learning based solutions rely heavily on extracting meaningful features from the Android apps for training the models [
9]. Generally, static and dynamic analysis methods are utilized to extract typical malware descriptive behaviour (i.e., features) from the raw data. These feature extraction methods normally generate very large high-dimensional, redundant and noisy features [
10,
11]. Some of the raw features offer little or no information that is useful to distinguish malware apps from benign apps and may even impact the performance of the malware detection methods [
10,
12,
13,
14]. As a result, automatic feature subset selection has become a key aspect of machine learning [
15].
Feature selection algorithms select a subset of features from the original feature set, which are considered useful for training the learning models to obtain good results [
2,
10]. A growing number of Android malware detection models have applied different feature subset selection algorithms and have achieved good detection rates [
8,
16]. However, research on the usefulness of the state-of-the-art feature subset selection methods in the context of Android malware detection models have not received the attention it deserves [
10]. To this end, we investigate the utility of the commonly used feature subset selection approaches for malware detection in Android platforms. We analyse the feature selection methods with the goal of finding out: (i) the order in which they select the subset features, (ii) the usefulness of the selected features on the performance of the learning models, (iii) similarities between the various feature selection methods with respect to feature ranking, and (iv) the direct influence of varied feature length on the learning model classification accuracy. The contributions made in this paper can be summarized as follows:
We formulate the feature selection problem as a quadratic programming problem; and analyse how different feature selection methods work and how they are used in Android malware detection models,
We compare and contrast several commonly used filter-based feature selection methods along several factors,
We analyse the requested permissions distribution of the samples and the composition of the relevant feature subsets selected by the feature subset selection algorithms thoroughly to discover the usefulness of the feature subsets,
We empirically evaluate the predictive accuracy of the feature selection techniques using several learning algorithms that do not perform feature subset selection internally, and
We demonstrate the usefulness of feature selection in Android malware classification systems.
We organise the remainder of the paper in the following manner: In
Section 2, the problem overview is given.
Section 3 will review some related work while
Section 4 highlights the model used in this paper. Experimental evaluation is discussed in
Section 5. The conclusion remarks and future work are discussed in
Section 6.
2. Problem Overview
Quality features are crucial for building effective machine learning based classification models. This is because raw features extracted from Android apps for the purposes of training the models are very large. For example, Su et al. [
11] extracted more than 50,000 different features from an Android app. Typically, some of these raw features are key in differentiating malware from benign apps while others are not [
17]. Also, too many features lead to the model complexity and to the “curse of dimensionality” problem [
17]. Moreover, these features tend to be high-dimensional [
18] and replete with a large number of redundant features that may not be relevant to exclusively differentiate Android malware from benign apps [
11]. It is not useful for machine learning algorithms to directly handle high-dimensional data [
19]. This is because such data normally contains significant noises and irrelevant features that add little or no value to the performance of the learning algorithms. Therefore, these unwanted features should be removed from feature subsets to be used in training the learning models.
The feature selection problem can be formulated by using quadratic programming. Specifically, given a dataset of
training samples
, where
is the target classification label and
is a vector of
raw features such that
. The feature subset selection problem is to select a subset of the original features from the observation space
for use in training the learning algorithms. This problem can be formally stated as a quadratic programming optimization problem as follows [
20]:
The above formulation considers the relevance of the features to the class label whereas redundancy between the features is penalized. The parameter in the above equation is a similarity matrix used for capturing features redundancy; the parameter quantifies the level of correlation between each feature and the target class label (i.e., captures how relevant the feature is). The entries in represent the weight of each feature as computed by . The constraints in Equations (2) and (3) enforce the weight of each feature (i.e., ) should be non-negative must add up to one. Normally, features with weights above a given threshold are considered useful features and selected for subsequent training of the learning algorithms.
Although feature subset selection is important in machine learning, finding the best subset features from the original features is known to be an NP-hard optimization problem [
20]. This is because the feature selection algorithm has to examine a total of
candidate subset of features. As the number of
increases, it quickly become evident that examining all of the features exhaustively cannot be done in practice. An exhaustive search for the best possible features from the original feature sets is practically unfeasible in most cases. Solving the problem, even for a modestly large
, is computationally impracticable. As a result, many heuristic approaches have been proposed in the literature to solve this problem.
3. Related Work
Feature subset selection is among the top fundamental challenges in machine learning arena. The problem has continued to draw an increasing attention from researchers and practitioners alike [
7,
8,
10,
18,
21,
22,
23,
24,
25,
26,
27,
28]. Although choosing a subset of features from the original features is a combinatorial problem, many suboptimal heuristics have been put forward and used in various domains, which include the chi-squared based feature subset selection [
7,
8,
10], the analysis of variance (ANOVA) [
7,
8,
10], mutual information [
7,
23,
29] and information gain [
18,
25,
26,
27]. Many studies have shown that feature selection approaches that select good feature subset will have significant impact on reducing the complexity in processing by eliminating unimportant features and enhance the performance of the learning models [
24,
30]. For example, Aonzo et al. [
30] demonstrated that small number of features are enough for a very good classification. They considered the most significant features extensively used in prior research. They selected a small number of features from the list of most important features and show that the small subset features they are enough for a very good classification.
Exiting feature selection approaches are organized into filters, wrapper, embedded and hybrid methods [
15,
16]. The filter methods select a subset of features without altering their original representation. A statistical criterion is used in filter-based feature selection techniques to assess the relevance of the features. The selected features can be used by any learning methods because the selection is not tied to any machine learning method. In contrast, feature selection in the wrapper-based approach involves a classification model to assess the suitability of the features [
15]. Several authors empirically compared representatives of filter, wrapper, and embedded feature selection methods using simulated data. Bolón-Canedo et al. [
19] analyzed the seven filter-based feature selection methods, two wrapper feature selection methods, and two embedded feature selection methods using synthetically created microarray data sets under four machine learning classifiers. Similarly, Wah et al. [
31] investigated how the filter methods compare with the wrapper methods in terms of the classifier accuracy. The authors compared two filtering methods, namely the correlation based and the information gain feature selection against two wrapper methods, namely the sequential forward and sequential backward elimination methods. The feature selection methods are tested using both artificial data sets and real data sets using logistic regression as a classifier. Xue et al. [
32] compared the filter-based feature subset selection and wrapper-based feature subset selection methods with respect to classification accuracy and execution time. These works show that the wrapper methods generally achieve better classification performance than the filter method but much slower than the filter method.
Alazab [
8] discussed a supervised machine-learning algorithm for Android malware detection using various feature sets that are generated using the Chi-Square and one-way ANOVA feature subset selection methods. The detection accuracy of ten supervised machine-learning algorithms were evaluated to identify the most reliable classifier for malware detection. The model with feature subsets produced by Chi-Square was found to have a higher detection accuracy than the feature subsets produced by ANOVA. Wang et al. [
7] experimented with three filter-based feature selection methods, namely Mutual Information, Chi-Squared and one-way ANOVA to avoid overfitting of their model. The authors used the selected features to train linear regression (LR) models with different numbers of selected top features, and thus compare their performances and the performance with full feature sets. The main tenet of the above studies is to develop Android malware detection, which differs from the main objective of our work.
Masabo et al. [
10] proposed a New Feature Engineering (NFE) feature subset selection method that utilizes the domain knowledge of the data to create feature subsets. The authors assessed the power of NFE on feature selection by comparing it against one-way ANOVA, Recursive Feature Elimination (RFE), and PCA using KNN and Linear Discriminant Analysis and Gradient Boosting Classifiers. Thus, the main focus of this work is to evaluate the performance of NFE as compared to ANOVA, RFE and PCA on feature selection. The authors used 30 most discriminating features to assess the feature subset selection models and shown that NFE outperforms the other approaches in terms of precision, recall, and F score. In contrast, we compare and contrast several feature selection methods along several factors including the composition of relevant features selected. Moreover, PCA transforms the original features and it is not really good to compare it with other approaches that do not transform the original features.
Wang et al. [
33] discuss one-class classification methods for detecting zero-day Android malware attacks using Intra-Class Distance (ICD) feature selection method. The one-class classification methods use benign samples only to construct the detection model as opposed to the two-class models that use both benign and malware samples. In order to justify the use of ICD, the authors compared it against PCA and Pearson Correlation Coefficient methods using the Gauss Distribution and ν-SVM classifiers. It is shown that Pearson Correlation Coefficient has a significantly poorer classification and runtime performance as compared to the ICD and LS. Although ICD and LS have comparable classification performance, ICD has significantly lower runtime than the other models.
Mas’ud et al. [
22] investigated the use of several different feature selection methods in optimizing the n-gram system call sequence feature in classifying benign and malicious mobile application. The n-gram system call sequence can generate a large number of features to be used in the classification and can contribute to the degradation of classification performance. Several filter and wrapper feature selection methods are selected, and their performance analyzed. Four different filter methods, namely Correlation-based Feature Selection (CFS), Chi Square (CHI), Information Gain (IG), ReliefF (RF) and one wrapper method with a Linear SVM classifier (WR) are chosen to be evaluated in this paper. The feature selection methods are evaluated based on the number of feature selected and the contribution it made to improve the True Positive Rate (TPR), False Positive Rate (FPR) and Accuracy of the Linear-SVM classifier in classifying benign and malicious mobile malware application.
Mahindru and Sangal [
24] discuss a Least Squares Support Vector Machine (LSSVM)- based malware detection system. The authors analyzed various feature selection methods for the purpose of selecting the relevant features. These feature selection approaches include Pearson’s correlation coefficients, Chi-Squared, Rough set analysis (RSA), Information-gain, Consistency subset evaluation and PCA. Empirical result reveals that the model using the Rough Set Analysis (RSA) feature selection achieved better results when compared to the other feature subset selection methods. Bommert et al. [
34] comprehensively compared various filter-based feature selection techniques available on different toolboxes. It is concluded that there is no single feature subset selection that is superior to others all the time but some of the methods always perform well on many of the data sets.
Wang et al. [
29] used the mutual information, Pearson correlation coefficient, and T-test feature subset selection with the aim of understanding the possible risks posed by Android permissions. The permissions are ranked based on their risks. In order to determine risky permission subsets, two different methods, namely sequential forward selection as well as the PCA (principal component analysis) are deployed. The experimental results, using several machine learning (SVM, decision trees, and random forest) algorithms, the authors show that risky permissions as features offer satisfactory performance with a detection rate as 94.62% with a false positive rate as 0.6%. This work is focused on identifying risky permissions and its use in Android malware detection. In contrast, we are focused on feature subset selection algorithms use in Android malware detection.
Wang and Li [
35] used PCA, Correlation, Chi-square and Information Gain feature selection methods for the sake of dimension reduction. The authors examined the Android kernel features to identify Android malware. A Weight-Based Detection (WBD) approach is proposed to differentiate between Android malware and benign apps using Decision Tree, Naive Bayes, and Artificial Neural Network machine learning algorithms. The authors used 112 multiple dimensional kernel features of tasks and processes where the four feature selection methods. Vinod et al. [
36] examined system calls for Android malware detection. The authors compared five feature selection methods for reducing higher dimensional system call set. The five feature selection methods are the symmetric uncertainty, information gain and principal component analysis (PCA), Absolute Difference of Weighted System Calls (ADWSC) and Ranked System Calls using Large Population Test (RSLPT). The last two methods are proposed by the authors whereas the other three methods are used as a benchmark to demonstrate the effectiveness of ADWSC and RSLPT feature selection methods.
There are many survey papers on feature selection methods [
37,
38,
39,
40]. In Wang, et al. [
37], a systematic survey of the feature subset selection techniques and the features used in exiting literatures for Android malware detection is discussed. A taxonomy of existing features and feature selection methods is presented, and the issues of creating features for malware detection are highlight. The authors concluded that there is a need for further work that explore and refine well-discriminated features from numerous features extracted from Android apps. Feizollah et al. [
38] analyzed about 100 publications with respect to feature selection in Android malware detection. The authors reported that feature selection algorithms were not exhaustively investigated in prior articles.
The prior works are either focused on non-malware related fields [
19,
30,
31] or compare approaches that transform the original features against approaches that does not [
10] or use classification methods that perform embedded selection of the feature subsets such as random forests [
41] and gradient boosting [
39] or use them as a validation of a newly proposed feature subset selection methods [
10]. As noted in [
19], it is essential that the efficacy of the feature subset selection is examined and verified on different situations and platforms such as Android malware classification. Also, classification methods that perform embedded selection of the feature subsets will not be able to explicitly quantify the influence of the feature selection methods. There is limited research in comprehensive analysis of feature selection methods for Android platform that avoids the above shortcomings. In this paper, we avoid classification algorithms with in-built feature subset selection. Also, we focus on filter-based feature subset selection algorithms since this class of feature selection methods have lower computational complexity as compared to the wrapper-based approaches. Moreover, research on malware detection in Android platform mostly deploy filter-based methods to select the subset of feature for training the learning algorithms [
7,
8,
18,
26,
42]. Further, filter-based methods assess the suitability of features solely on statistical criterion and thus can be used in conjunction with any learning model.
5. Feature Subset Selection Methods
Both the number and quality of the features used to train models to classify Android apps with respectable accuracy as a benign and malware are paramount. To this end, feature selection is used to weed out irrelevant, redundant, constant, duplicated, and correlated features from the raw features before training the models. A variety of filter methods for selecting the best features in Android malware detection frameworks have been widely deployed. In this section, we present detailed descriptions of the filter feature selection algorithms.
The basic tenet of filter-based feature subsets selection algorithms is that features that have a high correlation with the target are considered useful to enhance the training of the learning algorithms and subsequently improve the classification performance. Generally, filter methods are classified as univariate and multivariate methods. Univariate filter methods assess and rank a single feature at a time independently (i.e., the no evaluation for correlation among the features); multivariate filter methods assess the entire feature space while considering the relationships between the features. Multivariate filter methods are able to handle duplicated, redundant, and correlated features. In contrast, univariate methods do not consider the relationship between features and thus unable to handle duplicated, redundant, and correlated features. Each filter method uses a distinct performance parameter to rank the feature.
Generally, filter methods follow a typical scenario described in
Figure 2. Given a set of
features,
, and the class label
(as a target), the filter methods rank the
features based on certain criteria according to their degrees of relevance to the class label, and return the ranked features. Note that the features are judged solely based on the intrinsic characteristics of the features in relation to the target either individually or taking into account the statistical relationships between the features. A variety of scoring function is used to differentiate informative and discriminating features from less significant features. Normally filter methods perform statistical analysis such as correlation analysis or mutual information to assess and rank the features. The features with top rankings that is equal or exceed a threshold value are returned as the most suitable features while the rest are discarded.
Although there are many filter-based feature selection methods, we are considering only a subclass of filter-based that does not perform feature transformation. In other words, the selected features have not gone through any transformation and keep the semantics of the original features. Also, note that the features are selected independent of any learning algorithm.
Table 2 compares the filter-based feature selection methods discussed in this paper in terms of univariate (UV) or multivariate (MV), the ranking used, the relationship between the feature and the target and the feature type supported. Para indicates if it is parametric (Y) or a non-parametric (N). The filter methods discussed here are capable of classifying data sets with either numeric features or categorical features. They can also be univariate (e.g., Pearson correlation coefficient) or multivariate (e.g., mutual information). The main difference between these two classes of filter methods is that the univariate (UV) filter methods do not consider interactions between the features whereas the multivariate (MV) methods do consider interactions between the features.
5.1. Pearson Correlation Coefficient
Pearson correlation coefficient is used in many Android malware detection methods [
29,
35], Pearson correlation coefficient uses statistical measures (i.e., linear correlation coefficient
) to determine the strength of the correlation between a pair of variables
and
. The linear correlation coefficient
can be computed as follows [
29]:
where
is the sample size,
and
are the ith data values, and
,
are the mean values. Equation (5) will result in
, where the value of
is close to
indicates that the correlation coefficient between a feature and the target class is high enough and the feature is selected. In contrast, if
is close to 0, the correlation coefficient is low, and the feature can be dropped.
5.2. Chi-Square
Chi-square is among the common feature subset selection algorithms used for malware classification in Android platform [
8,
36]. Chi-Square
is used to determine whether the occurrence of a specific feature and the occurrence of a specific class label are independent from each other or not. Formally,
for a given feature
is computed as follows [
35]:
Normally, the features are ranked in ascending order following the calculation of for each feature. The higher the score the more the feature and the class are considered as dependent and the features that are highly dependent on the occurrence of the class label are considered as good candidates while those that are independent considered as a noninformative for classification purposes thus can be dropped.
5.3. Analysis of Variance (ANOVA)
The analysis of variance (Anova) is used in several studies as a feature selection method [
7,
8,
10]. For feature,
, the degree of dependency with the class label is estimated and the feature is given a score based on
-statistics. The f-score is computed as the ratio of within group variance and between group variance as follows:
The variation between sample means (SSB) and variation within the means (SSW) samples are expressed as follows:
where
is the sample mean,
is the class mean,
represent the total number of class labels, and
is the number of class label
. The features are ranked in ascending order of their
and the top features that meet the selection criterion are identified for further use.
5.4. Information Gain
Given a feature
and a class label
(malware, benign), the information gain (
IG) quantifies the amount of information feature
contributes to the accurate prediction of the class
. Formally,
for a given feature
and a class label
is expressed as follows [
35]:
where
represents the prior uncertainty of L (i.e., entropy of a feature) and
denotes the expected posterior uncertainty, which are computed as follows:
where
and
refer to the prior probability of
, and the posterior probability of
given feature
respectively. Equations (11) and (12) give the entropy of
L before and after observing
F. The features are ranked in ascending order of their
and the top features that meet the selection criterion are identified for further use. Normally, the features with a high
value are taken to be relevant features, whereas those that do have a lower
value are considered not useful feature.
5.5. Mutual Information
Mutual information (MI) can measure the relevance of specific permissions by capturing a correlation between a given features
and a class label
based on how much information they share. Specifically, the mutual information between
and
can be measured by using the Kullback-Leiber divergence as follows:
where
is the occurrence rate of
with score
,
is the rate of recurrence of
with score
, and
is the frequency count of
with value
in class
. The features are ranked in ascending order of their
and the top features that meet the selection criterion are identified for further use. Basically, the larger the value of
, the greater the relationship between the
L and
F. But if
, then
L and
F are said to be independent (i.e., no correlation).
8. Conclusions and Future Work
The detection of Android malware is a complex process that requires selecting a subset of discriminative features from the original samples. This study examined the feature selection methods commonly used in detection of malware in an Android platform with the goal of finding out the order in which they select the subset features, the predictive performance of the selected feature subsets on the classifiers, the similarities between the methods in terms of feature ranking and the influence of diverse feature length on classification accuracy. The study revealed that few of the permissions are popular among both malware and benign apps. The result of the study also shows that, on average all feature selection methods performed better than using the extracted features from the samples without filtering. Also, the chi-squared and the information gain approaches tend to do well as compared to others. However, these methods achieve high accuracy sometimes and perform poorly on other classifiers in the analysis. While filter-based feature subset selection techniques are effective computationally, they fail to handle issues such as multicollinearity that affects the accuracy of the filter methods. The problem with this is that a feature determined to be irrelevant on its own by a feature selection method maybe a significant predictor when it is combined with other features. Filter methods may miss such features since they normally do not address multicollinearity automatically.
We plan to extend this work in several directions. We plan to examine the scalability of the feature selection methods using very large datasets. Also, as Android malware apps are becoming increasingly sophisticated, our future work will focus on the need to characterize malware behaviours from different aspects. First, the current study shows that few of the permissions are popular among both malware and benign apps. Therefore, our future work will look at the implication of filtering out these common permissions on the performance of the detection accuracy of the classifiers. Also, the present study is solely based on a single component of Android system (i.e., permission) feature. As a single feature type may not be able to detect malware with sufficient accuracy, our future work will examine some combinations of the permissions on the accuracy level of the malware detection models. In addition, the correlation between the requested permissions and the actually utilized permissions with respect to more precisely reveal the behavioral pattern of the malware apps will be our other future work. Including other features such as API will also be studied in the future. Another direction of feature work is to examine how the characteristics of data and the machine learning models used favor one feature selection over the others will be studied. The theoretical aspect of feature selection is also another work planned to be tackled in the future.