1. Introduction
Android is an open-source system framework. It has become one of the most popular mobile ecosystems. With the popularity of mobile devices in daily life increasing every year, more researchers are paying attention to the security of the Android ecosystem [
1]. Various malware detection approaches have been raised in our community.
Malware attempts to control the user’s use system without authorization, steal personal information, encrypt important electronic files or cause other damages. A malware family is composed of malware samples with common characteristics. Common characteristics usually include the same code segment, pattern, application characteristics and similar behavior. The number of each malware family varies from just a few to tens of thousands. Low resource malware family refers to a family with less data, and its data are not enough to train the malware detection model alone. Although low-resource malware families have less data volume, these malware families can still bring great software security risks. When we target to train a classifier for a low-resource malware family, the training data using the family itself are not sufficient to train a good classifier. Androzoo [
2] currently contains 17,927,200 different APKs with hundreds of malware families. However, some malware families have less than 100 or 500 apks, which is not enough to train a good malware classifier. For the unpopular malware, we may never find enough samples even if we label all the existing apks. At the same time, it is hard for us to have sufficient labels in time with the continuous evolution of malware.
There are some widely used malware data sets, in which a large number of malware families have only a few data samples. For example, MalGenome [
3] dataset covers 49 families, and each family contains 1 to 309 malware samples. The top three families occupy roughly 70% of the overall dataset, while over 30 families have less than 10 samples. The distribution suggested that, as long as the detection approach can successfully detect the top families, the overall result will be good enough. The malware families with few samples are ignored. If we directly use all the training datasets from other malware families, most of the malware detection models may not be robust enough to transfer between different malware families. To detect low-resource malware families, Tran et al. first used prototype learning to create prototype representations for the target malware family, and used the twin network to classify malware [
4]. Subsequent work further improved the generation of prototype representations [
5], or improved the training of twin networks using meta-learning [
6,
7,
8] and contrastive loss functions [
9]. Kamaci et al. [
10] established novel distance concepts to measure the relative difference between two objects. Alsboui [
11] proposed a graph-based dynamic multi-mobile agent itinerary planning method to cover all nodes in the network. However, these methods use only a small number of samples related to the target low-resource malware family for model training, ignoring the relationship between the target malware family and a large number of existing malware families. In this paper, we seek to study the relatedness of malware families and leverage the relatedness to improve the malware detection performance of low-resource families. Our work focuses on three research questions:
First, does use different malware families as training datasets could help the detection of target low-resource malware? Intuitively, training with similar malware tends to achieve better transferring to the low-resource malware family and dissimilar malware may even harm the performance. We propose to measure the similarity with empirical experiments. Specifically, we train a malware detection model with one family and test on the target family , and define the test performance as the supportive score from to . Our work shows that the transferring performance varies a lot between different malware families, and we could achieve good performance by selecting the family with the biggest supportive score.
Second, we further study whether it is more helpful to use multiple malware families as the training set? We found that if we neglect the differences between distinct malware families and train the model with all families in the possible training data, the malware detection performance may even be worse than only selecting one most supportive malware family. We propose a Sequential Family Selection (SFS) algorithm to carefully select multiple families as the training set. Our algorithm could be easily adapted to any detection model. We conduct experiments to validate its performance and test on 16 malware families and four representative detection models csbd, drebin, mamadroid, and droidsieve. Our results show that SFS improves the performance of all the malware detection models. We also evaluate the performance on datasets from future time and SFS still achieves better performances.
Third, we try to understand why the supportive score between some malware families is higher which means having better transferring performance. We hypothesize that this is because of the similar characteristics between different malware families. We study two popular characteristics about whether malware steals user data and whether it displays advertisements. We found that malware with the same characteristics tends to have high supportive scores. Most supportive relations are the same for different malware detection algorithms, while some are varied for different detection models. Our work makes three contributions:
We make the first systematic study of the relatedness between malware families. We propose to measure the malware family similarity with an empirical supportive score and find it is the key to good transfer performance.
We propose a new Sequential Family Selection algorithm to target the low-resource families and validate it on 16 families and 4 different malware models. Our results show that the combination of our algorithm and the malware detection model based on machine learning and deep learning method can greatly improve the performance of malware detection.
We study the relationship between the performance of knowledge transfer and the characteristics of malware. We found that the success of knowledge transfer is essentially due to the similar behavior characteristics between different malware families.
2. Related Work
In this section, we first overview the common Android malware detection methods. Then, we discuss low-resource malware detection.
2.1. Android Malware Detection Based on Machine Learning
Many researchers have studied various malware detection methods in our daily lives. Malware detection methods are emerging rapidly, such as methods based on static features [
12,
13,
14,
15,
16], methods based on dynamic features, etc. [
17,
18,
19,
20,
21].Static features of applications include API calls, permissions, opcode, etc. These are extracted by analyzing the structure of applications. Dynamic features such as system calls, behavior characteristics, network traffic, etc. [
22].These features are extracted during the period when the application is running. Mudflow [
23] uses the flows between APIs as the malware features to detect malware. Deep4maldroid [
24] leverages the constructed graph to train malware detection models. DroidAPIMiner [
25] provides a lightweight malware classifier by conducting a thorough analysis of apks at the API level. However, these Android malware detection methods are more concerned with the performance of the algorithm on the overall dataset and ignore the malware detection performance in the low-resource malware family.
2.2. Low-Resource Malware Detection Based on Machine Learning
To enhance the detection of low-resource malware families, several researchers have improved models to enhance the detection of low-resource malware families. In 2019, Tran et al. used prototype learning to create prototype representations for target malware families and used twin networks for malware classification [
4]. The improvement of the subsequent work is divided into two main directions; first, to further improve the generation of prototype representations, Chai et al. used dynamic prototype networks to generate prototype representations [
5] and Tang et al. used multilayer convolutional neural networks to generate prototype representations [
6]. Second, to better train the network, Bai et al. used a contrast loss function to better train the twin network [
9]. In addition, Tran et al. used meta-learning to train memory neural networks for malware family classification [
26]. However, all these methods use only samples of the target malware family for model training and prediction, but ignore data samples of malware families related to the target malware family.
To address the problem of low-resource malware detection, some other researchers have increased the data samples of low-resource malware families by generating new data [
27,
28,
29]. Zahra et al. used generative adversarial networks to increase the data samples of low-resource malware families by generating new sample signatures of malware [
30]. Chen et al. proposed a malware detection model called Adv4Mal, which generates new data based on specific signatures of malware to supplement the training data of the low-resource malware family [
31]. These methods use artificially constructed data, while this paper uses real data related to the target malware family.
Table 1 shows the differences between these related work of low-resource malware detection based on machine learning method. This paper proposes sequential family selection algorithm that does not require the generation of any forged new data, but rather supplements the existing data with low-resource malware families. Meanwhile, the sequential family selection algorithm uses knowledge transferring among malware families to select relevant real data samples to improve the performance of low-resource malware family detection. Theoretically, the sequential family selection algorithm can be combined with the above malware detection methods and further improve its detection performance on low-resource malware families.
3. Methodology
In this section, we first introduce how to measure the similarity between different malware families. We could achieve good transferring performance on low-resource families by selecting a most similar malware family. Then we introduce a Sequential Family Selection (SFS) algorithm to select multiple families as training data and achieve better performance.
3.1. Malware Family Similarity
It is noticed that the malware detection performance differs significantly with different malware families in the training set when using the same malware detection algorithm. Moreover, the impact of different families differs while malware detection methods vary. It is interesting to explore the similarity between malware families. There are two general ways to measure the similarity between malware families, malware characteristics and empirical metrics.
Researchers obtain the characteristics of mobile applications through dynamic analysis and static analysis, including operation code, API calls, behavior characteristics et al. Malware families with similar characteristics are more likely to transfer knowledge to each other. The method of determining the similarity between malware families based on features is interpretable. However, it is hard to define all the characteristics of android applications, especially for the rarely studied low-resource malware families. Futhermore it is usually time-consuming to analyze the malware applications. We propose an empirical supportive score to measure the transfer quality. Specifically, we train a malware detection model with one family and test on the target family , and define the test performance as the supportive score from to . To achieve good performance on a given test set, we could select the family with the biggest supportive score as training data.
This quantitative relationship between malware families directly corresponds to the performance of malware detection performance and could help to improve the detection performance. We further explore the relationship between the characteristics of malware families and the supportive score. We find that the supportive score is highly correlated to the human summarized characteristics.
3.2. Sequential Family Selection (SFS) Algorithm
In this section, we first formalize the problem. Then we propose two baselines, the most supportive family only and the training with all the malware families. We further propose a new malware family selection algorithm to carefully select multiple families as the training set. Finally, we compare the performance of these four malware detection methods and validate our research questions. Formally, we target to test the performance of a target malware family , which have a validation and test dataset. The training data include a set of malware families , and each malware family corresponds to a training dataset , which contains malware from the corresponding family and randomly sampled benign. The benign samples of different groups of training data are not overlapping. We have two baseline settings for this problem. The first one is to train all the malware families in the training dataset and test on the target malware family. This is equivalent to neglecting the differences between different malware families. The second one is to find a training family which is the most supportive one to target the malware family. We empirically calculate the supportive malware family from the malware family to malware family by training on and test on .
To achieve better performance, we propose a Sequential Family Selection algorithm (SFS). SFS’s target is to select from a subset of malware families in training data. SFS starts from an empty set and selects families one by one. In each step, we will try to combine each candidate with the selected families and evaluation on the target test set. We select the best family with the best performance, i.e., with the biggest supportive score, and add it to the selected set. Formally, we initialize the selected family dataset
as empty set before selection, and all the families in the training data are in the candidate set. In the first step, we train each malware family in the candidate set separately and evaluate the target malware family
. We add the malware family with the best performance into
. We also will remove the selected family from the candidate set. Secondly, we combine
with the other malware families in the candidate set separately and evaluate on
. We select the malware family with the best combinations and add the malware into
. In the following steps, we repeat iteratively try to combine the malware families in the candidate set with
and add the best family into
. The algorithm terminates when we add all the families into
. Finally, we choose the combination with the best performance in all
history and return it as our final selection. We further describe the whole algorithm in Algorithm 1. Our algorithm is independent of the malware detection model and could improve the performance of any model.
Algorithm 1 Sequential Family Selection algorithm (SFS). |
Input: A malware classifier . A target malware family , includes a validation set and a test set . A set of training malware families , each corresponds to a set of training samples . Output: A subset of . - 1:
Suppose added training set , best training set , best validation performance is =0. - 2:
while is not empty set do - 3:
, - 4:
for in do - 5:
current training set =∪ - 6:
train classifier with , model - 7:
test on , - 8:
if then - 9:
- 10:
- 11:
- 12:
- 13:
if then - 14:
- 15:
- 16:
return
|
4. Experimental Setup
4.1. Malware Detection Approaches
Our method is designed to benefit malware detection by studying the relationship between the malware families. As for the malware detection approaches, we apply four popular malware detection methods csbd, drebin, mamadroid, and droidsieve.
Csbd [
19] extracts the control flow graph as the features of malware detection. It was first proposed by Kevin from the University of Luxembourg. This method performs the static analysis on the application bytecode to extract the control flow graph, takes the basic blocks of the control flow graph as features of the application, and uses classification algorithms in machine learning to assign it. Drebin [
20] uses multiple static analysis approaches to extract multiple features of the application from disassembled code and the information file as the features for Android malware detection. It is a lightweight detection method for Android malware. SVM algorithm is used to automate the classification of applications. Mamadroid [
21] is a malware detection model based on application behavior. This method extracts and abstracts the sequence of calls between APIs in an application, constructs feature vectors based on Markov chain, and uses different classification algorithms in machine learning to assign applications. Droidsieve [
22] utilizes the confusion invariant features and artifacts introduced by the confusion mechanism used in malware attacks to classify malware.
4.2. Malware Corpus
To prepare for our study on malware detection, we collect a set of Android applications from Androzoo, an open Android datasets collection project [
2]. The Android apks from Androzoo were obtained from various app markets. As VirusTotal is broadly utilized for Android malware labeling, we use the scanned results of VirusTotal to resolve the labels of the collected Android apks. VirusTotal uses more than 70 anti-virus scanners and URL/domain block list services to check items [
32]. The Android apk is labeled as benign software when no engines in VirusTotal marks it as positive. In order to ensure the reliability of the collected dataset, we label the apk as malware when at least five engines in VriusTotal label this apk as malicious. We use Euphony [
33] to acquire more information about the malware family of each data in the collected apks. We collect a dataset containing over 20,000 malware with family-type information and 20,000 benign software, and the time span of the dataset is 2015 to 2016. Note that, we will construct different groups of training and test sets targeting diverse cases we have considered in this paper.
4.3. Testing Dataset
Many different malware families exist in our community. We pick 16 random families from the malware corpus for our study. To ensure the consistency of our experiments, the number of each malware family is 500. We then combine 500 benign software with each malware family to construct the test set. The collected malware families belong to different categories with different malicious characteristics.
Table 2 shows the name and the description of the selected malware family. Note that the malware family in the test set does not appear in the training dataset.
5. Results and Discussions
In this section, we illustrate the impact of different malware families in training set on the malware detection and obtain the supportive score of the 16 malware families.
5.1. Dataset Construction
We construct 16 malware benchmarks to be used as the 16 test sets. Each test set consists of 1000 apks with 500 malware and 500 benign. The dataset used in this section is from 2015. The test set of each experiment contains merely one malware family. To minimize the variability of the experimental results, we perform each set of experiments five times. We then use the average of the five results as the final outcome of each experiment.
5.2. Supportive Score of Malware Families
We further analyze the impact of malware detection performance when using distinct malware families for training.
Table 3,
Table 4,
Table 5, and
Table 6 shows the accuracy of our experiment results for the different malware families in the training set. We bold the maximum values in
Table 3,
Table 4,
Table 5, and
Table 6 for the experimental groups that test the same malware family, which is the supportive score of the corresponding malware family.
It is interesting to find that accurate malware detection is possible when the malware families in the training and test sets are completely different. However, the malware detection performance differs between the four algorithms. The accuracy of malware detection between 37.7% pairs in csbd is higher than 55%, 79.1% pairs in drebin are higher than 55% and 51.6% pairs in mamadroid are higher than 55%. Moreover, 60.8% pairs in droidsieve are higher than 55%. It can be seen that only 25, 109, 36, and 72 (10.4%, 45.4%, 15%, and 30% in proportion) groups of experiments yield an accuracy of over 80%. The accuracy of malware detection is highly related to the malware family used in the training dataset. For example, gingermaster works the best and revmob performs the worst when testing the malware family ginmaster in csbd, with a difference of 22.2% in the terms of accuracy. The training set with leadbolt could achieve an accuracy of 92.3% when detecting plankton in drebin, but only 48.7% when using admogo as the training set.
5.3. Low-Resource Malware Family Detection
In this section, we investigate the performance of our algorithm in the case of low-resource malware detection. We apply SFS algorithm to four malware detection method, csbd, drebin, mamadroid and droidsieve. Each malware detection method is performed in 16 sets of experiments.
Low-resource malware families refer to the malware families that have a small amount of data but do not have enough data to train a malware detection model. The target malware family with only 10 samples in the training set, can be considered as a low-resource malware family. Other malware families of each group of experiments have 500 samples. We first conduct the experiment using data from 2015.
For a detailed discussion, we illustrate the SFS process of the detection of plankton using drebin. The leftmost part of
Figure 1 shows the first step of the SFS algorithm. We train the other 15 families separately. It can be seen that training on leadbolt could perform better than any other malware families. We take leadbolt as the base set of our next iteration of SFS. The rest of the malware families are combined with the base set leadbolt, separately. The second part of
Figure 1 shows that adding gingermaster to leadbolt could perform the best. The combination of leadbolt and gingermaster is taken as a new training dataset for the next round of the experiment. This is repeated until all the malware families are added to the training set. The best solution for this method is “leadbolt + gingermaster + ginmaster + utchi + wapsx + mulad + droidkungfu”.
To ensure the validity of the experiments, when a certain number of malware samples are added in each round of experiments, we then combine them with the same number of benign samples. We also perform each experiment five times to minimize the error of the experimental results. The results of each experiment are obtained by taking the average of the five experiments. We compare the best malware detection performance of training with only one family, the malware detection performance trained with all families, and the malware detection performance using our algorithm.
Table 7 shows the comparison of the malware detection performance between the three cases. It can be seen that SFS algorithm outperforms when applied to all the four malware detection methods. In particular, some low-resource malware family detection can be improved by over 10% in the terms of accuracy using our SFS algorithm, such as the detection of malware droidkunfu with Csbd and the detection of malware umeng with Drebin. The malware detection performance of different malware detection algorithms varies when malware families in the dataset are different. For example, the malware detection accuracy of Droidsieve is greater than 90% in 13 out of the 16 groups of experiments. With our SFS algorithm, the detection results using Droidsieve for artemis and ginmaster are improved by 6.4% and 8.12%, respectively. We also compare the average malware detection accuracy of the 16 malware families. The performance of the malware detection method using SFS algorithm has been improved. Mamadroid with SFS algorithm has the highest improvement in malware detection, which can reach 6.73%.
The Android mobile applications are constantly evolving with the rapid development of technology [
52]. New malware is often constantly updated to evade malware detection. This leads to the fact that malware detection algorithms that can achieve very good results in one year may not be able to classify new malware produced in the next year.
To explore the sustainability of SFS algorithm, we use the model trained by the data collected in 2015 to test the data in 2016, that is, the malware detection classifier obtained by using outdated data is trained to detect future malware data samples by observing the performance of malware detection algorithm to further evaluate the sustainability of SFS algorithm.
It can be seen from
Table 8 that SFS algorithm does not work well when some malware families, such as mulad and umeng, use the outdated data for training, but SFS algorithm still has a better detection performance for most low-resource malware family detection.
Table 8 shows that SFS algorithm performs the best on average. It can be assumed that SFS algorithm can support the sustainability of malware detection algorithms. From the average value, the combination of SFS algorithm with csbd, drebin, mamadroid and droidsieve can improve their malware detection accuracy by 4.11%, 0.76%, 3.36% and 4.26%, respectively. For a specific malware family and detection algorithm, SFS algorithm can greatly improve its performance. For example, the malware detection algorithm csbd trained the malware detection classifier using data from 2015, and SFS algorithm still improve the detection accuracy of the malware family droidkungfu by 13.32% when detecting the malware family droidkungfu in 2016.
5.4. Zero-Resource Malware Family Detection
In this section, we further investigate the most extreme case of the low-resource malware family detection, which is the zero-resource malware family detection. Zero-resource is the most extreme case of low-resource malware family detection. Zero-resource detection means that the malware detection model has never seen this malware family, that is, the zero-resource malware family is not included in the training dataset.
Table 9 shows the comparison of malware detection performance when the malware family to be detected does not exist in the training data set. It shows that SFS algorithm performs the best in all of the experiments when the malware family to be detected does not exist in the training dataset.
In 60.4% of the experiments, the malware detection performance training on all malware families is worse than the one training based on SFS. This shows that we cannot simply improve the malware detection performance on the target family by adding more but unrelated malware data. Using all malware families for training is equivalent to ignoring the differences between malware families in the hope of using one model to detect all malware. Our results also show this is a bad solution for malware detection.
The performance of SFS is higher than the two baselines in all of the experiments. This shows the effectiveness of SFS. This shows that carefully selecting multiple malware families is better than only selecting one most supportive family.
Figure 2 shows the performance of plankton in terms of accuracy. The size of the training set for the four malware detection methods in each experiment is the same. The horizontal coordinate indicates the name of the malware family for which the highest accuracy can be achieved by adding the target malware family. We could see that the malware detection accuracy of plankton reaches the highest in the middle of the process of SFS.
6. Relationship between Malware Families
Our results show that it could be supported by training with different malware families for the low-resource malware detection. However, the performance varies between different malware families. Therefore, we further dwell on understanding under what circumstances could support the low resource malware detection? We hypothesize that the knowledge transfer between different malware families is because they have similar characteristics. In this section, we study two popular characteristics: whether malware steals user data and whether malware displays advertisements to mobile users.
6.1. Malware Categories
Steal Data. Some of the malware families could steal user information from the device. The report [
53] points out that 60.7% of the applications collected Android ID and other unique device identification information, 55.4% of the applications collected application list information, 13.7% of the applications collected clipboard information, such information can be used for character portraits, personalized push, and other business. The sensitivity of this kind of information is relatively high.
Display Advertisements. Some of these malware families are sorted as adware. It is a malicious application that puts unneeded ads on users’ screens, especially when accessing web services. Adware lures users to view ads that offer lucrative products and entices them to click on that ad. Once the user clicks on the ad, the developer of the unwanted application generates revenue. Some common examples of adware include weight loss programs that make money in a shorter period of time and on-screen warnings about fake viruses.
6.2. Analysis
Table 10 shows the special characteristics of the 16 malware families. We label the malware with the characteristics of stealing user data or displaying advertisements as ‘∘’ and those without these characteristics as ‘×’. To better show the relationship between the malware detection performance and the malware characteristics, we further leverage TSNE to map the malware detection performance in
Table 8 into two-dimension vectors. The results are shown in
Figure 3 for all the four malware detection models.
In
Figure 3, we could find that (1) most of the malware families with the same characteristics are close to each other for all the four malware classification models. For example, the malware families that steal user data such as ginmaster, gingermaster, and nandrobox are close to each other, and plankton and leadbolt are also close to each other. The malware families that display ads such as waps, wapsx, and umeng are always close to each other. This shows that most of the similarities could be captured by all the malware detection models. (2) The malware relatedness has a slight difference between different malware detection algorithms. For example, for both csbd and mamadroid, mulad is far from the ads cluster. However, in drebin and droidsieve, mulad is close to the ads cluster, while admogo is the opposite. This may be because that drebin and droidsieve are not good at capturing the corresponding similar characteristics between admogo and other ads malware.
The results show that most of the supportive scores match our human knowledge. A higher supportive score means the characteristics of different malware are more similar. If the target malware family uses similar technology or has similar targets to the training family, our model could leverage this knowledge and achieve good results.
Although summarizing common characteristics could improve the human understanding of the knowledge transfer, using an empirical supportive score is a better way if we only target to improve the low-resource malware family performance. There are two reasons: (1) Specific malware detection models may not be good at capturing the common characteristics even if they exist. For example, csbd and mamadroid show that mulad is similar to ads cluster while drebin or droidsieve cannot. In this case, it is better to use a different family for drebin or droidsieve algorithm. (2) Summarizing all characteristics need a huge effort from experts, and experts may have less interest to study the low-resource malware families because they often have less impact. In this case, an empirical supportive score is much easy to get and only costs computer resources.
7. Conclusions and Future Work
Our work studies the cross-family knowledge transfer for low resource malware family detection. We quantify the knowledge transfer ability between malware families by supportive scores. We propose Sequential Family Selection algorithm to select multiple malware familes related to the target malware family to support low resource malware family detection based on the supportive scores of different malware families. The experiment shows that the Sequential Family Selection algorithm can better improve the performance of the malware detection model based on machine learning method in low-resource malware family detection. The research in this paper demonstrates that cross-family knowledge transfer can effectively improve the detection performance of low-resource malware. Furthermore, by analyzing the two behavioral characteristics of stealing user data and displaying advertisements, it could be found that the knowledge transfer between different malware families is due to their common characteristics.
In future work, we plan to set different weights on each malware family in the training dataset. The weights are based on the contribution to the target malware family detection. Each target malware family can select some specific families to improve its detection performance. New knowledge transfer methods can also be further explored to achieve better detection results for low-resource malware detection.