1. Introduction
With the development of information technology, security of cyberspace, which is mostly threatened by malware, has become more important than ever. According to the malware report released by AV-TEST, the amount of malware has been growing at a tremendous rate, which is 50 million on average per year in the past decade [
1]. In the face of such an abundance of malicious software, the privacy data of any internet user is under significant threat. Therefore, it is crucial to have effective malware detection approaches to protect this private information and data.
In past decades, the mainstream of malware detection methods proposed by researchers can be divided into two: static analysis and dynamic analysis, respectively. Static analysis mostly focuses on signature-based analysis, which creates a specific signature that is unique enough to represent the whole file by scanning the binary byte streams, such as printable strings and n-gram [
2,
3]. Static analysis can finish analyzing tasks at a fast rate, but it is easy to be evaded by some malware with specific technologies, such as code obfuscation [
4]. Also, it is hard for static analysis to detect new malware, which probably uses a zero-day attack [
5]. On the contrary, dynamic analysis is more robust to malware using evasion technology [
6] by monitoring the behavior generated by a target sample in an isolated environment, which is something like IntelligentMonitor proposed by [
7]. This captured behavior information often contains API calls, network activities, memory usage, etc. Moreover, all such information can serve as valuable reference material for judging a malware. For example, Herrera-Silva et al. [
8] collected the behavior information reports of ransomware and selected features—including file, PID, network, and more—that are of interest for ransomware detection. Moreover, due to such behavior, dynamic analysis exerts a higher detection rate and precision.
The system API call sequence is the most representative among all the captured behavior information, including file manipulation operation, network access, and register key modification, etc. [
9]. API sequences consist of API calls and their parameters, in which case text processing methods such as N-grams can be applied to support API sequences analysis, which is commonly used by researchers.
With the rapid rise of deep learning in recent years, researchers have also started to apply deep learning models such as CNN, RNN, and LSTM to the field of malware detection. These deep learning models can effectively mine the feature information in a sequence to train a malware detector with high accuracy. Although many researchers have proposed API-sequence-based malware detection methods [
10,
11], most of these methods only focus on the API itself and not on the parameters of the API. Methods such as those in [
10,
12] show a significant improvement in methods that combine APIs and their parameters compared to methods that only consider the APIs themselves. Indeed, [
13,
14] dealt with both the API and its parameters but did not consider the semantic information hidden in the API itself. Furthermore, [
13] uses a rule-based clustering algorithm to deal with the parameters in the API, which also means that it requires a great deal of expertise and is more complicated. Although [
15] constructs a semantic chain with the information hidden in the API sequence, it also does not consider the influence of the parameters on the semantics.
Furthermore, the most prevalent approach to handling text in the NLP field is by utilizing a dictionary—initially segmenting the text into tokens and building a dictionary, followed by substituting the tokens in the original text with their respective indices in the dictionary. Under this kind of method of feature engineering, deep learning struggles, in some cases, to handle text that is outside the dictionary. In the case of malware detection, the parameters are determined by the malware developers, meaning that any text could potentially appear in these parameters, especially parameters like url and file paths. A similar situation exists in the field of recommender algorithms, such as the ID, which is usually a combination of numbers and letters of streaming media on YouTube, where such features typically cannot be processed using a standard dictionary and cannot be ignored.
In this paper, a novel API sequence feature extraction method is proposed to solve the problems mentioned above. Inspired by [
9,
15], we use feature hash to encode the semantic chain of APIs with added parameters. Text can be directly transformed into vectors by hashing algorithms without a middle step by using feature hashing, which means feature hashing is able to handle all the text that can be hashed. Thus, the problem of unknown input from parameters is effectively solved, which allows the model to handle unknown samples effectively. By semantic analysis of the APIs, the actions and objects representing the API operations and operation objects are extracted from the APIs, and the statistics of the actions and objects appearing in the API sequences are counted as supplementary information of the sequences. Secondly, in order to preserve the semantic information to the maximum extent and to reduce the overhead of feature engineering, only the simplest formatting of the parameters is performed. Finally, the obtained semantic chains are encoded using feature hash to generate feature vectors. Due to the complexity of the parameters in the API sequences, the ordinary method of constructing a word list and replacing them would consume a great deal of time and memory, and the model cannot effectively handle unknown inputs. Using feature hash can be a good solution to this challenge without constructing a word list, and it can also make the model robust to unknown inputs. Finally, we put the obtained feature vectors after encoding into a deep network for training, which consists of gated CNN, Bi-LSTM, and attention.
The main contributions of this paper include
In this paper, a new API feature extraction approach is proposed from the perspective of semantic information and the characteristics of API sequences. This approach maximizes the description of the behavior of the samples by referring to both the semantic information of the API itself and its parameters. We semantically decompose the APIs and use the parameters of the APIs as an augmentation of the semantic information to extract a semantic chain containing complete semantic information from the API sequences.
This paper refers to the common practice in recommender systems and applies feature hash to behavior-based dynamic malware detection to solve the dynamic input problem caused by API parameters. It enables the malware detection system to handle unknown inputs more efficiently, thus alleviating the problem of an aging detection system to a certain extent.
A new API sequence information compression method is proposed. Therefore, this paper borrows the method of dealing with extra-long sequences in NLP, i.e., using key sentences instead of whole paragraphs. The statistical information of the sequence is used as the “key sentence” of the whole sequence, which is used as the complement of the API sequence and solves the problem of losing the key behavior caused by truncating the API sequence. The final experimental results prove that this approach is very effective in dynamic detection.
We evaluate the recognition performance of this method on a competition dataset. Comparison experiments with other baseline models are also conducted. The final experimental results show that our method is significantly better than various baseline models.
4. Experiment
4.1. Experimental Environment
An experimental environment can be generally divided into hardware environment and software environment.
Hardware Environment: The hardware environment of the experimental machine is as follows: CPU is i7-10700 with 8 cores and 32 GB memory; GPU is GeForce RTX 3060. The specific hardware parameters are shown in
Table 2.
Software Environment: The experimental software environment is mainly configured as follows: the operating system is Windows 10, and the programming language used is Python 3.9. The third-party libraries required for the construction of the text detection model and the text recognition model are shown in
Table 3, and the functions of each library are briefly introduced in the table. For specific usage methods, refer to the user manuals of each dependent package.
4.2. Dataset
We use the dataset from competition Datacon2019, which contains 69,207 behavior reports of different samples generated by sandbox for training our deep learning network. In those behavior reports, 20,000 of them are from benign samples; the remaining 49,207 reports are produced by malware. We divide the dataset into training set and validation set, 70% for training and 30% for validation. Additionally, we collated 2500 newest samples, which consist of both benign and malware in the real network to evaluate.
4.3. Hyperparameters
In order to retain the behavioral information in the original sequence as much as possible, the maximum length of the API sequence is set to 3000. In addition, the dimensionality of each feature vector is set to 128, i.e., 7 times of 2 to speed up the computation in this paper. The convolutional kernel size in gated CNN is set to 3, and the number of LSTM units in each direction in Bi-LSTM is 100, totaling 200 LSTM units. The detailed parameters are shown in the table.
4.4. Metrics
In this paper, we evaluate this method by calculating the accuracy, precision, recall, and F1-score of the final classification results and compare them with other methods. In addition, the ROC curve and the AUC score are used to compare the methods.
In the process of metrics calculating, we used TP and TN. TP denotes true positive, i.e., the number of samples correctly classified as POSITIVE (correctly classified as malware), and TN denotes true negative, i.e., the number of samples correctly classified as NEGATIVE (correctly classified as benign). FN denotes false negative; i.e., the number of samples incorrectly classified as FN denotes the number of samples incorrectly classified as negative (malicious samples are judged as benign samples), and FP denotes the number of samples incorrectly classified as positive (benign samples are judged as malicious samples).
4.5. Baselines
In this paper, machine learning algorithms such as SVM and MLP are used as baseline models. The feature vectors obtained after semantic chain transformation of API sequences are put into the baseline model for feature mining and examining the indicators such as accuracy to evaluate the model performance in this paper. In addition, a widely used sequence processing method, embedding + TextCNN, is selected as the baseline for comparison in this paper. The results of the experiments can be observed in
Table 4; the ROC curve is shown in
Figure 6. Obviously, the method proposed in this paper outperforms all the baseline models.