1. Introduction
The protection of a computer system, especially in a smart enterprise, represents a key factor for the company’s survival: the damage caused by computer attacks can have significant economic impacts. Among the possible attacks, malware attacks are particularly relevant, as they cause direct damage to the system or intercept relevant information for the company. Protecting the smart enterprise from malware is crucial, so it is necessary to implement techniques that can detect and recognize malware within enterprise networks. Most malware detection systems are based on signature verification techniques, which are effective for all known malicious software, but inadequate for new malware as there are no signatures available. This limitation represents a problem, as it can intercept attacks built with prefabricated elements, but it cannot protect against a specifically constructed attack. A way to overcome the limitations of such approaches is to exploit the sandbox software CAPEv2 [
1], the evolution of Cuckoo [
2], to perform an analysis of hypothetical threats. The use of machine learning and classification techniques, operating on the information related to the behavior of executables, allows the detection of malware, especially the 0-day ones, i.e., those not yet known.
Edge AI (Edge computing + Artificial intelligence) solutions allow inference to be made at the direct generation of the data, even before it is downloaded to the receiver’s machine. Such approaches have the advantage of generally working in a near real-time mode, but, on the other hand, they are constrained by limits of memory availability and computational power, thus making execution time a key factor. For these reasons, it is essential to understand which solutions are best suited to run on edge computing devices and then create Edge AI solutions that capture malicious events with high accuracy while maintaining low complexity and execution time constraints. A possible compromise in the view of such solutions is to extract the calls to Application Programming Interfaces (APIs) made by individual executables, thus obtaining an abstraction concerning the behavior of the programs. This approximation allows a summary evaluation of the behavior, which can be analyzed in terms of machine learning, to estimate if the software is malicious or goodware.
This paper presents a benchmark whose primary purpose is to test various machine learning algorithms, using deep learning and shallow learning techniques, applied on datasets of goodware/malware labeled software API calls.
The following tree-based shallow learning algorithms were considered:
Random Forest;
CatBoost;
XGBoost;
ExtraTrees;
and also two other deep learning algorithms based on neural networks:
These algorithms were tested on two different datasets:
malware-analysis-datasets-api-call-sequences [
3,
4] and
APIMDS [
5] in balanced K-Fold cross-validation mode. Results’ reports were then generated and analyzed in detail. The explainable artificial intelligence technique SHAP [
6] also allowed for determining the extent to which each API call contributes both in discriminating goodware from malware, and vice versa.
The main contributions of this work are:
Provide a benchmark showing the trade-off between the proposed algorithms based on accuracy and execution time;
Use the Explainable AI SHAP technique to determine which API calls are heavily influential in the classification process;
Show how the use of such approaches can support the results of dynamic analyses, focusing on the importance of individual API calls.
The paper is organized as follows:
Section 2 contains the state-of-the-art review;
Section 3 contains a description of the datasets used in the experiment;
Section 4 describes the algorithms used, the settings of each of them for the experiment, and the techniques applied for preprocessing the dataset;
Section 5 provides the results and their discussion.
Section 6 presents the reasoning about explainable artificial intelligence using SHAP. Conclusions are presented in
Section 7.
2. Related Work
Some of the scientific literature refers to Cuckoo [
2], a precursor to CAPEv2; therefore, the principles described in these works are directly applicable.
The Honeynet Project published an open-source project called CuckooML [
7] in 2016. According to the state of the art at the time, the researchers built an innovative feature based on anomaly detection techniques so that the modified software could cluster and identify new types of malware. All are available both in the command line and from a simplified web interface. The approach chosen by the authors was an unsupervised, descriptive type, specifically Clustering. One of the advantages, given the nature of the problem, is the lack of need for labeled data.
Darshan et al. [
8] propose a new solution towards signature-based malware detection systems, where new malware, also obtained by applying obfuscation techniques on previously known malware, manage to easily evade controls, thus overcoming the defenses of a system. Thus, to detect malware based on their behavior, it starts by using the framework provided by Cuckoo, which offers an isolated environment where malicious software can be freely executed and its actions analyzed. Most sandboxes focus on system calls, i.e., those mechanisms used by a process at the application layer to request services at the operating system layer to perform desired user activities such as reading data, placing packets on the network, writing records to the registry, and so on.
The proposed approach for malware recognition is innovative: starting from executing the hypothetical threat inside Cuckoo, once a report in JSON format containing the system calls is obtained, this report is first converted into a new format, i.e., MIST (Malware instruction set). After that, the system calls contained in MIST, once isolated, are used to create N-grams that is a sub-sequence of N characters where N represents the length of each sequence. Through the metric of information gain, which represents the expected reduction of entropy in the data, the starting dataset (composed of N-grams) is partitioned, selecting the top N sub-sequences of characters as a Final Feature Vector, i.e., a vector of features to be processed in the training phase of the classifier. Unlike the paper presented by The Honeynet Project, in [
8], a supervised approach and a classification task are proposed, and hence predictive, with the need to have a labeled dataset. The n-grams thus extracted belong to both malicious and genuine software; moreover, in case of duplicates, these are removed. Once the final features are obtained, an element called a instruction converter takes care of their transformation into the ARFF (Attribute Relation File Format), used by the WEKA framework to operate with different machine learning algorithms for classification. The authors selected, within WEKA, six different classification algorithms, respectively:
At the end of the runs, the best classifier was SPegasos, in terms of accuracy, lowest false positive rate, and highest true positive rate.
Ali et al. [
9] used n-grams and TF-IDF for performing dynamic analysis for detecting malware using API calls, and the work reached an accuracy of 98.4% using this kind of preprocessing with Logistic Regression classifier. Kumar et al. [
10], based on the anomaly detection technique, propose an innovative malware detection methodology based on clustering and Trend Micro Locality Sensitive Hashing (TLSH) metrics in order to cluster the dataset entries. In summary, starting from the report generated through the Cuckoo sandbox, a hash value is calculated according to the Trend Micro Locality Hashing metric. After that, the specific difference for each generated hash is calculated, and different clusters are generated based on a certain threshold. Next, the feature extraction and feature selection steps are applied, according to which the starting dataset is partitioned and prepared for the training phase. Finally, several classifiers for malware detection were trained through the
scikit-learn library of Python. In addition, in this work, one of the crucial points lies in the dataset, since it is difficult to find suitable ones in public form and available on the net for this purpose: the authors had to build one specifically in order to evaluate the proposed methodology, which includes both malware and benign software. Although the task is supervised (prediction), in some preliminary training phases, i.e., feature extraction first and feature selection later, an unsupervised approach, clustering, was used to describe and group data according to TLSH metrics. This innovation is intended to overcome traditional feature extraction techniques, which were particularly unsuitable for large volumes of data. In fact, by applying this metric to clustering, a decrease in the training time of machine learning algorithms was observed without affecting the quality of predictions. Therefore, once Cuckoo obtains the report in JSON format, this report is used to calculate the footprint according to the TLSH metric. Based on these imprints, clusters are generated to group malware having similar traits. Several binary classification algorithms, i.e., genuine - malicious form, have been used for malware detection, such as:
Decision Tree;
Random Forest;
Logistic Regression.
At the end of the execution, during the evaluation phase of the proposed model, it was found that the Random Forest algorithm obtained the best performance in terms of accuracy, lower false-positive rate, and higher true-positive rate.
Udayakumar et al. [
11] propose a methodology similar to the one discussed in [
8], exploiting the same feature extraction and feature selection techniques, but differing mainly on the classification algorithms used and finally on the results obtained. Starting from the reports generated at the end of the analysis performed by the Cuckoo Sandbox in JSON format, the first step consists of converting and extracting the system calls in MIST format. After that, the model follows the heuristic methodology of the decomposition in n-grams, where n represents the length of the sub-sequences of characters to extract, which represent system calls. Exploiting the metric of the Information Gain and ordering the features in a decreasing way, extract the final vector of the characteristics that will be fundamental for all the activities of automatic learning. After the feature selection phase, the final feature vector will be composed respectively selecting the top 200, 400, and 600 features in the form of n-grams, with n equal to 3 and 4. Some of the classifiers chosen by the authors are:
Adaboost;
Random Forest;
Naive Bayes;
Logistic Regression;
Random Tree.
The dataset used for the testing phase is composed of 3000 files belonging to the non-malware category and 3100 malware, which belong to different categories. The test shows that the Random Forest classifier obtains the best results both in accuracy and in a lower rate of false positives and a higher rate of true positives. Ndibanje et al. [
12] created a framework for malware de-obfuscation and analysis using machine learning algorithms which showed good performances on detecting possible threats. In [
13], there is a survey that provides a global overview of how machine learning algorithms can be used in the context of offense-defense and, more in general, cybersecurity. Authors in [
14] developed Sisyfos, a modular and extendible platform for malware analysis comprehensive of a web interface. The accuracy is tested using Random Forest Classifier reaching 98%. Kim in [
15] presents a combination of static and dynamic analysis of various types of malware using several machine learning algorithms for classification aim. Moreover, the author estimates a malware risk index for using an analytic hierarchy process to detect malware and their probabilities. Choi et al. [
16] used KNN combined with vantage-point tree for classification of malware; they reduced the detection time by 67% and increased detection rate by 25%. In [
17], several machine learning techniques are used, demonstrating that the decision tree based family outperforms all others. In particular, the decision tree reduced the false alarm rate by 2% and reached an error rate of 99% of F1-score. El-Shafai et al. [
18] used a visual encoding of malware in order to use a deep convolutional neural network and perform classification. Malware are converted from binary files into images, then VGG16, AlexNet, DarkNet-53, DenseNet, and ResNet CNNs were trained in transfer learning mode on the proposed dataset reaching as high as 99.97% of accuracy, but without promoting insights about what are the learned patterns and how they correlate to malware. A survey on shallow and deep learning techniques to detect ransomware malware in IoT networks is provided in [
19] by Fernando et al. Instead, Ref. [
20] provides a survey concerning malware detection in mobile devices (especially Android), categorizing the literature to three dimensions: type of analysis, features, and techniques. A recently adopted approach for Explainable AI is the SHAP library, which determines the most involved features in the classification process. Rao et al. [
21] propose using such a library to analyze the result of using an Isolation Forest technique on the NSL-KDD dataset. In detail, the technique is employed to label the data based on the combinations of features judged to be of major significance. A different approach is the following one proposed by Wang et al. [
22], who, even though they apply to the same NSL-KDD dataset, propose a method to explain locally and globally the predictions made by an Intrusion Detection System. The innovation brought by this approach is a certain coherence between what is produced by the model realized by the authors and the peculiarities of the specific attacks, allowing the operator using the system to have more punctual information to make more informed decisions. The two works considered provide some initial approaches to using the library in the cybersecurity domain, using the explanations to add information to the prediction made by a classifier. The work proposed offers a similar approach but is oriented towards identifying the APIs that most influence the classification process, thus allowing the individuation of some elements that can be warning signs. This approach is also oriented in the perspective of an optimization of the time for inference. As it is possible to see from this literature review, several techniques are used on different datasets and in different conditions. Thus, it is difficult to draw any conclusion about the best techniques for this kind of problem. Exotic, nonstandard techniques should be re-implemented to be tested on exactly the same conditions, but some work lacks details to re-implement their technique. One of the final goals of this work is to test state-of-the-art tree-based techniques and their counterpart, which make use of deep neural networks on the same datasets in exactly the same conditions. This would allow the creation of a basic testing framework that could be used by other authors and allow comparable results.
4. Algorithms Used
The aim is to compare deep learning and shallow learning techniques on the two different datasets previously mentioned for the benchmark under consideration. The models in question are respectively subdivided in Shallow learning techniques:
Random Forest;
CatBoost;
XGBoost;
ExtraTrees;
Deep learning techniques:
4.1. Random Forest
Random Forest [
25] is a classifier obtained from the aggregation, through bagging, of decision trees. A decision tree is an acyclic graph of decisions and their possible consequences, mainly used to create an “action plan” aimed at a purpose. The particularity of this model is the clarity in the expression of the information, which is represented as a tree. Therefore, the decision tree is a predictive model in which each internal node represents a variable. The arcs between a parent node and the child nodes are obtained by distinguishing the paths based on the value of one of the data features to be classified; each branching step is the result of a splitting based on equality, majority, or minority condition applied on a variable. The leaf nodes instead represent the predicted class. The classification is obtained by following the expressed conditions, from the root node to a leaf node. Thus, a decision tree is, in fact, a set of decision rules based on the values of the variables. A decision tree is generated starting from a dataset; in the training phase, defining some stopping criteria (halting) is necessary since a much-ramified tree significantly increases the computational complexity against little benefits in classification accuracy. The bagging operation allows merging multiple models of the same type, all derived from the same original dataset, using data obtained from sampling without replacement. Random Forest is then composed of a set of decision trees, all trained from the same dataset, where each decision tree is trained on a random subset of the variables. The resulting classification is the mode (in the case of a classification task) or the mean (in the case of a regression task) of the results obtained by the individual decision tree classifiers.
4.2. XGBoost
Gradient Boosting is a Machine Learning technique developed to solve classification and regression problems. It creates prediction models from an ensemble of smaller prediction models, typically represented by decision trees. Model building is done stepwise, in the same way as other boosting algorithms, but the advantage of this technique lies in the fact that it can generalize by allowing the optimization of an arbitrary differentiable loss function. This approach differs from others in that it is suitable for a wide range of tasks and has excellent portability as it supports a cross-platform between different programming languages and different operating systems.
XGBoost [
26] improves the Gradient Boosting Machines (GBM) framework on which it is based through optimizations in the system and improvements in the algorithms. XGBoost takes advantage of cache-aware algorithms by allocating internal buffers in each of the threads to store computed statistics. XGBoost uses parallelization, as the process of tree construction is done in a parallel way, due to the exchange of nested loops used in tree construction. The
max_depth parameter allows for adjusting the division of each tree, instead of using the basic stop criterion. The features that enable algorithmic improvements are:
Allows for avoiding the phenomenon of overfitting in the training phase, ensuring regularization;
Capability of adaptation in the presence of features with missing values, since it automatically selects a value to replace the missing ones, based on the training data;
Cross-validation at each iteration.
XGBoost models currently represent an excellent solution in classification and regression tasks, both for their performance in terms of the results obtained and computation time compared to other algorithms.
4.3. CatBoost
CatBoost [
27] is an open-source software library that defines itself as state of the art for the gradient boosting [
28] technique on decision trees. During training, a set of trees are built consecutively, where each new tree is built with a different and reduced loss to the previous. This influences the tree structure greedily. Another important aspect is the capability of making feature values’ quantization automatically: it automatically defines the thresholds to use to create disjoint ranges (bins) for the feature values and labels. In addition to providing accuracy, robustness, practicality, and extensibility, all backed by the ease of use, it also offers direct support for categorical format data and has a GPU-computable version. CatBoost can easily be integrated with deep learning frameworks like Google’s TensorFlow and Apple’s Core ML. It can work with different data types, and it supports explainable artificial intelligence through feature ranking to sort the most important features. In this work, the model for plain supervised classification on a set of features is used.
4.4. Extra Trees
Extremely Randomized Trees [
29], also based on decision trees, combines the results from the base models organized in the forest to determine the prediction output. The Extra Trees model creates a large number of decision trees that are unpruned and generated automatically starting from the data. For classification purposes, the predictions are made using majority voting and regression purposes by averaging the result of each tree. It is based on the intuition that, by building trees by randomly picking the feature to split at each step for each tree, the model will not overfit. This makes the ensemble of trees less correlated and increases variance. This increased variance can be faced by increasing the number of trees. The main difference with Random Forest is in the construction of the decision trees within the forest: ExtraTrees does not exploit the bootstrap methodology on learning samples and randomly selects a branching point on each feature instead of targeting the optimal split of Random Forest.
4.5. TabNet
The idea behind TabNet [
30] is to build a neural network for processing tabular data. As demonstrated in other application domains, e.g., images, the application of deep learning techniques has brought a significant performance boost as the dataset grows vis-à-vis machine learning techniques. It is designed to learn similar decision tree-based models and have their benefits: interpretability and feature selection. It sequentially uses the multi-headed attention technique to choose from which features to use at each decision step. Feature selection is made on an instance-by-instance basis to differentiate for each input. The model is built in multiple sequential steps by passing the input instances from one step to the other: it is composed of a transformer with several in parallel multi-headed attention using a sparse-matrix to give Sparse feature selection. This aspect increases interpretability by extracting, for each instance, weights (also known as importances) of features. Initially, the dataset is processed without any feature engineering. Then, instances are Batch normalized and passed to the feature transformer, where it passes through different decision steps made up of fully connected layers and different gated linear unit (GLU) activation functions. The output of each activation is embedded (e.g., with a sum operator) with others, and depending on the problem, if regression or classification, a different loss function is used for performing end-to-end training. It is important to state that normalization with 0.5 helps stabilize learning by ensuring that the variance throughout the network does not change dramatically. The model used for this problem is the classification version with automatic feature engineering and feature selection. A significant drawback of this technique is the massive amount of data required for learning, which is, among other things, one of the major limitations of the multi-headed attention model.
4.6. NODE (Neural Oblivious Decision Ensembles)
The NODE [
31] algorithm consists of a deep learning architecture designed to work in the presence of tabular data. The basic unit of this architecture is the oblivious decision tree, with the peculiarity of being constrained to use the same feature for the split and the same split threshold for all nodes having the same depth. The network is trained in an end-to-end way through backpropagation. In the deep version, where multiple NODE layers are stacked on top of the other, the connection among these layers is made using the residual connection from the ResNet work. The input features and all other layers’ outputs are concatenated before being fed to the next NODE layer. In the end, each layer’s output is averaged in the case of regression, and majority voting is used in case of classification, similarly to Extra Trees. The model used for NODE is the deep version for classification purposes.
4.7. Experimental Setup and Preprocessing
Both datasets are unbalanced towards the “malware” class, and this has to be taken into account for all the following steps of the process. The solution is to leverage some sampling methods, which modify the starting data making their distribution more balanced. Tests were carried out in 10-fold stratified cross-validation, where, for each fold, the number of goodware and malware instances are randomly balanced (random undersampling). The values reported are the average values for the 10 folds.
For the dataset malware-analysis-datasets-api-call-sequences, all features were already numerical. For the APIMDS dataset, instead, all categorical variables were one-hot encoded, numerical variables were used as-is. In addition, only the first 114 features were kept into consideration for the APIMDS dataset. This is because the following features were composed of almost all null values (>90% of null values), and thus it has been decided to remove them in order to decrease the complexity of the problem and better face the famous Bellman’s “curse of dimensionality” problem. For using standard algorithms such as NODE, TabNet, and so on, datasets were standardized with z-score normalization prior to the training process.
The stratified version ensures that the same proportion of the data is maintained in all training and test sets in the same way as the original data. This is so that no value is over (or under)—represented in the various sets to have a more accurate estimate of the results.
The configuration of the algorithms is as follows: Random Forest and Extra Tree are configured to use 100 trees and the “gini” index. XGBoost is configured as suggested by the algorithm’s authors with the max depth pre-pruning parameter equal to 6, and the learning rate equal to 0.3. CatBoost is configured without any adjustment or selection of hyperparameters and default parameters. TabNet attention embedding amplitude and decision accuracy amplitude are equal to 8 as per the official publication. The default settings were used for NODE.
5. Results
Table 2 and
Table 3 present the results calculated for each algorithm, indicated in the first column, derived from the stratified cross-validation experiment with k equal to 10 and averaged between them. In particular, with regard to the metrics of Precision, Recall, F1-Score, they are to be considered calculated with macro average. The reference dataset for
Table 2 is
malware-analysis-datasets-api-call-sequences, while
Table 3 presents the results for dataset
APIMDS.
Table 2 highlights the differences between the algorithms. In this specific case, the deep learning techniques NODE and TabNet, although showing an average F1-macro score in line with the state of the art, have F1-macro score and Area Under the ROC curves lower when compared to algorithms using trees such as Random Forest, XGBoost, and CatBoost. Finally, CatBoost presents the best performance, which seems to obtain the best trade-off between precision and recall.
The dataset
APIMDS consists of 300 goodware class examples (labeled with 0) and 23,146 malware class examples (labeled with 1). Again, shallow learning techniques using trees such as Random Forest, Cat Boost and XGBoost perform slightly better than their deep learning counterparts such as TabNet and NODE. Specifically, it can be seen that, as in the case of the previous dataset, XGBoost and CatBoost are the best techniques both from an AUC ROC and F1-macro score point of view. XGBoost is slightly better than CatBoost in this experiment because it has better recall; however, it is also relevant to note how robust CatBoost is to the dataset change, i.e., its ability to perform with accuracies in line with the state of the art regardless of dataset. It is now of interest to understand what features the best performing algorithms, such as CatBoost, deem most important for classifying instances as malware and as goodware using the explainable artificial intelligence SHAP technique.
Table 4 illustrates the training and prediction times of the treated algorithms as applied to dataset
malware-analysis-datasets-api-call-sequences. The time intervals represented in the table are expressed in seconds. To achieve a more accurate estimate, the times reported in the table were averaged on the number of iterations completed for each step by each algorithm, i.e., 100.
Table 5 exposes the training and prediction times of the algorithms on the
APIMDS dataset. The same considerations made for the
malware-analysis-datasets-api-call-sequences dataset, expressed in the previous paragraph, apply.
As it is possible to observe from
Table 4 and
Table 5 concerning the implementation of these algorithms in an edge AI use case, the clear winners are Extra Tree, CatBoost, and XGBoost algorithms. All algorithms do not use end-to-end backpropagation training and neural networks in practice. This is because backpropagation training requires several epochs to allow the model to converge to a stable solution and provide reliable results. In addition, the nonlinearity brought from activation functions makes the learning phase of the neural network computationally intensive. In addition, from a memory constraint perspective, tree-based algorithms avoid the use and concatenation of big tensors found in deep neural networks (especially when the number of features is very high) and thus less memory greedy.
6. Reasoning
Once the experimental models have been built, and the metrics have been estimated, an explanation of the output values is now provided. The SHAP [
6] technique is based on Shapely’s value theory, which has its origin in game theory. Each instance of the dataset corresponds to a “player” in a game, in which the forecast represents the payoff. In contrast to game theory, in which the payoff is assigned to the players on the basis of the choices they make, in this case, the payoff obtained is the result of the combination of variables that characterize the dataset. To each variable, then, a weight, or marginal contribution, is assigned, and it represents the value of SHAPley. The marginal contribution of each feature is calculated considering all the possible interactions with the other features present in the model. It is estimated, therefore, how much information is contained in every combination, estimating the added value that every feature brings in the prediction. To every variable, a marginal contribution is associated based on the increase in the accuracy of the forecast. Specifically, many combinations are tested, depending on the number of features to be included in the model. For every feature, all the combinations with the others are calculated attributing in a first phase the previewed weight; subsequently, the same prediction is calculated without considering the variable in question and, on the base of the difference of prediction obtained, the marginal contribution is attributed to that feature. This process is carried out for every feature of the model on all the instances of the database in order to have the average value of the marginal contribution of each feature. Analyzing Shapely values on CatBoost trained model on
malware-analysis-datasets-api-call-sequences dataset, as depicted in
Table 6, it is possible to extract the importance of each feature for classifying the instance as malignant or non-malignant and their relative magnitude of importance.
Therefore, by reducing the feature space, an experiment was carried out in order to study the API calls most present in malware, measuring the frequency of the values and comparing it to the number of examples belonging to the same class. The same was done for the goodware class. Once the results for both categories were obtained, the “distances” were calculated, i.e., the differences between the presence of the API call in malware and the presence of the same API call in goodware were highlighted (values expressed in percentage). Finally, the following analysis shows that APIs LdrGetProcedureAddress, LdrGetDllHandle, LoadResource, FindResourceExW, LdrLoadDll, GetSystemInfo, LoadStringA, NtProtectVirtualMemory, GetSystemMetrics and CryptAcquireContextW are used most by malware and least by goodware; Instead, API NtClose, NtCreateFile, RegOpenKeyExW, NtOpenKey, RegCloseKey, GetSystemDirectoryW, RegQueryValueExW, GetSystemWindowsDirectoryW, NtOpenSection and SetErrorMode are more frequently used by goodware and less by malware.
As far as
APIMDS dataset is concerned, the same experiment was conducted, considering the 114 features of which it is composed. Here, too, the frequency of API call values present in malware was first measured and then related to the number of examples belonging to the same class. It was similarly done for the goodware class. Finally, the distances between the API call in malware and the API call in goodware were calculated (values expressed in percentage) as shown in
Table 7. Thus, the 10 API calls most involved in malware and least involved in goodware are:
LoadLibraryExW, LocalAlloc, GetProcAddress, GetSystemMetrics, GetVersionExW, GetModuleHandleW, CreateFileW, RegisterClipboardFormatW, LoadLibraryW, MapViewOfFileEx.
However, the 10 API calls most involved in goodware and least in malware are respectively: CoUninitialize, TerminateProcess, MessageBoxW, FormatMessageW, CoCreateInstance, PostQuitMessage, TranslateMessage, GetWindowRect, PostMessageW, FreeEnvironmentStringsA.
Let us now seek confirmation of what emerges from these data by analyzing examples of correctly classified malware and goodware and evaluating the related API calls. The following entries are selected from the dataset
malware-analysis-datasets-api-call-sequences: entries n° 42,679, 17,006, and 6173 for the goodware class; and entries n° 41,444, 43,641, 8921 for the malware class.
Table 8 shows a chart that highlights the calls made by each entry, focusing on the calls that fall in the top 10 API calls made by malware and goodware (previously reported in
Table 6).
Looking at the first record, 8 out of 10 of the most invoked APIs turn out to belong to the table that summarizes the TOP 10 APIs of the Goodware category, 1 API belongs to those typically invoked by malware, 1 API does not belong to the Top 10 ranking. In addition, the same happens in the second and third records, i.e., they show a prevalence of calls belonging to the TOP 10 of goodware. Next, by evaluating the results for the malware class records, the results for the first malware show 6 out of 10 APIs from the Top 10 ranking in malware, and 2 out of 10 belong to the Top 10 ranking in goodware. The remaining two APIs do not belong to the ranking of the experiment. In addition, in the last two records, a similar situation is presented, where the results show a prevalence of calls belonging to the TOP 10 in malware. Thus, the data show a confirmation of the influence of API calls.
Now, let analogous evaluations be performed on the
APIMDS dataset.
Table 9 shows the relevant API calls for three examples chosen for the goodware class (23,235, 23,386, 23,313) and the malware class (22,042, 15,316, 11,972), with similar classification compared to what was already done for the
malware-analysis-datasets-api-call-sequences dataset. From
Table 7, it can first be seen that, unlike the other dataset, the most frequently used API calls by goodware show rather small distance values. This evaluation follows that they have little power to discriminate in favor of goodware. Focusing on the goodware examples, it appears that, although most of the API calls do not belong to any of the top 10 expressed in the table by category, the presence of at least one of those indicated for the correct category is still relevant. Observing
Table 9, it is arguable that these examples have been classified as goodware since the calls identified in the top 10 of malware are missing, and there are few calls recognized pertaining mainly to goodware. The classification, in this case, was not, as in the case of the previous dataset, based on the presence of benign calls but rather on the absence of malicious calls. Conversely, in the malware examples, it is possible to see the power of discrimination carried out by the top 10 malware API calls since the examples reported contain, in many cases, such calls.
As discussed above, the results shown in the table for each of the two datasets, including those obtained from these additional experiments, are different. The cause is that the datasets examined are different, with different features, numerosity, and balance. In particular, the cardinality of the examples belonging to the malware-analysis-datasets-api-call sequences dataset is higher than the cardinality of the examples from the APIMDS dataset. On the other hand, the former collects a sequence of the first 100 features, while the latter has 114. The other difference there is within the features. In particular, the first dataset expresses 307 distinct API values within it, and the second dataset contains 2264. However, it is essential to note that the results are consistent with each other; all algorithms applied to the same dataset have accuracies that vary ±7% on malware-analysis-datasets-api-call-sequences and ±3% on APIMDS.