*Article* **Detection of Malicious Software by Analyzing Distinct Artifacts Using Machine Learning and Deep Learning Algorithms**

**Mathew Ashik <sup>1</sup> , A. Jyothish <sup>1</sup> , S. Anandaram <sup>1</sup> , P. Vinod <sup>2</sup> , Francesco Mercaldo 3,4,\*, Fabio Martinelli <sup>3</sup> and Antonella Santone <sup>4</sup>**


**Abstract:** Malware is one of the most significant threats in today's computing world since the number of websites distributing malware is increasing at a rapid rate. Malware analysis and prevention methods are increasingly becoming necessary for computer systems connected to the Internet. This software exploits the system's vulnerabilities to steal valuable information without the user's knowledge, and stealthily send it to remote servers controlled by attackers. Traditionally, antimalware products use signatures for detecting known malware. However, the signature-based method does not scale in detecting obfuscated and packed malware. Considering that the cause of a problem is often best understood by studying the structural aspects of a program like the mnemonics, instruction opcode, API Call, etc. In this paper, we investigate the relevance of the features of unpacked malicious and benign executables like mnemonics, instruction opcodes, and API to identify a feature that classifies the executable. Prominent features are extracted using Minimum Redundancy and Maximum Relevance (mRMR) and Analysis of Variance (ANOVA). Experiments were conducted on four datasets using machine learning and deep learning approaches such as Support Vector Machine (SVM), Naïve Bayes, J48, Random Forest (RF), and XGBoost. In addition, we also evaluate the performance of the collection of deep neural networks like Deep Dense network, One-Dimensional Convolutional Neural Network (1D-CNN), and CNN-LSTM in classifying unknown samples, and we observed promising results using APIs and system calls. On combining APIs/system calls with static features, a marginal performance improvement was attained comparing models trained only on dynamic features. Moreover, to improve accuracy, we implemented our solution using distinct deep learning methods and demonstrated a fine-tuned deep neural network that resulted in an F1-score of 99.1% and 98.48% on Dataset-2 and Dataset-3, respectively.

**Keywords:** malware; machine learning; deep learning; static analysis; dynamic analysis; hybrid analysis; security

#### **1. Introduction**

Malware or malicious code is harmful code injected into legitimate programs to perpetrate illicit intentions. With the rapid growth of the Internet and heterogeneous devices connected over the network, the attack landscape has increased and has become a concern, affecting the privacy of users [1]. The primary source of infection, causing malicious programs to enter the systems without users' knowledge. Mostly freely downloadable software's are a primary source of malware, which include freeware comprising of games,

**Citation:** Ashik, M.; Jyothish, A.; Anandaram, S.; Vinod, P.; Mercaldo, F.; Martinelli, F.; Santone, A. Detection of Malicious Software by Analyzing Distinct Artifacts Using Machine Learning and Deep Learning Algorithms. *Electronics* **2021**, *10*, 1694. https://doi.org/10.3390/ electronics10141694

Academic Editor: Suleiman Yerima

Received: 19 April 2021 Accepted: 9 July 2021 Published: 15 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

web browsers, free antivirus, etc. Largely financial transactions are performed using the Internet, these have caused huge financial losses for organizations and individuals. Malware writing has transformed into profit-making industries, thus attracting a large number of hackers. Current malware is broadly classified as polymorphic or metamorphic, and they remain undetected by a signature-based detector [2].

Malware writers employ diverse techniques to generate new variants that commonly include (a) instruction permutation, (b) register re-assignment, (c) code permutation using conditional instructions, (d) no-operation insertion, etc. Malware analysis is the process aimed to inspect and understand a malicious behavior [3]. Normally malware are analyzed by extracting strings, opcodes, sequence of bytes, APIs/system call, and the network trace.

In this paper, we conduct a comprehensive analysis using multiple datasets by exploiting machine learning and deep learning approaches. Classifiers are trained independently using the static, dynamic feature, and their combinations. We employ dynamic instrumentation tools like Ether [4], a sandbox approach for analyzing malware. In addition, we also make use of sandbox [5]. The motivation behind using the aforesaid sandboxes are to stop side-effects induced to the host environment and to permit malware to exhibit its capabilities, which can be used as features for developing detection models. Ether in particular is based on the application of a hardware virtualization extension, such as Intel VT [6] and resides entirely outside of the target OS environment. In addition to providing anti-debugging facilities, Ether can also be used for software de-armoring dynamically.

Starting from these considerations, we propose a malware detector, exploiting machine learning and deep learning techniques. The experiments were conducted on malware and benign Portable Executables (PE), Android applications, and metamorphic samples created using virus kits. The motivation for using these types of files was arrived at by monitoring the submissions received over the Virus Total [7], a service that performs online scanning of malicious samples. In particular, we consider a set of features obtained from benign and malicious executables like mnemonics, instruction opcodes, and API/system calls for automatically discriminating legitimate and malicious samples. In summary, we list below the contributions of our proposal:


The rest of the paper is organized as follows: In the next section we provide an overview about the current state of the art in the malware detection context; in Section 3 we present the proposed method for malware detection; experimental analysis is discussed in Section 4; and, finally, in Section 5 a conclusion and future research plan are presented.

#### **2. Related Work**

To highlight the novelty of our work, we examine malware detection techniques topics for which the proposed method is related: The technique for malware detection and classification through machine learning and deep learning algorithms, and other techniques.

#### *2.1. Machine Learning-Based Malware Detection Techniques*

Krugel et al. [8] used dynamic analysis to detect obfuscated malicious code using a mining algorithm. Authors in [9] proposed a hybrid model for the detection of malware using different features like byte *n*-gram, assembly *n*-gram, and library functions to classify an executable as malware or benign. The work [10] considers the system call subsequence as an element and regards the co-occurrence of system calls as features to describe the dependent relationship between system calls.

Furthermore, the work in [11] extracted 11 types of static features and employed multiple classifiers in a majority vote fusion approach where classifiers such as SVM, k-NN, naive Bayes, Classification and Regression tree (CART), and Random Forest were used. Nataraj et al. [12] consider the Gabor filter and evaluated it on 25 × 86 malicious families. Thus, they built a model using the *k*-nearest Neighbors approach with Euclidean distance.

#### *2.2. Deep Learning-Based Malware Detection Techniques*

Recently in [13], applications were represented in the form of an image to discriminate between malicious and benign applications. The solution considered static features extracted by reverse-engineering the malicious code and encoding it by SimHash. The DroidDetector tool [14] discriminates between legitimate and malicious samples in an Android environment by exploiting a deep learning network, relying on required permissions, sensitive APIs, and dynamic behaviors features. A deep convolutional neural network for malware detection is proposed by McLaughin et al. [15], starting from the analysis of raw opcode sequence obtained by a reverse engineering Android applications. MalDozer [16] is a tool aimed at Android malware detection and family identification by analyzing API method calls. Furthermore, the study in [17] proposes a malware detector focused on the Android environment, aimed to discriminate between malicious and legitimate samples and to identify malware belonging to the family.

#### *2.3. Malware Detection Using Other Techniques*

API calls have been used in the past for modeling program behavior [18,19] and for detecting malware [20,21]. This paper relies on the fact that the behavior of malicious programs in a specific malware class differs considerably from programs in other malware classes and benign programs. Sathyanarayan et al. [22] used static extraction to extract API calls from known malware to construct a signature for an entire class. In [23], authors use static analysis to detect system call locations and run-time monitoring to check all system calls made from a location identified during static analysis.

Damodaran et al. [24] compared malware detection techniques based on static, dynamic, and hybrid analysis. Authors in [25] used Hidden Markov Models (HMMs)to represent the statistical properties of a set of metamorphic virus variants. The metamorphic virus data set was generated from metamorphic engines: Second Generation virus generator (G2), Next Generation Virus Construction Kit (NGVCK), Virus Creation Lab for Win32 (VCL32), and Mass Code Generator (MPCGEN). Vinod et al. [26] proposed a method to find the metamorphism in malware constructors like NGVCK, G2, IL\_SMG, and MPCGEN by executing each malware sample in a controlled environment like QEMU and monitoring API calls using STraceNTX. Suarez-Tangil et al. [27] focus their efforts to discern malicious components from the legitimate ones in repackaged Android malware. They consider control flow graphs generated from code fragments of the application under analysis. They highlight that most research papers on Android malware detection are focused on outdated repositories, such as the MalGenome project [28] and the Drebin [29] datasets.

DroidScope [30] uses a customized Android kernel to reconstruct semantic views to collect detailed application execution traces. An approach aimed at detecting Android malware families was presented in [10,31]. The method is based on the analysis of system calls sequences and is tested obtaining an accuracy of 97% in mobile malware identification using a 3-gram syscall as a feature. Android malware detection exploiting a set of static features was addressed in [32]. Unsupervised machine learning techniques were used to build models with the considered feature set, statically obtained from permission invocations, strings, and code patterns. Furthermore, the Alde [33] framework employs static analysis and dynamic analysis to detect the actions of users collected by analytics libraries. Moreover, Alde analyses gives insight into what private information can be leaked by apps that use the same analytics library. Casolare et al. [34] also focused on the Android environment by proposing a model checking-based approach for detecting colluding between Android applications. A comparison of existing techniques is given in Table 1.





#### **3. Proposed Methodology**

In the following subsections, we discuss our proposed methods for detecting malicious files. We prepare four datasets (a) the first dataset (dataset-I) comprises malicious executables collected from VX-Heavens [35] along with legitimate files, (b) the second dataset (dataset-II), which is the collection of malicious files including ransomware's downloaded from virusshare [36] along with goodware gathered from diverse sources finally, (c) malicious Android applications acquired from the Drebin project [37] and benign apks (dataset-III), and (d) synthetic malware samples created using virus generation kits. To improve readability, we present the expansion of abbreviations and meaning of symbols in abbreviations and mathematical symbols.

To predict unknown samples, we used malwares from a collection of sources such as the VX Heavens repository [35], ransomware downloaded from virusshare [36], synthetic malware samples created using a virus kit, and malicious Android apps. Additionally, we gathered legitimate samples from diverse sources. The generation of feature space of features like mnemonic, instruction opcode, API calls [38], and 4-gram mnemonic are extracted after unpacking the files. The basic idea of dynamic analysis is to monitor the program while in execution. Dynamic analysis of malware needs a virtual environment to avoid infection on the host system. We thus used different types of sandboxes each for a different dataset. For VX-Heavens samples, the executable files were made to run on a hardware virtualized machine such as Xen [39]. The advantage of using an emulator is that the actual host machine is not infected by the viruses during the dynamic API tracing step. Ransomware dataset were analyzed in the Parsa sandbox [5] which hooks the API calls to provide the requested resources to the executable matching an environment condition. Finally, malicious Android apps were executed in an emulator and system call traces were logged using strace utility and each application was subjected to random events such as clicks, swipes, change of battery level, update of geo-location, etc.

API Call tracing requires that the samples are unpacked or unarmored, as explained earlier since the packers generally try to destroy the import table [40] of the malware or benign program. To unpack samples, we used Ether patched XEN. Ether patched XEN is transparent to malware. Hence the anti-debugging techniques like Virtual Machine Detection [4], Debugger Detection (IsDebuggerPresent() API Call, EFLAGS bitmask) and Timing Attacks (analyzed values of RDTSC before and after) could be avoided due to a hardware virtualized environment. We have used XEN as a virtual environment running on top of Debian Lenny (Debian 5.0.8). Xen is a generic and open source virtualizer. XEN achieves near native performances by executing the guest code directly on the host CPU. In our process, we followed these steps:


Ether dumps the sample by finding the Original Entry Point using the memory writes a program does. The dumped sample could be found in the images directory of ether. Once we have unpacked malware samples, they execute in an emulated environment and API tracing achieved using Veratrace.

#### *3.1. Software Armoring*

Software armouring or executable packing, as shown in Figure 1, is the process of compressing/encrypting an executable file and prepending a stub that is responsible for decompressing/decrypting the executable for execution [42,43]. When execution starts, the stub will unpack the original executable code and transfer control to it. Today most malware authors use packed executables to hide from detection. Due to software armoring, malware writers can defeat malicious applications from detection.

**Figure 1.** Software de-armoring.

Before beginning with the analysis of the malware, we should check whether the malware is armored or not. We use Ether as a tool for de-armoring since it is not signaturebased and also due to its transparency to malware. Ether detects all the writes to memory a program does and dumps the program back in the binary executable form. It creates a hash-table for all the memory maps, and whenever there is a write to a slot in the hash table, it reports that as the Original Entry Point, which is the starting point of execution of a packed executable.

#### *3.2. Feature Extraction*

In our approach, we used API calls (dynamic malware analysis), and mnemonic/ opcode, instruction opcode, and 4-gram mnemonic (static malware analysis). The process of feature extraction is briefly explained in Figure 2. To extract these features the various open-source tools used are listed below:


In the following paragraphs, we briefly introduce the features extracted from malware and legitimate executables.

#### *3.3. API Calls Tracing*

The Windows API, informally WinAPI, is Microsoft's core set of application programming interfaces (APIs) available in the Microsoft Windows operating systems. In Windows, an executable program to perform its assigned work needs to make a set of API calls. For example, for file management, some of the API calls are OpenFile: Creates, opens, reopens, or deletes a file; DeleteFile: Deletes an existing file; FindClose: Closes a file search handle that is opened by FindFirstFile, FindFirstFileEx, or FindFirstStreamW function; FindFirstFile: Searches a directory for a file or subdirectory name that matches a specified name; and GetFileSize: Retrieves the size of a specified file, in bytes. Thus no executable program can run without the API calls. Hence, the API calls made by an executable is a good measure to record its behavior.

#### **Figure 2.** Feature extraction.

To extract APIs, we use Veratrace an API Call Tracer for Windows. It can trace all calls made by a process to the imported functions from a DLL. For extracting APIs, Veratrace mandates unpacking the samples. If packed, the import table would be populated with API calls like GetProcAddress() and LoadLibrary(), which are also common to legitimate executables. We have designed a parser to parse all the traces and filter out API names without argument, which is considered a feature in our work.

#### *3.4. Mnemonic, Instruction Opcode, and 4-Gram Mnemonic Trace*

We performed static analysis using the open-source ObjDump tool to obtain assembly language code. From these files, mnemonics, instruction opcode, and 4-gram mnemonic are extracted. An independent parser is developed to filter out mnemonics and instruction opcodes.

Each file is represented in the form of a vector, where the elements of the vector are the occurrence of an attribute. Since attribute values have different ranges, we normalize the data to a common scale. In our approach, we utilized a standard scalar approach *"Z–Score"*. Besides, the normalized feature space is then discretized into three bins and used as an input to the Minimum Redundancy Maximum Relevance (mRMR) feature selection algorithm.

#### *3.5. Feature Selection*

In earlier studies, it has been reported that the feature selection is an integral component [45] in a machine learning pipeline. Many feature selection algorithms have been designed specifically for the application domain, furthermore every algorithm uses different criteria (such as information gain, Gini index, etc.) for extracting prominent attributes. In the presence of irrelevant features, the detection model learns complex hypothesis functions, and learning models cannot generalize in identifying a new sample. Fundamentally, the role of a feature selection approach is to extract a prominent subset of attributes to improve classifier performance. The advantages of feature selection are listed below:


We observe that the initial feature space obtained contained irrelevant attributes. By irrelevance, we mean set of feature which cannot identify a class and can never influence detection. In particular, these attributes appear equally in all samples of the target class. As a result, we selected discriminant features using maximal statistical dependency criterion based on mutual information known as Minimum Redundancy Maximum Relevance (mRMR), and by comparing means of two or more features using the ANOVA [46] as shown in Figure 3.

**Figure 3.** Feature selection process.

The training process can be either supervised, unsupervised, and semi-supervised. Supervised feature selection determines feature relevance by evaluating the correlation of attributes with the class. As our training data is labeled, we used supervised feature selection algorithms and filter methods to determine the correlation of the features with the class label. Using Filter methods, features are selected based on their intrinsic characteristics, their relevance, or controlling power concerning the target class. Such methods are based on mutual information, statistical test (*t*-test, F-test). A feature can become redundant due to the existence of other large volumes of relevant attributes in the feature space.

#### 3.5.1. Minimum Redundancy Maximum Relevance

Maximum relevance criteria select features that highly correlated to the target class. mRMR is a filter method demanding the feature space to be discretized into states. However, this feature set is not a comprehensive representation of the characteristics of the target variable due to two essential aspects, as cited in [47]:


To expand the representative power of the attribute set features while maintaining minimum pair-wise correlation, the minimum redundancy criterion supplements the maximum relevance criteria such as mutual information with target class. The mutual information of two features *x* and *y* is defined as the joint probabilistic distribution *P*(*x*,*y*) and their respective marginal probabilities *P*(*x*) and *P*(*y*) (refer to Equation (1)).

$$\mathbf{I}(\mathbf{x}, \mathbf{y}) = \sum\_{i, j \in S} P(\mathbf{x}\_{i\prime} y\_j) \log \frac{P(\mathbf{x}\_{i\prime} y\_j)}{P(\mathbf{x}\_i) P(y\_j)},\tag{1}$$

where *x*, *y* is the feature, namely mnemonic, instruction opcode, api call, 4-gram mnemonic, *P*(*x<sup>i</sup>* , *yj*) is the joint probabilistic distribution of feature *x* and *y*, *P*(*xi*), *P*(*yj*) are the marginal probabilities, *I*(*x*, *y*) is the mutual information between feature *x* and *y*, *i* indicates the level or state of feature *x* and *j* indicates the state of feature *y*, and *S* is the set obtained from cross product of set of states of *x* and *y*. Subsequently, we compute the relevance and redundancy value of attributes discussed below.

• Relevance value of an attribute *x*, *V*(*x*) is computed using Equation (2):

$$V(\mathbf{x}) = I(h, \mathbf{x}), \tag{2}$$

where *h* is the target variable or class, *I*(*h*, *x*) is the mutual information between class and feature *x*.

• Redundancy value, *W*(*x*) of feature *x* is obtained using Equation (3):

$$\mathcal{W}(\mathbf{x}) = \sum\_{j \in \mathcal{N}} I(y\_j, \mathbf{x}), \tag{3}$$

where *N* is the total number of attributes, *I*(*y<sup>j</sup>* , *x*) is the mutual information of features *y<sup>j</sup>* and *x* respectively.

Using Equations (2) and (3), minimum redundancy and maximum relevance of an attribute is computed, which is discussed below:

• Mutual Information Difference (MID): Is defined as the difference between the relevance value (*V*(*x*)) and the redundancy value (*W*(*x*)). To optimize the minimum redundancy and maximum relevance criteria, the difference between the relevance and redundancy value (see Equation (4)) was computed.

$$\text{MIID}(\mathbf{x}) = V(\mathbf{x}) - \mathcal{W}(\mathbf{x}),\tag{4}$$

Hence, the feature with maximum *MID* value indicates the mRMR feature;

• Mutual Information Quotient (MIQ): Is obtained by dividing the relevance value with the redundancy value, thus optimizing the mRMR criteria (refer to Equation (5)):

$$MIQ(\mathbf{x}) = \frac{V(\mathbf{x})}{\mathcal{W}(\mathbf{x}) + 0.001}.\tag{5}$$

Hence, the feature with a maximum *MIQ* value indicates the mRMR feature. Our approach use both these criteria, i.e., *MID* and *MIQ*, for selecting features, and compare classifier performance trained on the set of *MID* and *MIQ* attributes.

#### 3.5.2. Analysis of Variance

Analysis of Variance (ANOVA) is a statistical method to compare the means of two or more groups. Depending upon the features and the level of features, ANOVA can be classified as follows:

• One way ANOVA: Requires one feature with at least two levels such that the levels are independent;


Our proposed approach uses factorial ANOVA criteria for feature selection. In doing so, attributes highly correlated to the target class are determined. In particular, using ANOVA we estimate the impact of one or more independent variables on the dependent variable (i.e., class label). Feature influence is computed using variance, furthermore, it indicates separability between the class. Specifically, if the variance of an attribute is low then it has less impact on the target class. Using ANOVA, we choose a subset of independent variables having a stronger affinity towards classes. Generally, Post Hoc tests such as "F" statistics is performed to analyze the results of experiments. "F–Statistic" has its tailed distribution and is always positive. Variation in data can be due to two critical aspects (a) variation within the group and (b) variation between the group. Prominent features are derived using the procedure discussed below:

$$SS\_T^p = SS\_B^p + SS\_{W'}^p\tag{6}$$

where *SS<sup>T</sup>* is the total sum of squares of feature *p*.

$$SS\_T^p = \sum\_{i=1}^k \sum\_{j=1}^l (X\_{ij} - \mu\_p)^2,\tag{7}$$

Here, *k* is the number of classes (malware/benign), *l* is the number of states of feature *p*,

$$\mu\_p = \frac{1}{k \ast l} \sum\_{i=1}^{k} \sum\_{j=1}^{l} X\_{ij\prime} \tag{8}$$

*µ<sup>p</sup>* is the mean of frequencies of feature *p*.

$$SS\_W^p = \sum\_{i=1}^l \sum\_{j=1}^k (X\_{ji} - \mu\_i^p)^2 \,\prime \tag{9}$$

where *SS<sup>p</sup> <sup>W</sup>* is the sum of squares of within the group of feature *p*, and *µ p i* is the mean of frequencies of feature *p* in *i* th discretization state.

$$DF\_W^p = (k \ast l)^p - l^p \,, \tag{10}$$

where *DF<sup>p</sup> <sup>W</sup>* is the degree of freedom of feature *p* within the group, and (*k* ∗ *l*) *p* is the number of observations of feature *p*, *l p* is the number of samples of feature *p*:

$$DF\_B^p = l^p - 1,\tag{11}$$

where *DF<sup>p</sup> B* is the degree of freedom of feature *p* between the group. Finally, *F–Score* is defined as:

$$F(DF\_{B'}^p, DF\_W^p) = \frac{(SS\_B^P / DF\_B^P)}{(SS\_W^P / DF\_W^P)}.\tag{12}$$

Eventually a feature *p*, with the highest *F–Score* is selected as a candidate member of the feature set.

#### *3.6. Classification*

A classification is a form of data analysis that can be used to extract models describing classes. It predicts categorical (discrete, unordered) labels. In our work, we utilized various machine learning and deep learning algorithms, such as Support Vector Machine (SVM) [45,48], Naïve Bayes [49], J48 [50], Random Forest (RF) [51], and XGBoost [52]. In addition, we also evaluate the performance of the collection of deep neural networks like the Deep Dense network, One-Dimensional Convolutional Neural Network (1D-CNN) and CNN-LSTM in classifying unknown samples. The hyperparameters of all deep neural networks were tuned using the random search cross-validation approach. The above-mentioned classification algorithms were chosen as they have been extensively used in prior research work, and a subset of these classifiers have demonstrated to produce improved detection of unknown malware files [53–55] .

In real-world applications, the size of the dataset is massive, data appears in a different form. The shallow network has a limited generalization capability. For obtaining better results, the shallow networks must be presented with features that are handpicked or suitably chosen after several iterations of the feature selection algorithms. Thus, the entire process is computationally expensive, also error-prone if attributes are extracted by humans. In contrast, deep neural networks employ a myriad of hidden layers, with each layer consisting of many neurons. Each neuron act as a processing unit to output complex features of input data. The lower layers extract features that are gradually amplified in the subsequent layers (higher layers). A deeper layer derives important aspects of the input data by omitting irrelevant details needed for classification. Thus, deep networks does not require feature extraction from scratch. In general, classification is a two step process as discussed below:


#### **4. Experimental Evaluation and Results**

*4.1. Evaluation Metrics*

We used following evaluation metrics:


True Positive (*TP*) is number of samples correctly identified as malware. True Negative (*TN*) is the count of files identified as legitimate. False Negative (*FN*) is the number of malicious files misclassified as benign. False Positive (*FP*) is the number of benign files wrongly labeled as malware by the classifier.

#### *4.2. Experiments Results*

In this section, we discuss the experiment's setup, results obtained and the analysis of the result. The primary objective of this work is to perform analysis on different types of a dataset using various machine learning algorithms. For this purpose we created four datasets discussed below:


For experimenting on Dataset-1 and Dataset-4, we used a machine installed with Debian Lenny (Debian 5.08) as the host operating system, Windows XP Service Pack 2 as the guest operating system, i7 processor with 8GB RAM and 1TB HDD. Experiments on Dataset-2 and Dataset-3 were performed on Intel core i7, 10th generation with 16GB RAM, and 1TB HDD. Before executing samples in the system, we freshly installed the operating system and a snapshot of virtual environment was taken. After executing the sample, we restore the sandbox to its clean state, otherwise it would have a negative impact on the feature extraction phase.

#### *4.3. Investigation of Relevant Feature Type-Dataset-1*

We extracted mnemonics from 2000 samples. The experimental results obtained from feature reduction using mRMR (MID and MIQ) and ANOVA are as shown in Figure 4. We obtained these outcomes after classifying the samples using SVM, AdaBoost, Random Forest, and J48. Five mnemonic-based models were constructed at a variable length, starting from 40 to 120 at an interval of 20. Among these five models, ANOVA provides the best result with a strong positive likelihood ratio of 16.38 for the feature length of 120 mnemonics using AdaBoostM1 (J48 as base classifier). The main advantages of this model are its low error rate and speed. However, mnemonic-based features can be easily modified using code obfuscation techniques.

**Figure 4.** Performance of classifiers on mnemonic features.

The dynamic API features initially had a feature length of 4480. We use reduction algorithms like mRMR (MID), mRMR (MIQ), and ANOVA to obtain a reduced featurelength of 40, 60, 80, and 120 as illustrated in Figure 5. When we compare the different feature lengths, we observe that the likelihood for returning positive values is the highest in the case of mRMR (MIQ) at a feature-length of 120 prominent APIs using Random Forest.

Next, we derived 4-grams from a total feature space consisting of 1249 features. The features effectively reduced with mRMR (MID) methods, mRMR (MIQ), and ANOVA. The classification model's performance is estimated over variable feature-length starting from 40 until 120 in steps of 20 as shown in Figure 6. For the above-mentioned five-feature length, mRMR(MIQ) produces the best result with over 96% accuracy for feature-length 120 with Random Forest. However, the limitation stated in [56] is applicable in the current

scenario. Generation of 4-grams are computationally expensive, exhibit diminishing returns with more data, are prone to over-fitting, and do not seem to carry vital information from discriminating samples. At the same time, 4-grams do exhibit some merits as it partially depicts the behavioral snapshot of a program and sometimes produces comparable results to other approaches.

**Figure 6.** Performance of 4-gram features.

Finally, we derived the opcode-based feature set and reduced these features with mRMR (MID), mRMR (MIQ), and ANOVA, where the performance of the model is evaluated over a feature-length between 40 and 120 in increments of 20 as shown in Figure 7. Among these five-feature lengths, we observed that ANOVA attains the highest performance with a positive likelihood ratio of 19 for feature-length 100 using Random Forest. However, the results obtained with mRMR is very close to the ANOVA features.

**Figure 7.** Performance of classifier trained on opcode features.

Hence, from the results we obtained, we observe that API features had a higher detection rate of 97.4% with only a fallout of 1.7% as against 4-gram's 93.15% accuracy. Again, when we compare the results obtained from API features as opposed to the results gathered from mnemonic features and opcode features, we see that opcode features had the highest likelihood ratio of 19 as against mnemonic features, and API features having a *LR*+ ratio of 16.38 and 15.69, respectively.

To summarize experiments on Dataset-1 considering each feature independently on four classification algorithms, namely J48, Support Vector Machine (SVM), AdaBoostM1 (with J48 as a base classifier), and Random Forest (RF), we observe that Random Forest and AdaBoost produced the best results. We can attribute this accuracy of Random Forest as it

is an ensemble-based technique that derives its output from the sample by accumulating votes from multiple forests. We can credit the boosting technique that AdaBoost employs for improving its accuracy. AdaBoost cascades multiple weak classifiers to give a strong learner, ensuring a high degree of precision. The J48 classifier comes next in terms of the results produced in comparison to the other classifiers. The output produced by J48 is close to the best classifiers in some cases but is consistently inferior compared to the other two classifiers. SVM produces poor results among the four classifiers, which can be explained by SVM's tendency for over-fitting when the number of features is higher than the number of samples.

We further evaluate the performance of machine learning models generated by combining different feature categories. We consider such a feature space as a multimodal attribute set. The term modality means the particular mode in which something is expressed. In this context, it refers to the various features obtained with feature extraction, as shown in Figure 2. In the unimodal architecture, we perform classification based on a single modality and thus, this framework is limited to operating on a single attribute type. To investigate if blending different features from diverse feature categories could improve classification accuracy, we furthered our experiment using multimodal architecture.

Multimodal architecture involves learning based on multiple modalities. This solution is based on utilizing the relationship existing between the various features of the data available. This network can be used in converting data from one modality to another or in using one attribute set to assist the learning of another attribute set etc. We have achieved multimodal fusion in our experiment by carrying out feature selection (as shown in Figure 3) on the relevant attributes from diverse categories (4-gram, mnemonics, API, and opcodes) and then fusing them as shown in Figure 8.

As each feature has a different representation and correlation structure, the fusion of all these relevant features helps to extract maximum performance. Furthermore, after fusing these features, we were able to obtain a new feature space comprising of promising attributes. Additionally, we considered the new feature space for creating diverse classification models.

The presence of irrelevant features or redundancy in the data set might degrade the performance of the multimodal classification. Since we present the feature sets through various feature selection methods before performing feature fusion, our classifier is less susceptible to problems induced due to redundancy and extraneous features.

**Figure 8.** Multimodal architecture for feature fusion classification.

The ensemble classifier demonstrated the maximum accuracy of 97.98% with a featurelength of 240 using Random Forest, as shown in Figure 9. Among the unimodal classifiers, the API features demonstrated the highest detection rate of 97.4% with a FPR of 1.7%. Moreover, the opcode features displayed a detection rate of 91.6% and 0.48% FPR. By analyzing the results of both the unimodal and multimodal architectures, the results obtained using the multimodal architecture illustrate significant improvement compared to the results gained from the unimodal classifiers (as shown in Figures 4–7). Since the ensemble classifier was developed by concatenating prominent features from various feature sets, it is evident from the results that each modality considered for fusion has contributed to the overall performance of the classifier. Furthermore, this demonstrates that multimodal learning can be promising for increasing the detection in the malware detection task.

**Figure 9.** Performance of feature fusion.

Summary: Experiments on VX-Dataset demonstrates that combining prominent mRMR features results in improved results on comparing individual features. The highest detection rate is obtained with Random Forest and AdaBoost models, due to ensemble, bagging, and boosting strategies. APIs play a significant role in predicting examples, with poor outcomes obtained using opcodes. Another important trend noticed is that the results of multimodal feature space and API unimodal classifier marginally differ. This is because the opcode attribute in combined attribute space does not contribute towards classification, as they introduce more sparsity in feature vectors. Hence, we conclude that dynamic feature, i.e., API plays a critical role in discriminating malware and benign files.

#### *4.4. Evaluation on Virusshare Samples Dataset-2*

In this experiment, we perform a comprehensive evaluation on samples downloaded from Virusshare. As we saw inferior performance using static features, we performed analysis on APIs by running samples in the Parsa sandbox. The transition from Ether to an alternative sandbox (Parsa sandbox) arrived as many executables crashed while running in Ether. We observed that Parsa sandbox provides the requested resources to the executing samples by logging the API calls. This sandbox delivers the resources by matching the API used by executable with an API list which corresponds to a distinct set of operations corresponding to a mouse event, browser activities, file operations, etc. In this, the program is given an illusion of being run in a real environment as opposed to the virtual environment. While a program is executing, we log all APIs, extract call names, and select prominent calls using mRMR to create a machine learning model. In addition, we perform experiments using different deep learning models without feature engineering and compare the outcomes of both ML models and deep neural networks. Table 2 compares the average of the results of different models. We observe that the best results are obtained with a deep neural network, followed by one-dimensional convolutional neural network and XGBOOST. Table 3 exhibit the network topology and hyperparameters of deep neural network models. In all intermediate layers, we use ReLU activation function and randomly drop some neurons (i.e., dropout) to attain the best outcome for a particular neural network configuration.


**Table 2.** Results of models on Virusshare dataset.

**Table 3.** Network architecture and hyperparameters of deep neural network models.


#### *4.5. Evaluation on Android Applications Dataset-3*

In this experiment, we identify malicious Android applications (also known as app.) using machine learning and deep learning techniques. Here, we use system calls as a feature for representing each application. First, we create an Android virtual device and install applications to be inspected. A total of 2000 malware applications are randomly chosen from Drebin dataset [37], and 2000 legitimate applications are downloaded from the Google Playstore. While running applications, system calls are recorded using strace utility, during this event we employ Android Monkey (a utility in Android SDK for fuzz testing application) to simulate the collection of events (e.g., changing the location, battery charging status, sending SMS, dialling to a number, swipes, clicking on widgets of an app, etc.). In particular, in this work we execute an application with 1500 random events for one minute, however, the analysis could also be performed with varying events.

Relevant system calls are selected using the mRMR feature selection approach, and further each app. is represented using a numerical vector employing Term Frequency Inverse Document Frequency (TF-IDF). The performance of machine learning classifiers on the sequence of system call (two calls considered in sliding window fashion) is shown in Table 4. It was observed that distinguishing feature vectors were obtained by considering two consecutive system calls. Some examples of system call sequence are shown in Figure 10.

We considered 40% of top system calls from the list of unique calls extracted from entire training set.

**Table 4.** Performance matrix of machine learning classifiers on system call sequence.


From Table 4, we can visualize the best outcome for the XGBoost classifier. However, this result is obtained with an extra effort i.e., feature engineering which is a critical task in the machine learning pipeline. To eliminate the task of feature engineering, we make use of deep neural network architecture, which is a collection of layers, with each layer consisting of several neurons. A neuron acts as a processing unit that collects multiple inputs, multiplies weight, and finally applies the activation function. We use a deep neural network with an input layer consisting of 500 neurons and the second layer contains

250 neurons. In all layers, we use the Rectified Linear Unit (ReLU) activation function. The sigmoid activation function was used in the output layer since malware identification is a binary classification problem. For faster convergence and to avoid overfitting, the Adam optimizer and cross-entropy loss function are utilized. Table 5 is the results obtained at varying values of dropout, the best results are obtained with a dropout rate of 0.1.


**Table 5.** Performance of multi-level perceptron on varying the dropout rate.

#### *4.6. Evaluation on Synthetic Samples Dataset-4*

Malware constructors generate variants from the base virus by inserting equivalent instructions, reordering, and subroutine permutations as code obfuscation techniques. The segments mutate from one generation to another where mutant code is transformed by the metamorphic engine to evade AntiVirus (AV) signature detection. This motivates the use of machine learning techniques to explore metamorphism among variants and within different families among synthetic samples, and to understand the extent of obfuscation induced by the virus kits. Malware data set comprising of 800 NGVCK viruses were used. Prior studies in [57] reported that the NGVCK samples could easily bypass strong statistical detectors based on HMM by using the opcode sequence. Likewise, 1200 benign executables were downloaded from different sources, which include games, web browsers, media players, and executables of system 32 from a fresh installation of the Windows XP operating system. As in previous experiments, we scan all benign with VirusTotal to assure that none of the benign samples is infected. The complete data set was divided such that 80% of samples are used for training and the remaining 20% are used as a test set. In this experiment, executables based on API calls were analyzed.

We extracted unique opcode bigrams from the training set and found 733 of them. Prominent opcodes are filtered out using the mRMR approach. We also studied the impact of varying feature lengths beginning with 50 bigram opcode until 250 bigrams are included. The feature space is extended in increments of 50 opcodes at a time. We found that an increase in bigrams had a marginal influence on the classifier performance. As we progressively extend the feature vector, the informative attributes begin to appear, which eventually improves the results. However, if we further increase the features beyond a certain limit there is a drop in accuracy, primarily due to the addition of noise. We developed a classification model using different algorithms such as J48, AdaBoostM1 with J48 as a base classifier, and Random Forest. Table 6 compares the best outcome of classifiers attained at a feature length of 150 bigrams.

**Table 6.** Performance metrics of machine learning models.


To understand the extent of metamorphism in virus generation kits, 677 viruses were created using different infection mechanisms to form malware families. In particular, we generated using virus kits (NGVCK, MPCGEN, G2, and PSMPC) and also downloaded real malware samples downloaded from VX Heavens. Data set description is given in Table 7.


**Table 7.** Data set description with samples, number of families, and number of variants of each family.

Mnemonics are extracted from each malware sample and aligned using the global and local sequence alignment method. Sequence alignment places one opcode sequence over another to determine if sequences are identical. In the process of alignment, two opcode sequence gaps may be inserted. We have adopted a simple scoring scheme where a match is assigned a value of +1, and every mismatch and gap score is assumed as −1. A similarity matrix is constructed using pairwise alignment of malware samples within the family. We record minimum, average, and highest similarity distance for all malware samples. Likewise, the similarity distance of base malware across malware families is computed.

Two families are said to overlap if the similarity distance computed for base malware samples *Base<sup>i</sup>* and *Base<sup>j</sup>* is within the range of minimum and average similarity distance determined for families *i* or *j*. This means the greater the distance of a sample from the base malware, the lesser the similarity. Conversely, a high score depicts a higher similarity between any two samples. Table 8 depict a segment of pairwise alignment of two samples generated using the NGVCK constructor. Each row preceded with a hash symbol represents a gap and an asterisk designate a mismatch of an opcode for any two malware samples.

The local alignment technique is employed to identify a common code among obfuscated samples as the code varies in the subsequent generation to identify conserved code regions. We found variants generated from MPCGEN are similar to G2 and PSMPC. In Figure 11, MPCGEN-F1 and MPCGEN-F3 have high similarities with a base malware of G2 and PSMPC (G2-F1, G2-F3, PSMPC-F1, and PSMPC-F3).


**Table 8.** Pairwise alignment of two samples generated using the NGVCK constructor. The sequence shows match, mismatch, and gaps inserted for aligning the samples.

**Figure 11.** Overlapping MPCGEN malware families.

To examine obfuscation techniques using malware constructors, we calculated alignments of sequences and recorded mismatch among mnemonics. There was a visible instruction replacement for NGVCK samples in comparison to other synthetic generators. In Table 9, prominent mismatch opcodes are shown for four generators as the rest has shown a similar trend. mov, push, lea, pop, and jmp are primarily used as replacement instructions.


**Table 9.** Replacement of opcodes for malware generator (NGVCK, G2, PSMPC, and MPCGEN). For all generators, mov, push, pop, and jump instructions are replaced.

To ascertain overlap among real malware samples of VX-Heavens and the obfuscated families, we studied the overlapping of the opcode sequence of real malware samples with synthetic ones. Initially, we determine base malware alignment (a sample that is closer to all samples in a family). Figure 12 shows the overlap of Win32.Agent with NGVCK indicating real samples that also use code modification similar to synthetic constructors. Win32.Bot and Win32.Downloader overlap Win32.Autorun, Win32.Downloader, Win32.Mydoom, and Win32.Xorer families indicating that worm families preserve the common base code to differ in syntactic structure due to obfuscation or an extension of malevolence.

**Figure 12.** Overlapping families of real malware with other families.

#### **5. Conclusions and Future Work**

In this paper, we address the detection of malicious files using diverse datasets comprising of real and synthetic malware samples. The solution employs a collection of machine learning and deep learning approaches. Machine learning models were trained on prominent features derived using mRMR and ANOVA. Our results show that the Random Forest classifier attained better results, comparing all other machine learning algorithms used in this work. We also conclude that models trained on static features did not attain good results due to the sparse vectors. We demonstrate the efficacy of APIs and system call sequence in identifying samples. Moreover, to improve accuracy, we implemented our solution using distinct deep learning methods, and demonstrate fine-tuned a deep neural network that resulted in an F1-score of 99.1% and 98.48% on Dataset-2 and Dataset-3, respectively. Finally, we performed an exhaustive analysis of code obfuscation on variants generated using NGVCK and other virus kits. We found that NGVCK samples are appropriately detected by using a simple feature, such as opcode bigram. We also demonstrated that there exists inter-constructor overlaps especially amongst G2, MPCGEN, and PSMPC indicating the use of a generic code for infection. Our results also show that malware constructors employ naive obfuscation techniques, particularly they utilize junk instructions, a replacement of equivalent instructions involving mov, push, pop, jump, and lea.

In future, we would like to analyze malware and benign samples using an ensemble of features and classify unseen samples using ensembles of classifiers employing majority voting. We would also like to experiment on multi-class classification, i.e., labeling malware to its respective family. Moreover, we plan to investigate the efficacy of the machine learning and deep learning models on evasive samples generated through feature manipulations, and propose a countermeasure against adversarial attacks.

**Author Contributions:** Conceptualization, M.A., A.J., S.A., P.V., F.M. (Francesco Mercaldo), F.M. (Fabio Martinelli) and A.S.; methodology, M.A., A.J., S.A., P.V., F.M. (Francesco Mercaldo), F.M. (Fabio Martinelli) and A.S.; software, M.A., A.J., S.A., P.V., F.M. (Francesco Mercaldo), F.M. (Fabio Martinelli) and A.S.; validation, M.A., A.J., S.A., P.V., F.M. (Francesco Mercaldo), F.M. (Fabio Martinelli) and A.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This work has been partially supported by MIUR-SecureOpenNets, EU SPARTA, CyberSANE and E-CORRIDOR projects.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:

#### **List of Abbreviations**


#### **List of Mathematical Symbols**



#### **References**


**Muath Alrammal \*, Munir Naveed and Georgios Tsaramirsis**

Faculty of Computer Information Science, Abu Dhabi Women College, Higher Colleges of Technology, Abu Dhabi 41012, United Arab Emirates; mnaveed@hct.ac.ae (M.N.); gtsaramirsis@hct.ac.ae (G.T.) **\*** Correspondence: malrammal@hct.ac.ae

**Abstract:** The use of innovative and sophisticated malware definitions poses a serious threat to computer-based information systems. Such malware is adaptive to the existing security solutions and often works without detection. Once malware completes its malicious activity, it self-destructs and leaves no obvious signature for detection and forensic purposes. The detection of such sophisticated malware is very challenging and a non-trivial task because of the malware's new patterns of exploiting vulnerabilities. Any security solutions require an equal level of sophistication to counter such attacks. In this paper, a novel reinforcement model based on Monte-Carlo simulation called *e*RBCM is explored to develop a security solution that can detect new and sophisticated network malware definitions. The new model is trained on several kinds of malware and can generalize the malware detection functionality. The model is evaluated using a benchmark set of malware. The results prove that *e*RBCM can identify a variety of malware with immense accuracy.

**Keywords:** malware detection; Monte-Carlo simulation; reinforcement learning

**Citation:** Alrammal, M.; Naveed, M.; Tsaramirsis, G. A Novel Monte-Carlo Simulation-Based Model for Malware Detection (*e*RBCM). *Electronics* **2021**, *10*, 2881. https://doi.org/10.3390/ electronics10222881

Academic Editor: Suleiman Yerima

Received: 30 September 2021 Accepted: 16 November 2021 Published: 22 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

As the Internet has become essential in our life, the number of users who use internet services such as e-commerce and e-banking, has increased rapidly. Unfortunately, this increment is accompanied by an increased number of cyber-criminals who use malware (malicious programs) to achieve their malicious intentions [1].

Cyber-criminals launch new malware/attacks every year that are more sophisticated and harmful than previous years. Malware can adapt to the environment according to the security barriers set in an IT environment. Millions of new definitions are generated daily to exploit the vulnerabilities and compromise commercial information systems [2].

To overcome this severe threat, security companies such as Kaspersky and Symantec have introduced several anti-malware products to protect individuals and companies [2]. These products are for known malware definitions. While such solutions can detect known malware with high accuracy, they often lack the ability to detect unknown malware. Moreover, referencing all the different malware has become a complex task because of the enormous increase in the number of malware programs, making it difficult to find lasting solutions. These limitations have made it necessary to explore intelligent approaches that are flexible and adaptable in detecting unknown malware.

Most of the new intelligent approaches to malware detection are trained using the selective features of known malware that can represent malware in its best form. These representations are then used as training instances for a suitable machine-learning algorithm that generalizes or maps such features-based malware detection mechanisms [3–13]. This work extends a previously explored approach called RBCM, which is also based on reinforcement learning [3]. The RBCM extension is called *e*RBCM, and merges the most beneficial features of Monte-Carlo-based real-time learning (MOCART) [4] and random forest [5–7] to make it more scalable for higher-order training datasets.

The rest of this paper is organized as follows: Section 2 presents the various approaches adopted to detect and analyze network malware; Section 3 describes our motivations and contributions; Section 4 provides a short introduction to MOCART; Section 5 illustrates the enhancements to our previous approach (RBCM) [14], made to avoid converging to local minima in the search spaces with a narrow range of values in an observation dataset; Section 6 shows the experimental set-up and compares the performance of *e*RBCM with its state-of-the-art rivals; and Section 7 presents our conclusions and future work.

#### **2. Related Work**

According to the malware detection taxonomy outlined by [8], machine-learning approaches can be classified based on three major dimensions: malware targets, malware features, and the AI model used to generalize malware detection. This section focuses on the third dimension, the machine-learning algorithm, since our study evaluates algorithm performance in the malware detection task. Machine-learning algorithms are scalable to generalize non-linear problem spaces, which is the main motivation for exploring such approaches to optimize malware detection.

Different malware detection approaches in the literature have adopted different machine-learning techniques, such as random forest (RF) [5–7], neural network [9–11], decision tree [12,13], naïve Bayes [14,15], KNN and SVM [15], ARIMA [16], and reinforcement learning [17,18].

The RF machine-learning technique has been applied in several malware classification problems in the literature [5–7] because of its competitive performance compared to other algorithms. In an original approach proposed in [19], where the malware features were modelled as grayscale images, a comparison between three machine learning techniques revealed that RF outperformed the naïve Bayes and KNN algorithms.

RF was explored in [6] to generalize the malware detection and classification. The authors presented a machine-learning technique called AMICO [6], which was trained using the network-traffic-based selection parameters. The main purpose was to evaluate the payload information in network traffic. Parameters such as IP address, source URL, target URL, file contents, etc. were analyzed to identify the malware patterns. In the sandbox environment, a download reconstruction module was used to generate the network traffic in real-time. The traffic was based on executable files and the malware detection technique was evaluated using real-time generated data. The training data constructed to generalize the AI model was based on both malware-based traffic data and normal traffic.

To distinguish between malware and benign files, Vadrevu et al. [7] opted for a supervised learning approach based on a RF algorithm, where the training set of labelled malicious instances was evaluated over a period of one to two months. The simulated data contained a fair distribution of both kinds of samples. The model trained on this data was tested using an academic network, and the test results showed that AMICO could detect 90% of the malicious content that travelled over the network during the testing phase.

A classifier based on a decision tree was used in [7] to detect malicious contents. The "Malware Target Recognition (MaTR)" model is a hybrid of a decision tree classifier and is optimized by using a sophisticated heuristic-based feature search to keep the rules exploration-focused towards the promising area of the search space. In their work, the heuristics are built using the structural information of malicious contents and structural anomalies. Examples of malicious content structure include file path, attributes, and size, while examples of structural anomalies include entry point and section names. The classifier was trained using a benchmark dataset called VX-Heaven. The heuristics were built in the pre-processing stages and remained part of the training instances to extract quality rules. The classifier was tested on malicious contents that were not used during the training phase. The test data showed an accuracy of 99% for the decision-tree-based classifier's malware detection.

Neural-network-based approaches in malware detection were introduced in [9–11], while a recurrent neural network (RNN)-based model was explored by Andrade et al. [9]. The model was trained using a benchmark dataset that is publicly available for exploring new security solutions. Their neural network model creates new connections among the neurons based on cycles to increase the memory-based connectivity. The model also balances the trade-off between the long-term and short-term memory approaches. Shortterm memory emphasizes the exploration of solution space, while long-term memory exploits the already known best regions of the solution space. The experimental results showed that RNN detected malicious content with 67% accuracy.

There are several approaches for app malware detection. Approaches including EspyDroid [20], AndroShield [21], Droidcat [22], and RevealDroid [23] are used as solutions for obfuscation camouflage techniques such as junk code insertion, package renaming, and altering control-flow [24–26].

Other approaches for app malware detection, such as API-Graph [27], DroidEvolver [28], and DroidSpan [29], are oriented towards solving the problem of sustainability (performance over time). However, it is unclear how these approaches address this problem in the case of a network attack or malware.

This paper focuses mainly on exploring a machine-learning model that can generalize the patterns of a variety of malware.

#### **3. Motivations and Contributions**

Our motivations are summarized below.

Because of the enormous increase in new malware samples, traditional approaches are not scalable to the sophistication of new attacks and lack the capability to detect and analyze these attacks. Intelligent, self-adaptive approaches for efficient network malware detection and analysis are required [2].

There is a need for an approach that can be easily scaled for large and high-dimensional malware datasets to avoid extensive training episodes. This is essential in order to generalize the characteristics of different kinds of malware. The security solution can be trained on different datasets without changes in the learning structures.

Our contributions are summarized as follows.

We improve RBCM [3] to avoid being trapped in local minima. The current version combines the best features of MOCART [4] and RF [5–7]. Monte-Carlo simulations are optimized to dynamically select the region and scale of samples used by the learning model. The dynamic sampling technique is used to enhance the performance of the RBCM learning model, which selects a sampling region of lower error and fixed size. This drawback decreases RBCM's performance in cases where the sample space is limited or there are large areas of low-quality samples that reduce the error with respect to the current surroundings, but the model does not learn new knowledge.

We test *e*RBCM using the three datasets: Microsoft Malware [30], ARP attack, and ICMP attack [31]. Furthermore, we provide a comparison of *e*RBCM with four state-of-theart, best-performing prediction algorithms.

#### **4. Monte-Carlo-Based Real-Time Learning (MOCART)**

MOCART [4] is a Monte-Carlo (MC)-simulation-based machine-learning algorithm that applies the Monte-Carlo tree search to obtain estimates from one node of the solution space to another to reach the goal node. The MC simulations explore the solutions using a sample space and build a learning structure. In MOCART, MC simulations build a value function that can predict the outcomes for each action in an uncertain or unknown environment. The simulations use a model of system which can predict outcomes for a deterministic or nondeterministic problem space. As a result of these characteristics, MOCART has been used in several domains, especially nondeterministic domains. Because of these capabilities, MOCART is particularly suitable for malware detection, as the behavior of a sophisticated piece of malware can be non-deterministic, and it might behave differently at the same state in a problem space. This is particularly true for a new set of malwares that are sensitive to sandboxing and Trojans. However, MOCART underperforms

in domains where the number of possible samples generated during simulations are limited or if the simulation model is biased towards more exploitation than exploration.

#### **5. Reinforcement Learning Model RBCM**

To generalize the pattern recognition of various malware attacks, an updated version of reinforcement learning called *e*RBCM is explored in this work. *e*RBCM combines the best features of MOCART and RF.

The sampling techniques are modified in RBCM to keep finding new samples until the error rate remains below a threshold *θ*.

RBCM suffers from local minima when space dimensions are of a small scale or data has fewer variations with respect to class labels [4]. *e*RBCM increases the number of samples in the simulation model if class labels are not equally distributed. The generative model of *e*RBCM is shown in Figure 1 for a sample *S*, simulation length *d*, and extension *n*.


**Figure 1.** Generative model for *e*RBCM.

The generative model of *e*RBCM extends the simulation length by *n*, as shown in step 6 of Figure 1. The update decision is made by using epsilon *e*, which depends on the current root mean square error (RMSE) of the learning structure (the learning structure is a *Q* function). This is a validation RMSE of the *Q* function on unseen data. The decision parameter is dynamic and keeps reducing itself depending on the RMSE, which makes the sample exploration self-adaptive and keeps the trade-off between exploration and exploitation in balance. Figure 2 explains the sampling process with respect to the depth of search for samples in the direction of solutions.

**Figure 2.** Selection process of deeper search.

The search space at S1 was simulated with three neighboring states and only S3 was extended as it met the criteria for decision making. The state S3 produced the smallest RMSE of its siblings and a more reliable and stronger heuristic to select the direction to explore deeper into. This also assisted the *Q* function to be updated with the weight values that reduce the error of the network. The epsilon value was updated on each extension of the search process; for example, epsilon at S4 will be 0.002. If no neighbor of S4 produces a lower RMSE than S4, the search will stop at this state of the space. This process is also intuitive to bring the search out of local minima, as the RMSE will never be reduced below the best local minimum and the generative model will explore more spaces.

Due to dynamic changes in epsilon in the generative model, the model can learn biased strategies to explore the space rather than exploiting the best results. However, because of a fixed number of extensions, the search policy is kept balanced at the exploitation of the best and the search of new states in the search process. The adaptive use of epsilon also introduces the benefit of avoiding the visit for the same sample more than once. It reduces the time of searching for the best solution in the space and gives *e*RBCM the advantage of quicker convergence than CNN.

#### **6. Reinforcement Learning Model Experimentation**

#### *6.1. Experimentation*

All experiments were performed using Windows 10 Enterprise with 16 GB RAM and dual Intel Core (TM) i7-4702MQ CPUs, each of 2.20 GHz speed. The benchmark malware files were analyzed using different programs for deep visibility of attack data. The tools used were Wireshark and Network Miner. All tools were run in a special operating system called Security Onion. The attack files were further processed to generate a training dataset. The benchmark malware datasets analyzed were: Microsoft Malware dataset [30], ARP attack dataset [31], and ICMP attack dataset [31].

#### 6.1.1. Microsoft Malware Data

The dataset in [30] is organized with respect to machines and has several input features (e.g., 'machineidentifier' and 'hasdetected') which are malware detected on the machine. This column is used as the actual output for training the machine-learning algorithms. The dataset contains the system details for each observation, including default browser, current OS version, firewall, processor, primary disk type, volume capacity, total physical RAM, casing details, and gaming systems. This dataset is used for training machine-learning algorithms to detect malware on end systems running Windows OS.

#### 6.1.2. ARP Attack Dataset

The ARP dataset [31] is taken from the Contagion malware dataset. ARP attacks exploit the vulnerabilities related to Address Resolution Protocol. ARP vulnerabilities can lead to attacks such as ARP spoofing. These types of attacks require careful analysis of the network characteristics for detection. The dataset for this malware is given in pcap files, which contain the network characteristics of the malware attack. Wireshark is used to extract the pattern of the malware. The data in pcap files is exported to csv which is then used as a training dataset.

#### 6.1.3. ICMP Attack Dataset

The ICMP malware dataset [31] is also in the form of pcap files or network data of ICMP-related attacks (IMCP smurf or ping of death, etc.). These malwares exploit the vulnerabilities of network traffic based on ICMP messages or echo messages. Such messages can penetrate a network without being flagged because most of the security solutions are used to filter TCP/UDP based messages. The pattern of such attacks is extracted using Wireshark and exported to a csv file which is then used to train machinelearning algorithms.

*e*RBCM was trained using the benchmark datasets. The model testing also included malware definitions not used in the training. The malware categories of ICMP and ARP include several patterns (definitions) of network malware that are part of the benchmark dataset [28]. These models were trained using 200 malwares in both categories. The testing of the *e*RBCM to measure its performance was conducted on 150 malwares that were different from the 200 used in the training of *e*RBCM. The *e*RBCM performance was compared with the following state-of-the-art machine-learning techniques:


The model performance was measured by applying the correlation coefficient (CC), RMSE, and accuracy. Higher correlation coefficient and accuracy values indicate a better performance, while a model with a lower RMSE is considered superior to those with a higher RMSE.

#### *6.2. Results*

The results of each model were averaged over ten runs, with the averages shown in Figure 3. The results show that RF established better correlation-based rules and had a superior performance than other models with respect to the CC. RF extracts the best possible rules as it is an ensemble model of a decision tree and identifies the best tree structure.

**Figure 3.** The correlation coefficient (CC) results of each model.

Figure 4 shows each model's accuracy. The accuracy profile indicates that *e*RBCM's performance was better than its competitors. Because of variations in sample size in each run, the error-rate fluctuated greatly in each episode of testing. The application of convolutional neural network to extract the attack behaviors of the different malware was a promising strategy. The convolutional neural network took several training episodes to converge as compared to *e*RBCM.

**Figure 4.** The accuracy results of each model.

While random forest had a higher CC than other models, its performance lacked consistency in relation to accuracy due to the complex nature of malware patterns. When comparing RF and J48, RF performed better with respect to CC and accuracy because it is an ensemble model.

The performances of CNN and FNN were comparable in terms of accuracy, indicating the capability of neural network structures to generalize malware patterns. However, FNN identified fewer similar rules and produced low correlation-based outcomes.

The main success of *e*RBCM in terms of performance is its self-adaptability to explore and then balance the trade-off between exploration and exploitation. *e*RBCM can guide its search towards the promising area of a solution space due to epsilon. The generative model explores more on the lower sides of RMSE as compared to regions of higher RMSE.

Figure 5 displays the RMSE results of each machine-learning technique. The results show that *e*RBCM produced a lower RMSE than most of its rivals. *e*RBCM performed better than its predecessor, RBCM, and had a consistently better performance than other models because of its adaptive approach in simulations to keep the sample size and space suitable for model learning. The samples were selected based on the quality of the search for a solution during Monte-Carlo simulations.

**Figure 5.** The RMSE performance of each model.

#### *6.3. Look-Ahead Search of eRBCM*

Figure 6 provides a deep insight into the sample selection mechanism of *e*RBCM. The accuracy of each solution search in a simulation depends on a specific number of samples from the search space. The sample selection mechanism is non-linear and requires adaptation to each problem set given to the simulation model.

**Figure 6.** The dependence of RMSE on the number of samples per simulation.

*e*RBCM's selection mechanism depends on a threshold based on the error rate. It selects a threshold value that minimizes the RMSE. This is the main reason for the successful generalization of attack patterns by *e*RBCM. The self-adaptivity of epsilon enables *e*RBCM to explore the larger but focused area of search space compared to RBCM and CNN. *e*RBCM converges faster than its rival because of the self-tuning of epsilon.

The look-ahead search of the generative model also benefits *e*RBCM in terms of searching high-quality regions with a smaller number of iterations. The regions of lower RMSE are explored in more depth compared with the regions of higher RSME. This can lead to local minima, but due to the dynamic value of epsilon, the generative model departs such regions in few iterations.

Figure 7 shows the results with respect to RMSE for look-ahead search self-adaptability. The results show that the extended search produced quality solutions with low RMSE. The enhanced performance in the look-ahead search during simulations is explained by the guided exploration of the generative model in the simulation. The higher the *n*-value of the simulation model (as given in Figure 1), the more *e*RBCM explores more promising states of the solution space.

**Figure 7.** The dependence of RMSE on the number of states extended in a simulation.

With shallow searches in the region of quality solutions, *e*RBCM remains biased towards the exploitation of the best solutions found and it converges to suboptimal solutions as shown in Figure 7. With extensions to look-ahead search, the deep search provides an optimal balance of exploration and exploitation of the current best-found solutions. It also explains the phenomena shown in Figure 2 relating to the look-ahead search of the generative model of *e*RBCM. At a deeper search, the *e*RBCM generative model mirrors the natural selection mechanism of evolutionary techniques. It provides a new solution as a mutation of the existing best solution, as shown in Figure 2. At state S3, for example, the generative model generates a new state S4 which is a mutation of S3.

#### **7. Conclusions**

In this paper, we presented a new approach called *e*RBCM to detect malware. The new model was designed using the reinforcement learning approach, which utilizes the strength of Monte-Carlo simulations and builds a strong machine-learning model to detect complex malware patterns. It combines the most beneficial elements of MOCART's reinforcement learning and RF's exploration capabilities. A large number of experiments were conducted using different malware benchmarks, including ARP attack, ICMP attack, and Microsoft Malware. *e*RBCM was consistently better than its competitors in terms of learning the new malware patterns and detecting unknown malware. This was mainly explained by *e*RBCM's self-adaptability to exploration and intelligent tuning of the balance for the trade-off between exploration and exploitation.

For future work, we plan to test our approach with various attacks to measure its scalability and accuracy. Furthermore, *e*RBCM will be explored for mobile malware using benchmark datasets. The mobile malware will be analyzed using sophisticated forensics tools, identifying key patterns via an innovative pre-processing stage. The malware will be scanned and categorized based on its malicious agenda. In each category, the common parameters will be explored using clustering, with these clusters used to generate a training dataset.

**Author Contributions:** Conceptualization, M.A.; methodology, M.N.; software, G.T.; validation, G.T.; formal analysis, M.A., M.N. and G.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Higher Colleges of Technology, grant number 1394.

**Acknowledgments:** The authors would like to thank, Higher Colleges of Technology for their support to this project.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

