*Article* **Ensemble-Based Classification Using Neural Networks and Machine Learning Models for Windows PE Malware Detection**

**Robertas Damaševiˇcius 1,\* , Algimantas Venˇckauskas <sup>2</sup> , Jevgenijus Toldinas <sup>2</sup> and Šarunas Grigali ¯ unas ¯ 2**

<sup>1</sup> Department of Software Engineering, Kaunas University of Technology, 44249 Kaunas, Lithuania

<sup>2</sup> Department of Computer Science, Kaunas University of Technology, 44249 Kaunas, Lithuania;


**Abstract:** The security of information is among the greatest challenges facing organizations and institutions. Cybercrime has risen in frequency and magnitude in recent years, with new ways to steal, change and destroy information or disable information systems appearing every day. Among the types of penetration into the information systems where confidential information is processed is malware. An attacker injects malware into a computer system, after which he has full or partial access to critical information in the information system. This paper proposes an ensemble classificationbased methodology for malware detection. The first-stage classification is performed by a stacked ensemble of dense (fully connected) and convolutional neural networks (CNN), while the final stage classification is performed by a meta-learner. For a meta-learner, we explore and compare 14 classifiers. For a baseline comparison, 13 machine learning methods are used: K-Nearest Neighbors, Linear Support Vector Machine (SVM), Radial basis function (RBF) SVM, Random Forest, AdaBoost, Decision Tree, ExtraTrees, Linear Discriminant Analysis, Logistic, Neural Net, Passive Classifier, Ridge Classifier and Stochastic Gradient Descent classifier. We present the results of experiments performed on the Classification of Malware with PE headers (ClaMP) dataset. The best performance is achieved by an ensemble of five dense and CNN neural networks, and the ExtraTrees classifier as a meta-learner.

**Keywords:** malware analysis and detection; applied machine learning; mobile security; neural network; ensemble classification

#### **1. Introduction**

Many aspects of society have shifted online with the broad adoption of digital technology, from entertainment and social interactions to business, entertainment, industry and, unfortunately, crime as well. Cybercrime is rising in frequency and magnitude in recent years, with a projection of reaching USD 6 trillion by 2021 (up from USD 3 trillion in 2015) [1] and also taking on conventional crime both in number and revenues [2]. Additionally, these new cyber-attacks have become more complex [3], generating elaborate multi-stage attacks. By the end of 2018, about 9599 malicious packages appeared per day [4]. Such attacks also resulted in significant damage and major financial losses. Up to USD 1 billion was stolen from financial institutions around the world in two years due to malware [5]. In addition, Kingsoft estimated that between 2 and 5 million computers were attacked each day [6]. With cybercrime revenues reaching USD 1.5 trillion in 2018 [7] and cybercrime's global cost predicted to reach USD 6 trillion by 2021 [8], addressing cyber threats has become an urgent issue.

Moreover, the COVID-19 pandemic has delivered an extraordinary array of cybersecurity challenges, as most services have moved to online and remote mode, raising the danger of cyberattacks and malware [9,10]. Especially, in the healthcare sector, cyber-attacks can lead to compromised sensitive personal patient data, while data tampering can lead to incorrect treatment, with irreparable damage to patients [11].

**Citation:** Damaševiˇcius, R.; Venˇckauskas, A.; Toldinas, J.; Grigaliunas, Š. Ensemble-Based ¯ Classification Using Neural Networks and Machine Learning Models for Windows PE Malware Detection. *Electronics* **2021**, *10*, 485. https:// doi.org/10.3390/electronics10040485

Academic Editor: Suleiman Yerima Received: 11 January 2021 Accepted: 16 February 2021 Published: 18 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Today, computer programs and applications are developed at high speed. Malicious software (malware) has appeared and is growing in many formats and is becoming increasingly sophisticated. Computer criminals use them as a tool to infiltrate, steal or falsify information, causing huge damage to individuals, businesses and even threatening national security. A generic term generally used to describe all various types of unauthorized software programs is malware (malicious software), which includes viruses, worms, Trojans, spyware [12], Android malicious apps [13], bots, rootkits [14] and ransomware [15]. In achieving its objectives, malware has been used by cybercriminals as weapons. Malware has been used to conduct a wide variety of security threats, such as stealing confidential data, stealing cryptocurrency, sending spam, crippling servers, penetrating networks and overloading critical infrastructures. While large numbers of malware samples have been identified and blocked by cybersecurity service providers and antivirus software manufacturers, a significant number of malware samples have been created or mutated (e.g., "zero-day" malware [16]) and appear to evade conventional anti-virus scanning tools based on signatures. As these techniques are primarily based on modifications of signature-based models, this has caused the information security industry to reconsider their malware recognition techniques.

Malware detection methods can be classified into methods based on signatures and behavior. Currently, signature-based malware detectors can work effectively with previously known malware that has already been detected by some anti-malware vendors. However, it cannot detect polymorphic malware that can change its signatures, as well as new malware whose signatures have not yet been created. One solution to this problem is to use heuristic analysis in combination with machine learning techniques that provide higher detection efficiency. As practice has shown, the traditional approach to the field of malware detection, which is based on signature analysis [17], is not acceptable for detecting unknown computer viruses. To maintain the proper level of protection, users are forced to constantly and timely update anti-virus databases. However, the delay in the response from the anti-virus companies for the emergence of new malware (its detection and signature creation) can vary from several hours to several days. During this time, malicious new programs can cause irreparable damage.

To address this problem, in addition to the signature approach, heuristic analysis is used. At the same time, the file can be considered "potentially dangerous" with some probability based on its behavior (dynamic approach) or the analysis of its structure (static approach). Static analysis generally consists of two main stages: the training stage and the stage of using the results (detection of virus programs). At the training stage, a sample of infected (virus) and "clean" (legitimate) files is formed. In the structure of the files, some signs characterize each of them as viral or legitimate. As a result, a list of feature characteristics is compiled for each file. Next, the most significant (informative) features are selected, and redundant and irrelevant features are discarded. At the detection stage, feature characteristics are extracted from the scanned file. Heuristic algorithms developed specifically to detect unknown malware are characterized by a high error rate. Heuristic-based detection uses rules formulated by experts to distinguish between malicious and benign files. Additionally, behavior-based, model checking-based and cloud-based methods have performed effectively in malware detection [18].

Modern research in the area of information security aimed at creating such protection methods and algorithms that would be able to detect and neutralize unknown malware, and thus not only increase the computer security but also save the user from constant updates of antivirus software. The size of gray lists is constantly growing with the advancement of malware writing and production techniques. Intelligent methods for automatically detecting malware are, therefore, urgently required. As a result, several studies have been published on the development of smart malware recognition systems using artificial intelligence methods [19–22].

A prerequisite for creating effective anti-virus systems is the development of artificial neural network (ANN)-based technologies. The ability of such systems to learn and

generalize results makes it possible to create smart information security systems. Artificial intelligence (AI) has several advantages when it comes to cybersecurity: AI can discover new previously unknown attacks; AI can handle a high volume of data; AI-based cybersecurity systems can learn over time to respond better to threats [23].

This study aims to implement an ensemble of neural networks for the detection of malware. The novel contributions of this paper are the following:


The other parts of this study are structured as follows. In Section 2, related works are discussed including the presentation of adequate criticism of existing methods and approaches. Section 3 describes the methodology used in this paper. Section 4 discusses the implementation and results obtained. Section 5 presents the conclusion of the study.

#### **2. Related Works**

Malware search algorithms are divided into two classes based on the method of collecting information—dynamic and static. In static analysis, suspicious objects are considered without starting them, based on the assembly code and attributes of executable files [24]. Dynamic analysis algorithms work either with already running programs or run them themselves in an isolated environment, exposing the information that has arisen in the course of work: they analyze the behavior of the program, sections of code and data and monitor resource consumption [25]. According to the type of objects detected, malware search algorithms are divided into signature and anomalous ones. Signature programs tend to highlight the signatures of malware. Anomaly detection algorithms seek to describe legitimate programs and learn to look for deviations from the norm.

At the same time, machine learning is also widely used as a powerful tool for security experts to identify malicious programs with high accuracy, when the number of malicious programs is high enough, and their options have become diverse. Among the main methods is the Windows Portable Executable 32-bit (PE32) file header analysis [26]. For example, Nisa et al. [27] transformed malware code into images and applied segmentationbased fractal texture analysis for feature extraction. Deep neural networks (AlexNet and Inception-v3) were used for classification. Previously, the use of ensemble methods, such as random forest and extremely randomized trees, allowed the improvement of the performances of machine learning models in detecting malware in internet of things (IoT) environments [28] and Wireless Sensor Networks (WSN) [29].

Many studies are being performed to analyze malware to curb the increase in malicious software [30]. The existing deep learning-based malware analysis methods include convolutional neural networks (CNN) [31], deep belief network (DBN) [32], graph convolutional network (GCN) [33], LSTM and Gated Recurrent Unit (GRU) [34], VGG16 [35] and generative adversarial networks (GAN) [36]. However, it is not possible to guarantee the generalization potential of artificial neural network-based models [37].

To solve the above-mentioned problems, more general and robust methods are, therefore, required. Researchers are creating numerous ensemble classifiers [38–42] that are less susceptible to malware feature collection. Ensemble methods [43] are a class of techniques that incorporate several learning algorithms to enhance the precision of overall prediction. To minimize the risk of overfitting in the training results, these ensemble classifiers integrate several classification models. In this way, the training dataset can be more effectively used, and generalization efficiency can be increased as a result. While several models of

ensemble classification are already developed, there is still space for researchers to improve the accuracy of sample classification, which would be useful for improving malware detection.

Therefore, this paper proposes an ensemble earning-based approach for using fully connected and convolution neural networks as base learners for malware detection.

#### **3. Materials and Methods**

Malware developers are primarily focused on targeting computer networks and infrastructure to steal information, make financial demands or prove their potential. The standard approaches for detecting malware were effective in detecting known malware. Via these approaches, however, new malware can never be blocked. The latest machine learning platform [44] has significantly enhanced the identification capability of models used for malware detection. It is possible to detect malware using machine learning methods in two steps, namely, extracting features from input data and choosing important ones that best represent the data, and classifying/clustering. The technology proposed is focused on machine learning that can learn and discern malicious and benign files, as well as make reliable forecasts of new files that have not been seen before.

The phases involved in achieving the final solution are (1) data processing and feature selection and (2) model engineering, which includes the following steps: data selection and scaling, reduction in dimensionality, ANN model exploration and meta-learner classifier selection, ensemble model development, model testing and performance evaluation. Figure 1 indicates the flow to the model evaluation stage of the stages involved in the system methodology, beginning with data selection, which is described in more depth in the following subsections.

**Figure 1.** Outline of malware detection methodology.

#### *3.1. Data Collection and Processing*

For machine learning to be a success, the selection of a representative dataset is necessary. This is because it is important to train a machine learning algorithm on a dataset that correctly represents the conditions for the model's real-world applications.

For this model, the dataset gathered contains malicious and benign data from the Classification of Malware with PE headers (ClaMP) dataset, obtained from GitHub. We used the ClaMP\_Integrated dataset, which has 2722 malware and 2488 Benign instances. The dataset has 69 features, which include, among others, the following features:


e\_ip–initial IP value, e\_lfanewe\_lfarlc, e\_magic–Magic number, e\_maxalloc–maximum extra paragraphs, e\_minalloc–minimum number of extra paragraphs, e\_oemid–OEM ID, e\_oeminfo–OEM information, e\_ovno–overlay number, e\_res and e\_res2–reserved words, e\_sp–initial SP value, e\_ss–initial SS value.


However, we used only 68 features (all numerical), because one feature "packer\_type" is a string, which was not used. The numerical features were scaled using the standard scaling method. These features, along with the class label (0 for benign and 1 for malicious), were used to build the ensemble classification model.

#### *3.2. Dimensionality Reduction*

To fix a variety of estimation and classification questions, machine learning methods are commonly used. Bad machine learning output can be triggered by overfitting or underfitting the results. Removing the unimportant characteristics guarantees the algorithms' optimal efficiency and improves pace. Principal Component Analysis (PCA) was introduced to perform attribute dimensionality reduction. Based on previous studies, 40 features were chosen to be passed into the machine learning model (representing 95% of the total variability in the dataset), because these features are critical in neural network learning, whether a file is malicious or benign.

#### *3.3. Deep Learning Models*

As deep learning models, we considered fully connected (FC) multilayer perceptron (MLP) and one-dimensional convolutional neural networks (1D-CNN), which are discussed in detail below.

#### 3.3.1. Multilayer Perceptron

As a baseline approach, we adopted a simple multilayer perceptron (MLP). Let the output of the MLP be known *y*(*t*) at the input *X*(*t*), where *X*(*t*) is a vector with components (*x*1, *x*2, . . . , *xn*), t is the number of the sequence value and *t* = 1, *T* (T is predetermined).

To find model parameters *w* = (*w*0, *w*1, . . . , *wm*) and *V<sup>k</sup>* = (*V*1*<sup>k</sup>* , *V*2*<sup>k</sup>* , . . . , *Vnk*), *h<sup>k</sup>* , *k* = 1, *m* such that the model output *F*(*X*, *V*, *w*) and the real output of the MLP *y*(*t*) would be as close as possible. The relationship between the input and output of a two-layer perceptron is established by the following relationships:

$$Z\_k = \sigma(V\_{1k}\mathbf{x}\_1 + V\_{2k}\mathbf{x}\_2 + \dots \\ V\_{nk}\mathbf{x}\_n - h\_k), k = \overline{1, m} \tag{1}$$

$$y = \sigma(w\_1 Z\_1 + w\_2 Z\_2 + \dots \; w\_m Z\_m + w\_0) \tag{2}$$

The following expression describes a perceptron with one hidden layer, which is able to approximate any continuous function defined on a bounded set.

$$\sum\_{k=1}^{m} w\_k \cdot \sigma(V\_{1k}\mathbf{x}\_1 + V\_{2k}\mathbf{x}\_2 + \dots \\ V\_{nk}\mathbf{x}\_n - h\_k) \\ \quad + \; w\_0 \tag{3}$$

Training of MLP occurs by applying a gradient descent algorithm (such as error backpropagation) similar to a single-layer perceptron.

#### 3.3.2. One-Dimensional Convolutional Neural Network (1D-CNN)

While CNN models have been developed for image processing, where an internal representation of a two-dimensional input (2D) is learned by the model, the same mechanism can be used in a process known as feature learning on one-dimensional (1D) data sequences, such as in the case of malware detection. The model understands how to extract features from observational sequences and how to map hidden layers to different types of software (malware or benign).

$$\hat{\mathbf{y}} = \Phi([\mathbf{x}\_1, \dots, \mathbf{x}\_N])\_\prime \tag{4}$$

where *X* : *x*1, . . . , *x<sup>N</sup>* indicates the input of the network, and *Y* : ˆy is the output. Therefore, the network learns a mapping from the input space *X* to the output space *Y*.

The key block of the convolutional network is the convolutional layer. A group of trainable filters are the parameters of this layer (scan windows). Each filter operates in size through a tiny window. The scanning window sequentially traverses the whole picture during the forward propagation of the signal (from the first layer to the last layer) according to the tiling principle and measures the dot products of two vectors: the filter values and the outputs of the chosen neurons. Thus, a two-dimensional activation map is generated after passing all the shifts in the width and height of the input field, which gives the effect of applying a particular filter in each spatial area. The network uses filters that are enabled when there is an input signal of some kind. A series of filters are used for each convolutional layer, and each generates a different activation map.

$$\mathbf{x}\_{j}^{l} = f\left(\sum\_{i=1}^{M} \mathbf{x}\_{i}^{l-1} \cdot \mathbf{k}\_{ij}^{l} + \mathbf{b}\_{j}^{l}\right),\tag{5}$$

where *k* is the convolution kernel, *j* is the size of kernels, *M* is the number of inputs *x l*−1 *i* , *b*, is kernel bias, *f*( ) is the neuron activation function and (·) represents the convolution operator.

The sub-sampling layer is another feature of a convolutional neural network. It is usually positioned between successive convolution layers, so it may occur periodically. Its purpose is to reduce the spatial size of the vector gradually to reduce the number of network parameters and calculations, as well as to balance overfitting. The convolution layer resizes the feature map, using the max operation most frequently. If the output from the previous layer is to be fed to the fully connected layer, the flattening layer is used, and then it needs to be flattened. The layer of the Parametric Rectified Linear Unit (PReLU) is an activation function that complements the rectified unit with a negative value slope.

The dropout layer is used to regularize the network. It also makes it possible to be thinner for the network size. The neurons that are less likely to raise the weight of learning are randomly removed. The practical importance of dropout unit is to prevent overfitting [45]. This dropout layer, as we have two classes, is succeeded by a fully linked (dense) layer that reduces the final output vector to two classes, and we expect the program's behavior to be either malicious or benevolent. The final activation function is SoftMax, which shrinks the two outputs to one.

The output of each convolutional layer in 1D-CNN is also the input of the subsequent layer. It also represents the weights learned by the convolution kernel from the training samples.

A unique and essential part of CNNs is the fully connected (FC) layer, which outputs a final output. The output of the network's previous layers is reshaped into a single vector (flattened). Any of them reflects the probability that a class label is a special function. The final probabilities for each label are supplied by the output of the FC layer.

#### *3.4. Network Model Optimization*

Optimization of neural network hyper-parameters, which rule how the network operates and governs its accuracy and validity, is still an unsolved problem. Optimizers adjust the parameters of neural networks, such as weight and learning rate, to minimize loss. Known examples of neural network optimization algorithms are Stochastic Gradient Descent (SGD) [46], AdaGrad [47], RMSProp [48] and Adam [49], which usually show a tradeoff of optimization vs. generalization. This means that higher training speed and higher accuracy in the training may result in poorer accuracy on the testing dataset. Here, we adopted the Exponential Adaptive Gradients (EAG) optimization [50], which combines Adam and AdaBound [51]. During training, it exponentially sums the gradient in the past and adaptively adjusts the learning rate to address poor generalization of the Adam optimizer.

#### *3.5. Ensemble Classification*

The basic principle of ensemble methods is that training datasets are rearranged in several ways (either by resampling or reweighting) and by adding a base classifier to each rearranged training dataset, an ensemble of base classifiers is built. After that, a new ensemble classifier is developed using the stacked ensemble method by combining the prediction effects of all those base classifiers, where a new model learns how to better integrate predictions from multiple base models. We used the two-stage stacking technique [52]. First, several models are trained based on a dataset. Then, the output of each of the models is processed to create a new dataset. The actual value it is supposed to approximate is related to each instance in the current dataset. Second, the dataset with the meta-learning algorithm is used to provide the final output.

In the design of a stacking model (Figure 2), base models are often referred to as level-0 models, and a meta-learner (or generalizer) that integrates base model projections, referred to as a level-1 model, is involved. Models that fit into the training data and are compiled with forecasts are the base models. The meta-learner (level-1 model) is a classification model trained to combine the predictions of the base model. The meta-learner is informed by simple models on the choices made. To train the base models, a new batch of previously unused data is used and predictions are made, and the input and output value pairs of the training dataset are used to fit the meta-learner, along with projected outputs given by these predictions.

**Figure 2.** Schematics of ensemble classification approach.

The ensemble learner algorithm consists of three stages:

	- (a) Select *N* base learners;
	- (b) Select a meta-learning algorithm.
	- (a) Train each of the *N* base learners on the training dataset {*X*1, *X*2, . . . , *XM*}, where *M* is the number of samples;
	- (b) Perform the k-fold cross-validation on each of the base learners and record the cross-validated predictions {*y*1, *y*2, . . . , *yN*};
	- (c) Combine cross-validated predictions from base learners to form a new feature matrix as follows. Train the meta-learner on the new data (features x predictions from base-level classifiers) {(*X*1, *X*2, . . . , *XM*, *y*1),(*X*1, *X*2, . . . , *XM*, *y*2), . . . ,(*X*1, *X*2, . . . , *XM*, *yN*)}. Combine base learning models and the meta-learner to generate more accurate predictions on unknown data.
	- (a) Record output decisions from the base learners;
	- (b) Send base-level decisions to the meta-learner to make ensemble decision.

On the training dataset, stacking capitalizes over every single best learner. Usually, the greatest gains are made when base classifiers used for stacking have high variability and uncorrelated outputs predicted values. As base models, we used the following neural networks: fully connected MLP with one hidden layer (Dense-1), fully connected MLP with two hidden layers (Dense-2) and one-dimensional CNN (1D-CNN). The configurations of neural networks are summarized in Table 1.

**Table 1.** Model configuration of neural networks with their parameters. FC—fully connected. Conv1D—one-dimensional convolution. PReLU—Parametric Rectified Linear Unit.


The examples of neural network architectures are presented in Figure 3.

**Figure 3.** Example of architectures used as base learners: (**a**) Dense-1 network architecture, (**b**) Dense-2 network architecture and (**c**) 1D-CNN network architecture.

The role of the meta-learner is to find how best to aggregate the decisions of the base classifiers. As meta-learners, we explored K-Nearest Neighbors (KNN), Support Vector Machine (SVM) with linear kernel, SVM with radial basis function (RBF) kernel, Decision Tree (DT), Random Forest (RF), Multi-Layer Perceptron (MLP), AdaBoost Classifier, Extra-Trees (ET) classifier, Isolation Forest, Gaussian Naïve Bayes (GNB), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Logistic Regression (LR), Ridge Classifier (RC) and stochastic gradient descent classifier (SGDC). Here, KNN is a model that classifies unknown input data based on having the most similarity (least distance) to known input data. SVM is a supervised learning method that constructs a higher dimensional hyperplane to separate input data belonging to various classes while maximizing the data input distance to the hyperplane. The DT classifier creates a decision tree by splitting according to the feature which has the highest information gain. RF fits many DT classifiers on different sub-samples of the dataset and uses averaging to improve the prediction accuracy. AdaBoost fits a classifier on the dataset and performs weighting of incorrectly classified instances to improve accuracy. Isolation Forest performs classification based on identified anomalies in data. GNB performs classification based on the probability distributions of features and classes. The ET Classifier [53] creates a meta-estimator that fits multiple decision trees on the training dataset sub-samples and uses averaging to improve

precision and over-fitting management. The goal of LDA is to find a linear combination of input characteristics that distinguishes two or more input data groups. A quadratic decision surface is used by QDA to distinguish two or more groups of input data. LR is a linear regression-like statistical approach that predicts a result for a binary output variable from an input variable. RC converts the label data to (−1,1) and fixes the regression method problem. As a target class, the greatest value of prediction is admitted. SGDC is a learning algorithm for stochastic gradient descent that finds the decision boundary with a linear hinge loss.

#### *3.6. Evaluation of Malware Detection Results*

To measure the classification potential of the proposed ensemble learning model, the performance of the proposed model was evaluated using the Leave-One-Out Cross-Validation (LOOCV) with a 10-fold cross-validation method.

The true labels were compared against the predicted labels and the true positive (TP), true negative (TN), false positive (FP) and false-negative (FN) values were calculated. The recall, precision, accuracy, error rate and F-score values were calculated (we assumed the binary classification problem, where a positive class is labeled by +1 and a negative class is labeled by −1):

False positive rate (FPR) (also specificity):

$$FPR = \frac{\sum\_{i=1}^{m} [a(\mathbf{x}\_i) = +1] [y\_i = -1]}{\sum\_{i=1}^{m} [y\_i = -1]} \tag{6}$$

here [·] is the Iverson bracket operator.

True positive rate (TPR) (also sensitivity and recall):

$$TPR = \frac{\sum\_{i=1}^{m} [a(x\_i) = +1][y\_i = +1]}{\sum\_{i=1}^{m} [y\_i = +1]} \tag{7}$$

False negative rate (FNR):

$$FNR = \frac{\sum\_{i=1}^{m} [a(\mathbf{x}\_i) = -1][y\_i = +1]}{\sum\_{i=1}^{m} [y\_i = +1]} \tag{8}$$

Here, *a*(*x*) is the classifier with inputs *X <sup>m</sup>* = *x*1, . . . , *x<sup>m</sup>* , and *y*1, . . . , *y<sup>m</sup>* are outputs.

Precision is calculated as:

$$Precision = \frac{TPR}{TPR + FPR} \tag{9}$$

To compute F-score, the following equation is used:

$$F-score = 2\frac{Precision \times Recall}{Precision + Recall} \tag{10}$$

The Matthews Correlation Coefficient (MCC) is calculated as:

$$\text{MCC} = \frac{\text{TP} \cdot \text{TN} - \text{FP} \cdot \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}} \tag{11}$$

The Cohen's Kappa statistic (shortly, kappa) is

$$k = 1 - \frac{1 - p\_0}{1 - p\_\varepsilon} \tag{12}$$

where *p*<sup>0</sup> represents the ratio of correct agreement in the test dataset, and *p<sup>e</sup>* is the ratio of agreement that is expected by random selection.

In this study, performance was calculated using 10-fold cross-validation. According to F1-score, instead of checking the performance of the model with accuracy alone, we selected the best model. The accuracy can be a confusing metric in datasets where a major class imbalance occurs. For a highly imbalanced sample, a model would correctly guess the value of the majority class for all predicted outcomes, and achieve a high performance in classification but making erroneous predictions in the minority and main classes. The F1-score discourages this type of action by computing the metrics for each mark and finding its unweighted average. We also consider area under curve (AUC) as a measure of binary classification consistency, which is known as a balanced metric that can be used even though there are classes of very different sizes in the dataset. Furthermore, the performance of the proposed model on a binary dataset is represented using the confusion matrix.

We used the performance outcomes achieved from the results from each fold of the 10-fold cross-validation for statistical analysis. We adopted the non-parametric Friedman test followed by the post-hoc Nemenyi test to compare the findings and measure their statistical value. Second, both strategies were ranked based on the selected performance measures (we used accuracy, AUC and F1-score). Then, each method's mean ranks were determined. If the difference between the mean ranks of the methods was less than the critical difference obtained from the Nemenyi test, the difference between method outputs was assumed not to be significant.

#### **4. Implementation and Results**

#### *4.1. Experimental Settings*

The machine learning models were trained on the features acquired from the dataset using Python's Scikit-learn libraries. All experiments were performed on a laptop computer with 64-bit Windows 10 OS with Intel® Core™ i5-8265U CPU @ 1.60 GHz 1.80 GHz with 8 GB RAM (Intel, Santa Clara, CA, USA).

#### *4.2. Results of Machine Learning Methods*

The results from using classical machine learning models are summarized in Table 2, while their confusion matrices are summarized in Figure 4. The best results were obtained by the ExtraTrees (ET) model, achieving an accuracy of 98.8%. As can be seen from Table 2 and Figure 3, the ET model generated very good results for the precision, recall, F1 and accuracy of the two classes. This agrees with the low FPR and FNR of 0.8% and 1.4% obtained by the ET model.

**Table 2.** Summary of results of machine learning models. Acc–Accuracy. Prec–Precision. Rec–Recall. Spec–Specificity. FPR–False Positive Rate. FNR–False Negative Rate. AUC–Area Under Curve. MCC–Matthews Correlation Coefficient. SVM–Support Vector Machine. RBF–Radial Basis Function. LDA–Linear Discriminant Analysis. SGDC–Stochastic Gradient Descent Classifier.


**Figure 4.** Confusion matrices of machine learning models.

#### *4.3. Results of Neural Network Classifiers*

To select the base classifiers, first, we performed an ablation study to find the best representatives of Dense-1, Dense-2 and 1D-CNN models in terms of their performance with respect with different values of hyperparameters. The results are presented in Tables 3–5. Note that in all cases, we used sparse categorical cross-entropy loss function and an Adam optimizer. For the training of Dense-1 and Dense-2 models, we used 100 epochs, while for the training of 1D-CNN models, we used 20 epochs. In all cases, 80% of data were used for training and 20% for testing.

**Table 3.** Malware detection performance with different number of neurons in hidden layer of Dense-1 model. Best models are shown in bold.



**Table 4.** Malware detection performance with different number of neurons in hidden layers of Dense-2 model. Best models are shown in bold.

**Table 5.** Malware detection performance with different number of filters in convolutional layers and neurons in the final fully connected layer of 1D-CNN model. Best models are shown in bold.



**Table 5.** *Cont.*

#### *4.4. Results of Ensemble Learning*

Based on the ablation study, we selected one Dense-1 (with 35 neurons) model, two Dense-2 (with (40,40) and (40,50) neurons) models and two 1D-CNN (with (25,25) and (30,35) neurons) models as base learners based on their kappa and F1-score performance. We performed classification with several different meta-learner classification algorithms. For KNN, the number of nearest neighbors was set to 3. For linear SVM, C was set to 0.025. For RBF SVM, the C parameter (which performs regularization by applying a penalty to reduce overfitting) was set to 1, and gamma was set to 2. For DT and RF, the max depth was set to 5. In all cases, 10-fold cross-validation was used, where each cross-validation fold was made by randomly selecting 80% of samples, and the remaining 20% were used for testing. The results are presented in Table 6.

**Table 6.** Ensemble learning results with different meta-learners: mean values from 10-fold cross-validation. Best values are shown in bold.



The average performance results are visualized in Figures 5–7, whereas the results from the 10-fold cross-validation are shown as boxplots in Figures 8–10. The results demonstrate that the ExtraTrees meta-learner achieved the highest performance in terms of accuracy, AUC and F1-score measures.

**Figure 5.** Malware detection performance of deep learning ensemble model by final stage metalearner classifier: accuracy.

**Figure 6.** Malware detection performance of deep learning ensemble model by final stage metalearner classifier: F1-score.

**Figure 7.** Malware detection performance of deep learning ensemble model by final stage metalearner classifier: AUC.

**Figure 8.** Malware detection performance of deep learning ensemble model by final stage metalearner classifier: accuracy.

**Figure 9.** Malware detection performance of deep learning ensemble model by final stage metalearner classifier: area under curve.

**Figure 10.** Malware detection performance of deep learning ensemble model by final stage metalearner classifier: F1-score.

Finally, we present the confusion matrix of the best ensemble model (with the ET classifier as the meta-learner) in Figure 11.

#### *4.5. Statistical Analysis*

To perform the statistical analysis of the experimental results, we adopted the Friedman test and the Nemenyi test. The results are presented as critical difference (CD) diagrams in Figures 12–14. If the difference between the mean ranks of the meta-learners is smaller than the CD, then it is not statistically significant. The results of the Nemenyi test again show that the ExtraTrees meta-learner allows us to achieve the best performance; however, the performance of AdaBoost and Decision Tree meta-learners is not significantly different.

**Figure 13.** Comparison of mean ranks of meta-learners based on their AUC performance: results of Nemenyi test.

**Figure 14.** Comparison of mean ranks of meta-learners based on their F1-score performance: results of Nemenyi test.

#### *4.6. Ablation Study of the Ensemble*

We also conducted the ablation study to evaluate the contribution of the individual parts in the proposed ensemble classification base framework for malware recognition. We compared and analyzed the impact of the ensemble size of the classification results. We analyzed the following ensembles, consisting of a smaller number (4) of neural networks models:


The results are summarized and compared in Table 7. In all cases, the ExtraTrees Classifier was used as a meta-learner. The Full Model here corresponds to the five-model ensemble with PCA scaling of data. The results show that the best performance was achieved by the full five-model ensemble with data scaling using PCA and ExtraTrees as the meta-learner.


**Table 7.** Comparison of ensemble models. Best values are shown in bold.

#### *4.7. Comparison with Related Work.*

Finally, we compare our results with some of the related work on classifying benign and malware files in Table 8 and explain in more detail below. Note that the methods working on different malware datasets were compared. Alzaylaee et al. [54] explored 2-, 3- and 4-layer fully connected neural networks on a dataset of 31,125 Android apps, with 420 static and dynamic features, while comparing the results to machine learning classifiers. The best results were achieved with a three-layer network with 200 neurons in each layer. Bakour and Ünver [55] suggested a visualization-based approach that converted software characteristics into grayscale images and then applied local and global image features as voters in an ensemble voting classifier. Cai et al. [56] used information gain for feature selection and weight mapping functions derived by machine learning methods, which were optimized by the differential evolution algorithm. Chen et al. [57] used an attention network architecture based on CNN to classify apps based on their Application Programming Interface (API) call sequences. Fang et al. [58] used a DeepDetectNet deep learning model for static PE malware detection model, and an adversarial generation

network RLAttackNet based on reinforcement learning, which was trained to bypass DeepDetectNet. The generated adversarial samples were used to retrain DeepDetectNet, which allowed the improvement of malware recognition accuracy.

Imtiaz et al. [59] proposed a deep multi-layer fully connected Artificial Neural Network (ANN) that has an input layer, few hidden layers and an output layer. The approach has been validated with the CICInvesAndMal2019 dataset of Android malware. Jeon and Moon [60] proposed a convolutional recurrent neural network (CRNN), which uses the opcode sequences of software as input. The front-end CNN performs opcode compression, and the back end dynamic recurrent neural network (DRNN) detects malware from the compressed sequence.

Jha et al. [61] proposed using RNN with feature vectors obtained by skip-grams of the Word2Vec embedding model for malware recognition. Namavar Jahromi et al. [62] proposed a modified Two-hidden-layered Extreme Learning Machine (TELM), which was tested on Ransomware, Windows, Internet of Things (IoT) and other malware datasets.

Narayanan and Davuluru [63] suggested using CNNs and Long Short-Term Memory (LSTM) networks for feature extraction and SVM or LR for the classification of malware based on their machine language opcodes. The approach was validated on Microsoft's Malware Classification Challenge (BIG 2015) dataset with nine malware classes. Song et al. [64] proposed a JavaScript malware detection based on the Bidirectional LSTM neural network. Wang et al. [65] suggested CrowdNet, a radial basis function network, as a malware predictor. Yen and Sun [66] extracted instruction code and applied hashing to extract features. Then, the features were transformed into images and used to train a CNN.

**Table 8.** Comparison with other known deep learning approaches for malware recognition. n/a—data were not provided.


#### **5. Conclusions**

There is an increase in demand for smart methods that detect new malware variants, because the existing methods are time-consuming and vulnerable to many errors. This paper analyzed various machine learning algorithms and models of neural networks, which are smart approaches that can be used for malware detection. With neural networks used as base learners, we proposed an ensemble learning-based architecture and explored 14 machine learning algorithms as meta-learners. As baseline models, we used machine learning algorithms for comparison. We conducted our experiments on a dataset that included malware and benign files from Windows Portable Executables (PE).

In this paper, we analyzed and experimentally validated the use of ensemble learning to combine the malware prediction results given by different machine learning and deep learning models. The aim of this practice is to improve the recognition of Windows PE malware. With ensemble methods, it is not required to select any specific machine learning model. Instead, the prediction capability of each combination of the machine learning models is aggregated to create a learning procedure that achieves the best malware detection performance. We explored our proposed ensemble classification framework with lightweight fully connected and convolutional neural network architectures, and combined deep learning and machine learning techniques to learn effective and efficient malware detection models. We conducted extensive experiments on various lightweight deep learning architectures and machine learning models within the framework of ensemble learning under the same conditions for a fair comparison.

The results achieved show that the malware detection ability of ensemble stacking exceeds the ability of other machine-learning methods, including neural networks. We showed that the ensemble learning framework based on lightweight deep models could successfully tackle the problem of malware detection. The results obtained indicate that ensemble learning methods can be implemented and used as intelligent techniques for the identification of malware. The classification system with the Extra Trees algorithm as a meta-learner and an ensemble of dense ANN and 1-D CNN models obtained the best accuracy value for the classification procedure, outperforming other machine learning classification methods. Our proposed framework can lead to highly accurate malware detection models that are adapted for real-world Windows PE malware.

The application of explanatory artificial intelligence (XAI) [67] strategies to interpret the outcomes of deep learning models for malware detection will be carried out in the future to provide useful information for malware analysis researchers. We also intend to explore ensemble learning architectures and run further tests with larger databases of malware. We strive to improve the classification ability and accuracy of the ensemble learning model by refining the model architecture and validating it for multiple malware datasets in future work.

**Author Contributions:** Conceptualization and methodology, R.D. and A.V.; software, R.D.; validation, R.D., A.V. and J.T.; formal analysis, R.D. and J.T.; investigation, R.D., A.V. and Š.G.; resources, A.V. and J.T.; writing—original draft preparation, R.D., A.V., J.T. and Š.G.; writing—review and editing, R.D. and A.V.; visualization, R.D. and J.T.; supervision, A.V. All authors have read and agreed to the published version of the manuscript.

**Funding:** This paper was supported in part by European Union's Horizon 2020 research and innovation programme under Grant Agreement No. 830892, project "Strategic programs for advanced research and technology in Europe" (SPARTA).

**Data Availability Statement:** The dataset is available from https://github.com/urwithajit9/ClaMP.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


### *Article* **Deep Learning Techniques for Android Botnet Detection**

**Suleiman Y. Yerima 1,\*, Mohammed K. Alzaylaee <sup>2</sup> , Annette Shajan <sup>3</sup> and Vinod P <sup>4</sup>**


**Abstract:** Android is increasingly being targeted by malware since it has become the most popular mobile operating system worldwide. Evasive malware families, such as Chamois, designed to turn Android devices into bots that form part of a larger botnet are becoming prevalent. This calls for more effective methods for detection of Android botnets. Recently, deep learning has gained attention as a machine learning based approach to enhance Android botnet detection. However, studies that extensively investigate the efficacy of various deep learning models for Android botnet detection are currently lacking. Hence, in this paper we present a comparative study of deep learning techniques for Android botnet detection using 6802 Android applications consisting of 1929 botnet applications from the ISCX botnet dataset. We evaluate the performance of several deep learning techniques including: CNN, DNN, LSTM, GRU, CNN-LSTM, and CNN-GRU models using 342 static features derived from the applications. In our experiments, the deep learning models achieved state-of-the-art results based on the ISCX botnet dataset and also outperformed the classical machine learning classifiers.

**Keywords:** botnet detection; deep learning; Android botnets; convolutional neural networks; dense neural networks; recurrent neural networks; long short-term memory; gated recurrent unit; CNN-LSTM; CNN-GRU

#### **1. Introduction**

The increase in Android's popularity worldwide has made it a continuous target for malware authors. The volume of malware targeting Android has continued to grow in the last few years [1,2]. Android has been attacked by numerous malware families aimed at infecting mobile devices and turning them into bots. These bots become parts of larger botnets that are usually under the control of a malicious user or group of users known as botmasters. The Android botnets may be used to launch various types of attacks such as distributed denial of service (DDoS) attacks, phishing, click fraud, theft of credit card details or other credentials, generation and distribution of spam, etc. Nowadays, malicious Android botnets have become a serious threat. Additionally, their increasing use of sophisticated evasive techniques such as self-protection or multi-staged payload execution [3], calls for more effective approaches to detect them.

The Chamois malware family [3–5], which was discovered on Google Play in August 2016 is one example of the emerging sophisticated Android botnet threats. By March 2018, Chamois had infected over 20 million devices, which were commandeered into a botnet that received instructions from a remote command and control server [5]. The botnet was used to serve malicious advertisements and to direct victims to premiums Short Message Service (SMS) scams. The early version of Chamois disguised as benign apps that tricked users into downloading it on their devices, and this was detected and almost completely eradicated by the Android security team. Later versions of Chamois appeared which were

**Citation:** Yerima, S.Y.; Alzaylaee, M.K.; Shajan, A.; P, V. Deep Learning Techniques for Android Botnet Detection. *Electronics* **2021**, *10*, 519. https://doi.org/10.3390/electronics 10040519

Academic Editors: Khaled Elleithy and Ana Rosa Cavalli

Received: 31 December 2020 Accepted: 18 February 2021 Published: 23 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

distributed by tricking developers and device manufacturers into incorporating the botnet code directly into their apps. Chamois was sold to developers as a legitimate software development kit, and to the device manufacturers as a mobile payment solution [5].

The emergence of evasive and technically complex families like Chamois has driven interest in adopting machine learning based techniques as a means to improve existing detection systems. In the past few years, several works have investigated traditional machine learning techniques such as Support Vector Machines (SVM), Random Forest, Decision Trees, etc., for Android botnet detection. Some of the more recent machine learning based Android botnet detection work, such as ref. [6] and ref. [7] have focused on deep learning. Nevertheless, empirical studies that extensively investigate various deep learning techniques to provide insight into their relative performance for Android botnet detection are currently lacking. Hence, in this paper, we present a comparative analysis of deep learning models for Android botnet detection using the publicly available ISCX botnet dataset. Our approach is based on classification of unknown applications into 'clean' or 'botnet' using 342 static features extracted from the apps. We evaluate the performance of several deep learning models on 6802 apps consisting of 1929 botnet apps from the ISCX botnet dataset. The models investigated include Convolutional Neural Networks (CNN), Dense Neural Networks (DNN), Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM), as well as more complex networks like CNN-LSTM and CNN-GRU.

The rest of the paper is organized as follows: Section 2 contains related works. Section 3 gives an overview of the overall system for deep learning-based Android botnet detection, while Section 4 provides brief background discussions of the deep learning models that were built for this study. Section 5 discusses the methodology and experimental approach, while Section 6 presents the results and discussion of results. Finally, the conclusions and future work are outlined in Section 7.

#### **2. Related Work**

Kadir et al. in their paper [8], studied several families of Android botnets aiming to gain a better understanding of the botnets and their communication characteristics. They presented a deep analysis of the Command and Control channels and built-in URLs of the Android botnets. They provided insights into each malicious infrastructure underlying the families, and uncovered the relationships between the botnet families by using a combination of static and dynamic analysis with visualization. From their work, the ISCX Android botnet dataset consisting of 1929 samples from 14 Android botnet families emerged. Since then, several works on Android botnet detection have been based on the dataset which is available from ref. [9].

Anwar et al. [10] proposed a mobile botnet detection method based on static features. They combined permissions, MD5 signatures, broadcast receivers, and background services to obtain a comprehensive set of features. They then utilized these features to implement machine learning based classifiers to detect mobile botnets. Having performed experiments using 1400 botnet applications of the ISCX dataset, combined with an extra 1400 benign applications, they recorded an accuracy of 95.1%, a recall of 0.827, and a precision of 0.97 as their best results.

Android Botnet Identification System (ABIS) was proposed in [11] to detect Android botnets. The method is based on static and dynamic features consisting of API calls, permissions, and network traffic. ABIS was evaluated with several machine learning techniques. In the end, Random Forest was found to perform better than the other algorithms by achieving 0.972 precision and 0.96 recall.

In ref. [12], machine learning was used to detect Android botnets using permissions and their protection levels as features. Initially, 138 features were utilized and then increased to 145 after protection levels were added as novel features. In total, four machine learning models (i.e., Random Forest, multilayer perceptron (MLP), Decision Trees, and Naive Bayes) were evaluated on 3270 applications containing 1635 benign and 1635 botnets from the ISCX dataset. Random Forest was found to have the best results yielding 97.3%

accuracy, 0.987 recall, and 0.985 precision. The authors of [13] also utilized only the 'requested permissions' as features and applied Information Gain to reduce the features and select the most significant requested permissions. They evaluated their approach using Decision Trees, Naive Bayes, and Random Forest. In their experiments, Random Forest performed best, with an accuracy of 94.6% and false positive rate of 0.099%.

Karim et al. in [14], proposed DeDroid, a static analysis approach to investigate properties that are specific to botnets that can be used in the detection of mobile botnets. in their approach, 'critical features' were first identified by observing the coding behavior of a few known malware binaries that possess Command and Control features. These 'critical features' were then compared with features of malicious applications from Drebin dataset [15]. The comparison with 'critical features' suggested that 35% of the malicious applications in the Drebin dataset could be classed as botnets. However, according to their study, a closer examination confirmed 90% of the apps as botnets.

Jadhav et al. [16], present a cloud-based Android botnet detection system that leverages dynamic analysis by using a virtual environment with cluster analysis. The toolchain for the dynamic analysis process is composed of strace, netflow, logcat, sysdump, and tcpdump within the botnet detection system. However, in the paper there were no experimental results provided to evaluate the effectiveness of the proposed cloud-based solution. Moreover, the virtual environment can easily be evaded by the botnets using different fingerprinting techniques. In addition, being a dynamic-analysis based approach, the systems effectiveness could be degraded by the lack of complete code coverage [17,18].

In ref. [19], a method was proposed by Bernardeschia et al. to identify Android botnets through model checking. Model checking is an automated technique used in verifying finite state systems. This is achieved by checking whether a structure representing a system satisfies a temporal logic formula describing their expected behavior. In particular, static analysis is used to derive a set of finite state automata from the Java byte code that represents approximate information about the run-time behavior of an app. Afterwards, the botnet malicious behavior is formulated using temporal logic formulae [20]; then by adopting a model checker, it can be automatically checked whether the code is malicious and identify where the botnet code is located within the application. These properties are checked using the CAAL (Concurrency Workbench, Aalborg Edition) [21] formal verification environment. The authors evaluated their approach on 96 samples from the Rootsmart botnet family, 28 samples from the Tigerbot botnet family, in addition to 1000 clean samples. The results obtained on the 1124 app samples showed perfect (100%) accuracy, precision, and recall.

Alothman and Rattadilok [22] proposed a source code mining approach based on reverse engineering and text mining techniques to identify Android botnet applications. Dex2Jar was used to reverse engineer the Android apps to Java source code. Natural Language Processing techniques were applied to the obtained Java source code. They also evaluated a 'source code metrics (SCM)' approach of classifying the apps into 'botnet' or 'clean'. In the SCM approach, statistical measures, such as total number of code lines, code to comment ratio, etc., were extracted from the source code and the metrics were used as features for training machine learning classifiers. The Java source code was extracted from 9 apps from 9 ISCX botnet families, as well as 12 normal apps. The TextToWordVector filter within WEKA (Waikato Environment for Knowledge Analysis), together with TF-IDF, was then applied to the code. They also applied WEKA's StringToWordVector with TF-IDF filter while varying the numbers of the 'words to keep' parameter. SubSetEval feature selection method was used to reduce the features. The features were applied to Naive Bayes, KNN, J48, SVM, and Random Forest algorithms, were KNN obtained the best performance.

In ref. [23], a real-time signature-based detection system is proposed to combat SMS botnets, by first applying pattern-matching detection approaches for incoming and outgoing SMS text messages. In the second step, rule-based techniques are used to label unknown SMS messages as suspicious or normal. Their method was evaluated with over 12,000 test messages, where all 747 malicious SMS messages were detected in the dataset. However, the system produced some false positives where 349 SMS messages were flagged as suspicious. In ref. [24], a botnet detection technique called 'Logdog' is proposed for mobile devices using log analysis. The approach relies on analyzing the logs of mobile devices to find evidence of botnet activities. Logdog writes logcat messages to a text file in the background while the Android user continues to use their device. The system targets HTTP botnets looking for events or series of events that indicate botnet activities and was tested manually on a botnet and a normal app.

In ref. [6], Android botnet detection based on CNN and using permissions as features was proposed. In the proposed method, apps are represented as images that are constructed based on the co-occurrence of permissions used within the applications. The images were then used to train a CNN-based binary classifier. The binary classifier was evaluated using 5450 apps containing 1800 botnet apps from the ISCX dataset. They obtained an accuracy of 97.2%, with a recall of 0.96, precision of 0.955, and f-measure of 0.957. Similarly, ref. [7] proposes an Android botnet detection approach based on CNN, where not only permissions were used as features but also API calls, Commands, Intents, and Extra Files. Unlike in ref. [6], 1D CNN was used and the model was evaluated with the 1929 ISCX botnet apps and 4873 benign apps resulting in 98.9% accuracy, 0.978 recall, 0.983 precision, and 0.981 F1-score.

Different from the aforementioned earlier works, this paper aims to investigate the performance of several deep learning techniques to gain insight into their effectiveness in detecting Android botnets based on the extraction of 342 static features from the applications. To this end, we implemented CNN, DNN, LSTM, GRU, CNN-LSTM, and CNN-GRU models and evaluated the models using 1929 ISCX botnet apps and 4873 benign apps. The deep learning models developed in the study are discussed in Section 4 and the results of the experiments with the models are presented in Section 6.

#### **3. Deep Learning-Based Android Botnet Detection System**

At a fundamental level, our botnet detection system is designed to distinguish between clean apps and botnet apps. As a result, it may sometimes fail to correctly classify an unknown app by mistakenly identifying a benign app as botnet or vice-versa. The various accuracy metrics used in the experiments presented in Section 6 will enable us to capture the extent to which a given deep learning model used as a classifier can be relied upon to correctly predict which category an unknown app should belong to. The classification system is implemented by extracting static features from thousands of applications consisting of both botnet and clean examples. A bespoke tool that we developed in Python for automated reverse engineering of Android Package files (APKs) was utilized in the process. Using the tool, we extracted a total of 342 features from 5 different categories shown in Table 1.


**Table 1.** The five types of features used in developing the deep learning models.

The five feature types include: (1) API calls (2) commands (3) permissions (4) Intents (5) extra (binary or executable) files. Most of the features were from the 'API calls' and 'permissions' category as shown in Table 1. A selection of the features is shown in Table 2. These features are represented as vectors of binary numbers with each feature in the vector represented by a '1' or '0'. Each feature vector (corresponding to one application) is labelled with its class. The feature vectors are loaded into the deep learning model during the

training phase. After training, the model can then be used to predict the class (clean or botnet) of an unknown application using its extracted feature vector. Figure 1 gives a high level overview of the overall botnet detection system.

**Table 2.** Examples of features extracted for the deep learning models.


**Figure 1.** Overview of the deep learning-based detection system for Android botnets. **Figure 1.** Overview of the deep learning-based detection system for Android botnets.

#### **Table 2.** Examples of features extracted for the deep learning models. **4. Deep Learning Techniques Applied to Android Botnet Detection**

*4.1. Convolutional Neural Networks*

**Feature Name Type** TelephonyManager.\*getDeviceId API TelephonyManager.\*getSubscriberId API abortBroadcast API SEND\_SMS Permission A CNN is a feedforward neural network whereby the information moves only in the forward direction from the input node, through the hidden nodes to the output nodes with no loops or cycles. Such (feedforward) networks are primarily used for pattern recognition. CNN generally works well for identifying simple patterns in data which will then be used

Ljava.net.InetSocketAddress API

Android.intent.action.BOOT\_COMPLETED Intent

DELETE\_PACKAGES Permission

SMS\_RECIVED Permission

READ\_SMS Permission

Chown Command Chmod Command Mount Command .apk Extra File .zip Extra File .dex Extra File .jar Extra file CAMERA Permission

io.File.\*delete( API

ACCESS\_FINE\_LOCATION Permission INSTALL\_PACKAGES Permission

android.intent.action.BATTERY\_LOW Intent

to form more complex patterns in the higher or deeper layers. CNNs typically consists of convolutional layers and pooling layers. The role of the convolutional layer is to detect local conjunctions of features from the previous layer, while the role of the pooling layer is to merge semantically similar features into one [25]. CNNs combine concepts such as shared weights, local receptive fields and spatial subsampling [26]. They take advantage of many parallel and cascaded convolutional filters to solve high dimensional non-convex problems such as regression, image classification, semantic segmentation, object detection, etc. Due to weight sharing in each layer and by processing limited dimensions, a CNN requires fewer parameters than a traditional neural network and is much easier to train.

Datasets that possess a one-dimensional structure can be processed using a onedimensional convolutional neural network (1D CNN). A 1D CNN is quite effective when you expect to derive interesting features from shorter (fixed-length) segments of the overall feature set, and where the location of the feature within the segment is not important. The use of 1D CNN can be commonly found in Natural Language Processing (NLP) applications. Similarly, 1D CNN is applicable in problems where vectorized data are used to represent the characteristics of the items whose state or category is being predicted (e.g., an Android application). The 1D CNN could be used to extract potentially more discriminative feature representations that describe any existing patterns or relationships within segments of the vectors characterizing each entity in the dataset. These new features are then fed into a classifier (e.g., LSTM, GRU or a fully connected layer) which will in turn process the derived features to produce a set of outputs that will contribute towards a final classification decision. Hence, CNNs can be employed as feature extraction layers for a given classifier which then eliminates the need to apply separate feature ranking and selection outside of the deep learning model.

Figure 2 depicts a 1D CNN model made up of two convolutional layers and two max pooling layers. The output of last pooling layer is flattened and connected to a dense (fully connected) layer of N units. The N-unit dense layer is in then connected to a final output layer containing a single neuron with a sigmoid activation function which is given by: *S* = <sup>1</sup> 1+*e*−*<sup>x</sup>* .

**Figure 2.** 1D CNN model with 2 convolutional and max pooling layers feeding a dense (fully connected) layer. The model is designed for botnet detection by classifying Android applications into 'normal' or 'botnet'.

The output layer performs the final classification into one of two classes, i.e., 'botnet' or 'normal'. The convolutional layers utilize Rectified Linear Units (ReLU) with activation function given by: *f*(*x*) = *max*(0, *x*). ReLU helps to mitigate vanishing and exploding gradient issues [27].

#### *4.2. Long-Short Term Memory*

LSTM [28,29] is a type of recurrent neural network (RNN) which, unlike feedforward networks, utilizes feedback and is able to 'memorize' parts of the input and use them in making predictions. RNNs are designed to handle sequential data and thus have found popular application in areas such as speech recognition and machine translation. Different from traditional artificial neural networks that fully connect all nodes, or CNN that explore nodes from local to global layer by layer, RNNs use state neurons to explore the relationship in context. Traditional RNNs have a known problem of vanishing gradients which hinders their ability to have long term memory and thus can only make predictions based on the most recent information in the sequence. LSTM solves the vanishing gradient problem and is therefore able to process longer sequences (long term memory). LSTM is a recurrent neural network that can understand contextual information from a sequence of features. It has the ability to add or remove information from the hidden state vector with the aid of a gate function, thereby retaining important information in the hidden layer vectors.

As shown in Figure 3a, LSTM consists of three gate functions. These include: the forget gate, the input gate, and the output gate. The forget gate is used to control the amount of information in *Ct*−<sup>1</sup> is retained in the process of computing *C<sup>t</sup>* and it (the forget vector) can be expressed as:

$$f\_t = \sigma\left(\mathcal{U}^f \mathbf{x}\_t + \mathcal{W}^f h\_{t-1} + b\_f\right) \tag{1}$$

where *U<sup>f</sup>* , *W <sup>f</sup>* , and *b<sup>f</sup>* constitute the parameters of the forget gate and *x<sup>t</sup>* is the input vector in step *t*, while *ht*−<sup>1</sup> is the hidden state vector in the previous step *t* − 1. The input gate determines how much information of *x<sup>t</sup>* is added to *C<sup>t</sup>* and can be expressed as:

$$f\_t = \sigma\left(\mathbf{U}^f \mathbf{x}\_t + \mathbf{W}^f h\_{t-1} + b\_f\right) \tag{2}$$

where *U<sup>i</sup>* , *W<sup>i</sup>* , and *b<sup>i</sup>* are the parameters of the input gate and hence *C<sup>t</sup>* can be calculated by relying on the forget gate vector *f<sup>t</sup>* as well as the input gate vector *i<sup>t</sup>* as follows:

$$f\_t = \sigma\left(\mathcal{U}^f \mathbf{x}\_t + \mathcal{W}^f h\_{t-1} + b\_f\right) \tag{3}$$

where *C*˘ *<sup>t</sup>* <sup>=</sup> *tanh*(*Ucx<sup>t</sup>* <sup>+</sup> *<sup>W</sup>cht*−<sup>1</sup> <sup>+</sup> *<sup>b</sup>C*) denotes the information represented in the hidden layer vector. Note that ∗ denotes the Hadamard (element-wise) product. The output gate controls the output in *C<sup>t</sup>* , and we have:

$$o\_t = \sigma(\mathcal{U}^o \mathbf{x}\_t + \mathcal{W}^o h\_{t-1} + b\_o) \; \; h\_t = o\_t \* \tanh(\mathbb{C}\_t) \tag{4}$$

where *U<sup>o</sup>* , *W<sup>o</sup>* , and *b<sup>o</sup>* are the parameters of the output gate and *C<sup>t</sup>* is the internal state in step *t*.

**Figure 3.** Recurrent neural networks. (**a**) LSTM; (**b**) GRU.

#### *4.3. Gated Recurrent Units*

A GRU [30] is also a kind of RNN model and a variant of LSTM. However, unlike LSTM which has three gates, GRU has only two gates, i.e., reset gate and update gate, as shown in Figure 3b. This makes GRU less complicated and therefore faster to train than LSTM. The gates are two vectors that decide which information should be passed to the output. The update gate enables the model to determine how much of the past information needs to be passed along to the future. The update gate *Z<sup>t</sup>* is calculated for step *t* using the formula given by:

$$z\_t = \sigma(\mathcal{U}^z \mathbf{x}\_t + \mathcal{W}^z h\_{t-1}) \tag{5}$$

where *U<sup>z</sup>* , *<sup>W</sup><sup>z</sup>* are the parameters (weights) of the update gate and *<sup>h</sup>t*−<sup>1</sup> holds information for the previous *t* − 1 units. The reset gate is used to decide how much of the past information to forget which can be calculated using:

$$
\sigma\_t = \sigma(\mathcal{U}^r \mathbf{x}\_t + \mathcal{W}^r h\_{t-1}) \tag{6}
$$

where *U<sup>r</sup>* , *<sup>W</sup><sup>r</sup>* are the parameters (weights) of the reset gate and *<sup>h</sup>t*−<sup>1</sup> holds information for the previous *t* − 1 units. The current memory content will use the reset gate to store the relevant information from the past as follows:

$$\mathbf{c}\_{t} = \tanh(\mathbf{U}^{\mathbf{c}} \mathbf{x}\_{t} + r\_{t} \ast \mathbf{W}^{\mathbf{c}} h\_{t-1}) \tag{7}$$

where *U<sup>c</sup>* , *<sup>W</sup><sup>c</sup>* are the parameters (weights). Note that <sup>∗</sup> denotes the Hadamard (elementwise) product. At the last step, the vector *h<sup>t</sup>* which holds the information for the current unit and passes it down the network will be calculated by:

$$h\_t = z\_t \* h\_{t-1} + (1 - z\_t) \* c\_t \tag{8}$$

A GRU network obtains a long spatial (or temporal) sequence with lower computational complexity compared to traditional encoder-decoder architecture. With its gating mechanisms, GRU can overcome the vanishing gradient problem and is therefore capable of processing longer sequences than standard RNN. Both GRU and LSTM can be applied to sequences of spatial features to determine the extent of dependencies or establish context between features that are located several places apart.

#### *4.4. Dense Neural Networks*

The DNN model is a regular deeply connected neural network with several layers. In a DNN model, each neuron in a layer receives an input from all the neurons present in the previous layer. The layers are known as the dense layers and constitute the hidden layers of the network. Such neural networks are also known as Multilayer perceptron (MLP). It is composed of an input layer, an output layer that makes a decision or prediction about the input, and an arbitrary number of hidden layers in between. The model is often trained on a set of input–output pairs and learns to model the correlation (or dependencies) between those inputs and outputs. The basic unit (a perceptron) of the model produces a single output based on several real-valued inputs by forming a linear combination using its input weights. The output is typically passed through a non-linear activation function *ϑ*:

$$y = \theta\left(\sum\_{1}^{n} w\_{i}\mathbf{x}\_{i} + b\right) = \theta\left(\mathbf{W}^{T}\mathbf{X} + b\right) \tag{9}$$

where *W* denotes the vector of weights, *X* is the vector of inputs, *b* is the bias and *ϑ* is the non-linear activation function.

The sigmoid or the hyperbolic tangent functions were the non-linear activation functions typically used in the past due to their ability to map complex relationships within data. However, these two non-linear activation functions do not perform well in networks with many layers due to the vanishing gradient problem. Nowadays, rectified linear

activation function ReL (and its variants) is the preferred function used in training dense neural networks. Hence, the neurons in a network employing ReL activation are known as ReLU (Rectified Linear activated Units). ReL is a piecewise linear function given by: *f*(*x*) = *max*(0, *x*). It will output the input directly if positive, and will output a zero if negative. ReLU overcomes the vanishing gradient problem and enable models to learn faster and perform better. Hence, it is used as the default activation function when developing the DNN and the CNN networks. In our study, we have experimented with different numbers of hidden layers for the DNN, and numbers of units per layer and recorded the performance of each configuration.

#### *4.5. Hybrid Models*

In this paper, we refer to hybrid models as those combining different deep learning techniques to leverage the unique capabilities of each of the techniques. For example in CNN-LSTM or CNN-GRU depicted in Figure 4, the model will utilize CNN to extract local n-gram features (where n is set by the length of the filters). The CNN's max pooling layer downsamples the output to reduce the dimensionality, which also contributes to the reduction in overfitting. The LSTM or GRU layers are then used to capture long-range dependencies that may be present within the features encoded by the CNN layers. The vectors output by the LSTM-GRU layer with the context and dependencies information will then be transmitted to dense layers for further processing before the final classification by the sigmoid activated output layer consisting of a single unit.

**Figure 4.** Overview of the CNN-LSTM and CNN-GRU hybrid model architecture.

#### **5. Methodology and Experiments**

In this section, we further detail our approach and outline the experiments undertaken to evaluate the deep learning models implemented in this paper. The models were developed in Python using the Keras library with TensorFlow backend. Other libraries utilized include Scikit Learn, Seaborn, Pandas, and Numpy. The experiments were performed on a Ubuntu Linux 16.04 64-bit Machine with 8GB RAM.

#### *5.1. Problem Definition*

Let *A* = {*a*1, *a*2, . . . *an*} be a set of applications where each *a<sup>i</sup>* is represented by a vector containing the values of *n* features (where *n* = 342). Let *a* = {*f* <sup>1</sup>, *f* <sup>2</sup>, *f* <sup>3</sup>, . . . *fn*, *cl*} where *cl* ∈ {*botnet, normal*} is the class label assigned to the app. Thus, *A* can be used to train a model to learn the behaviors of botnet and normal apps, respectively. The goal of a trained model is then to classify a given unlabeled app *Aunknown* = {*f* <sup>1</sup>, *f* <sup>2</sup>, *f* <sup>3</sup>, . . . *fn*, ?} by assigning a label *cl*, where *cl* ∈ {*botnet, normal*}.

#### *5.2. Dataset Used for the Investigation*

As mentioned earlier, the ISCX Android botnet dataset from [9] was utilized for the experiments in this paper. This dataset contains 1929 botnet apps and has been employed in previous works including [6–8,10–13,22]. Table 3 shows the distribution of samples within the 14 different botnet families present in the dataset. To complement the ISCX dataset, we obtained 4873 clean from Google Play store. These apps were cross-checked for maliciousness using Virus Total (https://www.virustotal.com (accessed on 20 December 2020)). Thus, a total of 6802 apps were used in our experiments.

**Table 3.** Botnet dataset composition.


#### *5.3. Experiments to Evaluate the Deep Learning Techniques on the Android Dataset*

In order to investigate the performance of the deep learning models, we performed several experiments with different configurations of the models to enable us observe the optimum performance that is possible with each model architecture. The models are designed to exploit the capabilities of the constituent neural network types as discussed in Section 4. The following metrics are used in measuring the performance of the models: Accuracy, precision, recall, and F1-score. Given TP as true positives, FP as false positives, FN as false negatives, and TN as true negatives (all with respect to the botnet class), the metrics are defined as follows (taking the botnet class as positive):


All the results of the experiments are from 10-fold cross validation where the dataset is divided into 10 equal parts with 10% of the dataset held out for testing, while the models are trained from the remaining 90%. This is repeated until all of the 10 parts have been used for testing. The average of all 10 results is then taken to produce the final result. Additionally, during the training of all the deep learning models (for each fold), 10% of the training set was used for validation.

#### **6. Results and Discussions**

This section will present the results of investigating CNN-GRU, CNN-LSTM, CNN, and DNN models. Subsequently, a comparative performance evaluation of the models and how they measure against traditional machine learning models will be discussed. Finally, we will examine how these models have performed compared to results reported in previous works on Android botnet detection.

#### *6.1. CNN-GRU Model Results*

Here, we present the results obtained from CNN-GRU model where the configurations of both the CNN layer and the GRU layer were varied. A summary of the results is presented in Table 4. In the top half of the table, the configuration had 1 convolutional layer and 1 max pooling layer in the CNN part. These models are named as CNN-1-layer-GRU-X where X stands for the number of hidden units in the GRU layer. The Convolutional layer receives input vector of dimension 342 from the input layer, and it consists of 32 filters each of size = 4. The max pooling layer has its parameter set to 2, which means it would reduce the output of the convolutional layer by half. The outputs from the max pooling layer are concatenated into a flat vector before sending to the GRU layer. As depicted in Figure 4, the output from the CNN-GRU layers are forwarded to 2 dense layers. The first dense layer had 128 units, while the second one had 64 units. The 64-unit layer is finally connected to a sigmoid-activated single-unit output layer where the final classification decision into 'clean' or 'botnet' is made. The model can be summarized in the following sequence:

Input [342] -> Conv [32 filters, Size=4] -> max pooling -> flatten -> GRU[X] -> Dense [128, ReLU] -> Dense[64, ReLU]-> Dense[1, Sigmoid] where X is the number of GRU hidden units taken as 5, 10, 25, and 50, respectively.

**Table 4.** Results from the CNN-GRU models of various configurations using the architecture depicted in Figure 4.


The results of the bottom half of Table 4 are from the same CNN-GRU architecture described above, but with the CNN part having 2 convolutional layers and 2 max pooling layers. The model can be summarized in the following sequence:

Input [342] -> Conv [32 filters, Size=4] -> max pooling -> Conv [32 filters, Size=4] -> max pooling -> flatten -> GRU[X]->Dense [128, ReLU] -> Dense[64, ReLU]->Dense[1, Sigmoid] where X is the number of GRU hidden units taken as 5, 10, 25, and 50, respectively.

Note that a dropout = 0.25 is incorporated between each of the layers in the models to reduce overfitting.

From Table 4, we can see that the model with the 1-layer CNN had higher overall accuracy of 98.9% when the number of GRU hidden units were set at 10 or at 50. The corresponding F1-score were also the highest at 0.980. The recall of the GRU-50 model was 0.980 compared to that of the GRU-10 model, which was 0.975. This means that the GRU-50 model was better at detecting botnet apps than the GRU-10 model in the top half of Table 4. Note that the GRU-5 model from the 1-layer CNN batch (top-half) which had 5 hidden units actually did perform well also by obtaining an overall accuracy of 98.8%, with an F1-score of 0.979 and a botnet detection rate (recall) of 97.6%. It had the least numbers of parameters to train, i.e., 90,422.

From the bottom half of Table 4, the 2-layer CNN models with the best overall accuracy was the one with 50 units in the GRU layer (i.e., CNN-2-GRU-50). It obtained 99.1% accuracy, and the best F1 score of 0.984. The recall (botnet detection rate) was 97.9% while the precision was 98.8%, the highest in all of the CNN-GRU models. From these set of results, we can conclude the following:


#### *6.2. CNN-LSTM Model Results*

This section presents the results of the CNN-LSTM models with different configurations in both the CNN layer and the GRU layer. The results are presented in Table 5. Similar to the results of CNN-GRU in Table 4, the top half is for the models with 1 convolutional layer and 1 max pooling layer in the CNN part, while the bottom half (of Table 5) shows the results of the models having 2 convolutional layers and 2 max pooling layers in the CNN part. The models are named with the convention CNN-1-layer-LSTM-X in the top half, or CNN-2-layer-LSTM-X in the bottom half, where X stands for the number of hidden units in the LSTM layer. As depicted in Figure 4, the output from the CNN-LSTM layers are forwarded to 2 dense layers. The first dense layer had 128 units, while the second one had 64 units. The 64-unit layer is finally connected to a single unit sigmoid activated output layer where the final classification decision into 'clean' or 'botnet' is made. The model can be summarized in the following sequence:

Input [342] -> Conv [32 filters, Size=4] -> max pooling -> flatten -> LSTM[X] -> Dense [128, ReLU] -> Dense [64, ReLU]-> Dense[1, Sigmoid]

where X is the number of LSTM hidden units taken as 5, 10, 25, and 50, respectively.

**Table 5.** Results from the CNN-LSTM models of various configurations using the architecture depicted in Figure 4.


For the bottom half of Table 5, the models can be summarized in the following sequence:

Input [342] -> Conv [32 filters, Size=4] -> max pooling -> Conv [32 filters, Size=4] -> max pooling -> flatten -> LSTM[X]->Dense [128, ReLU] -> Dense [64, ReLU]->Dense [1, Sigmoid]

where X is the number of LSTM hidden units taken as 5, 10, 25, and 50, respectively.

Note that a dropout = 0.25 is incorporated between each of the layers in the models to reduce overfitting.

Table 5 (top half), it can be seen that the all the CNN-LSTM models with only 1 layer in the CNN part had overall accuracy of 99%. The models with 10 and 50 hidden units, respectively, in the LSTM layer obtained identical F1-score of 0.983, compared to the ones with 5 and 25, respectively, which had F1-score of 0.982. The best recall (or botnet detection rate) of 98.3% was recorded with the LSTM-50 model. However, having more

than 1.1 million parameters, the LSTM-50 model will be longer to train than the LSTM-10 model which has only 226,617 parameters.

From the bottom half of Table 5, the 2-layer CNN models with the best overall accuracy was the one with 25 units in the LSTM layer (i.e., CNN-LSTM-25). It obtained 99% accuracy, and the best F1 score of 0.981. The recall (botnet detection rate) was 97.9% while the precision was 98.4%. From the results of Table 5, we can conclude the following:


#### *6.3. CNN Model Results*

In this section we discuss the results of the CNN model which is summarized in Table 6. The CNN model consists of 2 convolutional layers and 2 max pooling layers. The resulting vectors are 'flattened' and fed into a dense layer containing 8 units. The model's sequence can be summarized as follows:

Input [342] -> Conv [32 filters, Size=4] -> max pooling -> Conv [32 filters, Size=4] -> max pooling-> flatten -> Dense [8, ReLU] -> Dense [1, Sigmoid]


**Table 6.** Results from a 2-layer CNN model obtained by varying the number of filters, with length of filters = 4 in both convolutional layers.

In our preliminary study presented in [24], this particular configuration of the model has been determined to yield the best performance on the same features extracted from the same app dataset used for the other models presented in this paper. More extensive performance evaluation of the CNN model has been presented in [24], where the effect of varying the other parameters, such as filter length, number of layers, and max pooling parameter has been investigated.

As shown in Table 6, the CNN model with 32 filters yielded the best results with 98.9% overall accuracy, precision = 0.983, recall = 0.978, and F1-score = 0.981. When compared to the results in Tables 4 and 5, it can be observed that most of the CNN-LSTM configurations and some of the CNN-GRU configurations achieved higher results than the CNN-only model. This suggests that the LSTM and GRU were able to capture some dependencies amongst the features thus improving the performance of the model.

#### *6.4. DNN Model Results*

The results obtained from the Dense Neural Network model is presented in this section. The naming convention used to describe the models is DNN-Y-layer-N as shown in Table 7, where *Y* stands for the number of hidden layers and *N* is the number of units in the layer. For example, the sequence of the DNN-2-layer-200 model can be summarized as follows:


Input [342] -> Dense [200, ReLU] -> Dense [200, ReLU] -> Dense [1, Sigmoid]

**Table 7.** Results from the DNN model with various numbers of layers and units per layer.

Note that a dropout = 0.25 is incorporated between each of the layers in the models to reduce overfitting. Additionally, in all of the DNN models and the previous models in Sections 6.1–6.3, the optimization algorithm used was 'Adam' and 'Binary cross entropy' was used for the loss function. Furthermore, all the models were configured to automatically terminate the training after the validation loss is observed to have not changed for a specific number of *K* training epochs, where *K* was set to 20.

From Table 7, it can be observed that the DNN models with a single hidden layer did not result in the best outcomes. Likewise, in most cases, using 3 hidden layers as observed with the DNN-3-layer-200 and DNN-3-layer-300 also did not give the best outcomes. The best performance was obtained from the model with 2 hidden layers and 100 units in each layer, where the overall accuracy is 99.1% and F1-score = 0.984. The model with 3 hidden layers and 100 units in each layer also gave identical results. This shows that increasing the number of units in each layer is unlikely to improve the performance any further.

#### *6.5. Best Deep Learning Results vs. Classical Non-Deep Learning Classifiers*

In Table 8, we juxtapose the best results from our investigation of the deep learning classifiers with the results from the classical machine learning techniques. The DNN and the CNN-GRU models achieved the best results as depicted in the table. The highest accuracy achieved by both models were 99.1% which also corresponds to the highest F1 score of 0.984. These results are followed closely by the CNN-LSTM model which achieved 99% overall accuracy and F1-score of 0.983. Next, was the CNN-only model with 98.9% accuracy and F1-score of 0.981. All of these models outperformed the classical machine learning classifiers where the best two were SVM and Random Forest. SVM had 98.7% overall accuracy and F1-score of 0.976, while Random Forest obtained 98.5% accuracy and F1-score of 0.973. These results suggest that with the static based features extracted for detecting Android botnets, the deep learning models will perform beyond the limits of the classical machine learning classifiers.

In the table, the GRU-only model is shown as having the least accuracy results compared to all the other models. This GRU model consisted of 200 hidden units and obtained an overall accuracy of 82.9%. Similarly, with LSTM-only models, overall accuracies below 75% were observed (results not shown in the table). This confirmed our initial expectation that pattern recognition (e.g., with convolutional layers or dense layers) was more important for the type of feature vectors used in the study, rather than context or dependencies. However, the results of Sections 6.1 and 6.2 for the hybrid models suggests that a combination of methods that can capture both characteristics is promising.


**Table 8.** A summary of the best results of each technique compared to popular non-deep learning classifiers.

#### *6.6. Model Training Times*

When training the deep learning models, the number of epochs has a major influence on the overall model training time. In our experiments we utilized a stopping criterion based on minimum validation loss rather than specifying a fixed number of training epochs. For this reason, the number of training epochs varied between the different configurations of a given model. Hence, the longest CNN-GRU model to train was the CNN-2-layer-GRU-25 which took 145 s, and the testing time was 0.482 s. Whereas the shortest CNN-GRU model to train was the CNN-1-layer-GRU-10 model which took 84.4 s with a testing time of 0.399 s. The longest CNN-LSTM model to train was the CNN-2-layer-LSTM-5 which took 141 s with a testing time of 0.468 s. The shortest CNN-LSTM model to train was the CNN-1-layer-LSTM-25 model which took 83.6 s with a testing time of 0.419 s. Compared to the other models, the DNN was the fastest to train with training times ranging from 10 to 26 s and an average testing time of 0.15 s.

#### *6.7. Comparison with Previous Works*

The results obtained in our study improves the performance beyond the reported results in previous papers that also used the ISCX botnet dataset in their work. This can be observed in Table 9. The second column shows the numbers of the botnet and benign samples used in each of the referenced paper. Note that in some papers, some of the metrics were not reported. Even though the complete datasets and techniques used were different in each of the previous works, Table 9 shows that the models developed in this paper achieved state-of-the-art results with the ISCX botnet dataset compared to the others.

**Table 9.** Performance comparisons with previous works that utilize ISCX botnets samples.


#### **7. Conclusions and Future Work**

In this paper, we presented an extensive evaluation of various deep learning techniques for Android botnet detection using 342 static features consisting of 5 different types. The deep learning models investigated include: CNN, DNN, GRU, LSTM, as well as CNN-LSTM and CNN-GRU. The experiments were undertaken using 6802 apps consisting of 1929 botnet apps from the ISCX botnet dataset which has been utilized in several previous works. The outcomes of our experiments showed that with optimum configuration, the deep learning models performed quite well yielding high accuracies that were beyond the limits of the classical machine learning classifiers. DNN showed the best overall performance, while CNN-GRU and CNN-LSTM showed promising results that were much better than GRU-only or LSTM-only models. In future work, we plan to further investigate the performance of the deep learning models for botnet detection using alternative static and dynamic features. Another possible direction is to explore alternative network architectures such as those consisting of parallel rather than purely sequential integrations of the deep learning model components.

**Author Contributions:** Conceptualization, S.Y.Y.; Data curation, M.K.A.; Investigation, S.Y.Y., M.K.A. and A.S.; Methodology, S.Y.Y.; Resources, V.P.; Supervision, V.P.; Validation, M.K.A., A.S. and V.P.; Writing—original draft, S.Y.Y.; Writing—review & editing, M.K.A. and V.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The data presented in this study are publicly available in FigShare at https://doi.org/10.6084/m9.figshare.14079581 (accessed on 10 December 2020).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

