*Article* **Digital Forensics Classification Based on a Hybrid Neural Network and the Salp Swarm Algorithm**

**Moutaz Alazab \* , Ruba Abu Khurma , Albara Awajan and Mohammad Wedyan**

Faculty of Artificial Intelligence, Al-Balqa Applied University, Amman 1705, Jordan; rubaabukhurma82@gmail.com (R.A.K.); a.awajan@bau.edu.jo (A.A.); mwedyan@bau.edu.jo (M.W.) **\*** Correspondence: m.alazab@bau.edu.jo

**Abstract:** In recent times, cybercrime has increased significantly and dramatically. This made the need for Digital Forensics (DF) urgent. The main objective of DF is to keep proof in its original state by identifying, collecting, analyzing, and evaluating digital data to rebuild past acts. The proof of cybercrime can be found inside a computer's system files. This paper investigates the viability of Multilayer perceptron (MLP) in DF application. The proposed method relies on analyzing the file system in a computer to determine if it is tampered by a specific computer program. A dataset describes a set of features of file system activities in a given period. These data are used to train the MLP and build a training model for classification purposes. Identifying the optimal set of MLP parameters (weights and biases) is a challenging matter in training MLPs. Using traditional training algorithms causes stagnation in local minima and slow convergence. This paper proposes a Salp Swarm Algorithm (SSA) as a trainer for MLP using an optimized set of MLP parameters. SSA has proved its applicability in different applications and obtained promising optimization results. This motivated us to apply SSA in the context of DF to train MLP as it was never used for this purpose before. The results are validated by comparisons with other meta-heuristic algorithms. The SSAMLP-DF is the best algorithm because it achieves the highest accuracy results, minimum error rate, and best convergence scale.

**Keywords:** digital forensic; optimization; multilayer perceptron; salp swarm algorithm; connection weights

#### **1. Introduction**

Great technological development has led to the use of the Internet in many areas of life [1]. Many companies have taken advantage of the Internet to provide many services using e-commerce without having to submit to market restrictions. This reflected positively in the country's economy by increasing competitiveness and achieving great returns. This has caused a huge positive shift in the number of customers who use the Internet to buy, sell, and transfer large amounts of money [2]. These large sums are tempting for many hackers and scammers to engage in many activities that violate privacy. These put a lot of people and companies at risk through the web and cause huge financial losses [3]. Other violations that may occur online include impersonation, loss of privacy, brand theft, and loss of customers' trust in institutions. Hence, the suitability of the Internet for carrying out business and banking operations is called into question.

DF is a direct result of cybercrime, which is typically applied in computer-related crimes [4–6]. This includes Intellectual property infringement, use of unauthorized privileges to deal with the computer systems, privacy infringement, a security breach of confidential data repositories, carrying out terrorist operations via the Internet, misuse of electronic data, etc. DF is defined as an organized process that uses scientific techniques to collect, document, and analyze electronically stored data. This helps utilize the computer equipment and storage media to provide evidence to detect the abnormal events [7].

**Citation:** Alazab, M.; Abu Khurma, R.; Awajan, A.; Wedyan, M. Digital Forensics Classification Based on a Hybrid Neural Network and the Salp Swarm Algorithm. *Electronics* **2022**, *11*, 1903. https://doi.org/10.3390/ electronics11121903

Academic Editor: Suleiman Yerima

Received: 30 April 2022 Accepted: 5 June 2022 Published: 17 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Suspicious events and unsafe activities may lead organizations to lose a lot of money, in addition to losing their prestige. Therefore, such events may lead to serious repercussions for both individuals and institutions. According to the statistics shown by the published reports, the number of computer violations has reached more than five million violations so far [8]. Many cybercrimes go unrecorded because their victims do not report them to officials. Victims of cybercrime feel confused and humiliated or believe that the authorities are not taking the necessary measures to punish the attackers. In addition to the lack of competencies and expertise from the workforce, the government needs to mitigate computer crimes [9].

Crime scene cyber investigations include conducting forensic analysis of all types of storage and digital media increasing the volume of data obtained. The value of data depends on the extent to which it is used in decision-making. The process of analyzing digital data during forensic practice is traditionally manual in most cases. Investigators perform reports using some statistical tools to give a picture of the collected data [10]. However, the cost of digital investigations increases with increasing data dimensions. Furthermore, forensic cases increase the data analysis complexity, which leads to the deterioration of the manual process quickly [11]. It is necessary to use more advanced methods with potential beyond the capacity of the conventional manual analysis in dealing with big data. Machine learning (ML) is an efficient and more sophisticated method that facilitates the production of useful knowledge for decision-makers. This is carried out by analyzing data sets from different perspectives and fixing them into meaningful forms [12].

The primary goal of DF analysis is to identify who is responsible for these cyber security crimes. This is performed by the selection, classification, and reordering of the sequence of events of the digital crime. Acts reorganization analyzes DF and prepares a schedule of these cyber events. It determines digital evidence, which is information with true value gained from trusted sources that admit or do not admit to committing cybercrime. Over there, there are many sources from which reliable information can be collected to rebuild cybercrime events, such as web browsers, history files, cookies messages, temporary files, log files, and system files [13].

Some of the valuable sources of information that can aid the DF are the system files and their metadata. These files are normally modified by users and the usual use of computer machines. The reconstruction process for digital events can benefit from the system files affected by cyber-crime. However, these files may be modified by normal programs [14]. For this reason, the reconstruction of digital events and the determination of the timeline sequence of events and activities are essential. This helps to recognize whether the type of application that manipulates the file system is reliable or malicious.

The main problem to be solved in this paper is the classification of files that have been accessed, manipulated, changed, and deleted by application programs. This depends on using some features that are represented by footprints. Identifying the affected files by specious acts facilitates the process of event reconstruction in the file system.

This paper presents one of the Neural Network techniques for rebuilding the acts of a digital crime. Multi-layer perceptron (MLP) is one of the artificial neural networks (ANNs) that imitates the neural human system [15]. It has been used in many applications as a supervised classifier. The main advantage of MLP is that it can learn and tackle many complex problems with promising results in science and engineering. It can be adopted efficiently for either supervised or unsupervised learning. In addition, it has a large ability to tackle parallelism, fault tolerance, and robustness to noise. MLP has a great capacity to generalize as well.

The most common problems of gradient-based MLP are stagnation in local minima, a tendency to the starting values of its parameters, and slow convergence. Due to the local minima problem, several studies in the literature proposed different approaches to train MLP. Swarm-based algorithms have been widely used to train MLPs. These algorithms simulate the natural survival of animals such as the Gradient-Based Optimizer (GBO) [16], Slime mould algorithm (SMA) [17], Heap-based optimizer (HBO) [18], and Harris hawks

optimization (HHO) [19]. However, the local minima problem still exists. Furthermore, based on the no-free-lunch theorem (NFL), there is no preference among swarm algorithms. The superiority of one swarm algorithm over another does not imply superiority in all applications. This means that there is no guarantee that it will perform similarly when applied to other optimization problems. These reasons motivated us to propose a new method to train MLPs using the SSA algorithm in the context of DF.

SSA [20] is one effective meta-heuristic algorithm that belongs to a swarm-based family. It has a set of characteristics that motivated us to select it. First, it has a single parameter that decreases in an adaptive manner relative to the increasing iterations. Second, it performs an extensive exploration in the initial iterations then it adaptively switches to exploit the most promising areas of the search space. Third, SSA preserves the best-found solution so that it never loses the optimal solution. Fourth, follower salps change their locations adaptively following other members of the population, so it has the power to alleviate the local minima problem.

SSA has been implemented to solve different optimization problems. In [21], Yang proposed a memetic SSA (MSSA) with multiple independent salp chains. He aimed to make more exploration and exploitation. Another improvement was using a populationbased regroup operation for coordination between different salp chains. The MSSA is applied for an efficient maximum power point tracking (MPPT) of PV systems. The results showed the output energy generated by MSSA in Spring in Hong Kong is 118.57%, which is greater than the other algorithms. In [22], the author integrated the SSA with the Ant Lion Optimization algorithm (ALO) for intrusion detection in IoT networks. The true positive rates reached 99.9% and with a minimal delay. In [20], the authors proposed a new phishing detection system using SSA. They aimed to increase the classification performance and reduce the number of features of the phishing system. Different transfer function (TF) families: S-TFs, V-TFs, X-TFs, U-TFs and Z-TFs were used to generate a binary version of the SSA. The results showed that BSSA with X-TFs obtained the best results. In [23], the target of the proposed approach was to reduce the number of symmetrical features and obtain a high accuracy after implementing three algorithms: particle swarm optimization (PSO), the genetic algorithm (GA), and SSA. The used datasets are the Koirala dataset. The proposed COVID-19 fake news detection model obtained a high accuracy (75.4%) and reduced the number of features to 303. In [24], the authors introduced four optimizations algorithms integrated with an MLP neural network, namely artificial bee colony (ABC-MLP), grasshopper optimization algorithm (GOA-MLP), shuffled frog leaping algorithm (SFLA-MLP), and SSA-MLP to approximate the concrete compressive strength (CSC) in civil engineering. The results show that SSA-MLP and GOA-MLP can be promising alternatives to laboratory and traditional CSC evaluative methods.

The main contributions can be summarized as follows:


This paper has the following sections: Section 2 reviews some of the previous studies that investigate DF. Section 3 provides a description for MLP. Section 4 discusses the details of the new MLPSSA-DF method. Section 5 discusses the dataset, experiments, and results. Finally, Section 6 summarizes the findings of the whole paper.

#### **2. Forensic's Background**

The initiation of the DF investigations depends on the presence of some indications on the computer system. There are several indications that a computer system has experienced suspicious incidents and is a victim of cybercrime [25]. Table 1 shows these indications.


**Table 1.** DF indications.

The goal of conducting DF investigations is to answer a range of questions as quickly as possible. These include:


DF investigations rely heavily on system files because they provide accurate and important information about the sequence of events involved in cybercrimes [26]. At the end of the DF investigations comes the question about how the cybercrime occurred, and this question depends on the answer to all the previous questions.

In the past, a general rule was devised regarding potential digital evidence based on documentary evidence considered admissible in court [27]. These steps include earning, identification, assessing, and admission. Years later, a specific methodology was developed for cybercrime research [28]. This methodology is an approach to all the proposed models that have been followed so far and includes the following steps:


In [29], the authors expanded the model in [28] by including two additional steps, which are set up and approach strategies. The new model was called the abstract DF model. The additional two steps were carried out after the define step and before the preserving step. The setup step involves the tools that are used to deal with the suspicious acts. On the other hand, the approach step involves a formal strategy for investigating the effect on the technology and viewers. The problem with this model is that it is quite generic, and there is no clear method to test it.

Through the models that have been proposed in [27–29], we can say that the steps of the DF are compatible with the steps of machine learning. In general, a DF investigation relies heavily on reconstructing events and preparing accurate evidence.

In [30], the authors devised a new method that initially requires the collection of evidence to reconstruct the acts, followed by the development of initial hypotheses. These hypotheses are studied, analyzed, and examined. From here begins the actual process of DF investigations. It ends with finding the outcomes of the electronic case. A new technique has been developed in [31] to reconstruct the latest acts. It is based primarily on defining the condemnation of data objects and the relationships between them.

In [32], the authors used an ant system to take the place on Grid hosts. The Antbased Replication and Mapping Protocol (ARMAP) is used to spread resource information with a decentralized technique, and its performance is assessed using an entropy index. The authors in [33] developed a new method to reconstruct acts. It uses a finite state machine. It shows the transition from one state to another based on some conditions put by the evidence. In [34], the authors proposed ontology for reconstructing acts by representing the acts as entities. The events are represented by developing a time model to show the instance change in the state instead of using a period. The shellbag-based technique was proposed in [35], which preserves information in the Windows Registry. This information is about the files, folders, and windows that appear, such as deleted, modified, and relocated files and folders.

Nowadays, many tools are used to collect acts and save them in a temporal repository. However, it can be difficult to analyze the raw data.

In [36], Hall investigated the potential of EXplainable artificial intelligence (XAI) to enhance the analysis of DF evidence using examples of the current state of the art as a starting point. Bhavsar [37] pointed out the challenges in this forensics standard to design the framework of the investigation process for criminal activities using the best digital forensic instruments. Dushyant [38] discussed the advantages and disadvantages of incorporating machine learning and deep learning into cybersecurity to protect systems from unwanted threats. In [39], Casino conducted a review of all the relevant reviews in the field of digital forensics. The main challenges are related to evidence acquisition and pre-processing, collecting data from devices, the cloud, etc.

Unlike previous studies in the literature, where the proposed algorithm was trained on general datasets or limited applications, the newly developed MLP-SSA is proposed for the first time in the context of DF. Different evaluations are performed to test the viability of the MLPSSA to be used as a robust classifier for DF.

#### **3. DF Using MLP**

The MLP neural network can be used for the classification of files that have been accessed, manipulated, changed, and deleted by application programs. Once act depends on using some features that are represented by footprints. Identifying the affected files by specious acts facilitates the process of event reconstruction in the file system.

#### *3.1. Dataset*

The dataset used in the experiments is collected from three resources: the audit log, file system, and registry. It ensures that if some features are missed in one resource, they may be found in another resource. The dataset represents a database for the collected features of the system file or the metadata. The record contains the values of the files' features that have been affected by a specific program. These are the footprints that are specific for each application program. It describes the system events or acts or metadata. As in supervised classification, the dataset has one column for class. In this dataset, the last column is the application program.

The collection of the features related to the system file using an application program is carried out using some programs such as the .Net program. This paper runs a program called VMware, which is used for collecting the features and building a training dataset. The main advantage of VMware is that it can get red from the useless programs and reduce their effects. The operating system used in this experiment is Windows 7 as it is a commonly used platform. The application programs used in the experiments are Internet Explorer, Acrobat Reader, MS-Word, MS-Excel (Microsoft Office version 2007), and VLC media player.

These applications are performed in different ways. First, the applications are loaded separately. Thus, if one application is completely loaded and then closed, the other program is loaded after it. The main issue in this execution of programs is that they do not do anything in the file system except loading and closing. Second, the applications also are loaded and executed and then closed one by one. In this case, the first three applications perform one act, which is saving a file. The last application visit (www.msn.com accessed on 20 April 2022) website. Third, as the first and second execution, the applications are loaded separately. However, different acts are executed by each application.

The acts of the first three applications include saving files, opening files, and creating new files. However, the last application performs a set of acts that include visiting secured/unsecured websites and sending/receiving emails with/without attachments. Fourth, in this execution, the applications are executed at the same time as opposed to previous executions. As in execution three, different acts take place. The number of examples of records in a database is 23,887. Table 2 shows the features on the dataset that is used to investigate the digital forensics.

**Table 2.** The features in a dataset.


#### *3.2. Preparing a Dataset*

The performance of a machine learning model is greatly affected by the dataset. Therefore, preparing a dataset is an important stage for developing efficient and reliable models. It enhances the generalizing process. Different issues have been applied to a dataset to preprocess it before using it in the learning algorithm. First, restructuring of the dataset by distributing the values into sets in such a way the feature's states reduce.

Another important concern in this regard is that most machine learning algorithms deal with numerical values instead of text values. Therefore, in the used dataset, there is a need to assign numerical values to some feature values using the word2vect tool. Using this tool, an index is assigned to each word. Second, cleaning the dataset and git-rid of missing and outlier values. These have a positive impact on enhancing the generalization and generating less biased models. Third, normalizing the dataset by scaling the values of the features into a predefined range using the min-max method. This helps to deal with all features equally instead of making one feature overwhelm the others.

#### *3.3. Salp Swarm Algorithm (SSA)*

The inspiration for the SSA algorithm is from the salp aquatic animals. They have a specialized technique for obtaining food. The first salp in the swarm leads the other members in the sequence. This implies that other swarm members change their positions dynamically concerning the leader. Figure 1 shows a single salp on the right side and a swarm of salps on the left side.

**Figure 1.** Single salp (**A**) and the salps chain (**B**).

Algorithm 1 (SSA) is an evolutionary algorithm that was developed by Mirjalili [20]. The swarm *S* of *n* salps can be represented in Equation (1), where *Foo* is the source of food. The population of salps is represented by a matrix. Each row is a salp or solution. The length of the salp is the number of features in a dataset (*d*). The number of salps (*n*) is the swarm's size. The first row in the matrix is the leader salp, and the other rows are for the follower salps.

$$S\_{\vec{i}} = \begin{bmatrix} s\_1^1 & s\_2^1 & \cdots & s\_d^1 \\ s\_1^2 & s\_2^2 & \cdots & s\_d^2 \\ \vdots & \vdots & \vdots & \vdots \\ s\_1^n & s\_2^n & \cdots & s\_d^n \end{bmatrix} \tag{1}$$

Equation (2) illustrates the location of the first salp

$$S\_j^1 = \begin{cases} Fo\_j + cp\_1((up\\_bound\_j - low\\_bound\_j)cp\_2 \\ + low\\_bound\_j), \ cp\_3 \gg 0.5 \\ Foo\_j - cp\_1((up\\_bound\_j - low\\_bound\_j)cp\_2 \\ + low\\_bound\_j), \ cp\_3 < 0.5 \end{cases} \tag{2}$$

where *s* 1 *j* and *Foo<sup>j</sup>* are the locations of leader salp and the source of food in the *j th* dimension, respectively. In Equation (3), *cp*<sup>1</sup> gradually decreases and changes its value across cycles, and *curr* and *last* are the current and the last cycles, respectively. The other parameters *cp*<sup>2</sup> and *c*<sup>3</sup> in Equation (2) are randomly chosen from [0, 1]. *cp*<sup>2</sup> and *cp*<sup>3</sup> direct the next location in the *j th* dimension to +∞ or −∞ and determine the step size. The *up*\_*bound<sup>j</sup>* and *low*\_*bound<sup>j</sup>* are the limits of the *j th* dimension.

$$c p\_1 = 2e^{-(\frac{4cur}{last})^2} \tag{3}$$

$$S\_{\dot{j}}^{\dot{i}} = \frac{1}{2} (s\_{\dot{j}}^{\dot{i}} + s\_{\dot{j}}^{\dot{i}-1}) \tag{4}$$

In Equation (4), *i* > 2, and *s i j* is the location of the *i th* salp at the *j th* dimension.

#### **Algorithm 1** SSA


#### *3.4. Multi-Layer Perceptron Neural Networks (MLP)*

MLP is a type of feed-forward NN, which is used for training the data and discovering the hidden patterns in the training data. The pattern is then applied to the hidden instances of the dataset to obtain the results. Three layers are in the architecture of MLP: the first, the middle, and the last layers. Each layer consists of a set of computational nodes that simulate the human neurons. The MLP's complexity increases by adding more middle layers. The standard MLP contains a single hidden layer. Figure 2 shows a standard MLP, with a first layer that has *n* nodes, a single middle layer that has *m* nodes, and a last layer that has *k* nodes.

**Figure 2.** MLP components.

The MLP can be visualized as a directed graph that connects the first layer with the middle layer and the middle layer with the last layer. Each middle node is connected with the first layer with *n* weights and with *r* weights with the last layer. In addition, there are *m* biases. The main processes that take place in the hidden nodes are the summation as in Equation (5) and the activation as in Equation (6). The output of Equation (5) of node *m* is performed using Equation (5). After computing the sum, a transfer function is applied to the input as in Equation (6).

$$SumFum\_m = \sum\_{n=1}^{n} d\_{nm} \* P\_n + c\_m \tag{5}$$

where *dnm* is the weight between the first node *Pn* and the middle node *hm*, and *c<sup>m</sup>* is the bias *m* that enters the middle node *m*.

$$\rho\_{\mathfrak{m}} = \text{Sign}(\text{SumFun}\_{\mathfrak{m}}) \tag{6}$$

where *o<sup>m</sup>* is the output node *m*; *m* = 1, 2, . . . ,*s*; *Sig* is the sigmoid function as in Equation (7)

$$\text{Sig}(\text{SumFun}\_m) = \frac{1}{1 + e^{-\text{SumFun}\_m}} \tag{7}$$

After collecting the results from all the *m* nodes, the final result *O<sup>m</sup>* can be generated as shown in Equations (8) and (9).

$$SumFum\_m = \sum\_{n=1}^{s} d\_{nm} \* o\_m + c\_m \tag{8}$$

where *dnm* is the weight between the node *n* in the middle layer and the node *m* in the last layer, and *c<sup>m</sup>* is the bias *m* that enters to the output node *m*.

$$O\_m = f(SumFum\_m) \tag{9}$$

where *O<sup>m</sup>* is the final result *m*; *m* = 1, 2, . . . ,*r*; *Sig* is the function applied in Equation (7).

#### **4. SSAMLP-DF Model**

MLP is a type of NN that uses the feed-forward NN architecture and backpropagation method to propagate data from input nodes to hidden nodes to output nodes. The following list shows the stages of the SSAMLP-DF model:


There are many areas in which the MLP has been implemented. The main features of MLP that helped it to be commonly used include the nonlinear structure, adaptive update of its parameters, and the ability to generalize well compared with other algorithms. Initially, the dataset is divided into three parts: 70% to train the model, 15% to validate the model, and 15% to test the model and mitigate the overfitting problem. This is called the "Hold-Out" validation method. Overfitting is a popular machine learning problem in which the error rate in the training phase is too small, but it increases in the testing phase. Of course, this is not what is required.

The main target of MLP is to build an accurate model with a minimum error achieved in the testing phase. By applying the MLP to the training part of a dataset, the initial MLP structure is constructed. These include the MLP's layers and the number of nodes in each middle layer. The weights of the MLP are set to nonzero values. MLP trains a model until a specific condition is satisfied, such as reaching the maximum number of cycles or achieving a threshold error rate. If the trainer achieves the stopping condition, then the weight parameters of the generated model are kept. If the stopping condition is not met, then the MLP structure is updated by changing the number of nodes in the middle layer.

The MLP structure has a large number of nodes in the middle layer and is then progressively resourced across cycles until an acceptable performance is achieved. This method is called the punning method, and it is used to determine the number of nodes in the middle layer that is suitable to generate the desired performance. After the model is generated, the testing dataset is applied to this model, and the error rate is approximated. The basic measurement is the accuracy, which is calculated by dividing the correctly classified instances by the total number of instances in the testing part of the dataset. It is computed as follows:

$$\text{ACC-DF} = \frac{TP \text{-DF} + TN \text{-DF}}{TP \text{-DF} + TN \text{-DF} + FP \text{-DF} + FN \text{-DF}} \tag{10}$$

where (*TP*-*DF*) is the number of files that are predicted to have tampered with a specific computer program, and it is tampered by a specific computer program. (*FN*-*DF*) is the number of files that are tampered by a specific computer program and incorrectly predicted that they are not tampered by a specific computer program. (*FP*-*DF*) is the number of files that are not tampered by a specific program and incorrectly predicted to be tampered by a specific program. (*TN*-*DF*) is the number of files that are not tampered by a specific program and incorrectly predicted to be not tampered by a specific program. The generated DF-model is trained using MATLAB.

This section presents the proposed DF model by integrating the SSA and MLP algorithms. In this model, the SSA is used to train the MLP based on one hidden layer. Two important issues must be taken into account: the representation of the solution in the SSA and the fitness function. Each solution is represented as a one-dimensional array and its values represent a candidate MLP structure. The solution in the SSA-MLP model is partitioned into three parts: the weight parameters that connect the input nodes with the hidden nodes, the weight parameters that connect the hidden nodes with the output nodes, and the biases. The solution's length equals the number of connection weights in addition to the number of biases. Equation (11)

$$\text{Solution} \\ \text{'slength} = (\text{P} \times \text{M}) + (\text{2} \times \text{M}) + 1 \tag{11}$$

where *P* is the number of input nodes, and *M* is the the number of the hidden nodes.

The fitness value in the SSA-MLP model is the mean square error (MSE). This is calculated by subtracting the predicted value from the actual value by the generated solutions (MLPs) for the training part instances of the dataset. MSE is shown in Equation (12), where *R* is the actual value, *R*ˆ is the value generated from prediction, and *N* is the number of examples in the training dataset.

$$MSE = \sum\_{i=1}^{N} (R - \hat{R})^2 \tag{12}$$

The steps of the proposed SSA-MLP can be summarized as follows:


Figure 3 shows the assignment of a salp to MLP. Figure 4 shows the general steps of the proposed DF-SSA-MLP model.

**Figure 3.** Assigning a salp to MLP.

**Figure 4.** DF-SSA-MLP flowchart.

#### **5. Experiments and Results**

This section presents the classification results of the proposed SSAMLP versus other metaheuristic algorithms in terms of accuracy. Twenty-four experiments were established to evaluate the different algorithms in terms of the accuracy of results using different MLP structures. Furthermore, an analysis of their convergence curves is illustrated. The proposed approach is compared against seven metaheuristic algorithms integrated with MLP using the same experiment specifications. These algorithms are: Particle Swarm Optimization (PSO) [40], Ant Colony Optimization (ACO) [41], Genetics Algorithm (GA) [42], Differential Evolution algorithm (DE) [43], and BackPropagation [44].

These experiments performed the proposed methods on 14 benchmark mathematical functions used for minimization problems. Table 3 shows the values of the parameters for the applied algorithms that will be used to validate the proposed SSAMLP model.


**Table 3.** The parameters' values for the applied algorithms.

Table 4 shows the specification of the performed experiments. These include the structure of the MLP related to the number of layers and the number of hidden nodes. The target is to study the effects of different MLP structures on the performance of the algorithms and determine the best MLP structure for all the studied algorithms.


**Table 4.** Experiments specifications.

Table 5 and Figure 5 show the accuracy results of the proposed SSAMLP model versus other algorithms that are integrated into the MLP algorithm.

**Table 5.** Accuracy results of the SSA-MLP against other algorithms based on different experiment specifications.


**Figure 5.** Boxplot representation for the proposed SSAMLP and other algorithms in terms of classification accuracy.

As can be seen in Table 5, the best MLP structure is achieved when the number of hidden nodes is four and the number of layers is one. The results show that in the first three experiments, the accuracy of all hybrid models increases dramatically by decreasing the number of hidden nodes in the MLP's structures. This can be explained because reducing the complexity of computations enhances the performance of DF prediction. The accuracy results increase until a specified limit, which is when the number of layers is one and the number of hidden nodes is four. This structure represents the best MLP structure for all algorithms in which the best DF prediction accuracy is achieved. After that, the accuracy started to decrease linearly by decreasing the number of nodes in the middle layer. The reason is that decreasing the number of nodes in the middle layer makes the model become simple, and the computations are not sufficient to produce an efficient model. Hence, it is not recommended to decrease the number of hidden nodes to less than four or increase it to more than four.

Choosing a medium number of computational nodes in the middle layer can produce the best classification model and achieve the best DF prediction performance. In the remaining experiments, the number of layers in the MLP structure does not benefit the prediction performance. Conversely, increasing the number of layers inversely affects the classification performance. This can increase the complexity of the classification model and cause a major machine learning problem, which is overfitting. This problem occurs because complex models that are unable to produce a general prediction pattern in the learning phase are generated. The produced model is complex and passes by a large number of learning instances. This reflects badly on the testing phase and causes degradation in the prediction performance. Figure 6 shows the convergence curves of the proposed SSAMLP versus other algorithms. The Y-axes is the error rate in terms of MSE. The X-axis is the number of iterations. It can be seen that SSAMLP achieved the best convergence curves, as it obtains the least error rates in the final iterations.

**Figure 6.** Convergence curves of the proposed SSAMLP and other algorithms.

Overall, the experiments confirm that MLP can be applied as a reliable prediction model for DF. Furthermore, it can be revealed that the number of features is sufficient to investigate the DF and identify the application programs that have affected the file system. The best accuracy achieved is 95.84%, which is somehow high. This indicates that there is about a 4.16 error rate. Although there is no recommended error rate threshold commonly reported for the DF models, it can be explained that this small error rate comes from the application programs that access the same files in the file system. This makes the extracted features of the application programs overlap for several files.

#### **6. Conclusions and Future Works**

Recently, cybercrime has increased significantly and tremendously. This made the need for DF urgent. The target of this paper is to propose the SSA in training MLPs. The few parameters, fast convergence, and strong ability to avoid local minima motivated our attempts to use SSA to train MLPs. This is a minimization problem in which the main objective is to select the optimal structure of MLP (best connection weight and bias parameters) to achieve the minimum MSE. The purpose is to apply the optimal MLP in determining the growing evidence by checking the historical acts on the file system to identify how the application programs affected these files.

The used dataset in the experiments has been collected based on applying for five different application programs and checking the footprint on the system files that are the results of system actions and log entries. There are four scenarios for applying the application programs to the system files. These represent simple, medium, and complex scenarios to access the file system. The dataset is used to train the hybrid MLP-SSA model and determine the best structure of MLP that produces the minimum MSE. The results of the experiments show that the proposed SSAMLP outperformed others compared with hybrid meta-heuristic algorithms in terms of accuracy, error rate, and convergence scale. Furthermore, SSAMLP proved its suitability to be used as a reliable model to investigate DF, with an accuracy of 95.84% when the number of middle layers is one and the number of hidden nodes is four.

To verify the proposed method, a set of meta-heuristic algorithms were applied to the same dataset, and their results were compared with the SSAMLP. The comparison results show the out-performance of the SSAMLP in the majority of cases.

For future works, it is worthy to train other types of MLPs using the SSA. The applications of the multiobjective SSA to train MLP in the context of DF are recommended as well.

**Author Contributions:** Conceptualization, R.A.K. and M.A.; methodology, R.A.K. and M.A.; software, A.A.; validation, M.A., A.A. and M.W.; formal analysis, R.A.K.; investigation, M.A.; data curation, M.W.; writing—original draft preparation, R.A.K.; writing—review and editing, R.A.K.; visualization, A.A.; supervision, M.A.; project administration, M.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** The research reported in this publication was funded by the Deanship of Scientific Research and Innovation at Al-balqa Applied University, Al-Salt, Jordan. (Grant Number: DSR-2021#388).

**Conflicts of Interest:** The author declares that no conflict of interest.

#### **References**


**Janaka Senanayake 1,\* , Harsha Kalutarage <sup>1</sup> and Mhd Omar Al-Kadri <sup>2</sup>**

<sup>1</sup> School of Computing, Robert Gordon University, Aberdeen AB10 7QB, UK; h.kalutarage@rgu.ac.uk

<sup>2</sup> School of Computing and Digital Technology, Birmingham City University, Birmingham B4 7XG, UK; omar.alkadri@bcu.ac.uk

**\*** Correspondence: j.senanayake@rgu.ac.uk

**Abstract:** With the increasing use of mobile devices, malware attacks are rising, especially on Android phones, which account for 72.2% of the total market share. Hackers try to attack smartphones with various methods such as credential theft, surveillance, and malicious advertising. Among numerous countermeasures, machine learning (ML)-based methods have proven to be an effective means of detecting these attacks, as they are able to derive a classifier from a set of training examples, thus eliminating the need for an explicit definition of the signatures when developing malware detectors. This paper provides a systematic review of ML-based Android malware detection techniques. It critically evaluates 106 carefully selected articles and highlights their strengths and weaknesses as well as potential improvements. Finally, the ML-based methods for detecting source code vulnerabilities are discussed, because it might be more difficult to add security after the app is deployed. Therefore, this paper aims to enable researchers to acquire in-depth knowledge in the field and to identify potential future research and development directions.

**Keywords:** Android security; malware detection; code vulnerability; machine learning

#### **1. Introduction**

In this technological era, smartphone usage and its associated applications are rapidly increasing [1] due to the convenience and efficiency in various applications and the growing improvement in the hardware and software on smart devices. It is predicted that there will be 4.3 billion smartphone users by 2023 [1]. Android is the most widely used mobile operating system (OS). As of May 2021, its market share was 72.2% [2]. The second highest market share of 26.99% is owned by Apple iOS, while the rest of the 0.81% is shared among Samsung, KaiOS, and other small vendors [2]. Google Play is the official app store for Android-based devices. The number of apps published on it was over 2.9 million as of May 2021. Of these, more than 2.5 million apps are classified as regular apps, while 0.4 million apps are classified as low-quality apps by AppBrain [3]. Android's worldwide popularity makes it a more attractive target for cybercriminals and is more at risk from malware and viruses. Studies have proposed various methods of detecting these attacks, and ML is one of the most prominent techniques among them [4]. This is because ML techniques are able to derive a classifier from a (limited) set of training examples. The use of examples thus avoids the need to explicitly define signatures in developing malware detectors. Defining signatures requires expertise and tedious human involvement and for some attack scenarios explicit rules (signatures) do not exist, but examples can be obtained easily. Numerous industrial and academic research has been carried out on ML-based malware detection on Android, which is the focus of this review paper.

The taxinomical classification of the review is presented in Figure 1. Android users and developers are known to make mistakes that expose them to unnecessary dangers and risks of infecting their devices with malware. Therefore, in addition to malware detection techniques, methods to identify these mistakes are important and covered in this paper

**Citation:** Senanayake, J.; Kalutarage, H.; Al-Kadri, M.O. Android Mobile Malware Detection Using Machine Learning: A Systematic Review. *Electronics* **2021**, *10*, 1606. https:// doi.org/10.3390/electronics10131606

Academic Editor: Rui Pedro Lopes

Received: 29 May 2021 Accepted: 29 June 2021 Published: 5 July 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

(see Figure 1). Detecting malware with ML involves two main phases, which are analyzing Android Application Packages (APKs) to derive a suitable set of features and then training machine and deep learning (DL) methods on derived features to recognize malicious APKs. Hence, a review of the methods available for APK analysis is included, which consists of static, dynamic, and hybrid analysis. Similar to malware detection, vulnerability detection in software code involves two main phases, namely feature generation through code analysis and training ML on derived features to detect vulnerable code segments. Hence, these two aspects are included in the review's taxonomy.

**Figure 1.** Taxonomy of the review.

The rest of this paper is organised as follows: Section 2 lays out the background to this study. Section 3 provides a detailed description of the review methodology, while Section 4 discusses related previous reviews on the topic. Section 5 discusses static, dynamic, and hybrid analysis techniques for Android malware detection and the application of ML and DL methods as well as a comparison of the methods used in the individual studies. Section 6 discusses ML methods to identify code vulnerabilities, with Section 7 exploring the results and discussions thereof. Finally, Section 8 concludes the paper.

#### **2. Background**

This section provides a high-level overview of the Android architecture and its built-in security as well as potential threat vectors for Android. It also provides an introduction to the ML process as it would be useful for non-ML background readers to understand the contents of this paper.

#### *2.1. Android Architecture*

Android is built on top of the Linux Kernel. Linux is chosen because it is open source, verifies the pathway evidence, provides drivers and mechanisms for networking, and manages virtual memory, device power, and security [5]. Android has a layered architecture [6]. The layers are arranged from bottom to top. On top of the Linux Kernal

Layer, the Hardware Abstraction Layer, Native C/C++ Libraries and Android Runtime, Java Application Programming Interface (API) Framework, and System Apps are stacked on top of each. Each layer is responsible for a particular task. For example, the Java API Framework provides Java libraries to perform a location awareness application-related activity such as identifying the latitude and the longitude.

Android-based applications and some system services use the Android Runtime (ART). Dalvik was the runtime environment used before the ART. Both ART and Dalvik were created for the Android applications-related projects. The ART executes the Dalvik Executable (DEX) format and the bytecode specification [7]. The other aspects are memory management and power management since the Android-based applications run on battery-powered devices with limited memory. Therefore, the Android operating system is designed in a way that any resource can be well managed [5]. For instance, the Android OS will automatically suspend the application in memory if an application is not in use at the moment. This state is known as the running state of the application life cycle. By doing this, it can preserve the power that can be utilised when the application reopens. Otherwise, the applications are kept idle until they are closed [8].

#### Built-In Security

Android comes with security already built in. It is a privileged separated operating system [9]. Sandboxing technique and the permission system in Android reduce some risks and bugs in the application. Sandboxing technique in Android isolates the running applications using unique identifiers which are based on the Linux environment [10]. Without having permissions granted from the user at the time of app installation or reconfiguration, apps cannot access system resources. If some of the permissions are not granted, then the application itself will not be usable. When a system update or upgrade happens, several improvements happen in terms of security and privacy. For example, Android 11, the latest stable Android version contains some changes related to security and privacy such as scoped storage enforcement, one-time permissions, permissions auto-reset, background location access, package visibility, and foreground services [11].

However, there are possibilities of malware attacks to exploit some vulnerabilities in the applications developed by various users, because the Google Play Store will not detect some vulnerabilities when publishing applications in the Play Store as in Apple App Store [12].

#### *2.2. Threats to Android*

While Android has good built-in security measures, there are several design weaknesses and security flaws that have become threats to its users. Awareness about those threats is also important to perform a proper malware detection and vulnerability analysis. Many research and technical reports have been published related to the Android threats [13] and classified Android threats based on the attack methodology. Social engineering attacks, physical access retrieving attacks, and network attacks are described under the ways of gaining access to the device. For the vulnerabilities and exploitation methods, man in the middle attacks, return to libc attacks, JIT-Spraying attacks, third-party library vulnerabilities, Dalvik vulnerabilities, network architecture vulnerabilities, virtualization vulnerabilities, and Android debug bridges and kernel vulnerabilities are considered.

The survey in [14] identified four types of attacks to Android; hardware-based attacks, kernel-based attacks, Hardware Abstraction Layer (HAL) based attacks, and applicationbased attacks. Hardware-based attacks such as Rowhammer, Glitch, and Drammer are related to sensors, touch screens, communication media, and DRAM. Kernel-based attacks such as Gooligan, DroidKungfu, Return-oriented Programming are related to Root Privilege, Memory, Boot Loader, and Device Driver. HAL-based attacks such as Return to User and TocTou are related to interfaces for cameras, Bluetooth, Wi-Fi, Global Positioning System (GPS), and Radio. Application-based attacks such as AdDetect, WuKong, and LibSift are related to third-party libraries, Intra-Library collusion, and privilege escalations.

Android applications are easily penetrable with proper knowledge of Android programming if suitable security mechanisms are not in place. In addition, Android marketplaces such as Google Play are not following extensive security protocols when new apps are published. For example, the Android game known as Angry Bird was hacked and the hacker managed to get into its APK file and embed a malicious code that sent text messages unknowingly by the user. The cost was 15 GPB to the user per message. More than a thousand users were affected [15].

#### 2.2.1. Malware Attacks on Android

Malware attacks are the most common case that can be identified as a threat to Android. There are various definitions for malware given by many researchers depending on the harm they cause. The ultimate meaning of the malware is any of the malicious application with a piece of malicious code [16] which has an evil intent [17] to obtain unauthorised access and to perform neither legal nor ethical activities while violating the three main principles in security: confidentiality, integrity, and availability.

Malware related to smart devices can be classified into three perspectives as attack goals and behaviour, distribution and infection routes, and privilege acquisition modes [18]. Frauds, spam emails, data theft, and misuse of resources can be mentioned as the attack goals and behaviour perspective. Software markets, browsers, networks, and devices can be identified as the distribution and infection routes. Technical exploitation and user manipulation such as social engineering can be listed under the privilege and acquisition modes. Malware specifically related to the Android operating system is identified as Android malware [19] which harms or steals data from an Android-based mobile device. These are categorised as Trojans, Spyware, adware, ransomware, worms, botnet, and backdoors [20]. Google describes malware as potentially harmful applications. They classified malware as commercial and noncommercial spyware, backdoors, privilege escalation, phishing, types of frauds such as click fraud, toll fraud, Short Message Service (SMS) fraud, and Trojans [21].

App collusion also should be considered when studying malware. App collusion is two or more apps working together to achieve a malicious goal [22]. However, if those apps perform individually, there is no possibility of a malicious activity happening. It is a must to detect malicious inter-app communication and app permissions for app collusion detection [23,24].

#### 2.2.2. Users and App Developers' Mistakes

The mistakes can happen knowingly or unknowingly from the developers as well as users. These mistakes may lead to threats arising to Android OS and its applications.

It has been identified that users are responsible for most security issues [25]. Some common mistakes done by the users will lead to serious threats in an Android application. At the time of installing Android applications, users will be asked to allow some permissions. However, all the users may not understand the purpose of each permission. They allow permission to run the application without considering the severity of it. Fraudulent applications might steal data and perform unintended tasks after getting the required permissions. It is possible to arise threats to the Android systems due to the mistakes performed by the app developers at the time of developing applications. In the publishing stage of the Android apps, Google Play will have only limited control over the code vulnerabilities in the applications. Sometimes developers are specifying unwanted permissions in the Android manifest file mistakenly, which encourages the user to grant the permissions if the permissions were categorised as not simple permissions [26]. Though the app development companies and some of the app stores are advising about following the security guidelines implemented at the time of development, many developers still fail to write secure codes to build secured mobile applications [27].

#### *2.3. Machine Learning Process*

ML is a branch of artificial intelligence that focuses on developing applications by learning from data without explicitly programming how the learned tasks are performed. The traditional ML methods make predictions based on past data. ML process lifecycle consists of multiple sequential steps. They are data extraction, data preprocessing, feature selection, model training, model evaluation, and model deployment [9]. Supervised learning, unsupervised learning, semisupervised learning, reinforcement learning, and deep learning are the different subcategories of ML [28]. The supervised learning approach uses a labelled dataset to train the model to solve classification and regression problems depend on the output variable type (continuous or discreet). Unsupervised learning is used to identify the internal structures (clusters), the characteristics of a dataset, and a labelled dataset is not required to train the model. A mix of both supervised and unsupervised learning techniques are applied in semisupervised learning and used in a case of limited labelled data in the used dataset [29]. The learning model and the data used for training are inferred. The model parameters are updated with the received feedback from the environment in reinforcement learning where no training data is involved. This ML method proceeds as prediction and evaluation cycles [30]. DL is defined as learning and improving by analysing algorithms on their own. It works with models such as artificial neural networks (ANN) and consists of a higher or deeper number of processing layers [31].

#### **3. Methodology**

Android was first released in 2008. A few years later, the security concerns were discussed with the increasing popularity of Android applications [2]. More attention was received towards applying ML for software security in the last five years because many researchers continuously identify and propose novel ML-based methods [9]. This review was conducted according to the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) model [32]. Based on the objective of this study, first we formulated several research questions (see Section 3.1). Next, a search strategy was defined to identify the conducted studies which can be used to answer our research questions. The database usage and inclusion and exclusion criteria were also defined at this stage. The study selection criteria were defined to identify the studies aiming to answer the formulated research questions as the third stage. The fourth stage is defined as data extraction and synthesis, which describes the usage of the collected studies to analyse for providing answers to the research questions. We reviewed threats to the validity of the review and the mechanism to reduce the bias and other factors that could have influenced the outcomes of this study as the last step of the review process.

#### *3.1. Research Questions*

This systematic review aims to answer the following research questions.

RQ1:What are the existing reviews conducted in ML/DL based models to detect Android malware and source code vulnerabilities?

RQ2:What are code/APK analysing methods that can be used in malware analysis?

RQ3:What are the ML/DL based methods that can be used to detect malware in Android? RQ4:What are the accuracy, strengths, and limitations of the proposed models related to Android malware detection?

RQ5:Which techniques can be used to analyse Android source code to detect vulnerabilities?

#### *3.2. Search Strategy*

The search strategy involves the outline of the most relevant bibliographic sources and search terms. In this review, we have used several top research repositories as main sources to identify studies. They were ACM Digital Libraries, IEEEXplore Digital Library, Science Direct, Web of Science and Springer Link. Google Scholar, and Research Gate were also used to identify research studies published in some quality venues. The search string that we used to browse through research repositories contained the following search

terms: ("android malware") OR ("malware detection") OR ("machine learning") OR ("deep learning") OR ("static analysis") OR ("dynamic analysis") OR ("hybrid analysis") OR ("malware analysis") OR ("android vulnerability analysis") OR ("ML based malware detection") OR ("DL based malware detection").

#### *3.3. Study Selection Criteria*

Since mobile malware detection using ML techniques related trends increased from 2016, we limit our review to study related work from 2016 to May 2021. Initially through the research database search in the top research repositories, 109 research papers and from another sources 11 research papers were identified. From these 120 papers, 5 were excluded because of duplicate entries and another 5 were excluded because they were not available in public from those 110 articles. Due to data analysis issues and experiment issues in the given context, 4 articles were excluded though the full text is available. The remaining 106 articles were reviewed in this study. We performed the snowballing process [33], considering all the references presented in the retrieved papers and evaluating all the papers referencing the retrieved ones, which resulted in two additional relevant paper. We applied the same process as for the retrieved papers. The snowballing search was conducted in March 2021. Figure 2 shows a summary of the paper selection method for this systematic review.

**Figure 2.** PRISMA method: collection of papers for the review.

#### *3.4. Data Extraction and Synthesis*

We extracted data from 9 studies to answer the RQ1, which is about the existing literature reviews related to Android malware detection using ML/DL models and Android vulnerability analysis. To map with RQ2, related studies were identified related to Android code/APK analysing techniques that can be used to analyse malware. The count for those studies was 22. To answer the RQ3 about ML/DL based techniques which can be used to detect malware, we extracted data from 18 different studies. Data from 36 research studies were extracted to find answers for the RQ4, which is about detection model accuracy, strengths, and weaknesses. The remaining 21 papers about Android source code vulnerability analysis and detection methods were used to answer the RQ5.

#### *3.5. Threats to Validity of the Review*

This review was conducted in a systematic approach explained above. We tried to minimise the bias and the other factors affecting the review study. Though we have conducted our review comprehensively, still there can be good papers which were not reviewed in this study since they are not available in the research repositories that we used. The period we were considering for the paper selection is from 2016 to May 2021, as the use of ML techniques for malware detection has increased significantly during this period due to recent advances in artificial intelligence. Therefore, if comprehensive studies were conducted before that, those studies were not captured in our work. When searching for the papers we considered the research papers written in the English language. Because of this limitation, our work may have overlooked some important works written in other languages such as Chinese, German, and Spanish.

#### **4. Related Work**

Previous reviews in [9,13,17,34–37] discussed various ML-based Android malware detection techniques and ways to improve Android security.

The review in [34] systematically reviewed the studies conducted in static analysis techniques used for Android applications from 2011 to 2015. The tools that can be used to perform Android code analysis using static analysis techniques were also summarised. Abstract representation, taint analysis, symbolic execution, program slicing, code instrumentation, and type/model checking were identified as fundamental analysis methods. Though this review correctly identified the most widely used approach to detect privacy and security related issues, the applicability of static analysis techniques for malware detection was not discussed. Apart from that, it did not take into account the recent research where novel analysis methods and malware detection methods were suggested. The study conducted in [35] provided a good systematic review mainly about static analysis techniques that can be used in Android malware detection. Four methods were identified as characteristic-based, opcode-based, program graph-based and symbolic execution-based. After that, it evaluated the capabilities of static analysis based Android malware detection methods on those four methods using the existing literature. The paper has identified ML and statistical models as possible methods by which Android malware can be identified. However, ML-based machine learning methods have not been thoroughly reviewed as the main focus is only on the static analysis techniques.

In [13], a survey was carried out using existing literature up to 2017 to identify malware detection techniques together with their advantages and disadvantages. Under static and dynamic analysis, they have grouped several approaches that can be used to identify Android malware. However, the analysis of this survey was not comprehensive as it focused on a limited number of studies. Based on the previous studies, a systematic review was conducted in [17]. According to it, there are five types of Android malware detection techniques. They are static detection, dynamic detection, hybrid detection, permissionbased detection, and emulation based detection. They also summarised the reviewed work with the model accuracy of malware detection, but the approach of those studies was not discussed. The review conducted in [9] analysed several studies conducted until 2019

related to ML models which can be used to detect Android malware. The malware and APK analysis methods were not discussed in detail since the focus on identifying different ML models was the priority in this review. It is better to analyse the accuracies of the identified ML models. The novel ML/DL and other models which can be used to detect Android malware were also not in the focus of this review. The review in [36] provides a good analysis of static, dynamic, and hybrid detection techniques used in the existing research studies for Android malware detection. Along with that possibility of using machine learning models, several deep learning models are also discussed. However, this study did not comprehensively analyse the model accuracy of the machine learning methods for Android malware detection since this study focused more on discussing different malware detection approaches instead of considering the accuracy of those approaches. Hence, these works differ from our study.

In [37], a systematic review on DL-based methods for Android malware defence was discussed. Malware detection, malware family detection, repackaged/fake app detection, adversarial learning attacks and protections, and malicious behaviour analysis were identified as the malware defines objectives in this review together with the usage of DL models. Though they have identified the possible DL models, it is still better to analyse the accuracy and compare it with traditional ML methods and other hybrid approaches.

Apart from Android malware detection techniques, source code vulnerability analysis is also important to address security concerns in Android. The survey in [38] analysed several studies on ML-based and data mining approaches which can be used to identify software vulnerabilities until 2017. Though this survey provides a good analysis, they considered most of the research work in general software security. Therefore, the vulnerability analysis in Android code was not discussed. However, findings such as ML models' usage for vulnerability analysis are still beneficial for specific programming languages' related analysis.

However, several limitations have been identified in the above works, such as not covering recent proposals on ML methods to detect malware, narrow scopes, and lack of critical appraisals of suggested detection methods. The lack of a thorough analysis of ML/DL-based methods was also identified as a limitation of existing works. Android malware detection and Android code vulnerability analysis have a lot in common. ML methods used in one task can be customised for use in the other task. However, as per our understanding, there are no reviews that cover these two areas together. These shortcomings have been addressed in this work and therefore our work is unique.

#### **5. Machine Learning to Detect Android Malware**

Malware detection in Android can be performed in two ways; signature-based detection methods and behaviour-based detection methods [39]. The signature-based detection method is simple, efficient, and produces low false positives. The binary code of the application is compared with the signatures using a known malware database. However, there is no possibility to detect unknown malware using this method. Therefore, the behaviourbased/anomaly-based detection method is the most commonly used way. This method usually borrows techniques from machine learning and data science. Many research studies have been conducted to detect Android malware using traditional ML-based methods such as Decision Trees (DT) and Support Vector Machines (SVM) and novel DL-based models such as Deep Convolutional Neural Network (Deep-CNN) [40] and Generative adversarial networks [41]. These studies have shown that ML can be effectively utilised for malware detection in Android [9]. Most of these studies used datasets such as Drebin [42], Google Play [43], AndroZoo [44], AppChina [45], Tencent [46], YingYongBao [47], Contagio [48], Genome/MalGenome [49], VirusShare [50], IntelSecurity/MacAfee [51], MassVet [52], Android Malware Dataset (AMD) [53], APKPure [54], Anrdoid Permission Dataset [55], Andrototal [56], Wandoujia [57], Kaggle [58], CICMaldroid [59], AZ [60], and Github [61] to perform experiments and model training in their studies.

#### *5.1. Static, Dynamic, and Hybrid Analysis*

As mentioned earlier, analysing APKs to extract features is required to use some of the proposed ML techniques in the literature. To this end, three analysis techniques are identified as static, dynamic, and hybrid analysis method [62–64]. Static analysis can be performed by analysing the bytecode and source code (or re-engineered APK) instead of running it on a mobile device. Dynamic analysis detects malware by analysing the application while it is running in a simulated or real environment. However, there is a high chance of exposing the risks to a certain extent to the runtime environment in the dynamic analysis since malicious codes will be executed which can harm the environment. The hybrid analysis involves methods in both static and dynamic analysis.

Under the static analysis, four aspects were proposed [28] which are analysis techniques, sensitivity analysis, data structure, and code representation. Under the analysis techniques, Symbolic execution, taint analysis, program slicing, abstract interpretation, type checking, and code instrumentation were identified. For the sensitivity analysis, object, context, field, path, and flow were identified. For the data structure aspect, it is possible to list call graph (CG), Control Flow Graph (CFG), and Inter-Procedural Control Flow Graph (ICFG). Smali, Jimple, Wala-IR, Dex-Assembler, Java Byte code, or class were listed under the code representation aspect. Kernel, application, and emulator can be taken under inspection level aspect. Taint analysis and anomaly-based can be taken under the dynamic analysis approaches.

The feature extraction methods available in the static analysis consist of two types: Manifest Analysis and Code Analysis [65]. Features such as package name, permissions, intents, activities, services, and providers can be identified in Manifest Analysis. In the code analysis, features such as API calls, information flow, taint tracking, opcodes, native code, and cleartext analysis can be identified as possible features to extract. For the dynamic analysis, five feature extraction methods were identified. They were (1) Network traffic analysis for features like Uniform Resource Locators (URL), Internet Protocol (IP), Network protocols, certificates, and nonencrypted data, (2) Code instrumentation for features such as Java classes, intents, and network traffic, (3) System calls analysis, (4) System resources analysis for features such as processor, memory and battery usage, process reports, network usage, and (5) User interaction analysis for features such as buttons, icons, and actions/events. The study in [66] has explored the security of ML for Android malware detection techniques using a learning-based classifier with API calls extracted from converted smali files. Then a sophisticated secure learning method is proposed, which showed that it is possible to enhance the security of the system against a wide range of evasion attacks. This model is also applicable to anti-spam and fraud detection areas. This study can be further improved by exploring the possibilities of identifying attacks that can alter the training process.

#### *5.2. Static Analysis with Machine Learning*

Static analysis is the widely used mechanism for detecting Android malware. This is because malicious apps do not need to be installed on the device as this approach does not use the runtime environment [67].

#### 5.2.1. Manifest Based Static Analysis with ML

Manifest based static analysis is a widely used static analysis technique. The model proposed in SigPID [68] discussed an Android permission-based malware detection mechanism. This model has identified only 22 permissions out of all the permissions listed in sample APKs that are significant by developing a three-level data purring method: permission ranking with negative rate, support based permission ranking, and permission mining with association rules. After that, the ML algorithms were employed to detect the malware. To this process, a binary format dataset of permissions, which was created using a database of malware and benign apps from Google Play was used. The support-vector machine (SVM) outperformed the other studied ML algorithms (Naïve Bayes (NB) and

(DT)) with over 90% accuracy. For the permission-based static analysis, this work was conducted comprehensively. However, it is better to check the other variables which are affecting the malware apart from permissions.

A malware detection method using Android manifest permission analysing was proposed in [69] with the use of static analyser and decompilation support of APKTool for the APK to code level extraction. AndroZoo repository was used as the dataset to train four different ML algorithms. Random Forest (RF), SVM, NB, and K-Means were used to perform the model validity process, and RF produced the highest accuracy for this model with 82.5% precision and 81.5% recall. However, the accuracy of this model is comparatively low with the other studies conducted in the same area. The close reason for that would be that this approach compares the permissions only.

The proposed work in [70] checked the possibility of using reduced dimension vector generating for malware detection. Based on that, malware detection using ML models with permission-based static analysis was performed. In the feature selection stage of this approach, the model removed the unnecessary features using a linear regression-based feature selection approach. Therefore, the classification model can run in real-time since the training time was decreased, with an accuracy of over 96%. The Multi-Layer Perception Model (MLP) algorithm outperformed NB, Linear Regression, k-nearest neighbors (KNN), C4.5, RF and Sequential Minimal Optimization (SMO). It is better to focus on hypermeter selections to also increase the performance of the classification. The model proposed in [71] performed a static analysis on Android apps. Android permissions and intents were used as the basic static features of malware classification while URLs, Emails, and IPs were used as the basic dynamic features. Initially, the APK files were decompiled using ApkTool. The extractor module of this extracted different types of information related to malware. After extracting the data through disassembling the dex files, the data were kept in a text files and they were used to create the feature vector. Then the ML algorithms RF, NB, Gradient Boosting (GB), and Ada Boosting (AB) were used to train and test the malware detection model with the usage of Drebin dataset and Google Play Store. After performing ML training and testing part for each of permission, intent, and network features individually it has identified that the above ML algorithms were performing with different accuracies. For permissions RF performed well with 0.98 precision and recall, for intents NB performed well with 0.92 precision and 0.93 recall, and for network both RF and AB performed similarly well with 0.97 precision and recall. Though this research concluded with such accuracies for malware detection it is still lacking the study of some other features like API calls, etc.

Android malware detection technique using feature weighting with join optimisation of weight mapping and classifier parameters model is proposed in JOWMDroid Framework in [72]. This model is a static analysis-based technique that selected a certain number of features out of the extracted features from the app which were related to malware detection. This process was done by decompiling the APK to manifest and class.dex files and prepared a binary feature matrix. Initial weight was calculated using Random Forest, SVM, Logistic regression (LR), and KNN ML models. Weight machine functions were designed to map the initial weight with final weights. As the last step, classifiers and weight mapping function parameters were jointly optimised by the Differential Evolutional algorithm. Drebin, AMD, Google Play, and APKPure datasets were used to train the model. Finally, it is identified that among weight unaware classifiers, RF performed better with 95.25% accuracy and for weight-aware classifiers, KNN and MLP performed better. However, with the integration of this JOWM-IO method, SVM and LR beat the RF with over 96% accuracy. If the correlation between features is also considered, the model accuracy for detecting malware will increase.

Table 1 comparatively summarises the above research studies related to manifest analysis based methods.


#### **Table 1.** Manifest based static Analysis with ML.

5.2.2. Code Based Static Analysis with ML

Code based analysis is the other way of performing the static analysis to detect Android malware with ML. The model proposed in TinyDroid [39] analysed the latest malware listed in the Drebin dataset. Instruction simplification and ML are used in the model. Using the decompiled DEX files by converting APK to smali codes, the opcode sequence was abstracted. Then using that, features were extracted through N-gram and integrated with the exemplar selection method. In the exemplar selection method, for intrusion detection, a good representative of data was generated through a clustering algorithm, Affinity Propagation (AP). This is because in AP, the number of clusters determination or estimation is not required before running the application. Then the generated 2,3, and 4-gram sequences were fed into SVM, KNN, RF, and NB ML classifiers. RF algorithm was identified as the optimal algorithm for this scenario with 0.915 True Positive Rate, 0.106 False Positive Rate, 0.876 Precision, and 0.915 Recall for 2-gram sequence. High accuracy rates for the other 3 and 4-grams were also achieved compared to the studied ML algorithms. However, the proposed method still has issues such as using the malware samples taken only from few research studies and some organisation and lack of metamorphic malware samples. Therefore, some malware could remain undetected.

The approach proposed in [73] used the Drebin dataset with 5560 malware samples along with 361 malware from the Contagio dataset and 5900 benign apps from Google Play to propose another approach to detect malware by analysing API calls used in operand sequences. For the malware prediction model, the package level details were extracted from the API calls. The package n-grams were extracted from the package sequence, which represents application behaviour. Then they were combined with DT, RF, KNN, and NB ML algorithms to build a predictive model in this study and concluded that the RF algorithm performed with an accuracy of 86.89% after training the model on 2415 package n-grams. It is better to consider other information which contains in operands since it might affect the overall model. The relationship of system functions, sensitive permissions, and sensitive APIs were analysed initially in Anrdoidect [74]. A combination of system functions was used to describe the application behaviours and construct eigenvectors using the dynamic analysis technique. Based on the eigenvectors, effective methodologies of malware detection were compared along with the NB, J48 DT, and application functions decision

algorithm and identified that the application functions' decision algorithm outperformed the others. There are still some improvements to be performed to this approach.

In MaMaDroid [75] model, API calls performed by apps were abstracted using static analysis techniques to classes, packages, or families. Then to determine the call graph of apps as Markov chain, the sequence of API calls was obtained. Then using ML algorithms, classification was performed using RF, KNN, and SVM and it was identified that RF had the highest accuracy among these three. However, in this method, dynamic analysis was not considered. The dynamic analysis is useful for an API calls analysis in a runtime environment to detect malicious applications.

Android malware detection approach using the method-level correlation relationship of application's abstracted API calls was discussed in [76]. Initially, the source codes of Android applications were split into methods, and abstracted API calls were kept. After that, the confidence of association rules between those calls was calculated. This approach provided behavioural semantic of the application. Then SVM, KNN, and RF algorithms were used to identify the behavioural patterns of the apps towards classifying as benign or malicious. Drebin and AMD datasets were used for this, and 96% accuracy was received with the RF algorithm. This method does not address the problems such as dynamic loading, native codes, encryptions, etc. though it has such high accuracy. If the dynamic analysis methods are also used, the accuracy of this model will increase to a further high level.

The model named SMART in [77] proposed a semantic model of Android malware based on Deterministic Symbolic Automation (DSA) to comprehend, detect, and classify malware. This approach identified 4583 malware that were not identified by leading antimalware tools. Two main stages were included in this approach; malicious behaviour learning and malware detection and classification. In Stage 1, the model identified semantic clones among malware, and semantic models were constructed based on that. Then malicious features were extracted from DSA, and ML techniques were used to detect malware in Stage 2 after performing static analysing activities with bytecode analysis. Random Forest achieved the best classification results of 97% accuracy, and AB, C45, NB, and Linear SVM provided lower accuracy. Therefore, this work identified that DSA is possible to use for malware detection. DroidChain [78] proposed a static analysis model with behaviour chain model. The malware detection problem was transformed to a matrix model using the Wxshal algorithm to further analyse this approach. Privacy leakage, SMS financial charges, malware installation, and privilege escalation were proposed as malware models in this study using the behaviour chain model. In the static analysis part, using APKTool and DroidChain, Smali codes were extracted. Then the API call graph was generated using the Androguard [79] tool. After that, the incidence matrix was built, and the accessibility of the matrix to detect malware was calculated. The average accuracy of this model was 83%. This method can be improved to detect malware more accurately and efficiently by considering other static analysis features such as code analysis, permission analysis, etc.

The study conducted in [80] discussed testing malware detection techniques based on opcode sequence and API call sequence. The Hidden Markov Model (HMM) was trained in this and detection rates for models based on static, dynamic, and hybrid approaches were identified and it was concluded that the hybrid approaches are highly effective without performing static or dynamic analysis alone.

Tables 2 and 3 comparatively summarise the above research studies related to code analysis based methods, while Table 2 listed studies with model accuracy below 90% and Table 3 listed studies with model accuracy above 90%.


#### **Table 2.** Code based static Analysis with ML (Model Accuracy is below 90%).

**Table 3.** Code based static Analysis with ML (Model Accuracy is above 90%).


5.2.3. Both Manifest and Code Based Static Analysis with ML

Some studies used both manifest and code based static analysis approaches to detect Android malware with ML. The implemented model in WaffleDetector [81], a static analysis approach to detect malware, was proposed by using a set of Android program features, sensitive permissions, and API calls with the utilization of Extreme Learning Machine (ELM). Tencent, YingYongBao, and Contagio datasets were used to train the algorithms. This method outperformed traditional binary classifiers (DT, Neural Network, SVM, and NB) with 97.06% accuracy. This approach still needs a few improvements, such as refining the combination of permissions and API calls.

The study conducted in [82] studied repackaged apps. The malware was identified from these repackaged apps with code-heterogeneity features. The codes of the apps were partitioned into subsets. Then the subsets were classified based on their behavioural features with Smalicode. Compared to the other nonpartitioning methods, this approach provides high accuracy with a False Negative Rate (FNR) of 0.35% and a False Positive Rate (FPR) of 2.97%. This method also used some Ensemble Learning mechanisms. It is better if the method improves the code heterogeneity mechanisms by using context and flow sensitivity.

Using the Drebin dataset, a method to detect Android malware using static analysis is discussed in [83]. Using this method with high accuracy of 98.7%, it was possible to detect malware using a sample of 10,865 applications. In this method, initially, the APK file was downloaded using the extracted download link from the APKPure website by using web mining techniques. Then the APK content was extracted using Apktool and generated the AndroidManifest.xml and classes.dex files. The application features were extracted from AndroidManifest.xml using the AAPT utility while decompiling classes.dex into a jar file using the dex2jar tool. Then the number of lines of code feature was extracted after extracting the java source files from the jar file using the jd-cmd tool. This static analysis approach was evaluated using ten different ML algorithms; KNN, SVM, Bayes Net, NB, LR, J48, RT, RF, AB, and BA. Out of them RF with 1000 decision trees outperformed the others with 0.987 precision, recall, and F-measure [83]. Though the model has high accuracy, it is better to study behavioural analysis of app behaviour by performing dynamic analysis.

In RanDroid [84] model, already classified malicious and benign apps were used to train the SVM, DT, RF, and NB ML algorithms. Initially, the APK files were decompiled using Androguard (a python-ased tool) [79]. Then the required features of permission, API calls, is\_crypto\_code, is\_dynamic\_code, is\_native\_code, is\_reflection\_code, is\_database were extracted and transformed into binary vectors. Then it was trained using ML algorithm and identified that the DT was the most suitable algorithm for this static analysis approach with 97.7% accuracy. However, in this study, broadcast receivers, filtered intend, Control Flow Graph analysis, deep native code analysis, and dynamic analysis are not considered; they are identified as drawbacks.

In [85] a model named TFDroid has been proposed, which is a ML based malware detection by topics and sensitive data flow analysis using SVM with an accuracy of 93.7%. FlowDroid is a static analysis tool that was used in this approach to extract data flow in benign and malicious apps. The permission granularity was transformed using the data flow features. After that, a classifier was implemented for each category and performed the validation process. Google Play and Drebin datasets were used to train the model in this study. It is better to check the other possible ML algorithms' performance also. Since this study is related to data flow, it is better to perform dynamic analysis and introduce a hybrid model to increase the accuracy of detecting Android malware.

The DroidEnsemble [86] analyses the static behaviours of Android apps and builds a model to detect Android malware. In this approach, static features such as permissions, hardware features, filter intents, API calls, code patterns, and structural features of function call graphs of the application were extracted. Then after creating the binary vector, SVM, KNN, RF, and ML algorithms were performed to evaluate the performance of the features and their ensemble. The proposed methodology achieved detection accuracy of 95.8% and 90.68%, respectively, for static features and structural features. For ensemble of both types, the accuracy was increased to 98.4% with SVM. Sting features like API calls and structural features like function call graphs can be checked with dynamic analysis. Therefore, in this model, the malware detection accuracy would be increased when both static and dynamic analysis were integrated.

Table 4 comparatively summarised the above research studies related to both manifest and code based static analysis methods with ML.


**Table 4.** Both Manifest and Code based Static Analysis with ML.

#### *5.3. Dynamic Analysis with Machine Learning*

The second analysis approach is dynamic analysis. Using this approach it is possible to detect malware with ML after running the application in a runtime environment. Android Malware detection using a network-based approach was introduced in [87]. In this approach, a detection application was developed. It contained three modules: network traces collection, network feature extraction, and detection. In the traces collection module, network activities of running applications were monitored and recorded the network traces periodically. The features extraction module extracted features of the network used by the applications. Those features were Domain Name System (DNS) based features, HyperText Transfer Protocol (HTTP) based features, Origin destination based features, and Transmission Control Protocol (TCP) based features. DT, LR, KNN, Bayes Network, and RF algorithm were used in the detection module. The RF algorithm provided the highest accuracy (98.7%) among them. However, this approach used network-based analysis. If the malware apps were using encrypted transfers, the malware detection accuracy would decrease. Therefore, the model also should consider such factors.

The proposed model in 6th Sense [88], using Markov Chain, NB, Logistic Model Tree (LMT) to detect malware using dynamic analysis is based on sensors available in a mobile device. A context-aware intrusion detection system is studied in this approach by collecting and observing changes in sensor data. This step happened when the applications were performing activities that enhanced security. This model distinguishes malware and benign applications. Three types of malware activities (triggering, leaking information, and stealing data) were identified using this approach via sensors available in the device. The collected data was divided as 75% for training and 25% for testing. For the Markov Chainbased detection technique, a training dataset was used to compute the state transitions and build a transition matrix. A training dataset was used with NB to determine the sensor condition changing frequency. For the other ML algorithms, all the data were defined as

benign and malware. In this study, LMT outperformed others with 99.3% precision and 99.98% recall. Though this study is a comprehensive one, it is better if the tradeoffs such as frequency accuracy, battery frequency, etc. are considered.

The proposed method in [89] discussed dynamic analysis-based techniques which extract a set of dynamic permissions from APKs in different sources and run them in an emulator. Then it evaluates the model using NB, RF, Simple Logistic, DT, and K-Star ML models. After that, it is identified that Simple Logistic performs well with 0.997 precision and 0.996 recall. Some issues were in the dataset used in this model. For example, some benign and malicious apps were using the same permissions, and some apps crashed when running the application in an emulator. Therefore, if the dataset is fine-tuned more before use, this model provides even more accuracy.

In [90], a framework called Service Monitor was proposed, which is a lightweight host-based detection system that can detect malware on devices. This framework was built using dynamic analysis. Service Monitor monitored the way of requesting system services to create the Markov Chain Model. The Markov Chain is used as a feature vector to perform the classification tasks with ML algorithms: RF, KNN, and SVM. The RF method performed well with an accuracy of 96.7% after training the model with AndroZoo, Drebin, and Malware Genome datasets. Some benign apps also requested the system services in a similar way to malware. Therefore, this could lead to some misclassification of this model. To avoid that and enhance the classification accuracy, signature-based verification to the Service Monitor can be applied.

A mechanism named DATDroid was proposed in [91] which is a dynamic analysis based malware detection technique with an overall accuracy of 91.7% with 0.931 precision and 0.9 recall values with RF ML algorithm. As the initial stage, feature extraction was performed by collecting system calls, recording CPU and memory usage, and recording network packet transferring. Then in the feature selection stage, Gain Ratio Attribute Evaluator was applied. After that, the model training and validation were performed as the next stage to identify malicious and benign applications using APKPure and Genome Project datasets. In addition to the features studied in this, there can be an impact from features like HTTP, DNS, TCP/IP, and memory usage patterns towards identifying malware which should be discussed.

In [92], a framework which is named as MEGDroid, using the dynamic analysis was proposed to improve the event generation process in Android malware detection. In this method, it automatically extracted and represented information related to malware as a domain-specific model. Decompilation, model discovery, integration and transformation, analysis and transformation, and event production were the steps included in this model. The model was then used to analyse malware after training with the AMD dataset. This model extracted every possible event source from malware code and was developed as an Eclipse plugin. Based on the results, MEGDroid provides better coverage in malware detection through generating UI, whereas system events and monitoring the system calls are lacking in this approach.

Table 5 comparatively summarises the above research studies related to dynamic analysis based methods.



#### *5.4. Hybrid Analysis with Machine Learning*

Hybrid analysis is the third approach which can be used in ML-based Android malware detection. The review in [93] identified three approaches of malware detection, which are the signature-based, anomaly-based, and topic modelling based approaches. ML algorithms such as DT, J48, RF, KNN, KMeans, and SVM can be applied to all these approaches. Signature-based malware was detected using ML algorithms after the feature extraction process. After the feature extraction, sensitive API calls were also analysed before applying ML algorithms. Documents were collected such as reviews, user documents, and app descriptions before following a similar approach as the signature-based method, initially in the topic modelling approach. It was identified that the behavioural based approach is better than the signature-based approach. If the topic modelling is combined with that approach, it was possible to achieve good results. The hybrid analysis method is created when the dynamic analysis method is integrated with the static analysis method. According to this study, the SVM classifier with the hybrid analysis method performed better than the other ML algorithms.

The model proposed in [94] discussed a methodology of using ML algorithms with static analysis and dynamic analysis. In the static analysis approach, malicious and benign applications' manifest data were taken as JSON files from MalGenome and Kaggale datasets to train the ML model. The trending apps were taken from well-known app stores. Androguard [79] was used to extract information from the APK files. After reverse engineering, decompiling, testing, and training with SVM, LR, KNN based ML models, a JSON file was prepared. According to this model, LR was identified as the most suitable ML algorithm, which has 81.03% accuracy. Many improvements are required to the proposed static analysis model since comparatively this has a low accuracy. However, the proposed dynamic analysis approach outperformed the static analysis approach with high accuracy of 93% of both precision and recall over the RF. In this approach, Droidbox was used to run APKs obtained from MalGenome and Android Wave Lock in a sandbox environment. Then a CSV file is obtained after converting the JSON file obtained by analysing the APK and after that the key features are extracted. As the last step, DT, RF, SVM, KNN, and LR

ML algorithms were used with extracted key features. Then accuracy and results were checked and the particular app was labelled as malware or benign. It would be better if this study explored the possibilities of using other ML algorithms also.

In [95], authors conducted an experiment using various ML technologies to analyse the relative effectiveness of the static and dynamic analysis method towards detecting malware. This study used the Drebin dataset and a custom dataset to train the ML algorithm to classify malware and benign apps. Altogether the whole dataset contains 103 malware and 97 benign apps. For the static analysis, the APK files were reverse-engineered by a tool available in Virustotal and extracted the permissions using a custom XML parser. Then binary feature vectors and permission vectors were created, and ML algorithms were applied. For dynamic analysis, applications were executed on separated Android Virtual Devices (AVDs). System calls and their frequencies were traced using the MonkeyRunner tool since the frequency representation of system calls contained behavioural information on apps. Usually, malware has higher frequencies compared to benign apps. After that, a feature vector of system calls was created, and ML algorithms were applied. The RF, J.48, Naïve Bayes, Simple Logistic, BayesNet Augmented Naïve Bayes (TAN), BayesNet K2, Instance Based Learner (IBk), SMO PolyKernel, and SMO NPolyKernel algorithms were used for both static and dynamic analysis. The best results of 0.96 for static analysis and 0.88 for dynamic analysis were achieved when RF with 100 trees was used. Permissions extracted from the AndroidManifest.xml file were considered for static analysis, and system calls extracted from the runtime were considered in the dynamic analysis.

The model proposed in [96] explained a hybrid analysis process to detect malware using ML algorithms with the accuracy of 80% when using the permissions analysis in static analysis approach and 60% accuracy when analysing by system calls. Malware samples were collected using a honeypot and search repositories such as Androditotal to train the model. However, this study lacks the consideration of other features' which affect malware detection that should also be considered to achieve a high accuracy model.

In [97], the model proposed a hybrid analysis-based efficient mechanism for Android malware detection, which used the malware genome dataset and the Drebin dataset to train the ML and DL models in the static analysis approach. CICMalDroid dataset for the dynamic analysis approach and 261 combined features were extracted for the hybrid analysis. To increase the performance, this model used dimension reduction using Principal Component Analysis (PCA). SVM, KNN, RF, DT, NB, MLP, and GB were used to train and test the model. Out of these ML/DL algorithms, GB outperformed the others in terms of accuracy (96.35%), but it took a comparatively long training time. Forty-six features from dynamic analysis results were also analysed. After performing combined hybrid analysis, GB again performed well with an accuracy of 99.36% and efficiency compared to the Random Forest and MLP. It is better to study the runtime environment and configuration more because this does not cover some areas.

The model described in [98] proposed a Tree TAN based hybrid malware detection mechanism by considering both static and dynamic features such as API calls, permissions, and system calls. LR algorithms were trained for these three features. Drebin, AMD, AZ, Github, and GP datasets were used in this and modelled the output relationships as a TAN to detect if the given app is malicious or benign with an accuracy of 0.97. There is a possibility of some malware remaining undetected from the model, which can be reduced using Reinforcement Learning techniques.

Tables 6 and 7 comparatively summarise the above research studies related to hybrid analysis based methods, where Table 6 listed studies with model accuracy below 90% and Table 7 listed studies with model accuracy above 90%.


**Table 6.** Hybrid analysis based malware detection approaches (model accuracy is below 90% or overall accuracy is not available).

**Table 7.** Hybrid analysis based malware detection approaches (model accuracy is above 90%).


#### *5.5. Use of Deep Learning Based Methods*

It is possible to use deep learning techniques also for detecting Android malware. In MLDroid, a web-based Android malware detection framework [101] was proposed by performing dynamic analysis. In this work, ML and DL methods were used with an overall 98.8% malware detection rate.

The model proposed in [102] disused a method to detect malware using a semanticbased DL approach and implemented a tool called DeepRefiner. This approach used the Long Short Term Memory (LSTM) on the semantic structure of Android bytecode with two layers of detection and validation. This method used the LSTM over Recurrent Neural Network (RNN) since RNN contains gradient vanish problem. Using this approach with an accuracy of 97.4% and a false positive rate of 2.54%, it was possible to detect malware. It was efficient and accurate compared with the traditional approaches. Since this approach uses the static analysis approach, some limitations can arise based on the runtime environment, which can be identified if this model uses the hybrid analysis approach.

MOCDroid [99] model discussed a multiobjective evolutionary classifier to detect malware in Android. It combined multiobjective optimisation with clustering to generate a classifier using third-party call group behaviours. This method produced an accuracy of 95.15%. Import term extraction, clustering, and applying a genetic algorithm were the three steps included in this process. Initially, the DEX files were uncompressed from the APK after using the decompression tool, and Java codes were obtained using the JADX tool [103]. Then the document term matrix was transformed. As the next step, K-Means clustering was applied since it was identified as the highest accuracy model for this, and the genetic algorithm was also applied. The results were compared with a random set of 10,000 benign and malicious apps with different antivirus engines. It is possible to consider other clustering methods to improve the accuracy of this method.

The work proposed in [104] discussed a method to detect Android malware using a deep convolutional neural network (CNN). Raw opcode sequence from disassembled Smali program was analysed using static analysers to classify the malware. The advantage of this method is automatically learning the feature indicative of malware. This work was inspired by n-gram based methods. To train the models Android Malware Genome project dataset [49] and Intel Security/MacAfee Lab dataset were used. The classification system of this provides 0.87 precision and recall accuracies. The accuracy of the malware detection can be increased when the dynamic analysis is also performed.

A deep learning-based static analysis approach was experimented with an accuracy of 99.9% and with an F1-score of 0.996 in [105]. This approach used a dataset of over 1.8 million Android apps. The attributes of malware were detected through vectorised opcode extracted from the bytecode of the APKs with one-hot encoding. After performing experiments on Recurrent Neural Networks, Long Short Term Memory Networks, Neural Networks, Deep Convents, and Diabolo Network models, it was identified that Bidirectional Long Short-Term Memory (BiLSTMs) is the best model for this approach. It is better to analyse the complete byte code using static analysis and check the app behaviour with dynamic analysis to build a more comprehensive malware detection tool based on deep learning techniques.

The DL-Droid framework based on deep learning techniques [106] proposed a new way of detecting Android malware with dynamic analysis techniques. This approach was having a detection rate of 97.8% by only including dynamic features. When the static features were also included in that, the detection rate would increase to 99.6%. The experiments were performed on real devices in which the application can run exactly the way the user experiences it. Further to this, some comparisons of detection performance and code coverage were also included in this work. Traditional ML classifier performances were also compared. This novel method outperformed the ML-based methods such as NB, SL, SVM, J48, Pruning Rule-Based Classification Tree (PART), RF, and DL. In addition to this work, seeking the possibilities to include intrusion detection mechanism in the DL-Droid would be a valuable addition.

The AdMat model proposed in [107] discussed a CNN on Matrix-based approach to detect Android malware. This model characterised apps and treated them as images. Then the adjacency matrix was constructed for apps, and it was simplified with the size of 219 × 219 to enhance the efficiency in data processing after transferring decompiled source code into call-graph of Graph Modelling Language (GML) format. Those matrices were the input images to the CNN, and the model was trained to identify and classify malware and benign apps. This model has an accuracy of 98.2%. Even though the model is highly accurate, there are limitations to this work, such as performing static analysis only, and the performance depends on the number of used features.

The model proposed in [108] discussed a DL-based method that uses CNN approach to analyse API sequence call, opcode, and permissions to detect Android malware in a zero-day scenario. The model achieved a weighted average detection rate of 91% and 81% on two datasets Drebin and AMD after the model was trained. The model can further improve if the dynamic analysis techniques are also considered.

With an accuracy of 95%, a multimodal analysis of malware apps using information fusion was presented in [100] which used hybrid analysis techniques. The study used CBR for training and validation purposes. SVM and DT were compared with the proposed model validation, but the classic ML algorithms were outperformed by the CBR-based method. If the work can represent the knowledge representation, some of the limitations can be addressed.

Tables 8 and 9 comparatively summarise the above research studies related to deep learning based malware detection methods, where Table 8 listed studies with model accuracy below 90% and Table 9 listed studies with model accuracy above 90%.

**Table 8.** Deep learning based Malware Detection Approaches (Model Accuracy is below 90% or overall accuracy is not available).


**Table 9.** Deep learning based malware detection approaches (model accuracy is above 90%).


#### **6. Machine Learning Methods to Detect Code Vulnerabilities**

Hackers do not just create malware. They also try to find loopholes in existing applications and perform malicious activities. Therefore, it is necessary to find vulnerabilities in Android source code. A code vulnerability of a program can happen due to a mistake at the designing, development, or configuration time which can be misused to infringe

on the security [38]. Detection of code vulnerability can be performed in two ways. The first method is reverse-engineering the APK files using a similar approach discussed in Section 3. The second method is identifying the security flaws at the time of designing and developing the application [109]. The study conducted in [110] has identified five main categories of security approaches. They were secure requirements modelling, extended Unified Modeling Language (UML) based secure modelling profiles, non-UML-based secure modelling notations, vulnerability identification, adaption and mitigation, and software security-focused process. Under these categories, 52 security approaches were identified. All these approaches are used to identify software vulnerabilities at the time of designing and developing the applications. Based on the findings of the surveys and interviews conducted in [111] related to intervention for long-term software security, the importance of having an automated code analysis tool to identify vulnerabilities of the written codes has been identified. The empirical analysis conducted in [112] identified the static software metrics' correlation and the most informative metrics which can be used to find code vulnerability related to Android source codes.

#### *6.1. Static, Dynamic, and Hybrid Source Code Analysis*

Similar to analysing APKs for malware detection, there are three ways of analysing source codes. They are static analysis, dynamic analysis, and hybrid analysis. In static analysis, without executing the source code, a program is analysed to identify properties by converting the source to a generalised abstraction such as Abstract Syntax Tree (AST) [113]. The number of reported false vulnerabilities depends on the accuracy of the generalisation mechanism. The runtime behaviour of the application is monitored while using specific input parameters in dynamic analysis. The behaviour depends on the selection of input parameters. However, there are possibilities of undetected vulnerabilities [114].

In hybrid analysis, it provides the characteristics of both static analysis and dynamic analysis, which can analyse the source code and run the application to identify vulnerabilities while employing detection techniques [115].

The study conducted in [116] performed an online experiment where Android developers were the participants. Vulnerable code samples containing hard-coded credentials, encryptions, Structured Query Language (SQL) injections, and logging with sensitive data were given to the participants together with the guidance of static analysis tools and asked to indicate the appropriate fix. After analysing the experiment results, it has been identified that automated code vulnerability detection support is required for the developers to perform better when developing secure applications.

To analyse Android source code, Android Linters can be applied. Linters have been proposed to detect and fix these bad practices and they perform a static analysis based on AST or Universal AST (UAST) generation through written source codes [117]. The study in [118] discussed several Linters such as PMD, CheckStyle, Infer, and FindBugs, Detekt, Ktlint, and Android Lint discussed the usage of them. Android studio adopts the Android Lint, which identifies 339 issues related to correctness, security, performance, usability, accessibility, and internationalisation. In the proposed model in FixDroid [27], security-oriented suggestions along with their fixes were provided to the developer once the Android Lint identified security flaws. The FixDroid method can further be improved by employing ML techniques to produce highly accurate security suggestions.

However, just warning the developer about security issues in the code is not sufficient. There should be a mechanism to inform the developer about the severity level of the security issue also. By using app user reviews, OASSIS [119] proposed a method to prioritise static analysis warnings generated from Android Lint. Based on the review analysis using sentiment analysis, it was possible to identify the issues in Android apps. After receiving prioritised lint warnings, developers will able to take prompt actions. The study in [120] proposed a mechanism named as MagpieBridge to integrate static analysis into Integrated Development Environments (IDEs) and code editors such as Eclipse, IntelliJ,

Jupyter, Sublime Text, and PyCharm. However, the possibility of extending this to the Android platform should be discussed further.

In [121], using static and dynamic analysis, a vulnerability identification of Secure Sockets Layer (SSL)/Transport Layer Security (TLS) certificate verification in Android application was described. This experiment found that out of the analysed 2213 Android apps, 360 apps contain vulnerable codes using the proposed framework of DCDroid. Therefore, through SSL/TLS certificates, it is possible to identify some vulnerabilities.

#### *6.2. Applying ML to Detect Source Code Vulnerabilities*

It has been proven that ML methods can be applied on a generalised architecture such as AST to detect Android code vulnerabilities [38]. Most of the research was conducted using static analysis techniques to analyse the source code.

With the use of ML, vulnerability detection rules were extracted with static metrics as discussed in [122]. Thirty-two supervised ML algorithms were considered for most common vulnerabilities and identified that when the model used the J48 ML algorithm, 96% accuracy could be obtained in vulnerability detection. The model proposed in [123] discussed an automated mechanism to classify well-written and malicious code using a portable executable (PE) structure through static analysis and ML with an accuracy of 98.77%. The proposed methodology used RF, GB, DT, and CNN as ML models.

The study in [124] built a model to predict software vulnerabilities of codes using ML before releasing the code. After developing a source code representation using AST and intelligently analysing it, the ML models were applied. Popular datasets such as NIST SAMATE, Draper VDISC, and SATE IV Juliet Test Suite, which contain C, C++, Java, and Python source codes, were used to train the model. However, using this model, it was not possible to locate a specific place of vulnerability. It is identified as a drawback, and it has not proven that the same approach is possible to apply to other programming languages and frameworks. However, there is a possibility of using this approach for Android applications, which were developed using Java.

In [125], using C and C++ source codes, a vulnerability detection system was proposed using ML and deep feature representation learning. Apart from using the existing datasets, the Drapper dataset was compiled using Drebin and Github repositories with millions of open-source functions and labelled with carefully selected findings. The findings of the research were compared with Bag of Words (BOW), RF, RNN, and CNN models.

The study conducted in [126] developed a mechanism to classify subroutines as vulnerable or not vulnerable in C language using ML methods. The National Vulnerability Dataset (NVD) was used to collect C programming code blocks and their known vulnerabilities. After preparing the AST and preprocessing the data, feature extraction, feature selection, and classification tasks were performed and ML algorithms were applied.

The applicability of deep learning to detect code vulnerabilities was discussed in [127]. Comparison of using three DL algorithms CNN, LSTM, and CNN-LSTM were discussed in this study. The proposed model has an accuracy of 83.6% when applying the DL models. Using Deep Neural Networks, it was possible to predict vulnerable code components. The model in [128] evaluated it using some Java-based Android applications. In this mechanism, N-gram analysis and statistical feature selection for constructing features were performed. This model can classify vulnerable classes with high precision, accuracy, and recall.

In [129], a model was proposed to detect zero-day Android malware using a distinctive parallel classifier and a mechanism to identify oncoming highly elusive vulnerabilities in the source code with an accuracy of 98.27% with the use of Ml algorithms; PART, Ripple Down Rule Learner (RIDOR), SVM, and MLP.

#### ML-Based Vulnerability Detection Specifically for Android

There is less research conducted relating to Android vulnerability detection with ML. The methodology of the studies, which were conducted on general programming languages, could apply to the Android code vulnerability detection after training the model using specific code datasets and adjusting the generalisation mechanism.

The work conducted in [130] prepared a manually curated dataset that can be used to fix vulnerabilities of open-source software. The possibility of automatically identifying security-related commits in the relevant code repository has been proven since it has been successfully used to train classifiers.

In [131] repository of Android security vulnerabilities was created named AndroVul, which includes dangerous permissions, security code smells, and high-risk shell command vulnerabilities. In [132], a study was conducted to predicatively analyse the vulnerabilities in Internet of Things (IoT) related Android applications using statistical codes and applying ML. In this study, 1406 Android apps were taken with various risk levels, and six ML models (KNN, LR, RF, DT, SVM, and GB) were administered to examine security risk prediction. It is identified that RF performs well in the intermediate risk level. GB performs well at a very high-risk level compared to the other ML model-based approaches. The study conducted in [133] proposed an ML-based vulnerabilities detection mechanism to identify security flaws of Android Intents using hybrid analysis. Adaboost algorithm was used to perform the ML based analysis.

Tables 10 and 11 summarise selected studies from above which are related to Android vulnerability analysis. Table 10 lists the studies which have model accuracy below 90% and Table 11 lists the studies which have model accuracy above 90%.


**Table 10.** Android vulnerability detection mechanisms (Model accuracy is below 90%).

**Table 11.** Android vulnerability detection mechanisms (model accuracy is above 90%).


#### **7. Results and Discussion**

Based on the reviewed studies in ML/DL based methods to detect malware, it is identified that 65% of studies related to malware detection techniques used static analysis, 15% used dynamic analysis, and the remaining 20% followed the hybrid analysis technique. This is illustrated in Figure 3. This high attractiveness of static analysis may be due to the various advantages associated with it over dynamic analysis, such as ability to detect more vulnerabilities, localising vulnerabilities, and offering cost benefits.

**Figure 3.** Malware analysis techniques used in the reviewed studies.

Many ML/DL based malware detection studies used the code analysis method as the feature extraction method. Apart from that, manifest analysis and system call analysis methods are the other widely used methods. Figure 4 illustrates those feature extraction methods used in the reviewed studies. It is possible to detect a substantial amount of malware after analysing decompiled source codes rather than analysing permissions or other features. That may be the reason for the high usage of code analysis in malware detection.

By using the feature extraction methods, permissions, API calls, system calls, and opcodes are the most widely extracted features. This is illustrated in Figure 5 along with the other extracted features in the reviewed studies. Many hybrid analysis methods extracted permissions as the feature to perform static analysis. It is easy to analyse permissions when comparing with the other features too. These could be reasons for the high usage of permissions as the extracted feature. Services and network protocols have low usage in feature extractions. The reason for this may be it is comparatively not easy to analyse those features.

The datasets used in ML/DL based Android malware detection studies to train the algorithms are illustrated in Figure 6. Drebin was the most widely used dataset in Android Malware Detection, and it was used in 18 reviewed studies. Google Play, MalGenome, and AMD datasets are the other widely used datasets. The reason for the highest usage of the Drebin dataset may be because it provides a comprehensive labelled dataset. Since Google Play is the official app store of Android, it may be a reason to have high usage for the dataset from Google.

It is identified that the RF, SVM, and NB are at the top of widely studied ML models to detect Android malware. The reason may be that the resource cost to run RF, SVM, or NB based models is low. Models like CNN, LSTM, and AB have less usage because to run such advanced models, good computing power is required, and the trend for DL-based models was also boosted in recent years. Table 12 summarises widely used ML/DL algorithms with their advantages and disadvantages. Figure 7 illustrates all of the studied ML/DL models with their usage in the reviewed studies.

The majority of the studies used hybrid analysis and static analysis as the source code analysis techniques in vulnerability detection in Android, as illustrated in Figure 8. To perform a highly accurate vulnerability analysis, the source code should be analysed and executed too. Therefore, this may be the reason to have hybrid analysis and static analysis as the widely used source code analysis methods to detect vulnerabilities.

**Figure 4.** Feature extraction methods used in the reviewed studies.

**Figure 5.** Extracted features in the reviewed studies.

**Figure 6.** Usage of datasets.



**Figure 7.** ML/DL models used in the reviewed studies.

**Figure 8.** Android source code vulnerability analysis methods.

#### **8. Conclusions and Future Work**

Any smartphone is potentially vulnerable to security breaches, but Android devices are more lucrative for attackers. This is due to its open-source nature and the larger market share compared to other operating systems for mobile devices. This paper discussed the Android architecture and its security model, as well as potential threat vectors for the Android operating system. Based upon the available literature, a systematic review of the state-of-the-art ML-based Android malware detection techniques was carried out, covering the latest research from 2016 to 2021. It discussed the available ML and DL models and their performance in Android malware detection, code and APK analysis methods, feature analysis and extraction methods, and strengths and limitations of the proposed methods. Malware aside, if a developer makes a mistake, it is easier for a hacker to find and exploit these vulnerabilities. Therefore, methods for the detection of source code vulnerabilities using ML were discussed. The work identified the potential gaps in previous research and possible future research directions to enhance the security of Android OS.

Both Android malware and its detection techniques are evolving. Therefore, we believe that similar future reviews are necessary to cover these emerging threats and their detection methods. As per our findings in this paper, since DL methods have proven to be more accurate than traditional ML models, it will be beneficial to the research community if more comprehensive systematic reviews can be performed by focusing only on DL-based malware detection on Android. The possibility of using reinforcement learning to identify source code vulnerabilities is another area of interest in which systematic reviews and studies can be carried out.

**Author Contributions:** Conceptualization, J.S., H.K. and M.O.A.-K.; methodology, J.S., H.K. and M.O.A.-K.; validation, J.S., H.K. and M.O.A.-K.; investigation, J.S.; Project administration, H.K.; writing—original draft preparation, J.S.; writing—review and editing, J.S., H.K. and M.O.A.-K.; visualization, J.S.; supervision, H.K. and M.O.A.-K.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** We thank the Accelerating Higher Education Expansion and Development (AHEAD) grant of Sri Lanka, University of Kelaniya—Sri Lanka and Robert Gordon University— United Kingdom for their support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

