*3.3. Classification Algorithms*

In this section, the theoretical description of the machine learning and deep learning algorithms used in this research is presented.

### 3.3.1. K-Nearest Neighbors (KNN)

The KNN algorithm is a simple and common machine learning algorithm used to classify numbers of real-life applications by discovering neighbors. The mechanism of the KNN algorithm is finding the distance between the classes of normal values and attacks by selecting object values close to the class k-values. The algorithm starts by loading network data with the length of input data [60]. KNN is utilized to determine the k-values that are near a set of specific values in the training dataset. The majority of these k-values fall

into a confirmed class. Furthermore, the input sample is classified. In this research, the Euclidean distance function (Ei) was used to find the distance between the object values. The expression of the Euclidean distance function is as follows:

$$E\_i = \sqrt{\left(\mathbf{a}\_1 - \mathbf{a}\_2\right) + \left(\mathbf{b}\_1 - \mathbf{b}\_2\right)^2} \tag{2}$$

where a1, a2, b1, and b2 are variables of the input data.

### 3.3.2. Support Vector Machine (SVM)

SVM is a supervised machine learning algorithm developed to solve complex problems in linear and nonlinear applications. It is used to draw the hyperplane between the data points that are near the hyperplane and calculate the effect of the location and orientation of the hyperplane, called the support vector (SV) [61]. The good performance of SV is attained when the distance of the data points is close to the hyperplane. The support vector machine has a number of functions, linear and non-liner; the RBF is appropriate for separable patterns because the network data has a complex format. In this research, a Gaussian radial basis function was proposed to detect Android malware:

$$K(y, y') = \exp\left(-\frac{||y - y'||^2}{2\sigma^2}\right) \tag{3}$$

where, *y*, and *y*' are vector features of the training data, ||*y* − *y*||<sup>2</sup> is the squared Euclidean distance between the features of the training data, and *σ* is the parameter.

### 3.3.3. Linear Discriminant Analysis (LDA)

LDA is a linear machine learning algorithm used to solve applications with high dimensionality [62]. It is used to model and transform data from a high-space dimension into a low-space dimension by separating the classes of the data into two groups: normal and malicious packets. Figure 5 represent the LDA method for analyzing normal and abnormal packets, where the red line linearly separates the two classes of the data.

**Figure 5.** The linear discriminant analysis (LDA) method for analyzing datasets.

### 3.3.4. Deep Learning Models

CNN-LSTM is a fusion model created with the combination of CNN and LSTM; both are deep learning AI algorithms. In CNN, there are hidden neurons with trainable weights and bias parameters. It is broadly applied to analyze the data in a grid layout, making it different from other structures [63]. It is also called a feed-forward network because the input data stream in one way, from the input to the production layer [64]. Three are the

main components in the CNN structure: the convolutional, pooling, and fully connected layers. For feature extraction and the reduction of dimensionality, the convolutional and pooling layers are employed. The fully connected layer is completely folded and attached to the output of the previous layer. The main architecture of the CNN model for detecting Android malware applications is displayed in Figure 6.

**Figure 6.** Structure of the CNN model.

Hochreiter et al. [65] introduced the LSTM algorithm for learning long-term data dependency. The LSTM is one type of recurrent neural network (RNN). The distinction between the LSTM and RNN techniques is the memory cells present in the LSTM structure. Every memory cell comprises four gates: the input, candidate, forget, and output gates. The forget gate categorizes the input features as to whether they must be discarded or kept. The input gate revives the memory cells in the LSTM structure, and the hidden state is always controlled by the output gate. Furthermore, LSTM uses an embedded memory block and gate mechanism that enables it to address complications related to the disappearing gradient and the explosion gradient present in the RNN learning [66]. The structure of the LSTM model is presented in Figure 7. Table 1 show the parameters of the LSTM model. It is investigated that these parameter values were significant for obtaining high performance to detect the android malware. The kernel size of convolution was 4, the max pool size id 4 for selecting significant features from the filter layer. The drop out value was 0.50 for preventing the model from overfitting; in order to optimize the model, the RSMprop optimizer function is presented. The error gradient is used batch size 150. The equations for the LSTM-related gates are defined as follows:

$$f\_t = \sigma\left(\mathcal{W}\_f \, . \, X\_t + \mathcal{W}\_f \, . \, h\_{t-1} + b\_f\right) \tag{4}$$

$$i\_t = \sigma(\mathcal{W}\_{\bar{t}} . \ X\_t + \mathcal{W}\_{\bar{t}} . \ h\_{t-1} + b\_{\bar{t}}) \tag{5}$$

$$S\_t = \tan \mathbf{h} (\mathcal{W}\_\mathbf{c}. \ X\_t + \mathcal{W}\_\mathbf{c}. \ h\_{t-1} + b\_\mathbf{c}) \tag{6}$$

$$\mathbf{C}\_{t} = (\mathbf{i}\_{t} \* \mathbf{S}\_{t} + f\_{t} \* \mathbf{S}\_{t-1}) \tag{7}$$

$$\rho\_t = \sigma(\mathcal{W}\_o + \mathcal{X}\_t + \mathcal{W}\_o \ ., h\_{t-1} + \ . \ V\_o \ . \mathcal{C}\_t + \ b\_o) \tag{8}$$

$$h\_l = o\_l + \tan \mathbf{h}(\mathbf{C}\_l) \tag{9}$$

where *Xt* is the vector of the input features sent to the memory cell at a time *t. Wi*, *Wf* , *Wc*, *Wo*, and *VO* represent the weight matrices, *bi*, *bf* , *bc*, and *bo* indicate the bias vectors, *ht* is the point of the stated value of the memory cell at a time *t*, *St* and *Ct* are the defined values of the candidate state of the memory cell and the state of the memory cell at time *t*, respectively. *σ* and *tanh* are activation functions, and *it*, *ft*, and *ot* are obtained values for the input gate, the forget gate, and the output gate at time *t*, respectively. *ht*−<sup>1</sup> represents the short memory vector.

**Figure 7.** The structure of the LSTM technique.

**Table 1.** Parameters of the LSTM model.


The CNN-LSTM model was built, as shown in Figure 8. It was trained using the training dataset, and its hyperparameters were adjusted using the Adam optimizer and the validation dataset. The CNN-LSTM model was next implemented on the test dataset, including features of each testing record to its real class: normal or a particular class of attack [67]. The training and optimization processes of the CNN-LSTM model consisted of two one-dimensional convolution layers that cross the input vectors with 32 filters and a kernel size of 4, two fully connected dense layers composed of 256 hidden neurons, and an output layer that applies the nonlinear SoftMax activation function used for multiclass classification tasks. To overcome the model's overfitting, the global max-pooling and dropout layers were applied. The global max-pooling layer prevents overfitting of the learned features by captivating the maximum value, while the dropout layer is used to deactivate a set of specific neurons in the CNN-LSTM network. The Adam optimizer updates the weights and improves the cross-entropy loss of function. Table 2 show the parameters of the CNN-LSTM model.

**Figure 8.** The structure of the CNN-LSTM model.

**Table 2.** Parameters of the CNN-LSTM model.

