A Novel Malware Detection Model in the Software Supply Chain Based on LSTM and SVMs

Zhou, Shuncheng; Li, Honghui; Fu, Xueliang; Jiao, Yuanyuan

doi:10.3390/app14156678

Open AccessArticle

A Novel Malware Detection Model in the Software Supply Chain Based on LSTM and SVMs

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6678; https://doi.org/10.3390/app14156678

Submission received: 13 July 2024 / Revised: 26 July 2024 / Accepted: 30 July 2024 / Published: 31 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the increasingly severe challenge of Software Supply Chain (SSC) security, the rising trend in guarding against security risks has attracted widespread attention. Existing techniques still face challenges in both accuracy and efficiency when detecting malware in SSC. To meet this challenge, this paper introduces two novel models, named the Bayesian Optimization-based Support Vector Machine (BO-SVM) and the Long Short-Term Memory–BO-SVM (LSTM-BO-SVM). The BO-SVM model is constructed on an SVM foundation, with its hyperparameters optimized by Bayesian Optimization. To further enhance its accuracy and efficiency, the LSTM-BO-SVM model is proposed, building upon BO-SVM and employing LSTM networks for pre-classification. Extensive experiments were conducted on two datasets: the balanced ClaMP dataset and the unbalanced CICMalDroid-2020 dataset. The experimental results indicate that the BO-SVM model is superior to other models in terms of accuracy; the accuracy of the LSTM-BO-SVM model on the two datasets is 98.2% and 98.6%, respectively, which is 2.9% and 2.2% higher than that of the BO-SVM on these two datasets.

Keywords:

software supply chain; malware detection; long short-term memory network; Bayesian optimization algorithm; support vector machine

1. Introduction

With the quick advancement of smart technology, software has penetrated into every corner of modern society and become a basic element supporting the operation of society. Meanwhile, concurrent software security issues have gradually evolved into a focus of society. The field of software development has undergone significant changes in recent years. The traditional model, characterized by a single organization or supplier developing an individual product, has evolved into a collaborative approach involving the use of shared components. This new paradigm involves the use of shared components, which has given rise to the formation of a linear Software Supply Chain (SSC) [1]. An SSC refers to all links and processes involved in software development and delivery [2]. As the software industry experiences robust growth, SSC architecture is increasingly diversifying and becoming more intricate. Concurrently, SSC security challenges are escalating, posing formidable challenges to the robust security safeguards essential for software systems. Ensuring SSC security is paramount, as it is crucial for maintaining the integrity and safety of software throughout its entire life cycle. SSC security is not merely vital to the interests of developers, vendors, and enterprises; it is also essential for safeguarding the welfare of end-users and nurturing the health of the entire software ecosystem. Therefore, an in-depth examination of the SSC holds profound significance in order to uncover latent security threats.

In recent years, SSC security incidents have occurred frequently, and the scope of their influence is constantly expanding [3]. According to a report released by Sonicwall, there were 5.5 billion malware attacks worldwide in 2022, representing a 2% increase compared to 2021 [4]. In this context, it is critical to effectively and accurately identify malware in the SSC to prevent illegal attacks and unauthorized access. Traditional single-feature malware detection methods, especially signature-based techniques, are effective in identifying known malware. However, there are some shortcomings in these methods; malware makes it difficult for single-feature-based detection methods to effectively identify its true intent by changing its code structure and behavioral patterns [5]. Therefore, researchers have been turning to the development and validation of multi-feature methods, including methods based on malware behavioral analysis [6]. The introduction of emerging methods has led to the widespread use of machine learning (ML) algorithms for malware identification. Recently, some more advanced detection techniques have been proposed by the research community, such as hybrid SSC malware detection methods utilizing ML, deep learning (DL), and combining DL with ML [7]. Aslan et al. [5] illustrate that ML methods yield better results when detecting both known and some unknown malware. For complex malware that is not yet widely recognized, DL-based and hybrid detection methods demonstrate higher efficacy.

In addition, another issue facing current research is how to maintain the high-accuracy detection performance of models within the constraints of computational resources and time. To effectively address these challenges, this paper proposes a novel detection model, i.e., the Bayesian-Optimized Support Vector Machine (BO-SVM). This model exploits an SVM to identify malware in the SSC, and calls the BO algorithm to optimize the SVM hyperparameters. Furthermore, this paper proposes a enhanced detection model that combines the Long Short-Term Memory network (LSTM) [8] with BO-SVM, referred to as the LSTM-BO-SVM model. Taking a set of static software features as the input, the LSTM network is firstly employed for software pre-classification to effectively capture the temporal association between features. Then, the BO algorithm is introduced to optimize the hyperparameters in the SVM model. Finally, the optimized BO-SVM model is achieved to complete malware detection in the SSC. The following are the primary contributions of this paper:

(1): The BO-SVM model is innovatively applied to address the malware classification problem in the SSC. Through optimization of the hyperparameters of the SVM model, this model can significantly improve the efficiency and accuracy of the SVM model.
(2): A novel LSTM-BO-SVM model is also proposed for malware detection in the SSC. This model further improves the BO-SVM accuracy and reliability. By adopting the temporal analysis capability of the LSTM network a priori, the multidimensional features of malware can be captured. Based on these, the BO-SVM performance can be enhanced.
(3): Expensive experiments are conducted in order to comprehensively evaluate the performance of the BO-SVM model and the LSTM-BO-SVM model. The experimental results verify the high efficiency and robustness.

The rest of thispaper is structured as follows. Section 2 presents a review of the related literature. Then, Section 3 elaborates the methodology proposed. Section 4 provides an in-depth discussion of the experimental design and analysis of the results. In the Conclusion, this study is summarized, and future research directions are discussed.

2. Related Work

Some researchers have proposed detection methods based on ML algorithms. For example, Liu et al. [9] proposed an ML-based model using data visualization and adversarial training to detect different types of malware and its variants. The model is applicable to most of the generic malware file formats. The model generates adversarial samples for simulating potential malware variants, which not only extends the malware dataset but also facilitates the extraction of malware features. In addition, the commonly used adversarial sample generation methods and image transformation methods have been optimized. Jahangir et al. [10] analyzed the IoT 23 dataset with 1429,574 samples across 11 categories using Decision Tree (DT), Random Forest (RF), SVM, and Gaussian Naive Bayes (GNB). Xiong et al. [11] proposed a hybrid model which uses Logistic Regression (LR), DT, and K-Nearest Neighbor (KNN) for classification. The predictions of these three models are then aggregated to form new features. After that, the original and new features are combined to form a new feature set. Finally, the new feature set is used as input to train the RF model.

Moreover, a large number of researchers have proposed DL-based detection methods. For instance, Akhtar et al. [12] proposed a convolutional neural network (CNN) and LSTM model. This model detects advanced malware by automatically abstracting and expressing high-level n-gram API requests as sequential feature maps and without any feature engineering. Hosseini et al. [13] proposed a model combining an LSTM network with a CNN. The researchers created two separate networks to train both. One is trained by classes.dex files and the other one collects so files and binary files that have been written in native libraries. Kim et al. [14] proposed a malware detection system called MAPAS. The system first obtains the API call graph by conducting the taint analysis with Flowdroid. After that, deep learning of the dataset is performed using a CNN. After the learning phase finishes, MAPAS uses the deep learning interpretation approach, Grad-CAM, to discover high-weight API call graphs used in malicious applications. Finally, the Jaccard algorithm is used to calculate the similarity between the API call graphs of an application and the high-weighted API call graphs of malicious applications to classify the malware. Hemalatha et al. [15] proposed an improved DL-based DenseNet model to classify malware variants. The model improves the final classification layer of the DenseNet model using a reweighted class-balanced loss function. The binary files are then depicted as two-dimensional images and classified by the improved DenseNet model. Huang et al. [16] proposed a malware detection method based on a VGG16 network. The method first uses the Cuckoo Sandbox to dynamically analyze the samples, converts the dynamic analysis results into visual images according to a designed algorithm, and trains the VGG16 network on static and hybrid visualization images. Di et al. [17] analyzed in depth the applicability of artificial neural networks in network intrusion detection and showed that although artificial neural networks and their DL models perform well in terms of performance, their training process is often slow and takes a long time to converge due to back propagation algorithms. Dong et al. [18] proposed an optimization method for anomalous network traffic detection based on a Semi-Supervised Double Deep Q-Network (SSDDQN). In a SSDDQN, the current network first adopts the autoencoder to reconstruct the traffic features and then uses a deep neural network as a classifier. The target network first uses the unsupervised learning algorithm K-Means clustering and then uses deep neural network prediction.

In addition, some other researchers have explored hybrid detection methods that combine DL with ML. For example, Shaukat et al. [19] proposed a hybrid detection model. The model first converts malware samples to color images using migration learning, then uses VGG19 for feature extraction; after that, PCA is used for feature selection to reduce the dimensionality of the features, and finally the One-Class SVM (OCSVM) is used for the final detection. The OCSVM is an SVM model that is trained using only negative class samples. Zhao et al. [20] proposed a static detection method for Android malware based on the LSTM-SVM model. The outcomes of the experiment demonstrate that the LSTM-SVM model detection method based on multiple features and models has a higher detection accuracy compared to the single-feature model, multi-feature single model, and multi-feature cascade model. Damaševičius et al. [21] proposed a malware detection method based on integrated classification. The method consists of a stack of five fully connected layers and one layer of convolutional neural network which performs the first stage of classification, after which the final classification is performed using extreme random trees. Pardhi et al. [22] proposed a hybrid detection method. The method uses AdaBoost, RF, a hybrid approach (Artificial Neural Network (ANN) with SVM), and a customized DL method together to classify malware.

However, some issues still exist in the above studies. For some complex malware behaviors, simple ML models may have difficulty in capturing their patterns, leading to their weak generalization ability. DL models may be overfitted when the amount of data are limited, leading to subpar performance on fresh data. Hybrid models, which combine the features of ML and DL, can improve the accuracy and robustness of detection in some cases. However, effectively integrating these two approaches is a research question.

3. Technical Overview

This section outlines the key technologies and theoretical foundations employed in this paper, which include the LSTM network, the SVM model, and the BO algorithm.

3.1. Long Short-Term Memory Network

LSTM is a special kind of Recurrent Neural Network (RNN) architecture capable of learning long-term dependencies in data. Presented by Hochreiter and Schmidhuber in 1997 [8], when working with lengthy sequences, traditional RNNs encounter issues with vanishing and exploding gradients. LSTM was created to solve these issues.

The key innovation of LSTM lies in its use of three types of gates to control the flow of information:

Forget Gate: Decides which data to discard from the cell state.
Input Gate: Determines what fresh data to store in the cell state.
Output Gate: Decides what data to output from the cell state.

Each LSTM unit contains a cell state that remains constant across time steps until it is updated. This design allows LSTM to remember information over long periods and remain stable when processing long sequences. The inner cell state, depicted in Figure 1 [23], is central to the functioning of the LSTM network.

In Figure 1,

C_{t - 1}

and

h_{t - 1}

show the cell state and output of the LSTM cell layer at the previous time, respectively; the three σ correspond to the sigmoid activation function of the Forget, Input, and Output Gates; and the tanh layer is used to create new candidate cell states

C_{t}

.

In SSC, the temporal correlation is very significant between software entities. For software with the same name but different version numbers, the LSTM network can effectively capture and learn these temporal dependencies by virtue of its unique recurrent structure.

3.2. Support Vector Machines

The SVM is a prevalent model for binary classification problems [24,25]. It is fundamentally a linear classifier that operates on the principle of maximum margin separation within the feature space, a characteristic that distinguishes it from the perceptron.

The basic idea of SVM learning is to find separating hyperplanes that correctly delimit the training dataset with maximum geometric spacing. For the sample to be classified

(x_{i} {, y}_{i})

, there is a hyperplane that completely separates the two classes of samples, and the hyperplane is shown in Equation (1).

ω \cdot x + b = 0,

(1)

where

ω

is the normal vector that defines the direction of the hyperplane and

b

denotes the offset, which specifies the distance between the origin and the hyperplane.

For data that are not linearly separable, solving a hyperplane can be understood as solving a quadratic programming problem as shown in Equation (2).

m i n \frac{{| | ω | |}^{2}}{2} + C (\sum_{i = 1}^{N} ξ_{i})

s . t . y_{i} (x \cdot ω + b) \geq 1 - ξ_{i}, i = 1,2, \dots, N,

(2)

where

C

is the penalty factor, which controls the penalty degree for the correctly and incorrectly classified samples. Let

ξ_{i}

be the slack variable. The slack variable for each sample corresponds to its degree of non-satisfaction with the constraint.

Solving Equation (2) can be transformed into finding its dual problem, as shown in Equation (3).

\max_{α} W (α) = \sum_{i = 1}^{N} α_{i} - \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} α_{i} α_{j} y_{i} y_{j} K (x_{i}, x_{j})

s . t . \sum_{i = 1}^{N} y_{i} α_{i} = 0, 0 \leq α_{i} \leq C, i = 1,2, \dots N,

(3)

where

α_{i}, α_{j}

are the Lagrange multipliers, the vectors corresponding to

α_{i} > 0

are the support vectors, and

K (x_{i}, y_{i})

is the kernel function. When the data input to the model are nonlinear, the kernel function needs to be introduced to map the samples to a high-dimensional space. Radial Basis Kernel (RBF) function, polynomial kernel function, and other frequently used kernel functions are among them. The choice of kernel function can significantly influence the algorithm’s performance. In this paper, the RBF is used. Its expression is given in Equation (4).

k (x_{i}, x_{j}) = e x p (- \frac{{‖x_{i} - x_{j}‖}^{2}}{2 γ^{2}}),

(4)

where γ is the parameter of the kernel function.

The optimal decision function is given in Equation (5).

f (x) = s g n [\sum_{i = 1}^{N} α_{i}^{*} y_{i} K (x_{i}, x) + b^{*}],

(5)

where

s g n (x)

is a symbolic function that returns −1 if

x < 0

; when

x > 0

, the return value is 1. When

x = 0

, it returns 0. Here,

α_{i}^{*}

is the optimal Lagrange multiplier, and

b^{*}

is the optimal bias.

Malware in the SSC often employs various disguises to evade detection by various security systems. This results in the malware in the SSC exhibiting more characteristics and an increase in its dimensionality. The SVM model is suitable for SSC malware detection since it is determined by a small number of support vectors rather than the dimensionality of the sample space.

3.3. Bayesian Optimization

BO is a method used for global optimization [24,26,27], which can make full use of the information from previous parameter evaluation and effectively avoid falling into local optimal solutions. Its goal is to find the optimal solution of the objective function within a limited computational budget without explicitly knowing the mathematical expression of the objective function.

The core idea of the BO is to integrate Bayesian statistics with response surface modeling to efficiently find the optimal solution of an objective function. This is achieved through a balanced approach of exploration and exploitation. The BO algorithm typically follows these steps:

Probabilistic Surrogate Model Selection: At each iteration, a surrogate model is chosen to approximate the objective function. Common surrogate models include Gaussian process (GP) regression and RF, among others. In this paper, GP regression is selected as the surrogate model, the expression of which is detailed in Equation (6).

$f (x) ~ G P [m (x), k (x, x^{'})]$

(6)

In Equation (6), the mean function $m (x) = E [f (x)]$ represents the mathematical expectation of the data sample. $k (x, x^{'}) = E \{[f (x) - m (x)] [f (x^{'}) - m (x^{'})]\}$ is the covariance function.
Sampling Strategy Selection: The subsequent sample point is selected by evaluating the uncertainty or the confidence level within the surrogate model. This iterative process is referred to as the sampling strategy. Well-established sampling strategies encompass the Gaussian process upper confidence bound and the Expected Improvement (EI), among others. In this paper, EI is utilized as the sampling strategy, with its functional form delineated in Equation (7).

$E I (x) = \{\begin{matrix} [μ (x) - f (x^{+})] ϕ (z) + σ (x) φ (z) σ > 0 \\ 0 σ = 0 \end{matrix},$

(7)

In Equation (7), $μ (x)$ represents the function value of the sampling point, $f (x^{+})$ represents the maximum value in the searched points, $z = \frac{μ (x) - f (x^{+})}{σ}$ , $ϕ (z)$ is the distribution function of the standard normal distribution, and $φ (z)$ is the probability density function of the standard normal distribution.
Surrogate Model Update: The surrogate model is refined by new candidate solutions, which include both actual observations of the objective function and a collection of previous candidate solutions.
Iteration: The process iterates by repeating steps 2 and 3 until a pre-specified number of iterations is reached or the desired objective is achieved. The optimal candidate solution and its corresponding optimal value of the objective function are then identified as the outputs.

The BO algorithm takes full account of previous parameter evaluations, making it less susceptible to becoming trapped in local optima when addressing non-convex problems. It also demonstrates strong performance in high-dimensional, complex optimization scenarios. Therefore, this paper employs the BO algorithm to ascertain the Gaussian kernel parameters and penalty factors within the SVM model, thereby optimizing the model’s performance.

4. Proposed SSC Malware Detection Model

In view of the current complex SSC security environment, this paper proposes a novel BO-SVM model to detect malware in the SCC. Then, an enhanced LSTM-BO-SVM model is also proposed to further improve the performance of the BO-SVM model.

4.1. The Construction of the BO-SVM Model

This section details the BO-SVM model to detect malware in the SSC. The BO-SVM model uses the SVM model for detection as the SVM model is determined by a small number of support vectors and not by the dimensionality of the sample space. It has better robustness to outliers and noise. Afterwards, the BO algorithm is utilized to search the hyperparameters of the SVM model to determine the best model hyperparameters, so as to significantly improve the detection performance. Figure 2 displays the BO-SVM model flowchart.

The establishment steps of the BO-SVM model are as follows:

Input the malware dataset D: Select the malware dataset as the input. The dataset D is defined as a $M * N$ matrix, where each line is denoted as ${i d, f_{1}, f_{2}, \dots, f_{i}, {\dots, f}_{n}, c}$ , wherein $i d$ represents each line, $f_{i}$ each feature, and c the label.
Data preprocessing D: The primary tasks of the preprocessing of the data in this research are to handle missing values and software feature outliers. After that, the data are standardized.
Split the malware dataset: The processed data are divided into a training set T1 and a test set S1.
Dimensionality reduction: Apply PCA to the training set T1 and the test set S1. The T1 is defined as a $J * N$ matrix and the S1 is defined as a $L * N$ matrix, wherein $L = M - J .$ The same parameters are used for the PCA algorithm to maintain the consistency of the data. Generate the training set T2 and test set S2 after dimension reduction. The T2 is defined as a $J * K$ matrix, where each line is denoted as ${i d, f_{1}, f_{2}, \dots, f_{i}, {\dots, f}_{k}, c}$ , wherein, $k < n$ . The test set S2 is defined as a $L * K$ matrix.
Initialize the SVM model: Initialize the SVM model with the default hyperparameters $γ$ and $C$ .
Generate the candidate solution space P: Generate the candidate solution space P for the hyperparameters $γ$ and $C$ by defining the parameter search range.
Update the P: Add new candidate solutions to the P after each iteration, and input all candidate solutions in the set into the GP regression model.
Perform GP regression calculation: The GP regression model shown in Equation (6) is used to calculate the function values and probability distributions for all candidate solutions. That is, the confidence of each candidate solution is obtained.
Generate new candidate solutions: A new candidate solution is generated using the EI sampling strategy shown in Equation (7). The EI sampling strategy uses the resulting confidence to determine the next candidate solution to evaluate.
Fit the SVM model: Use the new candidate solution as hyperparameters $γ$ and $C$ to fit the SVM model. Then, calculate the accuracy of the current SVM model.
Make a judgment: Determine whether the stopping criterion is reached (the optimized model accuracy or the number of iterations approach the threshold). If it is met, carry out step 12; otherwise, go back to step 7.
Evaluate the BO-SVM model: First, select a candidate solution with the highest accuracy as the hyperparameters $γ$ and $C$ . Then, test the trained BO-SVM model using the test set S2 to check if it meets the criteria (whether the accuracy is up to the required level). If it does, proceed to the next step; otherwise, go back to step 5 to retrain the model.
Obtain the optimal SVM model: After the above steps, the optimized model BO-SVM is achieved.

4.2. The Construction of the LSTM-BO-SVM Model

To construct the LSTM-BO-SVM model, a staged methodology was adopted in this paper. Firstly, LSTM was introduced as a pre-classifier. In the SSC environment, malware behaviors are often time-dependent, so it is feasible to choose LSTM models to capture these time-dependent features. Its unique cycle mechanism could effectively capture the long-term dependence between different versions of software, so as to improve the accuracy of detection. Then, the BO algorithm was used to optimize the hyperparameters of the SVM model, aiming to further enhance the detection ability of the model. Additionally, the Principal Component Analysis (PCA) technique was adopted to reduce the dimension of the original dataset to enhance the computational efficiency. The LSTM-BO-SVM model flowchart is displayed in Figure 3.

The steps to build the LSTM-BO-SVM model are detailed as follows:

Input: The T1, T2, and S2 datasets are input into the model.
Build the LSTM model: An initialized LSTM model is first built, and then the LSTM model is trained using the training set T1. The specific establishment process of the LSTM model is shown in Section 4.3.
Classify T1 and obtain the pre-classification set R: Use the trained LSTM model to pre-classify the training set T1 and obtain the pre-classification set R. The R is defined as a $J * 2$ matrix, where each line is denoted as ${i d, p c}$ and wherein $p c$ represents the pre-classification label.
Build a new dataset D1: Merge the training set T2 with the pre-classification set R to form a new training set D1. The training set D1 is defined as a $J * Q$ matrix, where each line is denoted as $\{i d, f_{1}, f_{2}, \dots, f_{i}, {\dots, f}_{n}, p c, c\} .$ Specifically, the $p c$ column in R is added to the training set T2 as a new feature column.
Build the SVM model: Build an SVM model with default hyperparameters and initialize the BO algorithm.
Optimize the SVM hyperparameters with BO: The BO algorithm is used to optimize the Gaussian kernel parameter $γ$ and penalty factor $C$ in the SVM model to obtain the optimal parameters, and the optimal parameters are transferred to the SVM model to obtain the BO-SVM model.
Train the LSTM-BO-SVM model with optimized hyperparameters: The LSTM-BO-SVM model is trained using the training set D1, and the training is saved.
Evaluate the model: The test set S2 is used to evaluate the classification performance of the LSTM-BO-SVM model. If the requirements (the accuracy is up to the requirement) are met, proceed to the next step; otherwise, go back to step 7 to retrain the model.
Obtain the optimal LSTM-BO-SVM model: After the above steps, the optimized model LSTM-BO-BO-SVM is achieved.

4.3. The Construction of the LSTM Model

This section deeply explores the construction process of the LSTM model, which is used as the pre-classifier in the LSTM-BO-SVM model. The LSTM model architecture consists of four main parts: an LSTM network layer, a Dropout layer, and two subsequent fully connected layers. The LSTM network layer is designed to capture the long-term dependencies between software with the same name but different version numbers in SSC, which is essential for understanding and predicting software behavior. In order to enhance the generalization ability of the model and avoid overfitting, a Dropout layer is introduced after the LSTM network layer. Subsequently, two fully connected layers are concatenated, where the first fully connected layer aims to improve the classification accuracy of the model. The second fully connected layer is set as the output layer with a single neuron and is used to handle binary classification tasks. The LSTM model flowchart is displayed in Figure 4.

The steps to build the LSTM model are as follows:

Input: The T1 and S1 datasets are input into the model.
Split the T1: Divide the T1 into a validation set V1 and a training set $T 1 ’$ .
Train the LSTM model: Train the LSTM model using the $T 1 ’$ and V1 datasets. During the training process, the model parameters are constantly adjusted to obtain the optimal parameters.
Evaluate the model with V1: Use the test set V1 to assess the model’s performance. Determine whether the stopping criterion is reached (training rounds or model accuracy reaches a threshold), and if so, proceed to step 5; otherwise, go back to step 3.
Obtain the trained LSTM model: After the above steps, obtain a trained LSTM model.
Obtain the software pre-classification set R: Input the training set T1 into the trained LSTM model to obtain the pre-classification results of T1 and form the set R.

5. Experimental Results and Analysis

Numerous tests were conducted to evaluate the effectiveness of the proposed BO-SVM and LSTM-BO-SVM models. The hardware environment consisted of an Intel (R) Core (TM) i7-6700 CPU at 3.40 GHz, with 16.0 GB of RAM. The software environment included Windows 10, Visual Studio Code, and Python 3.11.2.

5.1. Experimental Dataset

In this paper, two public datasets were exploited to verify the performance of the proposed LSTM-BO-SVM model, which is widely recognized for SSC malware detection. The first dataset is the Classification of Malware with PE headers (ClaMP) dataset [28], which contains desktop software samples. The second is the Canadian Institute for Cybersecurity Android Malware Dataset 2020 (CICMalDroid-2020) dataset [29,30], which covers Android app samples.

In the ClaMP dataset, there are 5112 samples, where the benign samples come from Windows files, whereas the malicious samples come from VirusShare. These samples contain 68 features derived from the Portable Executable (PE) header, and include 2488 benign samples and 2624 malicious samples, as shown in Table 1.

The CICMalDroid-2020 dataset gathers over 17,341 Android samples from various sources, such as AMD, MalDozer, VirusTotal, and the Contagio security blog. It covers five categories: mobile risk software, adware, banking malware, SMS malware, and benign software [30]. This dataset contains 471 features. The sample size for each category is shown in Table 2.

5.2. Data Preprocessing

In order to ensure the consistency and quality of the dataset, a series of data preprocessing steps were carried out in this paper to lay the foundation for the subsequent model training and analysis.

5.2.1. Data Preprocessing for ClaMP Dataset

We preprocessed the ClaMP dataset using the following steps:

Firstly, all features need to be of a numerical type in order to be recognized by ML algorithms. As a unique characteristic feature in ClaMP, “Packing type” was converted into a numerical feature using the factorize() function from the Pandas library.
The StandardScaler() function was called in the Scikit-learn library to normalize the data in ClaMP.
The PCA technique was utilized to select the first 20 principal features of the ClaMP dataset, with the aim of lowering the original dataset’s dimensionality while retaining more data variability.
After randomly disrupting the dataset, random sampling was executed to divide it into an 80% training set and a 20% testing set.

5.2.2. Data Preprocessing for CICMalDroid-2020 Dataset

We preprocessed the CICMalDroid-2020 dataset using the following steps:

Considering that some features in the dataset are called with high frequency, with a maximum value of 3,697,410, the value may shrink to close to zero after normalization, which may affect the classification accuracy. To alleviate this issue, we set values above 10,000 in this article to 10,000 before applying StandardScaler() for normalization.
Considering that binary classification is the main topic of this paper, i.e., distinguishing benign software and malware, this paper selectively used SMS malware samples and benign software samples in the dataset. Finally, the resulting dataset of 5699 samples contained 1795 benign software samples and 3904 malware samples.
The first 40 principal components from the CICMalDroid-2020 dataset were selected using the PCA technique to enhance the model’s performance.
The dataset was randomly disrupted and then random sampling was used to divide it into an 80% training set and a 20% test set.

5.3. Evaluation Metrics

The following metrics were employed to assess the model’s performance.

Training time [23]: The training time was measured by timing the time from when the model initiated training until the training process was completely finished, and was determined by calculating the difference between these two time points.
Detection time [23]: The detection time was measured by timing from the moment the model initiated detection until the detection process was completely finished, and was determined by calculating the difference between these two time points.
Computation time [23]: The computational time is the sum of the model training time and the detection time.
Accuracy [14]: Accuracy represents the proportion of accurately predicted samples to all samples. It is a simple and intuitive metric to evaluate the ability of a model to correctly classify examples. As shown in Equation (8):

$A c c u r a c y = \frac{T P + T N}{T P + F P + T N + F N}$

(8)

In the formula, $N$ (Negatives) represents the malware label, while $P$ (Positives) stands for benign software. $T N$ represents the software with true $N$ and predicted $N$ ; $F N$ represents the software with true $P$ and predicted $N$ ; $T P$ represents the software with true $P$ and predicted $P$ ; and $F P$ represents the software with true $N$ and predicted $P$ .
Precision [14]: Precision is the percentage of samples that, when all samples are predicted by the model to be positive categories, are actually positive categories. This is shown in Equation (9).

$P r e c i s i o n = \frac{T P}{T P + F P}$

(9)
Recall rate [14]: Recall rate is used to measure how many of the true positive class samples the model correctly predicted as positive. This is shown in Equation (10).

$R e c a l l r a t e = \frac{T P}{T P + F N}$

(10)
F1-Score [14]: F1-Score is used to assess how well a classification model is performing and overall consideration for both precision and recall rate. Equation (11) illustrates how it is calculated.

$F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(11)

5.4. Experimental Results and Analysis

To ensure reliable results, all experiments were repeated 20 times. The average of these experiments provided a robust performance evaluation.

Taking the ClaMP dataset as an example, when evaluating the BO-SVM model, a series of data preprocessing tasks was first performed on the dataset with the file format of .csv (as shown in Section 5.2.1). Then, 80% of the data were randomly divided into the training set T1 and the remaining 20% into the test set S1. After that, the top 20 features were selected using the PCA algorithm to obtain the training set T2 and test set S2. The BO-SVM model was then trained continuously using the training set T2 until its accuracy reached a predetermined value. After that, the trained BO-SVM model was tested using test set S2, and if it met the requirements, the model was used for detection; if it did not meet the requirements, the model was retrained. When evaluating the LSTM-BO-SVM model in the ClaMP dataset, in order for the LSTM to capture the long-term and short-term dependencies between the software more comprehensively, the LSTM model was trained and evaluated using the training set T1 and the test set S1, because some software features may be lost due to dimensionality reduction. The predictions of the LSTM model for training set T1 were then added as new feature columns to the training set T2 (20 features) to form a new training set (21 features). The BO-SVM model was then trained using the new training set. The subsequent evaluation steps were the same as for the BO-SVM model.

The evaluation process on the CICMalDroid-2020 dataset is consistent with the ClaMP dataset evaluation process.

5.4.1. Evaluation of BO-SVM Efficiency

This paper builds four additional models to validate the performance of the BO-SVM model, which are as follows:

Basic SVM model: SVM model with default parameter settings.
ACO-SVM model: SVM model optimized based on the Ant Colony Optimization (ACO) algorithm.
PSO-SVM model: SVM model optimized based on the PSO algorithm.
BO-DT model: A Decision Tree (DT) model optimized based on the BO algorithm.

The SVM model uses the default values. The initial parameters of the other models were adjusted to the values in studies [31,32,33], respectively. Afterwards, the parameter combinations that made the model perform best were determined after several experiments. See Table 3 for the detailed parameter settings.

With the purpose of investigating its generalization ability and adaptability, the proposed BO-SVM model was compared with the above four models, i.e., ACO-SVM, PSO-SVM, BO-DT, and SVM, on the ClaMP dataset and the CICMalDroid-2020 dataset. To further verify the performance of the BO-SVM model, the experimental results on the ClaMP dataset were especially compared and analyzed with the related work of Anggraini et al. [34]. Table 4 displays the outcomes of the experiment.

From Table 4, it can be observed that the accuracy of the BO-SVM model outperforms the other models on both datasets. The SVM model, optimized by the BO algorithm, outperforms the SVM models optimized by the ACO and PSO algorithms for the two datasets mentioned above. This is because the ACO-SVM and PSO-SVM models have the problem that the algorithms may fall into local optimality when optimizing the hyperparameters

γ

and

C

, whereas the basic SVM model relies on empirical manual parameter settings, which may not be able to guarantee that the global optimal solution is reached. In addition, the BO-SVM model outperforms the BO-DT model and the KNN model reported in reference [34].

In terms of training time, the basic SVM model has the shortest training time, while the BO-SVM model and BO-DT model have similar training times, but both are longer than the basic SVM model. This is mainly due to the fact that the introduction of the BO algorithm increases the training time accordingly, but the increase is still within the acceptable range. However, the ACO-SVM and PSO-SVM models not only take longer to train, but also are less accurate than the BO-SVM model.

5.4.2. Evaluation of LSTM-BO-SVM Model Efficiency

In order to evaluate the performance of the LSTM-BO-SVM model in malware detection in the SCC, the following models were selected in this paper for comparative analysis:

RNN-BO-SVM model: First, use an RNN for pre-classification, and then use the BO algorithm to optimize the SVM model; finally, use the optimized SVM model to complete the classification.
BiLSTM-BO-SVM model: First, use Bidirectional Long Short-Term Memory (BiLSTM) for pre-classification, and then use the BO algorithm to optimize the SVM model; finally, use the optimized SVM model to complete the final classification.
BO-SVM model: First, use the BO to optimize the SVM model, and then use the optimized SVM model to complete the final classification.
LSTM-SVM model: First, use the LSTM model for pre-classification, and then use the SVM model to complete the classification.
Basic LSTM model: Use the LSTM model alone to complete the final classification.

In this paper, to ensure the consistency and comparability of the experiments, the same values were used for the same parameters in each model.

In order to present the experimental results more comprehensively, this paper used the evaluation metrics introduced in Section 5.3 to evaluate the six models mentioned above. In addition, in order to visualize the detection effect of the LSTM-BO-SVM model and to effectively reveal the correct and incorrect predictions of the model in each category, a heat map of the confusion matrix of the LSTM-BO-SVM model was drawn.

Confusion matrix

The confusion matrices are shown in Figure 5 and Figure 6 for the ClaMP dataset and the CICMalDroid-2020 dataset, respectively.

It is evident from Figure 5 and Figure 6 that the vast majority of samples are concentrated on the main diagonal of the confusion matrix. This phenomenon indicates that the LSTM-BO-SVM model can achieve a high degree of accuracy and discrimination when classifying samples. The sample points on the main diagonal represent correctly classified instances, while those on the off-diagonal represent misclassified instances. Therefore, a high proportion of samples in the main diagonal is intuitive evidence of the excellent detection performance of the model. The high accuracy of the LSTM-BO-SVM model on the two datasets further confirms its effectiveness in the SSC malware detection task.

Overall, the findings of the confusion matrix demonstrate that the LSTM-BO-SVM model achieves a good level of detection effectiveness.

2.: Accuracy Comparison

Figure 7 and Figure 8 show the accuracy comparisons of the LSTM-BO-SVM with the six models mentioned above on the ClaMP and CICMalDroid-2020 datasets, respectively.

The experimental results reveal the significant advantage of the LSTM-BO-SVM model in terms of accuracy. The LSTM-BO-SVM model on the two datasets reached 98.2% and 98.6%, respectively, surpassing the other five models. For the ClaMP dataset, the LSTM-BO-SVM accuracy was higher than that of the five models, ranging from 2.9% to 9.9%. From this perspective, this shows that the LSTM-BO-SVM model possesses a better detection ability on this dataset. Regarding the CICMalDroid-2020 dataset, the LSTM-BO-SVM model also shows its superiority. The accuracy of the suggested model is greater than those five models, ranging from 2.2% to 14.7%.

The lower accuracy of the LSTM model may be attributed to its difficulty in adequately extracting features when the amount of data are insufficient. In addition, the LSTM-SVM model lacks an optimization algorithm to determine the hyperparameters of the SVM classifier, which may lead to its unsatisfactory detection effect. Although the RNN-BO-SVM model uses the BO algorithm for optimization, its accuracy is still lower than that of the LSTM-BO-SVM model. This may be due to the vanishing gradient problem faced by RNN when dealing with long-term dependencies, which affects its ability to capture long-distance dependencies in the data. In contrast, there is little difference in accuracy between the BiLSTM-BO-SVM model and the LSTM-BO-SVM model, which indicates that the bidirectional LSTM structure alleviates the gradient vanishing problem of RNN to some extent, thus improving the performance of the model.

3.: Precision Comparison

Figure 9 and Figure 10 show the precision comparison of the six models on the ClaMP and CICMalDroid-2020 datasets, respectively.

As shown in Figure 9, on the ClaMP dataset, the LSTM-BO-SVM model demonstrates superior precision at 98.7% for malware, ranging from 0.2% to 13.5% higher than the other models. For benign software, this model achieves 97.7% precision, which is 0.7% lower than the BiLSTM-BO-SVM model but still higher than the other models, ranging from 0.1% to 15.8%.

As shown in Figure 10, on the CICMalDroid-2020 dataset, the LSTM-BO-SVM model demonstrates superior precision at 98.6% for malware, ranging from 0.6% to 15.2% higher than other models. For benign software, the precision is also 98.6%, which is lower than the LSTM-SVM and BiLSTM-BO-SVM models by 1.1% and 0.1%, respectively, but higher than the other models, which range from 0.1% to 16.8%.

The experimental findings demonstrate the high precision with which the LSTM-BO-SVM model can classify both malware and benign software. The BO-SVM model achieved good detection results on benign software, but its detection results on malware were worse than benign software. The precision of the LSTM model and the LSTM-SVM model varies greatly between benign and malware. In particular, on the ClaMP dataset, the LSTM-SVM model classifies malware with significantly higher precision than benign software, while on the CICMalDroid-2020 dataset, the situation is reversed. The LSTM model performs similarly to the LSTM-SVM model. In the experiments, the precision of the LSTM-SVM model obtained from each experiment fluctuated greatly, which may be partially attributed to the default hyperparameter settings of the SVM model. The RNN-BO-SVM model and the BiLSTM-BO-SVM model achieved better detection results, and the BiLSTM-BO-SVM model was better than the RNN-BO-SVM model, but still not as good as the LSTM-BO-SVM model.

4.: Recall Rate Comparison

Figure 11 and Figure 12 exhibit the recall rate comparisons of the above six models on the ClaMP and the CICMalDroid-2020 datasets, respectively.

As shown in Figure 11, on the ClaMP dataset, the LSTM-BO-SVM model has a recall rate of 97.9% for malware, which is 0.2% and 0.6% lower than the BO-SVM and BiLSTM-BO-SVM models, respectively, but still higher than the other models, with a range from 1.3% to 19.3%. The LSTM-BO-SVM model’s recall rate for benign software is 98.6%, which is 1% less than the LSTM-SVM model’s recall but greater than the other models’ recall rates, which range from 1.2% to 15.8%.

As shown in Figure 12, on the CICMalDroid-2020 dataset, the LSTM-BO-SVM model’s recall rate for malware is 96.9%, which is 0.3% lower than the BiLSTM-BO-SVM model and 0.3% to 0.9% higher than the other models. For benign software, the recall rate of the LSTM-BO-SVM model is 99.4%, which is 0.3% to 8.5% higher than the other models.

According to the experimental findings, the LSTM-BO-SVM model demonstrates a high recall rate on both datasets, whether detecting benign software or malware. The BiLSTM-BO-SVM model comes second, while the LSTM-SVM model performs relatively mediocrely in terms of recall. The excellent recall performance of the LSTM-BO-SVM model is attributed to two key factors: the optimization of hyperparameters and the LSTM structure’s ability to efficiently process sequence data. However, the poor performance of the LSTM-SVM model may be due to the lack of effective hyperparameter optimization. Although the BiLSTM-BO-SVM model performs better, its performance is slightly lower than that of the LSTM-BO-SVM model, possibly due to the complexity of the model structure.

5.: F1-Score Comparison

Figure 13 and Figure 14 exhibit the F1-Score comparisons of the above six models on the ClaMP and the CICMalDroid-2020 datasets, respectively.

As shown in Figure 13, on the ClaMP dataset, the LSTM-BO-SVM model achieves an F1-score of 98.3% for malware, which is higher than the other models, ranging from 0.2% to 10.4%. For benign software, the F1-score of the LSTM-BO-SVM model is 98%, which is higher than other models, ranging from 0.1% to 10.7%.

As shown in Figure 14, on the CICMalDroid-2020 dataset, the LSTM-BO-SVM model achieves an F1-score of 97.9% for malware, which is higher than the other models, ranging from 0.3% to 7.2%. For benign software, the F1-score of the LSTM-BO-SVM model is 99%, which is higher than other models by 0.1% to 3.2%.

According to the experimental findings, the LSTM-BO-SVM model has a high F1-score when detecting both benign software and malware. Although the BiLSTM-BO-SVM model performs slightly worse, it still outperforms the other models. In contrast, the LSTM-SVM model performs poorly in terms of the F1-score. In addition, the F1-score of benign software is higher than that of malware, which may be due to the fact that malware uses techniques such as code obfuscation in order to avoid detection, which makes it more difficult to detect.

6.: Training and Detection Time Comparison

This paper not only provides an evaluation of the detection performance of each model, but also records their training and detection times, which provides an important indicator of the computational efficiency of the models. This comparison not only reveals the differences in processing speed among the models, but also provides researchers with a reference for making trade-offs between time and performance when selecting models. This allows the selection of an optimal model that meets efficiency requirements and guarantees detection accuracy. Table 5 displays the outcomes of the experiment.

According to the experimental findings, the training time of the LSTM-BO-SVM model is longer (136.94 s) on the ClaMP dataset, which may be related to the higher computational cost of the LSTM for processing data. On the other hand, the training time of the BO-SVM model is 34.68 s, which demonstrates its optimization efficiency. In terms of detection speed, the BO-SVM model is the fastest with a detection time of 0.22 s, while the BiLSTM-BO-SVM model has the longest detection time (2.89 s), which may be related to the complexity of its structure.

On the CICMalDroid-2020 dataset, the training time of all models was extended due to the higher dimensionality of this dataset, but the BO-SVM model still maintained the shortest training time (45.56 s). In terms of detection time, the BO-SVM model remained the fastest at 0.32 s, while the BiLSTM-BO-SVM model had the longest detection time (3.21 s).

In conclusion, the BO-SVM model shows higher efficiency in both training and detection speeds and may be more suitable for environments with higher real-time requirements. On the other hand, the LSTM-BO-SVM model may be more suitable for complex data analysis that requires high accuracy, although it takes longer to train.

7.: Comparison with Related Work

Many literature studies on SSC malware detection have been conducted using the ClaMP and CICMalDroid-2020 datasets. Mohamed et al. [35] and Sawadogo et al. [36] analyzed the CICMalDroid-2020 dataset using various machine learning models. Musikawan et al. [37] proposed an effective improved deep neural network. Bhagwat et al. [38] used the XGBoost model for detection. Kattamuri et al. [39] proposed feature selection using the ACO algorithm followed by detection using DTs. Raju et al. [40] used RFs for detection. Masum et al. [2] analyzed SSC attacks using Quantum SVM (QSVM) and Quantum Neural Network (QNN) on a quantum platform.

This paper presents the accuracy and computation time of each study according to the dataset used, with the results ranked by accuracy as shown in Table 6. Additionally, computation times are compared; “\” indicates that the computation time is not mentioned in the article.

According to Table 6, the LSTM-BO-SVM model proposed in this paper achieved the highest accuracy on both datasets. Although the BO-SVM model is slightly less accurate, it offers a better computation time compared to other research methods listed in Table 6. Some methods, while having higher accuracy, require a longer computation time. For example, the method proposed by Kattamuri et al. [39] achieved an accuracy of 97.69% on the ClaMP dataset, with a computation time of 4365 s.

5.5. Discussion

Based on the experimental results, this paper finds that the BO-SVM and LSTM-BO-SVM models show better detection, even in the CICMalDroid-2020 dataset, where benign and malicious samples are unevenly distributed. This indicates that these two models have high robustness. Moreover, the consistency of the results obtained on different datasets further confirms their generalization ability.

In the experiments in this paper, the BO-SVM model achieves high accuracy with a short detection time. This feature makes it potentially applicable to resource-constrained environments. For example, the trained model can be deployed to environments such as mobile handsets to effectively defend against potential SSC malware attacks through real-time detection of software when users download or update it.

Although the training time of the LSTM-BO-SVM model is slightly longer than that of some simple ML and DL models, it outperforms others in terms of detection capability. It may be ideal for detection tasks that require high accuracy and have sufficient computational resources. For example, in areas such as risk management in the financial industry and healthcare information system protection, where devices are usually high-performance, the LSTM-BO-SVM model can provide desirable detection results to effectively defend against persistent threats and malware behaviors from SSC attacks.

6. Conclusions

In order to effectively prevent the security risk of SSC malware, two novel models, BO-SVM and LSTM-BO-SVM, are proposed in this paper for the effective detection of malware in the SSC. The BO-SVM model is built on the basis of SVM, and its hyperparameters are optimized by the BO algorithm. The model is suitable for resource-limited environments. Then, the LSTM-BO-SVM model is constructed on the basis of the BO-SVM, and the sequence data are processed by the LSTM. The model is suitable for use in scenarios where the real-time requirements are low and accuracy is high.

The experimental outcomes demonstrate that the BO-SVM model achieves an accuracy of 95.3% on the ClaMP dataset and 96.4% on the CICMalDroid-2020 dataset, outperforming the traditional SVM model. The LSTM-BO-SVM model, with the introduction of the LSTM network, further improves the accuracy with 98.2% and 98.6% on the ClaMP and CICMalDroid-2020 datasets, respectively, which demonstrates its potential for application in SSC malware detection tasks.

Although the BO-SVM and LSTM-BO-SVM models are effective, other models, such as DL, have been reported to achieve better results in similar studies [21]. Therefore, future work can explore other more efficient and concise detection models. The current study is in the preliminary stage, which was conducted on a public dataset and not validated in a real environment. Future work can deeply analyze the applications of the model in real SSC attack scenarios. In addition, future work can explore more data sources or incorporate more diverse features to increase the robustness of the model and assess the impact of dataset bias on model performance. This paper investigates binary classification problems, which can be extended to multiclassification problems in future work. Moreover, as the size of the dataset grows and the feature dimensions increase, it will become crucial to explore more efficient hyperparameter optimization algorithms. This will not only improve the performance of the model, but also cope with more complex SSC environments in the future.

Author Contributions

Conceptualization, S.Z., H.L., and X.F.; methodology, S.Z., H.L., and X.F.; software, S.Z. and Y.J.; validation, S.Z., H.L., and Y.J.; formal analysis, S.Z., H.L., and X.F.; investigation, S.Z. and Y.J.; writing—original draft preparation, S.Z. and Y.J.; writing—review and editing, S.Z. and Y.J.; visualization, S.Z.; supervision, H.L. and X.F.; project administration, H.L.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62041211; the Basic Scientific Research Foundation Project of Colleges and Universities directly under the Inner Mongolia Autonomous Region under Grant No. BR22-14-05; the Inner Mongolia Autonomous Region Science and Technology Program under Grant No. 2022YFHH0070; the China Ministry of Education industry–university cooperative education project Grant No. 230806272235841; the Natural Science Foundation project of the Inner Mongolia Autonomous Region Grant No. 2024MS06002; and the Inner Mongolia Autonomous Region Graduate Research Innovation Project under Grant.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The ClaMP dataset is available at https://www.kaggle.com/datasets/saurabhshahane/classification-of-malwares (accessed on 7 May 2024). The CICMalDroid-2020 dataset is available at https://www.unb.ca/cic/datasets/maldroid-2020.html (accessed on 7 May 2024). The methodology and code of this study, if required, should be obtained by contacting the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their constructive comments and suggestions to improve the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ji, S.; Wang, Q.; Chen, A.; Zhao, B.; Ye, T.; Zhang, X.; Wu, J.; Li, J.; Yi, J.; Wu, Y. Open-source software supply chain security research review. J. Softw. 2023, 3, 1330–1364. [Google Scholar]
Masum, M.; Nazim, M.; Faruk, M.J.H.; Shahriar, H.; Valero, M.; Khan, M.A.H.; Uddin, G.; Barzanjeh, S.; Saglamyurek, E.; Rahman, A. Quantum machine learning for software supply chain attacks: How far can we go? In Proceedings of the 2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC), Los Alamitos, CA, USA, 27 June–1 July 2022; pp. 530–538. [Google Scholar]
Sonatype. 2020 State of the Software Supply Chain. Available online: https://www.globenewswire.com/ (accessed on 8 July 2024).
Sonicwall. Sonicwall Cyber Threat Report. Available online: https://www.sonicwall.com/medialibrary/en/white-paper/2023-cyber-threat-report.pdf (accessed on 6 July 2023).
Aslan, Ö.A.; Samet, R. A comprehensive review on malware detection approaches. IEEE Access 2020, 8, 6249–6271. [Google Scholar] [CrossRef]
Liu, Z.W.C. Malware code classification based on multi-feature fusion BiLSTM. Electronics 2022, 18, 67–72. [Google Scholar]
Taheri, R.; Javidan, R.; Shojafar, M.; Pooranian, Z.; Miri, A.; Conti, M. On defending against label flipping attacks on malware detection systems. Neural Comput. Appl. 2020, 32, 14781–14800. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Liu, X.; Lin, Y.; Li, H.; Zhang, J. A novel method for malware detection on ML-based visualization technique. Comput. Secur. 2020, 89, 101682. [Google Scholar] [CrossRef]
Jahangir, M.T.; Wakeel, M.; Asif, H.; Ateeq, A. Systematic Approach to Analyze the Avast IOT-23 Challenge Dataset for Malware Detection Using Machine Learning. In Proceedings of the 2023 18th International Conference on Emerging Technologies (ICET), Peshawar, Pakistan, 6–7 November 2023; pp. 234–239. [Google Scholar]
Xiong, S.; Zhang, H. A Multi-model Fusion Strategy for Android Malware Detection Based on Machine Learning Algorithms. J. Comput. Sci. Res. 2024, 6, 1–11. [Google Scholar] [CrossRef]
Akhtar, M.S.; Feng, T. Detection of malware by deep learning as CNN-LSTM machine learning techniques in real time. Symmetry 2022, 14, 2308. [Google Scholar] [CrossRef]
Hosseini, S.; Nezhad, A.E.; Seilani, H. Android malware classification using convolutional neural network and LSTM. J. Comput. Virol. Hacki. 2021, 17, 307–318. [Google Scholar] [CrossRef]
Kim, J.; Ban, Y.; Ko, E.; Cho, H.; Yi, J.H. MAPAS: A practical deep learning-based android malware detection system. Int. J. Inf. Secur. 2022, 21, 725–738. [Google Scholar] [CrossRef]
Hemalatha, J.; Roseline, S.A.; Geetha, S.; Kadry, S.; Damaševičius, R. An efficient densenet-based deep learning model for malware detection. Entropy 2021, 23, 344. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Ma, L.; Yang, W.; Zhong, Y. A method for windows malware detection based on deep learning. J. Signal. Process Sys. 2021, 93, 265–273. [Google Scholar] [CrossRef]
Di Mauro, M.; Galatro, G.; Liotta, A. Experimental review of neural-based approaches for network intrusion management. IEEE Trans. Netw. Serv. 2020, 17, 2480–2495. [Google Scholar] [CrossRef]
Dong, S.; Xia, Y.; Peng, T. Network abnormal traffic detection model based on semi-supervised deep reinforcement learning. IEEE Trans. Netw. Serv. 2021, 18, 4197–4212. [Google Scholar] [CrossRef]
Shaukat, K.; Luo, S.; Varadharajan, V. A novel machine learning approach for detecting first-time-appeared malware. Eng. Appl. Artif. Intel. 2024, 131, 107801. [Google Scholar] [CrossRef]
Zhao, M.; Zhang, X.; Zhu, W.; Zhu, S. Malware detection method based on LSTM-SVM model. J. East China Univ. Sci. Technol. 2022, 48, 677–864. [Google Scholar]
Damaševičius, R.; Venčkauskas, A.; Toldinas, J.; Grigaliūnas, Š. Ensemble-based classification using neural networks and machine learning models for windows pe malware detection. Electronics 2021, 10, 485. [Google Scholar] [CrossRef]
Pardhi, P.R.; Rout, J.K.; Ray, N.K.; Sahu, S.K. Classification of malware from the network traffic using hybrid and deep learning based approach. SN Comput. Sci. 2024, 5, 162. [Google Scholar] [CrossRef]
Laghrissi, F.; Douzi, S.; Douzi, K.; Hssina, B. Intrusion detection systems using long short-term memory (LSTM). J. Big Data 2021, 8, 65. [Google Scholar]
Feng, R.; Chen, Z.; Yi, S. Research on maize variety identification based on Bayesian optimization SVM. Spectrosc. Spectr. Anal. 2022, 42, 1698–1703. [Google Scholar]
Kurani, A.; Doshi, P.; Vakharia, A.; Shah, M. A comprehensive comparative study of artificial neural network (ANN) and support vector machines (SVM) on stock forecasting. Ann. Data Sci. 2023, 10, 183–208. [Google Scholar] [CrossRef]
Yang, F.; Zhao, W. Bearing fault diagnosis based on Bayesian optimized SVM. Coal Mine Mach. 2022, 43, 178–180. [Google Scholar]
Berkenkamp, F.; Krause, A.; Schoellig, A.P. Bayesian optimization with safety constraints: Safe and automatic parameter tuning in robotics. Mach. Learn. 2023, 112, 3713–3747. [Google Scholar] [CrossRef] [PubMed]
Kumar, A.; Kuppusamy, K.; Aghila, G. A learning model to detect maliciousness of portable executable using integrated feature set. J. King Saud Univ.-Comput. Inf. Sci. 2019, 31, 252–265. [Google Scholar] [CrossRef]
Mahdavifar, S.; Abdul Kadir, A.F.; Fatemi, R.; Alhadidi, D.; Ghorbani, A.A. Dynamic Android Malware Category Classification using Semi-Supervised Deep Learning. In Proceedings of the 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Calgary, AB, Canada, 17–22 August 2020; pp. 515–522. [Google Scholar]
Samaneh, M.; Dima, A.; Ghorbani, A.A. Effective and Efficient Hybrid Android Malware Classification Using Pseudo-Label Stacked Auto-Encoder. Int. J. Pure Appl. Sci. 2022, 30, 22. [Google Scholar]
Beştaş, M.Ş.; Dinler, Ö.B. Detection of Android Based Applications with Traditional Metaheuristic Algorithms. Int. J. Interact. Des. Manuf. 2023, 9, 381–392. [Google Scholar] [CrossRef]
Rani, S.; Tripathi, K.; Kumar, A. Machine learning aided malware detection for secure and smart manufacturing: A comprehensive analysis of the state of the art. Int. J. Interact. Des. Manuf. 2023, 1–28. [Google Scholar] [CrossRef]
Anggraini, N.; Pamungkas, M.S.T.; Rozy, N.F. Performance Optimization of Naïve Bayes Algorithm for Malware Detection on Android Operating Systems with Particle Swarm Optimization. In Proceedings of the 2023 11th International Conference on Cyber and IT Service Management (CITSM), Makassar, Indonesia, 10–11 November 2023; pp. 1–5. [Google Scholar]
Singh, P.; Borgohain, S.K.; Kumar, J. Investigation and pre-processing of CLaMP mlaware dataset for machine learning models. In Proceedings of the 2022 6th International Conference on Electronics, Communication and Aerospace Technology, Coimbatore, India, 1–3 December 2022; pp. 891–895. [Google Scholar]
Mohamed, S.E.; Ashaf, M.; Ehab, A.; Shereef, O.; Metwaie, H.; Amer, E. Detecting malicious android applications based on API calls and permissions using machine learning algorithms. In Proceedings of the 2021 International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC), Cairo, Egypt, 26–27 May 2021; pp. 1–6. [Google Scholar]
Sawadogo, Z.; Mendy, G.; Dembele, J.M.; Ouya, S. Android malware detection: Investigating the impact of imbalanced data-sets on the performance of machine learning models. In Proceedings of the 2022 24th International Conference on Advanced Communication Technology (ICACT), PyeongChang, Republic of Korea, 13–16 February 2022; pp. 435–441. [Google Scholar]
Musikawan, P.; Kongsorot, Y.; You, I.; So-In, C. An Enhanced Deep Learning Neural Network for the Detection and Identification of Android Malware. IEEE Internet Things J. 2023, 10, 8560–8577. [Google Scholar] [CrossRef]
Bhagwat, S.; Gupta, G.P. Android malware detection using hybrid meta-heuristic feature selection and ensemble learning techniques. In Proceedings of the International Conference on Advances in Computing and Data Sciences, Kurnool, India, 22–23 April 2022; pp. 145–156. [Google Scholar]
Kattamuri, S.J.; Penmatsa, R.K.V.; Chakravarty, S.; Madabathula, V.S.P. Swarm optimization and machine learning applied to PE malware detection towards cyber threat intelligence. Electronics 2023, 12, 342. [Google Scholar] [CrossRef]
Raju, P.; Raju, K.S.; Kalidindi, A. Feature selection and performance improvement of malware detection system using cuckoo search optimization and rough sets. Int. J. Adv. Comput. Sci. App. 2020, 11, 2020. [Google Scholar]

Figure 1. Internal unit structure of the LSTM.

Figure 2. Flowchart of the BO-SVM model’s establishment.

Figure 3. Flowchart of the LSTM-BO-SVM model’s establishment.

Figure 4. Flowchart of the LSTM model’s establishment.

Figure 5. Confusion matrix for ClaMP.

Figure 6. Confusion matrix for CICMalDroid-2020.

Figure 7. Accuracy comparisons of six models on ClaMP.

Figure 8. Accuracy comparisons of six models on CICMalDroid-2020.

Figure 9. Precision comparisons of six models on ClaMP.

Figure 10. Precision comparisons of six models on CICMalDroid-2020.

Figure 11. Recall rate comparisons of six models on ClaMP.

Figure 12. Recall rate comparisons of six models on CICMalDroid-2020.

Figure 13. F1-Score comparisons of six models on ClaMP.

Figure 14. F1-Score comparisons of six models on CICMalDroid-2020.

Table 1. ClaMP dataset.

Name	Quantity (Unit: PCS)
Benign software	2488
Malware	2624
Total	5112

Table 2. CICMalDroid-2020 dataset.

Name	Quantity (Unit: PCS)
Adware	1253
Banking malware	2100
SMS malware	3904
Mobile risk software	2546
Benign software	1795
Total	11,598

Table 3. Parameter settings of each model.

Model	Parameter	Value
BO-SVM	Kernel	RBF
ACO-SVM	Kernel	RBF
	Ant Count	30
	MaxIter	100
	Pheromone factor	3
	Pheromone constant	500
	Heuristic function factor	4
	Pheromone volatilization factor	0.3
PSO-SVM	Kernel	RBF
	Particle Size	30
	MaxIter	100
	Inertia Weight	0.9
	Acceleration Coefficients 1	2
	Acceleration Coefficients 2	2
	Velocity Limits_min	−5
	Velocity Limits_max	5
SVM	Kernel	RBF
	$γ$	5
	$C$	1.0

Table 4. The accuracy and training time of each model on the two datasets.

Model	Accuracy (%)		Training Time (in Seconds)
Model	ClaMP	CICMalDroid-2020	ClaMP	CICMalDroid-2020
BO-SVM	95.3	96.4	34.6	40.5
ACO-SVM	95.0	96.0	5185.4	5434.5
PSO-SVM	94.8	95.8	4223.4	4734.2
BO-DT	95.1	95.7	32.4	35.4
SVM	87.5	94.3	1.7	3.1
Anggraini et al. [34] KNN	93.5	\	\	\

Table 5. Training and detection time of six models.

Datasets	Model	Training Time (in Seconds)	Detection Time (in Seconds)
ClaMP	LSTM-BO-SVM	136.94	1.23
	BO-SVM	34.68	0.22
	LSTM	110.49	0.96
	LSTM-SVM	113.09	1.27
	RNN-BO-SVM	124.63	1.11
	BiLSTM-BO-SVM	160.58	2.89
CICMalDroid-2020	LSTM-BO-SVM	496.63	1.58
	BO-SVM	45.56	0.32
	LSTM	450.4	1.12
	LSTM-SVM	456.58	1.35
	RNN-BO-SVM	412.12	1.04
	BiLSTM-BO-SVM	560.47	3.21

Table 6. Comparison with other methods.

Dataset	Reference	Year	Method	Accuracy (%)	Computation Time (in Seconds)
ClaMP	Masum et al. [2]	2022	QNN	52.1	2698
	Masum et al. [2]	2022	QSVM	73.5	10000
	Raju et al. [40]	2020	RF	92	\
	Proposed	2024	BO-SVM	95.3	34.9
	Kattamuri et al. [39]	2023	ACO, DT	97.69	4365
	Proposed	2024	LSTM-BO-SVM	98.2	138.17
CICMalDroid-2020	Mohamed et al. [35]	2021	KNN	85	\
			SVM	86	\
			DT	88	\
	Musikawan et al. [37]	2022	DNN	93.5	305.15
	Sawadogo et al. [36]	2022	Hist GB, SMOTE	94.09	\
	Sawadogo et al. [36]	2022	Hist GB	95.25	\
	Bhagwat et al. [38]	2022	XGBoost	95.3	\
	Proposed	2024	BO-SVM	96.4	45.88
	Proposed	2024	LSTM-BO-SVM	98.6	498.21

Bold represents the optimal result.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, S.; Li, H.; Fu, X.; Jiao, Y. A Novel Malware Detection Model in the Software Supply Chain Based on LSTM and SVMs. Appl. Sci. 2024, 14, 6678. https://doi.org/10.3390/app14156678

AMA Style

Zhou S, Li H, Fu X, Jiao Y. A Novel Malware Detection Model in the Software Supply Chain Based on LSTM and SVMs. Applied Sciences. 2024; 14(15):6678. https://doi.org/10.3390/app14156678

Chicago/Turabian Style

Zhou, Shuncheng, Honghui Li, Xueliang Fu, and Yuanyuan Jiao. 2024. "A Novel Malware Detection Model in the Software Supply Chain Based on LSTM and SVMs" Applied Sciences 14, no. 15: 6678. https://doi.org/10.3390/app14156678

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Malware Detection Model in the Software Supply Chain Based on LSTM and SVMs

Abstract

1. Introduction

2. Related Work

3. Technical Overview

3.1. Long Short-Term Memory Network

3.2. Support Vector Machines

3.3. Bayesian Optimization

4. Proposed SSC Malware Detection Model

4.1. The Construction of the BO-SVM Model

4.2. The Construction of the LSTM-BO-SVM Model

4.3. The Construction of the LSTM Model

5. Experimental Results and Analysis

5.1. Experimental Dataset

5.2. Data Preprocessing

5.2.1. Data Preprocessing for ClaMP Dataset

5.2.2. Data Preprocessing for CICMalDroid-2020 Dataset

5.3. Evaluation Metrics

5.4. Experimental Results and Analysis

5.4.1. Evaluation of BO-SVM Efficiency

5.4.2. Evaluation of LSTM-BO-SVM Model Efficiency

5.5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI