SDSIOT: An SQL Injection Attack Detection and Stage Identification Method Based on Outbound Traffic

Fu, Houlong; Guo, Chun; Jiang, Chaohui; Ping, Yuan; Lv, Xiaodan

doi:10.3390/electronics12112472

Open AccessArticle

SDSIOT: An SQL Injection Attack Detection and Stage Identification Method Based on Outbound Traffic

by

Houlong Fu

¹

,

Chun Guo

^1,*,

Chaohui Jiang

^1,*,

Yuan Ping

²

and

Xiaodan Lv

¹

State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang 550025, China

²

School of Information Engineering, Xuchang University, Xuchang 461000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(11), 2472; https://doi.org/10.3390/electronics12112472

Submission received: 3 May 2023 / Revised: 27 May 2023 / Accepted: 27 May 2023 / Published: 30 May 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

An SQL Injection Attack (SQLIA) is a major cyber security threat to Web services, and its different stages can cause different levels of damage to an information system. Attackers can construct complex and diverse SQLIA statements, which often cause most existing inbound-based detection methods to have a high false-negative rate when facing deformed or unknown SQLIA statements. Although some existing works have analyzed different features for the stages of SQLIA from the perspectives of attackers, they primarily focus on stage analysis rather than different stages’ identification. To detect SQLIA and identify its stages, we analyze the outbound traffic from the Web server and find that it can differentiate between SQLIA traffic and normal traffic, and the outbound traffic generated during the two stages of SQLIA exhibits distinct characteristics. By employing 13 features extracted from outbound traffic, we propose an SQLIA detection and stage identification method based on outbound traffic (SDSIOT), which is a two-phase method that detects SQLIAs in Phase I and identifies their stages in Phase II. Importantly, it does not need to analyze the complex and diverse malicious statements made by attackers. The experimental results show that SDSIOT achieves an accuracy of 98.57% for SQLIA detection and 94.01% for SQLIA stage identification. Notably, the accuracy of SDSIOT’s SQLIA detection is 8.22 percentage points higher than that of ModSecurity.

Keywords:

SQL Injection Attack; Web security; attack detection; outbound traffic; stage identification

1. Introduction

Web application services are increasing year by year across various applications in our lives, making them the main targets of attackers. In this context, a large number of Web security vulnerabilities have posed significant challenges to network security [1,2]. SQL Injection Attack (SQLIA) is a type of Web application attack [3,4]. Tools such as SQLMAP [5] that automate the implementation of SQLIAs have made it easier to perform such attacks. The Open Web Application Security Project has published reports three times since 2013, with each report stating that SQLIA is one of the most threatening attacks [3]. Refer to Table 1 for a visual representation.

To counter the threats posed by SQLIA, researchers have proposed various SQLIA detection methods in recent years. Many of these methods rely on keywords or features extracted from SQLIA statements [6,7,8,9,10]. Such methods are highly accurate in detecting SQLIA statements where their corresponding keywords or features have been used to build detection rules or models (referred to as known SQLIA statements in this article). However, attackers can create complex and diverse SQLIA statements, i.e., they can utilize obfuscation techniques to conceal the attack. Consequently, these methods often encounter a high false-negative rate (FNR) when faced with deformed or unknown SQLIA statements. A complete SQLIA process can be roughly divided into two stages according to the different levels of threat: finding injection points (FIPs) and leaking data (LD). The different stages of SQLIA pose different threat levels to an information system: the FIP stage mainly explores the SQLI vulnerability of the information system, while the LD stage obtains or manipulates data in the database of the information system and even elevates the system privilege. In practical applications, accurately identifying the stage of SQLIA can help adopt a targeted defense mechanism to prevent the SQLIA from causing more serious damage [11]. Although some researchers [6,11] have analyzed the different characteristics of the stages of SQLIA from the attacker’s perspective, these studies focus on analyzing the stages rather than identifying them. Therefore, accurately detecting SQLIAs and identifying their stages has become an important issue that needs urgent attention.

Most of SQLIA detection studies have focused on building detection models based on the features extracted from SQLIA statements or inbound traffic. From the perspective of the content carried by the traffic, the inbound traffic (going from external attackers to a Web server) carries SQLIA statements, while the outbound traffic (going from the Web server to the attackers) carries the information responses from the Web server. Therefore, the outbound traffic reflects richer information about the actual state of the Web server than inbound traffic, and it is likely to help accurately detect SQLIA and identify its stage. We explore the feasibility of accurately detecting SQLIAs and identifying their stages based solely on the features extracted from outbound traffic. To our knowledge, there are few studies regarding SQLIA detection using outbound traffic, let alone SQLIA stage identification using outbound traffic. To improve the accuracies of SQLIA detection and SQLIA stage identification, we propose an SQLIA detection and stage identification method based on outbound traffic (SDSIOT). SDSIOT detects SQLIAs and identifies their stage based on the features extracted from outbound traffic without the need to analyze the complex and diverse malicious statements made by attackers. Contributions of this article are summarized as follows.

We explore the way to address the problem of identifying the stages of SQLIA from the perspective of outbound traffic. We analyze the differences in outbound traffic between SQLIA and normal applications and conclude that outbound traffic can be used to accurately identify the stage of SQLIA. This lays a foundation for building a SQLIA stage identification methods exclusively using outbound traffic.
We propose an outbound traffic-based method called SDSIOT, which not only detects SQLIA traffic but also distinguishes between traffic at the FIP stage and the LD stage.
We implement a prototype of SDSIOT in Python and evaluate it on two real datasets, one collected by us and the other publicly available. The experimental results show that SDSIOT can detect SQLIA traffic with high accuracy and identify its stage.

The remainder of this article is organized as follows: Section 2 covers related work, and Section 3 provides the motivations for our proposed method. Section 4 proposes the details of the SDSIOT. Section 5 details the experiment and evaluates the results. Finally, Section 6 concludes the work.

2. Related Work

To detect SQLIA, numerous researchers have proposed various methods. In Table 2, we provide a summary of existing SQLIA detection methods, which can generally be categorized into two classes based on whether they use machine learning technologies: traditional detection methods and machine learning-based detection methods.

2.1. Traditional Detection Methods

Traditional methods for detecting SQLIA typically rely on static, dynamic, or hybrid features and artificial rules. Consequently, they can be classified into three categories: static detection, dynamic detection, and hybrid detection.

Static detection involves examining the source code of a Web application to determine the vulnerabilities [24]. Livshits et al. [12] introduced a static analysis method that uses scalable and accurate point-to-point analysis to identify vulnerabilities that match a user-supplied vulnerability pattern. Xie et al. [13] used static taint analysis techniques to discover SQLIA vulnerabilities in PHP scripts. Xiang et al. [14] introduced SAFELI, a static analysis framework that identifies SQLIA vulnerabilities at the compilation stage. SAFELI can use symbolic execution to statically inspect the MSIL bytecode of ASP.NET Web applications.

Dynamic detection methods do not require the analysis of the source code or the database structure. They involve detecting SQLIA vulnerabilities though dynamic penetration testing or generating models at runtime. Masri and Sleiman [15] proposed SQLPIL, a lightweight and fully automated tool that uses prepared statements to prevent SQLIAs at runtime. Huang et al. [16] introduced a new network vulnerability scanner that uses a combination of penetration testing and evasion techniques to detect injections. Parvez et al. [17] analyzed the performance and detection capabilities of the latest black-box Web application security scanners for stored SQLIA and stored XSS and developed custom testbeds to challenge the scanners’ detection capabilities. Gu et al. [18] proposed a bidirectional network traffic analysis method to detect successful SQLIAs that result in the leakage of confidential data. Their method introduces a multilevel regular expression model comprising three sets of regular expressions.

Hybrid methods combine the advantages of dynamic and static detection methods. Halfond et al. [19] developed the AMNESIA tool based on traditional blacklisting techniques. It implements a combined black-box and white-box approach that automatically builds a model of legitimate queries that can be generated by the application in the static part. The dynamically created queries are then checked in the dynamic stage and kept consistent against the statically constructed model, and inconsistencies are considered as SQLIAs.

2.2. Machine Learning-Based Detection Methods

In recent years, machine learning has rapidly advanced and has shown promising results in various fields [25]. Many researchers [26,27,28,29] have applied machine learning technologies to construct SQLIA detection methods. Alghawazi et al. [30] conducted a systematic literature review of research on SQLIA detection using machine learning techniques. Generally, machine learning-based SQLIA detection methods can be divided into two groups: traditional machine learning-based SQLIA detection and deep learning-based SQLIA detection methods.

Related works [26,31] have conducted many studies on traditional machine learning detection methods. Kamtuo et al. [20] proposed a framework for SQLIA prevention using a compiler platform and machine learning. The results indicated that the decision tree (DT) is the best model in terms of processing time and has the highest efficiency in prediction. Choi et al. [21] extracted features from the source code using n-gram and then used a support vector machine (SVM) classification algorithm to train the detection model. Li et al. [22] proposed an adaptive method based on a deep forest to detect complex SQLIAs. This method improves the detection method for deep forest structural traffic and introduces the AdaBoost algorithm with adaptive capability into the deep forest model. Guo et al. [10] implemented a method to detect the truncated payload that only retains the attack statements that are significantly different from normal HTTP requests. Then, classification algorithms such as SVM and K-Nearest Neighbor (KNN) are used for training and testing.

Deep learning is a type of machine learning algorithm that has received significant attention in recent years. Luo et al. [8] used UNSW-NB15, KDD99, and HTTP CSIC 2010 datasets as training data, and collected some SQLIA samples in real environments as validation data. They obtained the payload of SQLIA issued by attackers by manual data cleaning, transformed them into vectors by word embedding, and conducted experiments using CNNs. Li et al. [7] proposed an LSTM-based communication attack behavior analysis to generate SQLIA samples and combined the methods to address the problem of lack of attack samples. The SQLIAs are classified into three types: in-band attacks, inference attacks, and out-of-band attacks. Two feature vector transformation methods, Word2vec and Bag-of-words, were compared. Tang et al. [9] proposed a neural network-based SQLIA detection framework. They used multiple neural networks to detect SQLIA separately. The multilayer perceptron (MLP)-based SQLIA detection model first extracts the corresponding URL features as the input of the neural network, and then performs MLP network training and saves the model with the best training effect as the optimal model. Liu et al. [23] proposed a deep natural language processing-based tool, dubbed DeepSQLi, to generate test cases for detecting SQLI vulnerabilities.

Overall, existing SQLIA detection methods can achieve high accuracy for detecting known SQLIAs. However, since most SQLIA detection methods rely on the keywords extracted from known SQLIA statements, they may have a high FNR when dealing with deformed or unknown SQLIA statements. Additionally, many studies have analyzed or extracted different features of the stages of SQLIA from the perspective of attackers, but these studies often focus on analyzing the stage rather than identifying different stages. We not only focus on SQLIA detection but also on SQLIA stage identification. Furthermore, the analysis object used in this article is only the outbound traffic, which is different from most SQLIA detection studies that use the keywords or features extracted from SQLIA statements or inbound traffic to develop detection models.

3. Motivation

Throughout this article, we define a flow as the combination of 5-tuples of network information, including source IP address, source port, destination IP address, destination port, and protocol, similar to reference [32]. A flow can be either inbound or outbound. SQLIAs typically target Web services that operate on HTTP or HTTPS protocol, which are request-response-based. Therefore, if we identify an outbound flow from a Web server as an abnormal flow, its preceding inbound flow that requested the Web server can be judged as an inbound flow of the SQLIA. Therefore, the accurate identification of outbound flows could be equivalent to the accurate identification of SQLIAs under SQLIA scenarios.

To analyze the difference between using inbound and outbound traffic to detect SQLIAs, we use an SQLIA statement “http://www.xx.com/xx.php?id=1 union select 1, database(),3” as an example of a known SQLIA statement, which is commonly used to obtain the database name of the target system. The other three SQLIA statements in Figure 1 are the deformed statements of the known SQLIA statement. These deformed statements, as shown in Figure 1, can be regarded as the statements modified by attackers through the use of obfuscation techniques.

Inbound traffic. To detect the SQLIA statement “http://www.xx.com/xx.php?id=1 union select 1, database(),3” based on inbound traffic, we can extract the “union”, “select”, “,” “space”, and other keywords or symbols from the statement to accurately detect it. However, attackers can construct deformed statements by altering the original statement in various ways, such as changing “select” to “sEleCT” (since the backend of the Web system is case sensitive while the database is case insensitive), coding “,” as “%2C” (since the Web system backend can parse URL codes), or transforming “space” into “%0a” (since “%0a” can be used instead of a space) to evade detection and achieve the same attack target (in Figure 1, the database name of the Web server is obtained). As a result, the detection method built on known SQLIA statements may be ineffective in identifying these deformed statements.

Outbound Traffic. To detect the SQLIA statement “http://www.xx.com/xx.php?id=1 union select 1, database(),3” based on outbound traffic, we can extract the features from the Web server’s response traffic to this statement. The outbound traffic we focus on includes the Web server’s response to both the original SQLIA statement and its deformed statements. We confirm that the server’s response to these deformed statements is generally consistent or similar to that of the original SQLIA statement since the purpose of the attack remains the same. For example, in Figure 1, the Web server responds with a message containing its database name for both the original statement and its deformed statements. This enables the detection method built on outbound traffic to detect the deformed statements of known SQLIA statements, regardless of how attackers have modified them. A greater analysis of outbound traffic characteristics is provided in Section 4.3.

Accurately identifying the stages of SQLIA is crucial to determine the threat posed by this attack, as different stages of SQLIA present varying levels of threat to an information system [33]. As described in Section 1, a complete SQLIA process can be divided into two primary stages based on the threat levels posed, namely FIP and LD. These two stages are closely related, as illustrated in Figure 2. The effectiveness of inbound and outbound traffic in identifying the stages of SQLIA is analyzed below. To conduct this analysis, we set up a Web server on a local area network (LAN) and deployed the SQLi-Labs [34] application, which contains multiple SQLIA vulnerabilities. Additionally, we utilized the SQLMAP tool [5] on an attacking host within the same LAN to perform SQLIAs on various pages of the Web application. Throughout the SQLIAs, we employed the Wireshark tool to capture both inbound and outbound traffic. The process of implementing the SQLMAP involved several steps, including identifying the injection point and vulnerability type, determining the database type and version, retrieving database table information, inspecting field information within tables, and examining table content. We classified the SQLIA inbound and outbound traffic related to the steps of identifying the injection point and vulnerability type, as well as determining the database type and version, as the FIP stage. Similarly, SQLIA inbound and outbound traffic associated with the steps of retrieving database table information, inspecting field information within tables, and examining table content were categorized as the LD stage.

Inbound traffic. Inbound traffic for SQLIA primarily carries SQLIA statements that are carefully crafted by attackers. Therefore, it can directly reflect the attacker’s behavior and is used by most existing SQLIA detection methods. These methods can be divided into two main categories. The first one is based on syntax trees or word vectors extracted by keywords [7,33,35]. The second category consists of methods that use features extracted from SQLIA statements [6,9,10].

Attackers generally try to construct deformed statements to achieve their goals and evade detection. Therefore, identifying the stage of SQLIA can be challenging because its statements have a high degree of variability. To investigate this issue, we extracted only the inbound traffic from the collected data and further extracted 1000 SQLIA statements from each of the two stages. We analyzed these statements in terms of their two characteristics: the length of the malicious SQL code in an SQLIA statement (referred to as malicious code, in the example shown in Figure 1, where it is “1 union select 1, database(),3”) and the keywords in SQLIA statements.

(1) The length of malicious code of SQLIA statement

Figure 3 shows the distribution of the length of malicious code for statements belonging to each of the SQLIA stages. As illustrated, statements belonging to both the FIP and LD stages have lengths of malicious code falling in the 3–50, 51–70, 71–100, 101–130, 131–150, 151–180, 181–200, 201–230, and 231–300 intervals. This implies that accurately distinguishing between different stages of SQLIA using only the length of malicious code for SQLIA statements within the aforementioned intervals can be challenging.

(2) Keywords in SQLIA statement

Figure 4 displays the frequency of keywords in SQLIA statements belonging to each of the SQLIA stages. It is observed that many keywords, such as “union”, “select”, “and”, “or” and “from” appear in both stages of SQLIA. Hence, it can be challenging to use keywords from statements to accurately distinguish between different stages.

Outbound traffic. Outbound traffic provides more comprehensive information about the Web server’s actual state compared with inbound traffic. In this study, we investigate the feasibility of using outbound traffic to differentiate between the different stages of SQLIA. We collected 300 consecutive outbound flows from the Web server during the FIP and LD stages, respectively. The payload size of a flow refers to the sum of the packet sizes of the packets containing payloads (referred to as payload packets in this article) in that flow. Figure 5 illustrates the payload sizes of outbound flows generated in the two stages of SQLIA. Most outbound flows generated in the LD stage have larger payload sizes than those generated in the FIP stage. This is because the FIP stage only probes for SQLI vulnerabilities and mainly generates outbound flows containing basic information about the Web server, such as system or application types. In contrast, during the LD stage, data from the database are leaked, resulting in larger payload sizes for outbound flows. We describe more detailed differences in the outbound traffic generated in the two SQLIA stages in Section 4.3.2.

From the above analysis, we can conclude that the outbound traffic generated in the two SQLIA stages exhibits distinct characteristics. Based on this idea, we construct SDSIOT and provide further details in Section 4.

4. Proposed SDSIOT

4.1. Framework of SDSIOT

Based on the analysis presented in Section 3, this section proposes an outbound traffic-based method called SDSIOT. SDSIOT adopts a two-phase structure. The detection model of Phase I detects SQLIA traffic, and the identification model of Phase II identifies SQLIA stages. The reason for adopting a two-phase structure is that the effective features for SQLIA detection and SQLIA stage identification are not identical, as will be presented in Section 4.3. Figure 6 presents the framework of SDSIOT, which includes four steps.

Step1: Data preprocessing. This step filters out the irrelevant network traffic.

Step2: Feature extraction. In this step, a total of 13 features are extracted from outbound traffic. Eight of these features are used for SQLIA detection, and the remaining five features are used for SQLIA stage identification.

Step3 (Phase I): SQLIA detection. A binary classification model is built based on the eight features and a classification algorithm. This model is used to detect SQLIA traffic, and the flows identified as SQLIA traffic in this step are sent to the next step for further stage identification.

Step4 (Phase II): SQLIA stage identification. Another classification model is constructed based on the five remaining features and a classification algorithm. It is used to identify which SQLIA stage the flow belongs to.

The details of these steps will be provided in the following subsections.

4.2. Data Preprocessing

In this step, the aim is to filter out irrelevant network traffic to reduce the traffic that SDSIOT needs to analyze. Firstly, non-80, non-443 ports, and non-TCP protocol traffic are filtered out, as all SQLIA traffic is based on HTTP or HTTPS protocol. Second, traffic that is not completely established is filtered out. Finally, inbound traffic is filtered out, as SDSIOT only focuses on outbound traffic.

4.3. Feature Extraction

This step involves extracting thirteen traffic features from outbound traffic, which are used for both SQLIA detection and SQLIA stage identification. The analysis and extraction of each feature are described below.

4.3.1. Feature Extraction for SQLIA Detection

To analyze the characteristics of outbound traffic generated by a Web server attacked by SQLIA, we used the SQLIA traffic described in Section 3, and additionally, 4.32 GB of normal traffic generated by users browsing different websites. We collect 2000 outbound flows during users browsing the websites and 2000 outbound flows (including 1000 outbound flows in the FIP stage and 1000 outbound flows in the LD stage) from the Web server during SQLIA for analysis. During the SQLIA process, the Web server will generate a large number of abnormal outbound flows, which are responses from the Web server to SQLIA. We speculate that there is a stronger correlation between the consecutive outbound flows from the Web server responding to the same host during SQLIA and the outbound flows from a website responding to the same host during the users browsing the website.

To verify this speculation, we analyze some statistical properties of these continuous outbound flows. Specifically, we observed four differences between the outbound traffic generated by the Web server during SQLIA and the outbound traffic generated by the websites during users browsing them.

(1) As shown in Figure 7a, the average duration of outbound flows from the Web server during SQLIA is smaller than that of outbound flows during users browsing websites. This is mainly due to the fact that the purpose of an SQLIA is to obtain specific information, rather than to browse the website normally.

(2) Figure 7b shows that the average time interval between outbound flows from the Web server during the SQLIA process is shorter than that of the outbound flows during users browsing websites. This is mainly because the request rate for an attacker launching an SQLIA through tools is much higher than the request rate for a person browsing a website normally.

(3) We extracted the number of packets in each outbound flow and calculated the standard deviation of the packet numbers for outbound flows generated by the Web server during SQLIA and the outbound traffic during users browsing websites. Figure 7c shows that the standard deviation of the number of packets in outbound flows from the Web server during SQLIA is much smaller than that of the outbound flows during users browsing websites. This indicates that, compared with the outbound flows during users’ browsing websites, the number of packets for each outbound flow from the Web server does not fluctuate much during the SQLIA process.

(4) We extracted the number of payload packets in each outbound flow and calculated the standard deviation of the number of payload packets in outbound flows from the Web server during SQLIA and outbound flows during users browsing websites. Figure 7d shows that the standard deviation of the number of payload packets in the outbound flows from the Web server during SQLIA is smaller than that of outbound flows during users browsing websites. This is because attackers often perform SQLIAs on a single page or a single vulnerable URL, and the number of packets and payload packets from the Web server varies relatively little during this period. On the other hand, during users browsing websites, different pages are usually browsed or different URLs are clicked, resulting in relatively significant variation in the number of packets and payload packets in different outbound flows.

In summary, Figure 7 reflects that the average duration, the average interval, the standard deviation of the number of packets, and the standard deviation of the number of payload packets can effectively distinguish between outbound flows generated during SQLIA and users browsing websites. However, the majority of these statistical properties cannot be used to distinguish the flows belonging to each SQLIA stage.

Based on the above four statistical properties, we extract eight features from outbound traffic for SQLIA detection. All the features are shown in Table 3.

To verify the effectiveness of the eight features for SQLIA detection, we utilized the t-SNE algorithm [36] to visualize the feature vector. The t-SNE algorithm is capable of reducing high-dimensional data, such as the feature vector, to lower dimensions, typically 2D. Figure 8 illustrates the visualization of these features. This figure demonstrates that the aforementioned features effectively separate the normal flows from the SQLIA flows, as they are distributed in distinct areas. This clear distinction indicates that the features can effectively distinguish between normal flows and SQLIA flows.

4.3.2. Feature Extraction for SQLIA Stage Identification

As mentioned in Section 4.3.1 and illustrated in Figure 7, some statistical properties (such as the standard deviation of the number of packets) cannot be used to distinguish the flows generated in FIP and LD stages. Therefore, using the features listed in Section 4.3.1 to distinguish the flows’ stages is not a viable option. However, in Section 3, we discovered that there is a difference in the payload sizes of the outbound flows corresponding to the different stages of SQLIA. In this subsection, we analyze the outbound flows that are detected as SQLIA to explore the feasibility of distinguishing stages of SQLIA. Specifically, we extracted the payload sizes of these outbound flows to compute their average and standard deviation. We found two differences between the outbound traffic of the two stages of SQLIA.

(1) As illustrated in Figure 9a, the average payload size of outbound flows generated in LD stage is larger than that of the outbound flows generated in FIP stage. This is because, during the FIP stage, attackers only conduct a probe to discover the SQL injection vulnerability, and the outbound traffic from the Web server carries little application system information. In contrast, during the LD stage, the outbound traffic from the Web server carries a large amount of database data.

(2) Figure 9b shows that the standard deviation of the payload size of the outbound flows generated in the LD stage is larger than that of the outbound flows generated in the FIP stage. This is because, during the FIP stage, attackers usually use brute force or enumeration to probe and find injection points for several specific pages, and the payload size of the outbound flows does not fluctuate much. In contrast, during the LD stage, the Web server begins to leak a lot of database data, and the payload size of the outbound flows changes with the response content, resulting in large fluctuations in the payload sizes of outbound flows generated in the LD stage.

Based on the above analysis, we extracted five features from outbound traffic for SQLIA stage identification. The details are presented in Table 4.

To verify the effectiveness of the five features for SQLIA stage identification, we employed the t-SNE algorithm to visualize the feature vector. The visualization of these features is presented in Figure 10. This figure shows that the aforementioned features effectively separate the majority of flows generated in the FIP stage from those in the LD stage, as they are distributed in distinct areas. This distinction indicates that the features can distinguish between flows associated with the FIP stage and the LD stage.

4.4. SQLIA Detection (Phase I)

This phase of SDSIOT is responsible for primarily differentiating the outbound traffic during SQLIA from outbound traffic during users browsing websites. The detection of SQLIA can be seen from the perspective of classification as a binary classification issue. Therefore, the quality of the employed features and classification algorithm are crucial to an effective SQLIA detection model. Using a classification algorithm and the eight features enumerated in Section 4.3.1, an SQLIA detection model is constructed, and Algorithm 1 provides the pseudocode for its development. In our experiment, as described in Section 5, we utilized four classic classification algorithms, namely SVM, KNN, DT, and RF, to individually train the SQLIA detection model. Subsequently, we compared their performances.

Once the SQLIA detection model has been trained, it can determine whether a given flow

x_{i}

in a test set belongs to the SQLIA or normal category. In case the model predicts that

x_{i}

is SQLIA, the subsequent phase (phase II) will require the identification of its stage.

Algorithm 1 Building the SQLIA detection and stage identification of SDSIOT

Input:: Training flows for SQLIA_detection_model (Training_flows_detection, the label is “Normal” or “SQLIA”); Training flows for stage_identification_model (“Training_flows_identification”, the label is “FIP” or “LD”);
Output:: A binary classifier for detecting SQLIA flow (SQLIA_detection_model); A binary classifier for identifying flow’s stage(SQLIA_stage_identification_model);
=====Build SQLIA detection Model============
1:: D_Training_set $\leftarrow ⌀$ ;
2:: for each flow $x_{i}$ in Training_flows_detection do
3:: D_vec( $x_{i}$ ) ← ExfD( $x_{i}$ ) //ExfD extracts the eight features from $x_{i}$ ;
4:: D_Training_set ← D_Training_set ∪ D_vec( $x_{i}$ );
5:: end for
6:: SQLIA_detection_model ← train_model(D_Training_set, Classification_Algorthm_1);
=====Build SQLIA stage identification model=======
7:: I_Training_set $\leftarrow ⌀$ ;
8:: for each flow $x_{i}$ in Training_flows_identification do
9:: I_vec( $x_{i}$ ) ← ExfI( $x_{i}$ ); //ExfI extracts the five features from $x_{i}$ for identifying the stage;
10:: I_Training_set ← I_Training_set ∪ I_vec( $x_{i}$ );
11:: end for
12:: SQLIA_stage_identification_model←(I_Training_set, Classification_Algorthm_2);
13:: return SQLIA_detection_model, SQLIA_stage_identification_model

4.5. SQLIA Stage Identification (Phase II)

A test flow

x_{i}

will be further examined by the SQLIA stage identification model to determine its stage if the SQLIA detection model predicts it to be SQLIA. Determining which SQLIA stage the flow belongs to is the task of this phase. Similarly, from a classification perspective, SQLIA stage identification is a binary classification issue. Thus, the quality of the characteristics employed and the classification algorithm are both crucial components of this identification model. Specifically, the five features listed in Section 4.3.2 are used to build an SQLIA stage identification model using a classification algorithm. In our experiment, as described in Section 5, we employed SVM, KNN, DT, and RF to individually train the SQLIA stage identification model. Subsequently, we compared their performances. Note that there are two distinctions between the SQLIA detection model and the SQLIA stage identification model: (a) the two models are built using different features; and (b) the training set for phase II only contains SQLIA flows.

After training, the built SQLIA stage identification model is used to determine which stage

x_{i}

belongs to. Algorithms 1 and 2 provide the pseudocodes for building the SQLIA stage identification model and the detection and identification process of SDSIOT, respectively.

Algorithm 2 SQLIA detection and stage identification process of SDSIOT

Input:: A flow $x_{i}$ needed to be detected and stage identified; the built classifier for detecting SQLIA flows (SQLIA_detection_model); the built classifier for identifying the stages for SQLIA flows (SQLIA_stage_identification_model);
Output:: The predicted label (predicted_lable $_{i}$ ) of $x_{i}$ (normal flow, SQLIA FIP or SQLIA LD);
1:: predicted_lable $_{i}$ ← SQLIA_detection_model( $x_{i}$ );
2:: if predicted_lable $_{i}$ ==“normal” then
3:: predicted_lable $_{i}$ ← “normal”;
4:: else
//If $x_{i}$ is declared a SQLIA flow by SQLIA_detection_mode, SQLIA_stage_identification_model will identify its stage.
5:: predicted_lable $_{i}$ ←SQLIA_stage_identification_model( $x_{i}$ );
6:: end if
7:: return predicted_lable $_{i}$

5. Experiment and Results

5.1. Environmental Environment and Datasets

To evaluate the effectiveness of SDSIOT, we conducted a series of experiments on the traffic generated by several common SQLIA tools and a public dataset. SDSIOT was developed using the Python 3.8 programming language with scikit-learn (version 0.24.2) library. All experiments were carried out on a Windows 10 X64 system and a 16 GB memory and an Intel Corei7 10700F CPU. To collect SQLIA traffic, we used SQLi-Labs [34] and DVWA-Web [37] as Web applications that have SQLI vulnerabilities. We installed three popular SQLIA tools (SQLMAP [5], SuperSQLI [38], and JSQL [39]) on another host to execute SQLIA for the built Web applications. These tools can successfully inject the built Web applications. We executed four common types of SQLIA, including union injection, error injection, time blind, and Boolean blind. Normal traffic used in our experiment was the traffic that came from browsing websites such as business, government, socializing, and school. A total of 4.38 GB of normal and SQLIA traffic were collected from the aforementioned Web applications using Wireshark. From the collected traffic, we extracted 21,000 normal HTTP flows and 19,500 SQLIA flows to train and test SDSIOT. Specifically, we used 10,500 SQLIA flows collected by SQLMAP and SuperSQLI tools and 12,000 normal flows to build the SQLIA detection model of SDSIOT. The same 10,500 SQLIA flows (each of the flows is labeled FIP or LD) were used to train the stage identification model of SDSIOT. We use 4500 flows collected by SQLMAP and SuperSQLI tools and 4500 normal flows in a dataset (denoted as “test set 1”) to test SDSIOT for the ability to detect SQLIAs launched by known SQLIA tools. We used another 4500 flows collected by JSQL and 4500 normal flows in another dataset (denoted as “test set 2”) to test SDSIOT for the ability to detect the SQLIAs launched by an unknown SQLIA tool. The specific distribution of the datasets is shown in Table 5. The Z-score normalization was used to preprocess the features of each flow.

5.2. Evaluation Metrics

Five widely used evaluation metrics, namely precision, recall, FNR, F1-score, and accuracy, are employed to evaluate the performance of SDSIOT. These metrics can be calculated using the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), according to Formulas (1)–(5), Moreover, the efficiency of SDSIOT is assessed through the time taken for feature extraction, training, detection, and identification.

\begin{matrix} Precision = \frac{TP}{TP + FP} \end{matrix}

(1)

\begin{matrix} Recall = \frac{TP}{TP + FN} \end{matrix}

(2)

\begin{matrix} F1-score = \frac{2 * Precision * Recall}{Precision + Recall} \end{matrix}

(3)

\begin{matrix} FNR = \frac{FN}{FN + TP} \end{matrix}

(4)

\begin{matrix} Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \end{matrix}

(5)

5.3. Experiment

5.3.1. Experimental Results of SDSIOT

We selected four classic classification algorithms, namely SVM, KNN, DT, and RF, to train the SQLIA detection and stage identification models of SDSIOT. Table 6 lists the parameters used in the four classification algorithms in the experiment. The SQLIA detection results obtained by SDSIOT with different classification algorithms in phase I on test sets 1 and 2 are shown in Table 7.

As shown in Table 7, SDSIOT with KNN in phase I obtained the best accuracy and F1-score among the four classification algorithms in detecting SQLIA. Specifically, when KNN was used to train the SQLIA detection model of SDSIOT, the obtained accuracy, precision, recall, and F1-score values for SDSIOT on test set 1 were all higher than 98% and FNR was only of 0.36%, indicating that SDSIOT is highly accurate at detecting SQLIAs launched by known SQLIA tools. Similarly, on test set 2, the obtained accuracy and F1-score for the SDSIOT with KNN were 98.57% and 98.56%, respectively, indicating that SDSIOT also has high accuracy in detecting SQLIAs launched by an unknown SQLIA tool. Therefore, we selected the KNN algorithm as the classification algorithm to train the SQLIA detection model of SDSIOT in the following experiments.

The stage identification results obtained by SDSIOT with different classification algorithms in phase II on test sets 1 and 2 are presented in Table 8. From Table 8, it can be seen that when DT was used to train the stage identification model in phase II, SDSIOT achieved the best performance on both test sets, with accuracy, precision, recall, and F1-score values higher than 91%. Thus, we use DT as the classification algorithm to build the identification model of SDSIOT in the following experiments.

5.3.2. Study of the Effect of Using the Two-Phase Structure

To evaluate the benefit of SDSIOT using the two-phase structure, this section presents the detection performance of the one-phase-based method (undivided phase and using the 13 features of SDSIOT, hereinafter referred to as “undivided method”). The undivided method classifies the flows to be detected into three classes (Normal, SQLIA FIP, and SQLIA LD) in one phase. Table 9 and Table 10 provide the SQLIA detection and SQLIA stage identification results obtained by the undivided method with different classification algorithms, respectively.

As shown in Table 9, the undivided method using KNN obtained the best accuracy and F1-score on both test sets in detecting SQLIA. Compared with the SQLIA detection result of SDSIOT (given in Table 7), the detection results of the undivided method are similar to those obtained by SDSIOT. Regarding the SQLIA stage identification performance, Table 10 shows that the results obtained by the undivided method using different classification algorithms vary greatly. Compared with the SQLIA stage identification result of SDSIOT (given in Table 8), the identification results of the undivided method on the two test sets are significantly weaker than those obtained by SDSIOT. This means that using the two-phase structure in SDSIOT is beneficial to distinguish different stages of SQLIAs.

5.3.3. Comparative Study

(1) Comparison with other SQLIA detection methods

To further evaluate the performance of SDSIOT in detecting SQLIAs, we conducted comparative experiments using several detection methods. Specifically, we selected a classical rule-based method (ModSecurity [40]) and three machine learning-based SQLIA detection methods (methods [8,9,10]) for comparison. Since these methods are based on inbound traffic, we used the inbound flows containing requests for the outbound flows of the training and test sets listed in Table 5. The results of SQLIA detection obtained by the different methods are presented in Table 11.

From Table 11, we can observe that SDSIOT outperforms ModSecurity, method [8], method [9], and method [10] in terms of accuracy, precision, and F1-score on test set 1. In addition, among the five methods, SDSIOT exhibits the best performance on test set 2, as reflected by all five evaluation metrics. In particular, ModSecurity shows a high FNR of 15.70%, indicating poor performance in detecting SQLIAs launched by unknown SQLIA tools. This is because ModSecurity relies on rules built on known SQLIA statements for detection. SDSIOT performs better than the other four methods in detecting SQLIAs launched by unknown SQLIA tools.

In contrast, SDSIOT does not require complex and diverse SQLIA statements to detect SQLIAs. Instead, it focuses on outbound traffic features generated by the Web server during SQLIAs, making it easier to detect deformed or new SQLIAs, regardless of how the attacker constructs the deformed SQLIA statements. Furthermore, among the five methods, only SDSIOT can identify the stages of SQLIA, making it more effective in preventing SQLIA causing severe damage.

(2) Comparison with an SQLIA stage identification method based on inbound traffic

In this section, we compare SDSIOT with a newly designed SQLIA stage identification method based on inbound traffic. Since there is no existing method for SQLIA stage identification, we designed this method specifically for the purpose of comparison. To carry out the comparison, we extracted the length of the malicious code, the keyword frequency, and the proportion of special characters from the malicious code as the features. The number of training and test flows used in this comparison was the same as in the previous sections, but the difference lies in that the training and test flows used for this comparison method were inbound flows. The stage identification results obtained by the SQLIA stage identification method based on inbound traffic using different classification algorithms on test sets 1 and 2 are presented in Table 12.

Based on Table 12, the SQLIA stage identification method based on inbound traffic with KNN performed the best on both test sets. However, its accuracy, precision, recall, and F1-score values were only around 75%, which were weaker than the results obtained by SDSIOT (given in Table 8). The reason for this is mainly because SDSIOT is not easily affected by the SQLIA statements that have a high degree of uncertainty, while the SQLIA stage identification method based on inbound traffic is.

5.3.4. Efficiency Evaluation

In this section, we present the times required for SDSIOT to extract features, train models, and perform SQLIA detection and stage identification in our experiment. Table 13 provides the features extraction, model training, SQLIA detection, and stage identification times of SDSIOT and two comparison methods. Note that the detection and identification times for each method given in Table 13 are the averages of the two test sets. The training time for SDSIOT is the time required to train both the SQLIA detection and stage identification models.

From Table 13, it can be observed that SDSIOT requires less feature extraction, training, and detection times than that of method [8], method [9], and method [10]. Note that the SQLIA detection and stage identification times of SDSIOT are 1.39 s and 1.37 s, respectively. This indicates that SDSIOT is highly efficient in SQLIA detection and stage identification.

5.3.5. Additional Test

To further validate the effectiveness of SDSIOT, we conducted additional tests using a publicly available dataset—the CSE-CIC-IDS dataset [41]. This dataset was released in 2017 and 2018 through the cooperation of Communications Security Establishment (CSE) and Canadian Institute for Cybersecurity (CIC) and includes PCAP traffic and CSV documents. In this experiment, we used a test set consisting of application layer attack flows and normal flows. The test set included 95 SQLIA flows, 395 Brute Force attack flows, and 234 XSS attack flows, as well as 1090 normal flows, which we used to evaluate the performance of SDSIOT. Each attack flow (inbound direction) contains a corresponding outbound flow. Among the 95 SQLIA flows, 50 belonged to the FIP stage and 45 to the LD stage. Given that the types of normal applications in the public dataset may significantly differ from those in our experiments, we randomly selected 500 normal flows from the CSE-CIC-IDS dataset and added them to the training set used in Section 5.3.3 to serve as the new training set. It is worth noting that the test set includes two unknown attack types (Brute Force and XSS) where are not present in the training set, which makes the task of SQLIA detection more complicated and realistic.

As shown in Section 5.3.1, SDSIOT using KNN in phase I achieved the best performance in the SQLIA detection task. Therefore, in this section, KNN was selected to build the SQLIA detection model of SDSIOT. To cope with the unknown attack types that are not presented in the training set, SDSIOT uses two thresholds (

τ 1

and

τ 2

) to determine the class label of a test flow

x_{i}

as SQLIA, normal application, or a specific label “Other Web Attack (OWA)”. We denote the maximum similarities of

x_{i}

with the SQLIA flows and the normal flows in the training set as

S 1

and

S 2

, respectively. The decision rule of SDSIOT is as follows: (a) If

S 1

is above

τ

1 and

S 2

is below

τ

2,

x_{i}

will be labeled SQLIA; (b) If

S 1

is below

τ

1 and

S 2

is above

τ

2,

x_{i}

will be labeled as normal application; (c) If

S 1

is below

τ

1 and

S 2

is below

τ

2,

x_{i}

will be labeled OWA; (d) If

S 1

is above

τ

1 and

S 2

is above

τ

2, and the difference between

S 1

and

τ

1 is

d 1

and the difference between

S 2

and

τ

2 is

d 2

, then if

d 1

is greater than

d 2

,

x_{i}

will be labeled SQLIA; otherwise,

x_{i}

is labeled normal application.

The attack detection performance of SDSIOT on the CSE-CIC-IDS dataset is presented in Table 14, where the values are obtained by setting

τ 1 = 0.79

and

τ 2 = 0.22

. This section only provides the performance of ModSecurity as a reference because the other three methods are designed to classify SQLIA and normal applications. As shown in Table 14, SDSIOT correctly detects 79 SQLIA flows and labels 567 flows as OWA. Additionally, SDSIOT correctly identifies 1049 normal flows, resulting in an overall accuracy of 93.44% on the test set, which is better than the accuracy achieved by ModSecurity (86.05%). Moreover, SDSIOT can identify the stages of SQLIA, which is an advantage over ModSecurity. When RF is used to train the stage identification model in phase II, SDSIOT can correctly identify the stages of 65 SQLIA flows out of the 79 SQLIA flows that were correctly detected as SQLIA. This result further confirms that SDSIOT can be extended to apply to detection scenarios with multiple Web attacks.

6. Discussion

The experimental results show that outbound traffic can effectively differentiate between SQLIA traffic and normal traffic, as well as identify the stage of SQLIA, whether it belongs to the FIP or LD stage. In terms of SQLIA detection, compared to the SQLIA detection methods based on inbound traffic, SDSIOT stands out due to its ability to detect both known and deformed SQLIA statements. Attackers frequently create deformed SQLIA statements to evade detection and achieve their intended objectives. However, SQLIA detection methods based on known SQLIA statements frequently exhibit a high FNR in identifying these deformed statements. In contrast, SDSIOT focuses on analyzing outbound traffic, which carries the server’s response to these deformed statements. We observe that this response traffic from the victim server generally remains consistent or similar to that of the original SQLIA statement, as the attack’s purpose remains the same. By leveraging outbound traffic, SDSIOT is able to detect deformed statements derived from known SQLIA statements, regardless of how attackers have deformed them.

Regarding SQLIA stage identification, using SQLIA statements to identify the stage of SQLIA poses a challenge due to the high variability introduced by attackers constructing deformed statements. SDSIOT employs outbound traffic to identify the stage of SQLIA, as it carries the information responses from the Web server, providing more comprehensive insights into the actual state of the server compared to inbound traffic. We observe that the outbound traffic generated during the two stages of SQLIA exhibits distinct characteristics.

While we have made progress in SQLIA detection and stage identification based on outbound traffic, much work remains in terms of real-world cases. In this article, we divide the complete SQLIA process into two main stages: FIP and LD. SDSIOT can currently identify the stage of SQLIA belonging to either FIP or LD. However, both FIP and LD stages can be further divided into several stages based on the specific operations of SQLIA. Identifying these stages according to the concrete operations of the attack enables a finer-grained determination of the threat posed by the attack. In future research, we will explore methods to identify the sub-stages of SQLIA within the FIP and LD stages based on the specific operations involved. Additionally, outbound traffic depends on the execution of SQLIA statements from inbound traffic before it is generated. From an SQLIA blocking perspective, the point in time to implement attack blocking when using outbound traffic for detection slightly lags behind the corresponding point in time for inbound traffic-based detection. To address this, improving the accuracy of identifying the early stages of SQLIA is crucial to detecting and blocking SQLIA in a timely manner during the early stages of execution.

7. Conclusions

SQLIA detection and stage identification are very critical for defending against SQLIAs and identifying their stages. We find that the outbound traffic from the Web server can be utilized for SQLIA detection and stage identification. Based on this finding, we propose SDSIOT to detect SQLIAs and identify their stages. Compared with other SQLIA detection methods based on inbound traffic, SDSIOT is featured with detecting deformed SQLIAs and identifying the stages of SQLIA. Experimental results on the collected dataset and CSE-CIC-IDS dataset demonstrate that SDSIOT is highly accurate at detecting SQLIAs and identifying their stage.

Future works will include two areas: (1) exploring the use of convolutional neural networks to improve the accuracy of detecting SQLIAs and identifying their stages; and (2) exploring the scalability and performance issues of our method in large-scale real network scenarios.

Author Contributions

Conceptualization, C.G. and Y.P.; methodology, H.F. and C.G.; software, H.F.; validation, H.F. and C.G.; formal analysis, C.G. and C.J.; investigation, C.J.; resources, C.J.; data curation, H.F.; writing—original draft preparation, H.F.; writing—review and editing, C.G., C.J., Y.P. and X.L.; project administration, X.L.; funding acquisition, C.G. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Support Program of Guizhou Province under Grant No. [2022]071, the Science and Technology Foundation of Guizhou Province under Grant No. [2017]1051, the Key Technologies R&D Program of He’nan Province under Grant No. 212102210084, and the Foundation of He’nan Educational Committee under Grant No. 18A520047.

Data Availability Statement

The data presented in this study are available upon request from the corresponding authors.

Acknowledgments

The authors thank the anonymous referees for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

Jemal, I.; Haddar, M.A.; Cheikhrouhou, O.; Mahfoudhi, A. Performance evaluation of Convolutional Neural Network for web security. Comput. Commun. 2021, 175, 58–67. [Google Scholar] [CrossRef]
Amouei, M.; Rezvani, M.; Fateh, M. RAT: Reinforcement-Learning-Driven and Adaptive Testing for Vulnerability Discovery in Web Application Firewalls. IEEE Trans. Dependable Secur. Comput. 2021, 19, 3371–3386. [Google Scholar] [CrossRef]
van der Stock, A.; Glas, B.; Smithline, N.; Gigler, T. OWASP Top 10:2021. Available online: https://owasp.org/www-project-top-ten/ (accessed on 4 August 2022).
Stiawan, D.; Bardadi, A.; Afifah, N.; Melinda, L.; Heryanto, A.; Septian, T.W.; Idris, M.Y.; Subroto, I.M.I.; Lukman; Budiarto, R. An Improved LSTM-PCA Ensemble Classifier for SQL Injection and XSS Attack Detection. Comput. Syst. Sci. Eng. 2023, 46, 1759–1774. [Google Scholar] [CrossRef]
SQLMAP: Automatic SQL Injection and Database Takeover Tool. Available online: https://sqlmap.org/ (accessed on 1 August 2021).
Zhao, Y.F.; Xiong, G.; He, L.T. Approach to detecting SQL injection behaviors in network environment. J. Commun. 2016, 37, 89–98. [Google Scholar]
Li, Q.; Wang, F.; Wang, J.; Li, W. LSTM-Based SQL Injection Detection Method for Intelligent Transportation System. IEEE Trans. Veh. Technol. 2019, 68, 4182–4191. [Google Scholar] [CrossRef]
Luo, A.; Huang, W.; Fan, W. A CNN-based Approach to the Detection of SQL Injection Attacks. In Proceedings of the 2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS), Beijing, China, 17–19 June 2019; pp. 320–324. [Google Scholar] [CrossRef]
Tang, P.; Qiu, W.; Huang, Z.; Lian, H.; Liu, G. Detection of SQL injection based on artificial neural network. Knowl.-Based Syst. 2020, 190, 105528. [Google Scholar] [CrossRef]
Guo, C.; Cai, W.; Shen, G. Research on SQL Injection Attacks Detection Method Based on the Truncated Key Payload. Netinfo Secur. 2021, 21, 43–53. [Google Scholar]
Li, M.; Liu, B.; Xing, G.; Wang, X.; Wang, Z. Research on Integrated Detection of SQL Injection Behavior Based on Text Features and Traffic Features. In Proceedings of the International Conference on Computer Engineering and Networks, Xi’an, China, 16–18 October 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 755–771. [Google Scholar]
Livshits, V.B.; Lam, M.S. Finding Security Vulnerabilities in Java Applications with Static Analysis. In Proceedings of the USENIX Security Symposium, Baltimore, MD, USA, 31 July–5 August 2005; Volume 14, pp. 271–286. [Google Scholar]
Xie, Y.; Aiken, A. Static Detection of Security Vulnerabilities in Scripting Languages. In Proceedings of the USENIX Security Symposium, Vancouver, BC, Canada, 31 July– 4 August 2006; Volume 15, pp. 179–192. [Google Scholar]
Fu, X.; Lu, X.; Peltsverger, B.; Chen, S.; Qian, K.; Tao, L. A static analysis framework for detecting SQL injection vulnerabilities. In Proceedings of the 31st Annual International Computer Software and Applications Conference (COMPSAC 2007), Beijing, China, 24–27 July 2007; Volume 1, pp. 87–96. [Google Scholar]
Masri, W.; Sleiman, S. SQLPIL: SQL injection prevention by input labeling. Secur. Commun. Netw. 2015, 8, 2545–2560. [Google Scholar] [CrossRef]
Huang, H.C.; Zhang, Z.K.; Cheng, H.W.; Shieh, S.W. Web application security: Threats, countermeasures, and pitfalls. Computer 2017, 50, 81–85. [Google Scholar] [CrossRef]
Anagandula, K.; Zavarsky, P. An analysis of effectiveness of black-box web application scanners in detection of stored SQL injection and stored XSS vulnerabilities. In Proceedings of the 2020 3rd International Conference on Data Intelligence and Security (ICDIS), South Padre Island, TX, USA, 24–26 June 2020; pp. 40–48. [Google Scholar]
Gu, H.; Zhang, J.; Liu, T.; Hu, M.; Zhou, J.; Wei, T.; Chen, M. DIAVA: A Traffic-Based Framework for Detection of SQL Injection Attacks and Vulnerability Analysis of Leaked Data. IEEE Trans. Reliab. 2020, 69, 188–202. [Google Scholar] [CrossRef]
Halfond, W.G.; Orso, A. AMNESIA: Analysis and monitoring for neutralizing SQL-injection attacks. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, Long Beach, CA, USA, 7–11 November 2005; pp. 174–183. [Google Scholar]
Kamtuo, K.; Soomlek, C. Machine Learning for SQL injection prevention on server-side scripting. In Proceedings of the 2016 International Computer Science and Engineering Conference (ICSEC), Chiang Mai, Thailand, 14–17 December 2016; pp. 1–6. [Google Scholar]
Choi, J.; Kim, H.; Choi, C.; Kim, P. Efficient malicious code detection using n-gram analysis and SVM. In Proceedings of the 2011 14th International Conference on Network-Based Information Systems, Tirana, Albania, 7–9 September 2011; pp. 618–621. [Google Scholar]
Li, Q.; Li, W.; Wang, J.; Cheng, M. A SQL injection detection method based on adaptive deep forest. IEEE Access 2019, 7, 145385–145394. [Google Scholar] [CrossRef]
Liu, M.; Li, K.; Chen, T. DeepSQLi: Deep semantic learning for testing SQL injection. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, 18–22 July 2020; pp. 286–297. [Google Scholar]
Li, J. Vulnerabilities Mapping based on OWASP-SANS: A Survey for Static Application Security Testing (SAST). Ann. Emerg. Technol. Comput. 2020, 4, 1–8. [Google Scholar] [CrossRef]
Sahu, A.K.; Sharma, S.; Tanveer, M.; Raja, R. Internet of Things attack detection using hybrid Deep Learning Model. Comput. Commun. 2021, 176, 146–154. [Google Scholar] [CrossRef]
Chen, D.; Yan, Q.; Wu, C.; Zhao, J. Sql injection attack detection and prevention techniques using deep learning. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2021; Volume 1757, p. 012055. [Google Scholar]
Preethi, V.; Velmayil, G. Automated Phishing Website Detection Using URL Features and Machine Learning Technique. Int. J. Eng. Tech. 2016, 2, 107–115. [Google Scholar]
Kumar, S.; Mahajan, R.; Kumar, N.; Khatri, S.K. A study on web application security and detecting security vulnerabilities. In Proceedings of the 2017 6th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 20–22 September 2017; pp. 451–455. [Google Scholar]
Fredj, O.B.; Cheikhrouhou, O.; Krichen, M.; Hamam, H.; Derhab, A. An OWASP top ten driven survey on web application protection methods. In Proceedings of the International Conference on Risks and Security of Internet and Systems, Paris, France, 4–6 November 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 235–252. [Google Scholar]
Alghawazi, M.; Alghazzawi, D.; Alarifi, S. Detection of sql injection attack using machine learning techniques: A systematic literature review. J. Cybersecur. Priv. 2022, 2, 764–777. [Google Scholar] [CrossRef]
Marashdeh, Z.; Suwais, K.; Alia, M. A survey on sql injection attack: Detection and challenges. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; pp. 957–962. [Google Scholar]
Wang, W.; Shang, Y.; He, Y.; Li, Y.; Liu, J. BotMark: Automated botnet detection with hybrid analysis of flow-based and graph-based traffic behaviors. Inf. Sci. 2020, 511, 284–296. [Google Scholar] [CrossRef]
Kuroki, K.; Kanemoto, Y.; Aoki, K.; Noguchi, Y.; Nishigaki, M. Attack intention estimation based on syntax analysis and dynamic analysis for SQL injection. In Proceedings of the 2020 IEEE 44th Annual Computers, Software, and Applications Conference (COMPSAC), Madrid, Spain, 13–17 July 2020; pp. 1510–1515. [Google Scholar]
Ping, C.; Jinshuang, W.; Lanjuan, Y.; Lin, P. SQL Injection Teaching Based on SQLi-labs. In Proceedings of the 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), Dalian, China, 27–29 September 2020; pp. 191–195. [Google Scholar]
Zhu, Z.; Jia, S.; Li, J.; Qin, S.; Guo, H. SQL Injection Attack Detection Framework Based on HTTP Traffic. In Proceedings of the ACM Turing Award Celebration Conference-China (ACM TURC 2021), Hefei, China, 30 July–1 August 2021; pp. 179–185. [Google Scholar]
Arora, S.; Hu, W.; Kothari, P.K. An analysis of the t-sne algorithm for data visualization. In Proceedings of the Conference on Learning Theory, Stockholm, Sweden, 6–9 July 2018; pp. 1455–1462. [Google Scholar]
Lebeau, F.; Legeard, B.; Peureux, F.; Vernotte, A. Model-based vulnerability testing for web applications. In Proceedings of the 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation Workshops, Luxembourg, 18–22 March 2013; pp. 445–452. [Google Scholar]
SuperSQLInjectionV1:2021. Available online: https://github.com/shack2/SuperSQLInjectionV1 (accessed on 10 August 2021).
JSQL Injection. Available online: https://github.com/ron190/jsql-injection/ (accessed on 15 August 2021).
Modsecurity: Open Source Web Application Firewall. Available online: http://www.modsecurity.org/ (accessed on 1 August 2022).
Kilincer, I.F.; Ertam, F.; Sengur, A. Machine learning methods for cyber security intrusion detection: Datasets and comparative study. Comput. Netw. 2021, 188, 107840. [Google Scholar] [CrossRef]

Figure 1. An example of an SQLIA statement and its deformed statements to obtain the database name of a Web server.

Figure 2. The main process of a complete SQLIA.

Figure 3. Distribution of the length of malicious code for the SQL statements belonging to each of the SQLIA stages.

Figure 4. Frequency of keywords in the SQLIA statements belonging to each of the SQLIA stages.

Figure 5. Payload size of the outbound flows generated in the two stages of SQLIA.

Figure 6. Framework of SDSIOT.

Figure 7. Some statistical properties of the outbound flows during SQLIA and users’ browsing websites.

Figure 8. Feature visualization results for SQLIA detection.

Figure 9. Two statistical properties of outbound flows from the Web server generated in FIP and LD stages.

Figure 10. Feature visualization results for SQLIA stage identification.

Table 1. OWASP Top 10— Injection Attack Rankings from 2013 to 2021.

Top 10 Application Security Risks—2013	Top 10 Application Security Risks—2017	Top 10 Web Application Security Risks—2021
A1—Injection	A1—Injection	A1—Broken Access Control
A2—Broken Authentication and Session Management	A2—Broken Authentication	A2—Cryptographic Failures
A3—Cross-Site Scripting (XSS)	A3—Sensitive Data Exposure	A3—Injection
A4—Insecure Direct Object References	A4—XML External Entities (XXE)	A4—Insecure Design
A5—Security Misconfiguration	A5—Broken Access Control	A5—Security Misconfiguration
A6—Sensitive Data Exposure	A6—Security Misconfiguration	A6—Vulnerable and Outdated Components
A7—Missing Function Level Access Contr	A7—Cross-Site Scripting (XSS)	A7—Identification and Authentication Failures
A8—Cross-Site Request Forgery (CSRF)	A8—Insecure Deserialization	A8—Software and Data Integrity Failures
A9—Using Components with Known Vulnerabilities	A9—Using Components with Known Vulnerabilities	A9—Security Logging and Monitoring Failures
A10—Unvalidated Redirects and Forwards	A10—Insufficient Logging & Monitoring	A10—Server-Side Request Forgery

Table 2. Previous SQLIA Detection Methods.

Class	Subclass	Advantages	Limitations	Work
Traditional detection methods	Static	Detects vulnerabilities by analyzing the application’s source code or bytecode without executing it.	Limited ability to detect vulnerabilities that arise from runtime behaviors or input validation issues.	Livshits et al. [12], Xie et al. [13], Xiang et al. [14]
	Dynamic	Does not require the analysis of the source code or the database structure.	Requires a comprehensive set of test cases to cover various attack scenarios.	Masri and Sleiman [15], Huang et al. [16], Parvez et al. [17] Gu et al. [18]
	Hybrid	Combines the strengths of static and dynamic analysis techniques.	High complexity in combining and synchronizing static and dynamic analysis techniques.	Halfond et al. [19]
Machine learning-based detection methods	Traditional machine learning	Allows for extensive feature engineering, where domain-specific features can be manually extracted from the input data.	Relies on the keywords extracted from known SQLIA statements	Kamtuo et al. [20], Choi et al. [21], Li et al. [22], Guo et al. [10]
Machine learning-based detection methods	Deep learning	Deep Learning can automatically learn relevant features from raw input data.	High complexity of model training.	Luo et al. [8], Li et al. [7], Tang et al. [9], Liu et al. [23]

Table 3. Features for SQLIA detection.

Feature	Description
DOF	Duration of the outbound flow
ILF	Interval time from the previous outbound flow of the same host flow
MIP	Minimum time interval for adjacent packets in the outbound flow
MAP	Maximum time interval for adjacent packets in the outbound flow
AIP	Average time interval of adjacent packets in the outbound flow
RPP	Ratio of payload packets to the total of packets in an outbound flow
CNP	Number of outbound flows with the same number of packets as the current outbound flow in the previous 5 outbound flows of the same host
CNO	Number of outbound flows with the same number of payload packets as the current outbound flow in the previous 5 outbound flows of the same host

Table 4. Features for SQLIA stage identification.

Feature	Description
APS	Ratio of the payload size to the total size in the outbound flow
MIS	Minimum size of the payload of payload packets in the outbound flow
MAS	Maximum size of the payload of payload packets in the outbound flow
MPS	Average size of packets in the outbound flow
CPA	Number of outbound flows with the same payload size as the current outbound flow in the previous 5 outbound flows of the same host

Table 5. Distribution of training and test sets.

Type	Training Set		Test Set 1	Test Set 2
Type	Phase I	Phase II	Test Set 1	Test Set 2
Normal	12,000	0	4500	4500
SQLIA FIP	5300		2250	2250
SQLIA LD	5200		2250	2250

Table 6. Parameters of different classification algorithms in the experiment.

Classifiers	Parameters
SVM	$k e r n e l = ‘ r b f ’, C = 1.0$
KNN	$K = 1, a l g o r i t h m = ‘ b a l l_t r e e ’$
DT	$c r i t e r i o n = “ e n t r o p y ”, m a x_d e p t h = 1$
RF	$n_e s t i m a t o r s = 10, m a x_d e p t h = 2$

Table 7. SQLIA detection results obtained by SDSIOT with different classification algorithms in phase I.

Test Set	Algorithm	Accuracy (%)	Precision (%)	Recall (%)	FNR (%)	F1-Score (%)
test set 1	SVM	98.14	96.73	99.54	0.46	98.12
	KNN	99.24	98.84	99.64	0.36	99.24
	DT	98.88	98.33	99.40	0.60	98.84
	RF	98.42	97.78	99.30	0.70	98.53
test set 2	SVM	98.00	97.51	98.89	1.11	98.20
	KNN	98.57	98.04	99.08	0.92	98.56
	DT	97.94	96.51	98.39	1.61	97.44
	RF	97.57	96.48	98.58	1.42	97.52

Table 8. Stage identification results obtained by SDSIOT with different classification algorithms in phase II.

Test Set	Algorithm	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
test set 1	SVM	93.15	95.41	92.73	94.05
	KNN	93.26	95.35	92.63	93.97
	DT	94.26	95.62	92.87	94.22
	RF	93.95	95.06	92.48	93.75
test set 2	SVM	93.01	95.05	91.19	93.08
	KNN	93.94	95.33	91.13	93.18
	DT	94.01	95.82	91.91	93.82
	RF	93.02	94.84	91.88	93.34

Table 9. SQLIA detection result obtained by Undivided Method with different classification algorithms.

Test Set	Algorithm	Accuracy (%)	Precision (%)	Recall (%)	FNR (%)	F1-Score (%)
test set 1	SVM	97.24	97.82	96.69	3.31	97.20
	KNN	98.12	97.23	98.47	1.53	97.85
	DT	97.02	97.72	96.13	3.87	96.98
	RF	96.61	96.89	96.26	3.74	96.57
test set 2	SVM	96.35	96.84	95.78	4.22	96.31
	KNN	97.75	96.98	98.41	1.59	97.69
	DT	96.97	96.23	97.87	2.13	97.04
	RF	96.95	96.26	97.57	2.43	96.91

Table 10. SQLIA stage identification result obtained by Undivided Method with different classification algorithms.

Test Set	Algorithm	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
test set 1	SVM	60.99	61.68	59.77	60.71
	KNN	91.87	91.13	92.44	91.78
	DT	69.76	69.89	69.14	69.52
	RF	71.11	72.52	69.29	70.87
test set 2	SVM	58.26	59.77	56.29	57.98
	KNN	90.34	88.54	92.01	90.23
	DT	66.77	64.74	68.38	66.51
	RF	69.84	70.92	68.31	69.59

Table 11. SQLIA detection results obtained by different detection methods on test sets 1 and 2.

Test Set	Methods	Accuracy (%)	Precision (%)	Recall (%)	FNR (%)	F1-Score (%)	Stage Identification
test set 1	ModSecurity [40]	91.19	93.27	88.78	11.22	90.97	✕
	Luo [8]	98.17	96.44	99.26	0.74	97.83	✕
	Tang [9]	98.50	97.80	99.48	0.52	98.63	✕
	Guo [10]	98.09	96.86	98.22	1.78	97.54	✕
	SDSIOT	99.24	98.84	99.64	0.36	99.24	✓
test set 2	ModSecurity [40]	90.35	95.90	84.30	15.70	89.73	✕
	Luo [8]	95.10	95.02	94.15	5.85	94.58	✕
	Tang [9]	95.83	93.51	97.84	2.16	95.63	✕
	Guo [10]	94.10	96.06	92.11	7.89	94.04	✕
	SDSIOT	98.57	98.04	99.08	0.92	98.56	✓

Table 12. Stage identification results obtained by the SQLIA stage identification method based on inbound traffic with different classification algorithms.

Test Set	Algorithm	Accuracy (%)	Precision (%)	Recall (%)	F1-Score (%)
test set 1	SVM	71.89	72.88	69.26	71.02
	KNN	76.36	77.11	75.57	76.33
	DT	72.25	78.33	68.40	72.29
	RF	70.38	73.45	68.30	70.78
test set 2	SVM	70.02	69.76	69.38	69.57
	KNN	73.18	74.59	72.68	73.62
	DT	68.17	68.97	65.45	67.16
	RF	68.72	70.19	67.76	68.95

Table 13. The times required for different methods.

Methods	Feature Extraction Time (s)	Training Time (s)	Average Detection Time (s)	Average Stage Identification Time (s)
Luo [8]	204.69	3.26	2.32	n/r
Tang [9]	36.45	189.26	24.32	n/r
Guo [10]	181.03	140.23	14.36	n/r
SDSIOT	9.23	2.86	1.39	1.37

Table 14. Detection results obtained by different methods on the CSE-CIC-IDS dataset.

Type	Number of Flows in the Test Data	SDSIOT		Modsecurity
Type	Number of Flows in the Test Data	Correct	Incorrect	Correct	Incorrect
Normal	1090	1049	41	1003	87
SQLIA	95	79	16	55	40
OWA	629	567	62	503	126

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, H.; Guo, C.; Jiang, C.; Ping, Y.; Lv, X. SDSIOT: An SQL Injection Attack Detection and Stage Identification Method Based on Outbound Traffic. Electronics 2023, 12, 2472. https://doi.org/10.3390/electronics12112472

AMA Style

Fu H, Guo C, Jiang C, Ping Y, Lv X. SDSIOT: An SQL Injection Attack Detection and Stage Identification Method Based on Outbound Traffic. Electronics. 2023; 12(11):2472. https://doi.org/10.3390/electronics12112472

Chicago/Turabian Style

Fu, Houlong, Chun Guo, Chaohui Jiang, Yuan Ping, and Xiaodan Lv. 2023. "SDSIOT: An SQL Injection Attack Detection and Stage Identification Method Based on Outbound Traffic" Electronics 12, no. 11: 2472. https://doi.org/10.3390/electronics12112472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SDSIOT: An SQL Injection Attack Detection and Stage Identification Method Based on Outbound Traffic

Abstract

1. Introduction

2. Related Work

2.1. Traditional Detection Methods

2.2. Machine Learning-Based Detection Methods

3. Motivation

4. Proposed SDSIOT

4.1. Framework of SDSIOT

4.2. Data Preprocessing

4.3. Feature Extraction

4.3.1. Feature Extraction for SQLIA Detection

4.3.2. Feature Extraction for SQLIA Stage Identification

4.4. SQLIA Detection (Phase I)

4.5. SQLIA Stage Identification (Phase II)

5. Experiment and Results

5.1. Environmental Environment and Datasets

5.2. Evaluation Metrics

5.3. Experiment

5.3.1. Experimental Results of SDSIOT

5.3.2. Study of the Effect of Using the Two-Phase Structure

5.3.3. Comparative Study

5.3.4. Efficiency Evaluation

5.3.5. Additional Test

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI