1. Introduction
SQL Injection (SQLI) is a critical cybersecurity threat that exploits vulnerabilities in a system’s database management by inserting malicious SQL code into queries. This type of attack can lead to severe consequences, including data breaches, system hijacking, and even complete system failure, resulting in substantial financial and reputational losses for both corporations and users. Despite extensive research on SQLI [
1,
2,
3,
4,
5,
6,
7,
8,
9], recent studies, such as [
10] and the OWASP Top Ten [
11], indicate that SQLI continues to be one of the most prevalent attacks on web applications.
SQLI scanners, typically used to identify potential SQLI vulnerabilities in web applications, are classified into white-box, gray-box, and black-box testing methods based on the level of visibility and access to the application being tested. White-box testing requires access to the source code, gray-box testing involves partial knowledge of the system’s internal structure, and black-box testing is conducted without any knowledge of the application’s internals. Consequently, black-box testing is more adept at simulating attacks from an external attacker’s perspective. It is suitable for assessing web application vulnerabilities in real-world scenarios, especially when source code access is restricted for third-party testing.
Common SQLI black-box detection methods include heuristic detection techniques based on predefined rules and intelligent detection techniques leveraging AI. Heuristic detection techniques rely heavily on the quality of predefined rules and often suffer from a low detection efficiency and accuracy. On the other hand, intelligent detection techniques require complex modeling for SQLI black-box detection and typically face challenges such as the lack of high-quality datasets and poor portability.
Large Language Models (LLMs) [
12,
13,
14,
15], such as those in the GPT series, have demonstrated substantial capabilities in contextual understanding and reasoning, quick adaptation to downstream tasks, and extensive knowledge storage. These attributes make LLMs a promising avenue for addressing SQLI black-box detection. Previous research, like PentestGPT [
16], show that LLMs are already being used for penetration testing, indicating their potential in SQLI black-box detection. Yet, the full capabilities of LLMs in this specific area have not been exhaustively investigated, signaling a need for further exploration.
To further explore the potential of LLMs in SQLI black-box detection, we designed a specialized micro benchmark called SqliMicroBenchmark, based on Sqli-labs, and conducted exploratory experiments. The findings reveal that LLMs hold considerable potential to guide SQLI black-box scanners. Building on this, we developed SqliGPT, an LLM-powered SQLI black-box scanner, which leverages the advanced contextual understanding and reasoning abilities of LLMs to increase the precision and efficiency of detecting SQLI in a black-box manner.
By conducting validation tests of SqliGPT against six leading SQLI scanners from academia and industry, we proved our method’s viability and effectiveness in enhancing detection efficiency and accuracy. The results indicate that SqliGPT successfully identified all 40 targets in the SqliMicroBenchmark, surpassing the performance of the other six advanced scanners, particularly against targets with insufficient defense mechanisms. Additionally, SqliGPT showed superior efficiency in its detection tasks. Although it was slightly less effective than Arachni and SQIRL on 22 targets that most scanners managed to detect, it exhibited the best performance on the remaining 18 targets. Moreover, our ablation studies verified the significant contributions of each module to the overall efficacy.
In this paper, we begin by reviewing non-LLM- and LLM-based black-box detection methods for SQLI. Subsequently, we state the research motivation and report an exploratory study, which included the experimental setup and evaluation processes. Next, we describe SqliGPT’s design in detail, with a particular focus on the implementation of the Strategy Selection Module and the Defense Bypass Module. The evaluation section demonstrates SqliGPT’s effectiveness and efficiency, as well as the results of the ablation study. Finally, the paper concludes with a discussion and a conclusion section.
In summary, we make the following contributions:
By evaluating LLM4Sqli-Manual, an LLM-only approach, and LLM4Sqli-Sqlmap, an approach that utilizes LLMs to guide SQLI black-box scanners, our exploratory experiment demonstrated the potential of LLMs for generating instrumental parameters to perform SQLI black-box detection, thereby enhancing the academic understanding of the applicability of LLMs for SQLI black-box detection. At the same time, it also pointed out the challenges faced by these approaches in terms of detection efficiency and bypassing insufficient defense mechanisms.
We developed and introduced SqliGPT, an innovative LLM-based SQLI black-box detection scanner. SqliGPT enhances detection efficiency by introducing the Strategy Selection Module, and it introduces the Defense Bypass Module to effectively address the challenge of bypassing insufficient defense mechanisms. By comparing SqliGPT with six other state-of-the-art SQLI scanners on SqliMicroBenchmark, we demonstrate the advantages of SqliGPT in terms of SQLI black-box detection accuracy and efficiency.
To support ongoing research in the community on the application of LLMs in SQLI black-box detection, we have made SqliMicroBenchmark and SqliGPT available as open-source resources at
https://github.com/guizhiwen/SqliGPT accessed on 11 July 2024.
2. Background and Related Work
2.1. Large Language Models
LLMs are deep learning-based natural language-processing techniques that typically consist of billion-level parameters and can understand and generate human language. These models learn a wide range of properties and structures of language by pre-training on large-scale textual data, thereby demonstrating strong language understanding and generation capabilities without domain-specific training. The best-known examples include OPENAI’s GPT family, Google’s BERT [
17], and other variants such as RoBERTa [
18] and T5 [
19].
The strength of LLMs lies in their deep semantic and contextual understanding capabilities, which enable them to handle complex tasks such as automated penetration testing and vulnerability detection. These models’ multi-task and multi-domain adaptability allow for practical reasoning and application in unseen tasks and domains. In cybersecurity, LLMs help identify security threats by understanding code structure and logic. As a result, researchers and practitioners have begun to explore the application of LLMs to SQLI black-box detection with the expectation of improving the accuracy and efficiency of detection.
2.2. Non-LLM SQLI Black-Box Detection Methods
In the initial stages of academic research, SQLI black-box scanners primarily relied on predefined rules and patterns to detect vulnerabilities through heuristic methods [
20,
21]. These scanners would first identify potential injection points within web applications and generate numerous inputs for testing purposes. Subsequently, they would refine their approach by monitoring the responses from the application and adjusting the inputs according to predefined rules to enhance the detection of SQLI vulnerabilities. Notable heuristic SQLI black-box detection scanners in the industry include OWASP ZAP, Burp Suite, Arachni [
22], and Sqlmap [
23]. However, these scanners typically suffer from inefficiency, low accuracy, and a heavy dependence on the quality of the predefined rules [
24,
25,
26].
With advancements in AI, researchers have begun incorporating AI technologies into SQLI black-box detection to achieve better detection results. For instance, Luo et al. [
27] and Sablotny et al. [
28] have integrated machine learning with fuzzing, employing machine learning algorithms to guide the mutation of fuzzing seeds, thereby enhancing SQLI detection efficiency. Liu et al. [
29] introduced DeepSQLi, incorporating deep learning into SQLI black-box detection, which leverages the semantic knowledge of SQLI attack payloads to boost detection performance. Moreover, several studies have explored the application of reinforcement learning in SQLI detection [
1,
30,
31]. Wahaibi et al. [
2] developed a gray-box SQLI scanner that uses reinforcement learning by conceptualizing SQLI as a game within this paradigm. While these intelligent approaches offer certain benefits, they often require intricate modeling of SQLI and are dependent on high-quality training datasets. Furthermore, the limited transferability of these methods restricts their broader application.
2.3. LLM-Based SQLI Black-Box Detection Methods
In recent years, the rapid advancement in LLMs has sparked significant interest in their intersection with cybersecurity. LLMs boast extensive general knowledge and basic reasoning skills, allowing them to adapt rapidly to a wide range of downstream tasks. Given these capabilities, researchers have begun to explore the application of LLMs in SQLI black-box detection.
Deng et al. [
16] introduced PentestGPT, an LLM-based automated tool for penetration testing. Their research underscores the considerable potential of LLMs in addressing challenges related to vulnerability mining and penetration testing within web security. PentestGPT employs round-robin scheduling among three modules, depending on LLMs’ robust task planning and tool utilization capabilities to conduct automated penetration tests. Therefore, PentestGPT can perform SQLI black-box detection in targeted web applications with the help of the automated SQLI scanner Sqlmap. Happe et al. [
32] evaluated the ability of LLMs to analyze vulnerabilities and suggest attack vectors during penetration testing, thereby illustrating the potential of LLMs in this domain. They concentrated on the performance of LLM Agents in penetration testing, citing automated LLM agents such as AutoGPT [
33] and BabyAGI [
34,
35], and envisaged a system with two layers: a high-level task planning layer and a low-level attack execution layer.
However, both Deng et al. and Happe et al. focused primarily on the process of penetration testing without delving deeply into the effectiveness of LLMs in SQLI black-box detection. Although the code of PentestGPT is publicly available, it is not effective in performing SQLI black-box detection. In addition, Happe et al. did not provide an open-source tool or include experimental data in their paper, limiting the wider application of this methodology in the field of SQLI black-box detection.
3. Motivation
Common SQLI black-box detection methods, which rely on predefined rules and patterns, are highly dependent on the quality of predefined rules. Unfortunately, they often fail to encompass all potential attack scenarios, exhibiting a lack of flexibility and poor adaptability to novel or mutated SQLI. Additionally, the rigidity of these rules can lead to false positives, where legitimate queries are mistakenly identified as attacks, or false negatives, where actual attacks are not detected. This results in low detection accuracy and efficiency.
In contrast, AI-based SQLI black-box detection methods leverage a large corpus of training data to learn potential attack patterns, thus eliminating the dependence on comprehensive and high-quality predefined rules. However, these methods face significant challenges due to the need for more high-quality training sets and the complexities involved in accurate problem modeling. Insufficient training sets can result in under-trained models that fail to effectively recognize real-world attack patterns, while inadequate SQLI modeling might prevent the models from comprehensively and effectively capturing all types of attack behaviors. This can severely impact the detection system’s accuracy and increase the likelihood of false positives and false negatives. Consequently, AI-based SQLI black-box detection methods often need better portability.
LLMs have developed an extensive knowledge base by training on vast amounts of text data, enabling them to provide insightful and accurate information across a wide array of topics and domains. The emergence of LLMs reduces the dependence of AI-based approaches on high-quality datasets. By pre-training on large and diverse datasets, these models can understand complex texts and lessen the need for finely labeled data. In SQLI black-box detection, LLMs can utilize their rich linguistic understanding to identify better and understand various patterns of SQLI. Additionally, their a priori knowledge can improve detection accuracy even if the training data are insufficient or of low quality. The capability of LLMs to process and understand long-form text, along with their strong contextual comprehension and reasoning skills, allows them to excel in complex modeling and inference, delivering continuous, relevant, and meaningful judgments. Moreover, LLMs can be fine-tuned with a small set of examples to rapidly adapt to various downstream tasks, and their robust transfer learning ability ensures adaptability, providing a robust and reliable solution for SQLI black-box detection.
Despite the promising potential of LLMs in SQLI black-box detection, current research has not deeply explored their application in this domain. To address this gap, we designed an exploratory experiment to further investigate the capabilities of LLMs in SQLI black-box detection. Based on the conclusions drawn from the exploratory experiment, we developed SqliGPT, an LLM-powered SQLI black-box scanner.
4. Exploratory Study
4.1. SqliMicroBenchmark
A robust and broadly representative benchmark is required to fairly assess the capabilities of LLMs in SQLI black-box detection. However, existing benchmarks in this field [
2,
3,
9,
30,
36] have several limitations. Firstly, they often lack sufficient visibility and wide recognition, affecting the acknowledgment and fairness of comparative studies. Secondly, most existing benchmarks focus on simple SQLI vulnerability detection and typically do not include more complex SQLI vulnerabilities with defense mechanisms. Consequently, these benchmarks often fall short in terms of realism.
We propose the SqliMicroBenchmark, based on Sqli-labs, a widely recognized SQLI testing range designed with real-world scenarios in mind, providing both representativity and practicality. As an open-source project, Sqli-labs offers standardized test environments and datasets, ensuring the comparability and reproducibility of experimental results. Furthermore, Sqli-labs encompasses a variety of SQLI vulnerability types, allowing for a comprehensive evaluation of LLMs’ SQLI detection effectiveness across different injection scenarios. It is worth noting that we excluded some targets in Sqli-labs for testing purposes. Some targets were excluded because their detection requires an extremely limited number of requests (e.g., only five), a requirement beyond the capability of current SQLI black-box detection scanners. Additionally, we did not consider targets related to stacked injections in Sqli-labs, as stacked query injections are typically used for more in-depth attacks or exploits after an injection vulnerability has been identified, such as performing database management operations. In addition, to make our benchmarks more representative of real-world scenarios, we included five actual CVEs in SqliMicroBenchmark (CVE-2020-8637, CVE-2020-8638, CVE-2020-8841, CVE-2023-30605, CVE-2023-24812). We extracted relevant SQL statements from these CVEs and integrated them into our SqliMicroBenchmark. Our micro benchmark aims to evaluate SQLI black-box detection capabilities without encouraging any form of offensive behavior.
The SqliMicroBenchmark encompasses 45 targets that exhibit SQLI vulnerabilities. These targets are categorized into two levels of difficulty—basic and advanced—depending on the presence of protective mechanisms at the injection points. Insufficient defense mechanisms at these points characterize the advanced difficulty targets. With the exception of CVE-2023-24812, the rest of the CVEs are basic difficulty targets. With the experimental evaluation on SqliMicroBenchmark, we can effectively assess the capability of LLMs in SQLI black-box detection and we provide a solid foundation for future research and practical applications.
4.2. Experimental Settings
Our experiment aimed to explore the capability of LLMs in performing SQLI black-box detection. Specifically, we designed two methods: LLM4Sqli-Manual and LLM4Sqli-Sqlmap. The former investigates the ability of LLMs to perform SQLI black-box detection independently, while the latter examines LLMs’ ability to guide automated SQLI detection scanners. The workflow of LLM4Sqli-Manual is illustrated on the left side of
Figure 1. Initially, a crawler retrieves the content of a web application based on a user-provided URL, passing the HTTP request–response pairs to LLMs as initial information. LLMs then generate a payload from this information and pass it to a replayer to test the web application. The results are analyzed by the user, and if no injection point is found, the results are fed back to LLMs for iterative testing.
The implementation of LLM4Sqli-Sqlmap is depicted on the right-hand side of
Figure 1. This approach incorporates Sqlmap as a scanner, chosen for its support of command-line operations, ease of automation, comprehensive features, and wide recognition. Unlike LLM4Sqli-Manual, in LLM4Sqli-Sqlmap, LLMs generate parameters of the Sqlmap command rather than payloads. Sqlmap executes the command, and based on the results returned by Sqlmap, the program determines the success of the injection and decides the subsequent steps.
This study selected three advanced LLMs—OPENAI’s GPT-4, GPT-3.5, and AnthropicAI’s Claude-3—to evaluate their performance in SQLI black-box detection. These models were chosen due to their prominence in the research community and consistent usability. To facilitate automated testing, we interacted with these models via API calls. The criterion for determining the success of LLMs on the SqliMicroBenchmark target was its ability to accurately identify and confirm the presence of SQLI vulnerabilities in the target and successfully retrieve the database name. This criterion aims to evaluate the models’ ability to recognize SQLI vulnerabilities rather than to encourage or conduct any form of illegal attack. Additionally, we implemented the measures outlined in
Appendix A to ensure the validity of the experiments.
4.3. Capability Evaluation
To examine the proficiency of LLMs in SQLI black-box detection, we assessed the performance of LLM4Sqli-Manual and LLM4Sqli-Sqlmap using three advanced LLMs (GPT-3.5, GPT-4, and Claude-3) as base models on the SqliMicroBenchmark. The results are presented in
Table 1. The findings reveal that LLM4Sqli-Sqlmap detected an average of 30 out of 45 targets, whereas LLM4Sqli-Manual identified only 2 on average. This indicates that LLMs guiding scanners for SQLI black-box detection are significantly more effective than those manually constructing payloads for SQLI black-box detection.
In the context of LLM4Sqli-Manual’s results, GPT-4 demonstrated the best performance by detecting 4 out of 28 targets of basic difficulty. Claude-3 followed, identifying two targets of basic difficulty, while GPT-3.5 failed to detect any targets of basic difficulty. None of the models succeeded in detecting any targets of advanced difficulty.
The results suggest that LLM4Sqli-Sqlmap has a pronounced advantage in detecting basic difficulty targets. Specifically, GPT-4 excelled in this mode, detecting 27 basic difficulty targets and 4 advanced difficulty targets. Both Claude-3 and GPT-3.5 showed comparable performance, each detecting 26 basic difficulty targets and 4 advanced difficulty targets. Both LLM4Sqli-Manual and LLM4Sqli-Sqlmap struggled with advanced difficulty targets. These targets incorporate defense mechanisms such as encoding, keyword filtering, or Web Application Firewalls (WAFs), necessitating deep analysis and validation to identify and bypass these defenses. Thus, the poor performance of LLMs on these complex problems is predictable. In summary, LLM4Sqli-Sqlmap exhibits significant potential in SQLI black-box detection, whereas LLM4Sqli-Manual demonstrates weak performance.
Compared to LLM4Sqli-Manual, LLM4Sqli-Sqlmap performs better in SQLI black-box detection. This superior performance is attributed to the fact that directly generating payloads to detect SQLI imposes higher demands on LLMs’ depth of knowledge and reasoning ability within the SQLI domain. In contrast, LLM4Sqli-Sqlmap employs automated SQLI scanners to execute the task, which significantly reduces these requirements. Experimental results indicate that current LLMs, such as GPT-3.5, GPT-4, and Claude-3, have not yet achieved the capability to efficiently resolve SQLI issues independently. Consequently, our research will utilize LLM4Sqli-Sqlmap as the foundational framework for further exploration.
4.4. Challenges
To explore the challenges faced when using LLMs for SQLI black-box detection, we conducted an exhaustive review of LLM4Sqli-Sqlmap’s test logs. We aimed to dissect the underlying factors preventing LLMs from effectively solving these problems. Our analysis revealed that the main challenges faced by LLMs in guiding Sqlmap to perform SQLI black-box detection can be categorized into the following two points:
Firstly, LLMs struggle to detect SQLI vulnerabilities in web applications that have insufficient defense mechanisms. Specifically, LLM4Sqli-Sqlmap exhibits notable constraints when detecting advanced difficulty targets. The resolution of such targets entails two essential steps: firstly, precise identification of existing defense mechanisms, and secondly, the effective bypass methods. Regarding identifying defense mechanisms, it is noted that LLMs often suggest potential bypass tactics based on conjecture rather than conducting empirical validations. For instance, LLMs might presume the presence of some kind of defense mechanisms and proceed to generate a corresponding Sqlmap command to attempt a bypass without prior confirmation. This subjective and uncertain approach significantly complicates the problem-solving process. Regarding the bypass methods of defense mechanisms, a simple test was performed where LLMs were clearly informed about the presence of specific defense mechanisms and given detailed information about the issue. They were then asked to propose several context-based methods that could potentially succeed in bypassing the defense mechanisms. The results revealed that the bypass strategies suggested by LLMs were quite limited, failing to encompass some uncommon or specific bypass methods. This limitation likely arises from LLMs’ tendency to produce solutions that are more frequently encountered in their training data, thereby overlooking less common but potentially viable strategies.
Secondly, the detection efficiency of the Sqlmap commands generated by LLMs is low. Specifically, LLMs tend to adopt a comprehensive testing strategy, for example, using the “–level = 5” parameter, which causes Sqlmap to test each parameter using various injection techniques in the scanner’s preset testing order. While this predefined test order allows for relatively efficient testing in most cases, it may not be efficient for some specific web applications to rely solely on this predefined order. For instance, the header parameter Referer is only vulnerable to the time-blind technique, but it is placed later in the preset order of test parameters, while the time-blind technique is not prioritized in the preset order of test techniques. This results in the need to spend a significant number of requests and time testing other parameters, as well as ineffective testing of the Referer parameter using unsuitable injection techniques when performing tests in LLMs. Additionally, another serious problem is that the “–level = 5” can only cover common parameters and fails to cover some uncommon but potentially injection-risky header parameters, such as X-Forwarded-For, resulting in incomplete testing.
In summary, while LLM4Sqli-Sqlmap shows clear potential in testing benchmarks, it still faces challenges such as inefficient detection and difficulty in bypassing insufficient defense mechanisms. These issues are precisely the core objectives that our research aims to address.
5. Overview and Design
5.1. Overview
We base our work on LLM4Sqli-Sqlmap and design two modules, the
Defense Bypass Module and the
Strategy Selection Module, to address the two main challenges that LLM4Sqli-Sqlmap encounter. We name this enhanced framework SqliGPT, with its workflow depicted in
Figure 2. SqliGPT begins with an initial input of a user-specified URL. It then employs a crawler to scrape the content of the targeted web application, and the extracted HTTP request–response pairs are passed as initial information to the Strategy Selection Module. This module identifies all possible injection points and performs preliminary validation, prioritizing testing potential injection points and injection techniques deemed more likely and generating corresponding parameters of the Sqlmap command. Subsequently, Sqlmap conducts tests on the targeted web application. If Sqlmap fails to find an injection point, the Strategy Selection Module relays the initial information and its generated list of potential injection points and techniques, ranked by likelihood, to the Defense Bypass Module. This module attempts to identify the defense mechanisms of the targeted web application and strives to compile a payload transformation script and corresponding Sqlmap execution parameters capable of successfully bypassing these defenses. Sqlmap then utilizes the payload transformation script to process the payload in subsequent tests. Upon successfully identifying an injection point by Sqlmap, a report is dispatched to the user.
In each module, we divide large tasks into multiple small tasks, each of which is assigned to different LLM Agents [
37,
38,
39,
40] for processing. An LLM Agent is a program centered around LLMs, designed to achieve a specific goal or complete a task set by the user. The LLMs receive feedback and opt to use predefined tools or functions to complete the task through iterative runs. By coordinating task assignments and communicating with each other, multiple LLM Agents can collaborate to accomplish more complex functions.
5.2. Strategy Selection Module
The Strategy Selection Module aims to alleviate the problems of inefficient detection and the inability to cover some uncommon header parameters that may be injected. In this module, LLM Agents extract all potential injection points from the target and perform preliminary validation of them using various injection techniques. This comprehensive approach ensures that even the most uncommon header parameters are not overlooked, instilling confidence in the thoroughness of your security testing.
Figure 3 depicts the structure of this module, primarily comprising three LLM Agents and a Replayer: ① The
Extraction Agent is designed to identify all potential SQLI points from HTTP request–response pairs, compiling them into a list. Although every parameter in the request headers could potentially be vulnerable to SQLI, testing each one exhaustively is often impractical and costly. Therefore, the Extraction Agent selectively identifies only those header parameters implicated in the responses and parameters the user can control as potential injection points. ② The
Pre-validation Agent’s responsibility is to generate simple payloads to assess these potential injection points individually. It evaluates each parameter’s injection likelihood based on the received responses and ranks them according to their successful injection probability.
Furthermore, this agent gauges the sensitivity of parameters to various injection techniques based on feedback and prioritizes techniques that are likely to be more successful. This strategic prioritization enhances the test effectiveness of the Sqlmap commands generated by LLMs. ③ The Parameters Generation Agent takes the sorted list from the Pre-validation Agent and generates the appropriate parameters of the Sqlmap command for the listed potential injection points and techniques.
The Strategy Selection Module enhances detection efficiency and broadens the coverage of uncommon header parameters by optimizing the generation of Sqlmap command parameters through a feedback mechanism. This module arranges injection parameters and techniques based on feedback from the targeted web application, enabling more effective and thorough detection.
5.3. Defense Bypass Module
The Defense Bypass Module is designed to address the challenges associated with bypassing insufficient defense mechanisms during SQLI black-box detection by LLMs. This module employs a defense mechanism determination based on feedback from the targeted web application. It uses a bypass method generated through Retrieval-Augmented Generation (RAG) [
41,
42] to overcome these inadequate defenses.
As illustrated in
Figure 4, this module primarily comprises three LLM Agents, a Retriever, and a Replayer: ④ The primary function of the
Defense Mechanism Detection Agent is to identify the presence of defense mechanisms at injection points and specify the specific mechanisms in place. This agent takes as input the initial information and the ranked list of potential injection points and techniques from the Strategy Selection Module and outputs the detected defense mechanisms. To address the issue of potentially insufficient information, this agent can generate payloads, which are then passed to the Replayer to test the web application, relying on the responses to determine the defense mechanisms. The Retriever is designed to tackle the limitation of bypass solutions proposed by LLMs. The Defense Mechanism Detection Agent forwards the detected defense mechanisms to the Retriever, which then searches an external knowledge database for corresponding bypass methods. Specifically, we store various bypass methods in this external knowledge database, tagging each with metadata to enhance retrieval accuracy, indicating the specific defense mechanisms that each method can bypass. ⑤ The
Methods Validation Agent receives the list of bypass methods from the Retriever, along with the parameters of the detected defense mechanisms from the Defense Mechanism Detection Agent, and outputs the validated bypass methods. The Methods Validation Agent generates the payloads for each bypass method and sends them to the Replayer to test the web application, validating the effectiveness of the bypass methods based on the responses. ⑥ The
Parameters and Scripts Generation Agent generates payload transformation scripts according to predefined templates based on validated bypass methods and also generates the corresponding parameters of the Sqlmap command.
Considering that LLMs used for testing are proprietary commercial models with significant costs associated with fine-tuning, we opted to employ the RAG technique to address the issue of limited defensive bypass options. RAG, a robust approach, involves LLMs initially retrieving pertinent information from an extensive document database to answer questions or generate text. This process enhances response quality. Thus, RAG facilitates the straightforward expansion of LLMs’ knowledge base by merely incorporating additional relevant information into an external repository.
To enhance the reproducibility of our experiments, particularly in segments where LLMs performance is inconsistent, we employed cue engineering techniques like Few-Shot Learning [
43,
44,
45] and Chain-of-Thought (CoT) [
46,
47,
48] to refine LLMs’ effectiveness further. The Defense Bypass Module leverages feedback mechanisms to aid LLMs in recognizing the target web application’s defense mechanisms. It employs RAG technology to help devise effective strategies for bypassing these defenses, thereby augmenting LLMs’ capability to detect advanced SQLI vulnerability.
6. Evaluation
In the experimental evaluation section, we provide a detailed comparison of SqliGPT with six other state-of-the-art academic and industrial SQLI scanners, including PentestGPT, SQIRL, and four industrial scanners. We focus on the following three research questions:
RQ1 (Effectiveness): How does SqliGPT compare to the other six scanners in terms of the effectiveness of SQLI black-box detection?
RQ2 (Efficiency): How does SqliGPT compare to the other six scanners in terms of efficiency in performing the detection task?
RQ3 (Ablation): How do the individual modules of SqliGPT contribute to the overall performance?
To assess effectiveness, we used the number of detected SQLI vulnerabilities as the performance metric, i.e., the number of targets that were successfully and correctly detected by each scanner on the SqliMicroBenchmark. Additionally, considering the large amount of network traffic that automated SQLI black-box scanners may generate while performing the tests and their potential burden on the system under test, we used the total number of HTTP requests sent by the scanners during the test (excluding requests generated by crawlers) as a measure of efficiency in performing the SQLI detection. By combining these metrics, we provide a quantitative and comprehensive assessment of each scanner’s performance in terms of SQLI detection. It is worth noting that although average runtime is often considered a metric of efficiency, we did not use it as a performance metric in this study because it can be affected by network conditions and LLMs API access latency.
6.1. Evaluation Settings
We comprehensively evaluated SqliGPT on SqliMicroBenchmark and compared it to other research subjects, including PentestGPT, SQIRL, and four industry-leading automated SQLI black-box scanners. PentestGPT is a representative penetration testing tool powered by LLMs that supports SQLI black-box detection of web applications. SQIRL is an advanced gray-box SQLI detection tool based on reinforcement learning and federated learning, but the tool’s operation relies on access to log information from the database system. Therefore, we provided SQIRL with the required database log information separately during our testing. Although we would like to compare ours with more existing academic scanners for SQLI black-box detection, these implementations are not available [
27,
28,
49,
50]. DeepSQLi [
29] references an available implementation, but the repository seems incomplete and unusable. In addition, we selected four advanced industry automated SQLI black-box scanners for comparison, including Sqlmap v1.8.3.15, ZAP V2.9.0, BurpSuite Professional v2024.3.1.2, and Arachni v1.6.1.3. The scanner configuration details are described in
Table A1.
We tested SqliGPT and PentestGPT using GPT-3.5, GPT-4, and Claude-3. During the testing process of PentestGPT, testers interacted with PentestGPT to clarify its tasks, but they did not contribute their insights to the interaction. Additionally, we used the measures outlined in
Appendix A to ensure the validity of the tests.
6.2. Effectiveness Evaluation (RQ1)
We evaluated SqliGPT, LLM4Sqli-Sqlmap, and six other state-of-the-art scanners on SqliMicroBenchmark. The experimental results are shown in
Table 2.
SqliGPT successfully detected all 45 targets in the SqliMicroBenchmark, and its performance across the three LLMs was consistent and significantly better than that of the other scanners.PentestGPT detected only 18 of the basic difficulty targets and 4 of the advanced difficulty targets. Upon analyzing the test logs of PentestGPT, we found that it focuses more on the overall planning of the penetration testing process and does not sufficiently analyze the target’s information. For example, it tends to generate uniform Sqlmap commands that do not take into account the target parameter delivery method (POST or GET): sqlmap -u <url> –dbs. Therefore, PentestGPT’s performance is consistent across the three LLMs despite their differences in capabilities. This suggests that within PentestGPT’s framework, the main focus of LLMs is on the penetration testing process, with insufficient attention to capturing detailed information. SQIRL detected 20 basic difficulty targets and 2 advanced difficulty target. We believe its weaker performance is mainly due to the poor transferability of its SQLI detection capabilities, which were learned in the original training set. ZAP detected 20 basic difficulty targets and 4 advanced difficulty targets.
The results reported for BurpSuite and Arachni indicate that both tools detected 7 targets of advanced difficulty despite only identifying 21 and 23 targets of basic difficulty, respectively. This discrepancy arises because, even though some input parameters were filtered in the SqliMicroBenchmark’s targets, database error messages continued to be echoed in the HTTP responses. BurpSuite and Arachni employed relatively lenient criteria for determining injection points. These tools inferred vulnerability to injection when they detected database error information in the content of a target’s HTTP response. However, neither BurpSuite nor Arachni actually succeeded in bypassing defenses and executing the injection statement effectively. Consequently, we disregarded these targets and acknowledged that only 4 of the advanced difficulty targets were successfully detected by each tool.
Sqlmap successfully detected 27 targets of basic difficulty and 4 of advanced difficulty. These results demonstrate that SqliGPT leverages the capabilities of Sqlmap and enhances its performance. In contrast, the subpar results of PentestGPT in SQLI black-box detection with Sqlmap can be ascribed to its broader focus on the overall penetration testing process, which results in a relative deficiency in pinpointing specific vulnerabilities.
6.3. Efficiency Evaluation (RQ2)
Table 2 also shows the average number of HTTP requests sent by each scanner to successfully detect different classes of targets. Since there may be significant quantitative differences in the number of HTTP requests required to detect different targets, we categorized the 45 targets into three classes for independent comparison based on the scanners’ target completion, ensuring fairness in the evaluation. The specific target categorization is shown in
Table A2, where Class-1 totals 22 targets, including 23 basic and 4 advanced difficulty targets; Class-2 consists of 5 basic difficulty targets; and Class-3 covers 13 advanced difficulty targets. On Class-1 targets, Arachni performs best in terms of the number of HTTP requests, followed by SQIRL. SqliGPT performs slightly worse than Arachni and SQIRL on Class-1 but still outperforms the other scanners. SqliGPT performs best on both Class-2 and Class-3 targets.
Overall, SqliGPT performs well in terms of efficiency in performing detection tasks. SqliGPT failed to outperform Arachni in the detection performance of Class-1 targets, a situation that can be attributed to the differences in detection strategy and evaluation criteria between SqliGPT and Arachni. SqliGPT guided Sqlmap to perform exhaustive SQLI detection of the targets, ensuring comprehensive coverage of potential injection points. In contrast, Arachni employs a narrower detection strategy, which allows it to complete tests more efficiently in some cases. However, the limitation of this strategy is that it may need to adequately cover all potential injection points, resulting in a less comprehensive test. Additionally, Arachni may use looser criteria when determining injection points, which, while helpful in quickly identifying potential vulnerabilities, may also increase false positives. Compared to SqliGPT, the advantages demonstrated by SQIRL on Class-1 targets stem from its use of a reinforcement learning-based approach to payload mutation, which enables more rapid and efficient SQLI black-box detection. However, its performance on Class-2 and Class-3 targets reveals limitations in the transferability of its detection capabilities, reflecting the difficulty of broadly applying its SQLI detection knowledge gained on the initial training set to new scenarios.
By comparing SqliGPT with Sqlmap, we observe the significant improvement in detection efficiency that LLMs provide for the SQLI black-box scanner. For Class-1 targets, the average number of HTTP requests required by SqliGPT is only 17% of that required by Sqlmap. Notably, for Class-2 targets, this ratio is further reduced to 8%. These results highlight the significant enhancements in detection efficiency that LLMs contribute to the design of SqliGPT.
6.4. Ablation Study (RQ3)
We conducted a series of ablation experiments to measure the impact of the various design choices behind SqliGPT. The comparisons included LLM4Sqli-Sqlmap, SqliGPT-Strategy, which adds only the Strategy Selection Module to LLM4Sqli-Sqlmap, and SqliGPT itself. After rigorous testing on the SqliMicroBenchmark, the performance of the three configurations was evaluated in terms of the number of SQLI vulnerability detections and HTTP requests, as shown in
Table 3. Our key findings are as follows:
- (1)
In terms of efficiency, SqliGPT and SqliGPT-Strategy demonstrate superior results compared to LLM4Sqli-Sqlmap, which clearly confirms the effectiveness of the Strategy Selection Module. Additionally, we observe a slight improvement in the number of detected SQLI vulnerabilities of SqliGPT-Strategy compared to LLM4Sqli-Sqlmap for the base difficulty targets. This improvement can be attributed to the comprehensive enumeration of potentially injectable parameters by the Strategy Selection Module, which tests them individually. This approach ensures thorough testing of suspected injection points by LLMs and reduces the risk of missing potential injection points due to the random nature of LLMs generation.
- (2)
In terms of effectiveness, SqliGPT stands out as a clear winner, significantly outperforming LLM4Sqli-Sqlmap and SqliGPT-Strategy, particularly in the number of detected SQLI vulnerabilities for advanced difficulty targets. Without the Defense Bypass Module, LLM4Sqli-Sqlmap and SqliGPT-Strategy are unable to accurately identify and effectively bypass defenses in advanced difficulty targets. The addition of the Defense Bypass Module equips LLMs with the ability to identify and bypass defense mechanisms, thereby increasing the number of detected SQLI vulnerabilities for advanced difficulty targets.
Table 3.
Ablations of SqliGPT on the SqliMicroBenchmark.
Table 3.
Ablations of SqliGPT on the SqliMicroBenchmark.
Scanner | Avg Requests per Successful Target | Basic Targets (28) | Advanced Targets (17) | Total (45) |
---|
Class-1
|
Class-2
|
Class-3
|
---|
| GPT-3.5 | 609 | 6712 | – | 26 (93%) | 4 (24%) | 30 (67%) |
LLM4Sqli-Sqlmap | GPT-4 | 517 | 8451 | – | 27 (96%) | 4 (24%) | 31 (69%) |
| Claude-3 | 504 | 11436 | – | 26 (93%) | 4 (24%) | 30 (67%) |
| GPT-3.5 | 142 | 788 | – | 28 (100%) | 4 (24%) | 32 (71%) |
SqliGPT-Strategy | GPT-4 | 119 | 773 | – | 28 (100%) | 4 (24%) | 32 (71%) |
| Claude-3 | 112 | 731 | – | 28 (100%) | 4 (24%) | 32 (71%) |
| GPT-3.5 | 127 | 761 | 1684 | 28 (100%) | 17 (100%) | 45 (100%) |
SqliGPT | GPT-4 | 119 | 822 | 1681 | 28 (100%) | 17 (100%) | 45 (100%) |
| Claude-3 | 113 | 749 | 1669 | 28 (100%) | 17 (100%) | 45 (100%) |
7. Discussion
Hallucination in LLMs: Hallucination [
51] in LLMs refers to errors or inaccuracies in the text generated by LLMs, stemming from flaws in the training data, the training process, and the inference process. This phenomenon can affect the reliability of our automated SQLI black-box scanner. Our experiments used closed-source commercial models such as GPT-3.5, GPT-4, and Claude-3. Since we could not directly access or modify the training data and model parameters, we employed CoT and Self-Correcting [
52] techniques at the prompts to minimize hallucinations as much as possible. At the same time, we are constantly exploring other methods to reduce hallucination and further optimize the tool’s performance.
High Computational and Financial Costs of advanced LLMs: The high computational and financial costs associated with advanced LLMs such as GPT and Claude constitute one of the limitations in practical applications. These models require considerable computational resources for reasoning, and the financial expenditures can be noticeable, particularly in large-scale concurrent processing scenarios. For large-scale applications, these costs may need to be revised. To reduce costs and increase feasibility, we plan to actively explore the use of open-source LLMs in our subsequent work. These open-source models can be tuned and optimized to meet specific needs and significantly reduce the costs associated with their use, thus enhancing the utility of our SQLI black-box detection tool.
Challenges for LLMs in Bypassing Defense Mechanisms: LLMs face several significant challenges when attempting to bypass defense mechanisms like WAFs or keyword filters. These mechanisms might process detected malicious payloads in unique ways, such as removing, replacing keywords, or redirecting requests to an error page. Consequently, LLMs struggle to accurately determine if a keyword has been manipulated, leading to a high incidence of false positives. This issue primarily stems from two factors: firstly, SqliGPT’s inadequacy in providing enough information related to targeted web applications to LLMs; secondly, LLMs in use may lack the deep logical reasoning required for precise judgment. Addressing these challenges is a pivotal focus of our future research.
Integrating LLMs with other Automated Scanners: SqliGPT merges LLMs and Sqlmap to create an efficient, intelligent, and robust automated SQLI black-box scanner. Sqlmap was selected due to its extensive feature set, ease of integration, and broad community acceptance. However, Sqlmap does have some limitations, including inadequate support for certain database types and potential issues with missed detections or false positives in specific scenarios. Thus, our framework plans to incorporate additional automated SQLI scanners to address diverse scenarios better and meet particular requirements.
8. Conclusions
In this research, we explore the effectiveness and limitations of LLM for SQLI black-box detection. Through assessments conducted using our developed SqliMicroBenchmark, it was observed that the LLMs-guided SQLI black-box scanner, LLM4Sqli-Sqlmap, demonstrated promising capabilities in SQLI detection. However, it still faced challenges such as inefficient detection and difficulty in bypassing insufficient defense mechanisms. Consequently, we developed SqliGPT, an innovative LLM-based SQLI black-box detection scanner. SqliGPT enhances LLM4Sqli-Sqlmap by integrating the Strategy Selection Module and the Defense Bypass Module to boost detection efficiency and overcome defense-related challenges.
We conducted a comparative analysis of SqliGPT against six other state-of-the-art SQLI scanners using the SqliMicroBenchmark. The results indicated that SqliGPT successfully identified all 45 targets, surpassing the performance of the other scanners. Additionally, SqliGPT demonstrated superior efficiency in detection tasks. Notably, among the 45 SQL injection targets successfully identified by SqliGPT, 5 included real-world CVEs, further underscoring the practical effectiveness of SqliGPT. While it was slightly less effective than Arachni and SQIRL on 27 Class-1 targets, it achieved the best results on 18 targets across Class-2 and Class-3. Overall, this study reveals the potential of LLMs in SQLI black-box detection and provides a wide scope for future research and improvement.
Future work will focus on reducing hallucinations to improve SqliGPT’s stability and strengthening the Defense Bypass Module to handle more complex insufficient defenses. We will explore integrating other scanners to enhance robustness and consider using open-source LLMs to reduce costs. Additionally, we aim to improve the efficiency of LLMs in SQL injection black-box detection for better performance.