SqliGPT: Evaluating and Utilizing Large Language Models for Automated SQL Injection Black-Box Detection

Gui, Zhiwen; Wang, Enze; Deng, Binbin; Zhang, Mingyuan; Chen, Yitao; Wei, Shengfei; Xie, Wei; Wang, Baosheng

doi:10.3390/app14166929

Open AccessArticle

SqliGPT: Evaluating and Utilizing Large Language Models for Automated SQL Injection Black-Box Detection

by

Zhiwen Gui

^*

,

Enze Wang

,

Binbin Deng

,

Mingyuan Zhang

,

Yitao Chen

,

Shengfei Wei

,

Wei Xie

and

Baosheng Wang

College of Computer, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 6929; https://doi.org/10.3390/app14166929

Submission received: 8 July 2024 / Revised: 29 July 2024 / Accepted: 31 July 2024 / Published: 7 August 2024

(This article belongs to the Special Issue Artificial Intelligence for Cybersecurity: Latest Advances and Prospects)

Download

Browse Figures

Versions Notes

Abstract

:

SQL injection (SQLI) black-box detection, which simulates external attack scenarios, is crucial for assessing vulnerabilities in real-world web applications. However, existing black-box detection methods rely on predefined rules to cover the most common SQLI cases, lacking diversity in vulnerability detection scheduling and payload, suffering from limited efficiency and accuracy. Large Language Models (LLMs) have shown significant advancements in several domains, so we developed SqliGPT, an LLM-powered SQLI black-box scanner that leverages the advanced contextual understanding and reasoning abilities of LLMs. Our approach introduces the Strategy Selection Module to improve detection efficiency and the Defense Bypass Module to address insufficient defense mechanisms. We evaluated SqliGPT against six state-of-the-art scanners using our SqliMicroBenchmark. Our evaluation results indicate that SqliGPT successfully detected all 45 targets, outperforming other scanners, particularly on targets with insufficient defenses. Additionally, SqliGPT demonstrated excellent efficiency in executing detection tasks, slightly underperforming Arachni and SQIRL on 27 targets but besting them on the other 18 targets. This study highlights the potential of LLMs in SQLI black-box detection and demonstrates the feasibility and effectiveness of LLMs in enhancing detection efficiency and accuracy.

Keywords:

SQL injection; large language models; black-box detection; SqliGPT; detection efficiency and accuracy

1. Introduction

SQL Injection (SQLI) is a critical cybersecurity threat that exploits vulnerabilities in a system’s database management by inserting malicious SQL code into queries. This type of attack can lead to severe consequences, including data breaches, system hijacking, and even complete system failure, resulting in substantial financial and reputational losses for both corporations and users. Despite extensive research on SQLI [1,2,3,4,5,6,7,8,9], recent studies, such as [10] and the OWASP Top Ten [11], indicate that SQLI continues to be one of the most prevalent attacks on web applications.

SQLI scanners, typically used to identify potential SQLI vulnerabilities in web applications, are classified into white-box, gray-box, and black-box testing methods based on the level of visibility and access to the application being tested. White-box testing requires access to the source code, gray-box testing involves partial knowledge of the system’s internal structure, and black-box testing is conducted without any knowledge of the application’s internals. Consequently, black-box testing is more adept at simulating attacks from an external attacker’s perspective. It is suitable for assessing web application vulnerabilities in real-world scenarios, especially when source code access is restricted for third-party testing.

Common SQLI black-box detection methods include heuristic detection techniques based on predefined rules and intelligent detection techniques leveraging AI. Heuristic detection techniques rely heavily on the quality of predefined rules and often suffer from a low detection efficiency and accuracy. On the other hand, intelligent detection techniques require complex modeling for SQLI black-box detection and typically face challenges such as the lack of high-quality datasets and poor portability.

Large Language Models (LLMs) [12,13,14,15], such as those in the GPT series, have demonstrated substantial capabilities in contextual understanding and reasoning, quick adaptation to downstream tasks, and extensive knowledge storage. These attributes make LLMs a promising avenue for addressing SQLI black-box detection. Previous research, like PentestGPT [16], show that LLMs are already being used for penetration testing, indicating their potential in SQLI black-box detection. Yet, the full capabilities of LLMs in this specific area have not been exhaustively investigated, signaling a need for further exploration.

To further explore the potential of LLMs in SQLI black-box detection, we designed a specialized micro benchmark called SqliMicroBenchmark, based on Sqli-labs, and conducted exploratory experiments. The findings reveal that LLMs hold considerable potential to guide SQLI black-box scanners. Building on this, we developed SqliGPT, an LLM-powered SQLI black-box scanner, which leverages the advanced contextual understanding and reasoning abilities of LLMs to increase the precision and efficiency of detecting SQLI in a black-box manner.

By conducting validation tests of SqliGPT against six leading SQLI scanners from academia and industry, we proved our method’s viability and effectiveness in enhancing detection efficiency and accuracy. The results indicate that SqliGPT successfully identified all 40 targets in the SqliMicroBenchmark, surpassing the performance of the other six advanced scanners, particularly against targets with insufficient defense mechanisms. Additionally, SqliGPT showed superior efficiency in its detection tasks. Although it was slightly less effective than Arachni and SQIRL on 22 targets that most scanners managed to detect, it exhibited the best performance on the remaining 18 targets. Moreover, our ablation studies verified the significant contributions of each module to the overall efficacy.

In this paper, we begin by reviewing non-LLM- and LLM-based black-box detection methods for SQLI. Subsequently, we state the research motivation and report an exploratory study, which included the experimental setup and evaluation processes. Next, we describe SqliGPT’s design in detail, with a particular focus on the implementation of the Strategy Selection Module and the Defense Bypass Module. The evaluation section demonstrates SqliGPT’s effectiveness and efficiency, as well as the results of the ablation study. Finally, the paper concludes with a discussion and a conclusion section.

In summary, we make the following contributions:

By evaluating LLM4Sqli-Manual, an LLM-only approach, and LLM4Sqli-Sqlmap, an approach that utilizes LLMs to guide SQLI black-box scanners, our exploratory experiment demonstrated the potential of LLMs for generating instrumental parameters to perform SQLI black-box detection, thereby enhancing the academic understanding of the applicability of LLMs for SQLI black-box detection. At the same time, it also pointed out the challenges faced by these approaches in terms of detection efficiency and bypassing insufficient defense mechanisms.
We developed and introduced SqliGPT, an innovative LLM-based SQLI black-box detection scanner. SqliGPT enhances detection efficiency by introducing the Strategy Selection Module, and it introduces the Defense Bypass Module to effectively address the challenge of bypassing insufficient defense mechanisms. By comparing SqliGPT with six other state-of-the-art SQLI scanners on SqliMicroBenchmark, we demonstrate the advantages of SqliGPT in terms of SQLI black-box detection accuracy and efficiency.
To support ongoing research in the community on the application of LLMs in SQLI black-box detection, we have made SqliMicroBenchmark and SqliGPT available as open-source resources at https://github.com/guizhiwen/SqliGPT accessed on 11 July 2024.

2. Background and Related Work

2.1. Large Language Models

LLMs are deep learning-based natural language-processing techniques that typically consist of billion-level parameters and can understand and generate human language. These models learn a wide range of properties and structures of language by pre-training on large-scale textual data, thereby demonstrating strong language understanding and generation capabilities without domain-specific training. The best-known examples include OPENAI’s GPT family, Google’s BERT [17], and other variants such as RoBERTa [18] and T5 [19].

The strength of LLMs lies in their deep semantic and contextual understanding capabilities, which enable them to handle complex tasks such as automated penetration testing and vulnerability detection. These models’ multi-task and multi-domain adaptability allow for practical reasoning and application in unseen tasks and domains. In cybersecurity, LLMs help identify security threats by understanding code structure and logic. As a result, researchers and practitioners have begun to explore the application of LLMs to SQLI black-box detection with the expectation of improving the accuracy and efficiency of detection.

2.2. Non-LLM SQLI Black-Box Detection Methods

In the initial stages of academic research, SQLI black-box scanners primarily relied on predefined rules and patterns to detect vulnerabilities through heuristic methods [20,21]. These scanners would first identify potential injection points within web applications and generate numerous inputs for testing purposes. Subsequently, they would refine their approach by monitoring the responses from the application and adjusting the inputs according to predefined rules to enhance the detection of SQLI vulnerabilities. Notable heuristic SQLI black-box detection scanners in the industry include OWASP ZAP, Burp Suite, Arachni [22], and Sqlmap [23]. However, these scanners typically suffer from inefficiency, low accuracy, and a heavy dependence on the quality of the predefined rules [24,25,26].

With advancements in AI, researchers have begun incorporating AI technologies into SQLI black-box detection to achieve better detection results. For instance, Luo et al. [27] and Sablotny et al. [28] have integrated machine learning with fuzzing, employing machine learning algorithms to guide the mutation of fuzzing seeds, thereby enhancing SQLI detection efficiency. Liu et al. [29] introduced DeepSQLi, incorporating deep learning into SQLI black-box detection, which leverages the semantic knowledge of SQLI attack payloads to boost detection performance. Moreover, several studies have explored the application of reinforcement learning in SQLI detection [1,30,31]. Wahaibi et al. [2] developed a gray-box SQLI scanner that uses reinforcement learning by conceptualizing SQLI as a game within this paradigm. While these intelligent approaches offer certain benefits, they often require intricate modeling of SQLI and are dependent on high-quality training datasets. Furthermore, the limited transferability of these methods restricts their broader application.

2.3. LLM-Based SQLI Black-Box Detection Methods

In recent years, the rapid advancement in LLMs has sparked significant interest in their intersection with cybersecurity. LLMs boast extensive general knowledge and basic reasoning skills, allowing them to adapt rapidly to a wide range of downstream tasks. Given these capabilities, researchers have begun to explore the application of LLMs in SQLI black-box detection.

Deng et al. [16] introduced PentestGPT, an LLM-based automated tool for penetration testing. Their research underscores the considerable potential of LLMs in addressing challenges related to vulnerability mining and penetration testing within web security. PentestGPT employs round-robin scheduling among three modules, depending on LLMs’ robust task planning and tool utilization capabilities to conduct automated penetration tests. Therefore, PentestGPT can perform SQLI black-box detection in targeted web applications with the help of the automated SQLI scanner Sqlmap. Happe et al. [32] evaluated the ability of LLMs to analyze vulnerabilities and suggest attack vectors during penetration testing, thereby illustrating the potential of LLMs in this domain. They concentrated on the performance of LLM Agents in penetration testing, citing automated LLM agents such as AutoGPT [33] and BabyAGI [34,35], and envisaged a system with two layers: a high-level task planning layer and a low-level attack execution layer.

However, both Deng et al. and Happe et al. focused primarily on the process of penetration testing without delving deeply into the effectiveness of LLMs in SQLI black-box detection. Although the code of PentestGPT is publicly available, it is not effective in performing SQLI black-box detection. In addition, Happe et al. did not provide an open-source tool or include experimental data in their paper, limiting the wider application of this methodology in the field of SQLI black-box detection.

3. Motivation

Common SQLI black-box detection methods, which rely on predefined rules and patterns, are highly dependent on the quality of predefined rules. Unfortunately, they often fail to encompass all potential attack scenarios, exhibiting a lack of flexibility and poor adaptability to novel or mutated SQLI. Additionally, the rigidity of these rules can lead to false positives, where legitimate queries are mistakenly identified as attacks, or false negatives, where actual attacks are not detected. This results in low detection accuracy and efficiency.

In contrast, AI-based SQLI black-box detection methods leverage a large corpus of training data to learn potential attack patterns, thus eliminating the dependence on comprehensive and high-quality predefined rules. However, these methods face significant challenges due to the need for more high-quality training sets and the complexities involved in accurate problem modeling. Insufficient training sets can result in under-trained models that fail to effectively recognize real-world attack patterns, while inadequate SQLI modeling might prevent the models from comprehensively and effectively capturing all types of attack behaviors. This can severely impact the detection system’s accuracy and increase the likelihood of false positives and false negatives. Consequently, AI-based SQLI black-box detection methods often need better portability.

LLMs have developed an extensive knowledge base by training on vast amounts of text data, enabling them to provide insightful and accurate information across a wide array of topics and domains. The emergence of LLMs reduces the dependence of AI-based approaches on high-quality datasets. By pre-training on large and diverse datasets, these models can understand complex texts and lessen the need for finely labeled data. In SQLI black-box detection, LLMs can utilize their rich linguistic understanding to identify better and understand various patterns of SQLI. Additionally, their a priori knowledge can improve detection accuracy even if the training data are insufficient or of low quality. The capability of LLMs to process and understand long-form text, along with their strong contextual comprehension and reasoning skills, allows them to excel in complex modeling and inference, delivering continuous, relevant, and meaningful judgments. Moreover, LLMs can be fine-tuned with a small set of examples to rapidly adapt to various downstream tasks, and their robust transfer learning ability ensures adaptability, providing a robust and reliable solution for SQLI black-box detection.

Despite the promising potential of LLMs in SQLI black-box detection, current research has not deeply explored their application in this domain. To address this gap, we designed an exploratory experiment to further investigate the capabilities of LLMs in SQLI black-box detection. Based on the conclusions drawn from the exploratory experiment, we developed SqliGPT, an LLM-powered SQLI black-box scanner.

4. Exploratory Study

4.1. SqliMicroBenchmark

A robust and broadly representative benchmark is required to fairly assess the capabilities of LLMs in SQLI black-box detection. However, existing benchmarks in this field [2,3,9,30,36] have several limitations. Firstly, they often lack sufficient visibility and wide recognition, affecting the acknowledgment and fairness of comparative studies. Secondly, most existing benchmarks focus on simple SQLI vulnerability detection and typically do not include more complex SQLI vulnerabilities with defense mechanisms. Consequently, these benchmarks often fall short in terms of realism.

We propose the SqliMicroBenchmark, based on Sqli-labs, a widely recognized SQLI testing range designed with real-world scenarios in mind, providing both representativity and practicality. As an open-source project, Sqli-labs offers standardized test environments and datasets, ensuring the comparability and reproducibility of experimental results. Furthermore, Sqli-labs encompasses a variety of SQLI vulnerability types, allowing for a comprehensive evaluation of LLMs’ SQLI detection effectiveness across different injection scenarios. It is worth noting that we excluded some targets in Sqli-labs for testing purposes. Some targets were excluded because their detection requires an extremely limited number of requests (e.g., only five), a requirement beyond the capability of current SQLI black-box detection scanners. Additionally, we did not consider targets related to stacked injections in Sqli-labs, as stacked query injections are typically used for more in-depth attacks or exploits after an injection vulnerability has been identified, such as performing database management operations. In addition, to make our benchmarks more representative of real-world scenarios, we included five actual CVEs in SqliMicroBenchmark (CVE-2020-8637, CVE-2020-8638, CVE-2020-8841, CVE-2023-30605, CVE-2023-24812). We extracted relevant SQL statements from these CVEs and integrated them into our SqliMicroBenchmark. Our micro benchmark aims to evaluate SQLI black-box detection capabilities without encouraging any form of offensive behavior.

The SqliMicroBenchmark encompasses 45 targets that exhibit SQLI vulnerabilities. These targets are categorized into two levels of difficulty—basic and advanced—depending on the presence of protective mechanisms at the injection points. Insufficient defense mechanisms at these points characterize the advanced difficulty targets. With the exception of CVE-2023-24812, the rest of the CVEs are basic difficulty targets. With the experimental evaluation on SqliMicroBenchmark, we can effectively assess the capability of LLMs in SQLI black-box detection and we provide a solid foundation for future research and practical applications.

4.2. Experimental Settings

Our experiment aimed to explore the capability of LLMs in performing SQLI black-box detection. Specifically, we designed two methods: LLM4Sqli-Manual and LLM4Sqli-Sqlmap. The former investigates the ability of LLMs to perform SQLI black-box detection independently, while the latter examines LLMs’ ability to guide automated SQLI detection scanners. The workflow of LLM4Sqli-Manual is illustrated on the left side of Figure 1. Initially, a crawler retrieves the content of a web application based on a user-provided URL, passing the HTTP request–response pairs to LLMs as initial information. LLMs then generate a payload from this information and pass it to a replayer to test the web application. The results are analyzed by the user, and if no injection point is found, the results are fed back to LLMs for iterative testing.

The implementation of LLM4Sqli-Sqlmap is depicted on the right-hand side of Figure 1. This approach incorporates Sqlmap as a scanner, chosen for its support of command-line operations, ease of automation, comprehensive features, and wide recognition. Unlike LLM4Sqli-Manual, in LLM4Sqli-Sqlmap, LLMs generate parameters of the Sqlmap command rather than payloads. Sqlmap executes the command, and based on the results returned by Sqlmap, the program determines the success of the injection and decides the subsequent steps.

This study selected three advanced LLMs—OPENAI’s GPT-4, GPT-3.5, and AnthropicAI’s Claude-3—to evaluate their performance in SQLI black-box detection. These models were chosen due to their prominence in the research community and consistent usability. To facilitate automated testing, we interacted with these models via API calls. The criterion for determining the success of LLMs on the SqliMicroBenchmark target was its ability to accurately identify and confirm the presence of SQLI vulnerabilities in the target and successfully retrieve the database name. This criterion aims to evaluate the models’ ability to recognize SQLI vulnerabilities rather than to encourage or conduct any form of illegal attack. Additionally, we implemented the measures outlined in Appendix A to ensure the validity of the experiments.

4.3. Capability Evaluation

To examine the proficiency of LLMs in SQLI black-box detection, we assessed the performance of LLM4Sqli-Manual and LLM4Sqli-Sqlmap using three advanced LLMs (GPT-3.5, GPT-4, and Claude-3) as base models on the SqliMicroBenchmark. The results are presented in Table 1. The findings reveal that LLM4Sqli-Sqlmap detected an average of 30 out of 45 targets, whereas LLM4Sqli-Manual identified only 2 on average. This indicates that LLMs guiding scanners for SQLI black-box detection are significantly more effective than those manually constructing payloads for SQLI black-box detection.

In the context of LLM4Sqli-Manual’s results, GPT-4 demonstrated the best performance by detecting 4 out of 28 targets of basic difficulty. Claude-3 followed, identifying two targets of basic difficulty, while GPT-3.5 failed to detect any targets of basic difficulty. None of the models succeeded in detecting any targets of advanced difficulty.

The results suggest that LLM4Sqli-Sqlmap has a pronounced advantage in detecting basic difficulty targets. Specifically, GPT-4 excelled in this mode, detecting 27 basic difficulty targets and 4 advanced difficulty targets. Both Claude-3 and GPT-3.5 showed comparable performance, each detecting 26 basic difficulty targets and 4 advanced difficulty targets. Both LLM4Sqli-Manual and LLM4Sqli-Sqlmap struggled with advanced difficulty targets. These targets incorporate defense mechanisms such as encoding, keyword filtering, or Web Application Firewalls (WAFs), necessitating deep analysis and validation to identify and bypass these defenses. Thus, the poor performance of LLMs on these complex problems is predictable. In summary, LLM4Sqli-Sqlmap exhibits significant potential in SQLI black-box detection, whereas LLM4Sqli-Manual demonstrates weak performance.

Compared to LLM4Sqli-Manual, LLM4Sqli-Sqlmap performs better in SQLI black-box detection. This superior performance is attributed to the fact that directly generating payloads to detect SQLI imposes higher demands on LLMs’ depth of knowledge and reasoning ability within the SQLI domain. In contrast, LLM4Sqli-Sqlmap employs automated SQLI scanners to execute the task, which significantly reduces these requirements. Experimental results indicate that current LLMs, such as GPT-3.5, GPT-4, and Claude-3, have not yet achieved the capability to efficiently resolve SQLI issues independently. Consequently, our research will utilize LLM4Sqli-Sqlmap as the foundational framework for further exploration.

4.4. Challenges

To explore the challenges faced when using LLMs for SQLI black-box detection, we conducted an exhaustive review of LLM4Sqli-Sqlmap’s test logs. We aimed to dissect the underlying factors preventing LLMs from effectively solving these problems. Our analysis revealed that the main challenges faced by LLMs in guiding Sqlmap to perform SQLI black-box detection can be categorized into the following two points:

Firstly, LLMs struggle to detect SQLI vulnerabilities in web applications that have insufficient defense mechanisms. Specifically, LLM4Sqli-Sqlmap exhibits notable constraints when detecting advanced difficulty targets. The resolution of such targets entails two essential steps: firstly, precise identification of existing defense mechanisms, and secondly, the effective bypass methods. Regarding identifying defense mechanisms, it is noted that LLMs often suggest potential bypass tactics based on conjecture rather than conducting empirical validations. For instance, LLMs might presume the presence of some kind of defense mechanisms and proceed to generate a corresponding Sqlmap command to attempt a bypass without prior confirmation. This subjective and uncertain approach significantly complicates the problem-solving process. Regarding the bypass methods of defense mechanisms, a simple test was performed where LLMs were clearly informed about the presence of specific defense mechanisms and given detailed information about the issue. They were then asked to propose several context-based methods that could potentially succeed in bypassing the defense mechanisms. The results revealed that the bypass strategies suggested by LLMs were quite limited, failing to encompass some uncommon or specific bypass methods. This limitation likely arises from LLMs’ tendency to produce solutions that are more frequently encountered in their training data, thereby overlooking less common but potentially viable strategies.

Secondly, the detection efficiency of the Sqlmap commands generated by LLMs is low. Specifically, LLMs tend to adopt a comprehensive testing strategy, for example, using the “–level = 5” parameter, which causes Sqlmap to test each parameter using various injection techniques in the scanner’s preset testing order. While this predefined test order allows for relatively efficient testing in most cases, it may not be efficient for some specific web applications to rely solely on this predefined order. For instance, the header parameter Referer is only vulnerable to the time-blind technique, but it is placed later in the preset order of test parameters, while the time-blind technique is not prioritized in the preset order of test techniques. This results in the need to spend a significant number of requests and time testing other parameters, as well as ineffective testing of the Referer parameter using unsuitable injection techniques when performing tests in LLMs. Additionally, another serious problem is that the “–level = 5” can only cover common parameters and fails to cover some uncommon but potentially injection-risky header parameters, such as X-Forwarded-For, resulting in incomplete testing.

In summary, while LLM4Sqli-Sqlmap shows clear potential in testing benchmarks, it still faces challenges such as inefficient detection and difficulty in bypassing insufficient defense mechanisms. These issues are precisely the core objectives that our research aims to address.

5. Overview and Design

5.1. Overview

We base our work on LLM4Sqli-Sqlmap and design two modules, the Defense Bypass Module and the Strategy Selection Module, to address the two main challenges that LLM4Sqli-Sqlmap encounter. We name this enhanced framework SqliGPT, with its workflow depicted in Figure 2. SqliGPT begins with an initial input of a user-specified URL. It then employs a crawler to scrape the content of the targeted web application, and the extracted HTTP request–response pairs are passed as initial information to the Strategy Selection Module. This module identifies all possible injection points and performs preliminary validation, prioritizing testing potential injection points and injection techniques deemed more likely and generating corresponding parameters of the Sqlmap command. Subsequently, Sqlmap conducts tests on the targeted web application. If Sqlmap fails to find an injection point, the Strategy Selection Module relays the initial information and its generated list of potential injection points and techniques, ranked by likelihood, to the Defense Bypass Module. This module attempts to identify the defense mechanisms of the targeted web application and strives to compile a payload transformation script and corresponding Sqlmap execution parameters capable of successfully bypassing these defenses. Sqlmap then utilizes the payload transformation script to process the payload in subsequent tests. Upon successfully identifying an injection point by Sqlmap, a report is dispatched to the user.

In each module, we divide large tasks into multiple small tasks, each of which is assigned to different LLM Agents [37,38,39,40] for processing. An LLM Agent is a program centered around LLMs, designed to achieve a specific goal or complete a task set by the user. The LLMs receive feedback and opt to use predefined tools or functions to complete the task through iterative runs. By coordinating task assignments and communicating with each other, multiple LLM Agents can collaborate to accomplish more complex functions.

5.2. Strategy Selection Module

The Strategy Selection Module aims to alleviate the problems of inefficient detection and the inability to cover some uncommon header parameters that may be injected. In this module, LLM Agents extract all potential injection points from the target and perform preliminary validation of them using various injection techniques. This comprehensive approach ensures that even the most uncommon header parameters are not overlooked, instilling confidence in the thoroughness of your security testing.

Figure 3 depicts the structure of this module, primarily comprising three LLM Agents and a Replayer: ① The Extraction Agent is designed to identify all potential SQLI points from HTTP request–response pairs, compiling them into a list. Although every parameter in the request headers could potentially be vulnerable to SQLI, testing each one exhaustively is often impractical and costly. Therefore, the Extraction Agent selectively identifies only those header parameters implicated in the responses and parameters the user can control as potential injection points. ② The Pre-validation Agent’s responsibility is to generate simple payloads to assess these potential injection points individually. It evaluates each parameter’s injection likelihood based on the received responses and ranks them according to their successful injection probability.

Furthermore, this agent gauges the sensitivity of parameters to various injection techniques based on feedback and prioritizes techniques that are likely to be more successful. This strategic prioritization enhances the test effectiveness of the Sqlmap commands generated by LLMs. ③ The Parameters Generation Agent takes the sorted list from the Pre-validation Agent and generates the appropriate parameters of the Sqlmap command for the listed potential injection points and techniques.

The Strategy Selection Module enhances detection efficiency and broadens the coverage of uncommon header parameters by optimizing the generation of Sqlmap command parameters through a feedback mechanism. This module arranges injection parameters and techniques based on feedback from the targeted web application, enabling more effective and thorough detection.

5.3. Defense Bypass Module

The Defense Bypass Module is designed to address the challenges associated with bypassing insufficient defense mechanisms during SQLI black-box detection by LLMs. This module employs a defense mechanism determination based on feedback from the targeted web application. It uses a bypass method generated through Retrieval-Augmented Generation (RAG) [41,42] to overcome these inadequate defenses.

As illustrated in Figure 4, this module primarily comprises three LLM Agents, a Retriever, and a Replayer: ④ The primary function of the Defense Mechanism Detection Agent is to identify the presence of defense mechanisms at injection points and specify the specific mechanisms in place. This agent takes as input the initial information and the ranked list of potential injection points and techniques from the Strategy Selection Module and outputs the detected defense mechanisms. To address the issue of potentially insufficient information, this agent can generate payloads, which are then passed to the Replayer to test the web application, relying on the responses to determine the defense mechanisms. The Retriever is designed to tackle the limitation of bypass solutions proposed by LLMs. The Defense Mechanism Detection Agent forwards the detected defense mechanisms to the Retriever, which then searches an external knowledge database for corresponding bypass methods. Specifically, we store various bypass methods in this external knowledge database, tagging each with metadata to enhance retrieval accuracy, indicating the specific defense mechanisms that each method can bypass. ⑤ The Methods Validation Agent receives the list of bypass methods from the Retriever, along with the parameters of the detected defense mechanisms from the Defense Mechanism Detection Agent, and outputs the validated bypass methods. The Methods Validation Agent generates the payloads for each bypass method and sends them to the Replayer to test the web application, validating the effectiveness of the bypass methods based on the responses. ⑥ The Parameters and Scripts Generation Agent generates payload transformation scripts according to predefined templates based on validated bypass methods and also generates the corresponding parameters of the Sqlmap command.

Considering that LLMs used for testing are proprietary commercial models with significant costs associated with fine-tuning, we opted to employ the RAG technique to address the issue of limited defensive bypass options. RAG, a robust approach, involves LLMs initially retrieving pertinent information from an extensive document database to answer questions or generate text. This process enhances response quality. Thus, RAG facilitates the straightforward expansion of LLMs’ knowledge base by merely incorporating additional relevant information into an external repository.

To enhance the reproducibility of our experiments, particularly in segments where LLMs performance is inconsistent, we employed cue engineering techniques like Few-Shot Learning [43,44,45] and Chain-of-Thought (CoT) [46,47,48] to refine LLMs’ effectiveness further. The Defense Bypass Module leverages feedback mechanisms to aid LLMs in recognizing the target web application’s defense mechanisms. It employs RAG technology to help devise effective strategies for bypassing these defenses, thereby augmenting LLMs’ capability to detect advanced SQLI vulnerability.

6. Evaluation

In the experimental evaluation section, we provide a detailed comparison of SqliGPT with six other state-of-the-art academic and industrial SQLI scanners, including PentestGPT, SQIRL, and four industrial scanners. We focus on the following three research questions:

RQ1 (Effectiveness):

How does SqliGPT compare to the other six scanners in terms of the effectiveness of SQLI black-box detection?

RQ2 (Efficiency):

How does SqliGPT compare to the other six scanners in terms of efficiency in performing the detection task?

RQ3 (Ablation):

How do the individual modules of SqliGPT contribute to the overall performance?

To assess effectiveness, we used the number of detected SQLI vulnerabilities as the performance metric, i.e., the number of targets that were successfully and correctly detected by each scanner on the SqliMicroBenchmark. Additionally, considering the large amount of network traffic that automated SQLI black-box scanners may generate while performing the tests and their potential burden on the system under test, we used the total number of HTTP requests sent by the scanners during the test (excluding requests generated by crawlers) as a measure of efficiency in performing the SQLI detection. By combining these metrics, we provide a quantitative and comprehensive assessment of each scanner’s performance in terms of SQLI detection. It is worth noting that although average runtime is often considered a metric of efficiency, we did not use it as a performance metric in this study because it can be affected by network conditions and LLMs API access latency.

6.1. Evaluation Settings

We comprehensively evaluated SqliGPT on SqliMicroBenchmark and compared it to other research subjects, including PentestGPT, SQIRL, and four industry-leading automated SQLI black-box scanners. PentestGPT is a representative penetration testing tool powered by LLMs that supports SQLI black-box detection of web applications. SQIRL is an advanced gray-box SQLI detection tool based on reinforcement learning and federated learning, but the tool’s operation relies on access to log information from the database system. Therefore, we provided SQIRL with the required database log information separately during our testing. Although we would like to compare ours with more existing academic scanners for SQLI black-box detection, these implementations are not available [27,28,49,50]. DeepSQLi [29] references an available implementation, but the repository seems incomplete and unusable. In addition, we selected four advanced industry automated SQLI black-box scanners for comparison, including Sqlmap v1.8.3.15, ZAP V2.9.0, BurpSuite Professional v2024.3.1.2, and Arachni v1.6.1.3. The scanner configuration details are described in Table A1.

We tested SqliGPT and PentestGPT using GPT-3.5, GPT-4, and Claude-3. During the testing process of PentestGPT, testers interacted with PentestGPT to clarify its tasks, but they did not contribute their insights to the interaction. Additionally, we used the measures outlined in Appendix A to ensure the validity of the tests.

6.2. Effectiveness Evaluation (RQ1)

We evaluated SqliGPT, LLM4Sqli-Sqlmap, and six other state-of-the-art scanners on SqliMicroBenchmark. The experimental results are shown in Table 2. SqliGPT successfully detected all 45 targets in the SqliMicroBenchmark, and its performance across the three LLMs was consistent and significantly better than that of the other scanners.

PentestGPT detected only 18 of the basic difficulty targets and 4 of the advanced difficulty targets. Upon analyzing the test logs of PentestGPT, we found that it focuses more on the overall planning of the penetration testing process and does not sufficiently analyze the target’s information. For example, it tends to generate uniform Sqlmap commands that do not take into account the target parameter delivery method (POST or GET): sqlmap -u <url> –dbs. Therefore, PentestGPT’s performance is consistent across the three LLMs despite their differences in capabilities. This suggests that within PentestGPT’s framework, the main focus of LLMs is on the penetration testing process, with insufficient attention to capturing detailed information. SQIRL detected 20 basic difficulty targets and 2 advanced difficulty target. We believe its weaker performance is mainly due to the poor transferability of its SQLI detection capabilities, which were learned in the original training set. ZAP detected 20 basic difficulty targets and 4 advanced difficulty targets.

The results reported for BurpSuite and Arachni indicate that both tools detected 7 targets of advanced difficulty despite only identifying 21 and 23 targets of basic difficulty, respectively. This discrepancy arises because, even though some input parameters were filtered in the SqliMicroBenchmark’s targets, database error messages continued to be echoed in the HTTP responses. BurpSuite and Arachni employed relatively lenient criteria for determining injection points. These tools inferred vulnerability to injection when they detected database error information in the content of a target’s HTTP response. However, neither BurpSuite nor Arachni actually succeeded in bypassing defenses and executing the injection statement effectively. Consequently, we disregarded these targets and acknowledged that only 4 of the advanced difficulty targets were successfully detected by each tool.

Sqlmap successfully detected 27 targets of basic difficulty and 4 of advanced difficulty. These results demonstrate that SqliGPT leverages the capabilities of Sqlmap and enhances its performance. In contrast, the subpar results of PentestGPT in SQLI black-box detection with Sqlmap can be ascribed to its broader focus on the overall penetration testing process, which results in a relative deficiency in pinpointing specific vulnerabilities.

6.3. Efficiency Evaluation (RQ2)

Table 2 also shows the average number of HTTP requests sent by each scanner to successfully detect different classes of targets. Since there may be significant quantitative differences in the number of HTTP requests required to detect different targets, we categorized the 45 targets into three classes for independent comparison based on the scanners’ target completion, ensuring fairness in the evaluation. The specific target categorization is shown in Table A2, where Class-1 totals 22 targets, including 23 basic and 4 advanced difficulty targets; Class-2 consists of 5 basic difficulty targets; and Class-3 covers 13 advanced difficulty targets. On Class-1 targets, Arachni performs best in terms of the number of HTTP requests, followed by SQIRL. SqliGPT performs slightly worse than Arachni and SQIRL on Class-1 but still outperforms the other scanners. SqliGPT performs best on both Class-2 and Class-3 targets. Overall, SqliGPT performs well in terms of efficiency in performing detection tasks.

SqliGPT failed to outperform Arachni in the detection performance of Class-1 targets, a situation that can be attributed to the differences in detection strategy and evaluation criteria between SqliGPT and Arachni. SqliGPT guided Sqlmap to perform exhaustive SQLI detection of the targets, ensuring comprehensive coverage of potential injection points. In contrast, Arachni employs a narrower detection strategy, which allows it to complete tests more efficiently in some cases. However, the limitation of this strategy is that it may need to adequately cover all potential injection points, resulting in a less comprehensive test. Additionally, Arachni may use looser criteria when determining injection points, which, while helpful in quickly identifying potential vulnerabilities, may also increase false positives. Compared to SqliGPT, the advantages demonstrated by SQIRL on Class-1 targets stem from its use of a reinforcement learning-based approach to payload mutation, which enables more rapid and efficient SQLI black-box detection. However, its performance on Class-2 and Class-3 targets reveals limitations in the transferability of its detection capabilities, reflecting the difficulty of broadly applying its SQLI detection knowledge gained on the initial training set to new scenarios.

By comparing SqliGPT with Sqlmap, we observe the significant improvement in detection efficiency that LLMs provide for the SQLI black-box scanner. For Class-1 targets, the average number of HTTP requests required by SqliGPT is only 17% of that required by Sqlmap. Notably, for Class-2 targets, this ratio is further reduced to 8%. These results highlight the significant enhancements in detection efficiency that LLMs contribute to the design of SqliGPT.

6.4. Ablation Study (RQ3)

We conducted a series of ablation experiments to measure the impact of the various design choices behind SqliGPT. The comparisons included LLM4Sqli-Sqlmap, SqliGPT-Strategy, which adds only the Strategy Selection Module to LLM4Sqli-Sqlmap, and SqliGPT itself. After rigorous testing on the SqliMicroBenchmark, the performance of the three configurations was evaluated in terms of the number of SQLI vulnerability detections and HTTP requests, as shown in Table 3. Our key findings are as follows:

(1): In terms of efficiency, SqliGPT and SqliGPT-Strategy demonstrate superior results compared to LLM4Sqli-Sqlmap, which clearly confirms the effectiveness of the Strategy Selection Module. Additionally, we observe a slight improvement in the number of detected SQLI vulnerabilities of SqliGPT-Strategy compared to LLM4Sqli-Sqlmap for the base difficulty targets. This improvement can be attributed to the comprehensive enumeration of potentially injectable parameters by the Strategy Selection Module, which tests them individually. This approach ensures thorough testing of suspected injection points by LLMs and reduces the risk of missing potential injection points due to the random nature of LLMs generation.
(2): In terms of effectiveness, SqliGPT stands out as a clear winner, significantly outperforming LLM4Sqli-Sqlmap and SqliGPT-Strategy, particularly in the number of detected SQLI vulnerabilities for advanced difficulty targets. Without the Defense Bypass Module, LLM4Sqli-Sqlmap and SqliGPT-Strategy are unable to accurately identify and effectively bypass defenses in advanced difficulty targets. The addition of the Defense Bypass Module equips LLMs with the ability to identify and bypass defense mechanisms, thereby increasing the number of detected SQLI vulnerabilities for advanced difficulty targets.

Table 3. Ablations of SqliGPT on the SqliMicroBenchmark.

Scanner		Avg Requests per Successful Target			Basic Targets (28)	Advanced Targets (17)	Total (45)
Scanner		Class-1	Class-2	Class-3	Basic Targets (28)	Advanced Targets (17)	Total (45)
	GPT-3.5	609	6712	–	26 (93%)	4 (24%)	30 (67%)
LLM4Sqli-Sqlmap	GPT-4	517	8451	–	27 (96%)	4 (24%)	31 (69%)
	Claude-3	504	11436	–	26 (93%)	4 (24%)	30 (67%)
	GPT-3.5	142	788	–	28 (100%)	4 (24%)	32 (71%)
SqliGPT-Strategy	GPT-4	119	773	–	28 (100%)	4 (24%)	32 (71%)
	Claude-3	112	731	–	28 (100%)	4 (24%)	32 (71%)
	GPT-3.5	127	761	1684	28 (100%)	17 (100%)	45 (100%)
SqliGPT	GPT-4	119	822	1681	28 (100%)	17 (100%)	45 (100%)
	Claude-3	113	749	1669	28 (100%)	17 (100%)	45 (100%)

The best performance for each metric is highlighted in bold font.

7. Discussion

Hallucination in LLMs: Hallucination [51] in LLMs refers to errors or inaccuracies in the text generated by LLMs, stemming from flaws in the training data, the training process, and the inference process. This phenomenon can affect the reliability of our automated SQLI black-box scanner. Our experiments used closed-source commercial models such as GPT-3.5, GPT-4, and Claude-3. Since we could not directly access or modify the training data and model parameters, we employed CoT and Self-Correcting [52] techniques at the prompts to minimize hallucinations as much as possible. At the same time, we are constantly exploring other methods to reduce hallucination and further optimize the tool’s performance.

High Computational and Financial Costs of advanced LLMs: The high computational and financial costs associated with advanced LLMs such as GPT and Claude constitute one of the limitations in practical applications. These models require considerable computational resources for reasoning, and the financial expenditures can be noticeable, particularly in large-scale concurrent processing scenarios. For large-scale applications, these costs may need to be revised. To reduce costs and increase feasibility, we plan to actively explore the use of open-source LLMs in our subsequent work. These open-source models can be tuned and optimized to meet specific needs and significantly reduce the costs associated with their use, thus enhancing the utility of our SQLI black-box detection tool.

Challenges for LLMs in Bypassing Defense Mechanisms: LLMs face several significant challenges when attempting to bypass defense mechanisms like WAFs or keyword filters. These mechanisms might process detected malicious payloads in unique ways, such as removing, replacing keywords, or redirecting requests to an error page. Consequently, LLMs struggle to accurately determine if a keyword has been manipulated, leading to a high incidence of false positives. This issue primarily stems from two factors: firstly, SqliGPT’s inadequacy in providing enough information related to targeted web applications to LLMs; secondly, LLMs in use may lack the deep logical reasoning required for precise judgment. Addressing these challenges is a pivotal focus of our future research.

Integrating LLMs with other Automated Scanners: SqliGPT merges LLMs and Sqlmap to create an efficient, intelligent, and robust automated SQLI black-box scanner. Sqlmap was selected due to its extensive feature set, ease of integration, and broad community acceptance. However, Sqlmap does have some limitations, including inadequate support for certain database types and potential issues with missed detections or false positives in specific scenarios. Thus, our framework plans to incorporate additional automated SQLI scanners to address diverse scenarios better and meet particular requirements.

8. Conclusions

In this research, we explore the effectiveness and limitations of LLM for SQLI black-box detection. Through assessments conducted using our developed SqliMicroBenchmark, it was observed that the LLMs-guided SQLI black-box scanner, LLM4Sqli-Sqlmap, demonstrated promising capabilities in SQLI detection. However, it still faced challenges such as inefficient detection and difficulty in bypassing insufficient defense mechanisms. Consequently, we developed SqliGPT, an innovative LLM-based SQLI black-box detection scanner. SqliGPT enhances LLM4Sqli-Sqlmap by integrating the Strategy Selection Module and the Defense Bypass Module to boost detection efficiency and overcome defense-related challenges.

We conducted a comparative analysis of SqliGPT against six other state-of-the-art SQLI scanners using the SqliMicroBenchmark. The results indicated that SqliGPT successfully identified all 45 targets, surpassing the performance of the other scanners. Additionally, SqliGPT demonstrated superior efficiency in detection tasks. Notably, among the 45 SQL injection targets successfully identified by SqliGPT, 5 included real-world CVEs, further underscoring the practical effectiveness of SqliGPT. While it was slightly less effective than Arachni and SQIRL on 27 Class-1 targets, it achieved the best results on 18 targets across Class-2 and Class-3. Overall, this study reveals the potential of LLMs in SQLI black-box detection and provides a wide scope for future research and improvement.

Future work will focus on reducing hallucinations to improve SqliGPT’s stability and strengthening the Defense Bypass Module to handle more complex insufficient defenses. We will explore integrating other scanners to enhance robustness and consider using open-source LLMs to reduce costs. Additionally, we aim to improve the efficiency of LLMs in SQL injection black-box detection for better performance.

Author Contributions

Conceptualization, Z.G. and W.X.; methodology, Z.G.; software, Z.G., B.D., M.Z., Y.C. and S.W.; validation, Z.G., B.D., M.Z., Y.C. and S.W.; investigation, Z.G., E.W., W.X. and B.W.; resources, Z.G.; data curation, B.D., M.Z., Y.C. and S.W.; writing—original draft preparation, Z.G.; writing—review and editing, E.W., W.X. and B.W.; supervision, Z.G.; project administration, Z.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We have made SqliMicroBenchmark and SqliGPT available as open-source resources at https://github.com/guizhiwen/SqliGPT, accessed on 11 July 2024.

Acknowledgments

We sincerely thank the reviewers for their perceptive comments, which helped us to improve this work.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SQLI	Structured Query Language Injection
LLMs	Large Language Models
AI	Artificial Intelligence
CoT	Chain-of-Thought
CVE	Common Vulnerabilities and Exposures
WAF	Web Application Firewalls
API	Application Programming Interface

Appendix A

This appendix describes in detail the experimental testing strategies used to evaluate the effectiveness of different LLMs in detecting SQLI vulnerability.

Repeated Testing Criteria: Given the inherent variability of LLMs’ outputs, we use a repeated testing approach to verify the effectiveness of the program. Specifically, we test each target five times with each LLM. In this process, as long as one test succeeds, LLMs are considered to have successfully solved that SQLI problem.

Failure Determination Criteria: LLM4Sqli-Sqlmap and PentestGPT: If LLMs fail to detect a usable injection point or consider the target immune to SQLI attacks in five rounds of interactions, it is judged as a test failure. LLM4Sqli-Manual: If LLMs fail to provide a payload that helps to advance the direction of the SQLI problem solution in three consecutive interaction cycles, it is judged as a test failure.

Test Maintenance and Integrity Safeguards:

Ensure test independence by restarting the test machine after each test to reset the state.
Restrict testers from performing actions and reporting results without adding expert knowledge or guidance to maintain objectivity and consistency in the testing process.

Table A1. Detailed configuration of scanners.

ZAP:

1. Traditional Spider

2. AJAX Spider with Firefox Headless

3. Active scans using only SQL injection option, choosing the highest intensity and checking all the injection targets

BurpSuite:

1. Default crawling settings

2. Set the authentication using active cookie

3. Audit using all SQL injection options

Arachni:

./bin/arachni [url] –check=sqli –http-cookie-string=’Path=[cookie_file]’

Sqlmap:

sqlmap -r request_file –batch –level=5 –flush-session

Table A2. Detailed performance of different scanners on SqliMicroBenchmark.

Scanner		Class-1 (1–15,21–29,32–34)										Class-2 (16–20)			Class-3 (30,31,35–45)
Scanner		1–9	10	11–14	15	21–23	24	25–26	27–29	32–33	34	16–18	19	20	30–31	35–45
ZAP		✓	✓	✓	✓	✓			✓	✓	✓
BurpSuite		✓		✓	✓	✓		✓	✓	✓	✓
Arachni		✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Sqlmap		✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
SQIRL		✓	✓		✓	✓	✓	✓	✓		✓			✓
	GPT-3.5	✓	✓			✓	✓	✓	✓	✓	✓
PentestGPT	GPT-4	✓	✓			✓	✓	✓	✓	✓	✓
	Claude-3	✓	✓			✓	✓	✓	✓	✓	✓
	GPT-3.5	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
LLM4Sqli-Sqlmap	GPT-4	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Claude-3	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	GPT-3.5	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
SqliGPT	GPT-4	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
	Claude-3	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓

Basic Difficulty Targets: 1–28 Advanced Difficulty Targets: 29–45. 25–29: CVE-2020-8637, CVE-2020-8638, CVE-2020-8841, CVE-2023-30605, CVE-2023-24812.

References

Guan, Y.; He, J.; Li, T.; Zhao, H.; Ma, B. SSQLi: A Black-Box Adversarial Attack Method for SQL Injection Based on Reinforcement Learning. Future Internet 2023, 15, 133. [Google Scholar] [CrossRef]
Wahaibi, S.A.A.; Foley, M.; Maffeis, S. SQIRL: Grey-Box Detection of SQL Injection Vulnerabilities Using Reinforcement Learning. In Proceedings of the USENIX Security Symposium, Anaheim, CA, USA, 9–11 August 2023. [Google Scholar]
Djuric, Z. A black-box testing tool for detecting SQL injection vulnerabilities. In Proceedings of the 2013 Second International Conference on Informatics & Applications (ICIA), Lodz, Poland, 23–25 September 2013; pp. 216–221. [Google Scholar]
Saifan, A.; Alsmadi, I.; Aleroud, A. Fault-based Testing for Discovering SQL Injection Vulnerabilities in Web Applications. Int. J. Inf. Comput. Secur. 2018, 16, 51–62. [Google Scholar] [CrossRef]
Appelt, D.; Nguyen, D.C.; Briand, L.C.; Alshahwan, N. Automated testing for SQL injection vulnerabilities: An input mutation approach. In Proceedings of the International Symposium on Software Testing and Analysis, San Jose, CA, USA, 21–25 July 2014. [Google Scholar]
Kolias, C.; Kambourakis, G.; Meng, W.; Althunayyan, M.; Saxena, N.; Li, S.; Gope, P. Evaluation of Black-Box Web Application Security Scanners in Detecting Injection Vulnerabilities. Electronics 2022, 11, 2049. [Google Scholar] [CrossRef]
Anagandula, K.; Zavarsky, P. An Analysis of Effectiveness of Black-Box Web Application Scanners in Detection of Stored SQL Injection and Stored XSS Vulnerabilities. In Proceedings of the 2020 3rd International Conference on Data Intelligence and Security (ICDIS), South Padre Island, TX, USA, 24–26 June 2020; pp. 40–48. [Google Scholar]
Qu, Z.; Ling, X.; Wang, T.; Chen, X.; Ji, S.; Wu, C. AdvSQLi: Generating Adversarial SQL Injections Against Real-World WAF-as-a-Service. IEEE Trans. Inf. Forensics Secur. 2024, 19, 2623–2638. [Google Scholar] [CrossRef]
Yuan, Y.; Lu, Y.; Zhu, K.; Huang, H.; Yu, L.; Zhao, J. A Static Detection Method for SQL Injection Vulnerability Based on Program Transformation. Appl. Sci. 2023, 13, 11763. [Google Scholar] [CrossRef]
Touseef, P.; Alam, K.A.; Jamil, A.; Tauseef, H.; Ajmal, S.; Asif, R.; Rehman, B.; Mustafa, S. Analysis of Automated Web Application Security Vulnerabilities Testing. In Proceedings of the 3rd International Conference on Future Networks and Distributed Systems, Paris, France, 1–2 July 2019. [Google Scholar]
OWASP. OWASP Top 10 Web Application Security Risks. Available online: https://owasp.org/www-project-top-ten/ (accessed on 1 July 2024).
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Achiam, O.J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Anthropic. The Claude 3 Model Family: Opus, Sonnet, Haiku. Available online: https://www.anthropic.com/news/claude-3-family (accessed on 1 July 2024).
Deng, G.; Liu, Y.; Mayoral-Vilches, V.; Liu, P.; Li, Y.; Xu, Y.; Zhang, T.; Liu, Y.; Pinzger, M.; Rass, S. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool. arXiv 2023, arXiv:2308.06782. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Raffel, C.; Shazeer, N.M.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2019, 21, 1–67. [Google Scholar]
Kals, S.; Kirda, E.; Krügel, C.; Jovanović, N. SecuBat: A web vulnerability scanner. In Proceedings of the the Web Conference, Edinburgh, Scotland, 23–26 May 2006. [Google Scholar]
Huang, Y.W.; Huang, S.K.; Lin, T.P.; Tsai, C.H. Web application security assessment by fault injection and behavior monitoring. In Proceedings of the the Web Conference, Budapest, Hungary, 20–24 May 2003. [Google Scholar]
Arachni. Arachni—Web Application Security Scanner Framework. Available online: https://github.com/Arachni/arachni (accessed on 1 July 2024).
sqlmap: Automatic SQL Injection and Database Takeover Tool. Available online: https://sqlmap.org/?ref=byreference.net (accessed on 1 July 2024).
Marashdeh, Z.; Suwais, K.; Alia, M.A. A Survey on SQL Injection Attack: Detection and Challenges. In Proceedings of the 2021 International Conference on Information Technology (ICIT), Amman, Jordan, 14–15 July 2021; pp. 957–962. [Google Scholar]
Nagy, C.; Cleve, A. A Static Code Smell Detector for SQL Queries Embedded in Java Code. In Proceedings of the 2017 IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM), Shanghai, China, 17–18 September 2017; pp. 147–152. [Google Scholar]
Zhang, L.; Zhang, D.; Wang, C.H.; Zhao, J.; Zhang, Z. ART4SQLi: The ART of SQL Injection Vulnerability Discovery. IEEE Trans. Reliab. 2019, 68, 1470–1489. [Google Scholar] [CrossRef]
Luo, Y. SQLi-Fuzzer: A SQL Injection Vulnerability Discovery Framework Based on Machine Learning. In Proceedings of the 2021 IEEE 21st International Conference on Communication Technology (ICCT), Tianjin, China, 13–16 October 2021; pp. 846–851. [Google Scholar] [CrossRef]
Sablotny, M.; Jensen, B.S.; Johnson, C.W. Recurrent Neural Networks for Fuzz Testing Web Browsers. In Proceedings of the International Conference on Information Security and Cryptology, Seoul, Republic of Korea, 28–30 November 2018. [Google Scholar]
Liu, M.; Li, K.; Chen, T.A. DeepSQLi: Deep semantic learning for testing SQL injection. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual, 18–22 July 2020. [Google Scholar]
Verme, M.D.; Sommervoll, Å.Å.; Erdödi, L.; Totaro, S.; Zennaro, F.M. SQL Injections and Reinforcement Learning: An Empirical Evaluation of the Role of Action Structure. In Proceedings of the Nordic Conference on Secure IT Systems, Virtual, 29–30 November 2021. [Google Scholar]
Erdődi, L.; Sommervoll, Å.Å.; Zennaro, F.M. Simulating SQL Injection Vulnerability Exploitation Using Q-Learning Reinforcement Learning Agents. J. Inf. Secur. Appl. 2021, 61, 102903. [Google Scholar] [CrossRef]
Happe, A.; Cito, J. Getting pwn’d by AI: Penetration Testing with Large Language Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, San Francisco, CA, USA, 3–9 December 2023. [Google Scholar]
Gravitas, S. Auto-GPT: An Autonomous GPT-4 Experiment. 2023. Available online: https://github.com/Significant-Gravitas/Auto-GPT (accessed on 1 July 2024).
Nakajima, Y. Introducing Task-Driven Autonomous Agent. 2023. Available online: https://twitter.com/yoheinakajima/status/1640934493489070080 (accessed on 1 July 2024).
Nakajima, Y. BabyAGI. 2023. Available online: https://github.com/yoheinakajima/babyagi (accessed on 1 July 2024).
Trickel, E.; Pagani, F.; Zhu, C.; Dresel, L.; Vigna, G.; Kruegel, C.; Wang, R.; Bao, T.; Shoshitaishvili, Y.; Doupé, A. Toss a Fault to Your Witcher: Applying Grey-box Coverage-Guided Mutational Fuzzing to Detect SQL and Command Injection Vulnerabilities. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 21–25 May 2023; pp. 2658–2675. [Google Scholar]
Zhou, W.; Jiang, Y.; Li, L.; Wu, J.; Wang, T.; Qiu, S.; Zhang, J.; Chen, J.; Wu, R.; Wang, S.; et al. Agents: An Open-source Framework for Autonomous Language Agents. arXiv 2023, arXiv:2309.07870. [Google Scholar]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.; Wiest, O.; Zhang, X. Large Language Model based Multi-Agents: A Survey of Progress and Challenges. arXiv 2024, arXiv:2402.01680. [Google Scholar]
Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.; Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. The Rise and Potential of Large Language Model Based Agents: A Survey. arXiv 2023, arXiv:2309.07864. [Google Scholar]
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.Y.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model based Autonomous Agents. arXiv 2023, arXiv:2308.11432. [Google Scholar] [CrossRef]
Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Guo, Q.; Wang, M.; et al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
Zhu, Y.; Yuan, H.; Wang, S.; Liu, J.; Liu, W.; Deng, C.; Chen, H.; Dou, Z.; Wen, J. Large Language Models for Information Retrieval: A Survey. arXiv 2023, arXiv:2308.07107. [Google Scholar]
Song, Y.; Wang, T.Y.; Cai, P.; Mondal, S.K.; Sahoo, J.P. A Comprehensive Survey of Few-shot Learning: Evolution, Applications, Challenges, and Opportunities. ACM Comput. Surv. 2022, 55, 1–40. [Google Scholar] [CrossRef]
Wang, Y.; Yao, Q.; Kwok, J.T.Y.; Ni, L.M. Generalizing from a Few Examples. ACM Comput. Surv. (CSUR) 2019, 53, 1–34. [Google Scholar] [CrossRef]
Wang, W.; Zheng, V.W.; Yu, H.; Miao, C. A Survey of Zero-Shot Learning. ACM Trans. Intell. Syst. Technol. (TIST) 2019, 10, 1–37. [Google Scholar] [CrossRef]
Chu, Z.; Chen, J.; Chen, Q.; Yu, W.; He, T.; Wang, H.; Peng, W.; Liu, M.; Qin, B.; Liu, T. Navigate through Enigmatic Labyrinth A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future. arXiv 2023, arXiv:2309.15402. [Google Scholar]
Zhang, Z.; Yao, Y.; Zhang, A.; Tang, X.; Ma, X.; He, Z.; Wang, Y.; Gerstein, M.B.; Wang, R.; Liu, G.; et al. Igniting Language Intelligence: The Hitchhiker’s Guide from Chain-of-Thought Reasoning to Language Agents. arXiv 2023, arXiv:2311.11797. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain of Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2022, arXiv:2201.11903. [Google Scholar]
Aliero, M.S.; Ghani, I.; Qureshi, K.N.; Rohani, M.F. An algorithm for detecting SQL injection vulnerability using black-box testing. J. Ambient. Intell. Humaniz. Comput. 2019, 11, 249–266. [Google Scholar] [CrossRef]
Wang, X.; Hu, H. Evading Web Application Firewalls with Reinforcement Learning. 2020. Available online: https://openreview.net/forum?id=m5AntlhJ7Z5 (accessed on 1 July 2024).
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv 2023, arXiv:2311.05232. [Google Scholar]
Pan, L.; Saxon, M.S.; Xu, W.; Nathani, D.; Wang, X.; Wang, W.Y. Automatically Correcting Large Language Models: Surveying the landscape of diverse self-correction strategies. arXiv 2023, arXiv:2308.03188. [Google Scholar] [CrossRef]

Figure 1. LLM4Sqli-Manual (left) and LLM4Sqli-Sqlmap (right) workflow.

Figure 2. SqliGPT workflow.

Figure 3. Strategy Selection Module Workflow.

Figure 4. Defense Bypass Module workflow.

Table 1. Performance of LLM4Sqli-Manual and LLM4Sqli-Sqlmap on SqliMicroBenchmark.

Scanner		Basic Targets (28)	Advanced Targets (17)	Total (45)
	GPT-3.5	0 (0%)	0 (0%)	0 (0%)
LLM4Sqli-Manual	GPT-4	4 (14%)	0 (0%)	4 (9%)
	Claude-3	2 (7%)	0 (0%)	2 (4%)
	GPT-3.5	26 (93%)	4 (24%)	30 (67%)
LLM4Sqli-Sqlmap	GPT-4	27 (96%)	4 (24%)	31 (69%)
	Claude-3	26 (93%)	4 (24%)	30 (67%)

The best performance for each metric is highlighted in bold font.

Table 2. Performance comparison of each scanner on SqliMicroBenchmark. Since different numbers of HTTP requests need to be sent to detect different targets, to ensure fairness of the evaluation, we categorized the 45 targets into three classes based on the target completion of the scanners and compared the average number of HTTP requests that each scanner needs to send to detect different classes of targets respectively.

Scanner		Avg Requests per Successful Target			Basic Targets (28)	Advanced Targets (17)	Total (45)
Scanner		Class-1	Class-2	Class-3	Basic Targets (28)	Advanced Targets (17)	Total (45)
ZAP		388	–	–	20 (71%)	4 (24%)	24 (53%)
BurpSuite		283	–	–	21 (75%)	4 (24%)	25 (56%)
Arachni		77	–	–	23 (82%)	4 (24%)	27 (60%)
Sqlmap		717	9488	–	27 (96%)	4 (24%)	31 (69%)
SQIRL		83	–	–	20 (71%)	2 (12%)	22 (49%)
	GPT-3.5	357	–	–	18 (64%)	4 (24%)	22 (49%)
PentestGPT	GPT-4	351	–	–	18 (64%)	4 (24%)	22 (49%)
	Claude-3	338	–	–	18 (64%)	4 (24%)	22 (49%)
	GPT-3.5	609	6712	–	26 (93%)	4 (24%)	30 (67%)
LLM4Sqli-Sqlmap	GPT-4	517	8451	–	27 (96%)	4 (24%)	31 (69%)
	Claude-3	504	11,436	–	26 (93%)	4 (24%)	30 (67%)
	GPT-3.5	127	761	1684	28 (100%)	17 (100%)	45 (100%)
SqliGPT	GPT-4	119	822	1681	28 (100%)	17 (100%)	45 (100%)
	Claude-3	113	749	1669	28 (100%)	17 (100%)	45 (100%)

The best performance for each metric is highlighted in bold font.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gui, Z.; Wang, E.; Deng, B.; Zhang, M.; Chen, Y.; Wei, S.; Xie, W.; Wang, B. SqliGPT: Evaluating and Utilizing Large Language Models for Automated SQL Injection Black-Box Detection. Appl. Sci. 2024, 14, 6929. https://doi.org/10.3390/app14166929

AMA Style

Gui Z, Wang E, Deng B, Zhang M, Chen Y, Wei S, Xie W, Wang B. SqliGPT: Evaluating and Utilizing Large Language Models for Automated SQL Injection Black-Box Detection. Applied Sciences. 2024; 14(16):6929. https://doi.org/10.3390/app14166929

Chicago/Turabian Style

Gui, Zhiwen, Enze Wang, Binbin Deng, Mingyuan Zhang, Yitao Chen, Shengfei Wei, Wei Xie, and Baosheng Wang. 2024. "SqliGPT: Evaluating and Utilizing Large Language Models for Automated SQL Injection Black-Box Detection" Applied Sciences 14, no. 16: 6929. https://doi.org/10.3390/app14166929

APA Style

Gui, Z., Wang, E., Deng, B., Zhang, M., Chen, Y., Wei, S., Xie, W., & Wang, B. (2024). SqliGPT: Evaluating and Utilizing Large Language Models for Automated SQL Injection Black-Box Detection. Applied Sciences, 14(16), 6929. https://doi.org/10.3390/app14166929

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SqliGPT: Evaluating and Utilizing Large Language Models for Automated SQL Injection Black-Box Detection

Abstract

1. Introduction

2. Background and Related Work

2.1. Large Language Models

2.2. Non-LLM SQLI Black-Box Detection Methods

2.3. LLM-Based SQLI Black-Box Detection Methods

3. Motivation

4. Exploratory Study

4.1. SqliMicroBenchmark

4.2. Experimental Settings

4.3. Capability Evaluation

4.4. Challenges

5. Overview and Design

5.1. Overview

5.2. Strategy Selection Module

5.3. Defense Bypass Module

6. Evaluation

6.1. Evaluation Settings

6.2. Effectiveness Evaluation (RQ1)

6.3. Efficiency Evaluation (RQ2)

6.4. Ablation Study (RQ3)

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI