Grey-Box Fuzzing Based on Reinforcement Learning for XSS Vulnerabilities

Song, Xuyan; Zhang, Ruxian; Dong, Qingqing; Cui, Baojiang

doi:10.3390/app13042482

Open AccessArticle

Grey-Box Fuzzing Based on Reinforcement Learning for XSS Vulnerabilities

by

Xuyan Song

^1,*,†,

Ruxian Zhang

^2,†,

Qingqing Dong

¹ and

Baojiang Cui

¹

School of Cyber Security, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(4), 2482; https://doi.org/10.3390/app13042482

Submission received: 20 January 2023 / Revised: 14 February 2023 / Accepted: 14 February 2023 / Published: 15 February 2023

Download

Browse Figures

Versions Notes

Abstract

:

Cross-site scripting (XSS) vulnerabilities are significant threats to web applications. The number of XSS vulnerabilities reported has increased annually for the past three years, posing a considerable challenge to web application maintainers. Black-box scanners are mainstream tools for security engineers to perform penetration testing and detect XSS vulnerabilities. Unfortunately, black-box scanners rely on crawlers to find input points of web applications and cannot guarantee all input points are tested. To this end, we propose a grey-box fuzzing method based on reinforcement learning, which can detect reflected and stored XSS vulnerabilities for Java web applications. We first use static analysis to identify potential input points from components (i.e., Java code, configuration files, and HTML files) of the Java web application. Then, an XSS vulnerability payload generation method is proposed, which is used together with the reinforcement learning model. We define the state, action, and reward functions of three reinforcement learning models for XSS vulnerability detection scenarios so that the fuzz loop can be performed automatically. To demonstrate the effectiveness of the proposed method, we compare it against four state-of-the-art web scanners. Experimental results show that our method finds all XSS vulnerabilities and has no false positives.

Keywords:

XSS vulnerability; web security; fuzzing testing; reinforcement learning

1. Introduction

Web applications provide a friendly user interface for cloud services. With the large-scale implementation of cloud services in recent years, web applications have been widely used in the financial, government, communications, and industrial sectors. However, web applications face a large number of network attacks, such as injection attacks [1], distributed denial of service (DDoS) [2], and cross-site request forgery (CSRF) [3]. They seriously threaten the availability of cloud service providers and users. Cross-site scripting (XSS) vulnerability is one of the most well-known web vulnerabilities [4], which has been difficult to eradicate for a long time and has become a significant threat to web applications.

An XSS vulnerability is an injection flaw in a web application caused by untrusted input flowing to a sensitive web application location. Validating all input information during development is the most effective way to eliminate XSS vulnerabilities. The XSS cheatsheet series project [5] proposes rules to help developers correctly sanitize or avoid untrusted input in different contexts. In this way, the problem of XSS vulnerability is theoretically solved. However, in practice, it is difficult to correctly insert the data verification process in the program execution flow. Many XSS vulnerabilities still have not been appropriately handled. According to the CVE Program [6], 3403 newly discovered XSS vulnerabilities were assigned CVE numbers in 2022, an increase of 25.4% compared with 2021 and a rise of 56% compared with 2020. This means that the security threats faced by web applications are becoming more severe. The Open Web Application Security Project (OWASP) [7] is a community dedicated to improving software security, which provides lists to tell developers which vulnerabilities are most dangerous. In 2017, the XSS vulnerability was listed as the 7th on the OWASP Top 10. In the latest version of the OWASP Top 10 in 2021, the risk of the XSS vulnerability rose to the third. This demonstrates that XSS vulnerabilities are a severe challenge to web applications.

Many existing studies have proposed detection methods for XSS vulnerabilities. Static analysis is the earliest method used to detect XSS vulnerabilities. Through the analysis of source code, all possible execution paths of the program can be found, which provide information for detecting XSS vulnerabilities. Algaith et al. [8] used five static analysis tools, RIPS, PIPS, phpSafe, WAP, and WeVerca, to detect XSS vulnerabilities in 132 plugins of the WordPress content management system. Experimental results show that integrated static analysis tools perform better than individual static analysis tools. However, detection based on static analysis often has many false positives because even the most advanced static analysis techniques cannot accurately predict program execution flow. Fuzz testing is more widely used in XSS vulnerability detection than static analysis. The advantage of this approach is that it can be independent of the web application. If a vulnerability exists, it can also give a proof of concept (POC) for specific vulnerability exploitation. Ran et al. [9] proposed a detection method for DOM-based XSS vulnerabilities based on a dynamic detection framework, analyzing pages with XSS vulnerability risks to obtain taint information and automatically generating attack vectors based on the information for testing. Automatic verification of vulnerabilities can be achieved by exporting attack vectors. However, this method is not perfect. Fuzzing suffers from low API coverage, so it is used with other program analysis techniques, such as dynamic taint analysis [10,11]. In recent years, researchers have begun to apply deep learning to XSS vulnerability detection and achieved good performance. Maurel et al. [12] proposed an ensemble-based approach to deep learning by combining convolutional neural network (CNN) and long short-term memory (LSTM) models. On the dataset provided by the authors, the accuracy rate of XSS vulnerability detection reaches 99.47%. However, detection based on deep learning is generally tricky for real-world web applications. This is because most web pages are dynamically generated, and it takes a lot of human effort to convert them into a format that deep learning models can recognize.

In general, fuzzing-based XSS vulnerability detection is practical and can find vulnerabilities in real-world web applications. However, this method still has problems, such as low payload generation efficiency. We consider that there are three challenges for fuzzing-based methods: (1) diverse web frameworks. A web framework brings convenience to developers but makes web applications more complex, making the analysis of web application code and configuration inaccurate; (2) input point identification is incorrect. Most fuzzing-based methods use crawlers to discover potential input points. However, this cannot guarantee that all input points are covered; (3) inefficient payload generation. Due to the increasing scale of web applications, traditional payload generation strategies (such as random fuzzing) cannot generate test cases in good time. In this paper, we overcome the above challenges and propose a reinforcement learning-based grey-box fuzzing method for Java web applications. Our approach is inspired by [1,2,3,4,5,6,7,8,9,10,11,12,13], which propose a Q-learning model-based SQL injection vulnerability detection. We further extend [1,2,3,4,5,6,7,8,9,10,11,12,13] to make it possible to detect XSS vulnerabilities. On the one hand, we use static analysis to scan every component of the Java web application for potential input points. On the other hand, we experimented with the performance of three reinforcement learning models on XSS vulnerability detection. Compared with traditional payload generation, we detect vulnerabilities in a shorter time.

Overall, the main contributions of this paper are as follows:

(1): We use static analysis to identify input points of Java web applications, including Java code (supporting Java Servlet and Spring framework), configuration files, and HTML code. Almost all potential input points of the web application could be covered.
(2): We propose an XSS payload generation method based on reinforcement learning. The reinforcement learning model’s state, action, and reward function are defined in the XSS vulnerability detection scenario. We validate payload generation by DQN, DDQN, and Policy Gradient model.
(3): To evaluate the effectiveness of the proposed method, we implement it and design systematic experiments. On the one hand, we compare the efficacy of different reinforcement learning models in XSS vulnerability detection. On the other hand, we compare the performance of the proposed method with four state-of-the-art web scanners. Experimental results show that our method is more effective.

The remainder of this paper is organized as follows. Section 2 introduces the related work about XSS vulnerability detection and reinforcement learning for cyber security. Research progress is discussed in detail. Section 3 presents the background of XSS vulnerability and reinforcement learning to help readers quickly understand the two fields. In Section 4, the composition and principles of the proposed method are introduced in detail. To evaluate our method, we conduct comprehensive experiments in Section 5. We propose three standard research questions and demonstrate the effectiveness of our tool by answering them. Section 6 summarizes the work in this paper and introduces future work.

2. Related Work

2.1. XSS Vulnerability Detection

According to analysis methods, XSS vulnerability detection is divided into static, dynamic, and hybrid analysis [14]. Static analysis detects vulnerabilities by analyzing the source code of web applications. However, this method could lead to false positives, and the source code is sometimes unavailable. The dynamic analysis method detects vulnerabilities by injecting artificially generated payloads into websites and observing whether abnormalities are triggered. Due to the inability to cover all input points, this method has a high false negative rate. Hybrid analysis methods combine static and dynamic features to detect vulnerabilities.

Gupta et al. [15] proposed the CSSXC framework, which can be deployed as a cloud service to detect XSS vulnerabilities. When a user requests, the web server extracts information such as parameters and links to detect injection points. This information is sent to a malicious JavaScript detection server. If there is a malicious script at the input point, CSSXC will delete the malicious script in the request and return the result to the web application server. Experimental results show that the method detects XSS payloads with fewer false positives and less resource consumption.

Liu et al. [16] proposed a method based on second-order crawl scans to detect XSS vulnerabilities. This method first crawls all uniform resource locators (URLs) of the website and tries to inject strings with malicious scripts. Then, the scanner crawls URLs containing special strings again to detect whether there are vulnerabilities. Experimental results show that this method reduces the time overhead of XSS vulnerability detection while ensuring high accuracy.

Melicher et al. [17] proposed a machine learning-based DOM-based XSS vulnerability detection method. They collected more than 18 billion JavaScript functions and used the data to train a deep neural network to predict whether it was vulnerable to DOM-based XSS attacks. By testing a series of hyperparameters, the authors trained a well-performing classifier, which can be used as a pre-filter for taint tracking, while reducing the resource cost of independent taint tracking.

Choi et al. [18] proposed a black-box hybrid XSS detection (HXD) tool. HXD extracts URLs from weblogs and optimizes them to correct input URLs, then uses static analysis and dynamic browsers to detect XSS vulnerabilities. Since HXD uses PhantomJS, a headless browser, to execute JavaScript code, it can detect vulnerabilities in JavaScript frameworks. Experiments show that HXD has a low false positive rate and can detect XSS vulnerabilities that other black-box scanners miss.

2.2. Reinforcement Learning for Cyber Security

Sophisticated cyberattacks require protection mechanisms that are responsive, adaptable, and scalable. Reinforcement learning has been extensively proposed to address these problems [19]. In recent years, reinforcement learning has been gradually applied in payload generation and attack detection. However, most of the research is demonstrated on datasets and not applied to web applications, and implementing these methods needs to meet many constraints.

Wang et al. [20] explored the possibility of using reinforcement learning to generate SQL injection payloads to evade web application firewalls (WAFs). They created a reinforcement learning environment for WAF evasion tasks and evaluated various mainstream WAF products using a proximal policy optimization (PPO) algorithm. Our framework successfully discovers the number of evasive payloads per WAF in our experiments and can significantly outperform baseline strategies. Finally, we extract common patterns from the discovered evasion payloads and discuss weaknesses/flaws of existing WAF products as well as suggested improvements.

Caturano et al. [21] proposed Suggerster, a reflected XSS payload detection tool based on Q-learning. Suggerster sends a series of strings to a web application and observes the response. Then, the agent is trained to generate attack strings using a multi-objective reinforcement learning (MORL) environment framework with a parameterized action space. However, this process requires human participation, including analyzing observations and calculating rewards, which hinders Suggester from fully automated penetration testing.

Fang et al. [22] proposed a reinforcement learning-based XSS attack detection model by training adversarial examples and detection models together. They first mined the adversarial examples of the detection model based on the adversarial attack model of reinforcement learning. Then, they alternately trained the detection model and the adversarial model. After each round, the newly generated adversarial samples were marked as malicious samples for retraining the detection model.

Lee et al. [23] proposed the link, a black-box web scanner based on reinforcement learning. Link uses a web crawler to obtain the input point of the target web application and then uses two strategies of generation and mutation to generate the payload. They used a reinforcement learning algorithm to discover reflected XSS vulnerabilities. However, the black box scanner cannot judge the coverage of the scanned input points in the web application.

3. Background

This section presents the background of XSS vulnerability and reinforcement learning. We first introduce the principle and classification of XSS vulnerabilities. Then, the principles of reinforcement learning are introduced. They are the basis for understanding the proposed method in this paper.

3.1. XSS Vulnerability

XSS is an injection attack that injects malicious scripts into benign and trusted websites [24]. Attackers can take advantage of this vulnerability to implant malicious code into every corner of the website. When users visit this website, malicious code will be automatically executed, and users will not sense any abnormality. This way, attackers can obtain users’ private information, steal users’ cookies, and spread webworms. Most of the XSS vulnerabilities appear in applications and websites with human–computer interaction, such as WordPad, sending email, message areas, and search bars. XSS vulnerabilities are divided into three types [25]: reflected XSS, stored XSS, and DOM-based XSS.

Reflected XSS directly reflects the user’s input to the browser, causing the browser to execute some scripts. In this kind of attack, the attacker usually constructs the attack payload, combines the attack payload with the link to form a malicious link, and tricks the user into visiting the link. When the user visits the link, it causes the browser to execute malicious scripts.
Stored XSS stores the input data on the server side, and it will trigger when the data are displayed on the page visited by the user. Usually, the attacker uploads the attack payload on the page where text can be submitted (such as message boards, blogs, etc.), and when other users access the data, the attack vector will be activated.
DOM-based XSS is a vulnerability that exploits the Document Object Model (DOM). The DOM allows scripts to dynamically access and update document content, structure, and styles, then display them on the page. This type of XSS vulnerability does not need to save the data to the server but executes the DOM data obtained by the client locally, which is also a reflected XSS in the strict sense.

3.2. Reinforcement Learning

Reinforcement learning is a type of feedback-based learning [26]. As shown in Figure 1, an agent can observe the state and rewards of the environment and generate and execute an action based on the state. After the environment is affected by the action of the agent, it will switch to the next state and give the agent a new reward. The agent’s goal is to maximize long-term rewards and expectations in constant interactions with the environment.

A basic reinforcement learning model is modeled as Markov decision processes (MDPs) [27]. An MDP is represented by a five-tuple

〈S, A, R, P, γ〉

, where

S

is the state set of the agent,

A

is the action set of the agent,

P

is the state transition matrix

P_{s s^{'}}^{a} = P [S_{t + 1} = s^{'} | S_{t} = s, A_{t} = a]

,

R

is the reward function,

R (s, a) = E [R_{t + 1} | S_{t} = s, A_{t} = a]

,

γ

is the reward discount factor,

γ \in (0, 1)

. In MDPs, state changes and rewards depend on the current state and action, not the previous state and action. The agent needs to learn the policy

π

to determine the action chosen for the current state, which represents a mapping relationship from state to action. The goal of reinforcement learning is to learn a policy to describe all actions chosen by the agent. In addition to policies, the MDP contains two value functions, which are set state function

V

and action state function

Q

.

The value state function starts from state

s

, and the policy reward obtained by using policy

π

is:

v_{π} (s) = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) | S_{t} = s]

(1)

The state action value function starts from state

s

, executes action

a

, and then uses policy

π

to obtain the expected reward value:

q_{π} (s, a) = E_{π} [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) | S_{t} = s, A_{t} = a]

(2)

Reinforcement learning hopes to find the largest value function of all policies, that is, the optimal value function, including the optimal

V

function and the optimal

Q

function.

v_{*} (s) = m a x_{π} v_{π} (s)

(3)

q_{*} (s, a) = m a x_{π} q_{π} (s, a)

(4)

The optimal value function represents the best performance that can be achieved in an MDP, and when the optimal value function is found, it represents the current MDP problem to be solved. To find the optimal value function, we use the recurrence relationship between the optimal

Q

function and the optimal

V

function:

v_{*} (s) = v_{π_{*}} (s) = \sum_{a \in A} π_{*} (a | s) q_{π_{*}} (s, a) = m a x_{a} q_{π_{*}} (s, a) = m a x_{a} q_{*} (s, a)

(5)

Through continuous iteration, the optimal

Q

function and the optimal

V

function are finally obtained.

4. Methodology

In this section, we give a general introduction and details of each part of reinforcement learning-based XSS vulnerability detection.

As shown in Figure 2, the proposed method has four steps. First, we use static analysis to handle the Web Application Archive (WAR) file of the target web application. WAR is a file format for Java web application distribution, consisting of Java Servlet, configuration, static web pages, etc. We analyze each part of the WAR to identify potential input points. These input points are thoroughly tested. Compared with scanner-based input point recognition, our method identifies more input points. Then, we propose a payload generation mechanism. The XSS vulnerability payload is divided into four parts, and a new payload is generated by replacing different parts. To bypass the data validation of the target web application, we process each payload with 19 mutations. Then, we choose three reinforcement learning models to optimize the generation of payload. Finally, we observe the state of the target web application with a headless browser [28], which is a browser without a graphical interface. It is also used to send mutated payloads. In the above steps, the static analysis only needs to be performed once, and the second to fourth steps are repeated continuously.

4.1. Static Analysis

We use static analysis to identify input points, which are URLs that users can request. In Java web applications, developers can set web page access points in Java code, configuration files, and HTML code. Due to differences in programming languages and file structures, we need to handle them separately.

For Java code, we identify the Java method that handles the user’s web request and its URL. Developers usually use web frameworks to develop Java web applications. Our method supports Java Servlet [29] and Spring framework [30] to detect vulnerabilities in the real world. These frameworks use annotations to simplify the developer’s work, which is a form of syntactic metadata [31]. Figure 3 shows an example using the Spring framework. The RequestMapping annotation on line 1 maps the user’s request to the getUser() method. The parameter of the annotation RequestMapping is the URL that the user visits. PathVariable modifies the variable userName to indicate that the value of the name in the URL is passed to the userName. To test this URL, we need to extract the annotations of the modified method (called method annotation) and the annotations of the modified parameters (called parameter annotation), as well as the data types of modified parameters. We support parsing content as follows:

Method Annotations: GetMapping, PostMapping, PutMapping, DeleteMapping, PatchMapping, RequestMapping, WebServlet.
Parameter Annotations: RequestParam, PathVariable.
Data Type: String, Byte, Double, Float, Long, Character, Short, Boolean.

For configuration files, we parse web.xml, which is a Java web application deployment descriptor file. When a web server receives a request for an application, it uses the deployment descriptor to map the requested URL to the code that should handle the request [32]. So, we need to identify the URL and its parameters in web.xml. Extensible Markup Language (XML) is a structured document format, and XML parsers are usually used to parse it. According to the Java Servlet specification, we extract the values of <servlet-name> and <servlet-class> under the <servlet> node and the values of <servlet-name> and <url-pattern> under the <servlet-mapping> node. Then, we establish the relationship of the URL to the Java method by servlet name. The URL is the input point, and the Java method’s parameters are the URL’s parameters.

For HTML code, we extract input points from <form> and <a> elements. The <form> tag is used to create a form for user input, and we take the value of the action attribute as the input point. The <input> tag is the most commonly used form element. The name, type, and default value of the <input> tag are extracted as parameters of the web request. The <a> tag defines a hyperlink, which is used to link from one page to another. We extract the href attribute of <a> acting as a kind of input point.

4.2. Payload Generation

This section introduces the payload generation process, divided into two steps. The first step is to divide the XSS vulnerability payload into four parts according to the function and collect elements for each part. The new raw payload could be generated by replacing different parts. The second step is to mutate the raw payload so it can bypass the validation mechanism of the target web application.

4.2.1. Generation

The payload of an XSS vulnerability is a structured string, so we cannot treat it like binary fuzzing [33]. Simply operating in binary encoding would break the syntax of the payload during crossover and mutation, making fuzzing pointless. Therefore, we divide the XSS vulnerability payload into four parts according to its functions: HTML tags, HTML attributes, HTML events, and JavaScript (JS) snippets. Through analyzing the public payloads, we collect representative elements for each part. These elements have all been presented in proofs of concept for real-world vulnerability, which makes the generated payloads effective. A collection of these elements is called a dictionary. Table 1 shows the collected dictionaries. There are 45 elements in the HTML tag dictionary, including most of the HTML tags. The HTML attribute dictionary has five elements, including some common attributes, such as src=x and href=x. These attributes combine with onError or onClick events. The HTML event dictionary has 23 elements, such as the onClick event, which can be added to each tag. There are eight elements in the JS snippet dictionary, including a popup window, console, and reminder box. The reminder message is set to webfuzzer-token, which is a random string with a length of 15. We judge whether the injection is successful by observing that the injected web page contains the webfuzzer-token.

A payload consists of the above four parts. The process of generating payload is the process of selecting elements from these four dictionaries. To clearly describe the process of element selection in the fuzz loop, we number all elements in the four dictionaries in sequence. The numbers 1–45 represent elements in the HTML tag dictionary, 46–50 represent elements in the HTML attribute dictionary, 51–73 represent elements in the HTML event dictionary, and 74–81 represent elements in the JS snippet dictionary. For example, if the current payload is <b href=“javascript:” on-Click=confirm(‘webfuzzer-token’)></b>, using rule 10 (that is, the 10th element of the HTML tag dictionary) for it, the newly generated payload is < frame href=“javascript:” on-Click=confirm(‘webfuzzer-token’)></frame >. We continue to use rule 74 (that is, the first element of the JS snippet dictionary), and update the payload to < frame href=“javascript:” onClick= alert (‘webfuzzer-token’)></frame >. Since the generated payload is not sent to the web application immediately, it is called raw payload.

4.2.2. Mutation

Most web applications validate user input data to avoid injection attacks. To bypass the validation mechanism, raw payloads require further mutations. In this paper, mutation refers to the behavior of modifying the raw payload as a string. The semantics of the payload before and after the mutation are the same. As shown in Table 2, we propose 19 mutations. Mutations 1–4 modify the tags in the raw payload, which is to bypass the blacklist used by web applications to filter certain tags. Many web applications use this mechanism. Mutations 5–18 modify the JS snippet in the payload, which can hide the malicious intent of the payload. Generally speaking, the detection of complex JS scripts is provided by a web application firewall (WAF). Therefore, the proposed method also detects XSS vulnerabilities in real-world web applications protected by WAFs. Mutation 19 does not take any action. The raw payload is directly used for fuzzing, but we still call it a mutation for the consistency of the description.

To show the effects of these mutations intuitively, we take raw payload <font onmouse-move=alert(‘webfuzzer-token’)></font> as an example. The last row of Table 2 shows the payload after different mutations. It should be noted that when the next fuzzing cycle starts, the new payload is generated based on the raw payload of the previous round, not the mutated payload.

4.3. Reinforcement Learning Model

This section introduces how to apply reinforcement learning to XSS vulnerability detection. We first present the idea of choosing a reinforcement learning model. Then, the setting of the state, action, and reward of the reinforcement learning model is introduced.

4.3.1. Model

There are two important objects in reinforcement learning: agent and environment. The agent continuously interacts with the environment by “perceiving the current state→executing the corresponding action→obtaining the corresponding reward→adjusting the current policy”. The goal of reinforcement learning is to find the best policy

π

, so that the agent can obtain the maximum reward by acting according to the policy

π

. Reinforcement learning algorithms are divided into value-based models and probability-based models, represented by Q-learning [34] and Policy Gradient [35].

The critical idea of Q-learning is to build a Q-table to record the expected reward value

Q (s, a)

that the agent will obtain when choosing action

a

in state

s

. The agent chooses the action with the largest reward according to the Q-table in each interaction step. In Q-learning, the update formula of the

Q

value is

Q (s, a) = Q (s, a) + α (r + γ Q (s^{'}, a^{'}) - Q (s, a))

(6)

where

s

is the current state of the agent,

a^{'}

is the action the agent chooses in the next state

s^{'}

,

r

is the reward the agent obtains from state

s

to state

s^{'}

, α is the learning rate, and

γ

is the discount factor of the reward value.

Q (s, a)

is the

Q

value of the current state, and the estimated value of

Q

is represented by

r + γ * \max Q (s^{'}, a^{'})

, which is updated by learning the error between these two

Q

.

When the Q-table is too large, it takes time and space to search for and store the

Q

value. To solve this problem, Mnih et al. [36] used a neural network to simulate a Q-table and proposed the deep Q-network (DQN) to avoid dimension explosion caused by a complex environment. There are two neural networks with the same structure but different parameters in the DQN: the prediction network and the target network. The

Q (s, a; θ^{p r e})

of the prediction network is used to predict the

Q

of action of the current state. The

Q (s, a; θ^{t a r})

of the target network is used to calculate the

Q

estimated

Q_{t a r g e t}

, the calculation formula is

Q_{t a r g e t} = r + γ * m a x Q (s^{'}, a^{'}; θ^{t a r})

(7)

where

θ

is the parameters of the neural network.

In the beginning, the prediction network and target network parameters are the same. During training, the parameters of the prediction network are updated in real time through gradient descent, and the parameters are assigned to the target network every step

c .

The training samples of the prediction network are obtained from the experience pool, which records the trace

\{s, a, r, s^{'}\}

generated by the interaction between the agent and the environment. During training, a small batch of data is randomly sampled from the experience pool to update the network parameters. This process is called experience replay. Experience replay eliminates the correlation of adjacent samples, improves the utilization of samples, and makes the neural network easy to converge.

In the DQN, the maximization operation of calculating

Q_{t a r}

tends to overestimate the value of an action, and double Q-learning (DDQN) [37] eliminates overestimation by improving the

Q_{t a r}

calculation. DDQN also has two Q-networks. It first uses the prediction network to select the action

a^{m a x}

with the largest

Q

in the next state

s^{'}

, and then obtains the

Q (s^{'}, a^{m a x}; θ^{t a r})

corresponding to the action

a^{m a x}

through the target network. The value is not necessarily equal to

m a x Q (s^{'}, a^{m a x}; θ^{t a r})

, so DDQN can avoid selecting overestimated actions, which makes the calculated

Q_{t a r}

as close as possible to the real value. The calculation process is as follows:

a^{m a x} = \underset{a^{'}}{\arg m a x} Q (s^{'}, a^{'}; θ^{p r e})

(8)

Q_{t a r g e t} = r + γ * Q (s^{'}, a^{m a x}; θ^{t a r})

(9)

The above models are all value-based, and this type of model has the risk of falling into a locally optimal solution. So, we also choose the Policy Gradient, which is a classic probability-based reinforcement learning model. The input is the state

s

, and the output is the probability distribution

P (s; θ)

of the action

A

.

θ

is the parameter of the policy neural network. For a trace

τ = \{s_{1}, a_{1}, r_{1}, s_{2}, \dots, s_{t}, a_{t}, r_{t}\}

, the probability

p (τ; θ

) of

τ

occurrence is

p (τ; θ) = p (s_{1}) \prod_{t} p (a_{t} | s_{t}; θ) p (r_{t}, s_{t + 1} | s_{t}, a_{t})

(10)

After sampling the trace

τ

, the expected reward value of the current policy

π (θ)

is calculated, which is also the objective function formula

J (θ)

that needs to be maximized, as shown in Formula (11).

J (θ) = E (R_{π (θ)}) = \sum_{τ} R (τ) p (τ; θ) = \sum_{τ} \sum_{t} r_{t} γ^{t - 1} p (τ; θ)

(11)

We calculate the gradient of

J (θ)

, and update the network parameters

θ

using gradient ascent to determine the optimal policy

π (θ)

. Since the neural network parameters are only updated after the round, the learning rate of the Policy Gradient network is slower than that of the DQN and DDQN.

This paper implements three models of DQN, DDQN, and Policy Gradient. To prevent the DQN and DDQN from falling into local optimum, the ε-greedy [38] is used to optimize action selection. In the training phase, the agent randomly selects actions with probability ε (0 < ε < 1) and selects the optimal action with probability 1 − ε. The neural network structures of the three models are the same. The multi-layer perceptron model is adopted. The number of nodes in the input and output layers is 81, and the numbers of nodes in the middle layer are 32 and 16, respectively. The rest of the parameter settings are shown in Table 3.

4.3.2. State

In XSS vulnerability detection, the state is the XSS payload, which is a sequence of four numbers, representing the payload of each part. The initial state is [4,3,3,3], representing <b href=“javascript:” onLoad=confirm(‘webfuzzer-token’)>webfuzzer</b>. In the input layer of the neural network, the state is converted into a sequence of 81 bits, representing whether the element is selected.

Figure 4 shows the input sequence converted to [4,3,3,3]. The first number 4 means to select the 4th out of 45 HTML tags, so the 4th bit is set to 1. The second number 3 means to select the third of the 5 HTML attributes, so the 48th bit is set to 1 (the first 45 bits are the offset). The other two numbers use the same calculation rules.

4.3.3. Action

In the detection method proposed in this paper, action is the rule generated by the XSS payload, and there are 81 rules in total. Details are in Section 4.2.

4.3.4. Reward

After each action, the environment sends the agent a reward value, which determines the goal in reinforcement learning problems. The reward strategy adopted in this paper is as follows:

(1): If an XSS vulnerability is found, Award+10 and the Episode ends;
(2): No XSS vulnerability found, Award-1, Episode continues;
(3): Found repeatedly generated XSS payload, Award-5, Episode continues;
(4): When the 10th generation is reached, the Episode ends and returns to the initial state.

The repeated generation of payload in the third rule refers to the state that has appeared in the last ten rounds of generation.

4.4. Observation

We use a headless browser to send requests with payloads and observe the status of the target web application. The headless browser saves the context state after login and automatically brings in the identity information in the test. We randomly select one from the parameter list of the input point to put into the payload, called the injection point, and the other unselected ones are called non-injection points. The generation rules of the parameters are shown in Table 4.

The headless browser monitors the console and popup messages in real time. If the message contains a webfuzzer-token, an XSS vulnerability is found. After the request is sent, the headless browser will take the current page as the entry point and crawl all reachable pages to detect stored XSS vulnerabilities. If the webfuzzer-token is present on these pages, an XSS vulnerability is found.

5. Evaluation

In this section, we evaluate the effectiveness of the proposed method and compare it with four state-of-the-art web scanners. We propose the following standard research questions. These questions can help us comprehensively evaluate the proposed method’s performance detecting XSS vulnerabilities.

RQ1. Effectiveness: Is the proposed method able to detect XSS vulnerabilities?
RQ2. Comparison with other tools: Does the proposed method perform better than other tools? What is the reason?
RQ3. Resource consumption: Is the proposed method consuming excessive system resources?

5.1. Experimental Setup

We implement static analysis using JDK 8 (build 1.8_301) of Oracle. In this process, we analyze the Java bytecode based on ASM [39], which is used to extract the annotations and parameters of the Java framework. ASM is a general-purpose Java bytecode manipulation and analysis framework that builds custom complex transformation and code analysis tasks. We implement payload generation and mutation in Python 3.7. The reinforcement learning models are all implemented in TensorFlow 2.8. We use playwright [40] to observe the status of the web page after sending the payload to find a vulnerability. It is a powerful web testing framework released by Microsoft. It supports mainstream browsers such as Chrome, Firefox, and Safari. The experiments are performed on Intel i7–[email protected] and 64GB of RAM on Linux Ubuntu 20.04TSL. In order to reduce experimental bias, we use five-fold cross-validation to evaluate the performance of reinforcement learning models in XSS vulnerability detection.

5.2. Result and Analysis

In this section, we answer the standard research questions proposed in Section 5. We present experimental results and discuss them in detail.

5.2.1. Answering RQ1: Effectiveness

We test the performance of the proposed method on two target web applications. The first one is WebGoat [41], a cyber range developed by OWASP for web vulnerability detection. WebGoat provides 12 vulnerabilities, including reflected and stored XSS vulnerabilities. To make the evaluation results more accurate, we add five additional XSS vulnerabilities, which use stricter filtering rules, including specific tag filtering, key function filtering, JS function filtering, etc. To facilitate other researchers to reproduce the experiment, this customized WebGoat is publicly available on GitHub. The second target web application is Jeesns-1.4.2 [42], a social network services framework that provides functions such as forums, microblogs, and communities. We collect eight XSS vulnerabilities from Jeesns in the CVE database.

We use the DQN, DDQN, and Policy Gradient reinforcement learning (RL) models to experiment on target web applications. As a control, random payload generation is also considered in the experiment. It randomly selects the generation rules in each round of fuzzing. Since the performance of this method is not stable, we conduct five test rounds on the two target web applications, respectively. Then, we take the average number of vulnerabilities found as the experimental results. Table 5 shows the performance of these methods on two target web applications. The time in the table refers to the time taken from finding the first XSS vulnerability to the last one. The termination condition of fuzz testing is that all input points are tested. In the WebGoat experiment, DDQN performed the best and found all the vulnerabilities. Random payload generation performed the worst, and even the poorly performing DQN model found five more vulnerabilities than it. In the Jeesns experiment, reinforcement learning-based fuzzing still performed well. Both DDQN and Policy Gradient found all vulnerabilities. In general, the experimental results demonstrate that payload generation based on reinforcement learning is better than random payload generation and the DDQN model performs better than the other two models in the XSS vulnerability detection scenario.

To further evaluate the performance of each model on XSS vulnerability detection, we directly fuzz known vulnerable input points and record the time spent. The time spent in each round of fuzzing consists of two parts, one is the execution time of the reinforcement learning model, and the other is the time for the headless browser to scan the web page to determine whether the payload is executed successfully. Table 6 and Table 7 show the results of the two target applications, respectively. We found that the payload generated by the DQN can trigger all vulnerabilities, which seems inconsistent with the results in Table 5. Through trial and error, we found that the DQN model is the fastest among the three, taking an average of only 15.2 s per input point test. However, when the fuzzing is too fast, the response time of the target application will become longer with a certain probability. This makes the headless browser consider the page load to be timed out and wrongly think that the payload did not trigger the vulnerability, and eventually leads to false negatives. The Policy Gradient model has a similar situation at some input points. Therefore, we believe these three reinforcement learning models perform well in payload generation. This demonstrates that the accuracy of XSS vulnerability detection depends on generating effective payloads and on correctly judging that the vulnerability is triggered.

5.2.2. Answering RQ2: Comparison with Other Tools

We select four state-of-the-art web scanners to compare with the proposed method, namely Burp Suite [43], Acunetix Web Vulnerability Scanner (AWVS) [44], XSStrike [45], and Yakit [46]. Burp Suite is an integrated platform for attacking web applications. It contains many tools and designs many interfaces to enhance the penetration testing process. We use the XSS Validate plugin in our experiments. AWVS can scan any website that follows HTTP/HTTPS rules and supports scanning of various web vulnerabilities. XSStrike is a black-box XSS vulnerability scanner that includes a powerful fuzzing engine and an extremely fast crawler. Yakit is an interactive application security testing platform that detects XSS vulnerabilities through fuzzing.

These four scanners have different vulnerability detection concepts. Since only AWVS automatically detects XSS vulnerabilities among them, other tools require manual assistance, so we did not record the detection time of these tools. Table 8 shows the experimental results. In the WebGoat inspection, Burp Suite found the most, 13 vulnerabilities, with a detection rate of 76.5%, far exceeding other web scanners. However, the method proposed in this paper found 23.5% more vulnerabilities than Burp Suite. In the Jeesns inspection, AWVS found all potential vulnerabilities. Compared with other scanners, Jeesns has more stored XSS vulnerabilities, and AWVS provides better support for such vulnerabilities. However, as a black-box scanner, AWVS can only rely on crawlers to identify input points, and the identification of input point parameters is inaccurate, which leads to false positives. However, our tool uses static analysis to identify input points and their parameters, and there are no false positives. Overall, the proposed method outperforms these four web scanners.

To further analyze the reasons for the experimental results, we analyze the vulnerability detection principles of these four web scanners. Burp Suite uses a proxy to intercept the data package in the test, observes whether there are controllable parameters in the data package, and then performs XSS vulnerability detection. Burp Suite cannot crawl web pages and find input points, which need to be provided manually. When Burp Suite uses the XSS Validate plugin for XSS vulnerability detection, Phantom (a headless browser) is used to detect the token in the payload to determine whether the event is triggered successfully. This plugin uses four JavaScript functions and three JavaScript events to form statements and then embeds them in 32 templates to generate multiple payloads. Compared with the method proposed in this paper, Burp Suite does not mutate the payload and cannot bypass the filtering rules of the target web application. AWVS is a black-box scanner. It first tried to traverse all the links on the web page to collect entry points, but the scan did not collect all the links in the subdirectories. In the experiment, we scanned the target application with AWVS many times and found that the number of input points scanned each time was different and incomplete. This means that the performance of AWVS in entry point recognition is worse than the method proposed in this paper. AWVS can preset login information in advance, so it can automate fuzz testing. The payload dictionary of AWVS is not public. By crawling and analyzing the data package, we found that its dictionary size is larger than that of Burp Suite. The XSS payload generated by AWVS uses multiple methods to bypass the web application filtering mechanism. AWVS scanning is fast, but the number of vulnerabilities found is much lower than that by our method due to incomplete input points. XSStrike also uses a fuzz-based vulnerability detection method, but it does not have the function of simulating login authentication, so it cannot automatically discover the input points of the target web application. XSStrike tests all parameters of the input point one by one. However, the size of its built-in dictionary is too small, it is only suitable for a very small number of scenarios, and it cannot bypass common XSS vulnerabilities. Therefore XSStrike performed the worst in the experiment. Yakit requires people to interact with the tool during the testing process constantly. Like Burp Suite, Yakit uses a proxy to intercept data packages and requires manual screening and formulation of detection paths and parameters, so it cannot collect input points. The number of Yakit XSS vulnerability payload templates is small, but various XSS vulnerability scenarios are considered comprehensively. These payloads come as in-tag text and attributes, JavaScript tags, and comments. Yakit’s vulnerability detection speed is fast, and it is a potential vulnerability detection method. However, it requires human assistance, which limits its usage scenarios.

5.2.3. Answering RQ3: Resource Consumption

The system resource consumption of fuzz testing consists of two parts, one is resources consumed by the target web application, and the other is resources consumed by our proposed method. The resources consumed by the target web application depend on its developer and are generally considered to be constant. However, as shown in Section 5.2.1, we found that providing more system resources to the target web application can reduce page loading time to a certain extent, thereby reducing false positives. The method proposed in this paper consumes fewer CPU resources but requires at least 64G memory to run. We think this is an acceptable cost of resources.

6. Conclusions

Web applications face many risks, and XSS vulnerability is one of the most serious. Statistics from the CVE Program show that the number of XSS vulnerabilities disclosed in the past three years has increased yearly. Maintainers of web applications face enormous challenges. Many studies have proposed methods for detecting XSS vulnerabilities, but they still have various shortcomings. In response to these shortcomings, we propose a web application fuzz testing method based on reinforcement learning to detect reflected and stored XSS vulnerabilities. Compared with other tools, our method has two advantages. First, we analyzed the web application’s Java code, configuration files, and HTTP code before fuzzing, which means it can find more input points. Secondly, we use the reinforcement learning model to optimize the payload generation and improve the detection speed. Experimental results demonstrate the effectiveness of the proposed method.

Our method also has a limitation. It only supports XSS vulnerability detection of Java web applications and cannot support applications developed in other languages. This is because each language has different syntax rules and framework design patterns. We cannot support the analysis of multiple languages using the same analysis technology. In our future work, we will try to transfer reinforcement learning-based fuzzing to web applications developed in other languages.

Author Contributions

Supervision, B.C.; Writing—review and editing, X.S., R.Z. and Q.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The customized WebGoat is available at https://github.com/Lavender93/WebGoat-7.1 (accessed on 13 January 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Gu, H.; Zhang, J.; Liu, T.; Hu, M.; Zhou, J.; Wei, T.; Chen, M. DIAVA: A traffic-based framework for detection of SQL injection attacks and vulnerability analysis of leaked data. IEEE Trans. Reliab. 2019, 69, 188–202. [Google Scholar] [CrossRef]
Jaafar, G.A.; Abdullah, S.M.; Ismail, S. Review of recent detection methods for HTTP DDoS attack. J. Comput. Netw. Commun. 2019, 2019, 1693–1696. [Google Scholar] [CrossRef] [Green Version]
Khodayari, S.; Pellegrino, G. JAW: Studying Client-Side CSRF with Hybrid Property Graphs and Declarative Traversals. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; pp. 2525–2542. [Google Scholar]
Steffens, M.; Rossow, C.; Johns, M.; Stock, B. Don’t Trust the Locals: Investigating the Prevalence of Persistent Client-Side Cross-Site Scripting in the Wild. In Proceedings of the 26th Annual Network and Distributed System Security Symposium, San Diego, CA, USA, 24–27 February 2019; pp. 1–15. [Google Scholar]
Cross-Site Scripting Prevention Cheat Sheet Series. Available online: https://cheatsheetseries.owasp.org/cheatsheets/Cross_Site_Scripting_Prevention_Cheat_Sheet.html (accessed on 30 November 2022).
CVE-CVE. Available online: https://cve.mitre.org/ (accessed on 30 November 2022).
OWASP Foundation, the Open Source Foundation for Application Security on the Main Website for The OWASP Foundation. OWASP Is a Nonprofit Foundation That Works to Improve the Security of Software. Available online: https://owasp.org/ (accessed on 21 November 2022).
Algaith, A.; Nunes, P.; Jose, F.; Gashi, I.; Vieira, M. Finding SQL injection and cross site scripting vulnerabilities with diverse static analysis tools. In Proceedings of the 14th European Dependable Computing Conference (EDCC), Iași, Romania, 10–14 September 2018; pp. 57–64. [Google Scholar]
Wang, R.; Xu, G.; Zeng, X.; Li, X.; Feng, Z. TT-XSS: A novel taint tracking based dynamic detection framework for DOM Cross-Site Scripting. J. Parallel Distrib. Comput. 2018, 118, 100–106. [Google Scholar] [CrossRef]
Melicher, W.; Das, A.; Sharif, M.; Bauer, L.; Jia, L. Riding out domsday: Towards detecting and preventing dom cross-site scripting. In Proceedings of the 25th Annual Network and Distributed System Security Symposium (NDSS), San Diego, CA, USA, 18–21 February 2018; pp. 1–15. [Google Scholar]
Santos, J.F.; Rezk, T. An information flow monitor-inlining compiler for securing a core of javascript. In Proceedings of the IFIP International Information Security Conference, Marrakech, Morocco, 2–4 June 2014; pp. 278–292. [Google Scholar]
Maurel, H.; Vidal, S.; Rezk, T. Statically identifying XSS using deep learning. Sci. Comput. Program. 2022, 219, 1–20. [Google Scholar] [CrossRef]
Erdődi, L.; Sommervoll, Å.Å.; Zennaro, F.M. Simulating SQL injection vulnerability exploitation using Q-learning reinforcement learning agents. J. Inf. Secur. Appl. 2021, 61, 1–10. [Google Scholar] [CrossRef]
Liu, M.; Zhang, B.; Chen, W.; Zhang, X. A survey of exploitation and detection methods of XSS vulnerabilities. IEEE Access 2019, 7, 182004–182016. [Google Scholar] [CrossRef]
Gupta, S.; Gupta, B. CSSXC: Context-sensitive sanitization framework for Web applications against XSS vulnerabilities in cloud environments. Procedia Comput. Sci. 2016, 85, 198–205. [Google Scholar] [CrossRef] [Green Version]
Liu, M.; Wang, B. A web second-order vulnerabilities detection method. IEEE Access 2018, 6, 70983–70988. [Google Scholar] [CrossRef]
Melicher, W.; Fung, C.; Bauer, L.; Jia, L. Towards a lightweight, hybrid approach for detecting dom XSS vulnerabilities with machine learning. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 2684–2695. [Google Scholar]
Choi, H.; Hong, S.; Cho, S.; Kim, Y.G. HXD: Hybrid XSS detection by using a headless browser. In Proceedings of the 4th International Conference on Computer Applications and Information Processing Technology (CAIPT), Kuta Bali, Indonesia, 8–10 August 2017; pp. 1–4. [Google Scholar]
Nguyen, T.T.; Reddi, V.J. Deep reinforcement learning for cyber security. IEEE Trans. Neural Netw. Learn. Syst. 2021, 1–17. [Google Scholar] [CrossRef] [PubMed]
Evading Web Application Firewalls with Reinforcement Learning. Available online: https://openreview.net/forum?id=m5AntlhJ7Z5 (accessed on 1 January 2022).
Caturano, F.; Perrone, G.; Romano, S.P. Discovering reflected cross-site scripting vulnerabilities using a multiobjective reinforcement learning environment. Comput. Secur. 2021, 103, 1–16. [Google Scholar] [CrossRef]
Fang, Y.; Huang, C.; Xu, Y.; Li, Y. RLXSS: Optimizing XSS detection model to defend against adversarial attacks based on reinforcement learning. Future Internet 2019, 11, 177. [Google Scholar] [CrossRef] [Green Version]
Lee, S.; Wi, S.; Son, S. Link: Black-Box Detection of Cross-Site Scripting Vulnerabilities Using Reinforcement Learning. In Proceedings of the ACM Web Conference 2022, Lyon, France, 25–29 April 2022; pp. 743–754. [Google Scholar]
Gupta, S.; Gupta, B.B. Cross-Site Scripting (XSS) attacks and defense mechanisms: Classification and state-of-the-art. Int. J. Syst. Assur. Eng. Manag. 2017, 8, 512–530. [Google Scholar] [CrossRef]
Rodríguez, G.E.; Torres, J.G.; Flores, P.; Benavides, D.E. Cross-site scripting (XSS) attacks and mitigation: A survey. Comput. Netw. 2020, 166, 106960–1069830. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef] [Green Version]
White, D.J. A survey of applications of Markov decision processes. J. Oper. Res. Soc. 1993, 44, 1073–1096. [Google Scholar] [CrossRef]
Headless Browser-Wikipedia. Available online: https://en.wikipedia.org/wiki/Headless_browser (accessed on 30 November 2022).
Jakarta Servlet 5.0|The Eclipse Foundation. Available online: https://jakarta.ee/specifications/servlet/5.0/ (accessed on 30 November 2022).
Spring Framework. Available online: https://spring.io/projects/spring-framework (accessed on 30 November 2022).
Java Annotation-Wikipedia. Available online: https://en.wikipedia.org/wiki/Java_annotation (accessed on 30 November 2022).
The Deployment Descriptor: Web.xml|App Engine Standard Environment for Java 8|Google Cloud. Available online: https://cloud.google.com/appengine/docs/legacy/standard/java/config/webxml (accessed on 30 November 2022).
Zheng, Y.; Davanian, A.; Yin, H.; Song, C.; Zhu, H.; Sun, L. FIRM-AFL: High-Throughput Greybox Fuzzing of IoT Firmware via Augmented Process Emulation. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019; pp. 1099–1114. [Google Scholar]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Kakade, S.M. A natural policy gradient. Adv. Neural Inf. Process. Syst. 2001, 14, 1–8. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Dann, C.; Mansour, Y.; Mohri, M.; Sekhari, A.; Sridharan, K. Guarantees for epsilon-greedy reinforcement learning with function approximation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 4666–4689. [Google Scholar]
ASM. Available online: https://asm.ow2.io/index.html (accessed on 30 November 2022).
Fast and Reliable End-to-End Testing for Modern Web Apps | Playwright. Available online: https://playwright.dev/ (accessed on 30 November 2022).
OWASP WebGoat | OWASP Foundation. Available online: https://owasp.org/www-project-webgoat/ (accessed on 25 November 2022).
GitHub-Zchuanzhao/Jeesns. Available online: https://github.com/zchuanzhao/jeesns/ (accessed on 30 November 2022).
Burp Suite-Application Security Testing Software. Available online: https://portswigger.net/burp (accessed on 25 November 2022).
Acunetix | Web Application Security Scanner. Available online: https://www.acunetix.com/ (accessed on 25 November 2022).
GitHub-s0md3v/XSStrike: Most Advanced XSS Scanner. Available online: https://github.com/s0md3v/XSStrike (accessed on 30 November 2022).
GitHub-Yaklang/Yakit: Cyber Security ALL-IN-ONE Platform. Available online: https://github.com/yaklang/yakit (accessed on 30 November 2022).

Figure 1. Graphical representation of reinforcement learning.

Figure 2. Overview of grey-box fuzzing based on reinforcement learning.

Figure 3. An example of Spring framework annotations.

Figure 4. The state [4,3,3,3] is transformed into an acceptable input for reinforcement learning.

Table 1. The dictionaries of HTML tags, HTML attributes, HTML events, and JS snippets.

No	Part	Dictionary
1	HTML Tag	“a”, “area”, “audio”, “b”, “bgsound”, “body”, “br”, “button”, “form”, “frame”, “canvas”, “div”, “embed”, “frameset”, “h1”, “h2”, “h3”, “h4”, “h5”, “h6”, “iframe”, “img”, “input”, “link”, “menu”, “meta”, “object”, “ol”, “p”, “script”, “select”, “span”, “strong”, “style”, “table”, “tbody”, “td”, “textarea”, “tfoot”, “th”, “thead”, “title”, “tr”, “ul”, “video”
2	HTML Attribute	“src=x”, “href=x”, “href=“javascript:”, “src=“javascript:”
3	HTML Event	“onClick”, “onError”, “onLoad”, “onKeyDown”, “onKeyPress”, “onKeyUp”, “onContextMenu”, “onDoubleClick”, “onDrag”, “onDragEnd”, “onDragEnter”, “onDragExit”, “onDragLeave”, “onDragOver”, “onDragStart”, “onDrop”, “onMouseDown”, “onMouseEnter”, “onMouseLeave”, “onMouseMove”, “onMouseOut”, “onMouseOver”, “onMouseUp”
4	JS Snippet	“alert(‘webfuzzer-token’)”, “prompt(‘webfuzzer-token’)”, “confirm(‘webfuzzer-token’)”, “console.log(‘webfuzzer-token’)”, “alert(\”webfuzzer-token\”)”, “prompt(\”webfuzzer-token\”)”, “confirm(\”webfuzzer-token\”)”, “console.log(\”webfuzzer-token\”)”

Table 2. Mutation types in fuzzing.

No	Type	Description	Mutated Payload
1	Angle Bracket Recoding	Use %3c and %3e instead of < and >.	%3Cfont onmousemove = alert(‘webfuzzer-token’)%3E%3C/font%3E
2	Random Case Conversion	Invert the case of the characters in the payload (no more than half the number of characters).	<fONt OnmOuSEmoVE = aLERt(‘WEbfuZzEr-tokEn’)></fONt>
3	Space Insertion	Randomly insert spaces in the payload.	<font onmousemove = alert(‘webfuzzer-token’)></font >
4	Keyword Redundancy	Duplicate part of keywords.	<f<font>ont onmousemove = alert(‘webfuzzer-token’)></font>
5	Coding Conversion (URL)	Encode the JS snippet in the payload in URL encoding.	<font onmousemove = eval(‘%61%6c%65%72%74%28%27%77%65%62%66%75%7a%7a%65%72%2d%74%6f%6b%65%6e%27%29’)></font>
6	Coding Conversion (Base64)	Encode the JS snippet in the payload in Base64.	<font onmousemove = eval(atob(‘YWxlcnQoJ3dlYmZ1enplci10b2tlbicp’))></font>
7	Coding Conversion (Hex)	Encode the JS snippet in the payload in hexadecimal.	<font onmousemove = Set.constructor‘\x61\x6c\x65\x72\x74\x28\x27\x77\x65\x62\x66\x75\x7a\x7a\x65\x72\x2d\x74\x6f\x6b\x65\x6e\x27\x29‘‘‘></font>
8	Coding Conversion (Unicode)	Encode the JS snippet in the payload in Unicode.	<font onmousemove = setTimeout‘\u{61}\u{6c}\u{65}\u{72}\u{74}\u{28}\u{27}\u{77}\u{65}\u{62}\u{66}\u{75}\u{7a}\u{7a}\u{65}\u{72}\u{2d}\u{74}\u{6f}\u{6b}\u{65}\u{6e}\u{27}\u{29}‘></font>
9	Coding Conversion (Ascii)	Encode the JS snippet in the payload in Ascii.	<font onmousemove = eval(String.fromCharCode(97,108,101,114,116,40,39,119,101,98,102,117,122,122,101,114,45,116,111,107,101,110,39,41)></font>
10	JS Function Replacement (top)	Use the top() function to rewrite the payload.	<font onmousemove = top[‘ale’+’rt’](‘webfuzzer-token’)></font>
11	JS Function Replacement (eval)	Use the eval() function to rewrite the payload.	<font onmousemove = top.eval(‘a’+’lert’)(‘webfuzzer-token’)></font>
12	JS Function Replacement (self)	Use the self() function to rewrite the payload.	<font onmousemove = self[‘al’+’ert’](‘webfuzzer-token’)></font>
13	JS Function Replacement (this)	Use the this() function to rewrite the payload.	<font onmousemove = this[‘a’+’lert’](‘webfuzzer-token’)></font>
14	JS Function Replacement (toString)	Use the toString() function to rewrite the payload.	<font onmousemove = top [8680439..toString(30)](‘webfuzzer-token’)></font>
15	JS Function Replacement (custom function)	Use the custom function to rewrite the payload. In this example, we define the a() function.	<font onmousemove = a(this);function a(){}(alert(‘webfuzzer-token’))></font>
16	Backticks Insertion	Use backticks to diversify function calls.	<font onmousemove = javascript:‘${alert(‘webfuzzer-token’)}‘></font>
17	Regex	Insert special symbols in the payload and remove them using regular expressions.	<font onmousemove = eval(“~a~l~e~r~t~(~’~w~e~b~f~u~z~z~e~r~-~t~o~k~e~n~’~)~”.replace(/~/g, ‘‘))></font>
18	Quotes Change	Change single quotes to double quotes.	<font onmousemove = alert(“webfuzzer-token”)></font>
19	No Mutation	-	<font onmousemove = alert(‘webfuzzer-token’)></font>

Table 3. Reinforcement learning model parameters.

Symbol	Description	Value
$α$	Learning rate	0.01
$γ$	Discount factor	0.95
$ε$	The probability of choosing an action at random	0.1
$c$	Update interval rounds for the target network	10
BATCH_SIZE	Number of samples	32
REPLAY_SIZE	Experience pool size	1000

Table 4. Parameter generation rules.

Type	Injection Point	Non-Injection Point
Integer, Double, Float, Long, Short	random number + payload	random number
Boolean	50% probability true + payload, 50% probability false + payload	half probability of true and false
Byte	byte sequence of length 7 + payload	byte sequence of length 7
Character	alphabet or number + payload	alphabet of number
Other	string from static analysis or random string of length 7 + payload	string from static analysis or random string of length 7

Table 5. Test results of the proposed method on the target web application.

Web Application	RL Model	Vulnerability Count	Time	Detection Rate
WebGoat	random	5	-	29.4%
	DQN	15	2 h46 m20 s	88.2%
	DDQN	17	3 h 4 m 3 s	100.0%
	Policy Gradient	14	4 h 21 m 8 s	82.4%
Jeesns	random	3	-	37.5%
	DQN	5	4 h 31 m14 s	62.5%
	DDQN	8	5 h 59 m 35 s	100.0%
	Policy Gradient	8	7 h 21 m 3 s	100.0%

Table 6. Test results for each vulnerable input point of WebGoat.

No	Input Point	Time (s)
No	Input Point	DQN	DDQN	Policy Gradient
1	/2022121558/400	2.8	23.4	41.9
2	/1382523204/900	3.2	5.2	14.4
3	/1406352188/900	6.1	7.4	29.6
4	/1750680855/400	1.5	3.8	12.3
5	/1572295549/1100	1.4	6.3	9.5
6	/538385464/1100	2.7	3.0	14.5
7	/980912706/1100	2.9	4.0	20
8	/1036971378/1200	2.9	5.1	14.9
9	/1786050421/1500	2.8	4.5	39.9
10	/1319172155/1900	3.0	6.0	14.5
11	/611366032/900/5	112.7	117.0	31.5
12	/1584137874/1700	3.5	3.5	18.1
13	/1001855130/900	24.9	17.4	7.5
14	/693820813/900	34.8	17.8	20.4
15	/478442269/900	32	37	16
16	/1642011371/900	6.6	9.0	24.3
17	/751836712/900	14.8	12.6	11.3

Table 7. Test results for each vulnerable input point of Jeesns.

No	Input Point	Time (s)
No	Input Point	DQN	DDQN	Policy Gradient
1	/article/add	252.6	165.2	108.5
2	/weibo/list	16.1	145.7	34.7
3	/question/ask	8.0	57.3	2.3
4	/article/detail/	22.2	146.4	32.2
5	/group/post/	8.5	63.774	97.7
6	/question/detail/	28.1	238.3	33.2
7	/group/topic/	22.6	145.6	33.28
8	/weibo/detail/	18.8	164.8	24.1

Table 8. The performance of scanners on WebGoat and Jensns.

Web Application	Scanner	Vulnerability Count	Detection Rate
WebGoat	Burp Suite	13	76.5%
	AWVS	7	41.2%
	XSStrike	5	29.4%
	Yakit	6	35.3%
Jeesns	Burp Suite	3	37.5%
	AWVS	8	100.0%
	XSStrike	3	37.5%
	Yakit	4	80.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, X.; Zhang, R.; Dong, Q.; Cui, B. Grey-Box Fuzzing Based on Reinforcement Learning for XSS Vulnerabilities. Appl. Sci. 2023, 13, 2482. https://doi.org/10.3390/app13042482

AMA Style

Song X, Zhang R, Dong Q, Cui B. Grey-Box Fuzzing Based on Reinforcement Learning for XSS Vulnerabilities. Applied Sciences. 2023; 13(4):2482. https://doi.org/10.3390/app13042482

Chicago/Turabian Style

Song, Xuyan, Ruxian Zhang, Qingqing Dong, and Baojiang Cui. 2023. "Grey-Box Fuzzing Based on Reinforcement Learning for XSS Vulnerabilities" Applied Sciences 13, no. 4: 2482. https://doi.org/10.3390/app13042482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Grey-Box Fuzzing Based on Reinforcement Learning for XSS Vulnerabilities

Abstract

1. Introduction

2. Related Work

2.1. XSS Vulnerability Detection

2.2. Reinforcement Learning for Cyber Security

3. Background

3.1. XSS Vulnerability

3.2. Reinforcement Learning

4. Methodology

4.1. Static Analysis

4.2. Payload Generation

4.2.1. Generation

4.2.2. Mutation

4.3. Reinforcement Learning Model

4.3.1. Model

4.3.2. State

4.3.3. Action

4.3.4. Reward

4.4. Observation

5. Evaluation

5.1. Experimental Setup

5.2. Result and Analysis

5.2.1. Answering RQ1: Effectiveness

5.2.2. Answering RQ2: Comparison with Other Tools

5.2.3. Answering RQ3: Resource Consumption

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI