Next Article in Journal
AutoPaperBench: An MLLM-Based Framework for Automatic Generation of Paper Understanding Evaluation Benchmarks
Previous Article in Journal
Development of an IoT-Enabled Smart Electricity Meter for Real-Time Energy Monitoring and Efficiency
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

XSS Attack Detection Based on Multisource Semantic Feature Fusion

by
Ze Hu
1,*,
Jianwei Zhang
2 and
Hongyu Yang
1,2,*
1
School of Safety Science and Engineering, Civil Aviation University of China, Tianjin 300300, China
2
School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
*
Authors to whom correspondence should be addressed.
Electronics 2025, 14(6), 1174; https://doi.org/10.3390/electronics14061174
Submission received: 14 February 2025 / Revised: 12 March 2025 / Accepted: 15 March 2025 / Published: 17 March 2025
(This article belongs to the Section Networks)

Abstract

:
Cross-site scripting (XSS) attacks can be implemented through various attack vectors, and the diversity of these vectors significantly increases the overhead required for detection systems. The existing XSS detection methods face issues such as insufficient feature extraction capabilities for XSS attacks, inadequate multisource feature fusion processes, and high resource consumption levels for their detection models. To address these problems, we propose a novel XSS detection approach based on multisource semantic feature fusion. First, we design a normalized tokenization rule based on the structural features of XSS code and use a word embedding model to generate the original feature vectors of XSS. Second, we propose a local semantic feature extraction network based on depthwise separable convolution (DSC) that extracts XSS text and syntactic features using convolution kernels with different sizes. Then, we use a bidirectional long short-term memory (Bi-LSTM) network to extract the global semantic features of XSS. Finally, we introduce a multihead attention fusion network that employs a saliency score and a dynamic weight adjustment mechanism to identify the key parts of the input sequence and dynamically adjust the weight of each head. This enables the deep fusion of local and global XSS semantic features. Experimental results demonstrate that the proposed approach achieves an F1 score of 99.92%, outperforming the existing detection methods.

1. Introduction

The widespread adoption of 5G communication networks has significantly enhanced data transmission capabilities while also promoting the extensive use of web technologies. Web applications have become essential tools for enterprises and government organizations to streamline their operations, improve their quality of service, and enhance user experiences. However, with the widespread use of web applications, security concerns have become increasingly prominent. Cross-site scripting (XSS) attacks are a common type of script injection attack in which attackers exploit vulnerabilities in web applications to trick users into visiting websites or pages containing malicious scripts. These scripts are then loaded and executed, stealing user information. Relevant research [1] indicates that XSS attacks are the predominant type of web attack, accounting for approximately 70% of web application attacks worldwide. According to the survey provided by the Open Web Application Security Project (OWASP) [2], XSS attacks have consistently ranked among the top web security threats, demonstrating greater deception and complexity compared to other web application attacks. In 2021, they were classified as the third most prevalent type of injection attack, posing significant risks to user privacy and financial security. For example, XSS attacks can be used to forcibly download malware, such as trojans and viruses, or exploit browser vulnerabilities to gain control over user devices, leading to botnet formation or worm propagation. Additionally, XSS attacks often hijack user traffic through malicious pop-up advertisements or forced redirections to harmful websites. They can also be leveraged to launch distributed denial-of-service (DDoS) attacks against enterprises, causing server overload and service disruption [3]. Consequently, in-depth research on XSS attacks and effective mitigation strategies have become primary focuses in the field of web security.
Early researchers typically relied on known XSS attack patterns and features to establish blacklists, whitelists, and signature databases [4,5,6], using pattern matching techniques such as regular expressions [7,8,9] to check whether the input content contained potential XSS attack code. However, these methods are only applicable to specific scenarios and cannot comprehensively cover all possible XSS attacks. XSS attacks exploit the injection of malicious scripts to steal user data, hijack sessions, and perform other harmful actions. These attacks continuously evolve, with variants such as reflected, stored, and DOM-based XSS. Traditional detection methods relying on signatures and regular expressions struggle to effectively counter emerging attack patterns [10]. Modern web applications generate an increasing volume of traffic data, and XSS attack payloads are often obfuscated or encoded to evade detection. Traditional approaches depend on manually extracting features, which is inefficient and fails to capture deep semantic correlations. Machine learning-based methods, which rely on handcrafted XSS attack features, offer better generalization than traditional detection techniques. However, feature engineering in machine learning is complex, making it difficult to extract high-dimensional and context-aware information, and leading to a strong dependence on feature selection [11].
With advancements in deep learning, its advantages in XSS attack detection have become increasingly evident. Deep learning models can directly map raw input data to detection outcomes, reducing human intervention and improving detection efficiency. Additionally, deep learning can automatically learn the semantic and structural characteristics of XSS attack payloads without manual feature engineering, significantly reducing false positives and false negatives [12]. Traditional and machine learning-based methods often struggle to handle new XSS variants. In contrast, deep learning models, trained on large-scale datasets, enhance adaptability and are well-suited for complex XSS detection scenarios. The application of deep learning in XSS detection addresses key limitations of traditional methods, particularly in feature engineering, generalization, and analyzing complex attack patterns [13]. Deep learning has gradually replaced traditional methods because of its superior learning ability, diverse detection techniques, and high flexibility and scalability levels, making it the mainstream XSS attack detection technology.
Deep learning-based XSS detection methods include single-model optimization-based detection methods, which rely on a single deep learning model for performing XSS attack detection, and ensemble model-based detection methods, which involve stacking multiple models to detect XSS attacks. Initially, researchers employed traditional neural networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for design and extension processes, as these networks can effectively extract local features and long-distance dependencies from XSS texts. Alqarni et al. [14] proposed a method based on a modular deep neural network (DNN) to reduce the false-positive rates induced during XSS attack detection tasks using Pearson correlation to determine the associations between continuous features and classes; this was followed by the use of modular neural networks (MNNs) for XSS script detection and classification. However, the modular design of their method reduces the fault tolerance of the constructed system, and synchronization between modules can lead to issues such as time delays and decreased efficiency, limiting the detection performance of the model. Yan et al. [15] introduced an improved ResNet module based on a CNN to extract XSS features from three different dimensions; however, owing to the limited receptive field of the CNN, the model struggled to analyze global dependencies in XSS code. Nilavarasan et al. [16] proposed a CNN-based model with multiple convolutional and pooling layers for XSS feature extraction and detection through hierarchical propagation, achieving high detection accuracy in experiments. However, traditional CNN layers still suffer from excessive parameters and slow computational efficiency in real-time XSS attack detection. This issue becomes more prominent when integrating CNNs with other deep learning models, necessitating a lightweight design. Depthwise separable convolutions, due to their unique computational approach, significantly reduce the number of parameters and improve efficiency, making them a suitable alternative to traditional CNNs. Wan et al. [17] employed a BERT-based word embedding representation, which was then fed into a BiLSTM model to capture long-range dependencies in XSS attack payloads. However, this approach lacks the ability to extract local features effectively. Additionally, a standalone BiLSTM model is vulnerable to adversarial attacks. Integrating multiple deep learning models can effectively mitigate these limitations, enhancing robustness in XSS detection.
Single-model optimization methods for XSS attack detection offer certain advantages in terms of parameter tuning and resource savings, but their generalizability, robustness, and detection accuracy still present significant limitations when these methods are applied to different types of XSS attacks. As a result, researchers have attempted to address these issues using ensemble model-based XSS attack detection methods that integrate the strengths of multiple models to handle the complex and dynamic nature of XSS attack scenarios. Liu et al. [18] approached the problem regarding the small sizes of XSS samples by preprocessing the input detection samples, converting them into graph structures, and then utilizing graph convolutional networks and residual networks to train a detection model. This method can effectively extract high-order features from graph structures and maintain high detection accuracy even in cases with limited training samples, but its graph structure construction process is ambiguous and overlooks the sequential relationships contained in statements. Guan et al. [19] combined CNN and Transformer models to extract both local and global XSS features, which were then concatenated and fused. A Bi-LSTM model with an embedded self-attention mechanism was used to refine and enhance the fused features. While this approach demonstrated strong detection performance, both the Transformer and Bi-LSTM fusion models incurred significant computational overhead. To achieve a balance between detection performance and computational efficiency, the feature extraction network and fusion model design require further optimization. Nige et al. [20] proposed a dual-channel feature extraction framework, where a DistilBERT model designed based on BERT was used alongside fully connected layers to extract and fuse raw attack features. However, due to the complexity and variability of XSS attacks, simply concatenating multiple features is insufficient for capturing the full spectrum of attack characteristics. A lightweight and efficient multifeature fusion network is needed to address this limitation. Sheneamer et al. [21] combined the advantages of multiple deep learning models by parallelizing five models, namely, Visual Geometry Group (VGG)16, VGG19, AlexNet, ResNet, and LSTM, to extract features and then integrate and predict these features. Although this method offers some advantages in terms of detection accuracy, the complexity of the constructed models and their high resource consumption levels make them difficult to apply in real-world scenarios.
Furthermore, deep learning models are vulnerable to adversarial and backdoor attacks. Attackers can repeatedly query APIs to analyze input–output relationships, reverse-engineer model parameters or functionality, and ultimately compromise data privacy [22]. To defend against backdoor attacks, researchers have employed anomaly detection techniques to identify and remove contaminated samples [23]. In the context of XSS attack detection, effective data sanitization can help mitigate such threats. Adversarial attacks can significantly disrupt learning models, particularly single deep learning models. To enhance robustness and prevent single points of failure, an ensemble learning approach that integrates multiple models can be adopted [24].
By analyzing the limitations and challenges of the aforementioned methods, a novel XSS attack detection approach based on multisource semantic feature fusion, named the parallel convolutional recurrent attention network (PCRAN), is proposed in this paper. This approach includes three main components: XSS attack code tokenization rules, a multisource semantic feature extraction model combining dual depth-separated CNNs (DSCNNs) and bidirectional LSTM (Bi-LSTM), and a dynamic significance-aware multihead attention fusion network (DSA-MHAFN). The specific contributions of this study are as follows.
  • A novel tokenization rule tailored to the structural characteristics of XSS code is proposed. Standard tokenization methods such as those provided by the Natural Language Toolkit (NLTK) or SpaCy are generally used for natural language processing tasks, and they focus on vocabulary and sentence structures. In contrast, the tokenization rule developed in this work is specifically designed to address the structural features of XSS attacks, such as HyperText Markup Language (HTML) tags, JavaScript function calls, and URLs, enabling the accurate identification of complex URLs and script structures and effectively parsing special characters and XSS attacks in multilanguage environments.
  • A multisource semantic feature extraction model combining a DSCNN and Bi-LSTM is proposed. The DSCNN adopts a dual-branch structure to extract the textual and syntactic features of XSS code, which are then fused to generate local semantic features. The use of depthwise separable convolutions (DSCs) and residual connections reduces the computational complexity of the constructed model and improves its detection accuracy. The Bi-LSTM model uses bidirectional LSTM layers to better understand the context of input sequences, effectively capturing global semantic information. The local semantic features derived from the DSCNN and the global semantic features acquired from Bi-LSTM are then merged.
  • A DSA-MHAFN is proposed. First, a saliency scorer is introduced to adjust the merged input features, increasing the sensitivity of the model to important features. The adjusted input is then passed through linear transformation and scaled dot product attention mechanisms to calculate multihead attention weights. Finally, a dynamic weight adjuster is constructed to dynamically adjust the importance of each attention head, enhancing the flexibility and expressiveness of the model.

2. Materials and Methods

2.1. Overall Framework

The framework of the proposed XSS attack detection approach, which is based on multisource semantic feature fusion, is shown in Figure 1. This approach consists of three main components: tokenization rule design, sample semantic feature extraction, and XSS feature fusion and detection. The key operations of each component are described as follows, with innovative designs and original modules highlighted in red lines and red blocks in Figure 1.
  • Tokenization Rule Design: First, a dataset covering different types of XSS attacks is collected from the web to ensure a comprehensive representation of various XSS categories. Then, the obfuscated XSS attack code is decoded and normalized to restore the malicious code. Next, tokenization rules are designed based on the structural characteristics of the code to normalize the code and construct a word list using word statistics methods. Finally, the Word2Vec model is used to train word vectors and generate spatial word embeddings.
  • Sample Semantic Feature Extraction: This phase consists of two parts: the DSCNN and the Bi-LSTM network. The DSCNN is a dual-branch network structure composed of multiple DSC layers [25]. Each branch extracts textual and syntactic features from the XSS code via convolution kernels of various sizes. These features are then fused and globally pooled to generate the final local semantic features. In the Bi-LSTM model, the input word vector sequence passes through Bi-LSTM layers, where the bidirectional encoder applies the input to both the forward and backward LSTM layers at each time step, ultimately producing the complete global semantic features.
  • XSS Feature Fusion and Detection: First, the DSA-MHAFN fusion network adjusts the weights of the input data using a saliency scorer, integrating the local and global semantic features extracted earlier. This process generates query (Q), key (K), and value (V) vectors through linear transformations. Then, the Q, K, and V vectors are split into multiple heads, and scaled dot product attention and head weights are calculated for each head. Next, a dynamic weight adjuster is applied to adjust the attention output, generating fused features. Finally, the fused features are fed into the XSS detection classifier for XSS classification.

2.2. Tokenization Rule Design

The XSS attack samples used in this study were obtained from publicly available online platforms, including the XSSed website [26], the PortSwigger vulnerability platform [27], and the XSS-Payload-List open directory [28]. Its XSS attack code contains specific patterns and keywords, such as HTML tags, event handlers, and JavaScript code snippets. Tokenization, as a critical step in natural language processing, breaks text down into words or subwords, allowing for finer feature representations to be obtained. In the context of XSS attack detection, tokenization aids in extracting the key features of XSS attacks from the input dataset, thereby enhancing the understanding of the semantics and structures of XSS attack texts. Moreover, tokenization rules can be customized according to specific requirements to improve the ability to detect particular attack patterns.
To facilitate the extraction of XSS attack features, a set of normalized tokenization rules tailored to the structural characteristics of XSS attacks is designed, as shown in Table 1. Tokenization rules in natural language processing scenarios typically perform lexical and syntactic analyses of text, which makes it difficult to identify the specific patterns of XSS attacks. The normalized tokenization rules in Table 1 focus on recognizing potential content such as script tags, HTML tags, URLs, event handlers, and attributes in XSS attack code. These rules encompass techniques such as character escape, conditional statement manipulation, and inline script injection and implement fine-grained matching methods such as nongreedy matching and multiline matching. These methods enable dynamic content recognition and inline event handling, thus effectively identifying various forms of XSS attacks.
An example of the XSS code produced after the original payload is normalized through tokenization is shown in Table 2. In the XSS attack detection task, the tokenized data require vectorization. Therefore, the quality of the tokenized data significantly impacts the results of the tokenization vectorization process. As shown in Table 2, the normalized tokenization rules break an XSS payload down into multiple independent elements, including tags, attributes, event handlers, function calls, and strings, which makes the contextual relationships of each element clearer. On the basis of these independent tokens, tokenization vectors can better capture the local and contextual semantic information of the given XSS payload, providing rich feature representations for the detection model and thereby improving the accuracy and generalizability of the model.

2.3. Sample Semantic Feature Extraction

2.3.1. Local Semantic Feature Extraction Using the DSCNN

  • DSC Networks
As the volumes of user-requested data and XSS attack traffic continue to grow, researchers [16,29,30] often employ deep learning models to automatically extract high-level features. CNNs have demonstrated significant advantages in terms of capturing local text features in natural language processing tasks. However, the traditional CNNs face challenges such as high computational costs and resource consumption levels when handling high-dimensional data. DSC improves the computational efficiency of models by reducing the complexity of their convolution operations, where the requirement of fewer parameters lowers model storage requirements and the risk of overfitting. DSC effectively handles large-scale data, enhancing both the speed and accuracy of XSS detection models.
DSC is an efficient convolution operation method with excellent adaptability and lightweight characteristics [31]. DSC decomposes the standard convolution process into two steps: depthwise convolution and pointwise convolution.
First, DSC performs depthwise convolution, which reduces the number of parameters and computational complexity by applying convolution operations to each input channel separately. The calculation process of depthwise convolution is as follows:
Y d ( i , j , c ) = m = 0 K 1 n = 0 K 1 X ( i + m , j + n , c ) W d ( m , n , c )
where Yd represents the output tensor of the depthwise convolution, i and j represent the spatial dimensions (height and width, respectively) of the output tensor, c denotes the index of the input and output channels, X is the input tensor, m and n represent the spatial dimensions (height and width, respectively) of the convolution kernel, K is the size of the convolution kernel, and Wd denotes the depthwise convolution kernel.
Next, DSC performs pointwise convolution. Pointwise convolution is a 1D convolution technique that uses a 1 × 1 convolution kernel to implement a linear combination on the depthwise convolution output. The calculation process for pointwise convolution is as follows:
Y ( i , j , k ) = c = 0 C i n 1 Y d ( i , j , c ) W p ( 1 , 1 , c , k )
where Y is the output tensor of the pointwise convolution process; i and j represent the spatial dimensions (height and width, respectively) of the output tensor; k and c are the indices of the output and input channels, respectively; Yd is the tensor output from the depthwise convolution; Wp is the pointwise convolution kernel; and Cin is the number of input channels.
2.
Structural Design of the Local Semantic Feature Extraction Model
A DSCNN is designed in this paper to extract two types of local semantic features, i.e., textual and syntactic features, from XSS code. The structure of the local semantic feature extraction model based on the DSCNN is shown in Figure 2.
Word embedding converts each word in the input text into a dense vector with fixed dimensions, which effectively captures the similarity and semantic relationships between words. In text classification and feature extraction tasks, convolution kernels with different sizes capture sample features at various scales. Smaller convolution kernels have smaller receptive fields and are capable of capturing fine-grained local features, whereas larger convolution kernels have larger receptive fields and can capture more complex spatial relationships with fewer layers. On the basis of an experimental analysis, the DSCNN is designed with a dual-branch feature extraction process using convolution kernel sizes of two and eight to extract textual and syntactic features from XSS code, respectively. The algorithm for constructing the DSCNN is shown in Algorithm 1.
Algorithm 1 DSCNN construction algorithm
Input: Original feature matrix of a spatial word vector I
Output: Local XSS semantic feature representation vector L
1: I1, I2 = Embedding(I) //The embedding layer is applied to the input to generate two-branch inputs I1 and I2
2: x = [2, 8] //Define the array of kernels
3: for i in range(len(x)) do
4:      Ai = DSConv1D(x[i], 128)(Ii+1) //Apply DSC layers
5:      Bi = MaxPool(Ai) //Apply a maximum pooling layer
6:      Ci = DSConv1D(x[i], 128)(Bi)
7:      Di = DSConv1D(x[i], 128)(Ci)
8:      Ei+1 = GlobalAveragePool(Bi + Di) //Apply a global average pooling layer
9: end for
10: E = E1 + E2 //Concatenate the two-branch features
11: L = Dense(E) //Apply a fully connected layer

2.3.2. Global Semantic Feature Extraction Based on Bi-LSTM

A CNN uses convolution kernels that slide over the input data, performing operations on local regions to extract data features. However, the receptive fields of convolution operations are limited, as they can focus only on features within a local range at each step. Although a receptive field can be expanded by stacking multiple convolutional and pooling layers, the ability to capture long-range semantic information remains limited. Unlike traditional neural networks, RNNs are composed of input, hidden, and output layers. They utilize recurrent structures and hidden states to retain information from previous time steps, allowing them to process input sequences of arbitrary length while maintaining the temporal order of the input. However, when RNNs use the backpropagation algorithm to process long sequences, they may face issues such as long-term dependencies and gradient explosion, making it difficult to preserve effective information flows. LSTM mitigates long-term dependencies and gradient explosion problems to some extent by introducing specialized gating mechanisms and memory units to control the flow of information. However, LSTM can only process sequence data in one direction and relies solely on historical information to produce predictions at the current time step. These limitations make it challenging to capture all the dependencies within the input data for certain tasks.
Bi-LSTM [32] addresses the issue of missing subsequent information in LSTM by simultaneously processing both forward and backward information. Bi-LSTM achieves higher computational efficiency while maintaining its performance when processing longer input sequences, offering a significant advantage in terms of extracting the forward and backward dependencies of XSS payload sequences. Therefore, a Bi-LSTM network is adopted in this paper to extract the global semantic features of XSS code, and the detailed structure of this network is shown in Figure 3.
LSTM units are the basic building blocks of Bi-LSTM. Each LSTM unit contains a memory cell and three gating mechanisms (an input gate, a forgetting gate, and an output gate) that control the flow of information and the updates of states. The structure of the LSTM network is shown in Figure 3, and its operational process is as follows.
First, the current input Xt and the hidden state at the previous time step ht−1 are concatenated. These states are then passed through three activation functions (sigmoid) to obtain a forgetting gate function ft, which represents the degree of information to be forgotten; an input gate function it, which represents the importance of the input information; and an output gate function ot, which represents the amount of information that is output from the memory cell. The detailed computational process is as follows:
f t = σ ( W f [ h t 1 , X t ] + b f )
i t = σ ( W i [ h t 1 , X t ] + b i )
o t = σ ( W o [ h t 1 , X t ] + b o )
where Wf, Wi, and Wo are weight matrices, and bf, bi, and bo are bias vectors.
Next, Xt and ht−1 are concatenated and passed through a tanh function to train and obtain a memory cell state C ˜ t , which represents the potential information. Then, by combining the forgetting gate and the input gate, the previous memory cell state Ct−1 is updated to a new state Ct. The computational process for doing so is as follows:
C ˜ t = tanh ( W C [ h t 1 , X t ] + b C )
C t = f t C t 1 + i t C ˜ t
Finally, the updated memory cell state Ct is processed through a tanh function and combined with the output gate ot to obtain the currently hidden state ht. The associated computational process is as follows:
h t = o t tanh ( C t )
Bi-LSTM processes the input sequence through two parallel LSTM layers. Let the input sequence be X = {X1, X2, X3, …, XT}, which is fed into both the forward and backward LSTM layers. The forward LSTM layer processes the input sequence in temporal order, with each time step t receiving the current input Xt and the hidden state from the previous time step h t 1 , and computes the current hidden state h t and memory cell state C t . The backward LSTM layer processes the input sequence in reverse temporal order, with each time step t receiving the current input Xt and the hidden state from the next time step h t + 1 , and computes the current hidden state h t and memory cell state C t . Finally, the outputs obtained from the forward and backward LSTM layers are concatenated to form the final output h t = [ h t ; h t ] . This concatenation step ensures that the output produced at each time step contains both forward and backward contextual semantic information.

2.4. XSS Feature Fusion and Detection

2.4.1. XSS Feature Fusion

XSS attack detection requires multiple dimensions of information, including input character sequence patterns, contextual relationships, and historical data, to be comprehensively considered. A single feature may not fully capture the complexity and diversity of XSS attacks. Therefore, by using the DSCNN and Bi-LSTM to extract the local and global semantic features of XSS attacks, respectively, and then performing deep fusion on the obtained multisource semantic features, the negative impacts of data sparsity and noise can be alleviated, resulting in a more comprehensive feature representation.
In XSS attack detection tasks, various methods can be employed to perform feature fusion. Early feature fusion methods achieved this goal by concatenating raw features acquired from different sources during the input phase or combining them in a weighted manner. Although this method is simple and direct, it is inadequate for modeling the complex relationships between features, which may lead to information confusion and feature redundancy. In recent years, researchers have proposed methods for separately extracting different features and performing high-level feature fusion to retain the independence of each feature. However, this method overlooks the interactions between features, and its fusion effect relies heavily on the performance of individual features. Additionally, some studies have proposed performing feature fusion in intermediate feature extraction layers, such as through multilayer perceptron or convolution layers, but this leads to higher model complexity and may introduce overfitting problems.
A multihead attention mechanism [34] has unique advantages in feature fusion tasks. It can simultaneously focus on different parts of the input features, capture dependencies across the entire data range, and learn different feature representations through multiple attention heads, which enhances the overall feature representation ability of the constructed network. Furthermore, multihead attention networks support parallel computation, which allows for efficient performance to be attained when processing large-scale data and high-dimensional features. This characteristic is especially crucial in real-time detection systems for protection against XSS attacks.
In traditional multihead self-attention mechanisms, the weight of each head is typically fixed. Therefore, in this study, a “saliency score” mechanism is introduced to assess the importance of each part of the input sequence for making the final classification decision, and the attention weights are adjusted accordingly. Additionally, a dynamic weight adjustment mechanism is proposed to adjust the weight of each head based on the characteristics of different input samples, allowing the network to more flexibly adapt to different XSS attack patterns. In this paper, we propose a DSA-MHAFN that adjusts and weights the input features through a saliency scorer, scaled dot product attention, and a dynamic weight adjuster. The attention distribution is dynamically adjusted based on the importance of the input data. The structure of the DSA-MHAFN is shown in Figure 4, and its specific process is as follows.
Step 1: The local semantic features extracted by the DSCNN and the global semantic features extracted by the Bi-LSTM network are concatenated to form the input features X. Significance scores sf_scores are then calculated using the saliency scorer. The saliency scores are added to the last dimension of the input features and multiplied with the input features to obtain adjusted input features Xa. The detailed computational process is as follows:
s f _ s c o r e s = σ ( W s X )
X a = X × s f _ s c o r e s
where σ is the sigmoid activation function, and Ws represents the weight matrix of the saliency scorer.
Step 2: The adjusted input features Xa are then passed through the query, key, and value weight matrices WQ, WK, and WV to compute Q, K, and V vectors, respectively. These vectors are split into h heads. An experimental analysis shows that setting h = 8 provides the best results. The detailed computational process is as follows:
Q i = split ( X a W Q , h )
K i = split ( X a W K , h )
V i = split ( X a W V , h )
Step 3: Scaled dot product attention is computed. First, the matrix multiplication between the Q and K vectors is calculated and divided by the scaling factor. The scaled dot product result is subsequently passed through a softmax activation function to generate attention weights. Finally, the attention weights are multiplied by V to obtain the attention output matrix. The detailed computational process is as follows:
A ( Q , K , V ) = softmax ( Q K T d k s ) V
h d i = A ( Q i , K i , V i )
where dk represents the dimensionality of the key, and the scaling factor s is a trainable factor.
Step 4: The importance weight H of each head is calculated using the dynamic weight adjuster. These weights are applied to the attention outputs hdi of each head. The associated calculation process is as follows:
H = softmax ( W h X a )
Z i = H i × h d i
where Wh is the weight matrix.
Step 5: The attention outputs derived from all heads are concatenated and passed through a fully connected layer for linear transformation purposes. Finally, the transformed output O is normalized with the previously adjusted input features to obtain a complete fused feature Y. The detailed computational process is as follows:
O = concat ( Z 1 , Z 2 , , Z h ) W o
Y = LayerNorm ( O + X a )

2.4.2. XSS Attack Detection

The DSA-MHAFN simultaneously captures features from different parts of the input sequence through mechanisms such as significance scoring and dynamic head weight adjustment. It effectively filters out irrelevant noise and flexibly adapts to various XSS attacks. After integrating the local and global semantic features of XSS code, the DSA-MHAFN generates fused features, which are then classified into XSS attack categories through the construction of an XSS attack classification network. The detailed process is shown in Figure 5.
Step 1: The fused features are first further processed by a fully connected layer, which maps the high-dimensional features to a lower-dimensional space while applying the rectified linear unit (ReLU) activation function to perform a nonlinear transformation. The calculation process is as follows:
Dense ( x ) = ReLU ( W x + b )
where W is the weight matrix, b is the bias vector, and x represents the input feature matrix.
Step 2: Then, a dropout layer is introduced. During training, a portion of the input is randomly set to zero, reducing the reliance of the model on specific neurons. This helps to enhance the generalization ability of the model and prevent overfitting.
Step 3: The final layer is a fully connected layer with a sigmoid activation function. It maps the output to the range [0, 1], generating the final prediction value. The output dimensionality is two, representing the two categories in the classification task: XSS and non-XSS. The calculation process is as follows:
Output ( x ) = Sigmoid ( W x + b )

3. Experiments and Discussion

3.1. Configuration of the Experimental Environment and Scheme

The experimental computer configuration includes an Intel® Xeon® Silver 4110 processor, 32 GB of DDR4 memory, and an NVIDIA Quadro P2000 graphics card (with 5 GB of VRAM). The training, validation, and testing environments all run on the Windows 10 operating system, and the employed experimental IDE is PyCharm 2022.1.4. The detection approach proposed in this paper is implemented using Python 3.6, TensorFlow 1.15, and Keras 2.3.
XSS attackers inject malicious scripts into websites, exploiting security vulnerabilities to execute scripts in other users’ browsers, thereby stealing sensitive information or hijacking user sessions. These attacks can spread through URL parameters, input fields, or stored entries in databases. To mitigate such threats, XSS detection models are typically deployed on client-side Web Application Firewalls (WAFs) and web application servers to identify malicious scripts and generate alerts. The XSS detection approach proposed in this study collects sample data from both web clients and servers and deploys the model within a local client-side WAF for training and attack detection. The detailed process is illustrated in Figure 6.

3.2. Sample Preprocessing

To study and defend against XSS attacks, researchers often require relevant datasets for experimentation and analysis. However, these datasets may contain various types of noise and inconsistent data in their raw states. Therefore, preprocessing these samples is crucial, as outlined below.

3.2.1. Dataset Collection

XSS attacks can take various forms, including stored, reflected, and DOM-based attacks. The existing publicly available XSS datasets are relatively limited, with many failing to cover all types of XSS attacks and containing large amounts of invalid data, making it difficult to accurately reflect the latest XSS attack trends. Researchers typically collect such data manually from websites and vulnerability platforms. To ensure comprehensive coverage of different XSS attack types and maintain the relevance of XSS code, an extensive literature review and a data analysis are conducted in this study, with multiple websites, authoritative vulnerability platforms, and open-source communities selected to collect XSS data. A total of 14,753 malicious XSS scripts are collected from the xssed.com website, 8748 XSS attack codes are obtained from the PortSwigger vulnerability platform, and 10,156 XSS attack codes are acquired from the open-source Payloadbox community (2022). After removing sensitive and duplicate information, 33,049 malicious samples are retained. The normal data, totaling 33,713 records, are scraped from the open DMOZ directory. The dataset covers a wide range of attack samples with various types and encoding variations and includes real-world XSS samples of different attack intensities. This ensures a balance between the positive and negative samples and preserves the diversity of the attack data and payloads. The detailed statistical information concerning the utilized datasets is shown in Table 3.

3.2.2. Data Cleaning

XSS code often uses techniques such as mixed case and multiple encoding schemes to hide malicious code and bypass system detection. Early XSS detection systems relied on blacklist rules and regular expression matching techniques [4,7,10]. However, case-mixing techniques can evade strict string matching, allowing attack payloads to take varied forms during regular expression matching. As a result, detection systems must incorporate multilayer parsing and context-aware capabilities. Modern XSS detection has increasingly adopted machine learning and deep learning models [36,37,38]. However, mixed encoding and case mixing complicate the distribution of attack code, making it difficult for common word embedding models to process multi-encoded text and extract meaningful features. If a specific encoding format is not represented in the training set, the model may fail to effectively detect similar variant attacks. Thus, it is necessary to perform operations such as case conversion, multiple URL decoding, and HTML decoding on the raw data to eliminate tag and attribute confusion and restore the underlying malicious code. Converting the case of XSS payloads standardizes different variants, allowing them to be treated as identical, which helps reduce data redundancy and lower input feature dimensionality. Recursive URL decoding restores XSS payloads to their original form, facilitating the model’s ability to learn attack patterns. Additionally, HTML decoding prevents different HTML-encoded representations of the same XSS payload from being treated as distinct samples, thereby enhancing the consistency of feature representation.
Additionally, in XSS attack codes, attackers use specific numbers (e.g., counters and unique identifiers) to execute scripts and embed various forms of user redirection URLs, malicious resources, or other harmful operations, significantly increasing the complexity of the system detection process. Deep learning models typically rely on feature extraction techniques to analyze XSS code [15,18,21]. However, the dynamic components of attack code, such as timestamps and counters, can cause similar attack payloads to exhibit significant differences in feature space, reducing feature consistency and weakening the effectiveness of statistical methods. Therefore, during data preprocessing, numerical normalization is applied to ensure consistency across different samples [39]. All numbers in the XSS codes are replaced with 0, and the URLs are standardized to “http://u”, removing their specific numeric and URL content interference. Normalizing different numerical values in XSS payloads reduces the risk of model overfitting to specific numeric patterns. Replacing URLs eliminates URL variations, enhancing the model’s understanding of XSS structures. This approach reduces input data variability and improves training efficiency. Data cleaning examples are shown in Table 4.

3.2.3. Tokenization and Table Construction

Data cleaning eliminates noise and redundant data, restoring the true intent of the original XSS code to the greatest extent possible. However, an XSS code still contains various attributes, such as HTML tags, JavaScript functions, and URLs, which hinder the feature extraction process. Thus, a set of standardized tokenization rules, as shown in Table 1, are designed in this study to break the input XSS code into individual elements, helping the model capture the contextual relationships of each token. After completing tokenization, the term frequency–inverse document frequency (TF–IDF) [40] algorithm is used to construct a word list. First, the frequencies of all words in the positive samples are counted, and common stop words (e.g., “the” and “is”) that do not contribute to the feature extraction procedure are removed. A TF–IDF value is subsequently calculated for each word, and the results are sorted in descending order. The higher a TF–IDF value is, the more relevant the corresponding word is to the XSS code. The top 3000 words are selected to construct a vocabulary. Finally, each tokenized sample is converted into a list of words from the vocabulary, with words not in the vocabulary marked as “UNK.”

3.2.4. Word Vectorization

Neural networks cannot directly process string-based XSS code snippets. Thus, after performing data cleaning, tokenization, and vocabulary construction, the labeled word lists are converted into word vector representations. Word2Vec maps high-dimensional sparse features in XSS attack scripts to low-dimensional dense vectors, significantly reducing computational complexity. By training on neighboring word prediction, Word2Vec effectively captures the contextual semantic relationships within XSS attack payloads [41]. Compared to complex pre-trained models such as BERT [42,43], Word2Vec’s shallow neural network structure avoids excessive parameterization, reducing training costs. Additionally, its local context window modeling mechanism effectively extracts character-level dependencies, outperforming global co-occurrence-based methods like GloVe [44,45]. Word2Vec also achieves significantly higher training and inference efficiency than deep learning models, such as Transformer, using techniques like negative sampling [46,47]. Therefore, The Word2Vec [48] model is used to conduct word vector training in this study. The model consists primarily of continuous bag-of-words (CBOW) and skip-gram structures. The CBOW model predicts the target word using context information, making it effective for recognizing common malicious attack patterns. However, since the CBOW method implements prediction using the average of the observed context, it performs poorly in terms of recognizing rare words or specific patterns. The skip-gram model, which predicts context from the target word, excels at understanding complex contexts and recognizing special patterns. Although skip-grams are less computationally efficient than CBOWs are, their advantages in terms of recognizing rare words and understanding complex contexts make them more suitable for XSS detection tasks. Therefore, the Word2Vec model uses the skip-gram structure for XSS word vector training, with the specific model parameters detailed in Table 5.
The skip-gram structure consists of an input layer, a hidden layer, and an output layer. In the input layer, the word list is input into the structure in a one-hot encoded form. The input vectors are mapped to the hidden layer through a V × N embedding matrix (where V is the vocabulary size and N is the number of neurons contained in the hidden layer, i.e., the dimensionality of the embedding vectors). The output of the hidden layer is the embedding vector for the target word. Finally, the output of the hidden layer is passed through a N × V weight matrix that transforms the embedding vector back into a vocabulary size vector, i.e., the output layer vector. The specific model structure is shown in Figure 7.
The Word2Vec model maps words to high-dimensional vector spaces to capture the semantic and syntactic relationships between words. Since high-dimensional vector spaces are difficult to understand intuitively, the t-distributed stochastic neighbor embedding (t-SNE) [50] dimensionality reduction algorithm is used to transfer high-dimensional word vectors to three-dimensional space. The distribution of the reduced word vectors in three-dimensional space is shown in Figure 8, where the distance between a pair of data points reflect the similarity between the corresponding words in high-dimensional space. Points that are closer together exhibit higher similarity, whereas those that are farther apart possess lower similarity.

3.3. Model Construction

The experimental model for the proposed PCRAN method consists of three main components: the DSCNN model, the Bi-LSTM model, and the DSA-MHAFN multifeature fusion network. Word embeddings are fed into both the DSCNN and Bi-LSTM models. The DSCNN model, composed of a dual-branch DSC structure, primarily extracts local semantic features of XSS attacks, while the Bi-LSTM model, consisting of a two-layer LSTM structure, captures global semantic features. The extracted features are then processed through the DSA-MHAFN multifeature fusion network to generate a fused representation. The fused features undergo dimensionality reduction via a dropout layer and are subsequently classified using a sigmoid classifier. The detailed hyperparameter settings for the PCRAN method are provided in Table 6.

3.4. Evaluation Criterion

To evaluate the detection performance of the proposed approach, accuracy, precision, recall, and the F1 score are used as evaluation metrics. The calculation methods for these metrics are as follows.
Accuracy [51] is the ratio of the number of correctly predicted XSS samples to the total number of XSS samples. Accuracy is calculated as follows:
Accuracy = T P + T N T P + T N + F P + F N
Precision [52] is the ratio of the number of correctly predicted XSS attack samples to the total number of samples that are predicted as XSS attacks. Precision is calculated as shown below:
Precision = T P T P + F P
Recall [52] is the ratio of the number of correctly predicted XSS attack samples to the total number of actual XSS attack samples. Recall is calculated as follows:
Recall = T P T P + F N
The F1 score [53] is the harmonic mean of precision and recall, providing a balanced measure that considers both precision and recall. The F1 score is calculated as follows:
F 1   score = 2 × Precision × Recall Precision + Recall
In the above formulas, true positives (TP) refer to the number of XSS samples that are correctly detected as XSS attacks. True negatives (TN) refer to the number of benign XSS samples that are correctly detected as benign. False positives (FP) refer to the number of benign XSS samples that are incorrectly detected as XSS attacks. False negatives (FN) refer to the number of XSS samples that are incorrectly detected as benign XSS samples.

3.5. Comparative Experimental Analysis

After the input dataset is preprocessed to generate spatial word vectors, these vectors are separately fed into the DSCNN and Bi-LSTM models to extract local and global semantic features, respectively. The DSA-MHAFN is then employed for feature fusion, with the fused features subsequently input into the XSS attack classification network. The dataset is randomly split into training, validation, and test sets at a ratio of 7:1:2. The model undergoes 100 training epochs. To assess the reliability and stability of the proposed approach, all the experiments are conducted over five independent runs, and the average values of the evaluation metrics are taken as the final experimental results. In the feature fusion phase, the DSA-MHAFN utilizes eight attention heads. The experiments in this study consist of binary classification comparison experiments, including XSS attack detection method comparisons and ablation experiments.

3.5.1. Comparison Experiment Concerning XSS Attack Detection Methods

In the comparative experiment concerning XSS attack detection methods, the binary classification performance of the PCRAN approach proposed in this paper is evaluated against the CMABLSTM [54], CNNL [55], VGGRL [21], CCNN [56], and CNNABL [57] methods on the self-constructed dataset. The detection results yielded by these various XSS attack detection methods are presented in Table 7.
As shown in Table 7, although some methods exhibit relatively high accuracy rates, a comprehensive evaluation conducted across all the metrics reveals that the PCRAN approach outperforms the others in terms of all the evaluation criteria. Compared with the alternative methods, the PCRAN achieves accuracy improvements of 0.49% to 2.35% and F1 score increases ranging from 0.55% to 2.35%, indicating the best overall performance.
The CMABL method employs a Bi-LSTM model integrated with a multihead self-attention mechanism. While it has notable advantages in terms of overall feature extraction and contextual understanding, its ability to extract local features is constrained and more susceptible to noise interference. As a result, it performs suboptimally with respect to handling feature sparsity issues. Although CMABL maintains relatively balanced performance overall, its detection efficacy is still limited.
The CNNL method combines CNN and LSTM models to perform hierarchical feature extraction. However, owing to its serial structure, it suffers from high computational complexity, model tuning challenges, and increased latency. This results in a disparity between its precision and recall values, with limited improvements in its comprehensive performance.
The VGGRL method achieves multifeature extraction for XSS attacks by stacking various models, including VGG16, VGG19, LSTM, AlexNet, and residual networks. While its complex architecture enables multifeature extraction, it introduces significant difficulties during the model tuning and resource consumption steps. Although VGGRL demonstrates relatively good overall performance, it is not well suited for multiscenario XSS attack detection tasks.
The CCNN method enhances the original CNN model through a lightweight design, offering excellent local feature extraction capabilities for XSS attacks and resulting in better detection outcomes. However, it struggles to identify long-term dependencies in the code sequences of XSS attacks, which negatively impacts its generalization ability.
The performance of the CNNABL method, a state-of-the-art (SOTA) method, is similar to that of the approach presented in this paper. This method uses both CNN and Bi-LSTM models to incorporate local semantic features and long-range semantic information. However, its feature fusion process is limited to direct concatenation operations, which results in the insufficient integration of local and global semantic features. Consequently, the final detection results produced for XSS attacks leave room for improvement.
In terms of computational resource consumption, the experiment evaluates the overall number of parameters of the detection network and the inference time per sample Itime (inference time) as key performance metrics. Benefiting from the lightweight design of DSC, the PCRAN method has a lower Par (parameter count) compared to other approaches. Additionally, the parallel design of the feature extraction network and the multihead attention mechanism enable more efficient Itime than most other detection methods. While the PCRAN method incurs slightly higher resource consumption than CMABL and CCNN, it achieves significantly better detection performance. The CNNABL method demonstrates detection performance closest to PCRAN but at a higher computational cost. Overall, PCRAN achieves a favorable balance between computational resource consumption and XSS detection effectiveness.

3.5.2. Hyperparameter Comparison Experiment

The size of the convolution kernels contained in the DSCNN plays a crucial role in determining the local XSS semantic feature extraction ability of the PCRAN approach. To analyze the impacts of different convolution kernel sizes on the DSCNN, multiple detection networks using both single-kernel and dual-kernel configurations are set up. Various combinations of convolution kernel sizes are selected, and the performance of each detection network is evaluated on the basis of its precision, recall, and F1 score values. The detection performance of the tested variants is presented in Figure 9.
In Figure 9, DSCNN-X represents a single-core DSCNN, DSCNN-X&X denotes a dual-core DSCNN, and X indicates the size of the convolution kernel. As illustrated in Figure 9, the performance differences among the single-core DSCNN networks are relatively small, with an average accuracy of 97.46%. The size of the convolution kernel determines the feature extraction range that is achievable for XSS attack characteristics. Smaller convolution kernels produce more features, which may contain redundant information. Larger convolution kernels, on the other hand, are better suited for identifying complex XSS patterns but may overlook key details. Consequently, a dual-core DSCNN can combine the advantages of both kernel sizes, resulting in better detection performance than that of a single-core DSCNN. In Figure 9, the average detection accuracy of the dual-core DSCNNs is 98.45%, which is 0.99% higher than that of the single-core DSCNNs. By leveraging the complementary strengths of the different kernel sizes, the dual-core networks can comprehensively and effectively extract the local semantic features of XSS. The features extracted by each single-core network are distinct. When the detection networks are capable of extracting complete XSS attack features, the performance gap between the dual-core networks becomes more pronounced, and the extracted XSS features are more diverse. As a result, the two and eight combination in the dual-core DSCNN yields the best detection results, with accuracy improvements ranging from 0.53% to 1.62% over those of the other DSCNN configurations and an F1 score of 98.92%, demonstrating the effectiveness of this configuration in the detection task.
In the DSA-MHAFN, the number of attention heads defines the number of parallel attention paths, which significantly influences the ability to extract multidimensional, complex XSS features and the overall performance of the approach. To assess the impact of the number of attention heads on the DSA-MHAFN, experiments are conducted with varying numbers of attention heads, and the results are evaluated in terms of the accuracy, precision, recall, and F1 score metrics, as shown in Figure 10.
As shown in Figure 10, the use of too few attention heads fails to capture sufficient feature patterns, whereas the use of an excessive number of attention heads can result in overly dispersed feature distributions, making it difficult to learn effective features. Setting the number of attention heads to eight enables the DSA-MHAFN to effectively capture XSS features at multiple granularities without introducing excessive noise from irrelevant features. This setting results in a better balance between performance and efficiency. Therefore, the optimal configuration includes eight attention heads, and it consistently yields superior performance across multiple datasets, with accuracy improvements ranging from 0.33% to 1.1% over the other configurations and an F1 score of 99.92%, demonstrating its effectiveness in the detection task.

3.5.3. Ablation Experiments

Ablation experiments are conducted to compare the detection performance of the dual-branch DSCNN, Bi-LSTM network, and DSA-MHAFN on the self-constructed dataset via the PCRAN approach. The detection effectiveness of each component of the PCRAN approach is evaluated in terms of the accuracy, precision, recall, and F1 score metrics. The detailed experimental results are presented in Table 8.
In Table 8, DSCNN-2 and DSCNN-8 represent single-branch DSCNN networks featuring convolution kernel sizes of two and eight, respectively. In contrast, the DSCNN refers to the parallel dual-branch DSCNN network in which both DSCNN-2 and DSCNN-8 operate simultaneously. The primary purpose of the DSCNN is to extract the local features of XSS code. The DSCNN-2 branch focuses on capturing microcharacter text features that are related to XSS, such as common attack sequence labels (“>“) and short function call characters (such as “eval”). Moreover, the DSCNN-8 branch emphasizes broader XSS grammar features, including multilevel nested tag structures (e.g., “<scr<script>“) and dynamic JavaScript code characteristics (such as “eval(decodeURIComponent(x))”). By combining the strengths of both branches, the DSCNN achieves enhanced detection performance, exceeding that of the single-branch detection networks across all evaluation metrics. This demonstrates the effectiveness and advantages of the DSCNN architecture.
Bi-LSTM is employed to extract the global semantic features of XSS code, yielding excellent detection results by efficiently capturing long-distance semantic information.
In Table 8, DSA-0 represents the method that performs feature fusion and detection using only the local semantic features extracted directly by the DSCNN and the global semantic features obtained from the Bi-LSTM without the use of the DSA-MHAFN for feature fusion. Conversely, the DSA-MHAFN refers to the method that achieves feature fusion and detection through the DSA-MHAFN, which integrates both local and global semantic features. According to the data presented in Table 8, the multifeature fusion method using the DSA-MHAFN outperforms the method that does not leverage the DSA-MHAFN by 1.41% in terms of accuracy and by 1.51% in terms of the F1 score, further validating the rationality and superiority of the DSA-MHAFN design.

4. Conclusions

To address the limitations of existing XSS attack detection methods in feature extraction and fusion, which often lead to high false positive rates, this paper proposes an XSS detection method based on multisource semantic feature fusion. The proposed approach redesigns tokenization rules for XSS datasets and integrates DSC, Bi-LSTM, and a multihead self-attention mechanism to optimize detection performance. Experimental results demonstrate its effectiveness in both identification and classification.
This paper first reviews the current research status and motivations for XSS attack detection. Then, a standardized tokenization rule for XSS datasets is introduced. Next, a feature extraction network for both local and global XSS semantics, along with a multifeature fusion network, is proposed. Finally, multiple comparative experiments validate the proposed method’s rationality and superiority, showing significant improvements in accuracy and the F1 score over existing detection methods.
The proposed approach introduces a novel deep learning-based framework for XSS detection. By leveraging depthwise separable convolution and bidirectional LSTM, it jointly captures the local syntactic structure and global contextual dependencies of XSS code, overcoming the limitations of traditional unimodal detection methods. Additionally, a multihead attention fusion network based on saliency scoring is introduced to mitigate noise interference and feature redundancy in static feature fusion. In summary, this method provides a scalable solution for detecting emerging DOM-based XSS attacks and script variants, offering a new perspective and technical support for web security protection.
In future research, we will aim to analyze additional features of XSS attacks from multiple perspectives. We also plan to incorporate multiclass detection into the model for various types of XSS attacks while extracting multisource features, further enhancing the detection performance of the XSS detection model in cases involving different types of XSS attacks.

Author Contributions

Conceptualization, Z.H.; Methodology, Z.H.; Data curation, J.Z.; Writing—original draft, Z.H. and J.Z.; Writing—review & editing, H.Y.; Visualization, J.Z.; Supervision, H.Y.; Funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Civil Aviation Joint Research Fund Project of the National Natural Science Foundation of China (grant number U2433205), the National Natural Science Foundation of China (grant number 62201576), and the Supporting Fund of the National Natural Science Foundation of China (grant number 3122023PT10).

Data Availability Statement

The data presented in this study are available on request from the corresponding author, with consideration given to privacy, legal, or ethical reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Alotaibi, B. Cybersecurity Attacks and Detection Methods in Web 3.0 Technology: A Review. Sensors 2025, 25, 342. [Google Scholar] [CrossRef] [PubMed]
  2. Choiriyah, A.; Qomariasih, N. Security Analysis on Websites Belonging to the Health Service Districts in Indonesia Based on the Open Web Application Security Project (OWASP) Top 10 2021. In Proceedings of the 2023 International Conference on Information Technology and Computing (ICITCOM), Yogyakarta, Indonesia, 1–2 December 2023; pp. 267–272. [Google Scholar]
  3. Sable, N.P.; Patil, R.; Gaikwad, V.; Sakhare, N.N.; Buchade, A.; Kokare, R. Structured Approach to Web Security: Exploring Evolving Threats and Unresolved Research Challenges. In Proceedings of the 2023 International Conference on Integration of Computational Intelligent System (ICICIS), Pune, India, 1–4 November 2023; pp. 1–7. [Google Scholar]
  4. Ponnavaikko, M.; Shanmugam, J. Risk mitigation for cross site scripting attacks using signature based model on the server side. In Proceedings of the Second International Multi-Symposiums on Computer and Computational Sciences (IMSCCS 2007), Iowa City, IA, USA, 13–15 August 2007; pp. 398–405. [Google Scholar]
  5. Ali, K.; Abdel-Hamid, A.; Kholief, M. Prevention of DOM Based XSS Attacks Using a White List Framework. In Proceedings of the 2014 24th International Conference on Computer Theory and Applications (ICCTA), Alexandria, Egypt, 25–27 October 2014; pp. 68–75. [Google Scholar]
  6. Tang, Z.; Zhu, H.; Cao, Z.; Zhao, S. L-WMxD: Lexical based webmail XSS discoverer. In Proceedings of the 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Shanghai, China, 10–15 April 2011; pp. 976–981. [Google Scholar]
  7. Gebre, M.T.; Lhee, K.-S.; Hong, M. A robust defense against content-sniffing xss attacks. In Proceedings of the 6th International Conference on Digital Content, Multimedia Technology and its Applications, Seoul, Republic of Korea, 16–18 August 2010; pp. 315–320. [Google Scholar]
  8. Kozik, R.; Choraś, M.; Renk, R.; Hołubowicz, W. Modelling HTTP requests with regular expressions for detection of cyber attacks targeted at web applications. In Proceedings of the International Joint Conference SOCO’14-CISIS’14-ICEUTE’14, Bilbao, Spain, 25–27 June 2014; pp. 527–535. [Google Scholar]
  9. Javed, A.; Schwenk, J. Towards elimination of cross-site scripting on mobile versions of web applications. In Proceedings of the Information Security Applications: 14th International Workshop, WISA 2013, Jeju Island, Republic of Korea, 19–21 August 2013; pp. 103–123. [Google Scholar]
  10. Khan, N.; Abdullah, J.; Khan, A.S. Defending malicious script attacks using machine learning classifiers. Wirel. Commun. Mob. Com. 2017, 2017, 5360472. [Google Scholar] [CrossRef]
  11. Zhao, C.; Si, S.; Tu, T.; Shi, Y.; Qin, S. Deep-learning based injection attacks detection method for http. Mathematics 2022, 10, 2914. [Google Scholar] [CrossRef]
  12. Et-Tolba, M.; Hanin, C.; Belmekki, A. DL-Based XSS Attack Detection Approach Using LSTM Neural Network with Word Embeddings. In Proceedings of the 2024 11th International Conference on Wireless Networks and Mobile Communications (WINCOM), Leeds, UK, 23–25 July 2024; pp. 1–6. [Google Scholar]
  13. Liu, H.; Patras, P. NetSentry: A deep learning approach to detecting incipient large-scale network attacks. Comput. Commun. 2022, 191, 119–132. [Google Scholar] [CrossRef]
  14. Alqarni, A.A.; Alsharif, N.; Khan, N.A.; Georgieva, L.; Pardede, E.; Alzahrani, M.Y. MNN-XSS: Modular Neural Network Based Approach for XSS Attack Detection. CMC-Comput. Mater. Con. 2022, 70, 4075–4085. [Google Scholar]
  15. Yan, H.; Feng, L.; Yu, Y.; Liao, W.; Feng, L.; Zhang, J.; Liu, D.; Zou, Y.; Liu, C.; Qu, L.; et al. Cross-site scripting attack detection based on a modified convolution neural network. Front. Comput. Neurosc. 2022, 16, 981739. [Google Scholar] [CrossRef] [PubMed]
  16. Nilavarasan, G.; Balachander, T. XSS attack detection using convolution neural network. In Proceedings of the 2023 International Conference on Artificial Intelligence and Knowledge Discovery in Concurrent Engineering (ICECONF), Chennai, India, 5–7 January 2023; pp. 1–6. [Google Scholar]
  17. Wan, S.; Xian, B.; Wang, Y.; Lu, J. Methods for Detecting XSS Attacks Based on BERT and BiLSTM. In Proceedings of the 2024 8th International Conference on Management Engineering, Software Engineering and Service Sciences (ICMSS), Wuhan, China, 12–14 January 2024; pp. 1–7. [Google Scholar]
  18. Liu, Z.; Fang, Y.; Huang, C.; Han, J. GraphXSS: An efficient XSS payload detection approach based on graph convolutional network. Comput. Secur. 2022, 114, 102597. [Google Scholar] [CrossRef]
  19. Guan, Y.; Zhou, W.; Wang, H.; Lin, L. Feature Fusion-Based Detection of SQL Injection and XSS Attacks. In Proceedings of the 2024 5th International Conference on Information Science, Parallel and Distributed Systems (ISPDS), Guangzhou, China, 31 May–2 June 2024; pp. 351–355. [Google Scholar]
  20. Nige, L.; Lu, C.; Lei, Z.; Zhenning, T.; Zhiqiang, W.; Yiyang, S.; Xiaolin, G. A Web Attack Detection Method Based on DistilBERT and Feature Fusion for Power Micro-Application Server. In Proceedings of the 2023 2nd International Conference on Advanced Electronics, Electrical and Green Energy (AEEGE), Singapore, 26–28 May 2023; pp. 6–12. [Google Scholar]
  21. Sheneamer, A. Vulnerable JavaScript functions detection using stacking of convolutional neural networks. PeerJ Comput. Sci. 2024, 10, e1838. [Google Scholar] [CrossRef] [PubMed]
  22. Zhang, S.; Pan, Y.; Liu, Q.; Yan, Z.; Choo, K.-K.R.; Wang, G. Backdoor attacks and defenses targeting multi-domain ai models: A comprehensive review. ACM Comput. Surv. 2024, 57, 1–35. [Google Scholar] [CrossRef]
  23. Meena, K.; Raj, L. Evaluation of the descriptive type answers using hyperspace analog to language and self-organizing map. In Proceedings of the 2014 IEEE International Conference on Computational Intelligence and Computing Research, Coimbatore, India, 18–20 December 2014; pp. 1–5. [Google Scholar]
  24. Zhou, S.; Liu, C.; Ye, D.; Zhu, T.; Zhou, W.; Yu, P.S. Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity. ACM Comput. Surv. 2022, 55, 1–39. [Google Scholar] [CrossRef]
  25. Srivastava, H.; Sarawadekar, K. A depthwise separable convolution architecture for CNN accelerator. In Proceedings of the 2020 IEEE Applied Signal Processing Conference (ASPCON), Kolkata, India, 7–9 October 2020; pp. 1–5. [Google Scholar]
  26. KF; DP. Cross Site Scripting (XSS) Attacks Information and Archive. Available online: http://xssed.com/ (accessed on 12 October 2024).
  27. Cross-Site Scripting (XSS) Cheat Sheet. Available online: https://portswigger.net/web-security/cross-site-scripting/cheat-sheet (accessed on 20 October 2024).
  28. Ismail, T. Cross Site Scripting (XSS) Vulnerability Payload List. Available online: https://github.com/payloadbox/xss-payload-list (accessed on 25 October 2024).
  29. Abaimov, S.; Bianchi, G. CODDLE: Code-injection detection with deep learning. IEEE Access 2019, 7, 128617–128627. [Google Scholar] [CrossRef]
  30. Hsiao, W.-C.; Wang, C.-H. Detection of SQL Injection and Cross-Site Scripting Based on Multi-Model CNN Combined with Bidirectional GRU and Multi-Head Self-Attention. In Proceedings of the 2023 5th International Conference on Computer Communication and the Internet (ICCCI), Fujisawa, Japan, 23–25 June 2023; pp. 142–150. [Google Scholar]
  31. Jang, J.-G.; Quan, C.; Lee, H.D.; Kang, U. Falcon: Lightweight and accurate convolution based on depthwise separable convolution. Knowl. Inf. Syst. 2023, 65, 2225–2249. [Google Scholar] [CrossRef]
  32. Mahara, G.S.; Gangele, S. Fake news detection: A RNN-LSTM, Bi-LSTM based deep learning approach. In Proceedings of the 2022 IEEE 1st International Conference on Data, Decision and Systems (ICDDS), Bangalore, India, 2–3 December 2022; pp. 1–6. [Google Scholar]
  33. He, Y.-L.; Wang, P.-F.; Zhu, Q.-X. Improved Bi-LSTM with distributed nonlinear extensions and parallel inputs for soft sensing. IEEE T. Ind. Inform. 2024, 20, 3748–3755. [Google Scholar] [CrossRef]
  34. Linqin, C.; Zhongxu, L.; Sitong, Z.; Kejia, C. Visual question answering combining multi-modal feature fusion and multi-attention mechanism. In Proceedings of the 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Nanchang, China, 26–28 March 2021; pp. 1035–1039. [Google Scholar]
  35. Web Directory of High-Quality Resources. Available online: http://odp.org/homepage.php (accessed on 15 October 2024).
  36. Kaur, J.; Garg, U. A detailed survey on recent xss web-attacks machine learning detection techniques. In Proceedings of the 2021 2nd Global Conference for Advancement in Technology (GCAT), Bangalore, India, 1–3 October 2021; pp. 1–6. [Google Scholar]
  37. Lei, L.; Chen, M.; He, C.; Li, D. XSS detection technology based on LSTM-attention. In Proceedings of the 2020 5th International Conference on Control, Robotics and Cybernetics (CRC), Wuhan, China, 16–18 October 2020; pp. 175–180. [Google Scholar]
  38. Peng, B.; Xiao, X.; Wang, J. Cross-site scripting attack detection method based on transformer. In Proceedings of the 2022 IEEE 8th International Conference on Computer and Communications (ICCC), Chengdu, China, 9–12 December 2022; pp. 1651–1655. [Google Scholar]
  39. Tamamura, K.; Sakai, S.; Watarai, K.; Okada, S.; Mitsunaga, T. Detection of XSS Attacks with One Class SVM Using TF-IDF and Devising a Vectorized Vocabulary. In Proceedings of the 2023 IEEE International Conference on Computing (ICOCO), Langkawi, Malaysia, 9–12 October 2023; pp. 35–40. [Google Scholar]
  40. Mishra, A.; Vishwakarma, S. Analysis of TF-IDF Model and its Variant for Document Retrieval. In Proceedings of the 2015 International Conference on Computational Intelligence and Communication Networks (CICN), Jabalpur, India, 12–14 December 2015; pp. 772–776. [Google Scholar]
  41. Manalu, L.N.T.; Bijaksana, M.A.; Suryani, A.A. Analysis of the Word2Vec model for semantic similarities in Indonesian words. In Proceedings of the 2019 7th International Conference on Information and Communication Technology (ICoICT), Kuala Lumpur, Malaysia, 24–26 July 2019; pp. 1–5. [Google Scholar]
  42. Liu, X.; Liu, W. Research on Multi Round Dialogue Algorithm for Intelligent Robots Based on Text Pre training. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 29–31 March 2024; pp. 111–115. [Google Scholar]
  43. Zheng, X.; Zhang, C.; Woodland, P.C. Adapting GPT, GPT-2 and BERT language models for speech recognition. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 162–168. [Google Scholar]
  44. Giri, S.; Banerjee, S.; Bag, K.; Maiti, D. Comparative study of content-based phishing email detection using global vector (GloVe) and bidirectional encoder repre-sentation from transformer (BERT) word embedding models. In Proceedings of the 2022 First International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT), Trichy, India, 16–18 February 2022; pp. 01–06. [Google Scholar]
  45. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  46. Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
  47. Menghani, G. Efficient deep learning: A survey on making deep learning models smaller, faster, and better. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
  48. Wang, R.; Shi, Y. Research on application of article recommendation algorithm based on Word2Vec and Tfidf. In Proceedings of the 2022 IEEE International Conference on Electrical Engineering, Big Data and Algorithms (EEBDA), Changchun, China, 25–27 February 2022; pp. 454–457. [Google Scholar]
  49. Onishi, T.; Shiina, H. Distributed Representation Computation Using CBOW Model and Skip-gram Model. In Proceedings of the 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI), Kitakyushu, Japan, 1–15 September 2020; pp. 845–846. [Google Scholar]
  50. Sakib, S.; Siddique, M.A.B.; Rahman, M.A. Performance Evaluation of t-SNE and MDS Dimensionality Reduction Techniques with KNN, ENN and SVM Classifiers. In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020; pp. 5–8. [Google Scholar]
  51. Kitchenham, B.A.; Pickard, L.M.; MacDonell, S.G.; Shepperd, M.J. What accuracy statistics really measure. IEE Proc. Softw. 2001, 148, 81–85. [Google Scholar] [CrossRef]
  52. Buckland, M.; Gey, F. The relationship between recall and precision. J. Am. Soc. Inf. Sci. 1994, 45, 12–19. [Google Scholar] [CrossRef]
  53. Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and f1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 18 August 2020; pp. 79–91. [Google Scholar]
  54. Li, X.; Wang, T.; Zhang, W.; Niu, X.; Zhang, T.; Zhao, T.; Wang, Y.; Wang, Y. An LSTM based cross-site scripting attack detection scheme for Cloud Computing environments. J. Cloud Comput. 2023, 12, 118. [Google Scholar] [CrossRef]
  55. Tadhani, J.R.; Vekariya, V.; Sorathiya, V.; Alshathri, S.; El-Shafai, W. Securing web applications against XSS and SQLi attacks using a novel deep learning approach. Sci. Rep. 2024, 14, 1803. [Google Scholar] [CrossRef] [PubMed]
  56. Kumar, J.; Santhanavijayan, A.; Rajendran, B. Cross Site Scripting Attacks Classification using Convolutional Neural Network. In Proceedings of the 2022 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 25–27 January 2022; pp. 1–6. [Google Scholar]
  57. Hu, T.; Xu, C.; Zhang, S.; Tao, S.; Li, L. Cross-site scripting detection with two-channel feature fusion embedded in self-attention mechanism. Comput. Secur. 2023, 124, 102990. [Google Scholar] [CrossRef]
Figure 1. The XSS attack detection framework based on multisource semantic feature fusion.
Figure 1. The XSS attack detection framework based on multisource semantic feature fusion.
Electronics 14 01174 g001
Figure 2. Structure of the local semantic feature extraction model based on the DSCNN.
Figure 2. Structure of the local semantic feature extraction model based on the DSCNN.
Electronics 14 01174 g002
Figure 3. Network structure diagram of Bi-LSTM. Reproduced with permission from [33]. Copyright 2024, IEEE.
Figure 3. Network structure diagram of Bi-LSTM. Reproduced with permission from [33]. Copyright 2024, IEEE.
Electronics 14 01174 g003
Figure 4. The DSA-MHAFN architecture.
Figure 4. The DSA-MHAFN architecture.
Electronics 14 01174 g004
Figure 5. XSS attack classification network.
Figure 5. XSS attack classification network.
Electronics 14 01174 g005
Figure 6. Experimental flow chart.
Figure 6. Experimental flow chart.
Electronics 14 01174 g006
Figure 7. The skip-gram structure. Reproduced with permission from [49]. Copyright 2020, IEEE.
Figure 7. The skip-gram structure. Reproduced with permission from [49]. Copyright 2020, IEEE.
Electronics 14 01174 g007
Figure 8. 3D spatial distribution map of word vectors.
Figure 8. 3D spatial distribution map of word vectors.
Electronics 14 01174 g008
Figure 9. Results of hyperparameter validation experiments concerning the DSCNN.
Figure 9. Results of hyperparameter validation experiments concerning the DSCNN.
Electronics 14 01174 g009
Figure 10. DSA-MHAFN hyperparameter validation experiment.
Figure 10. DSA-MHAFN hyperparameter validation experiment.
Electronics 14 01174 g010
Table 1. Normalized tokenization rules.
Table 1. Normalized tokenization rules.
Regular ExpressionMeaning
(?x)[\w\.]+?\(Matching function calls
<script\b[^>]*>([\s\S]*?)<\/script>Matching JavaScript-nested structures
"[^"]*"Matching strings inside double quotes
'[^']*'Matching strings inside single quotes
http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+Matching complete URLs starting with http:// or https://
</?\w+[^>]*>Matching HTML tags
<!--.*?-->Matching HTML comments
\b(?:on\w+|src|href|data)\s*=\s*['"][^'"]*['"]Matching attribute assignment statements
[=!<>]=?Matching comparison operators
&&|\|\|Matching logical operators
[\w\.]+Matching general markup
Table 2. Examples of normalized tokenization.
Table 2. Examples of normalized tokenization.
XSS PayloadAfter Performing Normalized Tokenization
<script>alert('XSS')</script>[‘<script>’,‘alert(’,‘'XSS'’,‘)’,‘</script>’]
<img src="http://e.com/xss.jpg" onerror="alert('XSS')">[‘<img’,‘src='http://e.com/xss. jpg'>’,‘onerror="alert('XSS')"’, ‘>’]
<a href="#" onclick="alert ('XSS')">Click me</a>[‘<a’,‘href="’,‘"#’,‘"’,‘ onclick="’,‘alert(’,‘'XSS'’,‘)"’,‘">’, ‘Click me’,‘</a>’]
<!-- <script>alert('XSS'); </script> -->[‘<!--’,‘<script>’,‘alert(’, ‘'XSS'’,‘);’,‘</script>’,‘-->’]
<script>if (x < 10 && y > 5) {alert('XSS') }</script>[‘<script>’,‘if’,‘(’,‘x’,‘<’,‘10’,‘&&,‘y’,>‘5’,‘)’,‘'XSS'’,‘) ‘{alert’,‘'XSS'’,‘}’,‘</script>’]
Table 3. Statistical information concerning the utilized datasets.
Table 3. Statistical information concerning the utilized datasets.
Dataset SourceData AttributeNumber of CodesLabel
DMOZ Open Directory [35] Normal33,7130
xssed.com website [26] Malicious14,7131
PortSwigger [27] Malicious87451
Payloadbox (2022) [28] Malicious95911
Table 4. Examples of data cleaning.
Table 4. Examples of data cleaning.
Cleaning TypeSource CodeAfter Processing
Case conversion<ScRiPt>alert('XSS');</sCrIpT><script>alert('XSS');</script>
URL decoding%3csvg+onload%3d%22alert(%27XSS%27)%22%3e%3c%2fsvg%3e%0a%0a<svg onload="alert('XSS')"></svg>
HTML decoding&lt;img src="x.jpg" onerror="alert('XSS!');"&gt;<img src="x.jpg" onerror="alert('XSS!');">
Base64 decodingPGEgaHJlZj0iamF2YXNjcmlwdDphbGVydCgnWFNTJyk7Ij5oZXJlPC9hPgoK<a href="javascript:alert('XSS');">here</a>
Number substitution<img src=javascript:alert(String.fromCharCode(88,83,83))><img src=javascript:alert
(String.fromCharCode(0,0,0))>
URL replacement<SCRIPT SRC=http://127.0.0.1/xss.js< B ><script src=http://u< b >
Table 5. Word2Vec model parameter configuration.
Table 5. Word2Vec model parameter configuration.
ParameterMeaningSetting
Vocabulary_sizeVocabulary size3000
Batch_sizeBatch size128
Embedding_sizeDimensionality of the embedding vector128
Skip_windowWindow size5
Num_iterNumber of training iterations5
Table 6. Parameter settings.
Table 6. Parameter settings.
ParameterSetting
Number of DSC channels128
DSC convolution core size2–8
Number of BiLSTM hidden units128
Number of DSA-MHAFN Attention heads8
Dropout0.5
Learning rate1 × 10−4
OptimizerAdam
Batch size256
Epochs100
Loss functionBinary_crossentropy
ClassifierSigmoid
Table 7. Results of an experiment comparing different XSS attack detection methods.
Table 7. Results of an experiment comparing different XSS attack detection methods.
Detection MethodAccuracy (%)Recall (%)F1 Score (%)Par (M)Itime (ms)
CMABL97.5797.5597.5716.42.78
CNNL97.5297.2698.4616.73.1
VGGRL98.7998.5898.84368.136.88
CCNN98.5598.0498.5316.12.7
CNNABL (SOTA)99.3899.0699.3717.53.5
PCRAN (ours)99.8799.8999.9216.52.8
Table 8. Results of ablation experiments.
Table 8. Results of ablation experiments.
Detection MethodAccuracy (%)Precision (%)Recall (%)F1 Score (%)
DSCNN-297.5197.8297.2597.53
DSCNN-897.9497.9197.8997.89
DSCNN98.9699.0498.8198.92
Bi-LSTM98.5798.7898.4298.59
DSA-098.4698.6898.1498.41
DSA-MHAFN99.8799.9699.8999.92
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, Z.; Zhang, J.; Yang, H. XSS Attack Detection Based on Multisource Semantic Feature Fusion. Electronics 2025, 14, 1174. https://doi.org/10.3390/electronics14061174

AMA Style

Hu Z, Zhang J, Yang H. XSS Attack Detection Based on Multisource Semantic Feature Fusion. Electronics. 2025; 14(6):1174. https://doi.org/10.3390/electronics14061174

Chicago/Turabian Style

Hu, Ze, Jianwei Zhang, and Hongyu Yang. 2025. "XSS Attack Detection Based on Multisource Semantic Feature Fusion" Electronics 14, no. 6: 1174. https://doi.org/10.3390/electronics14061174

APA Style

Hu, Z., Zhang, J., & Yang, H. (2025). XSS Attack Detection Based on Multisource Semantic Feature Fusion. Electronics, 14(6), 1174. https://doi.org/10.3390/electronics14061174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop