1. Introduction
The widespread adoption of 5G communication networks has significantly enhanced data transmission capabilities while also promoting the extensive use of web technologies. Web applications have become essential tools for enterprises and government organizations to streamline their operations, improve their quality of service, and enhance user experiences. However, with the widespread use of web applications, security concerns have become increasingly prominent. Cross-site scripting (XSS) attacks are a common type of script injection attack in which attackers exploit vulnerabilities in web applications to trick users into visiting websites or pages containing malicious scripts. These scripts are then loaded and executed, stealing user information. Relevant research [
1] indicates that XSS attacks are the predominant type of web attack, accounting for approximately 70% of web application attacks worldwide. According to the survey provided by the Open Web Application Security Project (OWASP) [
2], XSS attacks have consistently ranked among the top web security threats, demonstrating greater deception and complexity compared to other web application attacks. In 2021, they were classified as the third most prevalent type of injection attack, posing significant risks to user privacy and financial security. For example, XSS attacks can be used to forcibly download malware, such as trojans and viruses, or exploit browser vulnerabilities to gain control over user devices, leading to botnet formation or worm propagation. Additionally, XSS attacks often hijack user traffic through malicious pop-up advertisements or forced redirections to harmful websites. They can also be leveraged to launch distributed denial-of-service (DDoS) attacks against enterprises, causing server overload and service disruption [
3]. Consequently, in-depth research on XSS attacks and effective mitigation strategies have become primary focuses in the field of web security.
Early researchers typically relied on known XSS attack patterns and features to establish blacklists, whitelists, and signature databases [
4,
5,
6], using pattern matching techniques such as regular expressions [
7,
8,
9] to check whether the input content contained potential XSS attack code. However, these methods are only applicable to specific scenarios and cannot comprehensively cover all possible XSS attacks. XSS attacks exploit the injection of malicious scripts to steal user data, hijack sessions, and perform other harmful actions. These attacks continuously evolve, with variants such as reflected, stored, and DOM-based XSS. Traditional detection methods relying on signatures and regular expressions struggle to effectively counter emerging attack patterns [
10]. Modern web applications generate an increasing volume of traffic data, and XSS attack payloads are often obfuscated or encoded to evade detection. Traditional approaches depend on manually extracting features, which is inefficient and fails to capture deep semantic correlations. Machine learning-based methods, which rely on handcrafted XSS attack features, offer better generalization than traditional detection techniques. However, feature engineering in machine learning is complex, making it difficult to extract high-dimensional and context-aware information, and leading to a strong dependence on feature selection [
11].
With advancements in deep learning, its advantages in XSS attack detection have become increasingly evident. Deep learning models can directly map raw input data to detection outcomes, reducing human intervention and improving detection efficiency. Additionally, deep learning can automatically learn the semantic and structural characteristics of XSS attack payloads without manual feature engineering, significantly reducing false positives and false negatives [
12]. Traditional and machine learning-based methods often struggle to handle new XSS variants. In contrast, deep learning models, trained on large-scale datasets, enhance adaptability and are well-suited for complex XSS detection scenarios. The application of deep learning in XSS detection addresses key limitations of traditional methods, particularly in feature engineering, generalization, and analyzing complex attack patterns [
13]. Deep learning has gradually replaced traditional methods because of its superior learning ability, diverse detection techniques, and high flexibility and scalability levels, making it the mainstream XSS attack detection technology.
Deep learning-based XSS detection methods include single-model optimization-based detection methods, which rely on a single deep learning model for performing XSS attack detection, and ensemble model-based detection methods, which involve stacking multiple models to detect XSS attacks. Initially, researchers employed traditional neural networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) for design and extension processes, as these networks can effectively extract local features and long-distance dependencies from XSS texts. Alqarni et al. [
14] proposed a method based on a modular deep neural network (DNN) to reduce the false-positive rates induced during XSS attack detection tasks using Pearson correlation to determine the associations between continuous features and classes; this was followed by the use of modular neural networks (MNNs) for XSS script detection and classification. However, the modular design of their method reduces the fault tolerance of the constructed system, and synchronization between modules can lead to issues such as time delays and decreased efficiency, limiting the detection performance of the model. Yan et al. [
15] introduced an improved ResNet module based on a CNN to extract XSS features from three different dimensions; however, owing to the limited receptive field of the CNN, the model struggled to analyze global dependencies in XSS code. Nilavarasan et al. [
16] proposed a CNN-based model with multiple convolutional and pooling layers for XSS feature extraction and detection through hierarchical propagation, achieving high detection accuracy in experiments. However, traditional CNN layers still suffer from excessive parameters and slow computational efficiency in real-time XSS attack detection. This issue becomes more prominent when integrating CNNs with other deep learning models, necessitating a lightweight design. Depthwise separable convolutions, due to their unique computational approach, significantly reduce the number of parameters and improve efficiency, making them a suitable alternative to traditional CNNs. Wan et al. [
17] employed a BERT-based word embedding representation, which was then fed into a BiLSTM model to capture long-range dependencies in XSS attack payloads. However, this approach lacks the ability to extract local features effectively. Additionally, a standalone BiLSTM model is vulnerable to adversarial attacks. Integrating multiple deep learning models can effectively mitigate these limitations, enhancing robustness in XSS detection.
Single-model optimization methods for XSS attack detection offer certain advantages in terms of parameter tuning and resource savings, but their generalizability, robustness, and detection accuracy still present significant limitations when these methods are applied to different types of XSS attacks. As a result, researchers have attempted to address these issues using ensemble model-based XSS attack detection methods that integrate the strengths of multiple models to handle the complex and dynamic nature of XSS attack scenarios. Liu et al. [
18] approached the problem regarding the small sizes of XSS samples by preprocessing the input detection samples, converting them into graph structures, and then utilizing graph convolutional networks and residual networks to train a detection model. This method can effectively extract high-order features from graph structures and maintain high detection accuracy even in cases with limited training samples, but its graph structure construction process is ambiguous and overlooks the sequential relationships contained in statements. Guan et al. [
19] combined CNN and Transformer models to extract both local and global XSS features, which were then concatenated and fused. A Bi-LSTM model with an embedded self-attention mechanism was used to refine and enhance the fused features. While this approach demonstrated strong detection performance, both the Transformer and Bi-LSTM fusion models incurred significant computational overhead. To achieve a balance between detection performance and computational efficiency, the feature extraction network and fusion model design require further optimization. Nige et al. [
20] proposed a dual-channel feature extraction framework, where a DistilBERT model designed based on BERT was used alongside fully connected layers to extract and fuse raw attack features. However, due to the complexity and variability of XSS attacks, simply concatenating multiple features is insufficient for capturing the full spectrum of attack characteristics. A lightweight and efficient multifeature fusion network is needed to address this limitation. Sheneamer et al. [
21] combined the advantages of multiple deep learning models by parallelizing five models, namely, Visual Geometry Group (VGG)16, VGG19, AlexNet, ResNet, and LSTM, to extract features and then integrate and predict these features. Although this method offers some advantages in terms of detection accuracy, the complexity of the constructed models and their high resource consumption levels make them difficult to apply in real-world scenarios.
Furthermore, deep learning models are vulnerable to adversarial and backdoor attacks. Attackers can repeatedly query APIs to analyze input–output relationships, reverse-engineer model parameters or functionality, and ultimately compromise data privacy [
22]. To defend against backdoor attacks, researchers have employed anomaly detection techniques to identify and remove contaminated samples [
23]. In the context of XSS attack detection, effective data sanitization can help mitigate such threats. Adversarial attacks can significantly disrupt learning models, particularly single deep learning models. To enhance robustness and prevent single points of failure, an ensemble learning approach that integrates multiple models can be adopted [
24].
By analyzing the limitations and challenges of the aforementioned methods, a novel XSS attack detection approach based on multisource semantic feature fusion, named the parallel convolutional recurrent attention network (PCRAN), is proposed in this paper. This approach includes three main components: XSS attack code tokenization rules, a multisource semantic feature extraction model combining dual depth-separated CNNs (DSCNNs) and bidirectional LSTM (Bi-LSTM), and a dynamic significance-aware multihead attention fusion network (DSA-MHAFN). The specific contributions of this study are as follows.
A novel tokenization rule tailored to the structural characteristics of XSS code is proposed. Standard tokenization methods such as those provided by the Natural Language Toolkit (NLTK) or SpaCy are generally used for natural language processing tasks, and they focus on vocabulary and sentence structures. In contrast, the tokenization rule developed in this work is specifically designed to address the structural features of XSS attacks, such as HyperText Markup Language (HTML) tags, JavaScript function calls, and URLs, enabling the accurate identification of complex URLs and script structures and effectively parsing special characters and XSS attacks in multilanguage environments.
A multisource semantic feature extraction model combining a DSCNN and Bi-LSTM is proposed. The DSCNN adopts a dual-branch structure to extract the textual and syntactic features of XSS code, which are then fused to generate local semantic features. The use of depthwise separable convolutions (DSCs) and residual connections reduces the computational complexity of the constructed model and improves its detection accuracy. The Bi-LSTM model uses bidirectional LSTM layers to better understand the context of input sequences, effectively capturing global semantic information. The local semantic features derived from the DSCNN and the global semantic features acquired from Bi-LSTM are then merged.
A DSA-MHAFN is proposed. First, a saliency scorer is introduced to adjust the merged input features, increasing the sensitivity of the model to important features. The adjusted input is then passed through linear transformation and scaled dot product attention mechanisms to calculate multihead attention weights. Finally, a dynamic weight adjuster is constructed to dynamically adjust the importance of each attention head, enhancing the flexibility and expressiveness of the model.
3. Experiments and Discussion
3.1. Configuration of the Experimental Environment and Scheme
The experimental computer configuration includes an Intel® Xeon® Silver 4110 processor, 32 GB of DDR4 memory, and an NVIDIA Quadro P2000 graphics card (with 5 GB of VRAM). The training, validation, and testing environments all run on the Windows 10 operating system, and the employed experimental IDE is PyCharm 2022.1.4. The detection approach proposed in this paper is implemented using Python 3.6, TensorFlow 1.15, and Keras 2.3.
XSS attackers inject malicious scripts into websites, exploiting security vulnerabilities to execute scripts in other users’ browsers, thereby stealing sensitive information or hijacking user sessions. These attacks can spread through URL parameters, input fields, or stored entries in databases. To mitigate such threats, XSS detection models are typically deployed on client-side Web Application Firewalls (WAFs) and web application servers to identify malicious scripts and generate alerts. The XSS detection approach proposed in this study collects sample data from both web clients and servers and deploys the model within a local client-side WAF for training and attack detection. The detailed process is illustrated in
Figure 6.
3.2. Sample Preprocessing
To study and defend against XSS attacks, researchers often require relevant datasets for experimentation and analysis. However, these datasets may contain various types of noise and inconsistent data in their raw states. Therefore, preprocessing these samples is crucial, as outlined below.
3.2.1. Dataset Collection
XSS attacks can take various forms, including stored, reflected, and DOM-based attacks. The existing publicly available XSS datasets are relatively limited, with many failing to cover all types of XSS attacks and containing large amounts of invalid data, making it difficult to accurately reflect the latest XSS attack trends. Researchers typically collect such data manually from websites and vulnerability platforms. To ensure comprehensive coverage of different XSS attack types and maintain the relevance of XSS code, an extensive literature review and a data analysis are conducted in this study, with multiple websites, authoritative vulnerability platforms, and open-source communities selected to collect XSS data. A total of 14,753 malicious XSS scripts are collected from the xssed.com website, 8748 XSS attack codes are obtained from the PortSwigger vulnerability platform, and 10,156 XSS attack codes are acquired from the open-source Payloadbox community (2022). After removing sensitive and duplicate information, 33,049 malicious samples are retained. The normal data, totaling 33,713 records, are scraped from the open DMOZ directory. The dataset covers a wide range of attack samples with various types and encoding variations and includes real-world XSS samples of different attack intensities. This ensures a balance between the positive and negative samples and preserves the diversity of the attack data and payloads. The detailed statistical information concerning the utilized datasets is shown in
Table 3.
3.2.2. Data Cleaning
XSS code often uses techniques such as mixed case and multiple encoding schemes to hide malicious code and bypass system detection. Early XSS detection systems relied on blacklist rules and regular expression matching techniques [
4,
7,
10]. However, case-mixing techniques can evade strict string matching, allowing attack payloads to take varied forms during regular expression matching. As a result, detection systems must incorporate multilayer parsing and context-aware capabilities. Modern XSS detection has increasingly adopted machine learning and deep learning models [
36,
37,
38]. However, mixed encoding and case mixing complicate the distribution of attack code, making it difficult for common word embedding models to process multi-encoded text and extract meaningful features. If a specific encoding format is not represented in the training set, the model may fail to effectively detect similar variant attacks. Thus, it is necessary to perform operations such as case conversion, multiple URL decoding, and HTML decoding on the raw data to eliminate tag and attribute confusion and restore the underlying malicious code. Converting the case of XSS payloads standardizes different variants, allowing them to be treated as identical, which helps reduce data redundancy and lower input feature dimensionality. Recursive URL decoding restores XSS payloads to their original form, facilitating the model’s ability to learn attack patterns. Additionally, HTML decoding prevents different HTML-encoded representations of the same XSS payload from being treated as distinct samples, thereby enhancing the consistency of feature representation.
Additionally, in XSS attack codes, attackers use specific numbers (e.g., counters and unique identifiers) to execute scripts and embed various forms of user redirection URLs, malicious resources, or other harmful operations, significantly increasing the complexity of the system detection process. Deep learning models typically rely on feature extraction techniques to analyze XSS code [
15,
18,
21]. However, the dynamic components of attack code, such as timestamps and counters, can cause similar attack payloads to exhibit significant differences in feature space, reducing feature consistency and weakening the effectiveness of statistical methods. Therefore, during data preprocessing, numerical normalization is applied to ensure consistency across different samples [
39]. All numbers in the XSS codes are replaced with 0, and the URLs are standardized to “http://u”, removing their specific numeric and URL content interference. Normalizing different numerical values in XSS payloads reduces the risk of model overfitting to specific numeric patterns. Replacing URLs eliminates URL variations, enhancing the model’s understanding of XSS structures. This approach reduces input data variability and improves training efficiency. Data cleaning examples are shown in
Table 4.
3.2.3. Tokenization and Table Construction
Data cleaning eliminates noise and redundant data, restoring the true intent of the original XSS code to the greatest extent possible. However, an XSS code still contains various attributes, such as HTML tags, JavaScript functions, and URLs, which hinder the feature extraction process. Thus, a set of standardized tokenization rules, as shown in
Table 1, are designed in this study to break the input XSS code into individual elements, helping the model capture the contextual relationships of each token. After completing tokenization, the term frequency–inverse document frequency (TF–IDF) [
40] algorithm is used to construct a word list. First, the frequencies of all words in the positive samples are counted, and common stop words (e.g., “the” and “is”) that do not contribute to the feature extraction procedure are removed. A TF–IDF value is subsequently calculated for each word, and the results are sorted in descending order. The higher a TF–IDF value is, the more relevant the corresponding word is to the XSS code. The top 3000 words are selected to construct a vocabulary. Finally, each tokenized sample is converted into a list of words from the vocabulary, with words not in the vocabulary marked as “UNK.”
3.2.4. Word Vectorization
Neural networks cannot directly process string-based XSS code snippets. Thus, after performing data cleaning, tokenization, and vocabulary construction, the labeled word lists are converted into word vector representations. Word2Vec maps high-dimensional sparse features in XSS attack scripts to low-dimensional dense vectors, significantly reducing computational complexity. By training on neighboring word prediction, Word2Vec effectively captures the contextual semantic relationships within XSS attack payloads [
41]. Compared to complex pre-trained models such as BERT [
42,
43], Word2Vec’s shallow neural network structure avoids excessive parameterization, reducing training costs. Additionally, its local context window modeling mechanism effectively extracts character-level dependencies, outperforming global co-occurrence-based methods like GloVe [
44,
45]. Word2Vec also achieves significantly higher training and inference efficiency than deep learning models, such as Transformer, using techniques like negative sampling [
46,
47]. Therefore, The Word2Vec [
48] model is used to conduct word vector training in this study. The model consists primarily of continuous bag-of-words (CBOW) and skip-gram structures. The CBOW model predicts the target word using context information, making it effective for recognizing common malicious attack patterns. However, since the CBOW method implements prediction using the average of the observed context, it performs poorly in terms of recognizing rare words or specific patterns. The skip-gram model, which predicts context from the target word, excels at understanding complex contexts and recognizing special patterns. Although skip-grams are less computationally efficient than CBOWs are, their advantages in terms of recognizing rare words and understanding complex contexts make them more suitable for XSS detection tasks. Therefore, the Word2Vec model uses the skip-gram structure for XSS word vector training, with the specific model parameters detailed in
Table 5.
The skip-gram structure consists of an input layer, a hidden layer, and an output layer. In the input layer, the word list is input into the structure in a one-hot encoded form. The input vectors are mapped to the hidden layer through a
V ×
N embedding matrix (where
V is the vocabulary size and
N is the number of neurons contained in the hidden layer, i.e., the dimensionality of the embedding vectors). The output of the hidden layer is the embedding vector for the target word. Finally, the output of the hidden layer is passed through a
N ×
V weight matrix that transforms the embedding vector back into a vocabulary size vector, i.e., the output layer vector. The specific model structure is shown in
Figure 7.
The Word2Vec model maps words to high-dimensional vector spaces to capture the semantic and syntactic relationships between words. Since high-dimensional vector spaces are difficult to understand intuitively, the t-distributed stochastic neighbor embedding (t-SNE) [
50] dimensionality reduction algorithm is used to transfer high-dimensional word vectors to three-dimensional space. The distribution of the reduced word vectors in three-dimensional space is shown in
Figure 8, where the distance between a pair of data points reflect the similarity between the corresponding words in high-dimensional space. Points that are closer together exhibit higher similarity, whereas those that are farther apart possess lower similarity.
3.3. Model Construction
The experimental model for the proposed PCRAN method consists of three main components: the DSCNN model, the Bi-LSTM model, and the DSA-MHAFN multifeature fusion network. Word embeddings are fed into both the DSCNN and Bi-LSTM models. The DSCNN model, composed of a dual-branch DSC structure, primarily extracts local semantic features of XSS attacks, while the Bi-LSTM model, consisting of a two-layer LSTM structure, captures global semantic features. The extracted features are then processed through the DSA-MHAFN multifeature fusion network to generate a fused representation. The fused features undergo dimensionality reduction via a dropout layer and are subsequently classified using a sigmoid classifier. The detailed hyperparameter settings for the PCRAN method are provided in
Table 6.
3.4. Evaluation Criterion
To evaluate the detection performance of the proposed approach, accuracy, precision, recall, and the F1 score are used as evaluation metrics. The calculation methods for these metrics are as follows.
Accuracy [
51] is the ratio of the number of correctly predicted XSS samples to the total number of XSS samples. Accuracy is calculated as follows:
Precision [
52] is the ratio of the number of correctly predicted XSS attack samples to the total number of samples that are predicted as XSS attacks. Precision is calculated as shown below:
Recall [
52] is the ratio of the number of correctly predicted XSS attack samples to the total number of actual XSS attack samples. Recall is calculated as follows:
The F1 score [
53] is the harmonic mean of precision and recall, providing a balanced measure that considers both precision and recall. The F1 score is calculated as follows:
In the above formulas, true positives (TP) refer to the number of XSS samples that are correctly detected as XSS attacks. True negatives (TN) refer to the number of benign XSS samples that are correctly detected as benign. False positives (FP) refer to the number of benign XSS samples that are incorrectly detected as XSS attacks. False negatives (FN) refer to the number of XSS samples that are incorrectly detected as benign XSS samples.
3.5. Comparative Experimental Analysis
After the input dataset is preprocessed to generate spatial word vectors, these vectors are separately fed into the DSCNN and Bi-LSTM models to extract local and global semantic features, respectively. The DSA-MHAFN is then employed for feature fusion, with the fused features subsequently input into the XSS attack classification network. The dataset is randomly split into training, validation, and test sets at a ratio of 7:1:2. The model undergoes 100 training epochs. To assess the reliability and stability of the proposed approach, all the experiments are conducted over five independent runs, and the average values of the evaluation metrics are taken as the final experimental results. In the feature fusion phase, the DSA-MHAFN utilizes eight attention heads. The experiments in this study consist of binary classification comparison experiments, including XSS attack detection method comparisons and ablation experiments.
3.5.1. Comparison Experiment Concerning XSS Attack Detection Methods
In the comparative experiment concerning XSS attack detection methods, the binary classification performance of the PCRAN approach proposed in this paper is evaluated against the CMABLSTM [
54], CNNL [
55], VGGRL [
21], CCNN [
56], and CNNABL [
57] methods on the self-constructed dataset. The detection results yielded by these various XSS attack detection methods are presented in
Table 7.
As shown in
Table 7, although some methods exhibit relatively high accuracy rates, a comprehensive evaluation conducted across all the metrics reveals that the PCRAN approach outperforms the others in terms of all the evaluation criteria. Compared with the alternative methods, the PCRAN achieves accuracy improvements of 0.49% to 2.35% and F1 score increases ranging from 0.55% to 2.35%, indicating the best overall performance.
The CMABL method employs a Bi-LSTM model integrated with a multihead self-attention mechanism. While it has notable advantages in terms of overall feature extraction and contextual understanding, its ability to extract local features is constrained and more susceptible to noise interference. As a result, it performs suboptimally with respect to handling feature sparsity issues. Although CMABL maintains relatively balanced performance overall, its detection efficacy is still limited.
The CNNL method combines CNN and LSTM models to perform hierarchical feature extraction. However, owing to its serial structure, it suffers from high computational complexity, model tuning challenges, and increased latency. This results in a disparity between its precision and recall values, with limited improvements in its comprehensive performance.
The VGGRL method achieves multifeature extraction for XSS attacks by stacking various models, including VGG16, VGG19, LSTM, AlexNet, and residual networks. While its complex architecture enables multifeature extraction, it introduces significant difficulties during the model tuning and resource consumption steps. Although VGGRL demonstrates relatively good overall performance, it is not well suited for multiscenario XSS attack detection tasks.
The CCNN method enhances the original CNN model through a lightweight design, offering excellent local feature extraction capabilities for XSS attacks and resulting in better detection outcomes. However, it struggles to identify long-term dependencies in the code sequences of XSS attacks, which negatively impacts its generalization ability.
The performance of the CNNABL method, a state-of-the-art (SOTA) method, is similar to that of the approach presented in this paper. This method uses both CNN and Bi-LSTM models to incorporate local semantic features and long-range semantic information. However, its feature fusion process is limited to direct concatenation operations, which results in the insufficient integration of local and global semantic features. Consequently, the final detection results produced for XSS attacks leave room for improvement.
In terms of computational resource consumption, the experiment evaluates the overall number of parameters of the detection network and the inference time per sample Itime (inference time) as key performance metrics. Benefiting from the lightweight design of DSC, the PCRAN method has a lower Par (parameter count) compared to other approaches. Additionally, the parallel design of the feature extraction network and the multihead attention mechanism enable more efficient Itime than most other detection methods. While the PCRAN method incurs slightly higher resource consumption than CMABL and CCNN, it achieves significantly better detection performance. The CNNABL method demonstrates detection performance closest to PCRAN but at a higher computational cost. Overall, PCRAN achieves a favorable balance between computational resource consumption and XSS detection effectiveness.
3.5.2. Hyperparameter Comparison Experiment
The size of the convolution kernels contained in the DSCNN plays a crucial role in determining the local XSS semantic feature extraction ability of the PCRAN approach. To analyze the impacts of different convolution kernel sizes on the DSCNN, multiple detection networks using both single-kernel and dual-kernel configurations are set up. Various combinations of convolution kernel sizes are selected, and the performance of each detection network is evaluated on the basis of its precision, recall, and F1 score values. The detection performance of the tested variants is presented in
Figure 9.
In
Figure 9, DSCNN-X represents a single-core DSCNN, DSCNN-X&X denotes a dual-core DSCNN, and X indicates the size of the convolution kernel. As illustrated in
Figure 9, the performance differences among the single-core DSCNN networks are relatively small, with an average accuracy of 97.46%. The size of the convolution kernel determines the feature extraction range that is achievable for XSS attack characteristics. Smaller convolution kernels produce more features, which may contain redundant information. Larger convolution kernels, on the other hand, are better suited for identifying complex XSS patterns but may overlook key details. Consequently, a dual-core DSCNN can combine the advantages of both kernel sizes, resulting in better detection performance than that of a single-core DSCNN. In
Figure 9, the average detection accuracy of the dual-core DSCNNs is 98.45%, which is 0.99% higher than that of the single-core DSCNNs. By leveraging the complementary strengths of the different kernel sizes, the dual-core networks can comprehensively and effectively extract the local semantic features of XSS. The features extracted by each single-core network are distinct. When the detection networks are capable of extracting complete XSS attack features, the performance gap between the dual-core networks becomes more pronounced, and the extracted XSS features are more diverse. As a result, the two and eight combination in the dual-core DSCNN yields the best detection results, with accuracy improvements ranging from 0.53% to 1.62% over those of the other DSCNN configurations and an F1 score of 98.92%, demonstrating the effectiveness of this configuration in the detection task.
In the DSA-MHAFN, the number of attention heads defines the number of parallel attention paths, which significantly influences the ability to extract multidimensional, complex XSS features and the overall performance of the approach. To assess the impact of the number of attention heads on the DSA-MHAFN, experiments are conducted with varying numbers of attention heads, and the results are evaluated in terms of the accuracy, precision, recall, and F1 score metrics, as shown in
Figure 10.
As shown in
Figure 10, the use of too few attention heads fails to capture sufficient feature patterns, whereas the use of an excessive number of attention heads can result in overly dispersed feature distributions, making it difficult to learn effective features. Setting the number of attention heads to eight enables the DSA-MHAFN to effectively capture XSS features at multiple granularities without introducing excessive noise from irrelevant features. This setting results in a better balance between performance and efficiency. Therefore, the optimal configuration includes eight attention heads, and it consistently yields superior performance across multiple datasets, with accuracy improvements ranging from 0.33% to 1.1% over the other configurations and an F1 score of 99.92%, demonstrating its effectiveness in the detection task.
3.5.3. Ablation Experiments
Ablation experiments are conducted to compare the detection performance of the dual-branch DSCNN, Bi-LSTM network, and DSA-MHAFN on the self-constructed dataset via the PCRAN approach. The detection effectiveness of each component of the PCRAN approach is evaluated in terms of the accuracy, precision, recall, and F1 score metrics. The detailed experimental results are presented in
Table 8.
In
Table 8, DSCNN-2 and DSCNN-8 represent single-branch DSCNN networks featuring convolution kernel sizes of two and eight, respectively. In contrast, the DSCNN refers to the parallel dual-branch DSCNN network in which both DSCNN-2 and DSCNN-8 operate simultaneously. The primary purpose of the DSCNN is to extract the local features of XSS code. The DSCNN-2 branch focuses on capturing microcharacter text features that are related to XSS, such as common attack sequence labels (“>“) and short function call characters (such as “eval”). Moreover, the DSCNN-8 branch emphasizes broader XSS grammar features, including multilevel nested tag structures (e.g., “<scr<script>“) and dynamic JavaScript code characteristics (such as “eval(decodeURIComponent(x))”). By combining the strengths of both branches, the DSCNN achieves enhanced detection performance, exceeding that of the single-branch detection networks across all evaluation metrics. This demonstrates the effectiveness and advantages of the DSCNN architecture.
Bi-LSTM is employed to extract the global semantic features of XSS code, yielding excellent detection results by efficiently capturing long-distance semantic information.
In
Table 8, DSA-0 represents the method that performs feature fusion and detection using only the local semantic features extracted directly by the DSCNN and the global semantic features obtained from the Bi-LSTM without the use of the DSA-MHAFN for feature fusion. Conversely, the DSA-MHAFN refers to the method that achieves feature fusion and detection through the DSA-MHAFN, which integrates both local and global semantic features. According to the data presented in
Table 8, the multifeature fusion method using the DSA-MHAFN outperforms the method that does not leverage the DSA-MHAFN by 1.41% in terms of accuracy and by 1.51% in terms of the F1 score, further validating the rationality and superiority of the DSA-MHAFN design.
4. Conclusions
To address the limitations of existing XSS attack detection methods in feature extraction and fusion, which often lead to high false positive rates, this paper proposes an XSS detection method based on multisource semantic feature fusion. The proposed approach redesigns tokenization rules for XSS datasets and integrates DSC, Bi-LSTM, and a multihead self-attention mechanism to optimize detection performance. Experimental results demonstrate its effectiveness in both identification and classification.
This paper first reviews the current research status and motivations for XSS attack detection. Then, a standardized tokenization rule for XSS datasets is introduced. Next, a feature extraction network for both local and global XSS semantics, along with a multifeature fusion network, is proposed. Finally, multiple comparative experiments validate the proposed method’s rationality and superiority, showing significant improvements in accuracy and the F1 score over existing detection methods.
The proposed approach introduces a novel deep learning-based framework for XSS detection. By leveraging depthwise separable convolution and bidirectional LSTM, it jointly captures the local syntactic structure and global contextual dependencies of XSS code, overcoming the limitations of traditional unimodal detection methods. Additionally, a multihead attention fusion network based on saliency scoring is introduced to mitigate noise interference and feature redundancy in static feature fusion. In summary, this method provides a scalable solution for detecting emerging DOM-based XSS attacks and script variants, offering a new perspective and technical support for web security protection.
In future research, we will aim to analyze additional features of XSS attacks from multiple perspectives. We also plan to incorporate multiclass detection into the model for various types of XSS attacks while extracting multisource features, further enhancing the detection performance of the XSS detection model in cases involving different types of XSS attacks.