Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

Moraliyage, Harsha; Kulawardana, Geemini; De Silva, Daswin; Issadeen, Zafar; Manic, Milos; Katsura, Seiichiro

doi:10.3390/asi8010017

Open AccessArticle

Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

by

Harsha Moraliyage

¹

,

Geemini Kulawardana

¹,

Daswin De Silva

^1,*

,

Zafar Issadeen

¹,

Milos Manic

² and

Seiichiro Katsura

³

¹

Centre for Data Analytics and Cognition, La Trobe University, Melbourne, VIC 3083, Australia

²

Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284-2520, USA

³

Department of System Design Engineering, Keio University, Tokyo 108-8345, Japan

^*

Author to whom correspondence should be addressed.

Appl. Syst. Innov. 2025, 8(1), 17; https://doi.org/10.3390/asi8010017

Submission received: 20 September 2024 / Revised: 9 January 2025 / Accepted: 10 January 2025 / Published: 21 January 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Text classifiers are Artificial Intelligence (AI) models used to classify new documents or text vectors into predefined classes. They are typically built using supervised learning algorithms and labelled datasets. Text classifiers produce a predefined class as an output, which also makes them susceptible to adversarial attacks. Text classifiers with high accuracy that are trained using complex deep learning algorithms are equally susceptible to adversarial examples, due to subtle differences that are indiscernible to human experts. Recent work in this space is mostly focused on improving adversarial robustness and adversarial example detection, instead of detecting adversarial attacks. In this paper, we propose a novel approach, explainable AI with integrated gradients (IGs) for the detection of adversarial attacks on text classifiers. This approach uses IGs to unpack model behavior and identify terms that positively and negatively influence the target prediction. Instead of random substitution of words in the input, we select the top p% words with the greatest positive and negative influence as substitute candidates using attribution scores obtained from IGs to generate k samples of transformed inputs by replacing them with synonyms. This approach does not require changes to the model architecture or the training algorithm. The approach was empirically evaluated on three benchmark datasets, IMDB, SST-2, and AG News. Our approach outperforms baseline models on word substitution rate, detection accuracy, and F1 scores while maintaining equivalent detection performance against adversarial attacks.

Keywords:

adversarial attacks; AI cybersecurity; integrated gradients; text classification; explainable AI

1. Introduction

Following the emergence of Generative Artificial Intelligence (AI), the number of AI applications that leverage large volumes of unstructured datasets has increased exponentially. This expands across the modalities of images [1], video [2], time series [3], and text [4]. Text classification is fundamental to most text-based AI applications. Text classifiers are trained to differentiate between two or more predefined classes or groupings of text or text-based trajectories [5], for example, classification of positive or negative sentiment, one of eight emotions, one of n topics, or n of n ontological branches [4]. Text classifiers are susceptible to adversarial attacks when minor perturbations are added to the input text vectors, which then results in inaccurate classification outcomes with high probability [6,7]. These minor perturbations are non-random noises that preserve the original structure of the input while being indiscernible by humans. The impact of perturbations has also been discussed in the cognitive computing domain, in fields such as hyperdimensional computing [8].

The recent increase in the use of deep neural networks (DNNs) for text classification tasks has led to a pressing need to protect DNNs against adversarial attacks [9]. Adversarial attacks can be categorized into three types based on the levels of knowledge leveraged by the perpetrator: white-box, grey-box, and opaque (or black-box) attacks [10]. In a white-box scenario, the attacker has complete knowledge of the trained model, including access to the training data, the algorithm used for training, and the model’s parameters, such as its internal architecture and weights [11]. In such attacks, the attacker can only query the model and observe the outputs but does not have access to its internal state [12]. In grey-box attacks, the adversary has limited knowledge that falls between white-box and black-box scenarios, including partial details about the machine learning model, its training data, and certain model parameters [13]. In real-world scenarios, black-box attacks are more frequent and realistic as attackers will not have full access to the model in production settings [14]. Adversarial attacks on text classifiers can take three forms: Character-level attacks involve generating adversarial examples by modifying individual characters within words through operations such as swapping, flipping, removing, or inserting characters. Word-level attacks create adversarial examples by substituting words with synonyms or unrelated words, leading to incorrect predictions. Sentence-level attacks focus on altering sentences by adding new sentences, rephrasing existing ones while preserving their meaning, or inserting parentheses around current sentences to produce adversarial examples.

In this article, we propose an explainable AI (XAI) approach using integrated gradients (IGs) for detecting adversarial attacks on text classifiers. While XAI techniques have primarily been developed to improve model interpretability and diagnostics, these techniques have also proved effective at uncovering hidden connections of model internal behavior under normal and adversarial settings, which allow us to differentiate and detect adversarial cases. Most textual adversarial attacks are formulated by replacing words with synonyms. XAI methods can effectively identify the impact of substituted words in detecting adversarial examples. The proposed approach is based on the IG [15] XAI technique. It identifies the most positively and negatively influenced words towards target prediction using word attribution scores. These selected words are then substituted with synonyms to generate k sample variations of the input. Model outputs for the generated k samples are processed through a voting mechanism, which compares them against the original input to detect adversarial examples.

The key contributions of the research presented in this paper are as follows: (1) investigation of the behavior of the attribution scores produced by IGs under normal and adversarial example settings on differentiable neural networks, (2) a novel adversarial attack detection method that is applicable to any neural model architecture without further modification, and (3) empirical evaluation of the proposed method, demonstrating its effectiveness with a low word substitution rate while maintaining high accuracy on original input across a diverse range of adversarial attacks, followed by a comparison against existing adversarial example detection methods. The remainder of the paper is organized as follows: Section 2 presents related work in this area. Section 3 delineates the proposed approach, a novel approach based on XAI to detect adversarial examples. This section also presents the approach’s modular architecture. Section 4 reports an empirical evaluation and results. Section 5 documents a discussion of results and concluding remarks.

2. Related Work

Extensive research in the field of computer vision focuses on improved understanding of adversarial attacks on images and developing robust models to address these [16]. However, adversarial attacks and defense strategies in computer vision tasks cannot be directly transferred to the textual domain due to the fundamental differences between image and text data [17,18]. Compared to adversarial attacks in images, textual adversarial attacks are more difficult to generate, as generated adversarial examples must preserve the language’s lexical, grammatical, and semantic properties [19].

2.1. Textual Adversarial Attacks

The attacker modifies the text input

x = (w_{1}, w_{2}, w_{3}, w_{4}, w_{5}, \dots, w_{n})

with small perturbations in a textual adversarial attack [13]. Due to the adversarial perturbations, the model will fail to classify the perturbed text

x^{'} = (w_{1}, w_{2}^{'}, w_{3}, w_{4}, w_{5}^{'}, \dots, w_{n})

. The quality of the crafted adversarial example

x^{'}

depends on its similarity to the original input x. To generate adversarial examples, it is necessary to identify tokens that strongly influence the model’s output [14]. These tokens are then used to generate adversarial examples by character-level attacks through spelling mistakes or word-level attacks through synonym substitutions. There are several adversarial attacks proposed through existing studies, which include the character-level adversarial attacks of HotFlip [20] and DeepWordBug [21] as well as the word-level adversarial attacks of TextFooler [22], PWWS [23], and BAE [24]. The combination of different attack strategies can augment attack performance for conciseness and advanced structure.

Finding tokens that significantly impact model output is crucial for effectively generating adversarial examples. Gradient- and deletion-based methods are widely used to identify the most important tokens [14]. Gradient-based methods analyze model gradients to identify tokens with high influence on the target prediction. In contrast, deletion-based methods assess the impact of individual tokens by removing them and comparing the model’s output differences between the original and modified inputs. XAI has emerged as another method to identify tokens with a strong influence on the target prediction, which can then be used to create new forms of adversarial attacks [25].

2.2. Explainable AI (XAI)

Recent advancements in XAI methods have helped uncover the internal behaviors of DNNs, which are considered opaque (or black-box) models [14]. XAI methods allow human operators to interpret the model outputs while increasing the user confidence and trust of DNN models [26]. Many XAI methods initially focused on explaining the DNN models used in computer vision tasks as humans can easily recognize patterns in visual content. However, these methods cannot be directly applied to DNNs used in NLP tasks as most of them use embedding layers, making it more challenging to extract explanations at the input token level compared to pixels [26].

XAI methods can be divided into two main categories, namely global and local, based on the scope of the explanation [14,27]. Global explainable methods aim to understand the model’s behavior as a whole by focusing on model internals or input features [27]. However, global methods are less suitable for explaining complex models with large numbers of features, such as those used for NLP tasks. In contrast, local methods aim to understand the reasons behind the model’s predictions based on individual inputs [28,29].

Based on the XAI approach used on the model, XAI methods can also be divided into two other groups, namely model-agnostic and model-specific [27,28]. Model-agnostic XAI methods can be applied independently regardless of the type of model. These methods only require model inputs and outputs to derive explanations and do not need access to the internal structure of the model. On the other hand, model-specific XAI methods are based on the model type, as they use internal parameters such as weights and gradients to derive explanations [29]. SHAP [30] and LIME [31] are two widely used model-agnostic XAI methods [26]. Gradient-based XAI methods such as LRP and DeepLift use backward propagation of the output gradients through the neural network to the input feature layer to identify the tokens that have a strong influence on the target prediction [19].

We used the IG XAI method in our proposed approach, which is another gradient-based approach to identify the most positively and negatively influenced words, as in Table 1, towards the target predictions. As shown in Table 1, an IG can be used to identify and differentiate the tokens based on its positive and negative impact towards target prediction. In Table 1, green indicates a positive impact, while red indicates a negative impact. In the first adversarial example in Table 1, the sentiment of the original input prediction is negative. Therefore, words such as awful, boring, and annoying have a positive influence, while words such as enjoy and very have a negative influence towards the target prediction. In the adversarial example (second subrow), the underlined words are replaced with synonyms by the attack method which then interchanges the sentiment of the adversarial example. As an example, in the first row (Attack method-PWWS), the word maddening has been replaced during the adversarial attack. This has a negative influence on the target prediction, while the word enjoy has a positively influence. This further validates our hypothesis that XAI can be used to identify tokens that are replaced during an adversarial attack, as the replaced words show positive and negative influence, as per Table 1. In this paper, we demonstrate the design and development of gradient-based methods based on explainable AI with integrated gradients (IGs) for the detection of adversarial attacks on text classifiers.

2.3. Adversarial Attack Defenses

With the rise of textual adversarial attack methods, researchers have increasingly focused on textual adversarial defenses to secure models against adversarial attacks [13]. The proposed defenses include various strategies, such as input preprocessing using spell checks, adversarial training, certified defense, defensive distillation, gradient masking, enhancing model robustness, and detecting adversarial examples [12,32].

Our proposed method can be classified as a type of adversarial example detection, where relatively few efforts have been made [33]. Since adversarial example detection approaches utilize separate detector modules, they require little or no changes to the model architecture and training process. This makes them applicable to any model without affecting the accuracy of the original samples [12]. Several existing research studies focus on adversarial example detection. Mozes et al. [34] proposed Frequency-Guided Word Substitutions (FGWS), which substitute words with low frequencies in the input text with the most frequent synonyms to detect and eliminate adversarial perturbations. Zhou et al. [35] proposed Learning to Discriminate Perturbation (DISP), which detects adversarial examples and recovers the original samples by training a perturbation discriminator and embedding estimator. Mosca et al. [36] used a model-agnostic approach to detect adversarial examples by analyzing the logit patterns of the target classifier when perturbing the input text. Shen et al. [37] proposed a TextDefense framework, which uses the information entropy contained in the text to detect adversarial examples.

Wang et al. [12] proposed RS&V, which uses random synonym substitution of tokens to generate multiple versions of the input text that then go through a voting process to compare the label of the original example against the voted-on label to detect adversarial attacks. Our inspiration for the proposed method came from this study. However, rather than randomly selecting tokens in the input text, we wanted to identify the tokens that have the most positive and negative influences on the target prediction. These tokens are then used for synonym substitution to detect adversarial attacks.

3. Proposed Approach

The proposed novel approach for adversarial attack detection is shown in Figure 1. Our work is inspired by the RS&V method of Wang et al. [12], and we demonstrate that identifying the most important tokens reduces the token substitution percentage, resulting in better detection accuracy compared to the RS&V method. We further validate that a robust XAI method should reveal insightful information about the model’s internal behavior under both standard and adversarial inputs. XAI-guided adversarial example detection approaches were first explored in computer vision tasks. Fidel et al. [11] proposed an approach that uses SHAP signatures to detect adversarial attacks in images. A recent study by Huber et al. [19] introduced a method that trains a separate classifier on SHAP signatures of original and adversarial examples to detect adversarial attacks. However, our approach does not require training a separate model to detect adversarial examples. Few studies have applied XAI-based approaches to detect adversarial attacks, which serves as an additional motivation for our study. As discussed in Table 1, we have used IG explainability scores to identify the tokens that have the highest positive and negative influence on the target prediction and substitute them based on substitution rate (p) in the input text.

3.1. Integrated Gradients (IGs)

Integrated gradients (IGs) are a frequently used attribution-based XAI method to identify the words that strongly influence the target prediction. Attribution-based methods use DNN predictions to estimate the significance of the input characteristics, which reveal reasons for the model predictions [38]. IGs are a local explainability method that can be applied to any differentiable DNN model with several advantages over other attribution methods [38]. The IG XAI method can be easily implemented in DNNs compared to other approaches as well [38]. A reference baseline must be defined to get the feature attributions using IGs. IGs generate feature attributions in comparison to this baseline that do not contain informative content [39]. For image-based DNN models, the baseline input can be a black image, whereas for text classification models, it may be a zero embedding vector [15]. This reference baseline can be identified as a neutral baseline, which is defined to indicate the absence of input features.

I G_{i} (x, x^{'}) : : = (x_{i} - x_{i}^{'}) \times \int_{α = 0}^{1} \frac{\partial f (x^{'} + α \times (x - x^{'}))}{\partial x_{i}} d α

Let’s define

f : R^{n} \to [0, 1]

to represent the deep neural network and let

x \in R^{n}

be the input while

x^{'} \in R^{n}

is the baseline input [15]. IGs are computed by accumulating the gradients at all points, which are evaluated along a straight-line path from the baseline

x^{'}

to the input x [15]. IGs are defined as the path integral of the gradients computed along a straight-line path from the baseline

x^{'}

to the input x. Equation (1) defines the IG along the i-th dimension for an input x and baseline

x^{'}

[15]. The

\frac{\partial f (x)}{\partial x_{i}}

in Equation (1) refers to the gradient of the model f with respect to the i-th dimension [38]. This reference baseline intends to get a high entropy prediction with greater uncertainty. Subsequently, the input features are added to the reference baseline, which interpolates towards the complete input, which reduces the uncertainty present in the absence of the input features [38]. Then, integrated gradients are derived by accumulating these gradients.

When applying IGs in the context of NLP DNNs, x is the concatenated embedding of the input sequence. The attribution for each token is computed by aggregating the attributions of its corresponding embedding [38]. We used padding tokens to define the reference baseline during our experiments. This is because the padding token is a neutral token that does not impact model predictions. In the IG method, we should specify the number of steps that should be taken to reach the baseline with reference to the actual input of the model. The authors of the IG paper recommend having step sizes between 20 and 300 [15]. The model provides an attribution value for each input feature during the calculation steps, indicating how much it influenced the target prediction.

3.2. Explainability-Guided Vote (EGV) Approach

As shown in Figure 1, the proposed approach can be separated into three main sequential components: extracting IG attribution scores, replacing high-influence tokens with synonyms, and voting to detect adversarial examples. Functionalities of these components are put into the Algorithm 1, which shows the main steps of our approach.

3.2.1. Extracting IG Attribution Scores

The first component involves generating attribution scores using IGs. These scores help identify the tokens within the input text that have the most positive and negative influence on the model’s prediction. By focusing on the tokens that significantly impact predictions, we can analyze the model’s decision-making process more effectively. As shown in Figure 1, the IG attribution scores are extracted to determine both highly positively and negatively impacted tokens towards the target prediction. These tokens are then passed to the next stage. Algorithm 1 includes the computation of IG scores and how tokens are ranked based on their influence.

Algorithm 1: Explainability-Guided Vote

3.2.2. Replacing High-Influence Tokens with Synonyms

After identifying influential tokens, we determine the number of tokens to substitute based on a predefined substitution rate p. The number of tokens to be replaced is calculated as

length (x) / 2 * p

, as shown in Steps 4 and 5 of Algorithm 1. This ensures that half of the substitutions are made for positively influenced tokens and the other half for negatively influenced tokens. Tokens with high positive and negative influence are substituted with synonyms to observe changes in predictions. Stopwords are excluded from substitution even if they have high IG attribution scores, ensuring meaningful modifications. Figure 1 shows how tokens are selected and substituted, while Algorithm 1 outlines the specific steps to construct synonym sets from Steps 6 to 14. We adopt the RS&V method to combine synonyms from WordNet [40] and neighboring words located within a specified Euclidean distance using the GloVe embedding space, which is postprocessed using counter-fitting [41]. For a given input text, we generate k number of variations by substituting random synonyms for each selected word

w_{i} \in x

.

3.2.3. Voting to Detect Adversarial Examples

During the voting and detection phase of adversarial examples, we first obtain the label of the input text without any modifications, as shown in Step 15 of Algorithm 1. Next, we accumulate logits for each prediction of the generated texts, as detailed in Step 16 of Algorithm 1. The

arg max

of the logits is then used to determine the voted label. Finally, we compare the voted label of the generated texts against the input text’s prediction, as illustrated in Figure 1 and Step 17 of Algorithm 1. If the voted-on and predicted labels of the input text do not match, the input text is identified as an adversarial example. Figure 1 demonstrates the flow from input variations to voting results, and Algorithm 1 details the process of accumulating logits and determining the final label.

3.2.4. Complexity Analysis

Considering the time complexity of the algorithm, the preprocessing step of the proposed algorithm involves calculating integrated gradients (IGs), which has a time complexity of

O (m \cdot d \cdot n_{f})

, where m is the number of interpolation steps, n represents the input dimension, and

n_{f}

accounts for the gradient computation cost in the model. Following this, the iterative substitution process operates with a complexity of

O (k \cdot n \cdot p \cdot m_{s})

, where k is the number of votes, n is the length of the input text, p is the substitution rate, and

m_{s}

denotes the average number of synonyms. Combining these steps, the overall complexity of the algorithm is

O (m \cdot n \cdot n_{f} + k \cdot n \cdot p \cdot m_{s})

. The IG computation typically dominates the complexity due to gradient calculations, especially for deep neural networks, while the substitution step scales linearly with text size, substitution rate, and voting iterations, ensuring computational feasibility for practical applications.

4. Experiments

We validated the effectiveness of our proposed approach on CNN-, LSTM-, and BERT-based DNN models against four adversarial attacks: PWWS, TextFooler, and DeepWordBug. To evaluate the performance of our approach, we used the RS&V and FGWS methods as our baselines.

4.1. Attack Models

We used three different DNN models to compare our method’s adversarial example detection performance. These models were selected to ensure a diverse range of architectures, capturing different feature extraction and sequence modeling capabilities. Additionally, these models are widely used in the research literature for evaluating adversarial attack detection and robustness, providing a solid foundation for benchmarking the effectiveness of our proposed method [24,42,43]. For the first model, we implemented a CNN architecture similar to the one proposed by Kim et al. [44]. This model includes a word-embedding layer with 300-dimensional vectors, followed by three convolutional layers with filter sizes of 3, 4, and 5, each containing 128 filters. It also features a max-pooling layer and a fully connected layer. The second model employs a Bidirectional Long Short-Term Memory (Bi-LSTM) recurrent neural network (RNN). This architecture consists of a word-embedding layer, two Bi-LSTM layers with 256 units each, and a fully connected layer. Both of these models used 300-dimensional GloVe pre-trained vectors in the embedding layer. For the third model, we used BERT [45] to validate the applicability of our method in the context of language models.

4.2. Datasets

The experiments were conducted on three widely used benchmark datasets, namely IMDB, SST-2, and AG News, in our experiments. The IMDB and SST-2 datasets are used for sentiment analysis, while the AG News dataset, which contains news articles categorized into four classes, World, Sports, Business, and Sci/Tech, is used for topic classification. Table 2 summarizes the key details, including the number of training and testing samples, the number of classes, and the average text length for each dataset.

4.3. Adversarial Attack Methods

The experiments evaluate the detection accuracy of the proposed method against four well-known text adversarial attacks: PWWS, TextFooler, and DeepWordBug, which consist of three word-level and one character-level attack. Although our approach primarily focuses on detecting synonym substitution attacks, we used a simple spell checker before generating IG attributions to check whether our approach can be used together with a spell checker to detect character-level adversarial attacks.

PWWS—This is a word-level adversarial attack that employs a unique strategy to determine the order of word substitutions based on a word saliency weighting method, leveraging classification probabilities while maintaining the semantic integrity of the input text [23].
TextFooler—This is a word-level adversarial attack that utilizes a word importance ranking strategy to identify key tokens in the input text. These tokens are subsequently replaced with similar words based on word embeddings, ensuring that both the semantics and syntax of the original input are preserved [22].
BAE—This opaque (or black-box) word-level attack uses contextual perturbations from a BERT-masked language model to generate adversarial examples. In this method, it masks a portion of the text and leverages the BERT mask language model to generate possible replacements for the masked tokens, which then replace and insert in the original text [24].
DeepWordBug—This is a character-level adversarial attack that initially identifies the most important words through a scoring strategy. It then perturbs the characters of these identified words while maintaining a minimal edit distance from the original words, effectively altering the classification output [21].

4.4. Comparison Baselines

To evaluate the effectiveness of our method, we have used RS&V and FGWS detection methods as our baselines. The RS&V method used a random word substitution approach with voting to detect adversarial examples. This method demonstrated superior performance compared to the methods of FGWS and DISP [12]. The RS&V method is also behind the main inspiration of our proposed approach. The FGWS method replaces the low-frequency words with their most frequent synonyms to mitigate adversarial perturbations to detect them [34].

4.5. Evaluation Setup

We used a Python 3.9 environment with two NVIDIA Quadro M6000 GPUs to train our Word-CNN, Bi-LSTM, and BERT models. The Word-CNN, and Bi-LSTM models were implemented using the PyTorch 2.0.1 version. We used the HuggingFace transformers library to retrieve the base model of BERT. This retrieved model was then fine-tuned separately on the IMDB, SST-2, and AG News datasets. We configured the maximum input length to 256, 128, and 64 tokens, respectively, for the IMDB, SST-2, and AG News datasets. The Captum [46] framework was used to extract IG attribution scores for the Word-CNN and Bi-LSTM models while Transformers Interpret [47] was used to extract IG attribution scores of BERT-based models. We followed the same approach used in the RS&V method to construct the synonym set of a given word by combining synonyms of WordNet and, at most, the 6 closest words using Euclidean distance on the GloVe vectors postprocessed by counter-fitting [12].

Adversarial Example Generation

TextAttack [9] is a widely used NLP adversarial attack framework that implements most of the adversarial attacks proposed in research studies. We utilized this framework to generate adversarial examples in our study. We attacked the trained models using PWWS, TextFooler, BAE, and DeepWordBug adversarial attack methods on each dataset. For each model, attack method, and dataset, we generated 1000 adversarial examples. These adversarial examples were used together with the original examples to evaluate the detection performances of our method.

4.6. Performance Evaluation

We compared our detection performance across our selected baselines to evaluate the efficiency of the proposed method. Then, we analyzed the impact of the change in substitution rate and the number of votes on the detection performance and compared it against the RS&V method. We evaluated the performance of adversarial attack detection using accuracy and F1 score. Accuracy in Equation (1) represents the ratio of correctly classified samples to the total number of samples. The F1 score in Equation (4), calculated as the harmonic mean of Precision and Recall, provides a balanced measure that considers the baseline. Table 3 presents model accuracy and F1 score against adversarial attacks, including PWWS, TextFooler, BAE, and DeepWordBug. It compares performance without any defense and demonstrates improvements when defenses are introduced, including baseline methods RSV and FGWS, alongside our proposed method. The Accuracy column in Table 3 shows the classifier accuracy on the test dataset. The rows with N/A values in the Method column of Table 3 indicate the accuracy of a model under a specified adversarial attack. This accuracy is measured when we generate the 1000 adversarial examples by giving random inputs from a dataset. Interestingly, models trained on the AG News dataset show high robustness against BAE adversarial attacks.

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(1)

where

$T P$ = True Positives
$T N$ = True Negatives
$F P$ = False Positives
$F N$ = False Negatives

Precision = \frac{T P}{T P + F P}

(2)

Recall = \frac{T P}{T P + F N}

(3)

F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

(4)

4.6.1. Comparison Across Baselines

Table 3 presents a performance comparison of our method against FGWS and RS&V methods. We set the number of votes to 25 while the substitution rate p was set to 80% on IMDB and 60% on SST-2 and AG News datasets. We maintained similar settings for the RS&V method, as stated in the paper, for default configurations [12]. According to the results shown in Table 3, our method outperforms in detection accuracy and F1 scores on several attacks compared to RS&V and FGWS. In most cases, as shown in Figure 2, Figure 3 and Figure 4, the F1 score is greater even when accuracy is slightly lower that of other methods. This highlights that our method can detect more adversarial examples than other methods. Our method shows par performance on accuracy and F1 scores in most cases compared to other methods, which further validates the effectiveness of our proposed method.

4.6.2. Number of Votes & Substitution Rate

The number of votes k and the substitution rate p are hyperparameters that can be tuned to enhance the performance of our detector. Figure 5 shows a comparison of accuracy between original and adversarial examples on the Word-CNN classifier trained with the IMDB dataset. Subfigure (a) presents a comparison of accuracy across different substitution rates, while subfigure (b) examines the effect of varying the number of votes. We first compare the accuracy of original and adversarial examples separately by changing the number of votes and substitution rates. According to Figure 5, when the substitution rate is low (

p <

10%), it shows high accuracy in original examples and low accuracy in detecting adversarial examples. This is expected behavior, as low substitution rates will fail to detect adversarial perturbations. This is because substituted words will not always get the highest IGs attribution scores. Hence, increasing the number of substituted token counts will increase detection accuracy. When p was higher than 10, we can observe that the accuracy of the original example dropped but remained over 90%. This indicates that our method maintains good accuracy on original examples even with a higher p. On the other hand, when p increases from 1% to 5%, there is a significant increase in adversarial detection accuracy. Afterwards, it remains consistent over values of 70% even with the increase in p. According to Figure 5, there is no significant impact on the number of votes against the accuracy. We can observe a slight accuracy drop when the number of votes increases.

Next, we analyzed the effect of k (number of votes) and p (substitution rate) on the final detection accuracy. Figure 6 presents this comparison, showing detector accuracy across varying numbers of votes and substitution rates on the Word-CNN classifier trained with the IMDB dataset. It evaluates the performance of our proposed method (subfigure (a)) against the baseline RS&V method (subfigure (b)). According to Figure 5a, we can observe a significant increase in the overall accuracy when the substitution rate is increased from 1% to 5%. This is due to the increase in accuracy in adversarial examples, which we can observe in Figure 5. Upon examining the influence of the vote count, we observed a slight decrease in accuracy as the number of votes increased. Hence, tuning these parameters is required to achieve better performance. As shown in Figure 6, having a p value in the range of 10% to 30% and the number of votes in the range of 1 to 10 demonstrates ideal performance.

Next, we compared the effectiveness of our approach against the RS&V method by varying the p and k. As shown in Figure 6b, the RS&V method performed better when p and k increased. To achieve par performance with our approach, the RS&V method requires the p to be over 60%, while our method only requires less than 10%. This is a considerable performance gain in our method as substituting large amounts of tokens with a high number of votes is a computationally expensive task that could result in overall performance degradation while taking more time to detect adversarial examples. Higher accuracy with a lower substitution rate and vote number is another advantage over the other methods.

4.6.3. Detecting Character-Level Adversarial Attacks

Our method uses XAI-guided word substitutions to detect adversarial examples. Our baselines, RS&V and FGWS, are specifically designed to detect word-level adversarial examples. However, we further evaluated our approach with character-level adversarial attacks. We used a spell checker to correct spelling mistakes to detect character-level adversarial examples before taking IG attributions scores. Then, we compared the voted label against the adversarial example with spelling mistakes to detect the adversarial examples. Based on our results for a DeepWordBug character-level attack, shown in Table 3, our method shows a higher level of detection accuracy for character-level attacks by using a simple spell checker.

5. Conclusions

The extensive empirical evaluation confirms that XAI can be used to detect adversarial attacks, while also being effective at revealing insightful information on model behavior in standard and adversarial settings. Instead of selecting random tokens for substitution, we used the IG XAI method to identify tokens with strong positive and negative influences, which were then used as substitution candidates. The IG XAI method can be applied to any differentiable DNN, which further increases the applicability of our approach across a wide range of DNNs. Our method does not require any changes to the training process or the model architecture to detect adversarial examples. As demonstrated in the empirical outcomes, our method outperforms in the detection accuracy and F1 scores while maintaining par detection performance against most adversarial attacks compared to baselines of RS&V and FGWS. Finally, this method can also detect adversarial examples with high accuracy and low word substitution rates compared to the RS&V method, which further demonstrates superior computational performance.

Author Contributions

Conceptualization, H.M., D.D.S. and M.M.; Methodology, H.M., D.D.S., Z.I.and S.K.; Validation, G.K., Z.I., M.M. and S.K.; Formal analysis, H.M., Z.I. and M.M.; Investigation, H.M., G.K., D.D.S., Z.I. and S.K.; Resources, S.K.; Data curation, G.K.; Writing – original draft, H.M., G.K., D.D.S., Z.I., M.M. and S.K.; Visualization, G.K.; Supervision, D.D.S. and M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to human research ethics requirements.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Nallaperuma, D.; De Silva, D.; Alahakoon, D.; Yu, X. Intelligent detection of driver behavior changes for effective coordination between autonomous and human driven vehicles. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018; pp. 3120–3125. [Google Scholar]
De Silva, D.; Yu, X.; Alahakoon, D.; Holmes, G. Semi-supervised classification of characterized patterns for demand forecasting using smart electricity meters. In Proceedings of the 2011 International Conference on Electrical Machines and Systems, Beijing, China, 20–23 August 2011; pp. 1–6. [Google Scholar]
Mirończuk, M.M.; Protasiewicz, J. A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 2018, 106, 36–54. [Google Scholar] [CrossRef]
Adikari, A.; De Silva, D.; Ranasinghe, W.K.; Bandaragoda, T.; Alahakoon, O.; Persad, R.; Lawrentschuk, N.; Alahakoon, D.; Bolton, D. Can online support groups address psychological morbidity of cancer patients? An artificial intelligence based investigation of prostate cancer trajectories. PLoS ONE 2020, 15, e0229361. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.J.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Osipov, E.; Kahawala, S.; Haputhanthri, D.; Kempitiya, T.; De Silva, D.; Alahakoon, D.; Kleyko, D. Hyperseed: Unsupervised learning with vector symbolic architectures. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 6583–6597. [Google Scholar] [CrossRef] [PubMed]
Morris, J.; Lifland, E.; Yoo, J.Y.; Grigsby, J.; Jin, D.; Qi, Y. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 119–126. [Google Scholar]
Ibitoye, O.; Abou-Khamis, R.; Matrawy, A.; Shafiq, M.O. The Threat of Adversarial Attacks on Machine Learning in Network Security - A Survey. arXiv 2019, arXiv:1911.02621. [Google Scholar] [CrossRef]
Fidel, G.; Bitton, R.; Shabtai, A. When Explainability Meets Adversarial Learning: Detecting Adversarial Examples using SHAP Signatures. In Proceedings of the 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]
Wang, X.; Xiong, Y.; He, K. Detecting textual adversarial examples through randomized substitution and vote. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Online, 27–30 July 2021. [Google Scholar]
Moraliyage, H.; Kahawala, S.; De Silva, D.; Alahakoon, D. Evaluating the Adversarial Robustness of Text Classifiers in Hyperdimensional Computing. In Proceedings of the 2022 15th International Conference on Human System Interaction (HSI), Melbourne, Australia, 28–31 July 2022; pp. 1–8. [Google Scholar] [CrossRef]
Chai, Y.; Liang, R.; Zhu, H.; Samtani, S.; Wang, M.; Liu, Y.; Jiang, Y. Local Post-hoc Explainable Methods for Adversarial Text Attacks. TechRxiv 2021. [Google Scholar] [CrossRef]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning 2017, ICML’17, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 3319–3328. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April—3 May 2018. [Google Scholar]
Zhang, W.E.; Sheng, Q.Z.; Alhazmi, A.; Li, C. Adversarial Attacks on Deep-Learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 2020, 11, 24. [Google Scholar] [CrossRef]
Kleyko, D.; Osipov, E.; De Silva, D.; Wiklund, U. Integer self-organizing maps for digital hardware. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Huber, L.; Kühn, M.A.; Mosca, E.; Groh, G. Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations. In Proceedings of the 7th Workshop on Representation Learning for NLP, Dublin, Ireland, 26 May 2022; pp. 156–166. [Google Scholar] [CrossRef]
Ebrahimi, J.; Rao, A.; Lowd, D.; Dou, D. HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 50–56. [Google Scholar] [CrossRef]
Jin, D.; Jin, Z.; Zhou, J.T.; Szolovits, P. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8018–8025. [Google Scholar]
Ren, S.; Deng, Y.; He, K.; Che, W. Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July—2 August 2019; pp. 1085–1097. [Google Scholar] [CrossRef]
Garg, S.; Ramakrishnan, G. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6174–6181. [Google Scholar] [CrossRef]
Tjoa, E.; Guan, C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4793–4813. [Google Scholar] [CrossRef] [PubMed]
Zini, J.E.; Awad, M. On the Explainability of Natural Language Processing Deep Models. ACM Comput. Surv. 2022, 55, 103. [Google Scholar] [CrossRef]
Carrillo, A.; Cant’u, L.F.; Noriega, A. Individual Explanations in Machine Learning Models: A Survey for Practitioners. arXiv 2021, arXiv:2104.04144. [Google Scholar]
Holzinger, A.; Saranti, A.; Molnar, C.; Biecek, P.; Samek, W. Explainable AI Methods—A Brief Overview. In xxAI—Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, Vienna, Austria, 18 July 2020 Revised and Extended Papers; Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, K.R., Samek, W., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 13–38. [Google Scholar] [CrossRef]
Saranya, A.; Subhashini, R. A systematic review of Explainable Artificial Intelligence models and applications: Recent developments and future trends. Decis. Anal. J. 2023, 7, 100230. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; KDD ’16. pp. 1135–1144. [Google Scholar] [CrossRef]
Sauka, K.; Shin, G.Y.; Kim, D.W.; Han, M.M. Adversarial Robust and Explainable Network Intrusion Detection Systems Based on Deep Learning. Appl. Sci. 2022, 12, 6451. [Google Scholar] [CrossRef]
Yoo, K.; Kim, J.; Jang, J.; Kwak, N. Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 3656–3672. [Google Scholar] [CrossRef]
Mozes, M.; Stenetorp, P.; Kleinberg, B.; Griffin, L. Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 171–186. [Google Scholar]
Zhou, Y.; Jiang, J.Y.; Chang, K.W.; Wang, W. Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification. In Proceedings of the EMNLP, Hong Kong, China, 3–7 November 2019. [Google Scholar]
Mosca, E.; Agarwal, S.; Rando Ramírez, J.; Groh, G. “That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 7806–7816. [Google Scholar] [CrossRef]
Shen, L.; Zhang, X.; Ji, S.; Pu, Y.; Ge, C.; Yang, X.; Feng, Y. TextDefense: Adversarial Text Detection based on Word Importance Entropy. arXiv 2023, arXiv:2302.05892. [Google Scholar] [CrossRef]
Santoso, N.; Mendonça, I.; Aritsugi, M. Text Augmentation Based on Integrated Gradients Attribute Score for Aspect-based Sentiment Analysis. In Proceedings of the 2023 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju, Republic of Korea, 13–16 February 2023; pp. 227–234. [Google Scholar] [CrossRef]
Liu, F.; Avci, B. Incorporating Priors with Feature Attribution on Text Classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; pp. 6274–6283. [Google Scholar] [CrossRef]
Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
Mrkšić, N.; Ó Séaghdha, D.; Thomson, B.; Gašić, M.; Rojas-Barahona, L.M.; Su, P.H.; Vandyke, D.; Wen, T.H.; Young, S. Counter-fitting Word Vectors to Linguistic Constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 142–148. [Google Scholar] [CrossRef]
Wang, X.; Jin, H.; Yang, Y.; He, K. Natural language adversarial defense through synonym encoding. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Tel Aviv, Israel, 22–25 July 2019. [Google Scholar]
Miyato, T.; Dai, A.M.; Goodfellow, I.J. Adversarial Training Methods for Semi-Supervised Text Classification. arXiv 2016, arXiv:1605.07725. [Google Scholar]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Kliushkina, N.; Araya, C.; Yan, S.; et al. Captum: A unified and generic model interpretability library for PyTorch. arXiv 2020, arXiv:2009.07896. [Google Scholar]
Pierse, C. Transformers Interpret. 2021. Available online: https://github.com/cdpierse/transformers-interpret (accessed on 15 January 2022).

Figure 1. Proposed adversarial attack detection approach.

Figure 2. Comparison of F1 scores on adversarial example detection against baseline methods and ours using PWWS attack. (a) AG News, (b) SST-2, (c) IMDB.

Figure 3. Comparison of F1 scores on adversarial example detection against baseline methods and ours using TextFooler attack. (a) AG News, (b) SST-2, (c) IMDB.

Figure 4. Comparison of F1 scores on adversarial example detection against baseline methods and ours using BAE attack. (a) AG News, (b) SST-2, (c) IMDB.

Figure 5. Comparison of accuracy of original and adversarial examples on Word-CNN classifier trained with IMDB dataset (a) against different substitution rates (b) and different number of votes.

Figure 6. Comparison of detector accuracy across varying numbers of votes and substitution rates on Word-CNN classifier trained with IMDB dataset. (a) Our method, (b) RS&V method.

Table 1. Integrated gradients on different adversarial attacks for Word-CNN model trained with IMDB dataset. Highlights in green indicate a positive impact towards the target prediction, while red indicates a negative impact. In the adversarial example (second subrow), the underlined words have been replaced with synonyms by the attack method which then interchanges the sentiment of the adversarial example.

Table 2. Summary of datasets.

Dataset	Train/Test	Classes	Avg. Words
IMDB	25,000/25,000	2	227
SST-2	67,349/1821	2	19
AG News	120,000/7600	4	38

Table 3. Comparison of classification accuracy (%) and F1 score (%) of detection methods for Word-CNN, Bi-LSTM, and BERT on IMDB, AG News, and SST-2 datasets.

Dataset	Model	Accuracy	Method	PWWS		TextFooler		BAE		DeepWordBug
Dataset	Model	Accuracy	Method	Acc.	$F 1$	Acc.	$F 1$	Acc.	$F 1$	Acc.	$F 1$
$\begin{matrix} IMDB \end{matrix}$	Word-CNN	87.3	N/A	0.17	-	0	-	20.9	-	7.64	-
			Ours	78.7	0.79	81	0.83	63.5	0.56	86.3	0.91
			RSV	81.7	0.78	83.9	0.82	65.6	0.50	-	-
			FGWS	82.9	0.81	75.5	0.70	57.1	0.35	-	-
	Bi-LSTM	87.2	N/A	0	-	0	-	8.27	-	7.25	-
			Ours	80.4	0.81	81.2	0.83	68.2	0.61	86.9	0.91
			RSV	83.7	0.82	84.1	0.83	70.5	0.60	-	-
			FGWS	80.8	0.78	78.7	0.76	60	0.43	-	-
	BERT	93.5	N/A	0	-	0	-	17.95	-	16.43	-
			Ours	85.2	0.86	88.8	0.90	73.2	0.68	91.5	0.94
			RSV	86.2	0.84	90.6	0.90	73.6	0.65	-	-
			FGWS	90	0.86	87.5	0.82	73.7	0.57	-	-
$\begin{matrix} SST - 2 \end{matrix}$	Word-CNN	83	N/A	9.47	-	2.03	-	25.63	-	18.67	-
			Ours	73.7	0.79	75.1	0.78	63.1	0.62	73.7	0.82
			RSV	73.2	0.67	74	0.68	59.1	0.38	-	-
			FGWS	73.1	0.68	67.3	0.57	54.9	0.31	-	-
	Bi-LSTM	81.3	N/A	11.1	-	2.89	-	23.06	-	21.57	-
			Ours	69.3	0.75	71.4	0.76	58.2	0.60	69.2	0.79
			RSV	69.8	0.62	72.2	0.67	58.8	0.40	-	-
			FGWS	69.7	0.61	63.5	0.49	53.3	0.24	-	-
	BERT	91.1	N/A	12.68	-	5.25	-	30.3	-	15.25	-
			Ours	64.5	0.63	67.5	0.67	61.9	0.59	77.9	0.85
			RSV	65.5	0.50	69.4	0.58	58.4	0.33	-	-
			FGWS	84.5	0.83	75.9	0.70	59.9	0.41	-	-
$\begin{matrix} AG \\ News \end{matrix}$	Word-CNN	93.2	N/A	20.68	-	2.1	-	78.62	-	3.85	-
			Ours	91.1	0.94	89.6	0.92	69.5	0.73	85.6	0.92
			RSV	92.2	0.92	82.8	0.79	70	0.60	-	-
			FGWS	81.3	0.79	78.1	0.74	50.6	0.27	-	-
	Bi-LSTM	92	N/A	18.15	-	3.02	-	71.34	-	7.67	-
			Ours	87.8	0.91	87.3	0.91	72.2	0.74	78.2	0.87
			RSV	87.4	0.86	77.1	0.71	71.2	0.63	-	-
			FGWS	77.1	0.73	73.9	0.68	53.57	0.31	-	-
	BERT	94.5	N/A	29.54	-	8.77	-	80.08	-	14.08	-
			Ours	84.4	0.87	87.6	0.89	65	0.67	84.8	0.92
			RSV	83	0.80	83	0.80	68.5	0.58	-	-
			FGWS	88.4	0.87	82.5	0.78	51.2	0.24	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moraliyage, H.; Kulawardana, G.; De Silva, D.; Issadeen, Z.; Manic, M.; Katsura, S. Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers. Appl. Syst. Innov. 2025, 8, 17. https://doi.org/10.3390/asi8010017

AMA Style

Moraliyage H, Kulawardana G, De Silva D, Issadeen Z, Manic M, Katsura S. Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers. Applied System Innovation. 2025; 8(1):17. https://doi.org/10.3390/asi8010017

Chicago/Turabian Style

Moraliyage, Harsha, Geemini Kulawardana, Daswin De Silva, Zafar Issadeen, Milos Manic, and Seiichiro Katsura. 2025. "Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers" Applied System Innovation 8, no. 1: 17. https://doi.org/10.3390/asi8010017

APA Style

Moraliyage, H., Kulawardana, G., De Silva, D., Issadeen, Z., Manic, M., & Katsura, S. (2025). Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers. Applied System Innovation, 8(1), 17. https://doi.org/10.3390/asi8010017

Article Menu

Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers

Abstract

1. Introduction

2. Related Work

2.1. Textual Adversarial Attacks

2.2. Explainable AI (XAI)

2.3. Adversarial Attack Defenses

3. Proposed Approach

3.1. Integrated Gradients (IGs)

3.2. Explainability-Guided Vote (EGV) Approach

3.2.1. Extracting IG Attribution Scores

3.2.2. Replacing High-Influence Tokens with Synonyms

3.2.3. Voting to Detect Adversarial Examples

3.2.4. Complexity Analysis

4. Experiments

4.1. Attack Models

4.2. Datasets

4.3. Adversarial Attack Methods

4.4. Comparison Baselines

4.5. Evaluation Setup

Adversarial Example Generation

4.6. Performance Evaluation

4.6.1. Comparison Across Baselines

4.6.2. Number of Votes & Substitution Rate

4.6.3. Detecting Character-Level Adversarial Attacks

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI