Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers
Abstract
:1. Introduction
2. Related Work
2.1. Textual Adversarial Attacks
2.2. Explainable AI (XAI)
2.3. Adversarial Attack Defenses
3. Proposed Approach
3.1. Integrated Gradients (IGs)
3.2. Explainability-Guided Vote (EGV) Approach
3.2.1. Extracting IG Attribution Scores
Algorithm 1: Explainability-Guided Vote |
3.2.2. Replacing High-Influence Tokens with Synonyms
3.2.3. Voting to Detect Adversarial Examples
3.2.4. Complexity Analysis
4. Experiments
4.1. Attack Models
4.2. Datasets
4.3. Adversarial Attack Methods
- PWWS—This is a word-level adversarial attack that employs a unique strategy to determine the order of word substitutions based on a word saliency weighting method, leveraging classification probabilities while maintaining the semantic integrity of the input text [23].
- TextFooler—This is a word-level adversarial attack that utilizes a word importance ranking strategy to identify key tokens in the input text. These tokens are subsequently replaced with similar words based on word embeddings, ensuring that both the semantics and syntax of the original input are preserved [22].
- BAE—This opaque (or black-box) word-level attack uses contextual perturbations from a BERT-masked language model to generate adversarial examples. In this method, it masks a portion of the text and leverages the BERT mask language model to generate possible replacements for the masked tokens, which then replace and insert in the original text [24].
- DeepWordBug—This is a character-level adversarial attack that initially identifies the most important words through a scoring strategy. It then perturbs the characters of these identified words while maintaining a minimal edit distance from the original words, effectively altering the classification output [21].
4.4. Comparison Baselines
4.5. Evaluation Setup
Adversarial Example Generation
4.6. Performance Evaluation
- = True Positives
- = True Negatives
- = False Positives
- = False Negatives
4.6.1. Comparison Across Baselines
4.6.2. Number of Votes & Substitution Rate
4.6.3. Detecting Character-Level Adversarial Attacks
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
- Nallaperuma, D.; De Silva, D.; Alahakoon, D.; Yu, X. Intelligent detection of driver behavior changes for effective coordination between autonomous and human driven vehicles. In Proceedings of the IECON 2018—44th Annual Conference of the IEEE Industrial Electronics Society, Washington, DC, USA, 21–23 October 2018; pp. 3120–3125. [Google Scholar]
- De Silva, D.; Yu, X.; Alahakoon, D.; Holmes, G. Semi-supervised classification of characterized patterns for demand forecasting using smart electricity meters. In Proceedings of the 2011 International Conference on Electrical Machines and Systems, Beijing, China, 20–23 August 2011; pp. 1–6. [Google Scholar]
- Mirończuk, M.M.; Protasiewicz, J. A recent overview of the state-of-the-art elements of text classification. Expert Syst. Appl. 2018, 106, 36–54. [Google Scholar] [CrossRef]
- Adikari, A.; De Silva, D.; Ranasinghe, W.K.; Bandaragoda, T.; Alahakoon, O.; Persad, R.; Lawrentschuk, N.; Alahakoon, D.; Bolton, D. Can online support groups address psychological morbidity of cancer patients? An artificial intelligence based investigation of prostate cancer trajectories. PLoS ONE 2020, 15, e0229361. [Google Scholar] [CrossRef] [PubMed]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.J.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Osipov, E.; Kahawala, S.; Haputhanthri, D.; Kempitiya, T.; De Silva, D.; Alahakoon, D.; Kleyko, D. Hyperseed: Unsupervised learning with vector symbolic architectures. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 6583–6597. [Google Scholar] [CrossRef] [PubMed]
- Morris, J.; Lifland, E.; Yoo, J.Y.; Grigsby, J.; Jin, D.; Qi, Y. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 119–126. [Google Scholar]
- Ibitoye, O.; Abou-Khamis, R.; Matrawy, A.; Shafiq, M.O. The Threat of Adversarial Attacks on Machine Learning in Network Security - A Survey. arXiv 2019, arXiv:1911.02621. [Google Scholar] [CrossRef]
- Fidel, G.; Bitton, R.; Shabtai, A. When Explainability Meets Adversarial Learning: Detecting Adversarial Examples using SHAP Signatures. In Proceedings of the 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, UK, 19–24 July 2020. [Google Scholar] [CrossRef]
- Wang, X.; Xiong, Y.; He, K. Detecting textual adversarial examples through randomized substitution and vote. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Online, 27–30 July 2021. [Google Scholar]
- Moraliyage, H.; Kahawala, S.; De Silva, D.; Alahakoon, D. Evaluating the Adversarial Robustness of Text Classifiers in Hyperdimensional Computing. In Proceedings of the 2022 15th International Conference on Human System Interaction (HSI), Melbourne, Australia, 28–31 July 2022; pp. 1–8. [Google Scholar] [CrossRef]
- Chai, Y.; Liang, R.; Zhu, H.; Samtani, S.; Wang, M.; Liu, Y.; Jiang, Y. Local Post-hoc Explainable Methods for Adversarial Text Attacks. TechRxiv 2021. [Google Scholar] [CrossRef]
- Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic Attribution for Deep Networks. In Proceedings of the 34th International Conference on Machine Learning 2017, ICML’17, Sydney, NSW, Australia, 6–11 August 2017; Volume 70, pp. 3319–3328. [Google Scholar]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April—3 May 2018. [Google Scholar]
- Zhang, W.E.; Sheng, Q.Z.; Alhazmi, A.; Li, C. Adversarial Attacks on Deep-Learning Models in Natural Language Processing: A Survey. ACM Trans. Intell. Syst. Technol. 2020, 11, 24. [Google Scholar] [CrossRef]
- Kleyko, D.; Osipov, E.; De Silva, D.; Wiklund, U. Integer self-organizing maps for digital hardware. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
- Huber, L.; Kühn, M.A.; Mosca, E.; Groh, G. Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations. In Proceedings of the 7th Workshop on Representation Learning for NLP, Dublin, Ireland, 26 May 2022; pp. 156–166. [Google Scholar] [CrossRef]
- Ebrahimi, J.; Rao, A.; Lowd, D.; Dou, D. HotFlip: White-Box Adversarial Examples for Text Classification. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
- Gao, J.; Lanchantin, J.; Soffa, M.L.; Qi, Y. Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers. In Proceedings of the 2018 IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 24 May 2018; pp. 50–56. [Google Scholar] [CrossRef]
- Jin, D.; Jin, Z.; Zhou, J.T.; Szolovits, P. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8018–8025. [Google Scholar]
- Ren, S.; Deng, Y.; He, K.; Che, W. Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July—2 August 2019; pp. 1085–1097. [Google Scholar] [CrossRef]
- Garg, S.; Ramakrishnan, G. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6174–6181. [Google Scholar] [CrossRef]
- Tjoa, E.; Guan, C. A Survey on Explainable Artificial Intelligence (XAI): Toward Medical XAI. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4793–4813. [Google Scholar] [CrossRef] [PubMed]
- Zini, J.E.; Awad, M. On the Explainability of Natural Language Processing Deep Models. ACM Comput. Surv. 2022, 55, 103. [Google Scholar] [CrossRef]
- Carrillo, A.; Cant’u, L.F.; Noriega, A. Individual Explanations in Machine Learning Models: A Survey for Practitioners. arXiv 2021, arXiv:2104.04144. [Google Scholar]
- Holzinger, A.; Saranti, A.; Molnar, C.; Biecek, P.; Samek, W. Explainable AI Methods—A Brief Overview. In xxAI—Beyond Explainable AI: International Workshop, Held in Conjunction with ICML 2020, Vienna, Austria, 18 July 2020 Revised and Extended Papers; Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, K.R., Samek, W., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 13–38. [Google Scholar] [CrossRef]
- Saranya, A.; Subhashini, R. A systematic review of Explainable Artificial Intelligence models and applications: Recent developments and future trends. Decis. Anal. J. 2023, 7, 100230. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; KDD ’16. pp. 1135–1144. [Google Scholar] [CrossRef]
- Sauka, K.; Shin, G.Y.; Kim, D.W.; Han, M.M. Adversarial Robust and Explainable Network Intrusion Detection Systems Based on Deep Learning. Appl. Sci. 2022, 12, 6451. [Google Scholar] [CrossRef]
- Yoo, K.; Kim, J.; Jang, J.; Kwak, N. Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 3656–3672. [Google Scholar] [CrossRef]
- Mozes, M.; Stenetorp, P.; Kleinberg, B.; Griffin, L. Frequency-Guided Word Substitutions for Detecting Textual Adversarial Examples. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 171–186. [Google Scholar]
- Zhou, Y.; Jiang, J.Y.; Chang, K.W.; Wang, W. Learning to Discriminate Perturbations for Blocking Adversarial Attacks in Text Classification. In Proceedings of the EMNLP, Hong Kong, China, 3–7 November 2019. [Google Scholar]
- Mosca, E.; Agarwal, S.; Rando Ramírez, J.; Groh, G. “That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 7806–7816. [Google Scholar] [CrossRef]
- Shen, L.; Zhang, X.; Ji, S.; Pu, Y.; Ge, C.; Yang, X.; Feng, Y. TextDefense: Adversarial Text Detection based on Word Importance Entropy. arXiv 2023, arXiv:2302.05892. [Google Scholar] [CrossRef]
- Santoso, N.; Mendonça, I.; Aritsugi, M. Text Augmentation Based on Integrated Gradients Attribute Score for Aspect-based Sentiment Analysis. In Proceedings of the 2023 IEEE International Conference on Big Data and Smart Computing (BigComp), Jeju, Republic of Korea, 13–16 February 2023; pp. 227–234. [Google Scholar] [CrossRef]
- Liu, F.; Avci, B. Incorporating Priors with Feature Attribution on Text Classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Korhonen, A., Traum, D., Màrquez, L., Eds.; pp. 6274–6283. [Google Scholar] [CrossRef]
- Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM 1995, 38, 39–41. [Google Scholar] [CrossRef]
- Mrkšić, N.; Ó Séaghdha, D.; Thomson, B.; Gašić, M.; Rojas-Barahona, L.M.; Su, P.H.; Vandyke, D.; Wen, T.H.; Young, S. Counter-fitting Word Vectors to Linguistic Constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 142–148. [Google Scholar] [CrossRef]
- Wang, X.; Jin, H.; Yang, Y.; He, K. Natural language adversarial defense through synonym encoding. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, Tel Aviv, Israel, 22–25 July 2019. [Google Scholar]
- Miyato, T.; Dai, A.M.; Goodfellow, I.J. Adversarial Training Methods for Semi-Supervised Text Classification. arXiv 2016, arXiv:1605.07725. [Google Scholar]
- Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1746–1751. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Kokhlikyan, N.; Miglani, V.; Martin, M.; Wang, E.; Alsallakh, B.; Reynolds, J.; Melnikov, A.; Kliushkina, N.; Araya, C.; Yan, S.; et al. Captum: A unified and generic model interpretability library for PyTorch. arXiv 2020, arXiv:2009.07896. [Google Scholar]
- Pierse, C. Transformers Interpret. 2021. Available online: https://github.com/cdpierse/transformers-interpret (accessed on 15 January 2022).
Dataset | Train/Test | Classes | Avg. Words |
---|---|---|---|
IMDB | 25,000/25,000 | 2 | 227 |
SST-2 | 67,349/1821 | 2 | 19 |
AG News | 120,000/7600 | 4 | 38 |
Dataset | Model | Accuracy | Method | PWWS | TextFooler | BAE | DeepWordBug | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
Acc. | Acc. | Acc. | Acc. | ||||||||
Word-CNN | 87.3 | N/A | 0.17 | - | 0 | - | 20.9 | - | 7.64 | - | |
Ours | 78.7 | 0.79 | 81 | 0.83 | 63.5 | 0.56 | 86.3 | 0.91 | |||
RSV | 81.7 | 0.78 | 83.9 | 0.82 | 65.6 | 0.50 | - | - | |||
FGWS | 82.9 | 0.81 | 75.5 | 0.70 | 57.1 | 0.35 | - | - | |||
Bi-LSTM | 87.2 | N/A | 0 | - | 0 | - | 8.27 | - | 7.25 | - | |
Ours | 80.4 | 0.81 | 81.2 | 0.83 | 68.2 | 0.61 | 86.9 | 0.91 | |||
RSV | 83.7 | 0.82 | 84.1 | 0.83 | 70.5 | 0.60 | - | - | |||
FGWS | 80.8 | 0.78 | 78.7 | 0.76 | 60 | 0.43 | - | - | |||
BERT | 93.5 | N/A | 0 | - | 0 | - | 17.95 | - | 16.43 | - | |
Ours | 85.2 | 0.86 | 88.8 | 0.90 | 73.2 | 0.68 | 91.5 | 0.94 | |||
RSV | 86.2 | 0.84 | 90.6 | 0.90 | 73.6 | 0.65 | - | - | |||
FGWS | 90 | 0.86 | 87.5 | 0.82 | 73.7 | 0.57 | - | - | |||
Word-CNN | 83 | N/A | 9.47 | - | 2.03 | - | 25.63 | - | 18.67 | - | |
Ours | 73.7 | 0.79 | 75.1 | 0.78 | 63.1 | 0.62 | 73.7 | 0.82 | |||
RSV | 73.2 | 0.67 | 74 | 0.68 | 59.1 | 0.38 | - | - | |||
FGWS | 73.1 | 0.68 | 67.3 | 0.57 | 54.9 | 0.31 | - | - | |||
Bi-LSTM | 81.3 | N/A | 11.1 | - | 2.89 | - | 23.06 | - | 21.57 | - | |
Ours | 69.3 | 0.75 | 71.4 | 0.76 | 58.2 | 0.60 | 69.2 | 0.79 | |||
RSV | 69.8 | 0.62 | 72.2 | 0.67 | 58.8 | 0.40 | - | - | |||
FGWS | 69.7 | 0.61 | 63.5 | 0.49 | 53.3 | 0.24 | - | - | |||
BERT | 91.1 | N/A | 12.68 | - | 5.25 | - | 30.3 | - | 15.25 | - | |
Ours | 64.5 | 0.63 | 67.5 | 0.67 | 61.9 | 0.59 | 77.9 | 0.85 | |||
RSV | 65.5 | 0.50 | 69.4 | 0.58 | 58.4 | 0.33 | - | - | |||
FGWS | 84.5 | 0.83 | 75.9 | 0.70 | 59.9 | 0.41 | - | - | |||
Word-CNN | 93.2 | N/A | 20.68 | - | 2.1 | - | 78.62 | - | 3.85 | - | |
Ours | 91.1 | 0.94 | 89.6 | 0.92 | 69.5 | 0.73 | 85.6 | 0.92 | |||
RSV | 92.2 | 0.92 | 82.8 | 0.79 | 70 | 0.60 | - | - | |||
FGWS | 81.3 | 0.79 | 78.1 | 0.74 | 50.6 | 0.27 | - | - | |||
Bi-LSTM | 92 | N/A | 18.15 | - | 3.02 | - | 71.34 | - | 7.67 | - | |
Ours | 87.8 | 0.91 | 87.3 | 0.91 | 72.2 | 0.74 | 78.2 | 0.87 | |||
RSV | 87.4 | 0.86 | 77.1 | 0.71 | 71.2 | 0.63 | - | - | |||
FGWS | 77.1 | 0.73 | 73.9 | 0.68 | 53.57 | 0.31 | - | - | |||
BERT | 94.5 | N/A | 29.54 | - | 8.77 | - | 80.08 | - | 14.08 | - | |
Ours | 84.4 | 0.87 | 87.6 | 0.89 | 65 | 0.67 | 84.8 | 0.92 | |||
RSV | 83 | 0.80 | 83 | 0.80 | 68.5 | 0.58 | - | - | |||
FGWS | 88.4 | 0.87 | 82.5 | 0.78 | 51.2 | 0.24 | - | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Published by MDPI on behalf of the International Institute of Knowledge Innovation and Invention. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Moraliyage, H.; Kulawardana, G.; De Silva, D.; Issadeen, Z.; Manic, M.; Katsura, S. Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers. Appl. Syst. Innov. 2025, 8, 17. https://doi.org/10.3390/asi8010017
Moraliyage H, Kulawardana G, De Silva D, Issadeen Z, Manic M, Katsura S. Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers. Applied System Innovation. 2025; 8(1):17. https://doi.org/10.3390/asi8010017
Chicago/Turabian StyleMoraliyage, Harsha, Geemini Kulawardana, Daswin De Silva, Zafar Issadeen, Milos Manic, and Seiichiro Katsura. 2025. "Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers" Applied System Innovation 8, no. 1: 17. https://doi.org/10.3390/asi8010017
APA StyleMoraliyage, H., Kulawardana, G., De Silva, D., Issadeen, Z., Manic, M., & Katsura, S. (2025). Explainable Artificial Intelligence with Integrated Gradients for the Detection of Adversarial Attacks on Text Classifiers. Applied System Innovation, 8(1), 17. https://doi.org/10.3390/asi8010017