Next Article in Journal
Exploring High-Order Skeleton Correlations with Physical and Non-Physical Connection for Action Recognition
Next Article in Special Issue
Denoising of Wrapped Phase in Digital Speckle Shearography Based on Convolutional Neural Network
Previous Article in Journal
Enhancing Network Attack Detection Accuracy through the Integration of Large Language Models and Synchronized Attention Mechanism
Previous Article in Special Issue
Ensemble Empirical Mode Decomposition Granger Causality Test Dynamic Graph Attention Transformer Network: Integrating Transformer and Graph Neural Network Models for Multi-Sensor Cross-Temporal Granularity Water Demand Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

WordBlitz: An Efficient Hard-Label Textual Adversarial Attack Method Jointly Leveraging Adversarial Transferability and Word Importance

School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(9), 3831; https://doi.org/10.3390/app14093831
Submission received: 18 March 2024 / Revised: 26 April 2024 / Accepted: 28 April 2024 / Published: 30 April 2024

Abstract

:
Existing textual attacks mostly perturb keywords in sentences to generate adversarial examples by relying on the prediction confidence of victim models. In practice, attackers can only access the prediction label, meaning that the victim model can easily defend against such hard-label attacks by denying access based on the attack’s frequency. In this paper, we propose an efficient hard-label attack approach, called WordBlitz. First, based on the adversarial transferability, we train a substitute model to initialize the attack parameter set, including a candidate pool and two weight tables of keywords and candidate words. Then, adversarial examples are generated and optimized under the guidance of the two weight tables. During optimization, we design a hybrid local search algorithm with word importance to find the globally optimal solution while updating the two weight tables according to the attack results. Finally, the non-adversarial text generated during perturbation optimization is added to the training of the substitute model as data augmentation to improve the adversarial transferability. Experimental results show that WordBlitz surpasses the baseline in terms of better effectiveness, higher efficiency, and lower cost. Its efficiency is especially pronounced in scenarios with broader search spaces, and its attack success rate on a Chinese dataset is higher than on baselines.

1. Introduction

Deep Neural Networks (DNNs) have achieved unprecedented successes in the fields of text, image, and speech processing. However, their security concerns have attracted widespread attention. Studies have indicated that DNNs are vulnerable to threats from adversarial examples [1]. Attackers can mislead the classifier with subtle perturbations which are imperceptible to humans. Thus, the study of how to generate adversarial examples is essential in exploring the robustness of DNNs.
In NLP tasks, attackers primarily generate adversarial examples by perturbing keywords in the input text, which is known as word-level attack. These attacks are broadly categorized as white-box attacks and black-box attacks based on the information available to the attacker [2]. In black-box attacks, the attacker can only obtain the output of the target model, which makes such attacks more challenging. Previous research has mainly focused on score-based attacks [3], which estimate the importance of words by relying on the class probabilities or confidence score of the target model. For instance, the importance of each word in a sentence can be evaluated through the change in model output results after word deletion. However, it is difficult to obtain the score of target models in real-world applications. Thus, a few studies have only leveraged the label for attacks, which are termed decision-based or hard-label attacks [4].
In recent years, research in deep learning has been evolving towards Large-scale Language models (LLMs). LLMs are complex and have a high number of parameters; meaning that attackers sometimes cannot access the output scores of LLMs. Consequently, we believe that hard-label attacks will become more significant in the future. Existing research in this area can be categorized based on whether it addresses traditional attacks or transfer-based attacks.
  • Traditional attacks utilize genetic algorithms, word importance, etc., to generate adversarial examples. These methods share a common drawback in that each round requires random initialization and cannot use the historical attack experience of other texts. Hence, these attack methods heavily rely on the impact of random initialization [5], making them are inefficient, as they require frequent access to the target model. The victim model can simply detect this based on its frequency.
  • Transfer-based attacks assume that adversarial examples designed for one victim model are likely to fool other victim models as well [6]. This characteristic is known as adversarial transferability. In this approach, a substitute model is trained using the input and output of the victim model. In this way, a hard-label attack is transformed into a decision-based attack or white-box attack. The advantage lies of this approach lies in its high efficiency and less need for access to the victim model. However, the success rate is sometimes low.
Transfer-based attacks are particularly suitable in the hard-label scenario, where access to the victim model is restricted. Numerous recent studies have focused on applying adversarial transferability to attack DNNs [7]. This research has indicated several challenges. (1) Structural Disparities in Models: Transfer-based attacks do not work well when there is a significant structural difference between the victim model and substitute model. (2) Weaker Classification Performance of Substitute Model: In general, stronger classifiers lead to better adversarial transferability; however, in the hard-label scenario, the attacker cannot train the substitute model with enough data due to the limitations on access to the victim model, meaning that its performance is not as good. (3) Poor Generalization Performance of Substitute Model: A number of studies have suggested that the data augmentation can enhance generalization performance and improve adversarial transferability [8]. How to enhance generalization performance with data augmentation remains a challenge. (4) Adversarial Overfitting Due to Muti-step Attacks: Research has indicated that multi-step attacks are more likely than single-step attacks to overfit to the parameters of victim model. Due to the discrete nature of textual data, attack algorithms based on keyword replacement typically require multiple iterations, which reduces the effectiveness of transfer-based attacks.
To address the above issues, we propose a hard-label attack method named WordBlitz that jointly leverages word importance and adversarial transferability. First, we train a substitute model with limited training data from the victim model and extract the word importance. Then, the Attack Parameter Set (APS), including the candidate pool and word importance, is constructed from the substitute model based on adversarial transferability. Subsequently, adversarial examples are generated and optimized from the APS by a hybrid local search algorithm. Meanwhile, the APS is updated based on the attack result. Furthermore, we utilize the text generated in previous attacks in data augmentation to enhance the generalization of the substitute model. WordBlitz effectively addresses the issues of poor transferability caused by model structural difference, poor classification and overfitting due to multi-step attacks, and the need for improved universality of attack algorithms with respect to victim models. Our contributions are as follows:
  • We propose a transfer-based attack method for the hard-label scenario and change the target of transferability from label to word importance. This approach provides a heuristic method to initialize the attack parameter set from the substitute model.
  • We introduce an efficient perturbation strategy relying on the attack parameter set. The attack parameter set enables quick searching for optimal results, and is updated with the attack results to leverage historical experience.
  • We propose a data augmentation method for the substitute model by utilizing the text generated in the perturbation, which improves the adversarial transferability from history. Hence, our approach performs more significantly in broad search spaces, such as Chinese datasets.

2. Related Work

Score-based Attacks. Earlier studies employed the score-based attack method, in which attackers can obtain the class probabilities or confidence score of victim model. Most of these first find important words which highly impact the confidence score of the victim model; these keywords are then replaced with candidate pool. For example, Li et al. [9] selected and masked the vulnerable words with a masked language model. Then, BERT (Bidirectional Encoder Representations from Transformers) was used to predict the masked words and generate adversarial examples. Similarly, Garg and Ramakrishnan [10] presented BAE, which replaces and inserts tokens in the original text by masking a portion of the text and leveraging BERT-MLM to generate alternatives for the masked tokens. Li et al. [11] proposed a contextualized adversarial samples generation model and produced fluent outputs through a mask-then-infill procedure. Maheshwary et al. [12] introduced a query-efficient attack strategy which jointly leverages an attention mechanism and locality-sensitive hashing to reduce the query count. Lee et al. [13] focused on using Bayesian optimization to improve attack efficiency. Unlike the strategies, other works have used optimization procedures to craft adversarial inputs. For example, Alzantot et al. [14] proposed a population-based optimization algorithm to fool well-trained sentiment analysis and textual entailment models.
Decision-based Attacks. Only a few existing attack methods are decision-based. Zhao et al. [15] first proposed a perturbation method in the continuous latent semantic space. Ribeiro [16] rewrote sentences following semantically equivalent adversarial rules. Maheshwary et al. [5,12] leveraged a population-based optimization algorithm called HLBB to craft plausible and semantically similar adversarial examples, only relying on the top label predicted by the target model. Ye et al. [4] proposed a gradient-based hard-label method named TextHoaxer that estimates the gradient directly from the virtual randomly sampled directions in the embedding space rather than from concrete candidate words. Yu et al. [17] proposed TextHacker, which adopts a hybrid local search algorithm and estimates the word importance from history to minimize perturbation. Due to the strict query budget constraints in hard-label scenarios, decision-based attack methods need to show improved attack efficiency. However, these methods typically suffer from low attack efficiency, requiring frequent access to the output results of the victim model. For example, the HLLB method that uses genetic algorithms cannot utilize historical attack experience, while TextHacker employs a random initialization strategy. Hence, when the query budget is low, the attack success rates significantly decrease. Improving attack efficiency is a key research issue that needs to be addressed.
Transfer-based Attacks. Transfer-based attacks set the victim model as the target model and assume that adversarial examples can transfer between different models. These methods rely on the training data from the target model. Vijayaraghavan and Roy [18] trained a substitute model to mimic the decision boundary of the target classifier, then generated adversarial examples against the substitute model and transferred them to target model. However, the adversarial transferability is sometimes poor, resulting in a low attack success rate. Therefore, a few studies have explored the factors affecting this issue [7]. Most related works explain transferability from a model perspective, claiming that the decision boundary [19], model architecture [20], or generalization performance of the substitute model [21] significantly influence the adversarial transferability. Another consideration is that the attack algorithm may impact the adversarial transferability. Xie et al. [8] found that multi-step attacks tends to generate more overfitted adversarial perturbations with lower transferability than single-step attacking. Wang et al. [22] explained this based on the interactions inside adversarial perturbations. Motivated by these studies, a few researchers have aimed to improve the transferability. For example, Wang et al. [23] introduced data augmentation to improve the generalization performance of the substitute model and enhance the transferability.

3. Methodology

3.1. Formula Descriptions

To make the definitions more comprehensible, we provide some formula explanations about textual classifiers and adversarial samples.
Textual classifier. A textual classifier can be defined as a function F : X Y which maps an input textual set X to a label set Y, where Y is a set of labels, for example, {positive, negative} or a more extensive set of labels.
Adversarial samples. Attackers aim to adding a small perturbation e in the original text x to generate an adversarial example x . Here, x misleads the classifier F such that F ( x ) F x ; meanwhile, the generated x should be imperceptible. A series of metrics can be adopted to achieve this goal, such as ε < δ , where δ is a threshold used to limit the size of perturbations.

3.2. Framework

The structure of WordBlitz is shown as the main framework in Figure 1, which revolves around the Attack Parameter Set (APS). WordBlitz consists of two modules, namely, Adversarial Initialization and Perturbation Optimization, along with a data augmentation process. The APS is constructed from the pretrained substitute model and updated based on the attack result on the target (victim) model. It aims to save the importance weights of each word in the input text and candidate pools. The adversarial examples are generated and optimized relying on the APS. In detail, the input text is first sent to the adversary initialization module. We replace vulnerable words with the weight table in APS to quickly generate the initial adversarial example. Next, the initial adversarial examples enter the perturbation optimization module, where hybrid local search operations are used to find the optimal result. Finally, the generated non-adversarial samples are used for data augmentation of the substitute model to improve its generalization and classification performance while enhancing the adversarial transferability.

3.3. Construction of the APS

The APS is the fundamental component of WordBlitz, including a candidate pool and Attack Weight Table (AWT). The candidate pool provides similar words for replacement to generate adversarial examples. Within the AWT, the position weight table documents vulnerable keyword weights, while the candidate table documents replacement word weights from candidate pool. To enhance the perturbation efficiency, we initialize the position weight table by training a substitute model. The entire construction process comprises three stages: constructing the candidate pool, training the substitute model, and initializing the AWT.

3.3.1. Candidate Pool

The candidate pool is constructed based on language features and common attack methods. In the case of English, we utilize near-synonyms to form a candidate pool, which is then employed to generate perturbation through near-syn. For Chinese, we construct the candidate pool using shape-close characters, near-phonetic words, and split-character dictionaries, which are respectively utilized to generate perturbation with pinyin, glyphs, and character splitting. For a given word x i , its candidate pool is donated by C x i = x ^ i 0 , x ^ i 1 , , x ^ i m .

3.3.2. Substitute Model

We set the victim model as the target model f. Due to the unstable efficiency of random or constant initialization, it is essential to devise a heuristic method to initialize the position weight table reasonably and achieve efficient perturbation on keywords. Similarly, the word weights of the attention mechanism should accurately capture each word’s extent of influence. Hence, we train a BiGRU model with an attention mechanism as the substitute model f in order to initialize the position weight table in the AWT. These words can then be perturbed using the word importance. The training data are obtained from the input and output of the target model f. The input text X = x 1 , x i , , x n is first encoded by the BiGRU to obtain the hidden state H = h 1 , h i , h n , with the i-th hidden state h i computed as follows:
h i = BiGRU x i , h i 1 .
Then, the attention matrix α = α 1 , , α i , , α n is calculated with H, with α i formulated a
u i = tanh W word h i + b word ,
α i = exp u i u i j = 1 n exp u j u j ,
where W word is the weight of a one-layer MLP, b word represents the bias, and u i is a hidden state of h i . The final vectors of the input text X are denoted
s = i = 1 n α i u i .
After the fully connected layer and softmax layer, the probability distribution p for classification is obtained. We use the negative log-likelihood of the correct labels as the training loss L, formulated as
p = softmax W fc · s + b fc ,
L = k log p k j ,
where W fc represents the weight of fully connected layer, b fc is the bias, k is the amount of input text, and j is the label of X. The whole structure of f is shown in Figure 2.
The substitute model f is pretrained with only a small amount of training data obtained from the target model f. After each round of the attack, the generated non-adversarial text is utilized for data augmentation to update the parameters of f with the aim of improving the adversarial transferability.

3.3.3. AWT Matrix

To perturb the keywords efficiently, we attempt to identify the replaced priority of words in both the sentence and candidate pool. These two priorities are stored in a position weight table and a candidate weight table, respectively.
  • Position Weight Table  W p = w 1 p , , w i p , , w n p . Each word and word position in the sentence influences the classification result. Thus, perturbing words with higher importance has a significant impact on the overall semantics. Considering that the word weight in an attention mechanism reflects the contribution of each word in a sentence, we initialize the position weight table W p approximately with α = α 1 , , α i , , α n from the substitute model f ; then, W p is updated based on the result of hybrid local search in order to better adapt to the target model f.
  • Candidate Weight Table  W c = W 1 c , , W i c , , W n c . We construct W c to store the substitution scores of each word in the candidate pool. The matrix has the shape ( n , m + 1 ) , where n represents the number of words in the sentence and m + 1 denotes the number of words in each candidate pool. W c is initialized with all ones and updated based on the hybrid local search results.

3.4. Adversary Initialization

WordBlitz generates the initial adversarial examples in adversary initialization based on the APS. We design a WordSubstitution operator to substitute keywords in a sentence. To narrow down the search space, we replace words based on their w i p ranking within the top 50% with random words in the candidate pool. For X t at the t-th iteration, we perturb it based on W p and the candidate pool C to craft a new text X t + 1 , which can be denoted as follows:
X t + 1 = WordSubsitution X t , C , W p .
Iteration is repeated until X t + 1 is an adversary or reaches the maximum number T.

3.5. Perturbation Optimization

We apply the hybrid local search algorithm [24] for perturbation optimization. Hybrid local search is a kind of population-based algorithm which is effective for combinational optimization problems. It usually contains two key components, namely, LocalSearch and Recombination. The LocalSearch operator searches for a better one from the neighborhood of each solution to approach the local optima, while the Recombination operator crosses over the existing solutions to accept non-improved solutions, helping it jump out of local optima. Inspired by TextHacker [17], we utilize an additional components called WeightUpdate in WordBlitz to allow it to learn from history.

3.5.1. LocalSearch

For an adversarial example X t adv at the t-th iteration, we randomly sample several (at most k) less important words x ^ i j t X t adv from W p with probability p i :
p i = 1 σ w i p i = 1 n 1 σ w i p
where σ ( x ) = 1 / 1 + e x is the sigmoid function. This reduces excessive gaps and makes the probability more reasonable. Then, each chosen word x ^ i j t is equally substituted with the original word x ^ i 0 or a candidate word x ^ i j t + 1 in C x i based on probability p i , j t + 1 to generate a new text X t + 1 adv . Here, p i , j t + 1 is calculated as follows:
p i , j t + 1 = σ w i , j t + 1 c j t + 1 m σ w i , j t + 1 c .
We accept X t + 1 adv if it is still adversarial; otherwise, we add it to set D for data augmentation and return X t adv . LocalSearch utilizes W p and W c to substitute less important words with the original word or a candidate word, with the aim of optimizing the perturbation from the k-neighborhood of X t adv .

3.5.2. WeightUpdate

In order to better highlight the important words in the sentence and candidate pool, we update the word importance W p and W c with X t adv and X t + 1 adv using the following rules:
Rule1: For each replaced word x ^ i j + 1 in C x i , if X t + 1 adv is still adversarial, it means that x ^ i j + 1 has a positive influence on the adversary. Its weight is increased with reward r c as
w i , j t + 1 c = w i , j t + 1 c + r c
or reduced as
w i , j t + 1 c = w i , j t + 1 c r c .
Rule2: For each operated word and its position i, if X t adv is still adversarial, it means that the position i has less influence on the adversary. Thus, its weight is decreased with negative reward r p as
w i p = n r p n w i p
or increased as
w i p = n + r p n w i p ,
where n is the number of words in sentence X.

3.5.3. Recombination

LocalSearch can find a better adversarial example; however, it is likely to result in local optima. Recombination can effectively address this issue. We randomly combine two randomly sampled texts X a = x 1 a , x 2 a , , x n a P t and X b = x 1 b , x 2 b , , x n b P t to generate a recombined text X c = x 1 c , x 2 c , , x n c . Each word x i c in X c is randomly sampled from x i a , x i b based on their weights in W c . We repeat this operation O / 2 times and return all samples, where O is the number of input adversarial examples. Then, we select fitness based on the modified number and semantic similarity with the USE [25]. In summary, Recombination accepts non-improved solutions in order to avoid local optima.

3.6. DataAugmentation

Due to limited access to target model, we train the substitute model f with sparse data. This results in poor classification and generalization performance of f , in turn leading to poor adversarial transferability. Considering that there are a large amount of texts generated over multiple iterations, we design a DataAugmentation operator that adds those generated texts which are non-adversarial to the training data of f as data augmentation to make it stronger. This adjustment enables APS to remember the features of the target model, thereby enhancing the adversarial transferability.

3.7. The Whole Algorithm

The overall algorithm of WordBlitz is summarized in Algorithm 1. It comprises four fundamental steps: Construct APS (lines 4–8 in Algorithm 1, detailed in Section 3.3), Adversary Initialization (lines 10–16 in Algorithm 1, detailed in Section 3.4), Perturbation Optimization (lines 18–31 in Algorithm 1, detailed in Section 3.5), and Data Augmentation (lines 33–37 in Algorithm 1, detailed in Section 3.6).
Algorithm 1 The WordBlitz Algorithm
 1:
Input: Input text X, target model f, query budget T, Population Size S, maximum number of local search N, attention weight α , similarity threshold ε , perturb rate threshold ϵ , the set of samples for DataAugmentation D da
 2:
Output: Attack result and adversarial example
 3:
▹ Construct APS (line 4–8)
 4:
for each word x i in X do
 5:
  Construct the candidate pool C ( x i )
 6:
Y pre = f ( X pre )
 7:
Train substitute model f with X pre , Y pre
 8:
Initialize W p with Attention Weight α in f ; Initialize W c with all 1’s       ▹ Initialize APS
 9:
▹ Adversary Initialization (line 10–16)
10:
X 1 = X , X 1 adv = None
11:
for  t = 1 to T do
12:
   X t + 1 = WordSubstitution ( X 1 , C , W p )       ▹ Detailed in Section 3.4
13:
  if  f ( X t + 1 ) f ( X )  then
14:
     X 1 adv = X t + 1 ; break       ▹ Initialization succeeds
15:
if  X adv is None then
16:
  return False, None       ▹ Initialization fails
17:
▹ Perturbation Optimization (line 18–31)
18:
P 1 = X 1 a d v
19:
for  i = 1 to S 1  do
20:
   X i + 1 adv = LocalSearch X i adv , C x i , W c , W p ; P 1 = P 1 X i + 1 a d v
21:
t = t + S 1 ; g = 1
22:
while  t T  do
23:
   P g = P g {Recombination  P g , W c       ▹ Detailed in Section 3.5.3
24:
  for each text X a d v g in P g  do
25:
     X a d v = X a d v g
26:
    for  i = 1 to N do
27:
       X i + 1 adv = LocalSearch X i adv , C x i , W p , W c       ▹ Detailed in Section 3.5.1
28:
       WeightUpdate X i a d v , X i + 1 adv , W p , W c , f       ▹ Update APS based on attack result, detailed in Section 3.5.2
29:
     P g = P g X N + 1 a d v ; t = t + N
30:
  Construct P g + 1 with the top S fitness based on the Modified Number and USE
31:
  Record the global optimal X best adv based on the Modified Number and USE
32:
▹ Data Augmentation (line 33–37)
33:
Construct D text with all generated Text X ge in Perturbation Optimization
34:
for each X ge in D text  do
35:
  if  X ge is non-adversarial then
36:
     D da = D da X ge , f X ge       ▹ Construct set for data augmentation
37:
DataAugmentation  f , D da
38:
if  Similarity X best adv , X < ε and PerturbRate  X best adv , X < ϵ  then
39:
  return success, X best adv       ▹ Attack succeeds
40:
return False, None       ▹ Attack fails

4. Experiments

We conducted extensive experiments on three English datasets to validate the effectiveness of WordBlitz. In order to validate its applicability on different languages, we also conducted experiments on three Chinese datasets and additionally incorporated a language model optimized for Chinese.

4.1. Experimental Setup

Victim Models. We adopted TextCNN [26], LSTM [27], and BERT [28] as the English victim models, while adopting TextCNN, BERT, and ERNIE [29] as the Chinese victim models. In addition, to verify the attack effect of WordBlitiz on LLMs, we added ChatGLM3-6B [30] as a common victim model.
Baselines. We used two hard-label attack methods, HLBB [5] and TextHacker [17], as our baselines. HLBB uses a population-based algorithm to generate adversarial examples, while TextHacker perturbs the text based on word importance and hybrid local search.
Evaluation Metrics. The evaluation metrics included two aspects, namely, attack effectiveness and perturbation cost. Attack effectiveness encompasses the attack success rate, attack stability, and attack efficiency, while perturbation cost includes the semantic similarity and perturbation ratio. We provide detailed descriptions of these metrics in the corresponding sections.
Datasets. We adopted three English datasets (IMDB [31], Yelp [32], and MR [33]) and three Chinese datasets (Waimai, OnlineShopping, and hotels from ChnSentiCorp) for text classification. These datasets are commonly used for sentiment classification tasks; each data point consists of a user review, and each review is labeled as either “positive” or “negative”. As perturbing keywords in the reviews can effectively change their sentiment polarity, these datasets are widely used in experiments involving adversarial examples. Table 1 provides detailed information about the datasets.

4.2. Evaluation of Attack Effectiveness

We first conducted evaluations using the above datasets on the models under the same query budget of 2000. The evaluation metrics were the attack success rate (Succ, %), perturbation rate (Pert, %), and semantic similarity (Sim, %) as evaluated with Universal Sentence Encoder [25]. Table 2 and Table 3 show the results on the English and Chinese datasets, respectively.
As shown in Table 2, WordBlitz performs better than HLBB and TextHacker. HLBB exhibits the weakest attack efficiency due to its reliance on a population optimization algorithm, which cannot learn from history; meanwhile, its inability to calculate candidate word importance leads to a higher perturbation rate. TextHacker introduces word importance and updates it from history to generate and optimize the adversarial examples. Therefore, its attack efficiency is higher than that of HLBB. WordBlitiz constructs the APS and initializes it based on the adversarial transferability rather than random initialization, enabling it to generate adversarial examples more quickly. Thus, it achieves a higher attack success rate and semantic similarity with lower perturbation across almost all of the datasets and victim models. Taking the results of attacks on BERT as an example, WordBlitz outperforms the others in terms of attack success rate by clear margins of 0.3–7.9%, and improves the perturbation rate by 0.1–1.7% and semantic similarity by 2.6–3%.
Unlike the English scenario, there are more attack types in the Chinese candidate pool, which leads to a broader search space. This situation challenges the attack efficiency. As shown in Table 3, the gap between the three methods is more significant on the Chinese datasets. WordBlitz and TextHacker, which can perturb vulnerable words based on word importance, are more efficient than HLBB. In particular, WordBlitz initializes the APS based on transferability rather than random initialization, allowing it achieve a 1.6–9.1% higher attack success rate than TextHacker and a 1.8–20.7% higher success rate than HLBB. These results show that WordBlitz performs better on complex languages.

4.3. Stability Evaluation

To consider the influence of random seeds on the three tested methods, we validated their stability by conducting repeated experiments on the Yelp and Hotel datasets with the BERT model. For the stability evaluation, we used the coefficient of variation γ , which can be formulated as
s = 1 N 1 i = 1 N z i z ¯ 2 ,
γ = S z ¯ ,
where s is the standard deviation, z i represents each value in the sample, N is the size of sample, z ¯ is the mean of the sample, and γ describes the relative stability. A lower γ indicates greater stability of the data. Table 4 illustrates the results, showing that WordBlitz has best attack stability on both the English and Chinese datasets.

4.4. Efficiency Evaluation

In practice, the victim model can easily defend attacks based on their anomalous access frequency, which challenges the efficiency of attack methods. It is obvious that the query budget of the victim model is highly related to the efficiency. Hence, we validated the efficiency of the three methods using different query budgets, taking attacks on the IMDB and Waimai datasets with BERT model as an example. Figure 3 shows the results.
It can be observed that the success rate of all three attack methods decreases with decreasing query budget, with the change from 1500 to 1000 being more obvious. Taking the IMDB dataset as an example, HLBB exhibits the highest deterioration in attack success rate, reaching up to 7.7%. This can be attributed to its reliance on extensive random substitutions for searching adversarial examples instead of memorizing word importance. Consequently, it is only suitable for scenarios with loose query restrictions. TextHacker experiences a lower decline of 3.1% due to its ability to remember word importance using a weight table. However, because the weight table is randomly initialized, its accuracy decreases significantly when the query budget is strictly limited. WordBlitz demonstrates the most consistent performance, with only a marginal decrease of 1.9% in attack success rate compared to the other methods. Even under the strictly limited query budget (≤1000), it is able to achieves a remarkable success rate of 79.2%, which is due to the improvement in adversarial transferability with data augmentation. The experimental results on the Chinese Waimai dataset are similar, with the gap between the three methods being even more significant due to the broader search space of Chinese candidate pool, which requires high efficiency.

4.5. Ablation Study

To verify the influence of the DataAugment (DA) components in WordBlitz, we conducted an ablation study on BERT using the Waimai dataset with a query budget of 2000. We removed the DA component to validate its contribution. The results are shown in Table 5. The ablation experiment demonstrates that WordBlitz with DA has better performance than the version without DA. This shows the superiority of the DataAugment component, which learn more knowledge of the target model from history and enhances the adversarial transferability.

4.6. Case Study

Table 6 shows adversarial sample cases generated by WordBlitz when attacking BERT. When attacking English texts, we used synonyms to replace the key words ‘brilliant’ and ‘moving’. When attacking Chinese texts, we used homophones to replace the key word ‘香’. By perturbing a small number of key words, WordBlitz was able to successfully change the output labels.

5. Conclusions

This paper proposes an efficient hard-label attack method called WordBlitz for generating high-quality adversarial samples with a strictly limited query budget. WordBlitz uses an Attack Parameter Set (APS) to remember word importance, which is initialized with a substitute model based on adversarial transferability. Then, adversarial examples are generated and optimized with the APS. Meanwhile, the APS is updated based on the attack results. This method overcomes the issue of low efficiency caused by random initialization. Experimental results show that WordBlitz achieves high efficiency and effectiveness, particularly on more complex languages. Compared to baselines, WordBlitz has a higher attack success rate and lower perturbation costs, especially in scenarios where query budgets are strictly limited. This means that it can be applied to real-world scenarios. In addition, the results of an ablation study prove that the DataAugment module improves the adversarial transferability.
In the future, we intend to further explore methods leveraging adversarial transferability to make our methods more general against other large-scale language models which are more robust. Ultimately, our goal is to reveal the flaws of textual classifiers in terms of adversarial robustness in order to make future models more robust. We think that there are two potential approaches. (1) In observing the generated adversarial samples, we found that their word frequency distribution and word weight distribution changed greatly. These features could be used to construct a detector to identify adversarial samples. (2) Because Chinese adversarial examples use features such as phonemes and glyphs, adding these features to train a robust model may be a potential defense. In future research, we will further explore such defense methods to improve the robustness of neural networks.

Author Contributions

Conceptualization, X.L. and H.L.; methodology, X.L.; resources, H.L. and Y.S.; writing—original draft preparation, X.L.; writing—review and editing, X.L. and H.L.; funding acquisition, H.L. and Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Nos. 62172051, and 62272052).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Nguyen, A.; Yosinski, J.; Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 427–436. [Google Scholar]
  2. Zhang, W.E.; Sheng, Q.Z.; Alhazmi, A.; Li, C. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Trans. Intell. Syst. Technol. (TIST) 2020, 11, 24. [Google Scholar] [CrossRef]
  3. Wang, W.; Wang, R.; Wang, L.; Wang, Z.; Ye, A. Towards a robust deep neural network in texts: A survey. arXiv 2019, arXiv:1902.07285. [Google Scholar]
  4. Ye, M.; Miao, C.; Wang, T.; Ma, F. TextHoaxer: Budgeted hard-label adversarial attacks on text. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 12–17 June 2022; Volume 36, pp. 3877–3884. [Google Scholar]
  5. Maheshwary, R.; Maheshwary, S.; Pudi, V. Generating natural language attacks in a hard label black box setting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 13525–13533. [Google Scholar]
  6. Emmery, C.; Kádár, Á.; Chrupała, G. Adversarial Stylometry in the Wild: Transferable Lexical Substitution Attacks on Author Profiling. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, 19–23 April 2021; pp. 2388–2402. [Google Scholar]
  7. Zhu, Y.; Chen, Y.; Li, X.; Chen, K.; He, Y.; Tian, X.; Zheng, B.; Chen, Y.; Huang, Q. Toward understanding and boosting adversarial transferability from a distribution perspective. IEEE Trans. Image Process. 2022, 31, 6487–6501. [Google Scholar] [CrossRef] [PubMed]
  8. Xie, C.; Zhang, Z.; Zhou, Y.; Bai, S.; Wang, J.; Ren, Z.; Yuille, A.L. Improving transferability of adversarial examples with input diversity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2730–2739. [Google Scholar]
  9. Li, L.; Ma, R.; Guo, Q.; Xue, X.; Qiu, X. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6193–6202. [Google Scholar]
  10. Garg, S.; Ramakrishnan, G. BAE: BERT-based Adversarial Examples for Text Classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 6174–6181. [Google Scholar]
  11. Li, D.; Zhang, Y.; Peng, H.; Chen, L.; Brockett, C.; Sun, M.T.; Dolan, W.B. Contextualized Perturbation for Textual Adversarial Attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 5053–5069. [Google Scholar]
  12. Maheshwary, R.; Maheshwary, S.; Pudi, V. A strong baseline for query efficient attacks in a black box setting. arXiv 2021, arXiv:2109.04775. [Google Scholar]
  13. Lee, D.; Moon, S.; Lee, J.; Song, H.O. Query-efficient and scalable black-box adversarial attacks on discrete sequential data via bayesian optimization. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD USA, 17–23 July 2022; pp. 12478–12497. [Google Scholar]
  14. Alzantot, M.; Sharma, Y.; Elgohary, A.; Ho, B.J.; Srivastava, M.; Chang, K.W. Generating Natural Language Adversarial Examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 2890–2896. [Google Scholar]
  15. Zhao, Z.; Dua, D.; Singh, S. Generating Natural Adversarial Examples. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  16. Ribeiro, M.T.; Singh, S.; Guestrin, C. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 856–865. [Google Scholar]
  17. Yu, Z.; Wang, X.; Che, W.; He, K. TextHacker: Learning based Hybrid Local Search Algorithm for Text Hard-label Adversarial Attack. arXiv 2022, arXiv:2201.08193. [Google Scholar]
  18. Vijayaraghavan, P.; Roy, D. Generating black-box adversarial examples for text classifiers using a deep reinforced model. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, 16–20 September 2019; Proceedings, Part II. Springer: Berlin/Heidelberg, Germany, 2020; pp. 711–726. [Google Scholar]
  19. Liu, Y.; Chen, X.; Liu, C.; Song, D. Delving into Transferable Adversarial Examples and Black-box Attacks. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  20. Wu, D.; Wang, Y.; Xia, S.T.; Bailey, J.; Ma, X. Skip Connections Matter: On the Transferability of Adversarial Examples Generated with ResNets. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  21. Lin, J.; Song, C.; He, K.; Wang, L.; Hopcroft, J.E. Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  22. Wang, X.; Ren, J.; Lin, S.; Zhu, X.; Wang, Y.; Zhang, Q. A Unified Approach to Interpreting and Boosting Adversarial Transferability. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021. [Google Scholar]
  23. Wang, X.; He, K. Enhancing the transferability of adversarial attacks through variance tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1924–1933. [Google Scholar]
  24. Eberhart, R.; Kennedy, J. Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks. Citeseer, Perth, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
  25. Cer, D.; Yang, Y.; Kong, S.y.; Hua, N.; Limtiaco, N.; John, R.S.; Constant, N.; Guajardo-Cespedes, M.; Yuan, S.; Tar, C.; et al. Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; pp. 169–174. [Google Scholar]
  26. Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Association for Computational Linguistics: Toronto, ON, Canada, 2014. [Google Scholar]
  27. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  28. Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186. [Google Scholar]
  29. Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Tian, H.; Wu, H.; Wang, H. Ernie 2.0: A continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8968–8975. [Google Scholar]
  30. Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. GLM-130B: An Open Bilingual Pre-trained Model. In Proceedings of the Eleventh International Conference on Learning Representations, Vienna, Austria, 25 April 2022. [Google Scholar]
  31. Maas, A.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150. [Google Scholar]
  32. Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. arXiv 2015, arXiv:1509.01626. [Google Scholar]
  33. Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, MI, USA, 25 June 2005; pp. 115–124. [Google Scholar]
Figure 1. The framework of WordBlitz.
Figure 1. The framework of WordBlitz.
Applsci 14 03831 g001
Figure 2. Structure of the substitute model f .
Figure 2. Structure of the substitute model f .
Applsci 14 03831 g002
Figure 3. Efficiency evaluation with the BERT model.
Figure 3. Efficiency evaluation with the BERT model.
Applsci 14 03831 g003
Table 1. The details of datasets.
Table 1. The details of datasets.
DatasetLanguageTotal Number of SamplesDesctription
IMDBEnglish50,000Movie Reviews
YelpEnglish94,000Reviews from Yelp
MREnglish10,662Movie Reviews
HotelChinese10,000Hotel Reviews
WaimaiChinese18,000Reviews for Take-out Food
Online_ShoppingChinese60,000Reviews for Online Shopping
Table 2. Evaluation on attack effectiveness on English datasets.
Table 2. Evaluation on attack effectiveness on English datasets.
ModelAttackIMDBYelpMR
Succ.Pert.Sim.Succ.Pert.Sim.Succ.Pert.Sim.
TextCNNHLBB74.04.285.367.17.686.271.113.284.3
TextHacker77.83.085.675.46.482.578.311.182.1
WordBlitz78.33.187.175.96.386.478.911.884.0
LSTMHLBB72.14.186.4616.685.368.311.283.5
TextHacker76.2384.965.45.584.875.211.282.9
WordBlitz76.9386.8665.386.675.510.784.8
BERTHLBB774.884.157.18.284.665.811.685.2
TextHacker81.53.483.163.26.782.273.111.483.6
WordBlitz81.93.385.763.56.585.273.710.985.4
ChatGLM3HLBB36.47.183.635.28.48635.612.281.7
TextHacker37.36.284.235.97.383.436.111.382.3
WordBlitz40.45.985.139.8786.137.911.583.6
The bold number denotes the best performance value for each dataset.
Table 3. Evaluation of attack effectiveness on Chinese datasets.
Table 3. Evaluation of attack effectiveness on Chinese datasets.
ModelAttackOnlineShoppingWaimaiHotel
Succ.Pert.Sim.Succ.Pert.Sim.Succ.Pert.Sim.
TextCNNHLBB43.520.779.246.119.581.347.220.481.0
TextHacker52.319.180.454.218.179.955.819.479.1
WordBlitz59.818.581.560.418.383.262.118.582.4
BERTHLBB39.41881.943.919.281.741.519.181.6
TextHacker50.217.981.152.619.981.353.218.380.3
WordBlitz60.117.48360.318.983.859.617.681.9
ERNIEHLBB41.619.280.842.520.382.64418.981.7
TextHacker50.418.680.251.819.68154.118.681.2
WordBlitz58.918.382.559.419.482.960.417.883.3
ChatGLM3HLBB32.518.979.333.718.481.231.817.681.4
TextHacker32.718.381.635.119.779.434.31880.2
WordBlitz34.31782.435.718.181.536.817.483
The bold number denotes the best performance value for each dataset.
Table 4. Stability evaluation with the BERT model.
Table 4. Stability evaluation with the BERT model.
AttackEnglish:YelpChinese:Hotel
γ (Succ.) γ (Pert.) γ (Sim.) γ (Succ.) γ (Pert.) γ (Sim.)
HLBB0.01950.02590.00750.03070.02980.0093
TextHacker0.00920.01880.00570.01750.02740.0071
WordBlitz0.00500.01520.00420.01170.02260.0056
The bold number denotes the best performance value for each dataset.
Table 5. Ablation study of the DataAugmentat (DA) component of WordBlitz, using the Waimai dataset on BERT.
Table 5. Ablation study of the DataAugmentat (DA) component of WordBlitz, using the Waimai dataset on BERT.
Victim ModelAttack MethodWaimai
Succ.Pert.Sim.
BERTWordBlitz without DA58.119.282.5
WordBlitz with DA60.318.983.8
The bold number denotes the best performance value for each dataset.
Table 6. Case study of WordBlitz.
Table 6. Case study of WordBlitz.
LanguageExampleLabel
EnglishOriginalBrilliant and moving acts by Tom Courtenay and Peter Finch.Pos
AdversarialShowy and emotional acts by Tom Courtenay and Peter Finch.Neg
ChineseOriginal太香 (xiāng, delicious) 太大了,全是肉,嘴唇都撑裂了。Pos
Adversarial响 (xiǎng, noisy) 太大了,全是肉,嘴唇都撑裂了。Neg
The red words denotes the replaced word. The bold words denotes the word which is used for replacement.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, X.; Luo, H.; Sun, Y. WordBlitz: An Efficient Hard-Label Textual Adversarial Attack Method Jointly Leveraging Adversarial Transferability and Word Importance. Appl. Sci. 2024, 14, 3831. https://doi.org/10.3390/app14093831

AMA Style

Li X, Luo H, Sun Y. WordBlitz: An Efficient Hard-Label Textual Adversarial Attack Method Jointly Leveraging Adversarial Transferability and Word Importance. Applied Sciences. 2024; 14(9):3831. https://doi.org/10.3390/app14093831

Chicago/Turabian Style

Li, Xiangge, Hong Luo, and Yan Sun. 2024. "WordBlitz: An Efficient Hard-Label Textual Adversarial Attack Method Jointly Leveraging Adversarial Transferability and Word Importance" Applied Sciences 14, no. 9: 3831. https://doi.org/10.3390/app14093831

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop