Bilateral Adversarial Patch Generating Network for the Object Tracking Algorithm

Rasol, Jarhinbek; Xu, Yuelei; Zhang, Zhaoxiang; Tao, Chengyang; Hui, Tian

doi:10.3390/rs15143670

Open AccessArticle

Bilateral Adversarial Patch Generating Network for the Object Tracking Algorithm

by

Jarhinbek Rasol

,

Yuelei Xu

^*,

Zhaoxiang Zhang

,

Chengyang Tao

and

Tian Hui

Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(14), 3670; https://doi.org/10.3390/rs15143670

Submission received: 19 May 2023 / Revised: 17 July 2023 / Accepted: 18 July 2023 / Published: 23 July 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Deep learning-based algorithms for single object tracking (SOT) have shown impressive performance but remain susceptible to adversarial patch attacks. However, existing adversarial patch generation methods primarily focus on generating patches within the search region, neglecting the incorporation of template information, which limits their effectiveness in carrying out successful attacks. There is also a lack of evaluation metrics to assess the patch’s adversarial abilities. In this study, we propose a bilateral adversarial patch-generating network to address these limitations and advance the field of adversarial patch generation for SOT networks. Our network leverages a Focus structure that effectively integrates both template and search region information, generating separate adversarial patches for each branch. We also introduce the DeFocus structure to solve the size discrepancy between the template and search region of the tracking network. To effectively mislead the tracking network, we have designed adversarial object loss and adversarial regression loss functions tailored to the network’s output. Moreover, we propose a comprehensive evaluation metric that measures the patch’s adversarial ability by establishing a relationship between the relative patch size and attack performance. As UAV view data often constitute small objects requiring smaller patches, we evaluate our approach on the UAV123 and UAVDT datasets. Our evaluation encompasses not only the overall attack performance but also the effectiveness of our strategy and the transferability of the attacks. Experimental results demonstrate that our algorithm generates patches with higher attack efficiency compared to existing methods.

Keywords:

deep learning; single object tracking; adversarial patch; adversarial abilities; UAV

1. Introduction

Object tracking in video streams is a fundamental computer vision task that involves locating a specific object across consecutive frames based on its initial location and size in a frame. This technique has found widespread applications in various domains, including security monitoring [1,2], autopilot systems [3], and intelligent robotics [4]. With the advancements in deep learning [5], significant progress has been made in object tracking algorithms, particularly in Siamese network [6] based tracking algorithms, which exhibit high precision and speed. However, from a security standpoint, it has been observed that deep learning algorithms are susceptible to adversarial samples [7].

Single object tracking (SOT) algorithms based on Siamese networks heavily rely on deep learning techniques, rendering them susceptible to adversarial attacks. While these attacks can be used to enhance the robustness of the tracking algorithm [8,9], they can also potentially cripple its performance. Currently, two primary types of adversarial attacks are employed for SOT: generating subtle adversarial perturbations and attaching specific-colored adversarial patches. Although the former method has better attack performance, it is difficult to apply to the physical world because it adds subtle noise at the pixel level. The latter method mainly achieves deception of the target tracking network by adding adversarial patches to the search area. Since adversarial patches can usually be applied in the physical world, they can be used to protect important targets from surveillance or intelligent drone operations [10]. Therefore, research on generating adversarial patches is of great significance.

Current adversarial patch generation algorithms for object tracking networks did not incorporate template information, only generating patches for the search region [11,12,13,14,15]. This results in a limited disparity between the template and search region. Consequently, the generated patches are insufficient to attack object-tracking networks with smaller patches, limiting applicability in attacking UAV onboard tracking algorithms.

Therefore, we propose a novel adversarial patch-generating algorithm in this paper. Our algorithm employs an adversarial image-generating network to generate adversarial images for both the template and search regions in an end-to-end form. Subsequently, we extract a patch from the adversarial images corresponding to the center point of the tracking object and apply these patches to the objects in the template and search region. The overall structure of our approach is illustrated in Figure 1.

Furthermore, we address the current absence of a comprehensive method to evaluate the adversarial ability of the generated patches. To overcome this limitation, we establish a correlation between the size of the patches and their attacking performance. This correlation serves as a means to assess the adversarial ability of the patches. The contributions of our work in this paper can be summarized as follows:

We propose a novel approach called the Bilateral Adversarial Patch Generating Network (BAPGNet), which effectively incorporates template information into the process of generating adversarial patches. Our network generates adversarial images for both the template and search areas, thereby amplifying the discrepancy between them. Additionally, we address the issue of size disparity between the template and search regions by introducing the Focus and DeFocus structures.
To attack the Region Proposal Network (RPN) within the tracker, we develop two loss functions: the Adversarial Object Loss and the Adversarial Regression Loss. These loss functions manipulate the bounding box, causing it to deviate and contract from the actual tracking target. This deceptive manipulation aims to mislead the tracking algorithm.
We introduce a novel metric to evaluate the adversarial ability of the patches. By associating the patch’s relative size with the attacking performance, our metric assesses the patch’s adversarial ability and attacking performance.

Our paper is organized as follows: Section 2 introduces some related work, including the SOT and attacking algorithms. Our method is detailly described in Section 3. Section 4 describes the metrics we introduce to evaluate the adversarial nature of the adversarial patch. The experimental evaluations are presented in Section 5. We give a discussion in Section 6 and conclude our work in Section 7.

2. Related Work

Our work aims to attack the Siamese network-based SOT algorithm using the adversarial patch. The attack can be realized by adversarial perturbation (add imperceptible noise at pixel level) or adversarial patch (using perceptible textures). Therefore, in this section, we provide a summary of the existing work related to Siamese network-based SOT, adversarial perturbation-based attacks, and adversarial patch-based attacks.

2.1. Siamese Network-Based SOT

Siamese network-based algorithms are widely used in single object tracking (SOT) to achieve robust visual tracking. These networks consist of identical subnetworks that analyze pairs or groups of inputs. By learning a discriminative feature space through end-to-end training, Siamese networks enable efficient template-search region matching, ensuring accurate target tracking across frames.

Several pioneering works have contributed to the advancements in Siamese tracking algorithms. Tao et al. [16] introduced a Siamese network-based algorithm that performs target matching without any adaptation, even for unseen targets. Bertinetto et al. [17] proposed a fully convolutional architecture that achieved state-of-the-art performance by combining a basic tracking algorithm with the Siamese network. Wang et al. [18] introduced the Residual Attentional Siamese Network (RASNet) and reformulated the correlation filter within the Siamese framework, incorporating attention mechanisms. Li et al. [19] improved the robustness of Siamese trackers against occlusions and clutter by introducing a Region Proposal Network (RPN). Jack et al. [20] interpreted the Correlation Filter as a differentiable layer in a deep neural network, enabling the learning of deep features closely related to the Correlation Filter. Wang et al. [9] addressed limitations caused by occlusion and limited training data diversity by introducing positive sample generation and hard positive transformation networks. Li et al. [21] introduced a spatial awareness sampling strategy and depth-wise and layer-wise aggregations to address accuracy and translation invariance issues in Siamese trackers. Zhang et al. [22] proposed new residual modules and network architectures, improving performance on multiple datasets. Wang et al. [23] integrated object tracking and segmentation within the Siamese network framework, augmenting training with a binary segmentation task. Zhu et al. [24] integrated object tracking and segmentation within the Siamese network framework, augmenting training with a binary segmentation task. Cao et al. [25] introduced an attentional aggregation module for UAV tracking, modeling semantic interdependencies. Wang et al. [26] introduced a dynamic appearance model with multiple target templates and utilized diversified tracking results to build a multi-trajectory history, enhancing performance in tracking scenarios. The research and development of Siamese network-based single-object tracking algorithms have significantly contributed to advancing the field of visual object tracking. These algorithms have proven to be powerful and effective tools for robustly tracking objects in various challenging scenarios.

2.2. Adversarial Perturbation Based Attack

The attacks can significantly degrade the performance of tracking algorithms by adding imperceptible Micro-noise to a normal sample, causing the tracker to fail. Various methods have been proposed to address this issue. Yan et al. [27] proposed the Cooling Shrinking Attack, which reduces the heat map’s correct target area’s heat generated by the Siamese RPN network and forces the bounding box to shrink, making it difficult for the tracker to detect the target accurately. To explore the tracker’s robustness, Liang et al. [28] proposed an end-to-end Fast Attack Network that combines drift loss with embedded feature loss in the network’s training process. The experimental results show that FAN can achieve efficient non-targeted and targeted attacks in both white-box and black-box scenarios. To improve the attack performance, Chen et al. [29] proposed an efficient Dual Attention attack that uses a dual-attention mechanism to generate adversarial perturbations in the first frame of a video, making it impossible for the visual object tracker to track the target in subsequent frames. To explore an object’s moving trajectory, Guo et al. [30] proposed the Spatial-Aware Online Incremental Attack (SPARK), which generates spatiotemporally sparse perturbations. This method applies the perturbation from the previous frame to the new video frame and then optimizes the loss function to generate the minimal effective perturbation increment. In addition to rendering the tracker ineffective, SPARK can also mislead the tracker into generating the wrong trajectory. To achieve adversarial attacks in black-box scenarios, Jia et al. [31] proposed the IoU attack, which uses the IoU score of the current and historical bounding boxes to guide the direction of adversarial perturbations. This method iteratively adds perturbations to reduce the accuracy of the object bounding box without solving the model gradient. To further explore the model’s vulnerability, Yan et al. [32] proposed the Hijack Attack algorithm, which attacks the tracking algorithm by hijacking the target box’s shape and position. They also proposed an adaptive optimization method to combine the two attack methods, achieving a more efficient attack. To improve the attack algorithm’s applicability, Liu et al. [33] proposed an offline, universal noise adversarial attack algorithm that uses one type of noise to attack the entire video frame. To improve computational efficiency and attack performance, they also proposed a greedy gradient strategy and a triple loss to obtain the model’s features. To improve the attack performance on unknown models, Suttapak et al. [34] proposed the Diminishing-Feature Attack algorithm, which interferes with the model’s feature heat map by adding tiny amounts of noise to the input image, weakening the target score to deceive the tracking algorithm.

2.3. Adversarial Patch Based Attack

Adversarial patch attacks, which aim to fool object tracking systems by adding a perceptible specified texture patch to the target object, have recently gained attention in the computer vision community. However, compared to adversarial perturbations, the number of studies on adversarial patch attacks for visual object-tracking is still relatively limited. Wiyatno et al. [11] made the first attempt to attack the SOT with a specified texture. They present a method for creating subtle textures that confuse visual object tracking systems. These textures can be displayed as posters in the physical world, causing the tracking system to lock onto the texture instead of the actual target, allowing the target to evade tracking. The article evaluates different optimization strategies for fooling tracking models and compares the impacts of different scene variables. Li et al. [12] propose a method for a video-agnostic and computationally efficient targeted attack on a Siamese visual tracking algorithm. By adding a specified texture to the template image and a fake target adhering to a predefined trajectory, the tracker outputs the location and size of the fake target instead of the real target. Chen et al. [13] propose a Unified and Effective Network (UEN) that can generate invisible and visible adversarial perturbations to attack visual object-tracking models. UEN uses three ingenious loss functions to produce various adversarial perturbations for different attack settings. Ding et al. [14] designed a universal patch to camouflage trackers. They introduced the maximum textural discrepancy (MTD) loss, a feature de-matching loss that distills global textural information of template and search images. They also evaluated two shape attacks, regression dilation and shrinking, to generate stronger attacks. Threet et al. [15] propose a pipeline for evaluating physical adversarial attacks in a simulated environment using the Car Learning to Act (CARLA) autonomous driving simulator and the DAPRICOT method. By using a simulated environment, the pipeline corrects for real-world variations such as lighting and viewing angle that can affect the effectiveness of adversarial attacks.

While the mentioned works have significantly contributed to generating adversarial patches for attacking SOT algorithms, some limitations still need to be addressed. Firstly, these approaches typically generate patches only for the search region without incorporating information from the template, which restricts the attack’s strength and effectiveness. Secondly, the network architectures employed in these works may not be sufficiently powerful, limiting their performance in generating effective adversarial patches. Lastly, the evaluation metrics used in these studies often do not consider the patch size, which hinders a fair comparison of different methods.

To overcome these limitations, we have designed the BAPGNet algorithm and introduced novel metrics in our research. The BAPGNet algorithm addresses the first issue by incorporating both the template and search region information into the patch generation process, thereby enhancing the attack strength. Moreover, we have employed powerful network architectures to improve the performance of our algorithm in generating highly effective adversarial patches. Additionally, we have developed novel metrics that consider the patch size, allowing for a fair and comprehensive evaluation of different methods.

In the subsequent sections, we will provide a detailed description of the BAPGNet algorithm and the novel metrics we have introduced, elaborating on how they address the aforementioned issues and contribute to the advancement of adversarial patch generation for SOT algorithms.

3. Methodology

To effectively execute an adversarial attack on object tracking algorithms using adversarial patches, we have developed a patch-generating algorithm capable of generating distinct patches for both the template and search region. By introducing these patches, we aim to induce a mismatch in the tracked target, leading to a reduction in the bounding box size and an increase in the disparity of deep features between the search region and template.

Our proposed algorithm specifically targets Single Object Tracking (SOT) algorithms that are based on the combination of Siamese networks and Region Proposal Networks (RPN). As an illustration of a victim SOT algorithm, we provide a brief description of SiamRPN++ in Section 3.1 of our paper.

3.1. Brief Description of SiamRPN++

The SiamRPN++ network improves upon the SiamRPN by introducing three key enhancements: a sampling strategy for increased spatial invariance, a feature aggregation strategy for improved representation, and a depth-wise separable correlation structure for enhanced cross-correlation operations. These improvements collectively contribute to the enhanced performance of the SiamRPN++ network in object-tracking tasks.

The SiamRPN++ object tracking network utilizes a modified ResNet-50 [35] architecture as its Siamese network, extracting similar features from three levels for both the template and search regions. These features are then used in a cross-correlation operation, employing a depth-wise separable correlation structure and weighted sum operation to aggregate the outputs from the three levels.

Following the cross-correlation operation, the network generates two feature maps for prediction: the class score feature map and the regression feature map. The class score feature map is responsible for determining whether the regions correspond to the positive object, while the regression feature map is utilized to regress the exact location and size of the object based on anchor boxes.

During the inference phase, the proposal’s location and size are decoded based on the anchor boxes. The network then applies a cosine window and a scale change penalty to re-rank the proposals, selecting the highest-ranked proposals for further consideration. Finally, the Non-maximum Suppression (NMS) algorithm is employed to obtain the final tracking bounding box.

3.2. Patch Generating Network Structure

In Siamese network-based object tracking, the template branch plays a critical role in the tracking process for several reasons. Firstly, it provides an initial reference point for the tracker, enabling accurate localization of the target object. Secondly, the template branch is responsible for feature extraction and encoding, capturing appearance information and essential characteristics of the target, which forms a compact representation. Lastly, the template branch provides a template depth feature that is cross-correlated with the depth feature of the search region, facilitating the tracking process.

Based on the template branch’s importance and adversarial perturbations’ significant impact on deep features, we have devised BAPGNet, as depicted in Figure 2. Our approach focuses on generating adversarial patches for both the template and search regions to enhance the discrepancy between them. Consequently, our patch-generating network takes both the search region and the template as inputs and produces two separate patches for each of them.

The overall structure of our network consists of three main components. Firstly, we employ a backbone architecture to extract features from both the template and search regions. This allows the network to capture relevant information from both inputs simultaneously. Secondly, we incorporate a branch specifically designed for generating an adversarial patch for the template region. This branch focuses on perturbing the template region while taking into account its unique characteristics. Thirdly, we include another branch dedicated to generating an adversarial patch for the search region. This branch focuses on influencing the search region while considering its distinctive attributes.

In the SiamRPN++ network, the search region is twice as large as the template, making it unsuitable for direct input to the network. To address this issue, we utilize the Focus operation from YOLOV5 [36] to reduce the size of the search region by half. We then concatenate the half-sized search region with the template, creating the input for the network.

After extracting the features, we employ two branch networks to generate adversarial images separately for the template and search regions. The output size of the adversarial images matches the size of the template and search regions in the tracking network, respectively. To handle the size discrepancy between the template and search regions, we introduce the DeFocus model [10] to the branch generating the adversarial image for the search region. The DeFocus operation effectively doubles the size of the feature map, ensuring compatibility. Furthermore, the DeFocus operation reduces the size of the preceding feature map, reducing computational complexity and improving operating speed. The DeFocus operation is depicted in Figure 3.

The network’s direct output adversarial images in Figure 2 cannot directly add to the tracker input. We use the ground truth to generate a mask M for the objects in the template and search region, respectively. Then, use the mask M to select the patch in the adversarial image and patch it to the corresponding location. The patch process is described as follows:

I_{p} = I_{a} ⊙ M + I ⊙ \bar{M}

(1)

where the I_p, I_a, I is the patched image (adversarial sample), adversarial image, and the clean input image, respectively. The

⊙

is the elementwise product operation. The

\bar{M}

is the inverse of the mask M.

The backbone used in this network is based on YOLOv5′s backbone, which has been proven to have strong feature extraction capabilities. To obtain adversarial images that have the same size as the inputs, we reversed the backbone as the decoder.

It is important to note that the search region size is not always exactly twice that of the template. We apply zero-padding to the template and search region to address this discrepancy. This ensures that the size of the search region is precisely twice that of the template. Subsequently, to restore the output to its original size, which matches the size of the search region and template in the tracking algorithm, we remove the pixels corresponding to the padded areas in the inputs. By doing so, we retain the consistency of the outputs with respect to the original input dimensions.

3.3. Loss Function

The network shown in Figure 2 requires a specific training goal that aims to confuse the network instead of enabling it to locate a target within a search region accurately. To misguide the tracker, three approaches can be taken:

Introducing contrasts in the extracted features of the template and the search region before cross-correlation. These contrasts make it harder for the network to match the features of the template and the search region.
Leading the classification branch of the RPN to give the background a higher score than the tracking object. This approach makes the network more likely to identify background regions as potential targets.
Introducing deviations and shrink in the bounding box of the tracking object within the search region. This approach makes it more difficult for the network to precisely locate the object.

By implementing these three approaches during the training process, the network becomes more confused and less accurate in identifying and locating targets within a search region.

To specifically apply the first loss function, we utilize the maximum textural discrepancy loss [14]. This loss function aims to maximize the discrepancy between the deep features of the search region and template, making their features unmatched. While this loss function helps to dampen the classification heat map of the RPN network to some extent, it does not directly deceive the tracking network alone. Therefore, to effectively fool the network, combining it with two other types of loss functions is necessary.

The second loss function leads the network to regard the background as the tracking object and vice versa. This can be achieved by attacking the classification branch of the RPN. Originally, the RPN assigns positive or negative labels to anchor boxes based on overlap with ground truth.

We reverse the label assignment method to lead the network to ignore the tracking object. In SiamRPN++, anchor boxes with an Intersection over Union (IoU) greater than 0.6 with the ground truth are assigned positive labels. To reverse the label assignment, we set P₊ to 0 and P_- to 1 when the anchor has an IoU greater than 0.2 with the ground truth and P₊ to 1 and P_- to 0 when the anchor has an IoU less than 0.2. This reversal in label assignment effectively leads the network to ignore the actual tracking object and treat the background as the target. The assignment process is illustrated in Figure 4. Therefore, the second type of loss in this paper is designed as follows:

\begin{array}{l} L_{a o b j} = \frac{1}{N} \sum_{i = 0}^{N} - [p_{+} \log ({\hat{p}}_{+}) + p_{-} \log ({\hat{p}}_{-})] \\ = \frac{1}{N} \sum_{i}^{N} - \{I_{i} \log ({\hat{p}}_{+}) + (1 - I_{i}) \log ({\hat{p}}_{-})\} \end{array}

(2)

I_{i} = \{\begin{matrix} 0, IoU > 0 . 2 \\ 1, IoU < 0 . 2 \end{matrix}

(3)

the p₊ and p_- are the positive and negative samples of ground truth, respectively. The

{\hat{p}}_{+}

and

{\hat{p}}_{-}

is the predicted result of the positive and negative samples. N is the overall anchor in the network’s output. The loss function maximizes the probability of incorrect RPN classification predictions by reversing label assignments. It misleads the Region Proposal Network (RPN) into predicting high confidence scores for the background. This subtly misleads the SiamRPN++ tracker, causing it to lose the target.

For the third type of loss, we design an adversarial regression loss to attack the regression branch of the RPN. The RPN utilizes a regression branch to predict the offset of objects relative to corresponding anchor boxes. Specifically, it predicts four parameters (dx, dy, dw, and dh) that are used to calculate the bounding box coordinates of the object.

Mathematically, this can be expressed as:

\begin{array}{l} x = x_{a n} + d x \times w_{a n} \\ y = y_{a n} + d y \times h_{a n} \\ w = w_{a n} \times e^{d w} \\ h = h_{a n} \times e^{d h} \end{array}

(4)

where (x_an, y_an, w_an, h_an) define the coordinates of the anchor box, and (x, y, w, h) represents the predicted bounding box of the tracking object. The dx and dy parameters denote the center offset, while dw and dh indicate the log-space width and height offsets. By regressing to offsets rather than directly predicting bounding box coordinates, the RPN is able to scale predictions to boxes of any size linearly. The exponential function used for width and height offsets further ensures that predicted boxes do not have excessively low or high aspect ratios. This formulation enables the RPN to propose anchor boxes and corresponding object-bounding boxes over a wide range of scales.

To effectively manipulate the RPN into generating incorrect predictions, we propose an adversarial attack strategy utilizing the Generalized Intersection over Union (GIoU) (GIoU) [37] loss. This loss function is employed to subtly misguide and shrink the bounding box of the tracking object. In this attack, our objective is to modify the predicted bounding box offsets dx, dy, dw, and dh produced by the regression branch of the RPN. To carry out this attack, we define the coordinates of the ground truth bounding box as (dx*, dy*, w*, h*). The original predicted bounding box from the RPN regression branch is denoted as (dx, dy, dw, dh). With these definitions in place, the adversarial regression loss L_areg used to misguide the RPN is formulated as follows:

\begin{array}{l} L_{a r e g} = \frac{1}{N} \sum_{i = 0}^{N} GIoU [B^{g}, B^{p}] \\ = \frac{1}{N} \sum_{i = 0}^{N} GIoU [(d x_{i}^{*}, d y_{i}^{*}, w_{i}^{*}, h_{i}^{*}), (d x_{i}, d y_{i}, w, h)] \end{array}

(5)

where

G I o U (B^{g}, B^{p})

represents the Generalized Intersection over Union operation applied to the ground truth bounding box

B^{g}

and predicted bounding box

B^{p}

. We calculate the GIoU on all anchor boxes. This is because the anchor boxes that do not intersect with the ground truth box will be selected as proposals according to the loss function L_aboj, and the anchor boxes that do intersect with the ground truth box have a possibility of being selected as proposals. Therefore, we need to set the ground truth (dx*, dy*, w*, h*) for all anchor boxes in order to mislead the network into generating incorrect predictions.

To manipulate the RPN to predict bounding boxes far from the true object, we categorize the anchor boxes into four groups based on the position of their center points relative to the object’s center point. These groups are (1) the upper left group (UL), (2) the upper right group (UR), (3) the bottom right group (BR), and (4) the bottom left group (BL). An anchor’s center point is denoted as (x_an, y_an), and the center point of the tracking object is denoted as (x_o, y_o). The anchor is assigned to one of the four groups based on the following conditions:

a n c h o r \in \{\begin{matrix} U L | x_{a n} < x_{o} and y_{a n} < y_{o} \\ U R | x_{a n} > x_{o} and y_{a n} < y_{o} \\ B R | x_{a n} > x_{o} and y_{a n} > y_{o} \\ B L | x_{a n} < x_{o} and y_{a n} > y_{o} \end{matrix}\}

(6)

for each group of anchor boxes, we aim to optimize the loss function L_areg to manipulate the predictions as far from the target object as possible. To achieve this, we set the ground truth values (dx*, dy*, dw*, dh*) for each group as follows:

(d x^{*}, d y^{*}) = \{\begin{matrix} (0, 0, 0, 0), a n c h o r \in U L \\ (1, 0, 0, 0), a n c h o r \in U R \\ (1, 1, 0, 0), a n c h o r \in B R \\ (0, 1, 0, 0), a n c h o r \in B L \end{matrix}

(7)

in Equation (7), we set the (w*, h*) to (0, 0) to shrink the bounding boxes.

Finally, the adversarial loss is designed as follows:

L = α L_{M T D} + β L_{a o b j} + γ L_{a r e g}

(8)

where the

α

,

β

and

γ

are the weights to balance the three types of loss. The three hyperparameters need to be set manually according to experimental results. In our paper, the

α

,

β

, and

γ

are set to 50, 1, and 1, respectively.

4. Metrics for Evaluating Adversarial Patch

The larger the size of the adversarial patch, the easier it used to attack an algorithm. However, if the patch size is too large, it can lead to two issues. Firstly, there can be a significant discrepancy between the adversarial and original samples, even completely covering the objects. Secondly, the patch may exceed the target boundary in the physical world, which is not desirable. Thus, it is essential to evaluate the adversarial ability of the adversarial patch.

To this end, we propose a metric called “patch’s adversarial nature” (PA). This metric takes into account both the patch size and the attacking performance of the victim algorithm (i.e., how much the algorithm’s performance drops after applying the patch to the tracking object). The PA metric satisfies the following conditions:

The larger the patch size, the weaker the PA, and vice versa. This condition implies that as the size of the patch increases, the adversarial nature of the patch decreases. Larger patches may lead to a significant discrepancy between the adversarial and original samples, even completely covering the objects, which reduces the effectiveness of the attack;
The higher the attacking performance, the stronger the PA, and vice versa. This condition indicates that as the attacking performance improves (i.e., the victim algorithm’s performance drops significantly), the adversarial nature of the patch increases. A stronger attack results in a higher PA value.

To account for the size discrepancy between objects, the PA metric considers the patch’s relative size, which refers to the patch’s size relative to the applied target. This ensures that even if the absolute size of the patch applied to larger objects is larger, the metric focuses on the patch’s size relative to the target. The PA metric is defined as follows:

P A = \frac{𝕄}{R_{A}}

(9)

R_{A} = \frac{A_{p}}{A_{o}}

(10)

where the A_o represents the area of the object’s bounding box, and A_p represents the area of the corresponding adversarial patch. The metric 𝕄 evaluates the patch’s attacking performance. In this paper, we design the following two metrics to evaluate the attacking performance, (1) the overall dropping rate of the tracking success rate (DRTSR) and (2) the overall dropping rate of the tracking precision rate (DRTPR). The PA will be written as follows when using these two metrics as the 𝕄.

P A_{s r} = \frac{D R T S R}{R_{A}}

(11)

P A_{p r} = \frac{D R T P R}{R_{A}}

(12)

The DRTSR (DRTPR) metrics are calculated as follows:

Firstly, we obtain the Success plot (or Precision plot) for the clean, non-adversarial data. This plot represents the performance of the tracker in terms of success rate (or precision) at different overlap thresholds (or distance thresholds);
We calculate the area under the Success plot (or Precision plot) curve, which is denoted as ASR_c (or APR_c). This average success rate (or precision rate) represents the tracker’s performance on the clean data;
Next, we apply an adversarial attack to the input data and obtain a new Success plot (or Precision plot) to measure the tracker’s performance on the perturbed data. We calculate the area under this curve, denoted as ASR_a (or APR_a);
Finally, DRTSR (DRTPR) value can be calculated as follows:

D R T S R = \frac{norm (A S R_{c}) - norm (A S R_{a})}{norm (A S R_{c})}

(13)

D R T P R = \frac{norm (A P R_{c}) - norm (A P R_{a})}{norm (A P R_{c})}

(14)

the norm(·) function represents a normalization operation. Typically, the vertical axis of the Success plot (Precision plot) curve is set to a maximum of 1, so normalization can be achieved by dividing each corresponding value by the maximum value of the horizontal axis. The upper limit for the horizontal axis of the Success plot (Precision plot) curve is 1 (50).

norm (A S R) = A S R

(15)

norm (A P R) = \frac{A P R}{50}

(16)

the decrease in ASR or APR, i.e., (ASR_c–ASR_a) or (APR_c–APR_a), indicates how much the object tracking success rate or precision rate has dropped due to the attack. A larger difference signifies that the attack has broadly disrupted the model’s ability to track the object. The DRTSR (DRTPR) metric quantifies the total decreasing rate in object tracking success or precision rate caused by the attack.

In summary, by quantifying the total decrease in tracking success or precision rate (the DRTSR or DRTPR) achieved by the attack and normalizing this by the patch’s relative size, PA indicates the patch’s ability to efficiently impairing the model’s tracking capability without becoming excessively conspicuous. A higher PA signifies maximizing influence on performance while minimizing obtrusiveness, achieving a stealthier attack.

5. Experiment

5.1. Datasets

This paper proposes an algorithm for generating adversarial patches that can effectively attack single object tracking (SOT) algorithms, particularly in the context of unmanned aerial vehicle (UAV) views where targets tend to be smaller in size. To assess the efficacy of our approach, we evaluate it using two benchmark datasets commonly used for object tracking in UAV videos: UAV123 and UAVDT.

The UAV123 dataset is a benchmark dataset specifically designed for object tracking in UAV videos. It comprises 123 video sequences captured by six UAVs in diverse scenarios, including urban, countryside, and forest environments. The dataset provides detailed ground-truth annotations for the target objects in each video, including their location, size, orientation, and occlusion status. Due to its challenging characteristics, such as complex motion patterns, scale changes, and occlusions, the UAV123 dataset serves as an excellent benchmark for evaluating and comparing the performance of object-tracking algorithms in UAV videos. It enables comprehensive testing of the robustness and accuracy of these algorithms.

The UAV Detection and Tracking dataset (UAVDT) is another valuable resource for evaluating object-tracking algorithms in UAV videos. It consists of over ten hours of video footage and includes 8000 annotated frames of unmanned aerial vehicles captured in real-world environments with various challenges. These challenges include object scale changes, occlusions, and fast motion, among others. The UAVDT dataset focuses on testing the robustness and accuracy of tracking algorithms in challenging UAV video scenarios. In our experimental evaluation, we specifically utilize the single object tracking portion of the dataset.

5.2. Experimental Setup

Our adversarial attack algorithm is designed to attack tracking algorithms based on the Siamese network and the RPN. Thus, we evaluated the effectiveness of our algorithm against SiamRPN++ [21], DaSiamRPN [24], and SiamAPN++ [25] trackers. For these trackers, we set the template size to 127 × 127. The search region size is set to 255 × 255 for SiamRPN++, 255 × 255 for DaSiamRPN, and 287 × 287 for SiamAPN++. We also incorporated the Total variation loss [38] during the test phase of our training process to achieve a smooth color transform.

To train the adversarial patch generation network effectively, we ensured that the search region input size was double that of the template size. As explained in Section 3.2, we used padding to handle the size discrepancy and allowed for variable input sizes during training. Multiple input sizes are utilized to train the network. During training, the template size is randomly selected between 127 and 200, while the search region size is always set to twice the template size when training the attacking model for SiamRPN++ and DaSiamRPN. For the SiamAPN++ network, the template size was randomly chosen between 144 and 160.

5.3. Evaluation of the Attacking Performance

The performance of our attacking algorithm is tested against three tracking algorithms: SiamRPN++, DaSiamRPN, and SiamAPN++. We generate separate attacking models for each algorithm and evaluate their performance on two datasets, resulting in the recorded results in Table 1 and Table 2. Table 1 shows the success rate performance drop, while Table 2 displays the precision rate performance drop caused by our attacks.

From the results in Table 1 and Table 2, it can be observed that our attacking algorithm achieved success in attacking the tracking algorithms. However, its impact was least significant on SiamAPN++. This can be attributed to the fact that the search area of SiamAPN++ is slightly larger than double the template size, requiring additional black edge padding, which negatively affected the algorithm’s performance.

To validate the transferability of our attacking algorithm, we conducted cross-validation experiments. We developed attacking algorithms for each tracking algorithm and tested them against the other two tracking models to generate adversarial patches. The results are presented in Table 3 and Table 4, where Table 3 shows the success rate performance drop, and Table 4 shows the precision rate performance drop. The success and precision plots are also visualized in Figure 5 and Figure 6.

Based on the results presented in Table 3 and Table 4 and Figure 5 and Figure 6, it is evident that our proposed attacking algorithm demonstrates a degree of transferability. This is because our algorithm is trained on the entire training dataset rather than just iterating on a single image, which makes it possible to transfer to other tracking algorithms. Furthermore, it can be observed that the transfer effect of the attack algorithm between SiamRPN and DaSiamRPN is better than that of SiamAPN++, which is attributed to the difference in the input size ratio of SiamAPN++”.

5.4. Influence of the Pre-Training Process

To assess the impact of the pre-training process on the attacking performance, we conducted experiments by pre-training the backbone of the tracking network on the ImageNet1k dataset. The specific steps are as follows:

Construct a feature extraction network that is identical to a single branch of the Siamese network in the tracking network;
Flatten the last convolutional layer of the feature extraction network;
Added a fully connected layer consisting of 1000 neurons;
Apply SoftMax operation to normalize the network output;
Train the network using the ImageNet1k dataset with the cross-entropy loss function;
The pre-trained network was used to initialize the parameters of the Siamese network in the tracking network (both branches shared the same parameters);
Fix the parameters of the Siamese network, and train the network for 200 epochs on the tracking dataset;
Unlocking the parameters of the Siamese network and fine-tuning them for 50 epochs to generate the final network.

The results are recorded in Table 5 and Table 6.

From Table 5 and Table 6, it can be observed that the pre-trained models have not significantly affected the performance. They seem more like errors that occur during each training. Our attack algorithm’s performance varies for the three algorithms on the two datasets. For example, concerning the SiamRPN++ pre-trained model, the attack performance decreases on the UAV123 dataset compared to the non-pre-trained model but improves on the UAVDT dataset. In contrast, the attack performance against DaSiamRPN shows the opposite trend compared to SiamRPN++. We speculate that this might be because the performance improvement from pretraining is limited when the training dataset is already rich, and it may even lead to a decrease in performance.

5.5. Ablation Experiment

Three patching networks based on the proposed architecture are developed to evaluate the effectiveness of using separate template and search region patches: BAPGNet, BAPGNet1, and BAPGNet2. BAPGNet represents the complete proposed network, while BAPGNet1 lacks a branch for the template, and BAPGNet2 lacks both the template branch and the template input. For BAPGNet1 and BAPGNet2, only the patch for the search region is generated. During the testing phase, these networks generate patches for the test dataset, and the results are recorded in Table 7 and Table 8.

Table 7 and Table 8 show that BAPGNet achieves the highest attack capability and generates adversarial patches with the highest efficiency. This can be attributed to BAPGNet generating patches for both branches, leading to more significant differences in depth features within the Siamese network. Additionally, the template input provides crucial data for the patch generation network, resulting in BAPGNet1 performing better than BAPGNet2.

5.6. Comparing with Other Attacking Algorithms

The proposed attacking algorithm’s performance is evaluated against the algorithm in paper [14] (named SA) and the algorithm in paper [13] (named UEN-P), two established adversarial patch generation algorithms for object tracking networks, using the UAV123 dataset. The hyperparameters specified by the authors of SA and UEA-P are applied. Specifically, for SA, D = 3,

α = 1000

,

β = 1

,

γ = 0.1

, K = 20,

\overset{⌣}{h} = - 1

,

\overset{⌣}{w} = - 1

,

m_{τ} = 0.7

. For UEN-P,

α = 10

, M = 5. A 20% patch size ratio is fixed for all algorithms to enable a fair comparison. The results are presented in Table 9 and Table 10, and sample results are shown in Figure 7.

Figure 7 shows that the proposed algorithm prompts tracking network failure faster and deceives the networks more successfully. Table 9 and Table 10 demonstrate that the proposed algorithm achieves higher attack performance than SA and UEN-P with the same patch size. This superior effectiveness results from the following:

The proposed network utilizes the backbone of the YOLOv5 network, which presents powerful feature extraction ability and provides high efficiency. The proven YOLOv5 backbone enables the proposed network to achieve strong feature extraction ability and high efficiency.
The proposed algorithm is patching the adversarial patch to both the search region and template, maximizing their variance. Applying the adversarial patch to both the search region and template results in the greatest difference, amplifying the attack’s effectiveness.

6. Discussion

Our proposed BAPGNet algorithm for SOT algorithms introduces several advancements in the field of adversarial patch generation for object-tracking networks. By generating separate adversarial patches for the template and search region, we maximize the discrepancy between them, leading to higher attack efficiency compared to current methods. Together the proposed loss functions and patch generation network advance adversarial patch generation for object tracking algorithms.

The proposed evaluation metrics also advance the adversarial patch attacking field and the object tracking field. The metrics ASR and APR can work as the average success rate evaluation for the object tracking tasks. The current object-tracking tasks usually use a certain threshold to measure the success rate or precision rate of the tracking algorithm, which hardly indicates the average performance evaluation of the algorithm. Our algorithm can comprehensively evaluate the success or precision rate by considering all of the thresholds of the success or precision rate plot. The metrics DRTSR, DRTPR, and the PA provide a more comprehensive evaluation of the adversarial attack on the object tracking algorithms. The PA metric can also provide an assessment of the adversarial patch-generating algorithm for the object detection tasks by substituting the performance evaluation metrics 𝕄 for those that can indicate a performance drop in object detection algorithms.

Experiments on UAV123 and UAVDT datasets demonstrate the effectiveness of our proposed BAPGNet in attacking object-tracking algorithms. The ablation experiments validated that by generating separate adversarial patches for the template and search region, our BAPGNet algorithm achieves higher attack efficiency than generating a single patch for the search region alone. This demonstrates that fusing template information improves adversarial patch generation for object-tracking networks. The transferability experiments also confirm that our attacking algorithm can transfer to other SOT algorithms. Comparisons with existing adversarial patch generation algorithms also confirm our algorithm’s performance.

While our algorithm demonstrates the ability to generate separate adversarial patches for the template and search regions and enlarge the discrepancy of deep features in the Siamese network, its practical applicability is somewhat constrained compared to networks that generate single patches. This limitation arises due to the typical usage of the algorithm in a black-box environment, where no knowledge about the tracking algorithm is available, making it challenging to determine the template image for the algorithm. Consequently, the current utilization of this algorithm is primarily geared towards adversarial patch attack tasks in white-box settings.

7. Conclusions

This article focuses on constructing an adversarial patch generation network for single-object tracking networks. The network is based on a powerful feature extraction network with excellent feature extraction capabilities, thereby enhancing the attack performance. The network generates adversarial patches for both the template and search regions enabling the network to maximize the discrepancy between the template and search region features in the Siamese network. Multiple evaluation metrics are also introduced to better assess the patch’s adversarial efficiency. Experimental results show that our algorithm can successfully attack various SOT networks and exhibit attacking transferability. Ablation experiments are conducted to validate the effectiveness of the network structure, which is shown to enhance the adversarial ability of the patches effectively. Finally, horizontal comparative experiments demonstrate that the proposed attack algorithm outperforms existing patch-generating algorithms.

Although our algorithm can currently achieve adversarial attacks by generating different adversarial patches for the template and search region, the biggest disadvantage of this method is that it cannot determine how to obtain the template image in the case of black-box attacks. Therefore, further research is needed on how to implement adversarial attacks under black box conditions.

Author Contributions

Conceptualization, J.R. and Y.X.; methodology, J.R.; software, Z.Z.; validation, T.H. and Z.Z.; formal analysis, C.T.; investigation, Z.Z.; resources, Y.X.; data curation, C.T.; writing—original draft preparation, J.R.; writing—review and editing, T.H.; visualization, T.H.; supervision, Y.X.; project administration, Y.X.; funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Basic Research Program of Shaanxi [grant number D5110220135]; China University Industry-University-Research Innovation Fund [grant number 2021ITA10022].

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://cemse.kaust.edu.sa/ivul/uav123 (accessed on 18 May 2023) and https://sites.google.com/view/grli-uavdt/%E9%A6%96%E9%A1%B5 (accessed on 18 May 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bai, C.; Gong, Y.; Cao, X. Pedestrian Tracking and Trajectory Analysis for Security Monitoring. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
Emami, A.; Dadgostar, F.; Bigdeli, A.; Lovell, B.C. Role of spatiotemporal oriented energy features for robust visual tracking in video surveillance. In Proceedings of the 2012 IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance, Beijing, China, 18–21 September 2012; IEEE: New York, NY, USA, 2012. [Google Scholar]
Gao, M.; Jin, L.; Jiang, Y.; Guo, B. Manifold Siamese network: A novel visual tracking convnet for autonomous vehicles. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1612–1623. [Google Scholar] [CrossRef]
Robin, C.; Lacroix, S. Multi-robot target detection and tracking: Taxonomy and survey. Auton. Robot. 2016, 40, 729–760. [Google Scholar] [CrossRef] [Green Version]
Zhang, Z.; Doi, K.; Iwasaki, A.; Xu, G. Unsupervised domain adaptation of high-resolution aerial images via correlation alignment and self training. IEEE Geosci. Remote Sens. Lett. 2020, 18, 746–750. [Google Scholar] [CrossRef]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. Adv. Neural Inf. Process. Syst. 1993, 6, 737–744. [Google Scholar] [CrossRef] [Green Version]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
Song, Y.; Ma, C.; Wu, X.; Gong, L.; Bao, L.; Zuo, W.; Shen, C.; Lau, R.W.; Yang, M.-H. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Wang, X.; Li, C.; Luo, B.; Tang, J. Sint++: Robust visual tracking via adversarial positive instance generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Rasol, J.; Xu, Y.; Zhang, Z.; Zhang, F.; Feng, W.; Dong, L.; Hui, T.; Tao, C. An Adaptive Adversarial Patch-Generating Algorithm for Defending against the Intelligent Low, Slow, and Small Target. Remote Sens. 2023, 15, 1439. [Google Scholar] [CrossRef]
Wiyatno, R.R.; Xu, A. Physical adversarial textures that fool visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Li, Z.; Shi, Y.; Gao, J.; Wang, S.; Li, B.; Liang, P.; Hu, W. A simple and strong baseline for universal targeted attacks on siamese visual tracking. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 3880–3894. [Google Scholar] [CrossRef]
Chen, X.; Fu, C.; Zheng, F.; Zhao, Y.; Li, H.; Luo, P.; Qi, G.-J. A Unified Multi-Scenario Attacking Network for Visual Object Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021. [Google Scholar]
Ding, L.; Wang, Y.; Yuan, K.; Jiang, M.; Wang, P.; Huang, H.; Wang, Z.J. Towards universal physical attacks on single object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021. [Google Scholar]
Threet, M.; Busho, C.; Harguess, J.; Jutras, M.; Lape, N.; Leary, S.; Manville, K.; Tan, M.; Ward, C. Physical adversarial attacks in simulated environments. In Proceedings of the 2021 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 12–14 October 2021; IEEE: New York, NY, USA, 2021. [Google Scholar]
Tao, R.; Gavves, E.; Smeulders, A.W. Siamese instance search for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; Hu, W.; Maybank, S. Learning attentions: Residual attentional siamese network for high performance online visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H. End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhang, Z.; Peng, H. Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Cao, Z.; Fu, C.; Ye, J.; Li, B.; Li, Y. SiamAPN++: Siamese attentional aggregation network for real-time UAV tracking. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
Wang, X.; Chen, Z.; Tang, J.; Luo, B.; Wang, Y.; Tian, Y.; Wu, F. Dynamic attention guided multi-trajectory analysis for single object tracking. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4895–4908. [Google Scholar] [CrossRef]
Yan, B.; Wang, D.; Lu, H.; Yang, X. Cooling-shrinking attack: Blinding the tracker with imperceptible noises. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Liang, S.; Wei, X.; Yao, S.; Cao, X. Efficient adversarial attacks for visual object tracking. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. Proceedings, Part XXVI 16. [Google Scholar]
Chen, X.; Yan, X.; Zheng, F.; Jiang, Y.; Xia, S.-T.; Zhao, Y.; Ji, R. One-shot adversarial attacks on visual tracking with dual attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Guo, Q.; Xie, X.; Juefei-Xu, F.; Ma, L.; Li, Z.; Xue, W.; Feng, W.; Liu, Y. Spark: Spatial-aware online incremental attack against visual tracking. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXV 16. Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Jia, S.; Song, Y.; Ma, C.; Yang, X. Iou attack: Towards temporally coherent black-box adversarial attack for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Yan, X.; Chen, X.; Jiang, Y.; Xia, S.-T.; Zhao, Y.; Zheng, F. Hijacking tracker: A powerful adversarial attack on visual tracking. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: New York, NY, USA, 2020. [Google Scholar]
Liu, S.; Chen, Z.; Li, W.; Zhu, J.; Wang, J.; Zhang, W.; Gan, Z. Efficient universal shuffle attack for visual object tracking. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; IEEE: New York, NY, USA, 2022. [Google Scholar]
Suttapak, W.; Zhang, J.; Zhang, L. Diminishing-feature attack: The adversarial infiltration on visual tracking. Neurocomputing 2022, 509, 21–33. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Chaurasia, A.; Changyu, L.; Hogan, A.; Hajek, J.; Diaconu, L.; Kwon, Y.; Defretin, Y. ultralytics/yolov5: v5.0—YOLOv5-P6 1280 models, AWS, Supervise.ly and YouTube integrations. Zenodo. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 12 June 2022).
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Sharif, M.; Bhagavatula, S.; Bauer, L.; Reiter, M.K. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM Sigsac Conference on Computer and Communications Security, Vienna, Austria, 24–28 October 2016. [Google Scholar]

Figure 1. The whole structure of our work. The Bilateral adversarial patch-generating network is our designed network used to generate adversarial patches for the template and search region. The three loss functions aim to misguide the network to give wrong predictions.

Figure 2. Bilateral adversarial patch generating network. In this paper, channel c is set to 64. The network consists of three parts: the backbone for feature extraction, the search region branch for generating the patch for the search region, and the template branch for generating the patch for the template. The right part of the figure shows the inner structure of the corresponding models in the left part of the figure.

Figure 3. DeFocus operation. The colors in the left part represent different channels of the convolutional layer’s output. These channels are sequentially mapped to fill the corresponding colored locations in the right part’s feature map. This process enables the transformation of a feature map with dimensions w × h × 4c to a feature map with dimensions 2w × 2h × c.

Figure 4. Label assignment. The black square denotes an anchor box with a value of zero, while the white square represents an anchor box with a value of one. The red rectangle represents the ground truth.

Figure 5. Results on UAV123 dataset. In all the subfigures, the left subfigure shows the success rate plot of the One Pass Evaluation, while the right subfigure displays the precision plot of the One Pass Evaluation. The “Clean-xxx” represents the testing results on the clean test dataset, while the “Attack-xxx” indicates the results on the patched test dataset.

Figure 6. Results on UAVDT dataset. In all the subfigures, the left subfigure shows the success rate plot of the One Pass Evaluation, while the right subfigure displays the precision plot of the One Pass Evaluation. The “Clean-xxx” represents the testing results on the clean test dataset, while the “Attack-xxx” indicates the results on the patched test dataset.

Figure 7. Some of the visualized results. In all the images, the green box represents the ground truth box of the tracking object. The red, purple, and orange-yellow boxes depict the predicted boxes of the tracking object after the BAPGNet, SA, and UEN-P algorithm attacks, respectively. The light blue rectangle in the upper right corner of the image indicates the frame of the video.

Table 1. Tracking success rate drop test.

Dataset	Victim Algorithm	ASR_c	ASR_a	DRTSR
UAV123	SiamRPN++	58.8%	18%	69.4%
	DaSiamRPN	53.5%	17.2%	67.8%
	SiamAPN++	56.4%	21.8%	61.3%
UAVDT	SiamRPN++	52.9%	18.3%	65.3%
	DaSiamRPN	50%	17.7%	64.6%
	SiamAPN++	52.2%	22.8%	56.3%

Table 2. Tracking precision rate drop test.

Dataset	Victim Algorithm	APR_c	APR_a	DRTPR
UAV123	SiamRPN++	73.9%	30.4%	58.9%
	DaSiamRPN	68.4%	34%	50.3%
	SiamAPN++	69.3%	38.2%	45%
UAVDT	SiamRPN++	68.7%	34.7%	49.5%
	DaSiamRPN	65.5%	33.9%	48.2%
	SiamAPN++	66.7%	39.5%	40.8%

Table 3. Attacking transferability test on tracking success rate drop.

Dataset	Attacking Model	Victim Algorithm	ASR_c	ASR_a	DRTSR
UAV123	Train For SiamRPN++	DaSiamRPN	53.5%	28.4%	47%
	Train For SiamRPN++	SiamAPN++	56.4%	30%	46.7%
	Train For DaSiamRPN	SiamRPN++	58.8%	25.2%	57.2%
	Train For DaSiamRPN	SiamAPN++	56.4%	28.8%	49.3%
	Train For SiamAPN++	SiamRPN++	58.8%	32.3%	45%
	Train For SiamAPN++	DaSiamRPN	53.5%	27.5%	48.6%
UAVDT	Train For SiamRPN++	DaSiamRPN	50%	29.5%	41%
	Train For SiamRPN++	SiamAPN++	52.2%	30.3%	42%
	Train For DaSiamRPN	SiamRPN++	52.9%	27.1%	48.7%
	Train For DaSiamRPN	SiamAPN++	52.2%	30.9%	69.8%
	Train For SiamAPN++	SiamRPN++	52.9%	28.5%	46.1%
	Train For SiamAPN++	DaSiamRPN	50%	28.8%	42.4%

Table 4. Attacking transferability test on tracking precision rate drop.

Dataset	Attacking Model	Victim Algorithm	APR_c	APR_a	DRTPR
UAV123	Train For SiamRPN++	DaSiamRPN	68.4%	44.1%	35.6%
	Train For SiamRPN++	SiamAPN++	69.3%	47.6%	31.4%
	Train For DaSiamRPN	SiamRPN++	73.9%	44.3%	40.1%
	Train For DaSiamRPN	SiamAPN++	69.3%	51.9%	25.2%
	Train For SiamAPN++	SiamRPN++	73.9%	47.8%	35.3%
	Train For SiamAPN++	DaSiamRPN	68.4%	46.2%	32.5%
UAVDT	Train For SiamRPN++	DaSiamRPN	65.4%	46.1%	29.4%
	Train For SiamRPN++	SiamAPN++	66.7%	51.2%	23.3%
	Train For DaSiamRPN	SiamRPN++	68.7%	46.5%	32.3%
	Train For DaSiamRPN	SiamAPN++	66.7%	53.6%	19.7%
	Train For SiamAPN++	SiamRPN++	68.7%	48.5%	29.3%
	Train For SiamAPN++	DaSiamRPN	65.4%	49.2%	24.7%

Table 5. Tracking success rate drop test.

Dataset	Victim Algorithm	ASR_c	ASR_a	DRTSR
UAV123	Pre-SiamRPN++	59.2%	18.6%	68.6%
	Pre-DaSiamRPN	53.1%	16.9%	68.2%
	Pre-SiamAPN++	55.8%	21.9%	60.8%
UAVDT	Pre-SiamRPN++	53.3%	18.1%	66%
	Pre-DaSiamRPN	49.6%	17.8%	64.1%
	Pre-SiamAPN++	51.7%	23%	55.5%

Table 6. Tracking precision rate drop test.

Dataset	Victim Algorithm	APR_c	APR_a	DRTPR
UAV123	Pre-SiamRPN++	74.5%	31.1%	58.3%
	Pre-DaSiamRPN	67.8%	33.2%	51%
	Pre-SiamAPN++	68.9%	38.4%	44.3%
UAVDT	Pre-SiamRPN++	69.7%	34.5%	50.5%
	Pre-DaSiamRPN	64.9%	33.8%	47.9%
	Pre-SiamAPN++	66.1%	40.1%	49.3%

Table 7. Success rate drop in ablation experiment.

Dataset	Attacking Network	ASR_c	ASR_a	DRTSR	PA_sr
UAV123	BAPGNet	58.8%	18%	69.4%	3.47
	BAPGNet1		22.5%	61.7%	3.085
	BAPGNet2		28.1%	52.2%	2.61
UAVDT	BAPGNet	52.9%	18.3%	65.3%	3.265
	BAPGNet1		23.2%	56.1%	2.805
	BAPGNet2		29.3%	44.6%	2.23

Table 8. Precision rate drop in ablation experiment.

Dataset	Attacking Network	APR_c	APR_a	DRTPR	PA_pr
UAV123	BAPGNet	73.9%	30.4%	58.9%	2.945
	BAPGNet1		33.5%	50.3%	2.515
	BAPGNet2		37.8%	45%	2.25
UAVDT	BAPGNet	68.7%	34.7%	49.5%	2.475
	BAPGNet1		36.6%	48.2%	2.41
	BAPGNet2		40.7%	40.8%	2.04

Table 9. Comparing results on success rate drop.

Datasets	Attacking Network	ASR_c	ASR_a	DRTSR	PA_sr
UAV123	BAPGNet	58.8%	18%	69.4%	3.47
	SA		30.5%	48.2%	2.41
	UEN-P		26.3%	55.3%	2.765
UAVDT	BAPGNet	52.9%	18.3%	65.3%	3.265
	SA		28.4%	46.3%	2.315
	UEN-P		26%	50.7%	2.535

Table 10. Comparing results on precision rate drop.

Datasets	Attacking Network	APR_c	APR_a	DRTPR	PA_pr
UAV123	BAPGNet	73.9%	30.4%	58.9%	2.945
	SA		41%	44.5%	2.225
	UEN-P		37.4%	49.4%	2.47
UAVDT	BAPGNet	68.7%	34.7%	49.5%	2.475
	SA		42.4%	38.2%	1.91
	UEN-P		39.7%	42.2%	2.11

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rasol, J.; Xu, Y.; Zhang, Z.; Tao, C.; Hui, T. Bilateral Adversarial Patch Generating Network for the Object Tracking Algorithm. Remote Sens. 2023, 15, 3670. https://doi.org/10.3390/rs15143670

AMA Style

Rasol J, Xu Y, Zhang Z, Tao C, Hui T. Bilateral Adversarial Patch Generating Network for the Object Tracking Algorithm. Remote Sensing. 2023; 15(14):3670. https://doi.org/10.3390/rs15143670

Chicago/Turabian Style

Rasol, Jarhinbek, Yuelei Xu, Zhaoxiang Zhang, Chengyang Tao, and Tian Hui. 2023. "Bilateral Adversarial Patch Generating Network for the Object Tracking Algorithm" Remote Sensing 15, no. 14: 3670. https://doi.org/10.3390/rs15143670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bilateral Adversarial Patch Generating Network for the Object Tracking Algorithm

Abstract

1. Introduction

2. Related Work

2.1. Siamese Network-Based SOT

2.2. Adversarial Perturbation Based Attack

2.3. Adversarial Patch Based Attack

3. Methodology

3.1. Brief Description of SiamRPN++

3.2. Patch Generating Network Structure

3.3. Loss Function

4. Metrics for Evaluating Adversarial Patch

5. Experiment

5.1. Datasets

5.2. Experimental Setup

5.3. Evaluation of the Attacking Performance

5.4. Influence of the Pre-Training Process

5.5. Ablation Experiment

5.6. Comparing with Other Attacking Algorithms

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI