Contextual Attribution Maps-Guided Transferable Adversarial Attack for 3D Object Detection

Cai, Mumuxin; Wang, Xupeng; Sohel, Ferdous; Lei, Hang

doi:10.3390/rs16234409

Open AccessArticle

Contextual Attribution Maps-Guided Transferable Adversarial Attack for 3D Object Detection

¹

School of Information and Software Engineering, The University of Electronic Science and Technology of China, Chengdu 610054, China

²

Laboratory of Intelligent Collaborative Computing, The University of Electronic Science and Technology of China, Chengdu 611731, China

³

School of Information Technology, Murdoch University, Murdoch, WA 6150, Australia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(23), 4409; https://doi.org/10.3390/rs16234409

Submission received: 1 October 2024 / Revised: 18 November 2024 / Accepted: 21 November 2024 / Published: 25 November 2024

(This article belongs to the Special Issue Artificial Intelligence-Based Sensor Data Processing for Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

The study of LiDAR-based 3D object detection and its robustness under adversarial attacks has achieved great progress. However, existing adversarial attack methods mainly focus on the targeted object, which destroys the integrity of the object and makes the attack easy to perceive. In this work, we propose a novel adversarial attack against deep 3D object detection models named the contextual attribution maps-guided attack (CAMGA). Based on the combinations of subregions in the context area and their impact on the prediction results, contextual attribution maps can be generated. An attribution map exposes the influence of individual subregions in the context area on the detection results and narrows down the scope of the adversarial attack. Subsequently, perturbations are generated under the guidance of a dual loss, which is proposed to suppress the detection results and maintain visual imperception simultaneously. The experimental results proved that the CAMGA method achieved an attack success rate of over 68% on three large-scale datasets and 83% on the KITTI dataset. Meanwhile, the CAMGA has a transfer attack success rate of at least 50% against all four victim detectors, as they all overly rely on contextual information.

Keywords:

adversarial attack; 3D object detection; contextual information; transferable attack; attribution map

1. Introduction

Deep neural networks (DNNs) based 3D object detection are widely used in safety-critical applications, such as autonomous driving [1,2]. The large-scale point cloud data generated by Light Detection and Ranging (LiDAR) is rich in geometric information about the 3D environment, enabling 3D object detection models to achieve a higher performance than using only RGB images as input. With the development of LiDAR sensors, many large-scale autonomous driving datasets have emerged, such as KITTI [3], NuScenes [4], and Waymo [5]. The large-scale annotations and challenging scenarios in these datasets have boosted improvements in 3D object detection performance.

However, DNNs have been shown to be vulnerable to adversarial attacks in the 2D image recognition field [6,7,8]. Similarly, deep learning models based on point clouds have also been shown to output erroneous 3D point cloud recognition results [9] when fed with adversarial samples. There are various types of adversarial attacks on point clouds, such as point perturbation [10,11], point addition [12], and point dropping [13]. Furthermore, existing works demonstrated that these adversarial attack strategies are uniformly effective against 3D object detection [14,15,16] and tracking [17] models. Besides, adverse weather in the real world can cause 3D object detection models to produce incorrect predictions [18,19,20]. They can blind the detector and make it unable to perceive the presence of objects [21]. These weaknesses and existing related defense studies [22] indicate that the vulnerability and robustness of 3D object detection models need to be further studied.

Despite the success of the aforementioned adversarial attack methods, all of them focus on a targeted object, which can be called an object-oriented attack. However, the object-oriented attack may drastically alter the structural and geometric information of an object’s point cloud, making it easily perceivable by humans. In order to maintain the integrity of the targeted object, adversarial perturbation attacks on the context of images have been studied in the field of 2D object detection [23,24]. Inspired by this, the importance of contextual information in 3D object detection was evaluated in this paper, and the exploratory results are presented in Section 3. We empirically removed objects from the LiDAR scene without changing their context, and the detector still output prediction boxes in empty space. Further experiments proved that recent 3D detectors achieve high performance with a heavy reliance on the contextual information of an object rather than the object itself. Considering that the contextual information of LiDAR scenes is largely ignored by existing adversarial attack methods, the robustness of 3D object detectors to contextual information has great research value.

In this paper, we propose an adversarial attack method against 3D object detection models named the contextual attribution maps-guided attack (CAMGA). In contrast to typical object-oriented attack methods, the CAMGA is difficult to detect and intervene because it launches attacks merely on the most influential subregions of the context area. We generate attribution maps in the object context area by dividing subregions and calculating their impact values. Elaborate perturbations are added to top-ranked influential regions in the context area from the objects to evade human perception. In addition, a novel dual loss, including an adversarial loss and a perception loss, is proposed. The adversarial loss suppresses the predictions in terms of locations, orientations, and confidence, while the perception loss enforces the context perturbations farther away from the object. Because existing 3D detectors rely heavily on contextual information, the generated adversarial samples can attack unknown detectors with a high attack success rate, which demonstrates their strong transferability.

In summary, the main contributions of this paper are as follows:

We propose a contextual attribution maps-guided attack against deep 3D detectors, which is different from existing object-oriented attacks.
The attribution maps of the object’s context area are generated to explain its local contributions to the detection results and guide the attack to launch precisely on influential regions.
The CAMGA is launched in the form of context perturbation and is generated under the guidance of a dual loss, suppressing the location, orientation, confidence, and perception simultaneously.
Comprehensive evaluations on multiple datasets and detectors demonstrate that the proposed CAMGA outperforms existing attack methods in terms of performance as well as a superior transferability.

2. Related Work

2.1. 3D Object Detection

The field of 3D object detection has witnessed significant progress in recent years, with the emergence of several groundbreaking detectors that have pushed the boundaries of performance and efficiency. VoxelNet [25], one of the early pioneers, introduced the concept of voxel-based feature encoding, transforming the unordered point cloud into an ordered volumetric representation, enabling the application of 3D convolutions for effective feature extraction. PIXOR [26], on the other hand, took a more straightforward approach by projecting the point cloud onto a bird’s-eye-view (BEV) plane and using a 2D CNN for detection, achieving real-time performance with a simplified network architecture. PointPillars [27] extended VoxelNet’s idea by adopting a pillar-based representation, which further reduced computational complexity while maintaining high detection accuracy. PointRCNN [28] leveraged the complementary nature of point clouds and images, integrating both modalities within a two-stage detection framework to achieve state-of-the-art results. 3DSSD [29], as its name suggests, aimed for faster speeds while maintaining competitive accuracy, introducing a one-stage anchor-free detection head optimized for 3D space. PV-RCNN [30] combined the strengths of voxel-based and point-based representations, fusing features from both to enhance detection performance. More recently, CenterPoint [31] and CenterFormer [32] focused on accurate center-based detection, leveraging the center point as a strong prior to simplify the detection task and achieve remarkable results. These detectors, each with unique strengths and contributions, have collectively advanced the research landscape of 3D object detection.

2.2. Adversarial Attack for 2D Image

Deep neural networks were first found to be vulnerable to adversarial samples in the field of 2D images [6,7], which led to extensive research on adversarial attacks on deep models. Szegedy et al. [6] first added perturbations to image and caused the 2D recognition model to output incorrect results, which proved the sensitivity of deep models to perturbations. Xie et al. [33] demonstrated that 2D object detection and segmentation models are equally vulnerable to perturbations, and the adversarial samples generated were transferable between models. Duan et al. [34] and PhysGAN [35] produced a variety of natural-style adversarial samples in the physical world and successfully spoofed the detectors, proving that adversarial attacks pose a significant threat to real-life deep network applications. The methods mentioned above all attack the samples globally, severely interfering with the information of the object. Zhang et al. [24] also succeeded in making the detector output false predictions by performing an adversarial perturbation attack on only the contextual region of the objects, which controls the range of perturbations in the image, making it harder to perceive. It also shows that contextual regions play an important role in object detection.

2.3. Adversarial Attack for 3D Point Cloud Recognition

PointNet [36] and PointNet++ [37] used symmetric functions and multi-level feature extraction to improve the performance and efficiency of 3D point cloud recognition. However, they used the raw point cloud directly as input, which has been proven to be easily altered and used to launch an adversarial attack. Xiang et al. [9] first demonstrated that adding perturbations and point clusters to a point cloud can successfully spoof a deep model. Zheng et al. [38] calculated the salience map of a point cloud and found that a classifier can be disabled by attacking only a small number of the most salient points in the point cloud. AdvPC [10] used an autoencoder structure and was not designed to target a specific victim model, which makes the proposed adversarial attack method highly transferable. PD-Net [13] also used an autoencoder to learn the probability of a point to be dropped from the local point features. Wen et al. [11] added perturbations while strictly controlling the curvature of the object’s surface to keep it visually coherent, achieving a geometry-aware adversarial attack. Kim et al. [12] achieved a similar attack effect by manipulating a minimal set of points, proving that a few key points play a decisive role in point cloud models. AL-Adv [39] divided the point cloud into regions and calculated the impact value of each region separately, which demonstrate that an attack on a small number of regions can invalidate the point cloud classifier.

2.4. Adversarial Attack for 3D Object Detection

The task of 3D object detection is more complex compared to 3D point cloud recognition because it requires completing both localization and classification tasks. The vulnerability of 3D object detection systems to adversarial attacks has garnered significant attention in recent years, with several studies investigating various attack strategies. Shin et al. [40] highlighted a novel approach to adversarial attacks that exploits the optical properties of LiDAR sensors. By manipulating the optical signals emitted and received by LiDARs, this work demonstrates the potential to deceive 3D object detectors. Cao et al. [14] introduced the concept of adversarial sensor attacks targeting LiDAR systems in autonomous vehicles, revealing vulnerabilities that could compromise safety. Subsequently, Sun et al. [41] explored general black-box adversarial attacks and proposed countermeasures to fortify LiDAR-based perception systems against such attacks. Tu et al. [15] successfully attacked 3D object detection models by adding points to areas on the roof of a vehicle, demonstrated the feasibility of creating adversarial objects that could mislead LiDAR-based detectors in real-world scenarios. Wang et al. [42] focused on the impact of adversarial perturbations on point cloud data, highlighting their potential to disrupt 3D object detectors. Recognizing the importance of environmental factors, Hahner et al. [18,19] and LISA [43] investigated methods to simulate adverse weather conditions like fog and snowfall on LiDAR point clouds and improved the robustness of the 3D object detection model through an adversarial training strategy, aiming to enhance detection robustness in such challenging environments. Domain generalization was tackled in 3D-Vfield [16], which spoofed the detectors by placing simulated crashed vehicles in a driving scene and introduced adversarial augmentation techniques to improve the generalization capabilities of 3D object detectors across different domains. OccAM [44] generated attribution maps for objects in LiDAR scenes for the first time based on occlusion analysis, providing a new research direction for adversarial attacks and defenses in 3D object detection. Wang et al. [45] generated a sparse adversarial vehicles point cloud by adding polar coordinate perturbations, which has an extremely high attack success rate against 3D object detection models under high point budgets. In addition, considering the phenomena such as perturbation and shearing exhibited by LiDAR data, Dong et al. [46] and Yu et al. [47] investigated the effect of sensor defects on the robustness of 3D object detection.

Existing work has demonstrated the vulnerability of 3D object detection models to point perturbations and added points. However, the targets which their attacks are launched against are usually the entire scene or the target object. Our work is innovative in several ways. First, we perform an importance analysis in the context area and attack only part of the influential regions, which greatly limits the number of adversarial points and improves efficiency. Second, we generate attribution maps of contextual regions for objects in the LiDAR scene. Additionally, we only launch the attack in the context area of the object, which reduces the scope of impact on the entire scene point cloud and preserves the integrity of the object.

3. Contextual Information in 3D Object Detection

Before attempting to launch an attack on the 3D object detection model, we explored the effect of the object points and context area on the prediction results. The research results indicate that existing detectors rely on contextual information for prediction, which makes adversarial attacks launched on contextual regions promising. Inspired by [44], firstly, we removed objects from the scene without changing their context and ground truth to generate a scene that did not contain targets. In addition, we removed more regions by enlarging the factor k of the ground-truth box, which contains the contextual information of the objects. The exploration process of contextual information is shown in Figure 1. The orange bounding box is obtained by enlarging the ground of the object. The area between the green and orange bounding boxes is called the contextual area of the object. We try to remove objects, their entire enlarged area, or only the contextual area of objects in the scene to verify the impact of each part on the performance of the 3D object detector. The performances of existing state-of-the-art 3D object detectors for missed objects and their contextual information is demonstrated in Section 5.2.

4. Contextual Attribution Maps-Guided Transferable Adversarial Attack

In this section, the proposed contextual attribution maps-guided attack method is introduced in detail. An overview of the method is illustrated in Figure 2. Context areas are generated by enlarging the surrounding bounding boxes of objects, which are divided into subregions and analyzed for contributions analysis. Perturbations are imposed on the most influential regions in the context area to generate a malicious LiDAR scene. The generated malicious sample aims to deceive the deep 3D detector. Furthermore, the optimization of context perturbation is guided by the adversarial loss to suppress the detection results and constrained by perception loss for visual imperception.

4.1. Problem Formulation

Let

S = {ρ_{1}, ρ_{2}, \dots, ρ_{ϵ}, \dots, ρ_{E}}

represent a benign LiDAR scene containing E points and M objects. The

ϵ

-th point

ρ_{ϵ} = {(x, y, z, r e)}_{ϵ}

, which is described by 3D coordinates

x, y, z

and its reflectance

r e

. The ground truth of all objects is encoded by bounding boxes

B = {b_{1}, b_{2}, \dots, b_{m}, \dots, b_{M}}

. Each bounding box

b_{m} = {(o_{x}, o_{y}, o_{z}, w, l, h, θ)}_{m}

indicates the object’s center coordinate

(o_{x}, o_{y}, o_{z})

, the size

(w, l, h)

, and the rotation angle

θ

in the bird’s-eye view. Given a detector D, the goal of 3D object detection is to predict bounding boxes of objects

B^{'} = D (S)

as close to the ground truth B as possible. An adversarial attack is launched on the benign scene, where

S^{'}

denotes the adversarial sample generated by imposing context perturbations

∆

to the benign scene S. In addition, the purpose of the adversarial attack is to spoof the detector, i.e.,

D (S^{'}) \neq B

. To limit the magnitude of perturbations

∆

, the distance

D i s (S, S^{'})

between the adversarial and benign samples needs to be controlled to be as small as possible. To sum up, the task of generating adversarial samples by adding perturbations can be formulated as

min_{S^{'}} D_{M i s} (S^{'}) s . t . D i s (S^{'}, S) \leq ξ,

(1)

where

D_{M i s} (S^{'})

is a loss term used to guide the detector to produce incorrect detection results.

D i s (S^{'}, S)

is used to represent the distance between the benign sample and the generated adversarial sample.

4.2. Contextual Attribution Maps-Guided Attack

Point perturbations have been shown to be effective to attack deep models in the task of 3D object detection. However, object-oriented attacks impose perturbation on object surfaces, which destroys the integrity of the object. To make the adversarial attack less perceptible, we a propose context-oriented attack, which adds perturbations only to the object’s context area.

In addition, the number of adversarial points is minimized to keep the scene pristine. Inspired by the attribution maps on LiDAR scenes [44], we analyze the influential regions in the context area and generate attribution maps. By removing each region from the detection process, its impact on the detection results can be calculated and recorded as an impact value. The generation of attribution maps is essentially a process of combining different regions to participate in 3D object detection. Based on the possibility of combining all regions, the contribution of a single subregion to the detection result can be calculated. We use the impact value to measure the contribution of each region to the detection results. Regions with higher impact values are more likely to cause the detector to output incorrect results when affected by an adversarial attack.

The proposed contextual attribution maps-guided attack is based on point perturbations. The context perturbations

∆

are initialized via Gaussian noise and added to the influential regions in the object context area of a benign LiDAR scene to generate a malicious LiDAR sample. The detection results of the adversarial sample are used to suppress the correct predictions. In addition, the magnitude of context perturbations

∆

is constrained by proposed loss functions. In contrast to object-oriented attacks, contextual attribution maps-guided attacks do not change the object and instead make the perturbations close to the object so tiny that the proposed attack is difficult to detect.

To launch an attack, the bounding box of size

(w, l, h)

is first expanded by a context-size factor k, which produces an enlarged box

b (k) = (o_{x}, o_{y}, o_{z}, k * w, k * l, k * h, θ)

with

k \in (1, ξ)

. Thus, the complement set of b in

b (k)

is the context area, denoted as

∁_{b (k)} b

. For an object’s context area

∁_{b (k)} b

, we collect n points by the farthest point sampling (FPS) method and generate n subregions

(r_{1}, r_{2}, \dots, r_{n})

centered on these points. All points within the context area are grouped in the region of their nearest center point. Then the context area can be denoted as

∁_{b (k)} b = (r_{1}, r_{2}, \dots, r_{n})

, and

r_{j}

represents the j-th subregion. For each subregion within the context area, we wish to calculate its impact value for the detection result. For the j-th subregion, let

A ⫋ ∁_{b (k)} b ∖ r_{j}

. That is, A is a set of subsets of

∁_{b (k)} b

minus

r_{j}

with the number of subregions

| A | = a

. Thus, there are

(\binom{n - 1}{a})

combinations for A. We erase all subregions that are not in A and make the detector D predict the objects that only use A as the context area. The detection result is denoted as

D (A)

. In this case, the impact value of

r_{j}

for A can be expressed as

D (A \cup r_{j}) - D (A)

. Considering all combinations of A, the impact value

v_{j}

of

r_{j}

can then be given as

\begin{matrix} v_{j} & = \sum_{a = 1}^{n - 1} \frac{D (A \cup r_{j}) - D (A)}{(\binom{n - 1}{a})} \end{matrix}

(2)

\begin{matrix} = \sum_{a = 1}^{n - 1} \frac{a! (n - 1 - a)!}{(n - 1)!} (D (A \cup r_{j}) - D (A)) \end{matrix}

(3)

In order not to change the number of points in the point cloud, we refer to method [38] and move all points not in A to the center of the point cloud, which can be considered as not participating in the prediction process of the detector. The magnitude of the impact value can represent the contribution of the subregion for the 3D object detection result. The attribution map of the context area is also generated based on the impact values of the subregions. The detailed process of generating attribution maps is described in Algorithm 1.

Algorithm 1: Generating attribution map for context area.

Input:: 3D object detector (D), LiDAR point cloud scene (S), context area of an object ( $∁_{b (k)} b$ ), the number of subregions (n), the number of influential regions ( $λ$ ).
Output:: attribution map ( $Φ$ ), influential regions of context area ( $V$ ).
1: Extract n center points in the $∁_{b (k)} b$ by farthest point sampling method for subregions generation, where j represents the j-th point;
2: A subregion is generated centered on each of these points, where $r_{j}$ represents the subregion to which the j-th point belongs;
3: Group all points in the context area to the subregion where the nearest center point is located;

12: All subregions within the context area and their impact values form the attribution map $Φ$ ;
13: The impact value $v_{j}$ is ranked and the largest $λ$ subregions are taken to form influential regions $V$ ;
14: return $Φ$ , $V$

Each subregion will be ranked according to its impact value. We choose

λ

influential regions in attribution maps to launch an attack, which means that context perturbations

∆

are only applied to regions with high impact values before

λ

. The perturbations added to the influential regions are constrained by the following dual loss functions.

4.3. Loss Function

The goal of an adversarial attack against 3D object detection is to make the victim detector predict incorrectly, which can be evaluated by the intersection over union (IoU) between the predictions and the ground truth. We set a coefficient

q_{i m}

to indicate whether the IoU of the detector’s i-th proposal with the m-th ground truth reaches the detection threshold

t h r

:

q_{i m} = ⌈ I o U (b_{m}, p_{i}) - t h r ⌉,

(4)

where

⌈ ⌉

represents an upward rounding operator.

The attack method is expected to suppress the detection results from multiple aspects, including locations, orientations, and confidence. Denote

p_{i} = (\hat{o_{x i}}, \hat{o_{y i}}, \hat{o_{z i}}, \hat{w_{i}}, \hat{l_{i}}, \hat{h_{i}}, \hat{θ_{i}})

, with

(\hat{o_{x i}}, \hat{o_{y i}}, \hat{o_{z i}})

,

(\hat{w_{i}}, \hat{l_{i}}, \hat{h_{i}})

, and

\hat{θ_{i}}

representing the center, size, and rotation angle of the

p_{i}

, respectively. Inspired by Tu et al. [15], we optimize the IoU part of its loss function and replace it as

q_{i m} | b_{m} - p_{i} |

. Based on this, the adversarial loss is proposed to suppress the detection results, which can be formulated as follows:

L_{a d v} = \sum_{i = 1}^{I} \sum_{m = 1}^{M} q_{i m} | b_{m} - p_{i} | log (1 - c_{i}),

(5)

where I denotes the number of the detector’s proposals P, while

p_{i}

and

c_{i}

represent the bounding box and confidence score of the i-th proposal result. The error between

b_{j}

and

p_{i}

is computed as the absolute value of the difference in each dimension:

\begin{matrix} | b_{m} - p_{i} | = | o_{x m} - \hat{o_{x i}} | + | o_{y m} - \hat{o_{y i}} | + | o_{z m} - \hat{o_{z i}} | \\ + | w_{m} - \hat{w_{i}} | + | l_{m} - \hat{l_{i}} | + | h_{m} - \hat{h_{i}} | + | θ_{m} - \hat{θ_{i}} |, \end{matrix}

(6)

which aims to bias prediction proposals away from the ground truth and thus reduce the IoU. The suppression of IoU between the proposals and the ground truth will result in incorrect predictions output by the detector, i.e., the detection task fails. Meanwhile, the error between

b_{j}

and

p_{i}

is used as the coefficient to suppress the confidence scores of proposals.

While suppressing the predicted results, the magnitude of the perturbations needs to be controlled to ensure that the attack is less perceptible. In the context area, points close to the object are more likely to be noticed, while points away from the object are less likely to be perceived. Although the context perturbations are only added in a small number of influential regions, they may be close to objects. The perception loss is proposed to make context perturbations limited by their distance from the object, which makes them harder to be perceived. Given a total number of T points in the chosen

λ

influential regions of an object, the perturbation at the t-th point is denoted as

∆ t = (∆ x_{t}, ∆ y_{t}, ∆ z_{t})

. Thus, the perception loss can be formulated as

L_{p e r} = \sum_{m = 1}^{M} \sum_{t = 1}^{T} \frac{l_{k b o x}}{l_{t}} (∆ {x_{t}}^{2} + ∆ {y_{t}}^{2} + ∆ {z_{t}}^{2}),

(7)

where

l_{k b o x}

denotes the furthest distance from the center point of the object to the context area and is defined as follows:

l_{k b o x} = k \sqrt{{w_{m}}^{2} + {h_{m}}^{2} + {l_{m}}^{2}},

(8)

where

l_{t}

denotes the distance from the t-th point in the influential regions to the center point of the object and is defined as follows:

l_{t} = \sqrt{{(t_{x} - o_{x m})}^{2} + {(t_{y} - o_{y m})}^{2} + {(t_{z} - o_{z m})}^{2}} .

(9)

To sum up, The overall contextual attribution maps-guided attack loss function is

L_{C A M G A} = α L_{a d v} + β L_{p e r} .

(10)

The notation of

α

and

β

are the balance parameters with values set as 1 and 0.1 for the KITTI dataset, which achieved the best performance throughout the experiments. The dual loss is used to iteratively optimize the context perturbations until the attack succeeds or the maximum number of iterations N is reached.

5. Results

In this section, the experimental results of the proposed CAMGA method are reported. The importance of contextual information on the performance of 3D object detection is assessed in Section 3. The quantitative results is presented in Section 5.3. Ablation studies are discussed in detail in Section 5.5. In addition, the qualitative results are shown in Section 5.4.

5.1. Experimental Setup

Datasets: Experiments were conducted on three popular LiDAR datasets: KITTI [3], NuScenes [4], and Waymo [5]. The training set was used to train the detector and the proposed contextual attribution maps-guided adversarial attack was launched on the validation set. The KITTI dataset includes 3712 training samples and 3769 validation samples. In addition, NuScenes contains 700 training sequences and 150 validation sequences, while Waymo contains 798 training sequences and 202 validation sequences. The CAMGA method was launched on the validation split of the datasets.

Victim detectors: Several representative 3D object detectors were adopted as victim models, including PointPillars [27], PointRCNN [28], PV-RCNN [30], and CenterPoint [31]. These detectors cover a variety of input representations, ranging from raw point cloud to voxel to both. In addition, the detectors represent advanced one-stage and two-stage 3D object detection methods.

Metrics: The quantitative metric used to measure the performance of the adversarial attack is the attack success rate (ASR), which is defined as

A S R = \frac{F N_{2} - F N_{1}}{T P_{1}},

(11)

where

F N_{1}

and

T P_{1}

denote the false-negative predictions and true-positive predictions without the adversarial attack, respectively.

F N_{2}

denotes the false-negative predictions under the attack. The measure of attack success rate demonstrates the performance of the adversarial attack method on transforming the true-positive predictions into false-negative predictions.

Implementation details: The Adam optimizer is used by the proposed CAMGA method to generate the context perturbations for individual samples. For the KITTI dataset, we optimized the context perturbations with a learning rate of 0.001 for a maximum number of iterations N = 120. For the NuScenes and Waymo datasets, we optimized the context perturbations with a learning rate of 0.001 for a maximum number of iterations N = 80. All the experiments were performed on a PC with i7 12700 CPU, 64 GB RAM, and RTX 3090 GPU.

The parameter k, which is used to generate the context area, was set to 1.5 as default. The number of subregions n and the number of influential regions

λ

of the object context area were set to 10 and 5, respectively. Ablation studies on parameter k and

λ

are conducted in Section 5.5. The IoU threshold

t h r

in Equation (4) was set to 0.5, which is the threshold criterion of dataset evaluation. Perturbations added to the context area were initialized with Gaussian noise, whose mean and variance were set to 0 and 0.1, respectively.

5.2. Contextual Information Significance Analysis

Before launching adversarial attacks, we evaluated the role of contextual information in 3D object detection using the method proposed in Section 3. We kept the points of the object unchanged and only removed the points within its context area. The size of the context area was still controlled by k, and the performance of the detectors is shown in Figure 3b. The performances in Figure 3 show the average precision (AP) of four different detectors on the KITTI dataset. In Figure 3a, when

k = 1

, it indicates that only the target object was removed. As can be seen from the table, most detectors maintain a high average precision in scenarios where only objects are removed without changing the contextual information. This means that they still consider that there are targets in those empty regions where objects have been removed. Furthermore, it is only when the value of k gradually increases and more contextual regions are removed that the performances of the detectors decrease significantly. This indicates that the performance of point cloud attacks targeting only objects is limited, as even if the objects have been removed, the detectors still assume that there are objects present based on the surrounding environment.

In Figure 3b, all points of the objects are preserved, while only the contextual regions are gradually removed. When

k = 1

, it means the raw performance of the detector because there are no points in the context area that can be removed. With the removal of contextual information, the average precision of the detectors decreases to below 30%. Combining Figure 3a,b, it can be demonstrated that removing the object’s context is more damaging to the average precision than removing the target object. This proves that the object context is even more important to the prediction result than the object points. Based on contextual information alone, the detector assumes that an object exists in empty space, which we believe is a significant shortcoming of existing detectors.

5.3. Quantitative Results

In this section, we compare our proposed contextual attribution maps-guided attack method with several advanced attack methods. To be specific, the methods of Tu et al. [15], Sun et al. [41], 3D-VField [16], and Wang et al. [45] are all based on point addition, while Wang et al. [42] generates adversarial scenes using point perturbations. The attack of Tu et al. [15] simulates the scenario of placing objects on top of vehicles in the physical world, but it requires a long time to iteratively optimize for a single scene. The attack methods [16,41,45] are easy to launch, but that of Sun et al. [41] violates the sensor rules by adding point clouds in the obscured area, which is easily defended by anomaly detection methods. However, 3D-VField [16] and Wang et al.’s attack method [45] require the preparation of additional object point clouds in advance before the attack. In addition, the point perturbation attack method [42] has a higher success rate in attacks, but it causes significant changes in the point cloud of objects and is easily perceived by humans.

Quantitative results on the KITTI dataset are shown in Table 1. Here, we choose the maximum points budget in method [45] to be 30, because the proposed attacks launched in the influential regions usually affect no more than 40 points. It should be noted that the last four rows in Table 1 represent the performances of the proposed CAMGA method, which is generated on the basis of a particular detector (in parentheses) and launches an attack on other victim detectors. For example, CAMGA(PP.) represents the adversarial samples generated on the PointPillars detector, which are subsequently fed to spoof different victim detectors. This can be used to measure the transferability of the proposed method. The time cost in the figure represents the time required to generate an adversarial sample under the current configuration. The time cost of open source methods is demonstrated for comparison.

Table 1 shows that the proposed CAMGA method outperforms existing state-of-the-art methods by 1.7%, 0.4%, and 0.5% in terms of attack success rates on three different victim detectors, respectively. Although existing adversarial attack methods have not launched attacks on CenterPoint, the proposed CAMGA method achieved a success rate of 64.0% on CenterPoint, which is still a considerable performance. These results demonstrate that the proposed context perturbations achieved strong adversarial attack effectiveness. On the other hand, the adversarial samples achieved an attack success rate of more than 50% on other victim detectors, which demonstrates a high transferability of the CAMGA method. The results prove that existing 3D detectors rely heavily on the object’s contextual information. Due to the fact that the proposed adversarial attack method requires backpropagation based on the detector’s results, its computational time consumption largely depends on the complexity of the detector. From the table, it can be seen that in most cases, the time cost of the CAMGA is lower than that of the perturbation-based attack method [42], which demonstrates its efficiency.

Furthermore, the CAMGA method was launched in three categories on the NuScenes and Waymo datasets. Since other adversarial attack methods did not launch attacks on the NuScenes and Waymo datasets, we only demonstrate the attack success rates of the CAMGA method on these two large-scale datasets. The adversarial samples were generated directly by the victim detector, and the results are shown in Table 2. As can be seen in Table 2, the proposed CAMGA method achieved more than 50% attack success rate on the other two large-scale datasets and all advanced victim detectors. The attack success rates exceed that of cars in the categories of pedestrians and cyclists, indicating that the detector’s prediction of objects in other categories also relies more on contextual information, as the number of points for the target object is less. It can also demonstrate that the contextual information of objects in different datasets plays an important role, and existing detectors are very sensitive to the influential regions within them.

5.4. Qualitative Results

In this section, we compare the visualization results between the proposed CAMGA method and method [42], which has advanced attack performance based on points perturbation. Several qualitative results of the pure detection and adversarial attacks on the KITTI dataset and PointPillars detector are presented in Figure 4. The default parameter settings for the proposed attack method are

n = 10, λ = 5

, and

k = 1.5

. In the figure, we also demonstrate visual analysis under different parameter settings.

It can be seen from the figure that the adversarial samples generated by Wang et al. severely disturb the regularity and structures of point clouds. The CAMGA method spoofed the detector on more vehicles while adding perturbations to the influential regions in the context area only. It effectively preserves the points of the object, reducing the possibility of the attack being perceived. In addition, due to the use of contextual attribution maps as guidance, the CAMGA only adds perturbations to a small number of contextual regions. In the figure, it can be seen that the number of points disturbed by the CAMGA is much lower than that of Wang et al.’s adversarial attack method. It achieved better attack results by attacking fewer points. For various parameter combinations, a larger n results in a more detailed division of subregions, while a larger

λ

leads to more regions being attacked by perturbations. Furthermore, a larger k makes the contextual range larger. It can be concluded that the proposed CAMGA method is more effective than existing methods, and the generated adversarial samples are more difficult to be perceived.

5.5. Ablation Study

In this section, we conduct ablation studies for the number of subregions n, the number of influential regions

λ

, balance parameters

α

and

β

, as well as the context-size factor k.

The number of subregions n: Before extracting the influential regions, the context area of the object needs to be divided into n subregions to generate the attribution map. Furthermore, n directly affects the size of each region and the number of points in it. Because the number of points within the object’s context area is mainly concentrated between 20 and 40, we choose n to be in the interval 5 to 25 so that there are as many more than two points as possible in each subregion. This defaults

λ

to 5, which is less than or equal to n. The attack success rates on the KITTI dataset for different choices of n are shown in Table 3.

As can be seen from the table, the attack success rates are negatively correlated with n. This is because when the number of influential regions

λ

is constant, a larger n results in fewer points being attacked. When n is equal to

λ

, it means that the entire context area is under attack. Furthermore, the smaller n increases the computational complexity and the probability of being perceived. Therefore, we choose

n = 10

to trade off between attack success rates and imperceptibility.

The number of influential regions $λ$ : After determining the number of subregions n,

λ

decides how many points in the context area are attacked.

λ

determines the number of influential regions and also determines the number of points disturbed by adversarial attacks. Its maximum value needs to be less than n, so when n is equal to 10, we choose the value range of

λ

as 1 to 9. The attack success rates under different influential regions number

λ

are shown in Table 4.

From the table we can see that the attack success rates increase with

λ

and grow slowly when

λ

exceeds 5. In order to balance the attack success rates and imperceptibility, we set

λ = 5

.

Balance of two loss functions: The parameters of

α

and

β

were designed to find a balance between adversarial loss

L_{a d v}

and perception loss

L_{p e r}

. There is a trade-off between attack performance and perturbation magnitude, where a better attack performance is expected with a smaller perturbation magnitude. The performances of the CAMGA attack method with respect to different ratios of

α

and

β

were examined on the KITTI dataset because the values of

α

and

β

are only related to their ratio. Therefore, based on experience, we selected three order-of-magnitude ranges and the cases where

α

and

β

is equal to 0 for parameter analysis. In the table, attack success rate is used to measure the effectiveness of adversarial attacks, while Chamfer Distance [48] is used to measure the difference between the benign scenes and their generated adversarial samples. Experimental results are shown in Table 5.

It can be learned from Table 5 that the increase in

α / β

improves the attack success rate of the proposed method. When

α / β

is equal to 0 or ∞, it represents that the context perturbations are optimized only through perception loss or adversarial loss, respectively. When

α / β = 0

, the attack success rate and Chamfer Distance are equal to 0. This proves that the initial Gaussian noise is eliminated with the guidance of perception loss. When

α / β = \infty

, the attack success rate rises to 95.6% without the limitation of perception loss. Specifically, when

α / β

rises from 0.1 to 10, the attack success rate increases to an extent of 7%. These prove that the adversarial loss

L_{a d v}

regulated by

α

effectively suppresses the performance of the detectors.

Meanwhile, the perception loss

L_{p e r}

weighted by the balance parameter

β

constrains the perturbation magnitude to ensure that the adversarial attack is difficult to perceive.

To better examine the influence of parameter

β

on the visual perception of the generated adversarial examples, we set the value of

β

as 0.01, 0.1, and 1, with

α

equal to 1. A demonstration of the generated samples is given in Figure 5. From the figure, it can be seen that the decrease in

β

significantly limits the intensity of the perturbations, which results in a more regular structure of points in the context area of the object and maintains the performance of the attack at the same time. For the other two datasets, the same ablation study was conducted to balance the loss terms. We obtained

α

values of 2 and 2.5 for NuScenes and Waymo while keeping

β

constant at 0.1, which may be due to differences in the range of the scene and sensor accuracy.

Influence of context-size factor k. The context area is determined by enlarging the object’s box with the factor k. The contextual region of an object is considered to be within a certain range of the object, so we choose the maximum range of twice the object as the upper limit of k. In Table 6, we show the attack success rates of CAMGA under different k values on the KITTI dataset.

Table 6 indicates that as the value of k increases, the attack success rate increases gradually. k = 1 indicates that the bounding box is the same as the ground truth and there are no points in the context area. As a result, the perturbation cannot be added to the scene, and the attack success rate is 0%. When k is equal to 1.1 or 1.3, the context area is small and the perturbation attack can only be added to a few points. while k = 2, it means that the enlarged box is eight times as large as the size of the object.

It can be learned from Table 6 that the attack success rate increases sharply as the value of k rises from 1 to 1.5 and then grows slightly. Since the expansion of the region makes the attack more effective but also increases the probability of being perceived, we chose k = 1.5 to find a balance between attack performance and imperception.

6. Discussion

In this study, we proposed the contextual attribution maps-guided transferable adversarial attack (CAMGA) method for 3D object detection, aiming to attack the influential regions in the object’s context area rather than directly manipulating the object’s point cloud. Our experimental results demonstrate that existing state-of-the-art 3D object detectors are highly vulnerable to such contextual perturbations. These findings support our hypothesis that contextual information plays a significant role in the prediction process of 3D object detectors.

Compared to previous adversarial attack methods that mainly focused on object-oriented perturbations, the CAMGA provides a novel perspective by leveraging contextual information. For example, Sun et al. [41] and Tu et al. [15] attacked 3D object detection models by adding points to occluded areas, while Wang et al. [42] generated adversarial samples by adding points under adverse weather conditions. However, these methods directly manipulate the object’s point cloud, making the attacks easily detectable by humans. In contrast, the CAMGA aims to maintain the integrity of the object’s structure and geometry by attacking only the context area, which is more challenging to defense.

The success of the CAMGA highlights several important implications for 3D object detection research. From the experiment, it can be concluded that objects further away from the detector are more dependent on contextual information due to their sparse point clouds. The success of the CAMGA on different datasets may be due to the sparsity of point clouds in large-scale LiDAR scene datasets. The small number of points in an object results in insufficient geometric information, and the detector can only obtain additional information from its contextual region to ensure the generation of correct detection results. It underscores the need to consider contextual information more thoroughly when designing and evaluating 3D object detectors. As shown in our experiments, existing detectors rely heavily on contextual information, which makes them vulnerable to contextual perturbations. Future research should focus on improving the robustness of 3D detectors against such attacks. Meanwhile, the CAMGA reveals the potential limitations of using only point cloud data for 3D object detection. LiDAR sensors provide rich information about the 3D environment, but they also introduce vulnerabilities, such as sensitivity to contextual perturbations. Complementary sensors, such as cameras, could provide additional information to mitigate these vulnerabilities and improve the overall robustness of 3D object detection systems. Meanwhile, the proposed CAMGA method also has some drawbacks. As a white-box attack, it needs to optimize the samples based on the detector’s prediction results during the attack, which makes its real-time performance insufficient in applications. Secondly, with the development of 3D object detectors, they may no longer rely on contextual information of objects for detection, and the performance of the proposed attack method may decrease.

Based on the findings and implications of this study, several promising future research directions can be highlighted. First, further investigation into the robustness of 3D object detectors to contextual perturbations, including the development of effective defense mechanisms, is needed. This could involve techniques such as adversarial training with contextually perturbed samples or utilizing multi-modal sensor fusion to improve robustness. Second, exploring the transferability of the CAMGA to other 3D vision tasks would be valuable. Understanding how contextual perturbations affect different tasks could reveal common vulnerabilities and inform the development of more robust 3D vision systems.

In conclusion, our proposed CAMGA method provides a novel perspective on adversarial attacks against 3D object detectors by leveraging contextual information. The findings of this study have important implications for future research in the field of 3D object detection and autonomous systems.

7. Conclusions

In this paper, we have proposed a novel adversarial attack method named the contextual attribution maps-guided attack (CAMGA), specifically tailored for 3D object detection models. The CAMGA differs significantly from traditional object-oriented attack methods in that it focuses on perturbing only the most influential subregions within the context area surrounding the target objects. This approach not only achieves effective adversarial attacks but also remains stealthy and difficult to detect. It can be concluded that existing state-of-the-art 3D object detectors are vulnerable to malicious context perturbations due to their large reliance on contextual information. We have introduced the concept of contextual attribution maps to explain the local contribution of different subregions within the context area to the 3D object detection results. These maps guide the attack process by identifying the most critical regions to perturb. Furthermore, the proposed context perturbations are constrained by a dual loss, which ensures that the attack is extremely effective and less perceptible. Experiments conducted on KITTI, NuScenes, and Waymo verified the effectiveness and transferability of our CAMGA method. More importantly, visualization results show that the generated adversarial samples maintain the integrity of the object and improve the imperceptibility of the attack.

Author Contributions

Conceptualization, M.C. and X.W.; methodology, M.C.; software, M.C.; validation, M.C. and X.W.; formal analysis, M.C.; investigation, M.C.; resources, M.C.; data curation, M.C.; writing—original draft preparation, M.C.; writing—review and editing, X.W., F.S., and H.L.; visualization, M.C.; supervision, X.W., F.S., and H.L.; project administration, M.C.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (No. 62072076), Sichuan Provincial Research Plan Project (No. 2022ZDZX0005, No. 24ZDZX0001, No. 24ZDZX0002).

Data Availability Statement

The code is available at https://github.com/caimumuxin/CAMGA. (accessed on 17 November 2024.)

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CAMGA	Contextual attribution maps-guided attack
DNNs	Deep neural networks
LiDAR	Light Detection and Ranging
ASR	Attack success rate
AP	Average precision
CD	Chamfer Distance

References

Fernandes, D.; Silva, A.; Névoa, R.; Simões, C.; Gonzalez, D.; Guevara, M.; Novais, P.; Monteiro, J.; Melo-Pinto, P. Point-cloud based 3D object detection and classification methods for self-driving applications: A survey and taxonomy. Inf. Fusion 2021, 68, 161–191. [Google Scholar] [CrossRef]
Mao, J.; Shi, S.; Wang, X.; Li, H. 3D Object Detection for Autonomous Driving: A Comprehensive Survey. Int. J. Comput. Vis. 2023, 131, 1909–1963. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Krishnan, A.; Pan, Y.; Baldan, G.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 11618–11628. [Google Scholar] [CrossRef]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 2443–2451. [Google Scholar] [CrossRef]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.J.; Fergus, R. Intriguing properties of neural networks. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
Moosavi-Dezfooli, S.M.; Fawzi, A.; Frossard, P. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Las Vegas, NV, USA, 27–30 June 2016; pp. 2574–2582. [Google Scholar] [CrossRef]
Mopuri, K.R.; Ganeshan, A.; Babu, R.V. Generalizable Data-Free Objective for Crafting Universal Adversarial Perturbations. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2452–2465. [Google Scholar] [CrossRef] [PubMed]
Xiang, C.; Qi, C.R.; Li, B. Generating 3D Adversarial Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Long Beach, CA, USA, 15–20 June 2019; pp. 9136–9144. [Google Scholar] [CrossRef]
Hamdi, A.; Rojas, S.; Thabet, A.K.; Ghanem, B. AdvPC: Transferable Adversarial Perturbations on 3D Point Clouds. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12357, pp. 241–257. [Google Scholar] [CrossRef]
Wen, Y.; Lin, J.; Chen, K.; Chen, C.L.P.; Jia, K. Geometry-Aware Generation of Adversarial Point Clouds. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2984–2999. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Hua, B.S.; Nguyen, D.T.; Yeung, S.K. Minimal Adversarial Examples for Deep Learning on 3D Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7777–7786. [Google Scholar] [CrossRef]
Wang, Z.; Wang, X.; Sohel, F.; Liao, Y. PD-Net: Point Dropping Network for Flexible Adversarial Example Generation with L0 Regularization. In Proceedings of the International Joint Conference on Neural Networks, Shenzhen, China, 18–22 July 2021; pp. 1–7. [Google Scholar] [CrossRef]
Cao, Y.; Xiao, C.; Cyr, B.; Zhou, Y.; Park, W.; Rampazzi, S.; Chen, Q.A.; Fu, K.; Mao, Z.M. Adversarial Sensor Attack on LiDAR-based Perception in Autonomous Driving. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; Cavallaro, L., Kinder, J., Wang, X., Katz, J., Eds.; ACM: New York, NY, USA, 2019; pp. 2267–2281. [Google Scholar] [CrossRef]
Tu, J.; Ren, M.; Manivasagam, S.; Liang, M.; Yang, B.; Du, R.; Cheng, F.; Urtasun, R. Physically Realizable Adversarial Examples for LiDAR Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 13713–13722. [Google Scholar] [CrossRef]
Lehner, A.; Gasperini, S.; Marcos-Ramiro, A.; Schmidt, M.; Mahani, M.A.N.; Navab, N.; Busam, B.; Tombari, F. 3D-VField: Adversarial Augmentation of Point Clouds for Domain Generalization in 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17274–17283. [Google Scholar] [CrossRef]
Cheng, R.; Wang, X.; Sohel, F.; Lei, H. Topology-aware universal adversarial attack on 3D object tracking. Vis. Intell. 2023, 1, 31. [Google Scholar] [CrossRef]
Hahner, M.; Sakaridis, C.; Dai, D.; Gool, L.V. Fog Simulation on Real LiDAR Point Clouds for 3D Object Detection in Adverse Weather. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 15263–15272. [Google Scholar] [CrossRef]
Hahner, M.; Sakaridis, C.; Bijelic, M.; Heide, F.; Yu, F.; Dai, D.; Gool, L.V. LiDAR Snowfall Simulation for Robust 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16343–16353. [Google Scholar] [CrossRef]
Bijelic, M.; Gruber, T.; Mannan, F.; Kraus, F.; Ritter, W.; Dietmayer, K.; Heide, F. Seeing Through Fog Without Seeing Fog: Deep Multimodal Sensor Fusion in Unseen Adverse Weather. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 11679–11689. [Google Scholar] [CrossRef]
Zhang, Y.; Hou, J.; Yuan, Y. A Comprehensive Study of the Robustness for LiDAR-Based 3D Object Detectors Against Adversarial Attacks. Int. J. Comput. Vis. 2024, 132, 1592–1624. [Google Scholar] [CrossRef]
Cai, M.; Wang, X.; Sohel, F.; Lei, H. Diffusion Models-Based Purification for Common Corruptions on Robust 3D Object Detection. Sensors 2024, 24, 5440. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Xie, X.; Li, S.; Yin, M.; Song, C.; Krishnamurthy, S.V.; Roy-Chowdhury, A.K.; Asif, M.S. Context-Aware Transfer Attacks for Object Detection. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; AAAI Press: Menlo Park, CA, USA, 2022; pp. 149–157. [Google Scholar]
Zhang, H.; Zhou, W.; Li, H. Contextual Adversarial Attacks For Object Detection. In Proceedings of the IEEE International Conference on Multimedia and Expo, London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar] [CrossRef]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, IEEE Computer Society, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar] [CrossRef]
Yang, B.; Luo, W.; Urtasun, R. PIXOR: Real-Time 3D Object Detection From Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, IEEE Computer Society, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7652–7660. [Google Scholar] [CrossRef]
Lang, A.H.; Vora, S.; Caesar, H.; Zhou, L.; Yang, J.; Beijbom, O. PointPillars: Fast Encoders for Object Detection From Point Clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Long Beach, CA, USA, 15–20 June 2019; pp. 12697–12705. [Google Scholar] [CrossRef]
Shi, S.; Wang, X.; Li, H. PointRCNN: 3D Object Proposal Generation and Detection From Point Cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Long Beach, CA, USA, 15–20 June 2019; pp. 770–779. [Google Scholar] [CrossRef]
Yang, Z.; Sun, Y.; Liu, S.; Jia, J. 3DSSD: Point-Based 3D Single Stage Object Detector. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 11037–11045. [Google Scholar] [CrossRef]
Shi, S.; Guo, C.; Jiang, L.; Wang, Z.; Shi, J.; Wang, X.; Li, H. PV-RCNN: Point-Voxel Feature Set Abstraction for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 10526–10535. [Google Scholar] [CrossRef]
Yin, T.; Zhou, X.; Krähenbühl, P. Center-Based 3D Object Detection and Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Virtual, 20–25 June 2021; pp. 11784–11793. [Google Scholar] [CrossRef]
Zhou, Z.; Zhao, X.; Wang, Y.; Wang, P.; Foroosh, H. CenterFormer: Center-Based Transformer for 3D Object Detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2022; Volume 13698, pp. 496–513. [Google Scholar] [CrossRef]
Xie, C.; Wang, J.; Zhang, Z.; Zhou, Y.; Xie, L.; Yuille, A.L. Adversarial Examples for Semantic Segmentation and Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE Computer Society, Venice, Italy, 22–29 October 2017; pp. 1378–1387. [Google Scholar] [CrossRef]
Duan, R.; Ma, X.; Wang, Y.; Bailey, J.; Qin, A.K.; Yang, Y. Adversarial Camouflage: Hiding Physical-World Attacks With Natural Styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 997–1005. [Google Scholar] [CrossRef]
Kong, Z.; Guo, J.; Li, A.; Liu, C. PhysGAN: Generating Physical-World-Resilient Adversarial Examples for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Computer Vision Foundation, Seattle, WA, USA, 13–19 June 2020; pp. 14242–14251. [Google Scholar] [CrossRef]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar] [CrossRef]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Proceedings of the Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
Zheng, T.; Chen, C.; Yuan, J.; Li, B.; Ren, K. PointCloud Saliency Maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1598–1606. [Google Scholar] [CrossRef]
Zheng, S.; Liu, W.; Shen, S.; Zang, Y.; Wen, C.; Cheng, M.; Wang, C. Adaptive local adversarial attacks on 3D point clouds. Pattern Recognit. 2023, 144, 109825. [Google Scholar] [CrossRef]
Shin, H.; Kim, D.; Kwon, Y.; Kim, Y. Illusion and Dazzle: Adversarial Optical Channel Exploits Against Lidars for Automotive Applications. In Proceedings of the Cryptographic Hardware and Embedded Systems, Taipei, Taiwan, 25–28 September 2017; Fischer, W., Homma, N., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2017; Volume 10529, pp. 445–467. [Google Scholar] [CrossRef]
Sun, J.; Cao, Y.; Chen, Q.A.; Mao, Z.M. Towards Robust LiDAR-based Perception in Autonomous Driving: General Black-box Adversarial Sensor Attack and Countermeasures. In Proceedings of the USENIX Security Symposium, 12–14 August 2020; Capkun, S., Roesner, F., Eds.; USENIX Association: Berkeley, CA, USA, 2020; pp. 877–894. [Google Scholar]
Wang, X.; Cai, M.; Sohel, F.; Sang, N.; Chang, Z. Adversarial point cloud perturbations against 3D object detection in autonomous driving systems. Neurocomputing 2021, 466, 27–36. [Google Scholar] [CrossRef]
Kilic, V.; Hegde, D.; Sindagi, V.; Cooper, A.B.; Foster, M.A.; Patel, V.M. Lidar Light Scattering Augmentation (LISA): Physics-based Simulation of Adverse Weather Conditions for 3D Object Detection. arXiv 2021, arXiv:2107.07004. [Google Scholar]
Schinagl, D.; Krispel, G.; Possegger, H.; Roth, P.M.; Bischof, H. OccAM’s Laser: Occlusion-based Attribution Maps for 3D Object Detectors on LiDAR Data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1131–1140. [Google Scholar] [CrossRef]
Wang, J.; Li, F.; Zhang, X.; Sun, H. Adversarial Obstacle Generation Against LiDAR-Based 3D Object Detection. IEEE Trans. Multim. 2024, 26, 2686–2699. [Google Scholar] [CrossRef]
Dong, Y.; Kang, C.; Zhang, J.; Zhu, Z.; Wang, Y.; Yang, X.; Su, H.; Wei, X.; Zhu, J. Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1022–1032. [Google Scholar] [CrossRef]
Yu, K.; Tao, T.; Xie, H.; Lin, Z.; Liang, T.; Wang, B.; Chen, P.; Hao, D.; Wang, Y.; Liang, X. Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Vancouver, BC, Canada, 17–24 June 2023; pp. 3188–3198. [Google Scholar] [CrossRef]
Fan, H.; Su, H.; Guibas, L.J. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Honolulu, HI, USA, 21–26 July 2017; pp. 2463–2471. [Google Scholar] [CrossRef]

Figure 1. The exploration process of contextual information. The green bounding box represents the ground truth of the object. The bounding box colored in orange represents the enlarged region of the ground truth. The points of objects and their contextual information are independently removed to evaluate their impact on the detection results.

Figure 2. Overview of the proposed CAMGA against deep 3D object detection. Given a benign LiDAR scene (a), detector D outputs the correct detection result. The object locations are colored in green in (b). The context area of an object can be obtained with an enlarged surrounding bounding box (colored in blue in (c)). The context area is divided into subregions and importance analysis is performed. The attribution map (d) is generated based on the contribution of each subregion to the detection results. During the attack, according to the attribution map, context perturbations (e) are added only to significant parts of the object’s context area to make the detector produce false detection (f). The detection results are used to optimize the context perturbations, which are generated with the guidance of a dual loss, including the adversarial loss and perception loss, where the former is designed to obscure the predictions from the perspective of locations, orientations, and confidence and the latter to ensure visual imperception by imposing context perturbations of a small magnitude and distance from the target.

Figure 3. The average precision of detectors on KITTI dataset when removing objects and their contextual information. The triangles represent the raw performance of the corresponding detectors.

Figure 4. Visualization of the detection results on (a,g) benign scenes and adversarial samples generated by (b,h) Wang et al. [42] and (c,i) CAMGA method. In addition, the visual effects of various parameter combinations are shown in subfigures (d–f,j–l). The green bounding boxes indicate the detector’s correct prediction results, and the red bounding boxes indicate the detector’s omissions.

Figure 5. Attack intensity corresponding to when

α

= 1,

β

= 0.01 (a), 0.1 (b), and 1 (c), respectively. Perturbation points are colored in blue.

Figure 5. Attack intensity corresponding to when

α

= 1,

β

= 0.01 (a), 0.1 (b), and 1 (c), respectively. Perturbation points are colored in blue.

Table 1. Attack success rates of the proposed CAMGA method compared to existing methods on KITTI dataset.

Attack	Time Cost	Victim Detectors
Method	(ms)	PP. [27]	PointR. [28]	PV-R. [30]	CP. [31]
Tu et al. [15]	-	-	32.3%	-	-
Sun et al. [41]	-	43.1%	64.6%	-	-
Wang et al. [42]	1272	60.2%	82.8%	56.2%	-
3D-VField [16]	465	63.4%	-	-	-
Wang et al. [45]	-	65.8%	61.1%	76.8%	-
CAMGA (PP.)	752	67.5%	59.0%	51.6%	49.6%
CAMGA (PointR.)	968	56.1%	83.2%	43.3%	52.9%
CAMGA (PV-R.)	1364	47.0%	55.8%	77.3%	51.4%
CAMGA (CP.)	861	45.8%	54.7%	42.9%	64.0%

Table 2. Attack success rates of the proposed CAMGA method in three categories on NuScenes and Waymo datasets. Ped. and Cyc. represent pedestrian and cyclist categories, respectively.

Dataset	Victim Detectors
Dataset	PP. [27]	PointR. [28]	PV-R. [30]	CP. [31]
NuScenes (Car)	65.8%	71.4%	53.4%	56.9%
NuScenes (Ped.)	74.7%	76.0%	65.4%	68.2%
NuScenes (Cyc.)	72.1%	76.6%	67.3%	66.9%
Waymo (Car)	60.3%	68.1%	50.8%	57.6%
Waymo (Ped.)	69.4%	70.0%	64.8%	66.3%
Waymo (Cyc.)	68.9%	70.4%	65.2%	65.7%

Table 3. Attack success rates of the proposed CAMGA method on KITTI dataset under different n.

Victim	Subregion Number n
Detector	5	10	15	20	25
PointPillars	74.6%	67.5%	51.1%	25.2%	13.4%
PointRCNN	85.7%	83.2%	60.5%	28.4%	17.6%
PV-RCNN	82.1%	77.3%	53.4%	26.6%	12.8%
CenterPoint	70.9%	64.0%	46.2%	21.4%	13.9%

Table 4. Attack success rates of the proposed CAMGA method on KITTI dataset under different

λ

.

Table 4. Attack success rates of the proposed CAMGA method on KITTI dataset under different

λ

.

Victim	Influential Regions Number $λ$
Detector	1	3	5	7	9
PointPillars	7.5%	42.8%	67.5%	70.7%	72.3%
PointRCNN	10.4%	48.6%	83.2%	83.4%	84.8%
PV-RCNN	8.9%	46.2%	77.3%	79.6%	81.1%
CenterPoint	5.9%	41.5%	64.0%	67.3%	69.0%

Table 5. Attack success rate (ASR) and Chamfer Distance (CD) (

\times 10^{- 4}

m) with different ratios between

α

and

β

on the KITTI dataset and the PointRCNN victim detector. The benign scenes and their generated adversarial samples are used to calculate the Chamfer Distance.

Table 5. Attack success rate (ASR) and Chamfer Distance (CD) (

\times 10^{- 4}

m) with different ratios between

α

and

β

on the KITTI dataset and the PointRCNN victim detector. The benign scenes and their generated adversarial samples are used to calculate the Chamfer Distance.

	The Combinations of $α / β$
	0	0.1	0.3	1	3	10	∞
ASR	0%	76.2%	78.6%	81.4%	82.8%	83.2%	95.6%
CD	0	4.0	5.8	7.6	9.6	12.3	24.1

Table 6. Attack success rates of CAMGA method on KITTI dataset with different context size factor k.

Victim	Context Size Factor k
Detector	1	1.1	1.3	1.5	1.7	2.0
PointPillars	0%	7.8%	43.4%	67.5%	73.4%	77.8%
PointRCNN	0%	10.2%	66.3%	83.2%	86.8%	89.5%
PV-RCNN	0%	8.4%	46.7%	77.3%	82.1%	84.4%
CenterPoint	0%	9.6%	50.7%	64.0%	70.3%	72.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, M.; Wang, X.; Sohel, F.; Lei, H. Contextual Attribution Maps-Guided Transferable Adversarial Attack for 3D Object Detection. Remote Sens. 2024, 16, 4409. https://doi.org/10.3390/rs16234409

AMA Style

Cai M, Wang X, Sohel F, Lei H. Contextual Attribution Maps-Guided Transferable Adversarial Attack for 3D Object Detection. Remote Sensing. 2024; 16(23):4409. https://doi.org/10.3390/rs16234409

Chicago/Turabian Style

Cai, Mumuxin, Xupeng Wang, Ferdous Sohel, and Hang Lei. 2024. "Contextual Attribution Maps-Guided Transferable Adversarial Attack for 3D Object Detection" Remote Sensing 16, no. 23: 4409. https://doi.org/10.3390/rs16234409

APA Style

Cai, M., Wang, X., Sohel, F., & Lei, H. (2024). Contextual Attribution Maps-Guided Transferable Adversarial Attack for 3D Object Detection. Remote Sensing, 16(23), 4409. https://doi.org/10.3390/rs16234409

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contextual Attribution Maps-Guided Transferable Adversarial Attack for 3D Object Detection

Abstract

1. Introduction

2. Related Work

2.1. 3D Object Detection

2.2. Adversarial Attack for 2D Image

2.3. Adversarial Attack for 3D Point Cloud Recognition

2.4. Adversarial Attack for 3D Object Detection

3. Contextual Information in 3D Object Detection

4. Contextual Attribution Maps-Guided Transferable Adversarial Attack

4.1. Problem Formulation

4.2. Contextual Attribution Maps-Guided Attack

4.3. Loss Function

5. Results

5.1. Experimental Setup

5.2. Contextual Information Significance Analysis

5.3. Quantitative Results

5.4. Qualitative Results

5.5. Ablation Study

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI