5.1. Experimental Setup
Datasets: Experiments were conducted on three popular LiDAR datasets: KITTI [
3], NuScenes [
4], and Waymo [
5]. The training set was used to train the detector and the proposed contextual attribution maps-guided adversarial attack was launched on the validation set. The KITTI dataset includes 3712 training samples and 3769 validation samples. In addition, NuScenes contains 700 training sequences and 150 validation sequences, while Waymo contains 798 training sequences and 202 validation sequences. The CAMGA method was launched on the validation split of the datasets.
Victim detectors: Several representative 3D object detectors were adopted as victim models, including PointPillars [
27], PointRCNN [
28], PV-RCNN [
30], and CenterPoint [
31]. These detectors cover a variety of input representations, ranging from raw point cloud to voxel to both. In addition, the detectors represent advanced one-stage and two-stage 3D object detection methods.
Metrics: The quantitative metric used to measure the performance of the adversarial attack is the attack success rate (
ASR), which is defined as
where
and
denote the false-negative predictions and true-positive predictions without the adversarial attack, respectively.
denotes the false-negative predictions under the attack. The measure of attack success rate demonstrates the performance of the adversarial attack method on transforming the true-positive predictions into false-negative predictions.
Implementation details: The Adam optimizer is used by the proposed CAMGA method to generate the context perturbations for individual samples. For the KITTI dataset, we optimized the context perturbations with a learning rate of 0.001 for a maximum number of iterations N = 120. For the NuScenes and Waymo datasets, we optimized the context perturbations with a learning rate of 0.001 for a maximum number of iterations N = 80. All the experiments were performed on a PC with i7 12700 CPU, 64 GB RAM, and RTX 3090 GPU.
The parameter
k, which is used to generate the context area, was set to 1.5 as default. The number of subregions
n and the number of influential regions
of the object context area were set to 10 and 5, respectively. Ablation studies on parameter
k and
are conducted in
Section 5.5. The IoU threshold
in Equation (
4) was set to 0.5, which is the threshold criterion of dataset evaluation. Perturbations added to the context area were initialized with Gaussian noise, whose mean and variance were set to 0 and 0.1, respectively.
5.2. Contextual Information Significance Analysis
Before launching adversarial attacks, we evaluated the role of contextual information in 3D object detection using the method proposed in
Section 3. We kept the points of the object unchanged and only removed the points within its context area. The size of the context area was still controlled by
k, and the performance of the detectors is shown in
Figure 3b. The performances in
Figure 3 show the average precision (AP) of four different detectors on the KITTI dataset. In
Figure 3a, when
, it indicates that only the target object was removed. As can be seen from the table, most detectors maintain a high average precision in scenarios where only objects are removed without changing the contextual information. This means that they still consider that there are targets in those empty regions where objects have been removed. Furthermore, it is only when the value of
k gradually increases and more contextual regions are removed that the performances of the detectors decrease significantly. This indicates that the performance of point cloud attacks targeting only objects is limited, as even if the objects have been removed, the detectors still assume that there are objects present based on the surrounding environment.
In
Figure 3b, all points of the objects are preserved, while only the contextual regions are gradually removed. When
, it means the raw performance of the detector because there are no points in the context area that can be removed. With the removal of contextual information, the average precision of the detectors decreases to below 30%. Combining
Figure 3a,b, it can be demonstrated that removing the object’s context is more damaging to the average precision than removing the target object. This proves that the object context is even more important to the prediction result than the object points. Based on contextual information alone, the detector assumes that an object exists in empty space, which we believe is a significant shortcoming of existing detectors.
5.3. Quantitative Results
In this section, we compare our proposed contextual attribution maps-guided attack method with several advanced attack methods. To be specific, the methods of Tu et al. [
15], Sun et al. [
41], 3D-VField [
16], and Wang et al. [
45] are all based on point addition, while Wang et al. [
42] generates adversarial scenes using point perturbations. The attack of Tu et al. [
15] simulates the scenario of placing objects on top of vehicles in the physical world, but it requires a long time to iteratively optimize for a single scene. The attack methods [
16,
41,
45] are easy to launch, but that of Sun et al. [
41] violates the sensor rules by adding point clouds in the obscured area, which is easily defended by anomaly detection methods. However, 3D-VField [
16] and Wang et al.’s attack method [
45] require the preparation of additional object point clouds in advance before the attack. In addition, the point perturbation attack method [
42] has a higher success rate in attacks, but it causes significant changes in the point cloud of objects and is easily perceived by humans.
Quantitative results on the KITTI dataset are shown in
Table 1. Here, we choose the maximum points budget in method [
45] to be 30, because the proposed attacks launched in the influential regions usually affect no more than 40 points. It should be noted that the last four rows in
Table 1 represent the performances of the proposed CAMGA method, which is generated on the basis of a particular detector (in parentheses) and launches an attack on other victim detectors. For example, CAMGA(PP.) represents the adversarial samples generated on the PointPillars detector, which are subsequently fed to spoof different victim detectors. This can be used to measure the transferability of the proposed method. The time cost in the figure represents the time required to generate an adversarial sample under the current configuration. The time cost of open source methods is demonstrated for comparison.
Table 1 shows that the proposed CAMGA method outperforms existing state-of-the-art methods by 1.7%, 0.4%, and 0.5% in terms of attack success rates on three different victim detectors, respectively. Although existing adversarial attack methods have not launched attacks on CenterPoint, the proposed CAMGA method achieved a success rate of 64.0% on CenterPoint, which is still a considerable performance. These results demonstrate that the proposed context perturbations achieved strong adversarial attack effectiveness. On the other hand, the adversarial samples achieved an attack success rate of more than 50% on other victim detectors, which demonstrates a high transferability of the CAMGA method. The results prove that existing 3D detectors rely heavily on the object’s contextual information. Due to the fact that the proposed adversarial attack method requires backpropagation based on the detector’s results, its computational time consumption largely depends on the complexity of the detector. From the table, it can be seen that in most cases, the time cost of the CAMGA is lower than that of the perturbation-based attack method [
42], which demonstrates its efficiency.
Furthermore, the CAMGA method was launched in three categories on the NuScenes and Waymo datasets. Since other adversarial attack methods did not launch attacks on the NuScenes and Waymo datasets, we only demonstrate the attack success rates of the CAMGA method on these two large-scale datasets. The adversarial samples were generated directly by the victim detector, and the results are shown in
Table 2. As can be seen in
Table 2, the proposed CAMGA method achieved more than 50% attack success rate on the other two large-scale datasets and all advanced victim detectors. The attack success rates exceed that of cars in the categories of pedestrians and cyclists, indicating that the detector’s prediction of objects in other categories also relies more on contextual information, as the number of points for the target object is less. It can also demonstrate that the contextual information of objects in different datasets plays an important role, and existing detectors are very sensitive to the influential regions within them.
5.5. Ablation Study
In this section, we conduct ablation studies for the number of subregions n, the number of influential regions , balance parameters and , as well as the context-size factor k.
The number of subregions n: Before extracting the influential regions, the context area of the object needs to be divided into
n subregions to generate the attribution map. Furthermore,
n directly affects the size of each region and the number of points in it. Because the number of points within the object’s context area is mainly concentrated between 20 and 40, we choose
n to be in the interval 5 to 25 so that there are as many more than two points as possible in each subregion. This defaults
to 5, which is less than or equal to
n. The attack success rates on the KITTI dataset for different choices of
n are shown in
Table 3.
As can be seen from the table, the attack success rates are negatively correlated with n. This is because when the number of influential regions is constant, a larger n results in fewer points being attacked. When n is equal to , it means that the entire context area is under attack. Furthermore, the smaller n increases the computational complexity and the probability of being perceived. Therefore, we choose to trade off between attack success rates and imperceptibility.
The number of influential regions : After determining the number of subregions
n,
decides how many points in the context area are attacked.
determines the number of influential regions and also determines the number of points disturbed by adversarial attacks. Its maximum value needs to be less than
n, so when
n is equal to 10, we choose the value range of
as 1 to 9. The attack success rates under different influential regions number
are shown in
Table 4.
From the table we can see that the attack success rates increase with and grow slowly when exceeds 5. In order to balance the attack success rates and imperceptibility, we set .
Balance of two loss functions: The parameters of
and
were designed to find a balance between adversarial loss
and perception loss
. There is a trade-off between attack performance and perturbation magnitude, where a better attack performance is expected with a smaller perturbation magnitude. The performances of the CAMGA attack method with respect to different ratios of
and
were examined on the KITTI dataset because the values of
and
are only related to their ratio. Therefore, based on experience, we selected three order-of-magnitude ranges and the cases where
and
is equal to 0 for parameter analysis. In the table, attack success rate is used to measure the effectiveness of adversarial attacks, while Chamfer Distance [
48] is used to measure the difference between the benign scenes and their generated adversarial samples. Experimental results are shown in
Table 5.
It can be learned from
Table 5 that the increase in
improves the attack success rate of the proposed method. When
is equal to 0 or
∞, it represents that the context perturbations are optimized only through perception loss or adversarial loss, respectively. When
, the attack success rate and Chamfer Distance are equal to 0. This proves that the initial Gaussian noise is eliminated with the guidance of perception loss. When
, the attack success rate rises to 95.6% without the limitation of perception loss. Specifically, when
rises from 0.1 to 10, the attack success rate increases to an extent of 7%. These prove that the adversarial loss
regulated by
effectively suppresses the performance of the detectors.
Meanwhile, the perception loss weighted by the balance parameter constrains the perturbation magnitude to ensure that the adversarial attack is difficult to perceive.
To better examine the influence of parameter
on the visual perception of the generated adversarial examples, we set the value of
as 0.01, 0.1, and 1, with
equal to 1. A demonstration of the generated samples is given in
Figure 5. From the figure, it can be seen that the decrease in
significantly limits the intensity of the perturbations, which results in a more regular structure of points in the context area of the object and maintains the performance of the attack at the same time. For the other two datasets, the same ablation study was conducted to balance the loss terms. We obtained
values of 2 and 2.5 for NuScenes and Waymo while keeping
constant at 0.1, which may be due to differences in the range of the scene and sensor accuracy.
Influence of context-size factor k. The context area is determined by enlarging the object’s box with the factor
k. The contextual region of an object is considered to be within a certain range of the object, so we choose the maximum range of twice the object as the upper limit of
k. In
Table 6, we show the attack success rates of CAMGA under different
k values on the KITTI dataset.
Table 6 indicates that as the value of
k increases, the attack success rate increases gradually.
k = 1 indicates that the bounding box is the same as the ground truth and there are no points in the context area. As a result, the perturbation cannot be added to the scene, and the attack success rate is 0%. When
k is equal to 1.1 or 1.3, the context area is small and the perturbation attack can only be added to a few points. while
k = 2, it means that the enlarged box is eight times as large as the size of the object.
It can be learned from
Table 6 that the attack success rate increases sharply as the value of
k rises from 1 to 1.5 and then grows slightly. Since the expansion of the region makes the attack more effective but also increases the probability of being perceived, we chose
k = 1.5 to find a balance between attack performance and imperception.