TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors

Dimitriu, Adonisz; Michaletzky, Tamás Vilmos; Remeli, Viktor

doi:10.3390/bdcc9030072

Open AccessArticle

TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors

by

Adonisz Dimitriu

^*,†

,

Tamás Vilmos Michaletzky

^† and

Viktor Remeli

Techtra Technology Transfer Institute, Széchenyi István University, 9026 Győr, Hungary

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Big Data Cogn. Comput. 2025, 9(3), 72; https://doi.org/10.3390/bdcc9030072

Submission received: 12 November 2024 / Revised: 18 January 2025 / Accepted: 9 March 2025 / Published: 19 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

Adversarial attacks threaten the reliability of machine learning models in critical applications like autonomous vehicles and defense systems. As object detectors become more robust with models like YOLOv8, developing effective adversarial methodologies is increasingly challenging. We present Truck Adversarial Camouflage Optimization (TACO), a novel framework that generates adversarial camouflage patterns on 3D vehicle models to deceive state-of-the-art object detectors. Adopting Unreal Engine 5, TACO integrates differentiable rendering with a Photorealistic Rendering Network to optimize adversarial textures targeted at YOLOv8. To ensure the generated textures are both effective in deceiving detectors and visually plausible, we introduce the Convolutional Smooth Loss function, a generalized smooth loss function. Experimental evaluations demonstrate that TACO significantly degrades YOLOv8’s detection performance, achieving an AP@0.5 of 0.0099 on unseen test data. Furthermore, these adversarial patterns exhibit strong transferability to other object detection models such as Faster R-CNN and earlier YOLO versions.

Keywords:

adversarial attacks; object detection; camouflage optimization; AI security; YOLOv8; Unreal Engine 5

1. Introduction

In recent years, object detection has made significant advances, with the YOLO family of models leading the way in real-time applications. These models have become essential in areas like autonomous driving, surveillance, and robotics, offering high accuracy and efficiency in many applications [1]. However, as these technologies become more integrated into critical systems, their vulnerabilities have also come to light. Adversarial attacks, which involve subtle, often imperceptible perturbations, have been shown to cause machine learning models to make incorrect predictions, exposing significant security risks [2]. While these attacks initially focused on classifiers, the scope has since expanded, with adversarial methods now being applied to domains such as object detection, image segmentation [3], reinforcement learning [4], and even large language models [5].

Adversarial attacks can be broadly categorized into digital and physical attacks. Digital attacks involve manipulating pixel values in an image to fool a model. These attacks are effective when the input to the model is a digital image, but they fail when applied to real-world objects. For instance, a pixel change that deceives an object detector digitally may become ineffective when printed or viewed under different lighting and perspectives, as changes in illumination or camera angles can alter the pixel values of a printed pattern.

This limitation has driven the development of physical adversarial attacks, where the patterns are physically applied to objects. These attacks must remain effective across varying lighting conditions, angles, and environmental factors. Recent research has demonstrated that such adversarial attacks are feasible. For example [6], has shown that even natural phenomena, like certain shadows, can serve as effective adversarial attacks, revealing how deceivable object detection models can be. Similarly, [7] introduced a novel approach that focuses on background adversarial perturbations in both digital and physical domains, showing that Deep Neural Networks (DNNs) can be deceived by perturbations applied to the background rather than the objects themselves.

To effectively craft physical adversarial patterns, a differentiable image generation pipeline is essential. Such a pipeline enables the optimization of adversarial patterns by allowing gradients of a loss function—typically tied to the object detection model’s confidence score—to propagate through the entire rendering process. This capability is critical because it allows the gradients to flow seamlessly from the output of the object detector back to the 2D texture applied to the 3D model. By doing so, the optimization process can directly adjust the texture to minimize the detection confidence score or alter the detector’s behavior. Ultimately, a successful pipeline ensures that the generated patterns are specifically tuned to exploit the vulnerabilities of the object detector, effectively camouflaging the object against detection systems.

In this study, we introduce Truck Adversarial Camouflage Optimization (TACO), a novel framework designed to render a specific truck model undetectable to state-of-the-art object detection models by generating adversarial camouflage patterns. Leveraging Unreal Engine 5 (UE5) for photorealistic and differentiable rendering, TACO optimizes textures applied to a 3D truck model to deceive detectors, specifically targeting YOLOv8 [8]. Our fully differentiable pipeline integrates advanced rendering techniques with neural networks to optimize adversarial patterns that prevent the detection of the truck. The key contributions of TACO are as follows:

We are the first to utilize UE5 for generating adversarial patterns within a fully differentiable rendering pipeline. This advancement builds upon prior methods that employed Unreal Engine 4 (UE4), offering improved graphical fidelity and rendering capabilities. Using UE5 we reduce the domain gap between the simulated environment and real-world deployment, ensuring adversarial patterns remain effective in the real world.
We introduce an additional neural rendering component, a gray textured truck image, to accurately capture and reproduce lighting and shadow conditions.
We are the first to design adversarial patterns specifically for YOLOv8 in the context of vehicle detection, moving beyond previous work that focused on older models like YOLOv3 [9].
We introduce Intersection over Prediction-based (IoP-based) filtering as part of the class loss formulation, enhancing the stealthiness of adversarial optimization by considering bounding boxes that significantly overlap with the target object. This method reduces false detections and improves the overall effectiveness of adversarial patterns.
We propose the Convolutional Smooth Loss function, a novel smooth loss function for ensuring that the adversarial textures are not only effective but also visually plausible.

The rest of the paper is organized as follows. Section 2 explores related works on adversarial attacks using visual patterns. Section 3 formulates the problem statement and the TACO framework. Section 4 shows the implementation details, followed by a presentation of our results in Section 5. Section 6 concludes the paper.

2. Related Works

Adversarial attacks on object detection systems have garnered significant attention in recent years. While early research primarily focused on digital adversarial examples, the shift towards physical-world attacks has introduced new challenges and methodologies. This section reviews the evolution of physical adversarial attacks, particularly those targeting vehicles, and highlights how our work advances the state of the art.

One of the initial approaches to physical adversarial attacks involved the use of adversarial patches. For instance, it was shown that attaching patches to specific regions in an image could deceive object detectors into misclassifying or failing to detect objects [10]. Building on this concept, it was also demonstrated that holding a printed adversarial patch in front of a person could successfully evade person detection systems [11]. While these studies provided valuable insights, they primarily focused on human subjects and simple scenarios.

Shifting the focus to vehicle-based applications, which are particularly relevant for autonomous driving and the military sector, researchers explored new methods to deceive object detectors. An innovative approach was to attach a screen to a car that displays adversarial patterns dynamically adjusted based on the camera’s viewpoint [12]. Although this method proved effective, it relies on electronic displays and knowledge of the detector camera location, which may not be practical or covert in real-world scenarios.

To overcome the limitations of screen-based methods, a black-box method was proposed to approximate both rendering and gradient estimation for generating adversarial patterns [13]. Notably, they observed that increasing the resolution of the camouflage does not necessarily enhance the fooling rate of object detectors, suggesting a trade-off between pattern complexity and effectiveness.

Further advancements were made by exploring white-box attacks that leverage knowledge of the target model’s parameters. Two primary strategies emerged for constructing differentiable pipelines for pattern optimization. The first strategy involved projecting a pattern onto the surface of the object using camera parameters. For example, a differentiable transformation network approach (DTA) projected repeated patterns onto vehicles [14]. However, this approach suffered from projection errors, especially on non-planar surfaces, leading to inaccuracies in the application of adversarial patterns.

Addressing these limitations, triplanar mapping was introduced (ACTIVE [15]), which projects the pattern from three different planes to reduce distortion on complex geometries. However, it still struggles to maintain texture proportions on complex geometries, such as bent or highly curved surfaces, because the mapping does not account for the underlying surface’s spatial distortion.

An alternative and more accurate approach involves the use of neural mesh renderers, also known as differentiable renderers [16]. By mapping textures directly onto each triangle face of the 3D mesh using UV coordinates, differentiable renderers ensure that every pixel in the final rendered view aligns with the underlying geometry, allowing reliable gradient backpropagation into the texture space and minimizing artifacts. The Dual Attention Suppression (DAS) approach [17] utilized this technique, aiming to suppress the attention maps of the target detection model while generating visually natural adversarial textures relying on partial coverage using patches.

However, partial coverage is less effective compared to full-body textures; the superior performance of the Full-coverage Camouflage Attack (FCA) was demonstrated by Wang et al. [18], achieving increased robustness and performance in deceiving object detectors. Building upon this, Zhou et al. proposed RAUCA where they further enhanced the effectiveness of adversarial patterns by simulating various weather conditions, such as different times of day, rain, and fog [19]. They incorporated environmental information from background images, which they passed through their Environment Feature Extractor network. Their results indicated that patterns optimized under diverse scenarios exhibit greater resilience in real-world applications.

In parallel, Duan et al. proposed a method to generate adversarial patterns for a 3D truck object, specifically targeting the Faster R-CNN object detector [20]. Their approach combined 3D rendering with dense proposal attacks to train adversarial camouflage across varying viewpoints and lighting conditions, further advancing the effectiveness of 3D adversarial attacks on vehicle-based models.

Recent advancements have explored alternative methodologies for generating adversarial patterns. For instance, Li et al. introduced diffusion models to create adversarial textures [21], moving away from traditional optimization-based approaches. Similarly, Lyu et al. proposed a novel framework that leverages diffusion models to generate customizable and natural-looking adversarial camouflage patterns for vehicle detectors [22]. By allowing users to specify text prompts, their method produces diverse and more natural textures while maintaining competitive attack performance.

However, despite these advancements, several challenges remain in crafting effective physical adversarial attacks against state-of-the-art object detectors. Many prior works, such as FCA [18], DAS [17], DTA [14], and ACTIVE [15], have relied on UE4. While effective, UE4 does not achieve the level of photorealism offered by its successor, UE5. The advanced features of UE5, such as Lumen for real-time global illumination, Nanite for handling detailed geometry, and an enhanced physically-based rendering pipeline, enable the generation of textures that more accurately replicate real-world lighting, shadows, and material properties. This increased photorealism ensures that adversarial patterns remain effective when transferred from a simulated environment to physical deployment. Additionally, these studies often targeted older versions of object detection models like YOLOv3 or Faster R-CNN, which may not reflect the robustness of current state-of-the-art detectors [1].

RAUCA achieved progress by extracting features from background images to enhance the realism of adversarial patterns. However, their method may not accurately capture the complex interplay of lighting and shadows directly on the vehicle.

Our work addresses these gaps by utilizing UE5 within a fully differentiable rendering pipeline, enabling the generation of more photorealistic images and detailed adversarial patterns. Furthermore, we target YOLOv8, a state-of-the-art object detection model known for its robustness and improved detection capabilities. By introducing an additional neural rendering component—a gray textured truck image—we accurately capture environmental lighting and shadows cast on the vehicle, further improving the accuracy of our neural renderer.

3. Materials and Methods

3.1. Problem Statement

Let

X = {X_{1}, X_{2}, \dots, X_{n}}

be the dataset generated in UE5, where each sample

X_{i}

includes the following elements:

Reference Image ( $X_{ref}$ ): A photorealistic image of the truck in the scene, rendered with UE5’s advanced lighting and shading techniques. The texture for this image is randomly sampled from a High-Resolution Texture Dataset defined in Section 4.2.
Gray Textured Truck Image ( $X_{gray}$ ): A version of the truck rendered in a neutral gray texture (RGB: 127, 127, 127).
Depth Map ( $X_{d}$ ): A depth map that provides the distance from the camera to each pixel on the truck surface.
Binary Mask ( $M$ ): A mask identifying the pixels corresponding to the truck in each image. It is generated using a custom material in UE5, which ensures black color during rendering for accurate segmentation.
Camera Parameters ( $θ_{c}$ ): Parameters that define the camera’s pose and orientation in the scene, used as input for differential rendering.

In addition to these components, we define the 3D truck model with its associated mesh and texture map T. The adversarial texture

T_{a d v}

is used in place of T to deceive the detection model. The neural renderer R takes as inputs the 3D mesh, adversarial texture

T_{a d v}

, camera parameters

θ_{c}

, depth map

X_{d}

, and the gray textured truck image

X_{g r a y}

to render the photorealistic adversarial truck image

X_{e n h}

:

X_{e n h} = R (M e s h, T_{a d v}, θ_{c}, X_{d}, X_{g r a y}) .

(1)

Next, the rendered truck image

X_{e n h}

is combined with the background

X_{r e f}

using the binary mask M:

X_{a d v} = X_{e n h} \cdot M + X_{r e f} \cdot (1 - M),

(2)

where

X_{e n h} \cdot M

places the truck into the scene and

X_{r e f} \cdot (1 - M)

fills the rest of the background.

It is important to note that our primary goal is to make the truck undetectable. Object detection models, including YOLOv8, output bounding boxes with coordinates

B_{p r e d} = (b_{x}, b_{y}, b_{w}, b_{h})

and a confidence score

b_{c l s}

for the object class. For many real-world applications, such as surveillance and autonomous systems, the presence of an object (vehicle or a person) is more critical than its exact location. Therefore, our main objective is to minimize the detection confidence score

b_{c l s}

for all classes. The optimization problem can be formulated as:

T_{a d v}^{*} = \underset{T_{a d v}}{arg min} L (F (X_{e n h}; θ_{F}))

(3)

where F is the detector with parameters

θ_{F}

and

L ()

is the loss function designed to reduce the class confidence score

b_{c l s}

. This loss function, along with a smoothness regularization term for the adversarial texture

T_{a d v}

, is detailed in later sections (see Section 3.2.3 and Section 3.2.4).

3.2. Truck Adversarial Camouflage Optimization

3.2.1. Neural Renderer: Overview of the Rendering Process

A central component of the TACO framework is the neural renderer, a system designed to generate photorealistic images of a truck adorned with adversarial camouflage patterns. This component ensures that the textures are rendered on the 3D object in a fully differentiable and photorealistic manner.

The rendering process consists of two main stages. First, adversarial textures are applied to the truck’s 3D mesh using a differentiable renderer, referred to as

R_{d i f f}

. This renderer, given the truck mesh, adversarial texture

T_{a d v}

, and camera parameters

θ_{c}

, outputs a raw rendered image,

X_{r a w}

:

X_{r a w} = R_{d i f f} (Mesh, T_{a d v}, θ_{c}) .

(4)

In the second stage, the raw image

X_{r a w}

is passed to the Photorealistic Rendering Network (PRN), an enhancement module that improves the quality of the rendered image by incorporating additional information such as the gray textured truck image

X_{g r a y}

and the depth map

X_{d}

. The enhanced image

X_{e n h}

is computed as:

X_{e n h} = R_{P R N} (X_{r a w}, X_{g r a y}, X_{d}),

(5)

where

R_{P R N}

represents the PRN.

Finally, the enhanced truck image

X_{e n h}

is blended with the background using Equation (2). The entire data flow and training process of the neural renderer, including both the differentiable renderer and the PRN, is illustrated in Figure 1.

3.2.2. Photorealistic Rendering Network (PRN)

The PRN is central to transforming raw rendered images into photorealistic outputs. Built on a U-Net architecture [23], it features a contracting path for feature extraction and an expansive path for image reconstruction. Each downsampling step in the contracting path incorporates a Convolutional Block Attention Module (CBAM) [24], leveraging spatial and channel attention mechanisms to enhance feature representation. The architecture of the PRN is illustrated in Figure 2, showing how the contracting and expansive paths interact to produce high-fidelity outputs.

During training, the PRN learns the photorealistic rendering characteristics of Unreal Engine 5 (UE5) by using a dataset of reference images

X_{r e f}

with diverse textures. The training is guided by an L1 rendering loss:

L_{r e n d e r} = | | X_{a d v} - X_{r e f} {| |}_{1},

(6)

which ensures that the PRN’s output maintains fidelity to

X_{r a w}

while incorporating the photorealistic characteristics of the UE5 reference dataset. For more details on the dataset and training details refer to Section 4.2 and Section 4.3.1.

A key innovation of our approach is incorporating a gray textured truck image

X_{g r a y}

as an additional input to the PRN. This gray textured image captures vital environmental details, particularly lighting and shadows on the vehicle, that would otherwise be difficult to reproduce. As shown in Table 1, models trained with

X_{g r a y}

achieve lower L1 loss and higher Strutural Similarity Index Measure (SSIM) [25] than those trained without it. Likewise, Figure 3 illustrates that omitting

X_{g r a y}

leads to inaccurate and incomplete shadow rendering, whereas including it yields flawless results to the eye.

3.2.3. Attack Loss

Following the pre-training of the neural renderer, we now focus on the optimization of the adversarial texture (Figure 4). In our setup, we target the Ultralytics YOLOv8 model, which features an updated detection head. Unlike previous YOLO versions, this updated head is anchor-free and removes the objectness score as it was found to be redundant. Instead, the model directly outputs class confidence scores

b_{c l s}

for each object in the scene. All YOLO versions in our experiments (e.g., YOLOv3u, YOLOv5Xu, see Section 5.1) utilize this updated detection head from Ultralytics, which significantly increases their performance [26]. Our primary goal is to minimize

b_{c l s}

values specifically for bounding boxes that overlap with the truck. Additionally, we also reduce the Intersection over Union (IoU) between any predicted bounding boxes and the ground truth bounding box of the vehicle. We found that this helps to further reduce

b_{c l s}

.

Let

B_{g t}

denote the ground truth bounding box of the truck and

B_{p r e d}^{i}

denote the i-th predicted bounding box. To identify predicted bounding boxes that overlap significantly with the truck, we define the Intersection over Prediction (IoP):

IoP (B_{p r e d}^{i}, B_{g t}) = \frac{Area (B_{p r e d}^{i} \cap B_{g t})}{Area (B_{p r e d})}

(7)

We select only those predicted bounding boxes where the IoP exceeds a predefined threshold

τ_{I o P}

. This filtering ensures that our attack focuses on reducing the detection confidence of bounding boxes representing the truck while avoiding patterns that might be misclassified as another class. The class loss is then defined as

L_{c l s} = - \sum_{c = 1}^{C} \sum_{i \in Ω_{I o P}} log (1 - b_{c l s}^{i, c})

(8)

where:

$Ω_{I o P} = {i ∣ IoP (B_{p r e d}^{i}, B_{g t}) > τ_{I o P}}$ denotes the set of indices for bounding boxes with an IoP greater than the threshold $τ_{I o P}$ ;
$b_{c l s}^{i, c}$ is the confidence score for class c in bounding box i;
C is the number of classes (80 in the case of YOLOv8).

We use IoP-based filtering instead of the traditional IoU-based techniques used in previous works [15,19]. Our experiments revealed that IoU filtering

(IoU = \frac{Area (B_{p r e d} \cap B_{g t})}{Area (B_{p r e d} \cup B_{g t})})

often excluded bounding boxes covering smaller central regions of the truck when their IoU with the ground truth box

B_{g t}

fell below the threshold. This exclusion led to adversarial patterns that caused the detector to misclassify parts of the truck as unrelated objects, such as apples or suitcases. Although the truck itself became undetected, these false identifications of surface patterns compromised the attack’s overall stealth and effectiveness.

To address this, we developed IoP-based filtering, which prioritizes bounding boxes based on the proportion of their area overlapping with the truck, ensuring that all relevant regions are included in the optimization. By capturing these inner regions, IoP-based filtering prevents the generation of patterns that trigger false detections. Figure 5 illustrates this difference; IoU-based filtering results in adversarial patterns that produce false positives across the truck’s surface, while IoP-based filtering eliminates such detections, rendering the truck undetectable.

In addition to the class loss with IoP-based filtering, we incorporate an IoU loss term to further suppress the detector’s ability to localize the truck accurately. The IoU loss is defined as

L_{I o U} = \sum_{i \in Ω_{I o U}} IoU (B_{p r e d}^{i}, B_{g t})

(9)

where:

$Ω_{I o U} = {i ∣ IoU (B_{p r e d}^{i}, B_{g t}) > τ_{I o U}}$ denotes the set of predicted bounding boxes with relatively large IoU values that we want to minimize.

Finally, the total attack loss is the weighted sum of the class confidence loss and the IoU loss, with a weight

β

to balance the contribution of the IoU loss term:

L_{a t k} = L_{c l s} + β L_{I o U}

(10)

By minimizing this loss, we effectively reduce the confidence scores for any objects detected over the area of our truck.

3.2.4. Convolutional Smooth Loss

An additional constraint is to generate physically producible adversarial patterns by ensuring a smooth texture. The traditional smoothness loss function calculates the Total Variation (TV) between adjacent pixels, specifically looking at the immediate right and bottom neighbors [27]:

L_{TV} = \sum_{i, j} \sqrt{{(δ_{i, j} - δ_{i + 1, j})}^{2} + {(δ_{i, j} - δ_{i, j + 1})}^{2}}

(11)

To improve upon this, we introduce Convolutional Smooth Loss—a generalized approach which is the extension of the traditional TV loss. Rather than only considering the immediate right and bottom neighbors, this method evaluates the differences between the central pixel and all pixels within a

k \times k

kernel. This captures the local variation over a larger neighborhood. Mathematically, let

T_{i, j}

represent the pixel value at position

(i, j)

. We calculate the local variation

D_{i, j}

as the sum of squared differences between

T_{i, j}

and all pixels within the

k \times k

neighborhood:

D_{i, j} = \sum_{n = - ⌊ \frac{k}{2} ⌋}^{⌊ \frac{k}{2} ⌋} \sum_{m = - ⌊ \frac{k}{2} ⌋}^{⌊ \frac{k}{2} ⌋} {(T_{i, j} - T_{i + n, j + m})}^{2}

(12)

The overall smoothness loss is then computed as

L_{smooth} (T) = \frac{1}{W \cdot H} \sum_{i, j} \sqrt{D_{i, j}}

(13)

where W and H are the width and height of the texture image respectively. Note that the term

D_{i, j}

can be easily calculated for every pixel of the image with the use of two convolutions. Let ∗ denote the convolution operation, and K a

k \times k

kernel with weights

K_{n, m} = \frac{1}{k^{2}}

(assuming a uniform weighting for simplicity). Using this kernel, we can compute D for the entire image in a compact and efficient manner as

D = k^{2} T^{2} - 2 T \cdot (T * K) + T^{2} * K

(14)

where

T^{2}

refers to the element-wise square of the pixel values.

T * K

is the convolution of the texture T with the kernel K, while

T^{2} * K

is

T^{2}

convoluted with K. This formulation not only captures smoother transitions over a larger neighborhood but also allows for fast computation with convolutions. The ability to choose k arbitrarily gives us further control over the degree of smoothness.

Finally, the total loss function used in our optimization process combines the adversarial attack loss

L_{a t k}

with the smoothness loss

L_{smooth}

, weighted by a factor

γ

:

L_{t o t a l} = L_{a t k} + γ L_{s m o o t h}

(15)

3.2.5. Projected Gradient Descent with Adam

When optimizing the adversarial texture

T_{a d v}

, each pixel must remain within the valid range

[0, 1]

. Formally, this can be viewed as the following constrained optimization problem:

min_{T_{a d v} \in {[0, 1]}^{3 \times H \times W}} L (T_{a d v})

(16)

where

L

is our total loss function (combining both adversarial and smoothness terms). A standard approach to such a box-constrained problem is Projected Gradient Descent (PGD), which clips parameter values back into the feasible range after each update. PGD-based methods have become standard in adversarial machine learning because they naturally ensure the generated perturbations satisfy domain-specific constraints (e.g.,

l_{p}

-norm bounds or pixel-value ranges [28]).

Standard Clipping and Its Limitations

A common strategy enforces

[0, 1]

constraints by simply clipping the texture values after each update step. While this guarantees feasibility, it can “zero out” critical gradient information if many pixels are pushed outside the valid range, slowing convergence and yielding suboptimal solutions.

PGD with Adam

To overcome these issues, we combine PGD with the Adam [29] optimizer. Adam adaptively scales the learning rate per parameter and typically converges faster in high-dimensional tasks such as texture optimization. In this context, we use the notation

\nabla_{Adam} L

to denote the gradient update calculated by the Adam optimizer. A naive combination of Adam with PGD would follow this sequence:

Compute Raw Updates:

$Δ T = \nabla_{Adam} L (T_{a d v}^{t}),$

where $Δ T$ is the parameter update vector derived from Adam.
Update step:

$T_{a d v}^{t + 1} \leftarrow T_{a d v}^{t} - η Δ T .$
Texture projection step:

$T_{a d v}^{t + 1} \leftarrow min (max (T_{a d v}^{t + 1}, 0), 1) .$

While this approach guarantees feasibility, it disrupts gradient information whenever pixel values are pushed outside the valid range, resulting in less efficient optimization.

Gradient Projection

Instead of clipping the texture values after the update, we propose modifying the gradients before applying them. This approach avoids the loss of gradient information while also maintaining feasibility:

Compute Raw Updates:

$Δ T = \nabla_{Adam} L (T_{a d v}^{t}) .$
Gradient projection step:

$\nabla_{proj} L (T_{a d v}^{t}) = min (max (Δ T, - T_{a d v}^{t}), 1 - T_{a d v}^{t}) .$
Update step:

$T_{a d v}^{t + 1} = T_{a d v}^{t} - η \nabla_{proj} L (T_{a d v}^{t}) .$

This gradient projection approach preserves gradient flow, leading to more stable and efficient optimization compared to naive clipping. The overall adversarial texture optimization process is illustrated in Figure 4, showing how the Neural Renderer, Object Detector, and Gradient Projection blocks work together to iteratively update the texture.

4. Experimental Setup

4.1. Truck Model

In our framework, we utilized an M923 truck model consisting of 24,784 faces (Figure 6). To improve computational efficiency during adversarial texture generation, we divided the truck into two distinct parts based on which components are typically painted in real-life scenarios:

Body Parts: This segment includes the main body of the truck, such as the carrosserie and tarp, which are usually painted for camouflage purposes. While these body parts comprise only 1282 triangular faces (5% of the total number of faces), they cover approximately 80% of the truck’s visible surface area. This is due to the fact that these regions consist of larger, less intricate surfaces compared to the auxiliary parts like wheels.
Auxiliary Parts: The remaining components of the truck—such as the wheels, bumper, exhaust stacks, and other parts that are not typically painted or are impractical to paint—fall under this category. These parts contain the remaining 23,502 triangular faces (95% of the faces), but they account for only 20% of the truck’s visible surface area due to their smaller individual sizes and more intricate geometry.

By separating the truck into these two parts, we ensure that only the body part (approximately 5% of the full truck model) is processed through the neural renderer. This not only reduces the computational load but also significantly lowers the memory requirements for the differential renderer, making the overall pipeline more efficient.

It is important to note that the full truck model is rendered within the UE5 environment. Because of this, the auxiliary parts of the vehicle are still visible in the final image, as they are blended with the background using the binary mask M applied to the reference image

X_{r e f}

. This allows us to focus the adversarial attack exclusively on the parts of the truck that could realistically be modified.

4.2. Dataset

To develop and evaluate our adversarial camouflage approach, we constructed a custom dataset in Unreal Engine 5 (UE5) featuring an M923 3D truck model. This dataset plays a dual role; it is used both for training the PRN and for guiding the adversarial texture optimization. We organized the data generation process into two main parts:

Core Truck Dataset: A large collection of rendered truck images under diverse positions, camera parameters, and textures.
High-Resolution Texture Dataset: A complementary set of 4000 high-resolution texture images.

4.2.1. Core Truck Dataset

We first identified 25 distinct truck locations within the UE5 scene. At each location, we placed the M923 truck and generated 2000 rendered images by randomly sampling the following:

Camera viewpoints and distances: For each image, the camera was placed between 5 m and 35 m from the truck, randomly oriented with azimuth angles between 5° and 90°, and elevation angles from 0° to 360°.
Truck textures: A random selection from our High-Resolution Texture Dataset (see below).

By combining 25 locations with 2000 images each, we obtained a total of 50,000 images. We refer to this entire set as the Core Truck Dataset. Along with each rendered image

X_{r e f}

, we also stored the gray textured image

X_{g r a y}

, depth map

X_{d}

, binary mask M, camera parameters

θ_{c}

, and the chosen texture T. This dataset size was chosen incrementally. Smaller datasets tested during preliminary experiments (e.g., 200 images per position) often led to poor generalization of the neural renderer, with artifacts appearing on unseen textures or views. Incrementally increasing the dataset size—from a few thousand images to tens of thousands—significantly improved the PRN’s ability to render previously unseen views and texture patterns accurately.

4.2.2. High-Resolution Texture Dataset

To produce visually diverse truck appearances, we built a texture library of 4000 images at

2048 \times 2048

resolution. The library includes the following:

Describable Textures Dataset [30]: A total of 1500 images were carefully selected from this dataset to provide a variety of texture patterns.
Van Gogh Paintings [31]: A total of 300 texture images were sourced from Van Gogh’s paintings, chosen for their distinct color patterns.
Random Uniform Color Images: A total of 200 images consisting of uniform color values were generated.
Random Noise Images: A total of 2000 noise textures were randomly generated to simulate non-structured patterns visually similar to adversarial patterns.

This broad collection helps our neural renderer generalize effectively to different surface appearances. To provide a visual overview of the High-Resolution Texture Dataset, Figure 7 presents representative samples from each of the four texture categories. The first row displays the original full-resolution textures. The second row demonstrates the application of a masking process to focus the texture exclusively on the truck’s body parts, blending it with a base texture for auxiliary components like wheels. Finally, the third row illustrates how these textures appear when rendered on the truck in a specific scene, as part of the Core Truck Dataset.

4.3. Implementation Details

4.3.1. PRN Training

For training the PRN, we used the entire 50,000 image Core Truck Dataset. To ensure that the PRN can handle unseen textures, we split the 4000 texture library into 3250 textures for training and 750 textures for testing. Consequently, images in the Core Truck Dataset whose textures came from the 3250-texture subset were grouped as the PRN training set, and images whose textures came from the 750-texture subset served as the PRN test set. This approach yields a total of 40,625 training images and 9375 testing images. We used the Adam optimizer with a learning rate of 0.0001 and trained it for 100 epochs. The final model achieved an L1 rendering loss of 0.024 and an SSIM of 0.9901 on the validation set.

4.3.2. Adversarial Texture Generation

While the PRN training used the entire Core Truck Dataset, our adversarial texture optimization only requires a subset. Specifically, in this case, we performed the following:

We used 1000 images per truck location (instead of the full 2000) to reduce training time. This subset thus contains 25,000 images in total.
We performed a location-based split into 18 locations for training (18,000 images) and 7 locations for testing (7000 images).

This ensures that during the adversarial optimization phase, the model sees a wide range of camera perspectives and scenes in the training split, while the evaluation is conducted on entirely different locations of the truck, enforcing a strict generalization test. For every training iteration, the truck’s original rendered texture is dynamically replaced by the adversarial texture under optimization

T_{a d v}

. The neural renderer thus re-synthesizes the scene with the latest version of the adversarial texture, and we compute gradients (via our loss functions) to update the texture’s pixel values. We used the Adam optimizer with gradient projection as described in Section 3.2.5 with a learning rate of 0.006 over 6 epochs on the 18,000 training samples.

We set the IoU loss term weight to

β = 0.01

and the smoothness regularization term weight to

γ = 0.1

with a convolution kernel size of

k = 3

for this configuration. The IoP threshold was set to

τ_{I o P} = 0.6

and the IoU threshold to

τ_{I o U} = 0.45

. These settings were used as the default configuration in all subsequent experiments.

In evaluations where we tested different setups or other hyperparameters, the remaining parameters (such as the learning rate, the initialization, and the loss weights) were kept consistent with the default configuration unless otherwise noted.

5. Results

In this section, we evaluate the effectiveness of the proposed TACO framework. We conduct a series of experiments to assess the attack performance against various object detection models, analyse the impact of different loss components, examine the influence of texture initialization strategies, and visualize the attention shifts in the target model using Class Activation Maps (CAMs).

5.1. Evaluation Metrics and Models

All evaluations are conducted on the test set of eight unseen truck positions. Since trucks can sometimes be mistaken for cars by object detection models due to their similar appearance, we treat cars and trucks as a single combined class when calculating our metrics. We adopt two primary metrics to assess the performance of our adversarial textures:

Average Precision at IoU threshold 0.5 (AP@0.5): This metric evaluates the precision of the object detector when the Intersection over Union (IoU) between the predicted bounding box and the ground truth exceeds 50%.
Attack Detection Rate (ADR): The proportion of images in which the object detector successfully identifies the truck.

We evaluate our adversarial textures against six object detection models:

YOLOv8X [8]: Our target model for the adversarial attack.
YOLOv3u [9]: A previous generation of the YOLO family with an upgraded detection head.
YOLOv5Xu: An intermediate version of the YOLO models with upgraded detection heads.
Faster R-CNN v2 (FRCNN) [32]: An improved version of the two-stage object detection model.
Fully Convolutional One-Stage Object Detection (FCOS) [33]: An anchor-free object detection framework.
Detection Transformer (DETR) [34]: A transformer-based object detection model.

All models are pre-trained on the COCO dataset [35] and are treated as black-box models except for YOLOv8, our white-box target.

5.2. Performance Comparison of Different Textures

We first compare the effectiveness of our TACO-generated adversarial texture against three baseline textures:

Base: The original single-color texture of the truck without any adversarial modifications.
Naive: A simple camouflage texture common on military trucks.
Random: A texture initialized with random pixel values.
DTA: The Differentiable Transformation Network Approach (DTA), an existing adversarial camouflage method reimplemented for comparison [14].

Figure 8 shows the applied textures for the Base, Naive, Random, DTA, and TACO methods used in this evaluation. In the case of DTA, we include two variants of the texture: one directly from the original paper (DTA (original)) and another optimized specifically for our truck model and environment (DTA (optimized)). For fairness, only the re-optimized DTA texture is used in our experiments. Additionally, both DTA textures are rendered photorealistically using our own PRN. Table 2 and Table 3 present the AP@0.5 and ADR results, respectively, across all evaluated models.

The texture generated by TACO significantly reduces both the AP@0.5 and ADR across all evaluated models compared to the baseline textures. Specifically, for YOLOv8, the AP@0.5 drops from 0.7295 (Base) to 0.0099 (TACO), and the ADR decreases from 0.7453 to 0.0097. This indicates that the adversarial texture effectively deceives the target model, rendering the truck nearly undetectable in unseen scenes. While TACO is optimized specifically against YOLOv8, it also exhibits transferability to other models, although with varying degrees of effectiveness. This suggests that our adversarial texture exploits common vulnerabilities across different object detection architectures.

5.3. Impact of Different Loss Functions

We investigate the contribution of each component in our total loss function by evaluating the adversarial textures generated using different combinations of the proposed loss terms. Table 4 and Table 5 present the AP@0.5 and ADR results, respectively, for each loss configuration including

L_{c l s}

,

L_{c l s} + L_{i o u}

,

L_{c l s} + L_{s m}

, and the full

L_{t o t a l}

across all models.

Optimizing with only the classification loss (

L_{c l s}

) reduces the detection performance, but not as significantly as when additional loss terms are included. Incorporating the IoU loss (

L_{i o u}

) further decreases both AP@0.5 and ADR, particularly for models like FCOS and DETR, indicating that minimizing bounding box overlap enhances the attack’s effectiveness.

Adding the smoothness loss (

L_{s m}

) slightly reduces the attack performance compared to using

L_{c l s}

alone. However, it contributes to generating more even textures, which is important for real-world applicability.

The full loss function (

L_{t o t a l}

) achieves the best overall performance on YOLOv8 and maintains competitive results on other models. This demonstrates that combining all loss components effectively contributes to attack strength and texture plausibility.

5.4. Texture Initialization Study

We examine the impact of different texture initialization strategies on the optimization outcome. The initialization methods tested are as follows:

Zeros: Initializing the texture with all zeros (black texture).
Ones: Initializing the texture with all ones (white texture).
Random: Initializing with random values.
Base: Starting from the truck’s original texture.

Figure 9 presents a column diagram comparing the AP@0.5 and ADR for each initialization method across all models.

Initializing the texture from zeros yields the best adversarial performance in terms of both AP@0.5 and ADR across most models, except for FCOS, where initializing from ones performs slightly better in terms of ADR. Initializing from the base texture or random values results in less effective attacks.

5.5. Smoothness Loss Coefficient Analysis

We explore how varying the smoothness loss coefficient (

γ

) affects the adversarial texture’s effectiveness and visual quality. Figure 10 illustrates the relationship between different

γ

values and the corresponding AP@0.5 and ADR on the object detectors.

When

γ > 1

, the smoothness loss dominates the optimization, resulting in overly smooth textures that lack the necessary perturbations to deceive the detector, thereby reducing the attack’s effectiveness. Conversely, when

γ

is too small (e.g.,

γ < 0.01

), the texture may become overly noisy, potentially making it visually unrealistic.

In addition to the performance metrics, Figure 11 provides a visual representation of how the generated textures evolve with different

γ

values and how these textures appear when applied to the truck. The second row of the figure shows the truck with the textures applied, along with the YOLOv8 detection results. As

γ

increases, the textures become smoother. When

γ

reaches 100, the texture becomes overly smooth and resembles a plain black color, which diminishes its adversarial effect.

We observe a sweet spot in the range

0.01 < γ < 1

, where the attack achieves optimal performance while maintaining a balance between texture smoothness and adversarial perturbations. A

γ

value around 0.1 provides a good trade-off, leading to effective attacks across all models.

Interestingly, the adversarial perturbations not only cause the detector to miss the truck but also lead to hallucinations of other objects outside the truck. As shown in Figure 11, YOLOv8 detects phantom objects in the surrounding areas where there are none. This phenomenon occurs because object detection is not a task that is separable from its surrounding context. As demonstrated in [7], even perturbing the background of an object can significantly affect detection performance. By altering the texture of the truck, the attack can indirectly affect how the model interprets other parts of the scene.

5.6. Class Activation Mapping Analysis

To understand how the adversarial texture affects the target model’s attention mechanisms, we adapt Ablation-CAM [36] to visualize the regions of interest in YOLOv8. Figure 12 displays the CAM overlays on the truck images with and without the adversarial texture.

In the original image without the adversarial texture, the CAM highlights the truck, indicating that YOLOv8 correctly focuses on the vehicle for detection. After applying the TACO adversarial texture, the CAM shifts away from the truck, and the model’s attention is dispersed to surrounding areas. This suggests that the adversarial texture successfully misleads the model’s attention mechanisms, contributing to the failure in detecting the truck.

6. Conclusions

In this paper, we introduced TACO (Truck Adversarial Camouflage Optimization), a novel framework for generating adversarial camouflage patterns on 3D vehicle models to deceive state-of-the-art object detection systems. By using UE5 for photorealistic rendering and a novel neural renderer component, TACO optimizes textures that are both visually smooth and highly effective in deceiving detectors. We introduced the Convolutional Smoothness Loss, which ensures that the generated patterns maintain a realistic appearance. Experimental results showed that our adversarial textures significantly reduced the detection performance of YOLOv8, achieving a near-zero AP@0.5 and ADR on unseen test data. The adversarial patterns also exhibited transferability to other object detection models including other YOLO versions, Faster R-CNN, and DETR. We also showed that balancing texture smoothness with adversarial perturbations is crucial for optimal performance. By targeting YOLOv8, we advanced the field beyond previous works that focused on older detection models, demonstrating the viability of adversarial attacks against more robust and modern architectures.

Despite these promising results, our approach has areas for further exploration. First, we tested it on a single truck model, leaving room to evaluate its effectiveness across diverse vehicle models. Such an extension would provide a more comprehensive assessment. Second, while TACO demonstrates strong transferability, the training pipeline and textures are primarily optimized for YOLO-based architectures. Investigating domain adaptation or multi-objective optimization methods could expand its applicability to a broader range of detectors. Additionally, our dataset, though leveraging UE5 for photorealistic rendering, does not yet account for diverse weather conditions or nighttime scenarios. Including these factors could enhance the robustness of the adversarial patterns. Lastly, as detection models are retrained or updated, a pattern optimized for one version may lose effectiveness over time. Exploring continual or online adversarial training strategies could sustain attack performance against evolving systems. In conclusion, TACO establishes a strong foundation for adversarial camouflage on 3D vehicle models and offers a promising strategy for bypassing state-of-the-art object detectors. Addressing these areas in future work could further enhance its utility and ensure broader effectiveness in real-world applications.

Author Contributions

Conceptualization, A.D., T.V.M. and V.R.; methodology, A.D. and T.V.M.; software, A.D. and T.V.M.; resources, V.R.; data curation, A.D. and T.V.M.; writing—original draft preparation, A.D.; writing—review and editing, T.V.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the “Nemzeti Laboratóriumok pályázati program” funding scheme, grant number 2022-2.1.1-NL-2022-00012.

Data Availability Statement

The datasets presented in this study are part of an ongoing research project and are therefore not readily available. For access requests, please contact Adonisz Dimitriu at dimitriu.adonisz@techtra.hu.

Acknowledgments

We would like to acknowledge Eszter Fülöp for her support in developing the figures and offering valuable design advice. Her expertise in visual representation was key to the overall presentation of this work.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015; Conference Track Proceedings; Bengio, Y., LeCun, Y., Eds.; ICLR: Appleton, WI, USA, 2015. [Google Scholar]
Hendrik Metzen, J.; Chaithanya Kumar, M.; Brox, T.; Fischer, V. Universal adversarial perturbations against semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2755–2764. [Google Scholar]
Lin, Y.C.; Hong, Z.W.; Liao, Y.H.; Shih, M.L.; Liu, M.Y.; Sun, M. Tactics of adversarial attack on deep reinforcement learning agents. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 3756–3762. [Google Scholar]
Shayegani, E.; Mamun, M.A.A.; Fu, Y.; Zaree, P.; Dong, Y.; Abu-Ghazaleh, N. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv 2023, arXiv:2310.10844. [Google Scholar]
Zhong, Y.; Liu, X.; Zhai, D.; Jiang, J.; Ji, X. Shadows can be dangerous: Stealthy and effective physical-world adversarial attack by natural phenomenon. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15345–15354. [Google Scholar]
Lian, J.; Mei, S.; Wang, X.; Wang, Y.; Wang, L.; Lu, Y.; Ma, M.; Chau, L.P. Attack Anything: Blind DNNs via Universal Background Adversarial Attack. arXiv 2024, arXiv:2409.00029. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 12 October 2024).
Farhadi, A.; Redmon, J. Yolov3: An incremental improvement. In Computer Vision and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1804, pp. 1–6. [Google Scholar]
Liu, X.; Yang, H.; Liu, Z.; Song, L.; Li, H.; Chen, Y. Dpatch: An adversarial patch attack on object detectors. arXiv 2018, arXiv:1806.02299. [Google Scholar]
Thys, S.; Van Ranst, W.; Goedemé, T. Fooling automated surveillance cameras: Adversarial patches to attack person detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Hoory, S.; Shapira, T.; Shabtai, A.; Elovici, Y. Dynamic adversarial patch for evading object detection models. arXiv 2020, arXiv:2010.13070. [Google Scholar]
Zhang, Y.; Foroosh, P.H.; Gong, B. Camou: Learning a vehicle camouflage for physical adversarial attack on object detections in the wild. In Proceedings of the ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Suryanto, N.; Kim, Y.; Kang, H.; Larasati, H.T.; Yun, Y.; Le, T.T.H.; Yang, H.; Oh, S.Y.; Kim, H. Dta: Physical camouflage attacks using differentiable transformation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15305–15314. [Google Scholar]
Suryanto, N.; Kim, Y.; Larasati, H.T.; Kang, H.; Le, T.T.H.; Hong, Y.; Yang, H.; Oh, S.Y.; Kim, H. Active: Towards highly transferable 3d physical camouflage for universal and robust vehicle evasion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4305–4314. [Google Scholar]
Kato, H.; Ushiku, Y.; Harada, T. Neural 3d mesh renderer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3907–3916. [Google Scholar]
Wang, J.; Liu, A.; Yin, Z.; Liu, S.; Tang, S.; Liu, X. Dual attention suppression attack: Generate adversarial camouflage in physical world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8565–8574. [Google Scholar]
Wang, D.; Jiang, T.; Sun, J.; Zhou, W.; Gong, Z.; Zhang, X.; Yao, W.; Chen, X. FCA: Learning a 3d full-coverage vehicle camouflage for multi-view physical adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 2414–2422. [Google Scholar]
Zhou, J.; Lyu, L.; He, D.; LI, Y. RAUCA: A Novel Physical Adversarial Attack on Vehicle Detectors via Robust and Accurate Camouflage Generation. In Proceedings of the Forty-First International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Duan, Y.; Chen, J.; Zhou, X.; Zou, J.; He, Z.; Zhang, J.; Zhang, W.; Pan, Z. Learning Coated Adversarial Camouflages for Object Detectors. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; IJCAI-2022. pp. 891–897. [Google Scholar] [CrossRef]
Li, Y.; Tan, W.; Zhao, C.; Zhou, S.; Liang, X.; Pan, Q. Flexible Physical Camouflage Generation Based on a Differential Approach. arXiv 2024, arXiv:2402.13575. [Google Scholar]
Lyu, L.; Zhou, J.; He, D.; Li, Y. CNCA: Toward Customizable and Natural Generation of Adversarial Camouflage for Vehicle Detectors. arXiv 2024, arXiv:2409.17963. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Ultralytics. YOLOv3 — docs.ultralytics.com. Available online: https://docs.ultralytics.com/models/yolov3/ (accessed on 25 October 2024).
Rudin, L.I.; Osher, S.; Fatemi, E. Nonlinear total variation based noise removal algorithms. Phys. D Nonlinear Phenom. 1992, 60, 259–268. [Google Scholar] [CrossRef]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2019, arXiv:1706.06083. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing Textures in the Wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Kim, A. Van Gogh Paintings Dataset. Mendeley Data. 2022. Available online: https://data.mendeley.com/datasets/3sjjtjfhx7/2 (accessed on 12 October 2024).
Li, Y.; Xie, S.; Chen, X.; Dollar, P.; He, K.; Girshick, R. Benchmarking detection transfer learning with vision transformers. arXiv 2021, arXiv:2111.11429. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.J.; Bourdev, L.D.; Girshick, R.B.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. arXiv 2014, arXiv:1405.0312. [Google Scholar]
Desai, S.; Ramaswamy, H.G. Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Seattle, WA, USA, 13–19 June 2020; pp. 983–991. [Google Scholar]

Figure 1. Data flow and training process of our neural renderer, composed of two elements: a differential renderer and the Photorealistic Rendering Network (PRN). The error images (absolute difference and mean squared error, magnified) are marked with an asterisk (*) to indicate that they are included for illustration purposes only.

Figure 2. Architecture of the Photorealistic Rendering Network (PRN) based on a U-Net. The contracting path extracts features, while the expansive path reconstructs the photorealistic image. CBAM modules are used in the contracting path for attention [24].

Figure 3. Comparison of shadow rendering quality with and without the gray textured truck input. The figure shows, from left to right: (1) applied texture, (2) ground truth rendering from UE5, (3) output of the neural renderer excluding

X_{g r a y}

, and (4) output of the neural renderer including the

X_{g r a y}

. Shadows cast on the truck are poorly rendered without

X_{g r a y}

, but are accurately captured when it is included.

Figure 3. Comparison of shadow rendering quality with and without the gray textured truck input. The figure shows, from left to right: (1) applied texture, (2) ground truth rendering from UE5, (3) output of the neural renderer excluding

X_{g r a y}

, and (4) output of the neural renderer including the

X_{g r a y}

. Shadows cast on the truck are poorly rendered without

X_{g r a y}

, but are accurately captured when it is included.

Figure 4. Texture optimization framework.

Figure 5. Comparison of IoU-based and IoP-based bounding box filtering for the class loss. Top row: IoU-based filtering results in false-positive detections on the truck surface. Bottom row: IoP-based filtering suppresses these false positives. Each column shows a different viewpoint of the same optimized texture.

Figure 6. The 3D truck model with the corresponding 2D textures. Top left: a naive, simple camouflage. Top right: blue paint indicates body parts that are used for the adversarial attack. The bottom row displays various views of the truck, showing the parts involved in the adversarial attack.

Figure 7. Examples from the High-Resolution Texture Dataset. Each column showcases textures from one of the four categories in the dataset: (1–2) Describable Textures Dataset, (3–4) Van Gogh Paintings, (5–6) Random Uniform Color Images, and (7–8) Random Noise Images. The first row presents the full texture images, the second row shows masked versions where only the truck’s body parts are textured (as described earlier in Section 4.1), and the third row displays the textures rendered on the truck in a specific scene from the Core Truck Dataset.

Figure 8. Visual comparison of the Base, Naive, Random, DTA, and TACO textures (first row) and their application to the truck model (second row).

Figure 9. Impact of different texture initialization methods on attack performance.

Figure 10. Effect of varying the smoothness loss coefficient (

γ

) on attack performance.

Figure 10. Effect of varying the smoothness loss coefficient (

γ

) on attack performance.

Figure 11. Textures generated with varying smoothness loss coefficients (

γ

) and their application to the truck model.

Figure 11. Textures generated with varying smoothness loss coefficients (

γ

) and their application to the truck model.

Figure 12. Ablation CAM for YOLOv8; the heatmaps indicate the regions where the model focuses its attention.

Table 1. Comparison of PRN performance with and without gray textured truck input.

Configuration	L1 Loss	SSIM
Without $X_{g r a y}$	0.031	0.9862
With $X_{g r a y}$	0.025	0.9901

Table 2. AP@0.5 performance comparison of different textures across various object detection models.

Method	AP@0.5
Method	YOLOv8	YOLOv3	YOLOv5	FRCNN	FCOS	DETR
Base	0.7295	0.7216	0.6132	0.8377	0.6361	0.6441
Naive	0.8057	0.7305	0.6518	0.7770	0.6317	0.6619
Random	0.6705	0.7214	0.5537	0.8202	0.5948	0.6470
DTA (optimized)	0.2865	0.3663	0.3068	0.4532	0.3633	0.4121
TACO	0.0099	0.0491	0.1381	0.3157	0.2410	0.2600

Table 3. ADR performance comparison of different textures across various object detection models.

Method	ADR
Method	YOLOv8	YOLOv3	YOLOv5	FRCNN	FCOS	DETR
Base	0.7453	0.7258	0.6195	0.8689	0.7475	0.6678
Naive	0.8241	0.7376	0.6561	0.8234	0.7719	0.7045
Random	0.6814	0.7313	0.5614	0.8440	0.6959	0.6586
DTA (optimized)	0.2906	0.3646	0.3011	0.4560	0.3851	0.4130
TACO	0.0097	0.0448	0.1354	0.3558	0.5511	0.3247

Table 4. AP@0.5 performance comparison of different loss schemes.

Method	AP@0.5
Method	YOLOv8	YOLOv3	YOLOv5	FRCNN	FCOS	DETR
$L_{c l s}$	0.0283	0.1283	0.1976	0.4884	0.4400	0.4280
$L_{c l s} + L_{i o u}$	0.0189	0.0690	0.1677	0.3786	0.2586	0.3105
$L_{c l s} + L_{s m}$	0.0373	0.1283	0.2371	0.4979	0.4892	0.4638
$L_{t o t a l}$	0.0099	0.0491	0.1381	0.3157	0.2410	0.2600

Table 5. ADR performance comparison of different loss schemes.

Method	ADR
Method	YOLOv8	YOLOv3	YOLOv5	FRCNN	FCOS	DETR
$L_{c l s}$	0.0317	0.1262	0.1953	0.5067	0.5542	0.4451
$L_{c l s} + L_{i o u}$	0.0189	0.0690	0.1677	0.3786	0.2586	0.3105
$L_{c l s} + L_{s m}$	0.0361	0.1298	0.2317	0.5097	0.6220	0.4867
$L_{t o t a l}$	0.0097	0.0448	0.1354	0.3558	0.5511	0.3247

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dimitriu, A.; Michaletzky, T.V.; Remeli, V. TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors. Big Data Cogn. Comput. 2025, 9, 72. https://doi.org/10.3390/bdcc9030072

AMA Style

Dimitriu A, Michaletzky TV, Remeli V. TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors. Big Data and Cognitive Computing. 2025; 9(3):72. https://doi.org/10.3390/bdcc9030072

Chicago/Turabian Style

Dimitriu, Adonisz, Tamás Vilmos Michaletzky, and Viktor Remeli. 2025. "TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors" Big Data and Cognitive Computing 9, no. 3: 72. https://doi.org/10.3390/bdcc9030072

APA Style

Dimitriu, A., Michaletzky, T. V., & Remeli, V. (2025). TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors. Big Data and Cognitive Computing, 9(3), 72. https://doi.org/10.3390/bdcc9030072

Article Menu

TACO: Adversarial Camouflage Optimization on Trucks to Fool Object Detectors

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Problem Statement

3.2. Truck Adversarial Camouflage Optimization

3.2.1. Neural Renderer: Overview of the Rendering Process

3.2.2. Photorealistic Rendering Network (PRN)

3.2.3. Attack Loss

3.2.4. Convolutional Smooth Loss

3.2.5. Projected Gradient Descent with Adam

Standard Clipping and Its Limitations

PGD with Adam

Gradient Projection

4. Experimental Setup

4.1. Truck Model

4.2. Dataset

4.2.1. Core Truck Dataset

4.2.2. High-Resolution Texture Dataset

4.3. Implementation Details

4.3.1. PRN Training

4.3.2. Adversarial Texture Generation

5. Results

5.1. Evaluation Metrics and Models

5.2. Performance Comparison of Different Textures

5.3. Impact of Different Loss Functions

5.4. Texture Initialization Study

5.5. Smoothness Loss Coefficient Analysis

5.6. Class Activation Mapping Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI