Next Article in Journal
Fostering Post-Fire Research Towards a More Balanced Wildfire Science Agenda to Navigate Global Environmental Change
Previous Article in Journal
Study on the Influence of Smoke Vent Arrangement on the Natural Smoke Exhaust Effect in Urban Traffic Link Tunnels
Previous Article in Special Issue
A Comparative Analysis of YOLOv9, YOLOv10, YOLOv11 for Smoke and Fire Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Adversarial Robustness for Deep Learning-Based Wildfire Prediction Models

by
Ryo Ide
1 and
Lei Yang
2,*
1
Harrison High School, Harrison, NY 10528, USA
2
Department of Computer Science and Engineering, University of Nevada, Reno, NV 89557, USA
*
Author to whom correspondence should be addressed.
Submission received: 25 December 2024 / Revised: 24 January 2025 / Accepted: 24 January 2025 / Published: 26 January 2025

Abstract

:
Rapidly growing wildfires have recently devastated societal assets, exposing a critical need for early warning systems to expedite relief efforts. Smoke detection using camera-based Deep Neural Networks (DNNs) offers a promising solution for wildfire prediction. However, the rarity of smoke across time and space limits training data, raising model overfitting and bias concerns. Current DNNs, primarily Convolutional Neural Networks (CNNs) and transformers, complicate robustness evaluation due to architectural differences. To address these challenges, we introduce WARP (Wildfire Adversarial Robustness Procedure), the first model-agnostic framework for evaluating wildfire detection models’ adversarial robustness. WARP addresses inherent limitations in data diversity by generating adversarial examples through image-global and -local perturbations. Global and local attacks superimpose Gaussian noise and PNG patches onto image inputs, respectively; this suits both CNNs and transformers while generating realistic adversarial scenarios. Using WARP, we assessed real-time CNNs and Transformers, uncovering key vulnerabilities. At times, transformers exhibited over 70% precision degradation under global attacks, while both models generally struggled to differentiate cloud-like PNG patches from real smoke during local attacks. To enhance model robustness, we proposed four wildfire-oriented data augmentation techniques based on WARP’s methodology and results, which diversify smoke image data and improve model precision and robustness. These advancements represent a substantial step toward developing a reliable early wildfire warning system, which may be our first safeguard against wildfire destruction.

1. Introduction

Wildfires were among the deadliest US natural disasters in 2023, surpassing storms and floods [1]. They are also some of the costliest, with the US five-year average cost of firefighting estimated at USD 2.3 billion. Indeed, recent wildfires such as the 2025 Palisades Fires have demonstrated that uncontained fires may have a substantial cost on human life and infrastructure even in a short time frame [2]. As wildfires become more frequent, early wildfire detection systems are imperative for evacuation and prevention.
Several works have proposed automated solutions with a particular focus on detecting early indications of wildfires. With rapid advancements in computer vision, image-based solutions have attracted substantial attention recently. Yazdi et al. [3] conducted a comparative study of existing wildfire monitoring approaches, including high- [4] and low-altitude remote sensing [5], local sensing [6], and terrestrial surveillance [3]. They concluded that using deep learning with terrestrial surveillance to detect wildfire smoke was the most effective method of predicting wildfires. Terrestrial surveillance cameras positioned at vantage points have continuous temporal coverage with a wide field of view, producing high-resolution images. This allows wildfires to be detected in their incipient stage, the earliest stage of wildfires which is characterized by small smoke plumes. This makes terrestrial surveillance the most competitive monitoring approach.
Wildfire smoke detection using deep learning typically involves object detection, a highly established task in computer vision. This consists of identifying and locating an object with a bounding box within an image. The target object is wildfire smoke in this context. Figure 1 illustrates wildfire smoke object detection from surveillance video images using deep learning. To develop object detection models, large-scale deep neural network (DNN)models are trained on image data, usually in quantities well over the thousands for each target object. Using the trained DNN models, smoke objects can be detected in real time, creating a powerful synergy with continuous surveillance. By doing so, automatic prediction systems can be developed which can (1) replace human resources that are expensive and time-consuming, and (2) potentially detect with a higher precision than their human counterparts. In tasks like wildfire detection, which is both costly [7] and demands accuracy (see Section 1.2), these advantages of DNNs are particularly desirable.

1.1. Existing DNN Solutions to Wildfire Smoke Detection

Existing camera-based wildfire smoke detection using computer vision frameworks in the literature are characterized by two major DNN architectures: CNN and transformer. There are numerous CNN-based approaches [9,10,11,12,13]; in particular, Jeong et al. [11] proposed a real-time wildfire detection model, where a YOLOv3 [14]-based CNN framework (for producing candidate proposals for smoke-positive regions) is combined with Long-Short Term Memory (LSTM) [15] (for screening the candidates). While enforcing temporal consistency is demonstrated to be effective in distinguishing smoke and cloud, the model was trained on samples taken from only 12 scenes. Despite a large number of image frames, the effective sample size may be as few as 12, potentially limiting the model’s ability to generalize. Additionally, for wildfire flame detection, which is considerably easier than smoke detection, Al-Smadi et al. [13] developed another CNN-based detection model using newer versions of YOLO to report an extremely high prediction performance. However, the model was trained on positive-only samples with a relatively small image count (1723). It is not entirely clear how the limited sample size and synthetically inflated wildfire occurrences affect model robustness and generalizability.
For the transformer approach, there have only been two camera-based transformer approaches to wildfire detection in the literature, to the best of our knowledge [3,16]. In particular, Yazdi et al. [3] proposed NEMO (NEvada sMOke detection benchmark), using the DEtection TRansformer (DETR) [17] framework for the first time in wildfire smoke detection. They set the practical benchmark for state-of-the-art precision. To address the issue of false positives, they use several data preprocessing strategies to artificially diversify data. Specifically, they inject 260 smoke-negative background images randomly selected from the internet. They also create 116 collages from reused smoke-positive and smoke-negative images to further address this problem. Generally, the collages were effective not only for reducing false positives but for introducing variety to the object’s position and size. It was shown that this preprocessing step can reduce the false alarm rate. However, detailed discussions on the model’s generalizability and robustness on real-life data are not present beyond an ablation study of the model’s detection encoder mechanism and a typical time-series detection analysis.

1.2. Challenges in Adapting DNN Solutions to Wildfire Smoke Detection

While the DNN-based object detection approach has shown promising results [3,11,13], two main challenges remain: (1) the limited sample size coupled with low diversity in wildfire smoke training images, and (2) the lack of universal solutions for improving model robustness.
The first challenge arises because of the properties of wildfire smoke. Smoke is spatially anomalous because it occupies a small portion relative to the entire image. As a result, it is absent from the vast majority of pixels in the frames of high-resolution continuous surveillance videos. This makes manual annotation of images for object detection highly labor-intensive, necessitating ad hoc preprocessing steps as discussed in Section 1.1. Furthermore, smoke is temporally anomalous because it originates from a rare event. This results in surveillance image sequences having relatively small smoke-positive frames compared to the entire data. Thus, limited data generally create imbalance problems, [18] and when used for training, can cause detrimental performance degradation [19,20]. Furthermore, when trained on limited datasets, DNNs are known to produce unexpected and/or detrimental outcomes even from slight modifications to the input [21]. In wildfire smoke detection, where training data are almost always limited, this problem can be exacerbated by the current trend of shifting from CNNs (Convolutional Neural Networks) to transformer-based models [22], as the latter explicitly handles second-order statistics through the key-value transformation [23]. Despite the importance of this issue, little work has been done to assess the robustness of wildfire detection models across various inputs.
The second challenge arises because DNNs are inherently black boxes [24]. Since the internal workings of the model, including feature extraction processes and the role of parameters, are not immediately discernible, addressing potential issues requires substantial case-by-case effort, even if the source code is available. The architecture of CNNs and transformers differs greatly, making analysis based on model-specific quantities (e.g., gradients) less practical. Since the spatiotemporal anomaly issue could potentially introduce vulnerabilities to the model, a model-agnostic framework for evaluating the model’s robustness and fine-tuning must be developed.

1.3. Contributions

We recognize the following advantages of DNNs for wildfire detection: previous studies have found that DNN–computer vision models are the most effective in wildfire detection due to their compatibility with surveillance cameras, which have the most coverage compared to other wildfire observation methods. They also synergize well with cameras because they can keep up with their footage in real time, allowing for an automated wildfire detection system that saves costs and human errors.
However, we also recognize the following disadvantages of DNNs for wildfire detection: because of smoke’s fundamental properties, smoke image data is highly difficult to collect, which not only creates a severe data shortage but also a class imbalance problem. This limits the model’s ability to be generalized to all wildfire scenarios. More importantly, DNNs are black boxes, meaning their internal decision-making mechanisms are too abstract to interpret. This makes it challenging to identify specific sources of robustness vulnerabilities. Not only that, these mechanisms differ greatly across architectures, making comparing robustness across different models difficult.
To address these challenges, we propose WARP (Wildfire Adversarial Robustness Procedure), the first model-agnostic framework to comprehensively evaluate the adversarial robustness of wildfire detection models. Unlike common adversarial attack methods, which often rely on solving an optimization problem for perturbations using internal model details, WARP is model-agnostic and uses relatively simple noise injection methods. Additionally, WARP incorporates wildfire-specific contexts in designing noise tests, especially in the distinction between smoke and cloud, rather than using generic random noise. To the best of our knowledge, WARP is the first framework to offer a model-agnostic, contextual adversarial robustness evaluation method specifically tailored for wildfire detection. Insights obtained through WARP’s analysis can be used to further improve the model through data augmentation, making it a crucial first step toward a truly practical/reliable wildfire detection system. The main contributions of our study are summarized as follows:
  • We propose WARP, consisting of global and local model-agnostic evaluation methods for model robustness, tailored to wildfire smoke detection.
  • We compare the robustness of DNN-based wildfire detection models across two major neural network architectures, namely CNNs and transformers, and provide detailed insights into specific vulnerabilities of those models.
  • We propose data augmentation approaches for potential model improvement based on the above findings.

2. Preliminaries

We seek to quantitatively evaluate the adversarial robustness of wildfire smoke detection models, comparing the two main DNN architectures: CNN and transformer. This section formally summarizes the problem setting and provides a concise overview of adversarial robustness.

2.1. Problem Statement

We assume that an image-based wildfire detection model y = f ( x ) is given, where x is an input image, typically an image frame captured by a surveillance camera, and y { 0 , 1 } is the binary label indicating whether x contains wildfire smoke ( y = 1 ) or not ( y = 0 ). Since wildfire smoke is spatially localized, the function f incorporates two steps: bounding box generation (locating subimages within x) and scoring (estimating the probability of containing smoke), as illustrated in Figure 1.
Additionally, we assume that a test dataset D = { x ( 1 ) , , x ( N ) } is available, where x ( n ) is the n-th image. Since the input images are often noisy due to varying weather and lighting conditions, a practical wildfire detection system must be robust to small variations in test images. The central question we address in this study is how robust such a wildfire detection model is.
We address this question by proposing metrics indicating changes in the classification outcomes upon introducing small perturbations to the input image. Specifically, as illustrated in Figure 2, we compute the precision degradation metric L, the classification flip probabilities α i and β i , and the localization deception rate γ i . L (see Section 3.1) seeks to quantify the model’s robustness against global noise, whereas α i , β i , and γ i (see Section 3.2) seek to quantify the model’s robustness against local noise.
As discussed, wildfire smoke detection models need an adaptable and contextualized adversarial robustness evaluation method. For a disaster prevention task where human lives are involved, wildfire detection solutions must be robust. That is, a DNN wildfire prediction model must be able to adapt to a variety of wildfire scenarios, not exclusive to the wildfires it has seen in its training data. We leverage adversarial attacks to identify any critical model vulnerabilities through a variety of noise perturbations. The insights obtained will be vital, especially in a task like image-based wildfire smoke detection, as they can be used to further diversify the training data through data augmentation which will be discussed in Section 5.

2.2. Adversarial Robustness

Adversarial robustness refers to a model’s resilience to external attacks on its input. “Attacks” usually involve the injection of noise perturbations (slight modifications in the form of noise) into the original input to artificially create counterfactual (unforeseen) test scenarios. These are called adversarial examples.
In a classification task, adversarial robustness is typically evaluated by quantifying the resilience of the original detection when facing input perturbations. Let f ( x ) be the classification model, f : R m 1 , , k , where m is the dimension of the input image tensor ( width × height × color channels ), k is the number of classes. In wildfire smoke detection, k is 2 as the only classes are null and smoke. The fundamental question that is considered when evaluating adversarial robustness is as follows:
Given a test input x R m and a perturbation r R m , compare how f ( x ) differs from f ( x + r ) .
If f ( x ) matches the ground truth, but f ( x + r ) does not, the model cannot be considered robust. However, it is conditionally required that r be reasonably small, as it is almost guaranteed that the classification output will change if r is arbitrarily large. Therefore, the most desirable perturbed image tensor x x + r looks almost identical to the original image tensor x, but has the maximum impact on classification outcome. The optimization problem to find such a perturbation is called an adversarial attack [24]:
max r Loss ( x + r , l ) subject to Distance ( x , x + r ) ϵ
where l is the predicted class label for x (i.e., f ( x ) = l ), Loss ( x , y ) is a loss function for an input and output pair ( x , y ) , and ϵ is a small positive error constant to keep the perturbed input close to the original input. Typically, the loss function is the same as that of the trained model.
Depending on their nature, perturbations can generally be categorized into two categories: global and local attacks [25]. (1) Global Attacks: The authors of [21,26] proposed global attack methods that seek the optimal perturbation that covers the entire input image. The authors of [26] incorporated the distant constraint as a penalty term c r 1 , where c is a constant and · 1 is the 1 norm, that is added to the objective function. Moreover, [21] proposed a more computationally efficient approach by constraining r not by a constant but only to the sign of the gradient of the loss function x Loss ( x , y ) . (2) Local Attacks: The authors of [27,28] proposed local attack methods. Ref. [28] proposed a single-pixel attack method based on a differential evolution framework. However, single-pixel attacks are highly inefficient for the typical object detection input which is a 640 × 640 pixel image. Additionally, ref. [27] proposed a patch-based method, where a constant patch would be directly injected into the image. These patches adapt to different backgrounds and transformations, including position, size, and rotation. Patches were generally successful at deceiving otherwise rigorously trained models, thus making them a prime tool for adversarial robustness.
There are several trends for both categories of existing adversarial attack approaches that limit their utility when applied to wildfire smoke detection.
Global attack methods often require model-specific elements for efficient computation, such as the gradient of the loss function. This necessitates the full knowledge of the architecture of the DNN model, thus making each model’s adversarial attack case-by-case. Given the variety of DNN architectures in wildfire smoke detection and the rapid development of new computer vision frameworks, this reliance on model elements may be computationally expensive. To comprehensively evaluate adversarial robustness for all wildfire smoke detection frameworks, a model-agnostic approach is necessary.
Moreover, local attack methods typically disregard the context of their task. Both single-pixel attacks and patch noise attacks are generally abstract formulations when viewed by humans, and generally do not fit in the context of wildfire smoke detection (i.e., abstract glob-like patches will not appear in camera surveillance images). Thus, their utility may be limited to deceiving hyper-tuned models in highly specific conditions. Wildfire smoke detection demands a more nuanced usage for noise patches. Contextual cues, such as the subtle differences in objects’ shape, color, and position, are paramount to distinguish smoke from highly similar objects (e.g., clouds, man-made structures with similar coloration) at a considerable distance. With limited wildfire data, it is not an option to make models learn these contextual cues by retraining them on additional images. Therefore, patch noise attack methods must take context into account to (1) evaluate a model’s ability to differentiate objects using the task’s context and (2) create adversarial examples that train this ability.

3. Materials and Methods

In this section, we introduce the proposed framework, WARP (Wildfire Adversarial Robustness Procedure), for comprehensively assessing the robustness of smoke detection models. WARP generates two types of adversarial examples through data augmentation: global noise (i.e., Gaussian noise) and local noise (i.e., cloud PNG patches). We then compare the original detection (red bounding box) with the perturbed detection (blue bounding box): we conduct a sanity check for the images’ perturbed with global noise, which analyzes the effect of global noise on model precision (see Section 3.1). Additionally, we conduct the deception test for images perturbed with local noise, which analyzes the robustness of the model’s localization and classification abilities on specific objects (see Section 3.2). We later demonstrate in Section 5 that the tests outlined above can identify vulnerabilities in robustness for wildfire detection models, which can produce data augmentation strategies tailored specifically for improving the robustness of DNN models.

3.1. Global Sanity Check: Noise Overlay

For practical wildfire smoke detection, models must distinguish smoke from similar objects such as clouds, fog, and camera artifacts. Since this distinction can be subtle, model training typically requires a significant sample size, which is currently unavailable in wildfire smoke detection, as discussed in Section 1.1. This raises questions about whether existing DNN models are robust enough against adversarial attacks.
As a preliminary test, the global sanity check uses image-wide random perturbations. Specifically, a random noise overlay following the Gaussian distribution ( r N ( 0 , 1 ) , i.e., Gaussian noise with zero mean and unit variance) of the same size as the input tensor x is added to the entire image. The perturbed image x is given by
x = ( 1 a ) x + a σ r ,
where a denotes the noise level that takes a value in the range [ 0 , 1 ] , and σ is the standard deviation of the input image tensor. The perturbation is contextualized to the input image since σ is uniquely computed from that particular image.
For a given noise level, the mAP (mAP50-95) percentage loss  L is calculated to quantify the precision loss from the addition of random noise. We choose mAP (see Appendix A for detailed definitions) as the primary challenge metric since fixed threshold-precision metrics (i.e., mAP33, mAP50, mAP75, etc.) may only offer a limited evaluation of precision. mAP measures the mean average precision across Intersection over Union (IoU) thresholds ranging from 0.50 to 0.95, making it a robust metric [3].
L = m A P after m A P original m A P original × 100 ,
where m A P after is the mAP score after adding random noise to the dataset, and m A P original is the mAP score for the original dataset.

3.2. Local Deception Test: Noise Patch

The previous test considers only the context of noise level variability via σ . We also propose the local deception test, which introduces spatial context specific to the smoke detection task. Specifically, local noise patches (small images in the PNG (portable network graphics) format) are used to check for robustness against localized perturbations specific to wildfires. For each image i, a PNG patch of constant size is injected at a specific spot in the image (see Figure 3a for instance). For computational efficiency, we divide each image into 25 by 25 grids, and the noise is injected in the center of each grid slot. “Noise” can be any wildfire-related object, including smoke-like objects such as clouds, or other objects in context, such as trees, buildings, glare, etc. We specifically used clouds as they are the most common subject of false positives in previous works [3,11,13]. Figure 3b shows the cloud PNG used in the local deception test. Despite being horizontally wider than wildfire smoke, they represent the everyday cumulus-type clouds, which is why we selected them. After observing existing wildfire smoke data, and testing different configurations, we configured the patch to be 25 × 25 pixels at 100% brightness.
To quantify the robustness against patch noise, three metrics are proposed, α i ,   β i , and γ i . They are defined in Equations (4)–(6) in the following sections.

3.2.1. Classification Flip Probabilities

A wildfire smoke detection model’s ability can be subdivided into its classification and localization abilities. The classification ability refers to how precisely the model can identify an image as smoke-positive or smoke-negative. The localization ability refers to how precisely the model can locate the smoke object using bounding boxes given a smoke-positive image.
The Classification Flip Probabilities focuses on the vulnerability of the model’s classification abilities. When smoke-positive input images are given, a model may fail to generate bounding boxes for some images. These misclassified images can be called false negatives (FNs). On the other hand, if the model successfully generates at least one bounding box in any location of an image, the image can be called a true positive (TPs).
The Classification Flip Probabilities quantify how many TPs or FNs are “flipped” upon injecting local noise. Unlike the global sanity check test, noise injection is performed grid-wise. Image i is divided into A i = 25 × 25 = 625 equal grids, and local noise is injected into each grid slot. The classification outcome is then observed A i times in total for the image. For each image i, depending on whether it is a TP or FN, the classification vulnerability of a model can be quantified by calculating two conditional probabilities:
α i = P i ( l null l smoke ) = 1 A i j = 1 A i I i , j ( FN TP ) ,
β i = P i ( l smoke l null ) = 1 A i j = 1 A i I i , j ( TP FN ) ,
where l null and l smoke denote the null and smoke classes. I i , j ( · ) is the indicator function, which equals 1 if the specified flip occurs for image i when the noise is injected into the j-th grid slot. It equals 0 otherwise. Again, A i is fixed at 25 × 25 = 625 possible slots for every image.
These image-wise probabilities are averaged over the entire test set data to evaluate the model’s classification vulnerabilities. To summarize what these metrics mean, α i and β i represent the classification robustness that image i is smoke-positive and smoke-negative, respectively.

3.2.2. Localization Deception Rate

We propose the Localization Deception Rate  γ i (for each image i in the test set data) to evaluate the localization ability. Unlike the Classification Flip Probabilities, γ i quantifies bounding box detection robustness, which is defined below (see Figure 4):
γ i = D i A i ,
where D i is the number of detections and its bounding box has I o U 0.50 (see Equation (A4) for further details) with the injected noise’s bounding box. A i is the number of attempts, or the number of possible positions for the noise to be injected in a 25 × 25 grid ( 25 2 = 625 ) . Since D i is discrete, γ i is also discrete. That is, { 1 , 2 , 3 , , 625 } is the set of all possible values of D i , and the set of all possible deception rates γ i is { 1 / 625 , 2 / 625 , 3 / 625 , , 625 / 625 } = { 0.0016 , 0.0032 , 0.0048 , , 1.0000 } .
A higher γ i indicates that the model is more vulnerable to noise injection for image i. The overall localization robustness of the model can be evaluated by averaging γ i across all images in the test set data. In a theoretical model with perfect adversarial robustness, all values of γ i should have a frequency of 0.

3.3. Data Collection and Preparation

We used two datasets to train, validate, and test the model: the NEMO dataset [3], curated from various sources including [8], and a dataset from [5], collected from the High-Performance Wireless Research and Education Network (HPWREN) database [29].
The NEMO dataset was originally created to fine-tune DETR. It contains 2500+ images, with roughly 90% being smoke-positive. The smoke-positive regions in the original video frames were cropped and zoomed for efficient training. Following [3], we used this dataset for model training. In contrast, the dataset by [5] contains more realistic 1661 unedited smoke-positive images. We used this dataset for model testing and robustness evaluation. Table 1 shows the data split.

3.4. Object Detection Framework

Since the ultimate goal is to establish an automated wildfire smoke detection system allowing real-time detection and continuous model improvement, we focus on object detection models allowing real-time object detection. Real-time models allow faster feedback when detecting real-life adversarial scenarios. Additionally, we contributed to the nascent model zoo in camera-based real-time wildfire smoke detection [3,10,11,13,30,31,32] by introducing two previously unused open-source models to the field.
  • For a CNN-based real-time object detection model, we choose YOLOv8, the 8th generation model of the YOLO (You Only Look Once) framework. It is a popular real-time object detection framework and is publically available on the Ultralytics API [33].
  • For a transformer-based real-time object detector, we choose RT-DETR (Real-Time Detection-Transformer), which can be viewed as a real-time variant of DETR used by [3]. RT-DETR overcomes the computation-costly limitations of transformers by sacrificing minimal accuracy for speed by prioritizing and selectively extracting object queries that overlap the ground truth bounding boxes by a certain IoU [34].
Pre-trained weights were transferred from COCO-dataset-trained versions of YOLOv8 and RT-DETR (both v8.3.67). Their lightweight versions, YOLOv8-nano (YOLOv8n) and RT-DETR-large (RT-DETR-l), were chosen to reduce computation time during robustness evaluation. Default training and inference hyperparameters were used from [33]. Training YOLOv8n for 250 epochs and RT-DETR-l for 285 produced our best results. However, extensive hyperparameter tuning is encouraged in the future.
We use metrics mAP and mAP50 to measure model precision (See Appendix A for detailed definitions), which are common challenge metrics in object detection. All training, validation, and testing were run on a Python 3.8.12 virtual environment with CUDA version 12.2, equipped with quadruple NVIDIA GeForce GTX 1080 GPUs from the University of Nevada, Reno.

4. Results

This section presents the results of the adversarial robustness of CNN- and transformer-based wildfire detection model architectures, trained on two publicly available datasets.

4.1. Post-Training

Table 2 compares precision metrics for real-time YOLOv8n and RT-DETR-l. Metrics for non-real-time models are also shown for comparison, which are the transformer-based model (NEMO-DETR) and last-generation CNN-based models (NEMO-FRCNN, NEMO-RNet), trained by [3]. YOLOv8n outperforms RT-DETR-l in precision and parameter efficiency. Specifically, YOLOv8n achieved a very efficient mAP-to-parameter ratio with little hyperparameter tuning. YOLOv8n approached NEMO-DETR’s mAP by ≈7.80%, whereas RT-DETR-l approached it by ≈23.6%. YOLOv8n and RT-DETR-l outperformed NEMO-FRCNN and NEMO-RNet, but lost to the state-of-the-art NEMO-DETR. We expected this as YOLOv8n and RT-DETR-l had a reduced parameter size for computational efficiency. Nevertheless, both are competitive wildfire smoke detection models.
Despite its state-of-the-art transformer architecture, RT-DETR-l’s precision was lower than the CNN-based YOLOv8n. This may be attributed to RT-DETR-l’s IoU-Aware Query Selection mechanism [34]. This mechanism prioritizes object queries with good initial IoU with ground truth. Small objects like wildfire smoke have small bounding boxes, making it likely that their IoU with initial predictions is low. Object queries for small smoke are therefore less prioritized during training, making it harder for RT-DETR to pick up smoke’s subtle features. Without extensive data fine-tuning and hyperparameter-tuning, speed-optimized transformers may be unable to take advantage of the strengths of the self-attention mechanism. This requirement can be harmful for nuanced tasks that require continual updating to accommodate unforeseen circumstances. Wildfire smoke detection may be only one such case where this may be a problem.
In addition, general model evaluation approaches using metrics that do not account for model robustness may not be appropriate. Modern metrics reflect the model’s performance under one validation dataset. Even if extensive hyperparameter tuning can significantly improve precision, it may not noticeably change results in practical testing.

4.2. Results of Global Sanity Check

Figure 5 shows the result of the global sanity check as the noise level a is increased in increments of 0.001 from 0.0 to 0.4 over 400 iterations. Interestingly, we observed a substantial difference between the two architectures. Specifically, the mAP percentage loss L for the transformer-based RT-DETR-l generally degraded at a higher rate than that of the CNN-based YOLOv8n. L for RT-DETR-l converged to ≈100 at noise level a 0.300 , whereas YOLOv8n did not show signs of convergence even after the terminal noise level 0.400 ( L RT - DETR - l ( 0.400 ) = 99.23 and L YOLOv 8 n ( 0.400 ) = 89.53 ).
Overall, YOLOv8n was more resilient to image-wide perturbation attacks than RT-DETR-l. RT-DETR-l and YOLOv8n had the largest difference in L at noise level a = 0.210 , where RT-DETR-l’s L was more than 70% more than that of YOLOv8n ( L RT - DETR - l ( 0.210 ) = 86.53 and L YOLOv 8 n ( 0.210 ) = 50.82 ). To illustrate how the global noise affects object detection performance, Figure 6 and Figure 7 are representative examples that compare original detection results with that at the noise level 0.210 for identical images.
To the naked eye, the noise is barely visible, only appearing as a dimming in image luminosity. There was minimal difference between the performance of the two models without noise (compare Figure 6b with Figure 7a). However, it is important to note that both models suffered in precision. YOLOv8n tended to be more conservative with its detections, only detecting smoke with a definitive white color, or in other words, at a later period of the incipient stage (see Figure 6b). On the other hand, RT-DETR-l did not have this bias but instead began to confuse the sky for smoke with relatively high confidence (see Figure 7b).
Further observation shows that both models may confuse clouds with smoke under noise stress (compare Figure 8a with Figure 8b), but RT-DETR-l makes false positives even when no clouds exist (see Figure 9b). Again, YOLOv8n became conservative with its detections, losing detection confidence and sometimes not detecting at all (see Figure 9a). Figure 8 and Figure 9 are two representative examples of false detections made by both models for identical images, again at noise level a = 0.210 .
This raises model robustness concerns when encountering real image-wide noise. A notable possibility is quality of service degradation events, where network latency or data transfer loss causes footage to become heavily distorted, which may be common given that the cameras are located in remote areas. Other possibilities may include various camera debris (for example, a rain mark near the left in Figure 9).

4.3. Results of Local Deception Test

In this section, we summarize the results of the local deception test, which included an analysis of the classification flip probabilities and the localization deception rate. They offer a nuanced review of model robustness and how it translates to detection results. In addition, we conduct an auxiliary test which contextualizes the role of image annotations in our findings.

4.3.1. Result of Classification Flip Probabilities

To see the robustness of the classification outcome under local noise, we calculate the expected value of both α i and β i over the entire test set for both YOLOv8n and RT-DETR-l below.
E [ α i ] = 1 k i = 1 k α i , E [ β i ] = 1 k i = 1 k β i ,
where k is the total number of images in the test dataset, which was 1,661 in the HPWREN data (See Table 1).
The results are summarized in Table 3. As discussed, α i shows the robustness of the prediction that image i is smoke-positive, and β i shows the robustness of the prediction that image i is smoke-negative. The results indicate a trend in wildfire smoke detection. Since smoke’s features are subtle, the smoke-positive prediction is easily flipped by injecting a cloud-like noise into even one of the 625 grids. This strongly suggests that there is substantial room for improvement in the model, possibly through data augmentation with additional images containing smoke-like objects, as discussed in Section. On the other hand, both models show strong robustness for smoke-free predictions. In particular, no smoke-to-null flips were observed for RT-DETR-l, suggesting a greater potential of the transformer-based model for general object detection tasks.
Here, RT-DETR-l’s self-attention mechanism may work in its favor for local perturbation classification. The architecture may help defend against classification changes under local perturbations, as demonstrated in Table 3. This can be because of the self-attention mechanism in transformers, which captures and leverages global relationships across the image. Unlike Section 4.2, where the entire image was perturbed, local perturbations may not substantially affect the global relationships within the original image.

4.3.2. Results of Localized Deception Rate

Table 4 shows the count of each possible γ i value for YOLOv8n and RT-DETR-l. The occurrence of local deception is relatively rare in both models. We also calculated the expected deception rate:
E YOLOv 8 n γ i = ( 0 × 1629 ) + ( 0.0016 × 32 ) 1661 = 3.08 × 10 5
E RT - DETR - l γ i = ( 0 × 1570 ) + ( 0.0016 × 85 ) + ( 0.0032 × 5 ) + ( 0.0048 × 1 ) 1661 = 9.44 × 10 5
Contrary to Section 4.3.1, RT-DETR-l’s self-attention may work to its deficit in local perturbation localization. Section 3.4 demonstrated that RT-DETR’s transformer architecture may cause challenges in small-object detection. In particular, object queries representing smoke’s subtle features may be de-prioritized, leading to worse precision for small objects. This causes bounding box confusion with similar small smoke-like objects like the cloud PNG patch. This is consistent with Table 4, as RT-DETR-l frequently recorded higher γ i values, and its localization deception rate was more than three times that of YOLOv8n. However, more testing is required to determine the significance of these results in real conditions.
For further analysis, we extracted RT-DETR-l detections with a γ i > 0.0000 , and compared them to their YOLOv8n counterpart. We highlight the region where the noise patch deceived the model. There were five images where both models detected with a γ i > 0.0000 . Interestingly, four out of five of these detections were taken from the same surveillance camera, namely from the SMER TCS9 Site. Figure 10a,b show two such cases. There were several cases where RT-DETR-l recorded multiple deceptions whereas YOLOv8n remained completely unaffected. In particular, there were cases in which RT-DETR-l detected with a γ i value of 0.0032 and 0.0048 (see Figure 11b and Figure 12b, respectively), but YOLOv8n remained unchanged and detected with γ i value of 0.0000 (see Figure 11a and Figure 12a). Below are identical images detected by the two models, where each grid represents the position at which the cloud-noise was injected. Red areas indicate noise-affected regions, whereas blue regions indicate otherwise.
Factoring out rare cases where the cloud noise completely overlapped the smoke (this could only happen in one or two grids, putting the probability at around 0.16–0.32%), both models were deceived when the cloud PNG patch was near the smoke. This is alarming, since the cameras are positioned at vantage points, and clouds are more clearly visible at higher altitudes. Thus, the occlusion of smoke by clouds would be a relatively common occurrence. Furthermore, it can be explained that the models confuse smoke with cloud since both are similar in features, and the parts of the cloud PNG patch merge with the smoke at times. As seen with YOLOv8n, the models tend to reliably detect when the smoke has developed a clear white coloration. Thus, the models may have interpreted the cloud as the origin point of the smoke, even when the direction of the smoke plumes suggested otherwise.
To test if there was a spatial relationship between the deceptions, we enumerated the cumulative number of noise-affected detections for each slot that the patch noise was injected for both models. Both models exhibited a slight bias to be deceived at the center regions (see Figure 13a,b). A potential explanation is that smoke is most frequently depicted in the middle horizontal of the image. It becomes easier for human observers to spot and annotate as smoke rises above the horizon, which is frequently in this region. This creates an annotation bias in the test set (see Figure 14), which may have caused the models to over-focus on this region. Thus, cloud PNG patches may appear as part of the smoke most in this zone.

5. Discussion

In a comprehensive comparative study, we analyze the adversarial robustness of CNN- and transformer-based real-time smoke detection models in context to other studies. The main findings are summarized below:
  • The global sanity check revealed that the transformer-based RT-DETR-l model was substantially more vulnerable to global noise injection compared to the CNN-based YOLOv8, even at noise levels barely visible to the human eye. This is a notable contrast from other studies [35,36] which suggest that transformers trained on Big Data (i.e., CIFAR-10, MNIST, etc.) are more robust than CNNs against global noise. The current results may be an artifact of the severe data shortage in wildfire detection. Thus, as more data augmentation techniques (such as those from WARP) create more diverse adversarial examples, transformers will most likely overtake CNNs in terms of global noise robustness.
  • An analysis of the classification flip probabilities revealed that both CNN- and transformer-based models are sensitive to local noise injection. A single noise–injected grid (out of 625 total) resulted in flip probabilities of ≈50% and ≈34% in the smoke-positive prediction for the CNN and transformer models, respectively. These results underscore the need for further model training using data augmentation techniques. While transformers performed better during this test in our study, [37,38] suggest that transformers are not necessarily stronger at patch perturbations than CNNs. Because this seems to be a continuity even for Big Data models, more research is recommended for this adversarial attack.
  • An analysis of the localization deception rate revealed areas within the images that were particularly vulnerable to local noise injection. Detailed analysis suggests that human annotation bias may cause to over-focus on the middle-horizontal region, offering insights for future data augmentation strategies to enhance model robustness. This is consistent with DNN behavior in other fields [39], highlighting the need to consider not only the visual characteristics of the target objects but also its spatiotemporal context for true unbiased training.
Based on our analysis, we propose the following data augmentation strategies for improving robustness. These solutions should diversify future training data for YOLOv8- and RT-DETR-based wildfire detection models. Moreover, because this framework is model-agnostic, these solutions may still be applied even if new DNN architectures are introduced.
  • Gaussian-distributed Noise
    Gaussian noise with 0.1 a 0.4 should be introduced into training data, which should reduce precision degradation when encountering global noise, especially among speed-optimized transformers. It will also improve data variety and quantity.
  • Cloud PNG-Patch
    Clouds and smoke overlap most in the middle-horizontal strip of the images. Given that most false positives occur in this zone, cloud PNG patches should be placed in this zone to help models distinguish the two.
    Furthermore, cloud PNG patches should also be placed into the upper areas (areas depicting the sky) to add spatial variety to local noise patches.
  • Collages/Mosaic
    An effective solution to combat false positives extensively used by [3] is to use collages. This is because collages allow models to easily compare between smoke-positive and smoke-negative images. Since CNNs suffered from local perturbation classification, collages of images between classes should be implemented into training data.
    Furthermore, certain collage techniques such as the YOLOv4-style mosaic [40] has the added benefit of introducing object size variety in the data. This is particularly useful for small object detection.
    However, collages inevitably make the already-small smoke object even smaller. Especially for speed-optimized transformers, which have a known weakness to small objects, crucial smoke features must be extracted first. The below augmentation strategy seeks to offer a potential solution.
  • 2 × 2 Crops
    Data collected by [5] from [29] exclusively depict smoke that appears at or near the horizon (i.e., the middle-horizontal of the image). Cropping these images into equal quadrants creates four standalone images, which shift the position of objects. This adds spatial diversity to smoke annotations.
    Furthermore, since the crops will result in a smaller-sized image, when resized to the target image size (640 × 640 pixels), the image, and by extension, smoke, will appear larger. This may help models better extract the subtle features of smoke, offering a solution to the problem discussed in the previous augmentation strategy.
    Finally, crops that do not include smoke introduce negative samples. Generally speaking, negative samples will help reduce false positives.

6. Conclusions

In this study, we introduced WARP (Wildfire Adversarial Robustness Procedure), the first-ever model-agnostic framework for evaluating the adversarial robustness of wildfire smoke detection models, designed to address limitations arising from insufficient variety in smoke images.
WARP supports both global and local adversarial attack methods. While the global attack method employs image-contextualized random noise overlays, the local attack method is tailored to address two key aspects of smoke detection: (1) classification between smoke-positive and smoke-negative instances and (2) smoke object bounding box localization.
Leveraging WARP’s model agnostic capabilities, we trained and compared CNN- and transformer-based wildfire detection models. We found in the global attack method that transformers suffer from higher precision degradation under global noise. In the local attack method, we found that while CNNs’ classification results are less robust than those of transformers, CNNs performed better in terms of localization ability. Future studies regarding this perturbation method are recommended to obtain a conclusive result. Finally, an auxiliary test suggests that annotation bias may be partially responsible for deceiving both models with local noise patches. Based on these findings, we proposed wildfire-specific data augmentation approaches. We leave a detailed analysis of the proposed data augmentation approaches for future studies.
WARP is a large step towards making DNN-based wildfire detection models more practical. Full-scale implementation of these systems will inevitably be costly. With the already high firefighting costs in the US, vulnerabilities to even simple adversarial attacks add great uncertainty in the reliability of these models. By identifying these biases and creating simple countermeasures in the form of data augmentation, WARP ensures both model accountability and dependability. We hope that this work will advance wildfire detection models enough to fully implement a completely automatic prediction system against wildfires, which can save the valuable lives and infrastructure that are lost to it.

Author Contributions

Conceptualization, R.I. and L.Y.; methodology, R.I.; software, R.I. and L.Y.; validation, R.I. and L.Y.; formal analysis, R.I.; investigation, R.I.; resources, R.I.; data curation, L.Y.; writing—original draft preparation, R.I.; writing—review and editing, R.I. and L.Y.; visualization, R.I.; supervision, L.Y; project administration, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Science Foundation OIA-2148788.

Institutional Review Board Statement

Not applicable

Informed Consent Statement

Not applicable

Data Availability Statement

The datasets presented in this study are openly available: NEMO [3], HPWREN [5,29].

Acknowledgments

We would like to thank Amirhesam Yazdi from the University of Nevada, Reno for their thoughtful discussions, alongside Patrick Watters’s generous help for debugging. We also acknowledge Austin Parkerson for his continued support in providing the GPU computing service which was vital for the experimentation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
WARPWildfire Adversarial Robustness Protocol
DNNDeep Neural Network
CNNConvolutional Neural Network
YOLOYou Only Look Once
LSTMLong Short Term Memory
mAPmean Average Precision
NEMONEvada sMOke detection benchmark
DETRDEtection TRansformer
IoUIntersection over Union
HPWRENHigh-Performance Wireless Research and Education Network
RT-DETRReal-Time DETR
COCOCommon Objections in COntext
FRCNNFaster Region-based Convolutional Neural Network
RNetRetinaNet

Appendix A

Mean Average Precision (mAP) is a common metric for precision in machine learning. It is obtained by calculating numerous other sub-metrics, which are shown here.
Precision and Recall are well established performance metrics in machine learning. Average Precision can be calculated by taking the area beneath the Precision–Recall curve (i.e., p ( r ) ), typically using the Trapezoid Rule.
P r e c i s i o n = T P T P + F P ,
R e c a l l = T P T P + F N ,
A P = 0 1 p ( r ) d r ,
where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives. A detection is considered a true positive when the detection bounding box A overlaps the ground truth bounding box B by a certain overlap threshold.
The degree of overlap is quantified by the Intersection over Union (IoU), which is obtained by
I o U ( A , B ) = A B A B .
The mean Average Precision (mAP) can be obtained by taking the average of AP across all classes.
m A P = l = 1 k A P ( l ) k ,
where the class index l runs over all the class labels { 1 , , k } with k being the total number of classes.
There are certain variations of mAP based on what IoU threshold is used. The most common is mAP50:95 or simply mAP, which is the average of mAP scores from thresholds 50 to 95 at increments of 0.05. There is also mAP50, where mAP is calculated with a fixed overlap threshold of I o U = 50 .

References

  1. Salas, E.B. Number of Fatalities Due to Natural Disasters U.S. 2023. 2024. Available online: https://www.statista.com/statistics/216831/fatalities-due-to-natural-disasters-in-the-united-states/ (accessed on 29 September 2024).
  2. The Department of Forestry and Protection. Palisades Fire. 2025. Available online: https://www.fire.ca.gov/ (accessed on 18 January 2025).
  3. Yazdi, A.; Qin, H.; Jordan, C.B.; Yang, L.; Yan, F. Nemo: An open-source transformer-supercharged benchmark for fine-grained wildfire smoke detection. Remote Sens. 2022, 14, 3979. [Google Scholar] [CrossRef]
  4. Fernandes, A.M.; Utkin, A.B.; Lavrov, A.V.; Vilar, R.M. Development of neural network committee machines for automatic forest fire detection using lidar. Pattern Recognit. 2004, 37, 2039–2047. [Google Scholar] [CrossRef]
  5. Govil, K.; Welch, M.L.; Ball, J.T.; Pennypacker, C.R. Preliminary results from a wildfire detection system using deep learning on remote camera images. Remote Sens. 2020, 12, 166. [Google Scholar] [CrossRef]
  6. Barmpoutis, P.; Papaioannou, P.; Dimitropoulos, K.; Grammalidis, N. A review on early forest fire detection systems using optical remote sensing. Sensors 2020, 20, 6442. [Google Scholar] [CrossRef]
  7. National Interagency Fire Center. Wildfires and Acres. 2023. Available online: https://www.nifc.gov/fire-information/statistics/wildfires (accessed on 18 September 2024).
  8. ALERTWildfire. Available online: https://www.alertwildfire.org/ (accessed on 18 September 2024).
  9. Gonçalves, A.M.; Brandão, T.; Ferreira, J.C. Wildfire Detection with Deep Learning—A Case Study for the CICLOPE Project. IEEE Access 2024, 12, 82095–82110. [Google Scholar] [CrossRef]
  10. Jindal, P.; Gupta, H.; Pachauri, N.; Sharma, V.; Verma, O.P. Real-time wildfire detection via image-based deep learning algorithm. In Soft Computing: Theories and Applications: Proceedings of SoCTA 2020; Springer: Singapore, 2021; Volume 2, pp. 539–550. [Google Scholar]
  11. Jeong, M.; Park, M.; Nam, J.; Ko, B.C. Light-weight student LSTM for real-time wildfire smoke detection. Sensors 2020, 20, 5508. [Google Scholar] [CrossRef]
  12. Fernandes, A.M.; Utkin, A.B.; Chaves, P. Automatic Early Detection of Wildfire Smoke with Visible Light Cameras Using Deep Learning and Visual Explanation. IEEE Access 2022, 10, 12814–12828. [Google Scholar] [CrossRef]
  13. Al-Smadi, Y.; Alauthman, M.; Al-Qerem, A.; Aldweesh, A.; Quaddoura, R.; Aburub, F.; Mansour, K.; Alhmiedat, T. Early wildfire smoke detection using different yolo models. Machines 2023, 11, 246. [Google Scholar] [CrossRef]
  14. Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  15. Hochreiter, S. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  16. Wang, X.; Pan, Z.; Gao, H.; He, N.; Gao, T. An efficient model for real-time wildfire detection in complex scenarios based on multi-head attention mechanism. J.-Real-Time Image Process. 2023, 20, 66. [Google Scholar] [CrossRef]
  17. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
  18. Cho, S.H.; Kim, S.; Choi, J.H. Transfer learning-based fault diagnosis under data deficiency. Appl. Sci. 2020, 10, 7768. [Google Scholar] [CrossRef]
  19. Buda, M.; Maki, A.; Mazurowski, M.A. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef] [PubMed]
  20. Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
  21. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
  22. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
  23. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  24. Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Available online: https://christophm.github.io/interpretable-ml-book/ (accessed on 18 September 2024).
  25. Chen, P.Y.; Hsieh, C.J. Adversarial Robustness for Machine Learning; Academic Press: Cambridge, MA, USA, 2022. [Google Scholar]
  26. Szegedy, C. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
  27. Brown, T.B.; Mané, D.; Roy, A.; Abadi, M.; Gilmer, J. Adversarial patch. arXiv 2017, arXiv:1712.09665. [Google Scholar]
  28. Su, J.; Vargas, D.V.; Sakurai, K. One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 2019, 23, 828–841. [Google Scholar] [CrossRef]
  29. HPRWEN. The HPWREN Fire Ignition Images Library for Neural Network Training. 2023. Available online: https://www.hpwren.ucsd.edu/FIgLib/ (accessed on 18 September 2024).
  30. Wang, L.; Zhang, H.; Zhang, Y.; Hu, K.; An, K. A Deep Learning-Based Experiment on Forest Wildfire Detection in Machine Vision Course. IEEE Access 2023, 11, 32671–32681. [Google Scholar] [CrossRef]
  31. Oh, S.H.; Ghyme, S.W.; Jung, S.K.; Kim, G.W. Early wildfire detection using convolutional neural network. In International Workshop on Frontiers of Computer Vision; Springer: Singapore, 2020; pp. 18–30. [Google Scholar]
  32. Wei, C.; Xu, J.; Li, Q.; Jiang, S. An intelligent wildfire detection approach through cameras based on deep learning. Sustainability 2022, 14, 15690. [Google Scholar] [CrossRef]
  33. Ultralytics. Ultralytics YOLO Docs. 2024. Available online: https://docs.ultralytics.com/ (accessed on 18 September 2024).
  34. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
  35. Mahmood, K.; Mahmood, R.; Van Dijk, M. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 7838–7847. [Google Scholar]
  36. Shao, R.; Shi, Z.; Yi, J.; Chen, P.Y.; Hsieh, C.J. On the adversarial robustness of vision transformers. arXiv 2021, arXiv:2103.15670. [Google Scholar]
  37. Fu, Y.; Zhang, S.; Wu, S.; Wan, C.; Lin, Y. Patch-fool: Are vision transformers always robust against adversarial perturbations? arXiv 2022, arXiv:2203.08392. [Google Scholar]
  38. Gu, J.; Tresp, V.; Qin, Y. Are vision transformers robust to patch perturbations? In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 404–421. [Google Scholar]
  39. Dhar, S.; Shamir, L. Systematic biases when using deep neural networks for annotating large catalogs of astronomical images. Astron. Comput. 2022, 38, 100545. [Google Scholar] [CrossRef]
  40. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Figure 1. Smoke object detection from surveillance video images. A DNN object detection model creates bounding boxes (see green box) to locate smoke as the target object. Image adapted from [8].
Figure 1. Smoke object detection from surveillance video images. A DNN object detection model creates bounding boxes (see green box) to locate smoke as the target object. Image adapted from [8].
Fire 08 00050 g001
Figure 2. WARP workflow.
Figure 2. WARP workflow.
Fire 08 00050 g002
Figure 3. Local noise injection. (a) An example image with injected cloud-like noise at a grid location, highlighted by the red circle. The green bounding box indicates the ground-truth location of the smoke. (b) The cloud-like PNG patch used as local noise, with a background added for visibility. Adapted from the internet.
Figure 3. Local noise injection. (a) An example image with injected cloud-like noise at a grid location, highlighted by the red circle. The green bounding box indicates the ground-truth location of the smoke. (b) The cloud-like PNG patch used as local noise, with a background added for visibility. Adapted from the internet.
Fire 08 00050 g003
Figure 4. Illustration of the proposed localization deception test.
Figure 4. Illustration of the proposed localization deception test.
Fire 08 00050 g004
Figure 5. mAP percentage loss plotted across all iterations.
Figure 5. mAP percentage loss plotted across all iterations.
Fire 08 00050 g005
Figure 6. (a) YOLOv8n without noise; detection confidence 0.79 and (b) with noise; detection confidence 0.62.
Figure 6. (a) YOLOv8n without noise; detection confidence 0.79 and (b) with noise; detection confidence 0.62.
Fire 08 00050 g006
Figure 7. (a) RT-DETR-l without noise; detection confidence 0.82 and (b) with noise; detection confidence 0.38.
Figure 7. (a) RT-DETR-l without noise; detection confidence 0.82 and (b) with noise; detection confidence 0.38.
Fire 08 00050 g007
Figure 8. (a) YOLOv8n; detection confidence 0.37 and (b) RT-DETR-l; detection confidence 0.41.
Figure 8. (a) YOLOv8n; detection confidence 0.37 and (b) RT-DETR-l; detection confidence 0.41.
Fire 08 00050 g008
Figure 9. (a) YOLOv8n; no smoke detections made and (b) RT-DETR-l; detection confidence 0.59.
Figure 9. (a) YOLOv8n; no smoke detections made and (b) RT-DETR-l; detection confidence 0.59.
Fire 08 00050 g009
Figure 10. SMER TCS9 site (10/3/2019).
Figure 10. SMER TCS9 site (10/3/2019).
Fire 08 00050 g010
Figure 11. Otay Mountain site (8/14/2019).
Figure 11. Otay Mountain site (8/14/2019).
Fire 08 00050 g011
Figure 12. Otay Mountain site (8/14/2019).
Figure 12. Otay Mountain site (8/14/2019).
Fire 08 00050 g012
Figure 13. A map of cumulative deceptions for YOLOv8n (a) and RT-DETR-l (b).
Figure 13. A map of cumulative deceptions for YOLOv8n (a) and RT-DETR-l (b).
Fire 08 00050 g013
Figure 14. Distribution of annotations in the test dataset.
Figure 14. Distribution of annotations in the test dataset.
Fire 08 00050 g014
Table 1. Data split for the image dataset. Training and validation data were sourced from the NEMO [3], whereas the testing data were sourced from HPWREN [5].
Table 1. Data split for the image dataset. Training and validation data were sourced from the NEMO [3], whereas the testing data were sourced from HPWREN [5].
TrainingValidationTesting
27043371661
Table 2. Comparison of smoke detection accuracies. Note that NEMO models are not real-time and for comparison purposes only. Their results were adapted from [3].
Table 2. Comparison of smoke detection accuracies. Note that NEMO models are not real-time and for comparison purposes only. Their results were adapted from [3].
mAPmAP50Parameter SizemAP-to-Param Ratio
YOLOv8n39.072.03.2 M 2.25 × 10 5
RT-DETR-l32.269.733 M 2.11 × 10 6
NEMO-DETR42.379.041 M 1.93 × 10 6
NEMO-FRCNN29.569.343 M 1.61 × 10 6
NEMO-RNet28.968.832 M 2.15 × 10 6
Table 3. The classification flip probabilities alongside the true positive/false negative counts without noise.
Table 3. The classification flip probabilities alongside the true positive/false negative counts without noise.
E [ α i ] E [ β i ] TPFN
YOLOv8n 8.67 × 10 6 0.495 1496165
RT-DETR-l 0.00 0.337 1102559
Table 4. The frequency of detections for each observed value of γ i . Other γ i values were not observed.
Table 4. The frequency of detections for each observed value of γ i . Other γ i values were not observed.
γ i 0.00000.00160.00320.0048
YOLOv8n16293200
RT-DETR-l15708551
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ide, R.; Yang, L. Adversarial Robustness for Deep Learning-Based Wildfire Prediction Models. Fire 2025, 8, 50. https://doi.org/10.3390/fire8020050

AMA Style

Ide R, Yang L. Adversarial Robustness for Deep Learning-Based Wildfire Prediction Models. Fire. 2025; 8(2):50. https://doi.org/10.3390/fire8020050

Chicago/Turabian Style

Ide, Ryo, and Lei Yang. 2025. "Adversarial Robustness for Deep Learning-Based Wildfire Prediction Models" Fire 8, no. 2: 50. https://doi.org/10.3390/fire8020050

APA Style

Ide, R., & Yang, L. (2025). Adversarial Robustness for Deep Learning-Based Wildfire Prediction Models. Fire, 8(2), 50. https://doi.org/10.3390/fire8020050

Article Metrics

Back to TopTop