1. Introduction
As the impact of climate change continues to intensify worldwide, the frequency and severity of wildfires have noticeably increased, posing substantial risks to ecosystems and human communities alike [
1]. Efficient and effective early detection systems are vital for mitigating the destructive consequences of such events. Aligning with the United Nations’ Sustainable Development Goals (SDGs)—specifically SDG 13, targeting urgent action to combat climate change and its impacts, and SDG 15, focusing on the sustainable use of terrestrial ecosystems and halting deforestation—these systems represent an essential move towards a resilient future.
Forest fires are multifaceted phenomena, influenced by various environmental and contextual factors. Designing accurate detection systems thus presents substantial challenges. The integration of deep learning (DL) techniques, especially within the field of computer vision, has yielded impressive state-of-the-art results in addressing these complexities [
2]. These models, primarily convolutional neural networks (CNNs), are capable of distinguishing between normal forest conditions and different stages of forest fires, and even recognizing the early indicators of wildfires [
3,
4]. However, their efficacy is heavily reliant on the quality, diversity, and relevance of the training data. Substandard or insufficiently diverse data can notably hinder these advanced algorithms’ performance, possibly leading to increased false alarms or missed detections [
5].
Once trained and evaluated, these models can be incorporated into various surveillance systems, such as satellites, drones, or ground-based platforms, for real-time image analysis [
4]. When they detect visual features indicative of potential forest fires, they can promptly alert authorities, significantly enhancing response times and reducing wildfire’s destructive effects. The continual improvement of these algorithms, facilitated by new data, ensures adaptability to shifting conditions and evolving wildfire patterns in the context of global climate change.
From a practical implementation standpoint, adopting a multi-modal data approach stands out as a highly robust solution for forest fire detection using DL [
4]. This strategy leverages the inherent strengths of various data types, such as red-green-blue (RGB) data, thermal data, hyperspectral data, laser infrared technology, and meteorological data. Each of these offers unique insights and complementary information about the environment under observation. By combining these data types, the system can counterbalance the individual limitations of each, yielding a more accurate and comprehensive detection solution [
6].
In the specific context of RGB data, relying exclusively on them can lead to a higher incidence of both false positives and negatives [
4]. This phenomenon stems from the inherent limitations in RGB data’s ability to consistently identify features indicative of forest fires. Characteristics such as color, shape, and texture can vary significantly across different fire scenes, leading to a lack of standardized identification markers and thereby complicating detection [
7]. Additionally, confounding environmental factors such as clouds, fog, sunlight reflection, and low-altitude cloud cover can imitate visual features of fire or smoke, further affecting machine learning (ML) model performance [
4] (see
Figure 1).
However, despite these challenges, RGB data remain vital to the development of DL-based forest fire early warning systems. Its broad availability, high resolution, and ease of human interpretation render them an indispensable element [
5]. The widespread use of RGB imaging devices and their potential for frequent data collection ensure extensive spatial and temporal coverage. Consequently, RGB data stand as a robust primary source that, when thoughtfully combined with other data types, can markedly improve the fire detection system’s overall performance. The strategic utilization of RGB data within a multi-modal data fusion framework presents a particularly promising approach to forest fire detection. While RGB data have their limitations, the unique attributes they possess make them a key component in the data fusion puzzle.
Maximizing the utilization of RGB data is appealing due to their near-ubiquitous availability and relative affordability. The high-frequency capture of RGB data allows for timely response to potential fire events, and the immediate mobilization of firefighting resources, a crucial factor in minimizing the devastating impact of forest fires. Moreover, RGB images offer a human-readable format that facilitates communication and coordination among various stakeholders, including decision-makers, firefighting teams, and the public [
3]. Additionally, most state-of-the-art CNNs are pre-trained on RGB data, enabling transfer learning to reduce processing cost and training time, both of which are important aspects for the use of edge devices with limited capacity [
8]. While the robustness of a multi-modal system is enhanced through the inclusion of other data types such as thermal, hyperspectral, laser infrared technology, and weather data to cite a few, the centrality of RGB data in this fusion approach is key.
Building on the integral role of RGB data in forest fire detection context, numerous strategies have been proposed to enhance the performance of DL models specifically trained on this data type. Addressing the unique challenges associated with forest fire detection in RGB images, these approaches span a broad spectrum of techniques. Data augmentation strategies have been deployed to enrich the diversity and representation of training data, bolstering the model’s capacity to generalize across varying conditions [
9]. Further improvements have been attained through transfer learning. This approach repurposes models pre-trained on large, diverse datasets for the specific task of forest fire detection, thus harnessing their robustness in handling confounding elements [
10]. To refine the differentiation between fire and non-fire phenomena, multi-class classification schemes have been introduced. Preprocessing techniques play a vital role as well. Techniques such as image enhancement, background subtraction, and color space transformations have been employed to accentuate smoke and fire features in the images, thereby boosting the model’s capacity to distinguish these phenomena. Researchers have also ventured into more advanced DL architectures, including recurrent neural networks (RNNs), transformer models, and attention mechanisms [
11]. These methodologies have been proven to be valuable in prioritizing more relevant and distinctive regions within images, thus enhancing the overall model performance. On the other hand, saliency detection has shown promise for DL-based wildfire identification [
7]. By directing the model’s focus towards the most relevant and distinctive regions within images, this approach mitigates the impact of confounding elements on classification accuracy, and further improves model performance.
The development of algorithms for forest fire detection requires the efficient extraction of intricate visual features from varied data sources. However, an equally vital, yet often overlooked aspect, is the quality and representativeness of the data used [
12]. While the drive to improve algorithmic performance continues, the necessity of capturing the complex and dynamic environmental conditions of real-world forests in training datasets cannot be understated. Unfortunately, this essential component has not received substantial attention in the existing research, leading to a critical gap in the field.
Detecting forest fires, whether from aerial sources or ground-based platforms, is a complicated process. It is fraught with variability due to environmental factors such as weather conditions, terrain, and vegetation, as well as caption-related aspects such as image resolution and angle of capture [
4]. Datasets that fail to encompass this wide array of variables may inadvertently lead to models that cannot generalize to new, unseen scenarios. The consequences of this shortfall can be severe, culminating in suboptimal performance when applied to real-world situations, and possibly failing to detect or falsely detecting fire events.
Furthermore, a thorough review of existing literature on forest fire detection using RGB images uncovers a noticeable deficiency in understanding the confounding elements and challenges that contribute to high false alarm rates [
13]. Although many studies acknowledge the connection between these factors and false alarms, they tend to explore the issue through a restricted set of examples. An in-depth examination that fully comprehends the nuances of these contributing elements is conspicuously absent. This lack of comprehensive exploration leaves a significant opportunity for future research to bridge this knowledge gap, enhancing our ability to develop more precise and reliable fire detection systems.
Despite the significant strides made in the realm of ML-based forest fire detection, certain practical challenges remain unaddressed. The availability of publicly accessible datasets for developing and testing models is notably limited [
12]. This assertion is further corroborated by recent review articles in the field [
3,
5,
13]. This deficit impedes the establishment of a well-grounded benchmark, critical for enabling consistent evaluation and comparison of different forest fire detection models [
5]. The absence of such standardized benchmarks hinders progress and validation of innovative techniques in this field [
12]. A major contributing factor to this issue is the scarcity and limited accessibility of RGB wildfire data [
5]. Predominantly sourced from fire surveillance cameras and drones, these images are often subject to permissions from local authorities. Moreover, capturing images of forest fires is not only hazardous for personnel involved but also difficult due to the unpredictable nature of wildfires. Coupled with the fact that forest fires are less common than other types of fires, acquiring adequate samples for research purposes becomes a fundamental task [
3].
While there is a growing interest in leveraging RGB images for wildfire detection, the reality is that most studies either do not share their datasets or rely on private datasets, often fraught with various data-related issues [
5]. Some research efforts, as documented in [
14], have tried to tackle the problem by collecting images with visual elements similar to wildfires, such as fog, clouds, sunlight reflections, and sunsets, from non-specific wildfire datasets to reduce false positive rates. In another instance, researchers [
9] made use of publicly accessible images from the Portuguese Firefighters Portal Database, a dedicated media outlet supporting Portuguese firefighters [
15]. While this source provides a wealth of images from various fire incidents throughout Portugal, accessing these images for research is a laborious process. Each image must be downloaded individually, and media outlet logos need to be cropped before use. These hurdles underline the lack of standardized, diverse datasets in the field, complicating the task of comparing the efficacy of different methodologies and building on the reported results in the literature. In the same sense, it is notable that some studies have utilized video datasets for flame detection, though only a handful have been specifically curated for forest fires. Early wildfire warning systems necessitate datasets that encapsulate not just the flames, but also the smoke characteristics [
14]. Predominantly, these training datasets consist of video frames, which are often plagued by an abundance of duplicate images. Such redundancy potentially undermines the generalizability of the models trained on these datasets [
9]. Seeking alternatives to these constraints, researchers have ventured into the realm of synthetic data generation using generative adversarial networks (GANs) [
14]. These networks can create additional training samples, infuse diversity, and facilitate training under controlled conditions. On the other hand, reference [
16] exemplifies an alternative approach that targets some of the noted constraints by employing a multi-task learning (MTL)-based forest fire detection model (MTL-FFDet). The model was developed with three distinct tasks: the detection task, the segmentation task, and the classification task. This innovative approach shares the feature extraction module across all tasks, thereby enhancing feature extraction ability and reducing the number of false and missed detections. Furthermore, the introduction of a novel joint multi-task non-maximum suppression (NMS) processing algorithm seeks to leverage the benefits of each task to maximize detection accuracy.
Finally, when considering the practical applications of fire detection models in real-world forest environments, a comprehensive approach is warranted. This approach would ideally integrate video and image datasets with synthetic data generated by GANs. Video datasets bring to the table valuable temporal information, facilitating the monitoring of fires over time and encapsulating the dynamic nature of wildfires. On the other hand, image datasets collected from the Internet contribute a diverse set of samples, originating from varied sources and depicting fires under a wide spectrum of conditions and contexts. The incorporation of synthetic data enhances the breadth of the training set, potentially improving the robustness and generalizability of the models.
In this study, we present a meticulously curated and diverse image dataset containing 2700 RGB instances, designed to serve as a benchmark for future forest fire detection research. The dataset is structured into two main categories (fire vs. nofire) and further divided into five subclasses, introducing a novel and comprehensive scope. It encompasses a wide array of environmental conditions, forest types, geographical regions, and confounding elements, all aimed at addressing the pervasive issue of high false alarm rates in DL-based fire detection systems.
Considering the notable scarcity of inclusive RGB datasets in this area, our contribution represents a valuable resource for the research community. To ensure the dataset’s integrity, we adhered strictly to legal compliance, including only images that belong to the public domain, and providing a detailed description of the dataset’s characteristics. This approach offers researchers a complete understanding of its diversity and depth.
Our goal is to spur innovation and facilitate progress in forest fire detection; thus, the dataset will be made publicly available. Accompanying the dataset, a CSV document will also be released to the public. In this document, each image will be linked to its respective download URL for reference and will include details such as its resolution. Through these efforts, the study aims to fill current knowledge gaps and foster the development of more precise and reliable solutions in this vital field.
In addition to providing the dataset, our work includes a thorough examination of potential confounding elements that could challenge the performance of DL models. By exploring these factors (see
Figure 1), we aim to deepen the understanding of the complexities involved in forest fire detection, further enhancing the applicability and efficacy of our research.
Through a meticulous examination of the dataset, and by compiling a list of challenging factors identified through both a comprehensive literature review and a visual inspection of the images, this study emphasizes the depth and relevance of the proposed dataset. The effectiveness of the new dataset, referred to as “wildfire”, will be assessed by leveraging a combined dataset. This combined collection comprises several relevant, previously published datasets, amounting to a total of 36,775 images. More detailed information on the datasets included can be found in
Table 1.
Subsequently, a DL model trained on the combined dataset will be evaluated using the wildfire dataset. This approach not only helps to confirm whether the initial list of confounding elements covers most of the challenges faced by a DL model in current literature but also assesses the model’s performance on specific types of images not covered by the list. If the model performs poorly on certain image categories, it may be necessary to update the list to include additional confounding elements. Moreover, this method serves as a means to evaluate the quality and relevance of the collected dataset and demonstrates its potential to enhance the diverse set of confounding elements in forest fire detection. This analysis will help justify the need for such a dataset in the research community and establish the significance of the study’s contribution to the field.
Furthermore, the study proposes a novel approach centered on a multi-task learning [
22] framework. In this method, a single base model is simultaneously trained to carry out two related tasks—binary classification (fire/smoke vs. no fire/smoke) and multi-class classification (different types of fire and confounding elements). The uniqueness of this approach lies in the concurrent consideration of auxiliary task classes of confounding elements during both processes. To the best of the authors’ knowledge, this unique approach of integrating multi-class confounding elements in a multi-task learning framework is a first in the field of forest fire detection. This innovative dual-task training could potentially enhance the model’s ability to distinguish between subtle differences among classes, thereby reducing the false positive rate.
The study also addresses the data imbalance problem evident in the proposed wildfire dataset, where 1047 instances form the fire class against 1653 instances in the nofire class. Care was taken to retain the inherent characteristics of the fire class, and the natural occurrence bias was preserved in the validation and test sets. Utilizing both the original and augmented training sets, the study implemented classical one-step and two-step MTL multi-class classification methods. These explored the subtle yet discernible influence of data balancing on key performance metrics such as accuracy, precision, recall, F1-score, and ROC-AUC score.
Section 2 will detail the key stages of the methodology and describe the materials utilized in this study.
2. Materials and Methods
2.1. Dataset Collection and Curation
There is an evident need in the literature for enhanced datasets that address the existing limitations and gaps in the field of forest fire detection. Many RGB datasets often lack the required variety and representation of real-world conditions, hindering the development and evaluation of robust detection models [
13]. The novel wildfire dataset introduced in this study aims to tackle this issue by purposefully increasing the variability between samples and integrating confounding elements. This increased variability facilitates a more comprehensive evaluation of detection approaches across various real scenarios, contributing to the enhancement of DL-based forest fire detection methods and their practical implementation.
2.1.1. Image Collection and Dataset Structure Formation
A dataset of 2700 RGB aerial and ground-based images of forested areas was gathered from multiple online sources, including government agencies, Flickr, and Unsplash. This diverse dataset encompasses a broad spectrum of environmental conditions, forest types, geographical regions, and the highly dynamic characteristics of forest ecosystems and fire events. The image resolutions within the dataset are varied, as indicated by the following key statistics:
Average resolution: 4057 × 3155 pixels
Minimum resolution: 153 × 206 pixels
Maximum resolution: 19,699 × 8974 pixels
Standard deviation of resolution (width): 1867.47 pixels
Standard deviation of resolution (height): 1388.60 pixels
These metrics highlight high-resolution imagery that captures detailed information favorable for precise analysis in deep learning applications for forest fire detection.
The dataset’s images represent different real-world scales, mirroring the varied sources and contexts from which they were collected. This diversity in scale was carefully considered in the design of the deep learning experiments, with images resized to a consistent scale as detailed in
Section 2.3. No preprocessing steps were applied to the images to ensure their versatility and usability in different contexts. However, preprocessing tailored to this study’s objectives was applied in the context of the experiments, as explained in
Section 2.3. This approach maintains the native resolution and natural variability of the images, enabling targeted adjustments that enrich each potential analysis.
Though the collection process may not be exhaustive, it supports the robustness and generalizability of the findings derived from the dataset’s analysis to some degree. The authors acknowledge the need for continuous development and expansion of this image collection and have chosen to maintain the dataset as a dynamic entity. This evolving approach signifies that additional images, videos, and other relevant data types will be incrementally included, based on feedback and requirements from the dataset’s users. Such adaptiveness not only broadens the dataset’s scope and richness but ensures that it remains a relevant and comprehensive tool for current and future research.
The dataset was carefully divided into training (70%), validation (15%), and testing subsets (15%), with further categorization within the primary classes of fire and nofire. The training set, containing 1888 images, forms the foundation for model learning. It consists of 1157 nofire images and 731 fire images. These images are further divided into subclasses representing different aspects of wildfires and potential confounders. The validation directory, holding 402 images, is used to fine-tune model parameters and avoid overfitting, while the test directory contains 410 images for the final evaluation. The structure of these directories is consistent, ensuring an authentic assessment of predictive capabilities.
Finally, the images within each directory were randomized, enhancing a diverse representation across the dataset. (Refer to
Table 2 for more details on the datasets’ classes and subclasses instances distribution).
2.1.2. Capturing General Variability in the Dataset
To strengthen the representativeness and generalizability of the dataset, the image collection process aimed to capture a comprehensive range of variability from environmental and caption-related sources. The following parameters were meticulously considered:
Environmental Variability:
- 1.
Topography: Varied terrain features, including hills, valleys, slopes, and plateaus.
- 2.
Canopy Density and Structure: Distinct differences in tree density, height, branching patterns, and forest stratification.
- 3.
Forest Types and Species Composition: A variety of forest ecosystems encompassing diverse species, plant communities, and successional stages.
- 4.
Ground Cover: A wide range of ground cover types, such as grass, bare soil, water, rocks, and leaf litter.
- 5.
Natural Components: The presence of rivers, lakes, wetlands, and other natural landscape elements.
- 6.
Human-made Objects: Infrastructure, including roads, bridges, buildings, vehicles, power lines, and other anthropogenic features.
- 7.
Weather Conditions: Various atmospheric phenomena, such as fog, rain, snow, dust, and wind.
- 8.
Foliage: Seasonal and phenological changes in foliage, including leaves, flowers, fruits, and seed dispersal.
- 9.
Sunlight: Diverse sunlight exposure, shading patterns, and solar angles.
- 10.
Fire Characteristics: Variability in fire size, shape, color, intensity, progression, and smoke plume dynamics.
- 11.
Smoke Dispersion: The variability in smoke plume patterns due to wind speed, wind direction, and atmospheric stability.
Caption-related Variability:
- 1.
Lighting Conditions: Fluctuations in light and shadows resulting from clouds, time of day, and sun angle [
9].
- 2.
Image Resolution: Varied levels of image detail, sharpness, and pixel density.
- 3.
Altitude and Distance: Diverse flying heights and distances from the forest or fire event, affecting image scale and detail [
14].
- 4.
Camera Angle and Orientation: Variations in the camera angle relative to the subject, its orientation, and field of view.
- 5.
Perspective: A mix of top-down, oblique, and side-view angles in the images.
- 6.
Platform Type: Heterogeneous image sources, such as drones, planes, and helicopters.
- 7.
Obstructions and Reflections: The presence of objects or atmospheric conditions that may cast shadows, cause reflections, or influence image quality.
- 8.
Image Compression: The type and degree of image compression applied, which can potentially introduce artifacts or degrade image quality.
- 9.
Camera Motion Blur: The effect of camera motion or platform vibrations on image sharpness, which can occur during flights in turbulent conditions or at high speeds.
2.1.3. Rigorous Data Curation and Deduplication Process
Throughout the collection process, a meticulous approach was used to optimize the quality of the dataset, ensuring it provides valuable insights for the practical implementation of the models.
The dataset features fewer images of forested areas covered with snow, as these environments are less prone to forest fires.
In assembling the dataset, the authors placed particular emphasis on images that captured human interaction with forests. This approach included incorporating images of forested areas containing buildings, human settlements, roads, bridges, and other structures associated with human activities. This consideration is crucial, as the majority of forest fires are attributed to human actions.
To address potential duplication issues in the dataset, a deduplication method was employed. This process involved comparing the perceptual hashes of the images to eliminate instances of double counting caused by duplicates gathered from multiple data sources. To detect similar images with only slight, non-significant differences, several image comparison algorithms were implemented.
2.1.4. Data Sources, Licensing, and Permissions
The images for the wildfire dataset were collected from multiple online sources, including government agencies, Flickr, and Unsplash. To ensure compliance with intellectual property rights and usage permissions, the licensing and permissions for each source were meticulously verified before incorporating the images into the dataset. The sole licensing associated with the images within the dataset is the public domain dedication. This selection ensures legal compliance, precludes issues with incompatible licenses, and allows for the dataset to be freely shared within the research community. In addition, the dataset will be complemented by a
Supplementary File. This file will serve as a reference guide, linking each image to its corresponding download URL. This approach ensures transparency and allows users to trace each image back to its original source, if needed.
2.2. Model Selection: MobileNetV3
Striking a balance between model performance and efficiency is a fundamental consideration in this study, given the significant computational demand and complexity involved in processing large-scale image datasets for forest fire detection. In this sense, to carry out the study effectively, the researchers chose MobileNetV3 [
23] as a representative model for their study’s experiments. Thanks to its compact and efficient design as a CNN, the model is ideal for image classification tasks. MobileNetV3 is a variant in the MobileNet series, combining the strengths of its predecessors, MobileNetV1 and V2, while integrating additional enhancements for improved performance. MobileNetV3′s architecture is notably characterized by its use of depthwise separable convolutions, designed to minimize computational costs without significantly compromising model performance. This technique dissects the standard convolution operation into two separate layers: a depthwise convolution and a pointwise convolution. This separation effectively attenuates the computational load while preserving most of the network’s representational power. Moreover, MobileNetV3 incorporates inverted residual blocks with linear bottlenecks, a technique that bolsters model capacity. These blocks, drawing inspiration from ResNet’s architecture, consist of a series of layers where the input and output share the same dimensions, fostering easy information flow.
MobileNetV3, with its focus on efficient computation and enhanced accuracy compared to previous versions, emerges as an ideal model choice [
23]. Its smaller parameter count facilitates more efficient training and deployment of the model, which proves advantageous in the practical implementation of forest fire detection systems. By using MobileNetV3 for their experiments, the researchers aim to show that promising performance can be attained without sacrificing efficiency. This balance is a critical factor in the development of practical and scalable forest fire detection solutions that rely on DL.
2.3. Training the MobileNetV3
The Keras framework with a TensorFlow backend and GPU support were used to compile the experiments. Training images are normalized to values between 0 and 1 by dividing each pixel by 255 and resized to the default size of MobileNetV3, which is 224 × 224 pixels. This resizing step ensures that the images are treated at a consistent scale, a key factor in the model’s ability to detect forest fire conditions across the diverse dataset. To optimize the performance of the input pipeline for the training, validation, and test splits, the prefetch method was employed. This approach ensures that the processing unit is not waiting for data to be loaded while training or evaluating the model, leading to faster training times and improved overall performance. By using the prefetch method, the optimal buffer size for prefetching is automatically determined, greatly enhancing the efficiency of the data pipeline during both training and evaluation stages.
Stochastic optimization methods, including Stochastic Gradient Descent (SGD), Root Mean Square Propagation (RMSProp), and Adaptive Moment Estimation (Adam), are applied with a maximum of 100 epochs. Early stopping is configured with a patience of 5 and a minimum change in the loss of 1e-3. We determine the optimal Learning Rate by testing four different rates (10−2, 10−3, 10−4, 10−5). A global average pooling layer is integrated to reduce the dimensionality of the output matrix from the convolutional layers, which is then flattened into a vector. This vector serves as an input for the fully connected prediction layer. A dropout regularization technique with a dropout rate of 0.2 is employed to enhance the model’s generalization capability.
The model’s performance is evaluated using a held-out validation dataset, and the combination of learning rate and optimizer that yields the highest accuracy is selected for testing on the test dataset.
2.4. Enhanced Detection with Multi-Task Learning Approach
This study introduces a novel approach, which includes the formulation of five distinct subclasses of the wildfire dataset based on the curated list of confounding elements. These classes are intended to ensure a balanced distribution of images depicting fire/smoke events and those not. The classes are as follows:
- 1.
Smoke from fires (subclass 1): This class encompasses images that illustrate smoke emissions from fires, without the apparent presence of flames.
- 2.
Both smoke and fire (subclass 2): This class includes images that exhibit both flames and smoke emissions from fires.
- 3.
Forested areas without confounding elements (subclass 3): Images devoid of any confounding elements, as per the defined list, are categorized under this class. They mainly represent typical forested areas.
- 4.
Fire confounding elements (subclass 4): This class comprises images that contain elements easily misconstrued as fire.
- 5.
Smoke confounding elements (subclass 5): Images that feature elements that may be misinterpreted as smoke fall under this class.
In the proposed approach, an auxiliary task of five-class classification is established alongside the primary task of binary classification (fire/smoke vs. nofire/nosmoke) within a MTL framework. A single base model is trained to handle both tasks concurrently. This strategy takes advantage of shared features between tasks, enhancing the model’s ability to generalize and improve overall performance.
The efficacy of the hierarchical multi-class classification strategy will be assessed. This assessment involves comparison with a traditional one-step binary classification approach. In the one-step approach, a single model is directly trained to classify images into two categories: those showing a fire event (subclasses 1 and 2) and those not (subclasses 3, 4, and 5). This comparative analysis helps evaluate the potential benefits of implementing multi-class MTL-based classification and the importance of addressing confounding elements within the method.
Another significant aspect of the analysis is to gain insight into identifying common visual elements in the images that could resemble fire or smoke, such as sun glare, clouds, fog, or specific vegetation types. Training the model to recognize these elements could help reduce false positives, as the model learns to differentiate between actual fires/smokes and visually similar elements [
22].
Finally, feature visualization techniques, specifically Gradient-weighted Class Activation Mapping (Grad-CAM) [
24], are proposed. This approach helps understand which parts of the input images contribute the most to the model’s predictions. Such understanding can assist in identifying the visual features shared by confounding elements and actual fires or smoke.
2.5. Addressing Confounding Elements
In assessing any alarm system, it is crucial to consider the presence of potential confounding elements that could lead to false alarms. The visual characteristics of certain elements may resemble those of smoke and forest fires, posing challenges in accurately distinguishing between images containing fire and those without. It is essential to understand and address these confounding factors in order to develop more accurate and reliable forest fire detection systems, minimizing the occurrence of false alarms and enhancing overall performance.
The process of addressing confounding elements starts by creating an initial list of challenging factors based on a comprehensive literature review. This review focuses on studies that have emphasized the connection between confounding elements and high false alarm rates in DL models trained on RGB forest fire detection. In addition to the literature review, an analysis of the initial wildfire dataset is conducted to identify any potentially relevant factors that may not have been addressed in existing research. Then, the list will be used to form five subclasses of nofire images that will be included for the multi-class classification problem for the remaining steps of the experimental setup.
The considered confounding elements are divided into specific subcategories, each presenting its unique set of challenges to DL models. The compiled list with descriptions is detailed below:
- 1.
Atmospheric Phenomena: (a) Fog or mist: These can produce illusions of smoke due to their translucent and diffused appearances, leading to potential misclassifications [
25]. (b) Low-altitude clouds: Their visual similarities to smoke plumes, particularly the gray or white clouds, pose challenges for models in distinguishing between them and smoke [
7,
19]. (c) Sunset: The angle and intensity of sunlight during sunset can produce shadows and bright spots, complicating the differentiation between fire and nofire elements.
- 2.
Vegetation and Seasonal Changes: (a) Reddish/orange foliage: Some tree species display red and orange hues during specific seasons, which can be misconstrued as fire or embers in aerial images.
- 3.
Lighting and Reflections: (a) Sunlight reflection on trees and water: Bright spots that mimic fire or smoke features can be produced when sunlight reflects off wet surfaces or water bodies [
7]. (b) Shadow and lighting variations: Shadows that can be mistaken for smoke or fire may be created by changes in lighting conditions, such as those induced by clouds, time of day, or topography [
9].
- 4.
Camera-Related Artifacts: (a) Camera motion blur: Motion blur resulting from camera movement or platform vibrations can lead to the introduction of visual artifacts resembling smoke or fire.
- 5.
Visually Similar Objects [26] and Phenomena [14]: This category encompasses any other objects or phenomena that visually resemble fire or smoke, presenting additional challenges for accurate classification.
As highlighted in
Section 2.4, the focus of the study will include a multi-classification problem comprising five distinct classes. Two of these classes are specifically dedicated to confounding elements: one class embodies elements that mimic fire, while the other encompasses elements that imitate smoke. Following an analysis of the model’s misclassifications, a deliberate emphasis will be placed on elements that are more frequently misclassified from the initial list. The goal of incorporating these classes dedicated to confounding elements is to augment the model’s ability to distinguish between genuine fire and smoke characteristics and those that merely bear similarities. This, in turn, should enhance the precision and dependability of the forest fire detection system.
2.6. The Data Balancing Problem
In the realm of machine learning and data-driven models, dealing with an imbalanced dataset represents a sophisticated challenge. Balancing a dataset can mitigate biases and enhance model performance, particularly in cases where class distributions are inherently unequal, such as the observed imbalance between the fire and nofire classes in the present wildfire dataset [
27]. With 1047 instances in the fire class against 1653 in the nofire class, this disparity in the collection process is not merely a statistical artifact; it reflects the actual occurrence bias existing in nature. The decision to employ data balancing techniques must therefore be handled with meticulous care. This includes maintaining the essential characteristics of each class without over-representation or artificial inflation that could distort the model’s real-world applicability [
28]. The choice of whether or not to balance, and how to do so, becomes a nuanced task that requires an intricate understanding of the data’s structure, the model’s purpose, and the underlying real-world dynamics of each forest. Since the main goal of the rest of the study is to assess the impact of novel strategies, such as the consideration of confounding element classes, the authors believed that failure to balance the classes might bias the model towards the majority class, limiting the robustness of the experiment’s results. In the following sections, the specific approach to this multifaceted issue will be detailed, elucidating the careful considerations and methodologies employed to strike a delicate balance that, to a certain extent, preserves an authentic reflection of real-world scenarios.
To reduce the bias in the training process, the fire class in the training set was augmented to match the nofire class in terms of representation. Augmentation was proportionally distributed among the subclasses of the fire class, ensuring that each source image was utilized only once. This method minimized the risk of the model internalizing noise or peculiarities from augmented samples, allowing it to focus on the underlying patterns. As a result, 268 images were incorporated into the Smoke from fires subclass (subclass 1), and 158 into the Both smoke and fire subclass (subclass 2). Specific augmentation techniques were applied, including random rotations within a range of 40 degrees, width and height shifts of 20%, a shear range of 20%, zooming within a range of 20%, horizontal flipping, and using the ‘nearest’ method for filling in newly created pixels [
29].
The process of balancing the classes within the training set was carefully designed to reduce the model’s tendency to favor the majority class, potentially improving the ability to identify the minority class. By maintaining an even distribution, the training set aided the model in avoiding an overfit to specific categories, thus enhancing its ability to generalize.
In contrast, the authors decided to retain the natural distribution within the validation and test sets. The considerations guiding this decision included avoiding overfitting, preserving natural distribution in validation and test sets, and preventing data leakage. (Refer to
Table 3 for more details on the datasets included).
The empirical comparison of the original and augmented training sets offered valuable insights into the influence of data balancing on model performance. These experiments underscored the importance of a nuanced approach to class balancing, reaffirming the methodology’s alignment with best practices and its potential to support nuanced predictions in fire classification. As the primary objective of this empirical comparison is to evaluate the effect of data balancing on model performance, the same hyperparameters were maintained during the training of the models. This approach was taken with the intention of creating a more controlled comparison, where the only differing variable was the data itself.
Beyond this assessment, all other experiments within the study utilized the augmented dataset, aligning with the broader methodology.
The augmentation process was designed to equalize the number of instances between the fire and nofire training set classes. By adding 268 images to the Smoke from fires subclass and 158 to the Both smoke and fire subclass through techniques such as random rotations, zooming, and flipping, both classes in the training set were brought to an equal count of 1157 instances each.
In the process of evaluating the impact of data balancing on model performance, it was observed that the differences in key performance metrics between models trained on the original and balanced datasets were not obviously substantial. Such variations raised questions concerning the stability of the observed differences, as minor fluctuations might result from random variations or noise inherent in the training process. Given these relatively narrow margins, a more nuanced and robust analysis was recognized as necessary. To this end, the model, for each method, was trained multiple times, utilizing both datasets, for a total of five iterations. The objective was to assess the stability and reliability of the results rather than to conduct formal statistical significance testing. Confidence intervals for the differences in performance metrics between the two methods were calculated using bootstrapping, a resampling technique that allows for robust statistical inference, particularly when dealing with small sample sizes (here, 5 runs). These intervals offer a range within which the true differences in the models’ performances are likely to lie, providing insights into the statistical significance of the differences and contributing to a more comprehensive understanding of how the balancing through the augmentation of positive instances (fire images) affects various aspects of the models’ behavior.
Further details of these experiments are provided in the results sections.
2.7. Weighting of Confounding Elements Subclasses in Model Training
In the multifaceted task of fire detection, the influence of confounding elements is a critical consideration. These factors, which differ in complexity and ambiguity, may have varied impacts on the model’s performance. A specific class might be more prone to confusion with an actual fire or smoke event, thus requiring particular attention during the training phase. Additionally, the disparate real-world effects of these elements further justify differential weighting in their consideration. Imbalances within the dataset could also be tackled by varying weights to cultivate a more balanced learning environment.
To investigate these aspects, a systematic approach is deployed in the training process, where the weights of two subclasses, namely Fire confounding elements (subclass 4) and Smoke confounding elements (subclass 5), are manipulated. Starting with equal weights for both subclasses, the weights are methodically adjusted from 1 to 3, and specific combinations are tested to gauge the model’s performance under different confounding circumstances. This strategy also sheds light on how the model’s sensitivity to these elements can shape its overall efficacy.
After determining the optimal weights, a detailed examination is conducted to comprehend why this specific weighting is effective. This includes probing how these optimal weights sway the entire model performance and identifying the underlying mechanisms that render them effective. By doing so, the study seeks to offer nuanced insights into the complex interplay of confounding factors in the realm of fire detection.
2.8. Transfer Learning
Different transfer learning [
8] scenarios are evaluated, including training from scratch, fine-tuning, and feature extraction, to identify the most effective strategy for enhancing the model’s classification.
2.9. Performance Metrics
The performance of the model is gauged using the four key elements of the confusion matrix: True Positives (TP), True Negatives (TN), False Negatives (FN), and False Positives (FP). TP and TN represent the accurate predictions of fire and nofire images, respectively, while FN and FP denote the instances where fire images and nofire images are incorrectly identified.
Accuracy is a measure of how often the model predicts correctly and is given by the ratio of correct predictions to total predictions.
Precision quantifies the model’s reliability when making positive predictions, defined as the ratio of correctly identified fire instances to all instances that the model labels as fire.
Recall (or sensitivity) expresses the proportion of actual fire images that are correctly identified by the model out of all actual fire images.
The
F1-score is a combined measure that reflects both precision and recall in a single metric, thus allowing for an overall evaluation of a model’s predictive performance. This is in contrast to accuracy, which measures the overall rate of correct predictions, encompassing both fire and nofire predictions.
The ROC-AUC (Receiver Operating Characteristic—Area Under Curve) score is a comprehensive metric that evaluates a model’s ability to distinguish between classes. Unlike individual metrics such as accuracy, precision, or recall, ROC-AUC considers both the true positive rate (sensitivity) and the false positive rate (1-specificity) across different thresholds. It plots a curve (ROC curve) that represents these rates across all thresholds, and the AUC value calculates the area under this curve. A perfect classifier would have an ROC-AUC score of 1, while a completely random classifier would score 0.5. The ROC-AUC score provides insights into the model’s discriminatory power, regardless of the specific threshold, making it a valuable metric for assessing a model’s overall classification effectiveness.
4. Discussion
Upon an in-depth analysis of deep learning forest fire models using RGB images, the existing body of literature reveals a noticeable gap—a comprehensive scrutiny of data representativeness and challenging factors that can induce high false alarm rates. While several studies do acknowledge the correlation between these confounding elements and false alarms, their treatment of the subject often lacks the breadth and depth necessary for a complete understanding.
This research endeavor addresses this specific gap in several ways. First, it crafts a more exhaustive enumeration of these confounding elements. Second, and more importantly, it introduces the wildfire dataset, a significant contribution that effectively considers these confounding elements. This dataset offers researchers a valuable tool for data collection, annotation, and model performance evaluation, thereby aiding in a more nuanced understanding and effective management of the factors that lead to false alarms in forest fire detection.
The evaluation of the newly introduced wildfire dataset, conducted against a combined set of previously published datasets totaling 36,775 images, yields critical insights. Despite the comprehensive nature and extensive range of the combined dataset, the model’s accuracy on the wildfire dataset reached 0.7936. Given the considerable size of the training set, this accuracy might appear moderate. A subjective error analysis further bolstered indicates that the wildfire dataset introduces a unique set of challenges and confounding elements that may not have been sufficiently represented in the previously used datasets. Moreover, in the context of forest fire detection, it is vital to highlight that the task at hand is fundamentally a binary classification problem, distinguishing between fire and nofire events. The ostensibly straightforward nature of the problem might suggest that achieving high accuracy should be less challenging. However, when the model was exposed to the wildfire dataset, an accuracy of 0.7936 was obtained, despite the substantial volume of the combined dataset used for training. This underscores the complexity of the task and the importance of carefully considering the confounding elements specific to forest fire detection and in general, data representativeness.
This performance, while respectable in some contexts, bears significant implications when translated into real-world, practical applications of forest fire detection. Given the high stakes associated with accurate and timely forest fire detection, where errors could result in substantial environmental damage and potential loss of life, an accuracy rate below 100% is of critical concern. This underscores the complexity of real-world forest fire detection tasks, which must contend with an array of confounding elements not adequately represented in existing datasets. The results also emphasize the importance of our endeavor in creating the wildfire dataset, which introduces new challenging elements, making it a valuable tool in the development of more robust and reliable DL models for forest fire detection. The results, therefore, validate the significance of our study’s contribution in this field.
Further, the present work delves into the intricate issue of data balancing, especially pertinent in the context of wildfire detection, where imbalances between fire and nofire classes are intrinsic. An augmented dataset was meticulously crafted by employing a nuanced method to equalize the fire training class with the nofire class, leaving the validation and test datasets untouched. The empirical comparison between the original and augmented training sets unveiled subtle yet essential insights into the influence of data balancing on model performance. While the original dataset resulted in higher accuracy, precision, recall, F1-score, and ROC-AUC, this study’s comprehensive exploration reveals that these metrics, in isolation, may not sufficiently capture the model’s effectiveness, particularly in the complex context of imbalanced data. The greater accuracy in the original dataset could be misleading, as a model may overly favor the majority class. A judicious examination of this pattern has led to a broader understanding of the delicate interplay between statistical representation, real-world occurrence, and the model’s intended role. By adopting a thoughtful and strategic approach to data balancing, this study accentuates the necessity of a measured approach that transcends mere numerical evaluation, contemplating the underlying data dynamics, the practical complexity of wildfire detection, and the overarching goal of crafting models attuned to sophisticated, real-world applications.
Another significant facet of this study is the introduction of an innovative multi-task learning approach for forest fire detection. This approach introduces an auxiliary task of five-class classification trained simultaneously with the primary binary classification task (fire vs. nofire). A single base model is employed to manage both tasks concurrently, exploiting the shared features between tasks to enhance the model’s ability to generalize and improve its overall performance. The study undertakes a comparative assessment between this hierarchical multi-class classification strategy (Method 2) and a traditional one-step binary classification approach (Method 1). In the one-step approach, the same model is trained directly to categorize images into two general classes: images showcasing a fire event (classes 1 and 2) and those that do not (classes 3, 4, and 5). The intention behind this analysis is to shed light on the potential benefits of the multi-class, MTL-based classification strategy and underline the potential of addressing confounding elements within the method as independent subclasses (classes 4 and 5).
When evaluated on the augmented wildfire dataset, the two-step multi-class classification approach (Method 2) showcased a significant suitability over the one-step classification approach (Method 1). The results consistently favor the latter across all key metrics including accuracy, precision, recall, F1-Score, and ROC-AUC. Method 2 demonstrates a mean accuracy of 0.8766 and an F1-Score of 0.8526, reflecting a refined ability to distinguish different visual aspects of fire events, such as smoke from flames, and excels in ranking predictions. The significant increases in these metrics and the corresponding 95% confidence intervals ([0.05270, 0.06926] for accuracy, [0.05814, 0.09942] for precision, [0.02390, 0.07301] for recall, [0.05810, 0.07556] for F1-Score, and [0.05094, 0.06582] for ROC-AUC) underline the robustness and nuanced modeling of wildfire events through Method 2. This comparison emphasizes the efficacy of a novel methodological design, with the integration of confounding elements within the multi-task learning framework, to attain significant improvements in wildfire detection performance.
A subjective error analysis using feature visualization techniques, such as Grad-CAM, further bolstered the superiority of Method 2, pointing out its advantage not only in terms of standard performance metrics, but also in its ability to more accurately discern the subtleties of the images within the dataset.
In moving forward, the uniquely structured wildfire dataset introduced in this study presents promising opportunities for further research in wildfire detection. Its distinct subclasses and the careful consideration of environmental and caption-related variability provide a multifaceted platform that may contribute to the refinement of detection algorithms and the advancement of image processing techniques. Building upon the insights gathered from this study, the wildfire dataset, with its nuanced structure and five distinct subclasses, offers promising avenues for further research and practical applications. The inclusion of diverse categories such as smoke from fires, both smoke and fire, forested areas, and confounding elements adds layers of complexity that can facilitate more targeted and nuanced analyses.
The specific subclass categorization may provide researchers with a platform to investigate specialized aspects of wildfire detection, such as distinguishing smoke from fire, recognizing elements that can be misidentified as fire or smoke, or understanding the interplay between flames and smoke. This can lead to more refined algorithm development, though further validation and experimentation would be necessary to confirm these possibilities.
Furthermore, the dataset’s attention to environmental variability, including aspects such as topography, forest types, weather conditions, and fire characteristics, opens up opportunities for modeling different wildfire scenarios. This could aid in creating more adaptable detection models, enhancing their relevance to diverse real-world contexts, without asserting that it can completely overcome all challenges.
In terms of image processing, the caption-related variability captured in the dataset presents a valuable resource for researchers. Factors such as lighting conditions, image resolution, altitude, and angle offer a range of challenges that may contribute to the development of techniques for improving image quality and robustness under various conditions.
Additionally, the dataset may serve as a potential benchmark for evaluating and comparing different wildfire detection models. Its rich structure and variety could provide a basis for assessing algorithm performance, though it would need to be utilized within a controlled experimental setup to ensure fair comparisons.
Furthermore, the wildfire dataset distinct structure, especially the inclusion of subclasses representing various confounding elements, allows for the utilization of advanced feature visualization techniques, such as Grad-CAM. By offering these insights, researchers may gain a more detailed understanding of how specific regions of the images influence the model’s predictions. This has the potential to uncover hidden patterns and dynamics that models trained on conventional datasets might overlook. Furthermore, the subclasses can enable a more granular examination of feature maps within visually similar images that have contrasting classification outcomes. Identifying connections between particular visual elements and instances of confusion in the predictions becomes an attainable goal. Though promising, the realization of this potential may require careful handling, as the complexity of the confounding elements may present unforeseen challenges.
An ablation study may be another avenue of exploration, wherein the removal of specific features or sections of the model can help assess their impact on performance. The systematic analysis might unveil areas causing confusion and provide data-driven insights to guide further data collection or architecture adjustments. This can be a vital step towards more accurate forest fire detection, though it must be undertaken with due consideration to the multifaceted nature of the data and models.
By taking advantage of the wildfire dataset structure, it provides a scaffold for innovation that might lead to the development of thoughtful and methodical implementation, recognizing that the intricate dynamics of the data can offer both opportunities and challenges. In this context, one promising avenue is the possibility to permit a novel methodological design, such as for example, the integration of confounding elements within the multi-task learning framework. This approach (Method 2) showed significant improvements by leveraging the diverse characteristics found within the dataset, utilizing the nuances in image categorization to create more responsive models. In that sense, varying the weights of these subclasses showed potential avenues of improvement as reported in
Section 3.5.
In summary, the dataset’s careful attention to detail and classification offers fertile grounds for ongoing innovation in wildfire detection. Whether through advanced visualization, novel methodological designs, or nuanced weighting strategies, the possibilities are vast and compelling. However, each step forward also demands a measured understanding of the underlying dynamics and a willingness to adapt and refine methods as new insights emerge. The ultimate goal remains clear: to foster models capable of sophisticated, real-world applications, making strides towards a more effective and comprehensive approach to wildfire detection and management.
Study Limitations
One notable limitation of the present study is the absence of external validation using a completely independent dataset, not involved in any phase of the model development process. While the design and execution of the experiments were meticulous, they were conducted exclusively on the curated dataset. This focus could potentially constrain the generalizability of the findings, limiting their applicability to other regions or different wildfire scenarios. External validation with diverse and unrelated data would have provided a more rigorous test of the model’s robustness and its ability to adapt to variations beyond the characteristics captured in the training and validation sets. Future research efforts that include such external validation can further substantiate the approaches’s efficacy and contribute to a more comprehensive understanding of its performance in real-world wildfire detection and management.
Another limitation of the study is the lack of consideration for the essential factors of time and computational resources in the methodology. While these aspects were considered in the choice of the model, they were not explicitly integrated into the evaluation of the dataset’s structure or the effects of various methodological approaches. The time required for training, validating, and testing the models, as well as the computational resources necessary for these processes, plays a vital role in the practical applicability of the findings. Ignoring these aspects may lead to an incomplete understanding of the model’s efficiency and feasibility in real-world scenarios.
These limitations do not diminish the value of the study but rather highlight areas for further refinement and exploration. Future research should aim to address these issues by developing more objective and standardized classification criteria for the subclasses, as well as by incorporating a more comprehensive assessment of time and computational demands in the methodology. By acknowledging and addressing these limitations, subsequent studies can build upon the current findings, contributing to a more robust and nuanced understanding of wildfire detection and the effective utilization of the novel dataset.