LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module

Lin, Zhenyu; Yun, Bensheng; Zheng, Yanan

doi:10.3390/f15091630

Open AccessArticle

LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module

by

Zhenyu Lin

,

Bensheng Yun

^*

and

Yanan Zheng

School of Science, Zhejiang University of Science and Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(9), 1630; https://doi.org/10.3390/f15091630

Submission received: 30 June 2024 / Revised: 25 August 2024 / Accepted: 11 September 2024 / Published: 15 September 2024

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Applications in Forestry)

Download

Browse Figures

Versions Notes

Abstract

The threat of forest fires to human life and property causes significant damage to human society. Early signs, such as small fires and smoke, are often difficult to detect. As a consequence, early detection of smoke and fires is crucial. Traditional forest fire detection models have shortcomings, including low detection accuracy and efficiency. The YOLOv8 model exhibits robust capabilities in detecting forest fires and smoke. However, it struggles to balance accuracy, model complexity, and detection speed. This paper proposes LD-YOLO, a lightweight dynamic model based on the YOLOv8, to detect forest fires and smoke. Firstly, GhostConv is introduced to generate more smoke feature maps in forest fires through low-cost linear transformations, while maintaining high accuracy and reducing model parameters. Secondly, we propose C2f-Ghost-DynamicConv as an effective tool for increasing feature extraction and representing smoke from forest fires. This method aims to optimize the use of computing resources. Thirdly, we introduce DySample to address the loss of fine-grained detail in initial forest fire images. A point-based sampling method is utilized to enhance the resolution of small-target fire images without imposing an additional computational burden. Fourthly, the Spatial Context Awareness Module (SCAM) is introduced to address insufficient feature representation and background interference. Also, a lightweight self-attention detection head (SADH) is designed to capture global forest fire and smoke features. Lastly, Shape-IoU, which emphasizes the importance of boundaries’ shape and scale, is used to improve smoke detection in forest fires. The experimental results show that LD-YOLO realizes an mAP_0.5 of 86.3% on a custom forest fire dataset, which is 4.2% better than the original model, with 36.79% fewer parameters, 48.24% lower FLOPs, and 15.99% higher FPS. Therefore, LD-YOLO indicates forest fires and smoke with high accuracy, fast detection speed, and a low model complexity. This is crucial to the timely detection of forest fires.

Keywords:

forest fire and smoke detection; C2f-Ghost-DynamicConv; DySample; SCAM; SADH

1. Introduction

Forests, which encompass diverse ecosystems [1,2], are essential to soil and water conservation, climate regulation, and carbon cycling facilitation, among other ecosystem functions [3]. As a significant natural phenomenon, forest fires contribute to the selection of specific species and environmental features while driving ecological succession and shaping ecosystems [4]. However, the frequent occurrence of forest fires surpasses the carrying capacity of natural ecosystems, leading to irreparable damage to forest ecology, landscapes, and biodiversity [5]. The frequency of forest fires has risen sharply in recent decades, resulting in the destruction of large tracts of forest each year [6]. Global ecological and socioeconomic impacts result from these occurrences, which disrupt ecosystems, infrastructure, and human communities, posing a substantial risk to human life and property as a result [7]. For example, the Amazon, the largest tropical forest on the planet, plays a vital role in climate regulation, carbon sequestration, and biodiversity conservation. However, during the dry season in 2019, an estimated 1480 forest fires raged across approximately 32,000 km² of the Amazon and Chiquitania forests [8]. Bowman et al. [9] investigate the global distribution of extreme wildfire events, which mainly occur in the suburbs of flammable forest biomes such as the western United States and southeastern Australia. Predictions indicate that by the mid-21st century, the number of days favorable for such events to occur will increase by 20%–50%, most notably in the southern subtropics and the Mediterranean basin of Europe. Based on data from 14 states most severely affected by wildfires, Baijnath-Rodino [10] develops a framework for calculating the Livelihood Vulnerability Index (LVI), To assess wildfire’s impact on humans and their natural and social environment. Furthermore, in flammable-vegetation-intact wildland–urban interfaces (WUIs), fire hazards are significant. Robert et al. [11] examine regions within the WUI prone to severe wildfires to implement preventive measures. They design a probabilistic method using graph theory to quantify community vulnerability to wildfires. As well, Mahmoud et al. [12] validate a probabilistic approach by creating a directed graph that simulates wildfire spread in WUIs by utilizing graph theory.

Given the significant threat posed by forest fires, various monitoring tools have been developed. Traditional monitoring methods include manual patrols and watchtower observations. However, manual patrols are greatly affected by weather, communication issues, and other human factors, resulting in low monitoring efficiency, an inability to monitor in real-time, and a considerable workload, making timely fire detection challenging. Watchtower monitoring [13] also has limitations, including range, view angle, dead angles, and cost. These factors contribute to the inherent uncertainty in forest fire monitoring. Although satellite remote sensing holds promise as a monitoring tool, its effectiveness is limited by the availability of satellites and the spatial resolution of data. Consequently, it cannot provide comprehensive coverage of all forest fire events [14]. Nevertheless, it remains an invaluable real-time monitoring tool. Additionally, efforts are made to leverage sensor technology and infrared technology for enhanced monitoring capabilities [15]. Traditional smoke sensors for forest fires primarily rely on monitoring smoke and temperature sensitivity or a combination of both [16]. However, this approach has limitations, as smoke rapidly propagates and causes significant damage before reaching the predefined temperature threshold that triggers an alarm. Despite the challenges associated with employing sensors and infrared technologies for forest fire monitoring, these technologies still present viable options for surveillance. Moreover, several forest fire prediction models exist. Mölders [17] demonstrates that the WRF model is highly suitable for fire-weather prediction in boreal forest environments across all forecast periods and in ensemble averages. In a study by Kumar et al. [18], the WRF model is employed for high-resolution (1 km) simulations. They utilize suitable PBL parameterization schemes, including MYNN, to enhance the accuracy of predictions related to fire meteorology.

The swift advancement of computing technology creates new pathways for detecting forest fires. Some researchers employ manual feature extraction for fire detection. For instance, Turgay Celik and colleagues develop a color model that effectively addresses the fire detection problem under varying brightness conditions by separating luminance and chroma using the YCbCr color space, and formulate a set of rules for detecting fire pixels in color images [19]. Abidha et al. [20] introduce a forest fire detection algorithm utilizing a Bayesian classifier, which gains widespread recognition for its accuracy in detecting forest fires. Gubbi [21] utilize wavelet transform and support vector machine (SVM) methods together to detect smoke from forest fires. Although these detection techniques are capable of meeting the demands of real-time surveillance, because they depend on features that are manually extracted, significant amounts of human, material, and financial resources are required. Moreover, the limitations associated with manual feature extraction pose challenges in achieving accurate detection in complex or small fire scenarios and render the model susceptible to environmental interference, thereby limiting its applicability to various scenarios.

The advancement of deep learning technology results in its widespread adoption for forest fire detection. Deep learning, as an algorithm, autonomously learns data structures through multi-level neural networks, enabling automatic identification and feature learning. Compared to traditional smoke and fire detection methods, deep learning-based algorithms excel in extracting more abstract and higher-order features, demonstrating enhanced robustness in complex environments, faster speeds, and higher accuracy. Frizzi [22] have created a convolutional neural network with nine layers that can classify video images into categories such as smoke, fire, and normal, achieving an accuracy of 97.9%. However, this model has limitations in detecting fires and smoke of other colors. Yuan [23] propose a neural network with deep multi-scale architecture that uses convolutional kernels of various scales for extracting features. This approach effectively addresses challenges posed by variations in illumination and different scales, thereby enhancing detection accuracy. Nevertheless, the utilization of multiple convolutional blocks increases model complexity, making deployment challenging. Sathishkumar [24] utilize the learning without forgetting (LwF) technique to tackle the problem of pretrained CNNs underperforming due to insufficient forest fire data. Furthermore, Ryu et al. [25] employ the HSV channel along with a Harris corner point detector for preprocessing the fire, while extracting features using InceptionV3, resulting in reduced false positive and false negative rates despite time-consuming preprocessing.

The potential utility of convolutional neural networks in detecting forest fires and smoke highlights their efficacy in this domain, making deep learning techniques a common approach for forest fire detection. In deep learning, smoke and fire detection algorithms are mainly divided into two types: one-stage algorithms that rely on target regression and two-stage algorithms that rely on region extraction. The two-stage algorithms, like R-CNN [26], Fast R-CNN [27], Faster R-CNN [28], and Mask R-CNN [29], generate candidate regions by employing a region proposal network. These candidate regions are then classified by a convolutional neural network. These methods typically exhibit high accuracy, but require substantial computational resources and processing time. Zhang et al. [30] improve the feature extraction process using Faster R-CNN, which enhances the network’s adaptability to the multi-scale characteristics of fires, thereby generating more accurate fire feature maps. Although feature fusion using a Feature Pyramid Network (FPN) can improve efficiency, it may lead to information attenuation and aliasing effects in cross-scale fusion. In contrast, one-stage algorithms like YOLO [31] and SSD [32] provide the benefit of speed, as they accomplish object detection in a single step. However, the simplified nature of these algorithms usually makes them less accurate than two-stage methods. Jindal [33] employs YOLOv3 and YOLOv4 for detecting forest fire smoke, noting that YOLOv3 demonstrates better performance in precision, recall, F1 scores, and mAP compared to YOLOv4. Qian [34] proposes the OBDS model, which employs CNN and Transformer methods to derive global attributes from images depicting forest fire smoke. An enhanced YOLOv8 model for drones improves its ability to detect forest fire smoke objects by introducing the BiFormer attention mechanism [35]. Yunusov et al. [36] combine the YOLOv8 pretrained model and TranSDet model to quickly and accurately detect the occurrence of forest fires. Yun et al. [37] propose a Channel Prior Expansion Attention module (CPDA) based on YOLOv8 to address the limitations of traditional manual feature extraction methods in complex scenarios. However, these strategies involve more memory swapping and slightly reduce the detection speed.

The precise and swift detection of forest fires and smoke has emerged as a key research focus, especially during the initial phases of their development. The targets, being quite small, often result in common problems of false positives and missed detections, particularly in dense and intricate forest settings. However, forest fire and smoke detection equipment often have limited computing power, which is insufficient to support extensive calculations. To study forest fire and smoke detection models effectively, it is important to balance performance with efficiency, so that the model is sufficiently lightweight.

To address these challenges, a lightweight dynamic forest fire detection model (LD-YOLO) is proposed, utilizing the YOLOv8. Building on previous research, this paper introduces several improvements as follows:

GhostConv improves the efficiency of smoke detection in forest fires by creating extra feature maps for fire and smoke through economical linear transformations, leading to a reduction in the number of parameters within the detection model. Furthermore, C2f-Ghost-DynamicConv, is introduced, which adaptively fuses multiple convolutional kernels based on the input. This approach improves feature extraction and representation for forest fires and smoke of varying sizes and shapes. It also optimizes the use of computational resources for forest fire smoke detection and enhances performance in low floating-point operations.
To address the time-consuming and labor-intensive processes involved in dynamic convolution for forest fire detection, as well as the creation of additional sub-networks for dynamic kernels, DySample is introduced. This method is a super-lightweight and highly efficient dynamic upsampling technique. By using point sampling for upsampling, DySample circumvents the necessity of dynamic convolution, which greatly lowers the parameter count in detection models. This approach minimizes GPU memory consumption and lowers latency, leading to quicker detection of smoke and forest fires.
Introducing the Spatial Context Aware Module (SCAM) aims to better detect small fire and smoke targets early and minimize background interference in fire and smoke detection. This module improves the model’s capability to globally associate information across both channels and spatial dimensions, effectively addressing the challenge of detecting small-target forest fires and smoke.
To capture comprehensive information about forest fire and smoke features, a detection head utilizing self-attention (SADH) is introduced. Introducing Shape-IoU enhances regression accuracy, improving both the detection accuracy and convergence speed of the detection model.

This paper is structured as follows for the remainder: Section 2 offers a comprehensive description of the dataset and the enhanced methodology. Section 3 details the experimental results, covering the setup, ablation studies, and comparative analysis. The limitations of the paper and future research directions are discussed in Section 4. Finally, the conclusion of the paper is presented in Section 5.

2. Materials and Methods

2.1. Dataset

In the experiments, the dataset comprised manually labeled images from D-Fires [38] and forest fire images sourced from the Internet, annotated using an annotation tool. The D-Fires dataset has been meticulously compiled to support machine learning and target detection algorithms, with a particular emphasis on identifying fire and smoke. To maintain the dataset’s quality, images with a resolution below

384 \times 384

and those depicting a single scene were excluded, resulting in a total of 3603 organized images.

This dataset includes various forest fire scenarios. Included in the dataset are 3468 images showing forest fires and smoke, and 135 background images of clouds are used as negative samples to assist the model in distinguishing smoke from non-forest-fire scenes accurately. In a ratio of 8:1:1, the dataset is split into training, testing, and validation subsets. Figure 1 shows a series of image samples from the dataset. For more detailed information, please refer to Table 1.

2.2. Methods

2.2.1. YOLOv8

In January 2023, Ultralytics unveiled the latest iteration of the YOLOv8 model. This version not only maintains the PAN architecture from YOLOv5 but also streamlines the CBS

1 \times 1

convolutional structure within the PAN-FPN upsampling phase and substitutes the C3 module with a C2f module. Additionally, YOLOv8 incorporates an Anchor-Free design concept and implements DFL Loss with CIoU Loss for classification and localization loss computation. Unlike traditional IoU matching or unilateral proportional allocation approaches, YOLOv8 introduces a Task-Aligned Assigner for matching purposes. The network structure of the YOLOv8 can be seen in Figure 2.

2.2.2. GhostDynamicModule

To enable models to be deployed on micro-monitoring embedded devices for high-precision detection in resource-poor and poorly equipped forest areas, we propose the GhostDynamicModule. This module integrates GhostConv and C2f-Ghost-DynamicConv modules to make the forest fire detection model more lightweight. Firstly, GhostConv generates more forest fire smoke feature maps through low-cost linear transformations. Secondly, to address the problem of network performance degradation caused by low FLOPs traps in the forest fire model, we integrate GhostBottleneck and DynamicConv into C2f, thereby proposing C2f-Ghost-DynamicConv. This module adaptively integrates various convolutional kernels based on the input data. It dynamically modifies the weights to improve the extraction and representation of features associated with forest fires and smoke, regardless of their varying sizes and shapes. Additionally, it optimizes computing resource utilization and improves the performance of low-precision floating-point operations, thereby advancing the effectiveness of forest fire smoke models in scenarios characterized by low FLOPs.

GhostModule

Traditional convolutional methods are computationally expensive and cannot effectively eliminate redundant features in forest fire smoke recognition. To construct more efficient convolutional neural networks (CNNs), MobileNet [39] and ShuffleNet [40] employ deep convolution or mixing operations, utilizing smaller convolution kernels to reduce the number of FLOPs. Nevertheless, the

1 \times 1

convolutional layers consume a large amount of memory and FLOPs. Han et al. [41] propose the Ghost module, which generates more feature maps of fire and smoke by employing less costly operations to further address this issue. Moreover, GhostConv employs efficient methods like depth-separable convolution to significantly minimize both the parameter count and computational burden. The detailed architecture is illustrated in Figure 3.

The GhostConv initially aggregates informative features across channels by performing a

1 \times 1

convolution. This is followed by generating a new feature map through grouped convolution. This method reduces computation by dividing the traditional convolution into two steps. In the first step, feature maps with fewer channels are generated using traditional convolution, which requires less computation. Subsequently, the computational effort is further reduced by applying inexpensive operations, such as depthwise convolution, to these feature maps. The new feature maps are thus generated and then merged to form the final output.

In the LD-YOLO model, the initial convolutional layer remains unchanged, whereas every other convolutional layer is substituted with GhostConv. Building upon the foundation of GhostConv, the literature [41] subsequently devises the GhostBottleneck and C3Ghost modules. The intricate structures of GhostBottleneck and C3Ghost are depicted in Figure 4.

GhostBottleneck replaces the bottleneck module within the C3 module, as illustrated in Figure 4c, forming a new C3Ghost structure. This redesigned structure effectively reduces the computational burden and diminishes the model’s size by substituting the

3 \times 3

standard convolution within the original bottleneck module.

C2f-Ghost-DynamicConv

However, the aforementioned improvement may lead to a low FLOPs trap. Large-scale visual pretraining on forest fire smoke substantially improves the ability of large visual models to detect forest fire smoke, but this pretraining does not offer the same advantages to models with low FLOPs [42]. Inspired by these findings, DynamicConv is combined with GhostBottleneck and integrated into the original C2f module, resulting in the C2f-Ghost-DynamicConv module. This improvement significantly enhances accuracy while only marginally increasing the number of FLOPs. DynamicConv, as described in reference [43], demonstrates enhanced feature representation compared to traditional static convolution. Based on the dynamic perceptron concept, DynamicConv’s underlying principle is illustrated by the following equation:

y = g ({\tilde{W}}^{T} (x) x + \tilde{b} (x))

(1)

\tilde{W} (x) = \sum_{k = 1}^{k} π_{k} (x) {\tilde{W}}_{k}, \tilde{b} (x) = \sum_{k = 1}^{K} π_{k} (x) {\tilde{b}}_{k}

(2)

s . t . 0 \leq π_{k} (x) \leq 1, \sum_{k = 1}^{k} π_{k} (x) = 1

(3)

In this context, x represents the input and y the output. It is observed that x performs two distinct operations. The first involves solving for the parameters of the attention mechanism used to generate the convolution kernel for the dynamics. The second operation is the convolution itself. The symbols W, b, and g denote weights, bias, and activation functions, respectively.

π_{k} (x)

, which stands for the attention weight of the kth linear function, is precisely

{\tilde{W}}_{k}^{T} (x) + {\tilde{b}}_{k}

. It is important to note that this weight varies with different inputs x.

DynamicConv includes k convolutional filters and is based on the traditional design of convolutional neural networks (CNNs). It incorporates batch normalization (BatchNorm) and ReLU activation functions. In Figure 5, the specific structure of DynamicConv is shown.

In DynamicConv, a set of K convolutional kernels with identical scale and channel number is created for a given layer. These kernels are combined using their respective attention weights,

π_{k}

, to form the convolutional kernel parameters for that layer, * represents multiplication. The attention box on the left side of the figure, highlighted with dashed lines, provides a detailed explanation of the computation of

π_{k} (x)

. Initially, global average pooling is utilized to capture spatial features on a worldwide scale. These features are then normalized using softmax after being mapped to k dimensions through two fully connected (FC) layers. This process generates k attention weights, which are subsequently assigned to the k convolutional kernels of the layer.

SENet, or Squeeze-and-Excitation Network [44], differs from DynamicConv primarily because it concentrates on attention at the channel level, whereas DynamicConv considers the entire convolutional kernel as an object of attention. Accordingly, C2f-Ghost-DynamicConv is proposed, based on the principle of DynamicConv and combined with the design concept of the GhostBottleneck module. Compared with the C2f module in the YOLOv8 model, the new design is illustrated in Figure 6.

2.2.3. DySample and Spatial Context Aware Module

In order to deploy monitoring equipment in resource-poor forest areas, it is necessary to ensure that the model is small and minimal enough to monitor forest fires and smoke more quickly and efficiently. To reduce the substantial workload and low detection efficiency caused by dynamic convolution in forest fire detection, as well as the generation of additional sub-networks for dynamic kernels, a super-lightweight and efficient dynamic upsampling method for forest fire, DySample, is introduced [45]. DySample bypasses the need for dynamic convolution by performing upsampling through a point-sampling approach, thereby substantially reducing the number of parameters. This reduction leads to decreased GPU memory usage and latency, facilitating faster detection of forest fires and smoke. To further address the challenge of insufficient representation of small fire and smoke targets and background interference in forest fire early detection scenarios, the Spatial Context Aware Module (SCAM) is introduced [46]. SCAM enhances the model’s capability to globally associate information across channels and spatial dimensions, effectively mitigating the issue of missing detections of small forest fire smoke targets.

DySample

DySample enhances resource efficiency by bypassing dynamic convolution and utilizing a point sampling method for upsampling. In the initial YOLOv8 model, the UpSample method demands substantial computational resources and numerous parameters, hindering the model’s capability to remain lightweight for detecting forest fires. In real-world applications of monitoring forest fires and smoke, the initial fire images are usually small and prone to pixel distortion, leading to the loss of fine-grained details and posing challenges for feature learning. To address this, DySample, a lightweight and efficient dynamic upsampler, is introduced as a replacement for UpSample. DySample enhances the identification of minor fire sources during the initial phases of a blaze, taking into account the possible low quality of surveillance pictures. It employs a point-based sampling method and incorporates a learned sampling approach for upsampling. This approach not only reduces computational resource consumption but also improves image resolution without additional burden. Consequently, DySample enhances the model’s efficiency and performance while minimizing computational cost. Figure 7 illustrates the sampling-based dynamic upsampling and module design in DySample.

The feasibility of a sampling-based dynamic upsampling method is demonstrated in Figure 7a. A feature map X of size

C \times H_{1} \times W_{1}

and a sampling set

δ

of size

2 \times H_{2} \times W_{2}

are considered, where the first two dimensions denote the x and y coordinates. The grid_sample function is employed to resample the input feature map X using the coordinates of the sampling set

δ

. This is achieved through the application of a bilinear interpolation method, generating a new feature map

X^{'}

of size

C \times H_{2} \times W_{2}

. The process is defined as follows:

X^{'} = grid_sample (X, δ)

(4)

Let the upsampling scale factor be s and the dimensions of the feature map X be

C \times H \times W

. By utilizing a linear layer where the input and output channels are C and

2 s^{2}

respectively, an output offset O of size

2 s^{2} \times H \times W

is produced. Subsequently, the offset O is adjusted to

2 \times s H \times s W

by the pixel rearrangement algorithm referenced in [47]. Finally, the sampling set

δ

is obtained from the superposition of the offset O and the original sampling grid G. This process is defined as follows:

O = linear (X)

(5)

δ = G + O

(6)

The description of the reshaping operation is omitted in this paragraph. Finally, the upsampled feature map

X^{'}

, with dimensions

C \times s H \times s W

, is generated by the sampling set and the grid_sample function, as shown in Equation (4). Figure 7 then demonstrates the sampling-based dynamic upsampling technique and its module design in DySample.

Spatial Context Aware Module

Based on ideas from GCNet [48] and SCP [49], SCAM consists of three distinct branches. The first branch merges GAP and GMP. In the second branch, the feature map is linearly transformed using a

1 \times 1

convolution. The third branch simplifies the product of query and key by utilizing the

1 \times 1

convolution, referred to as QK. Matrix multiplication is used to merge the first and third branches, thereby capturing contextual data across channels and spatial dimensions. The end product of SCAM is generated by applying the broadcast Hadamard product to these two branches. Figure 8 depicts the comprehensive structure of SCAM. The representation of the pixel-level spatial context within each layer is as follows:

Q_{i}^{j} = P_{i}^{j} + a_{i}^{j} \sum_{j = 1}^{N_{i}} [\frac{exp (ω_{q k} P_{i}^{j})}{\sum_{n = 1}^{N_{i}} exp (ω_{q k} P_{i}^{n})} \cdot ω_{v} P_{i}^{j}]

(7)

a_{i}^{j} = \frac{exp ([arg (P_{i}); max (P_{i})] P_{i}^{j})}{\sum_{n = 1}^{N_{i}} exp ([arg (P_{i}); max (P_{i})] P_{i}^{n})} \cdot ω_{v}

(8)

In this context,

P_{j}^{i}

and

Q_{j}^{i}

denote the input and output of the j-th pixel in the i-th layer of the feature map, respectively.

N_{i}

represents the total number of pixels in that layer. The matrices

w_{k}

and

w_{v}

are linear transformation matrices obtained through

1 \times 1

convolution, which are utilized to simplify the feature map. The functions avg() and max() implement the GAP and GMP algorithms, respectively. The feature map, through the use of GAP and GMP, can pinpoint and choose the channels holding crucial data, thereby allowing SCAM to effectively absorb the context information from the channel dimension.

Figure 8. Structures of GCBlock, SCP, and SCAM.

Consequently, SCAM facilitates the interaction of contextual features across various channels and spaces. It is found that adding SCAM to the neck is more effective than in the backbone network in capturing the global relationship between the small targets of fire and smoke and the background. Global context information helps characterize forest fire inter-pixel relationships, suppress extraneous background interference, and improve forest fire target–background discrimination.

2.2.4. Self-Attention Detection Head

In YOLOv8, a decoupled head structure is used. Two parallel branches are used to separately extract category and location features, a

1 \times 1

convolution layer is employed by each branch to handle the classification and localization of forest fires and smoke. The self-attention detection head (SADH) introduced in this paper draws inspiration from the HyCTAS model [50]. By upgrading the Conv module of the second layer to self-attention, as showed in Figure 9, the accuracy of detecting forest fires and smoke is improved, and the memory efficiency of the detection model is also enhanced.

Figure 9a shows the original YOLOv8 decoupled head, while Figure 9b presents the improved version with a memory-efficient self-attention module. The FFN-free self-attention layer [51] is employed to capture global context efficiently. A

1 \times 1

convolution is utilized to reduce the dimensionality, then increase it again, creating a bottleneck structure with reduced input and output channels. This approach significantly decreases the memory usage of the self-attention layer. Additionally, a bypass

1 \times 1

convolution module retains uncompressed residues. This memory-efficient self-attention module enables the use of a large receptive field while maintaining a low floating-point number and memory footprint. The self-attention module structure is shown in Figure 10.

2.2.5. Shape-IoU Loss

Existing loss functions primarily emphasize geometric constraints between actual and predicted bounding boxes, neglecting the impact of geometric factors like shape and scale on regression outcomes. To overcome this limitation, as part of the loss calculation, the Shape-IoU examines the bounding box’s shape and scale. Shape-IoU [52] introduces a method of bounding box regression that emphasizes the boundaries’ shapes and sizes. thereby addressing the shortcomings of previous studies, as illustrated in Figure 11.

Shape-IoU improves the precision of localization in target detection by facilitating more accurate bounding box regression. A ratio of intersection and union is calculated between predicted and actual targets shapes. The calculations are as follows:

w w = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{s t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(9)

h h = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{s t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(10)

d i s t a n c e^{s h a p e} = h h \times \frac{{(x_{c} - x_{c}^{g t})}^{2}}{c^{2}} + w w \times \frac{{(y_{c} - y_{c}^{g t})}^{2}}{c^{2}}

(11)

Ω^{s h a p e} = \sum_{t = w, h} {(1 - e^{- w_{t}})}^{θ}, θ = 4

(12)

\{\begin{matrix} w_{w} = h h \times \frac{| w - w^{g t} |}{max (w, w^{g t})} \\ w_{h} = w w \times \frac{| h - h^{g t} |}{max (h, h^{g t})} \end{matrix}

(13)

L_{S h a p e - I o U} = 1 - IoU + {distance}^{s h a p e} + 0.5 \times Ω^{s h a p e}

(14)

The Shape-IoU uses

w^{g t}

and

h^{g t}

to represent the width and height of the real bounding box, while

w^{s t}

and

h^{s t}

denote those of the candidate bounding box. The scale factor, typically ranging from 0 to 1.5, adjusts the weight of the width and height based on the dataset size.

w w

and

h h

are the weighting factors for width and height, related to the shape of the true labeled (GT) bounding box. The diagonal length of the smallest enclosing bounding box is represented as c, with

(x_{c}, y_{c})

and

(x_{g t}, y_{g t})

indicating the center coordinates of the candidate and actual bounding boxes, respectively. The

{d i s t a n c e}^{s h a p e}

quantifies the distance between these centers, adjusting the bounding box’s shape and size.

Ω^{s h a p e}

is a shape-dependent term based on the difference between the predicted and true bounding boxes, enhancing shape sensitivity.

w_{w}

and

w_{h}

, computed from

w w

and

h h

, adjust the bounding box dimensions based on the maximum difference between candidate and real sizes, ensuring accurate alignment with the real object. The Shape-IoU loss,

L_{Shape - IoU}

, integrates traditional IoU with additional terms for distance and shape adjustments, enhancing positional and geometric accuracy.

2.2.6. LD-YOLO

This paper proposes LD-YOLO, a model designed for lightweight and dynamic detection of forest fires and smoke. Firstly, GhostConv is used instead of traditional static convolution, reducing computational overhead while maintaining the expressiveness of the model by capturing more subtle representations of wildfire and smoke features. Secondly, DynamicConv is integrated with the GhostBottleneck module in the GhostModule, presenting C2f-Ghost-DynamicConv. This integration optimizes computational resources by dynamically adjusting the size of the forest fire smoke feature weights, allowing the network to achieve higher performance while minimizing the FLOPs.

UpSample is replaced with DySample, an efficient dynamic upsampling method, to optimize the upsampling process and reduce computational overhead. A spatial context-aware module (SCAM) is added to the Neck layer, which focuses on relevant features while suppressing irrelevant ones. Enhancing the contextual feature representation of small forest fire targets and chaotic backgrounds.

Furthermore, a lightweight self-attention detection head (SADH) is designed to capture a wider range of forest fire background information using the self-attention mechanism, which is particularly beneficial for recognizing complex forest fire and smoke scenes. By introducing Shape-IoU, the delineation accuracy of fire and smoke contours is improved, which consequently enhances the model’s overall detection accuracy and convergence speed.

In this study, the LD-YOLO model is proposed to improve the extraction and representation of features related to wildfires and smoke. Additionally, it optimizes the balance between detection performance and computational efficiency. The structure of LD-YOLO is showed in Figure 12.

3. Results

3.1. Experimental Environment

The experimental setting is showed in Table 2, detailing the software and hardware configurations.

3.2. Model Evaluation

To evaluate the efficacy of the model in target detection, a range of commonly utilized metrics is employed. The evaluation metrics encompass precision, recall, mean average precision at IoU threshold 0.5 (mAP@0.5), mean average precision across IoU thresholds from 0.5 to 0.95 (mAP@0.5∼0.95), average precision (AP), Floating-Point Operations per Second (FLOPs), the number of parameters, model size, and Frames Per Second (FPS), among other relevant indicators. True Positives (TPs) refer to the positive samples accurately identified by the model as either fire or smoke. True Negatives (TNs) are the negative samples that the model correctly classifies as non-fire or non-smoke. False Positives (FPs) are defined as negative samples that are incorrectly identified by the model as fire or smoke. False Negatives (FNs) denote instances of positive samples that the model erroneously categorizes as lacking fire or smoke.

Precision is characterized as the proportion of instances that the model identifies as fire or smoke, which are indeed confirmed to be fire or smoke, relative to the total number of instances that the model predicts as fire or smoke. The calculations are as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(15)

Recall is defined as the ratio of instances that are genuinely classified as fire or smoke to the total number of instances that are predicted as such by the model. The calculation is performed as follows:

R e c a l l = \frac{T P}{T P + F N}

(16)

Average precision (AP) is calculated by averaging the precision values obtained from the precision–recall (PR) curve. The metric mAP@0.5 is employed to compute the AP for all images within each category, which is subsequently averaged across all categories. Its calculation formula is as follows:

A P = \int_{0}^{1} (P r e c i s i o n) d (R e c a l l)

(17)

m A P = \frac{1}{n} \sum_{t = 1}^{n} A P_{t}

(18)

Parameters denote the quantity of variables incorporated within the model. The calculation is performed as follows:

P a r a m e t e r s = (I + 1) \times O = I \times O + O

(19)

Frames Per Second (FPS) serves as a metric for the frequency at which the network model analyzes and transmits images of identified fire or smoke. Latency refers to the time generally required by the network to predict an image, and is calculated as follows:

F P S = \frac{1}{L a t e n c y}

(20)

3.3. Ablation Experiment

In order to assess the effectiveness of the enhancements made to LD-YOLO, a total of nine sets of ablation experiments were conducted. These experiments aimed to assess the effectiveness and feasibility of each improvement module (Imp). The results of these experiments are shown in Table 3.

In this experiment, the impact of various improvements on model performance was analyzed by performing ablation experiments on the YOLOv8s model. The initial YOLOv8s model exhibited high detection accuracy (mAP@0.5 = 0.821) and real-time processing capability (FPS = 138.82). However, by introducing new modules, the performance and efficiency of the model were further optimized.

After adding GhostConv and C3fGhost, the model’s mAP@0.5 improved from 0.821 to 0.833. This result indicated that these improvements effectively enhanced the model’s forest fire smoke feature extraction capabilities. Moreover, these two enhancements substantially simplified the model, decreasing the number of parameters from 28.4 M to 16.4 M, the model size from 22.5 MB to 13.6 MB, and the number of FLOPs from 10.6 G to 5.6 G.

After replacing C3fGhost with the C2f-Ghost-DynamicConv component, the model’s mAP@0.5 further improved to 0.836. This result demonstrated that dynamic convolution played a key role in enhancing the model’s ability to adapt to complex forest fire smoke environments.

The model’s AP_fire increased from 0.781 to 0.819, and AP_smoke rose from 0.86 to 0.897 due to the implementation of DySample and SCAM, demonstrating improvements in the model’s capacity to detect specific elements like fire and smoke. Similarly, SADH was designed to further improve the mAP@0.5∼0.95 to 0.618, confirming its effectiveness in enhancing the model’s capability to distinguish the target area of forest fire smoke in detail. The replacement of CIoU with Shape-IoU improved the model’s ability to accurately distinguish the shape and boundary of fire and smoke, enhancing the AP_smoke to 0.897, thereby confirming its significance in improving the accuracy of target detection.

The final LD-YOLO model achieved an mAP@0.5 of 0.863 and an mAP@0.5∼0.95 of 0.603. It maintained a relatively low FLOP (6.7G) and a small model size (14.4MB), while achieving 161.02 FPS. This demonstrates that the model can meet the requirements for high-speed real-time processing of forest fires and smoke detection while maintaining high accuracy.

To demonstrate the efficacy of each improvement module for fire and smoke small-target detection, heat maps were plotted to represent the improvements brought by each module, as shown in Figure 13. The feature heat map reveals that the original model was unable to extract fuzzy fire features and only recognized sparse smoke around it. After the introduction of GhostConv and the enhanced C2f-Ghost-DynamicConv module, the model’s sensory field expanded, yet it still failed to accurately recognize fuzzy fire features. However, it discerned the surrounding smoke based on the original model’s capabilities. The incorporation of the fused DySample and SCAM modules, which enhance the representation of weak features of fire and smoke small targets while suppressing the influence of confusing backgrounds, enabled the model to correctly recognize fire features. Despite this, the model’s identification ability remained general, lacking the capacity to recognize specific fire features. Incorporating SADH enhanced the model’s capability to identify distinct fire characteristics and improved the accuracy within the sensory range, despite the limited ability to learn intricate fire features. The incorporation of Shape-IoU Loss, which emphasizes the shape and dimensions of the bounding box, facilitated the model’s ability to effectively learn both intricate and general characteristics of fire. This enhancement led to a precise and thorough identification of fire and the associated smoke.

3.4. Performance and Comparative Analysis

To further confirm the benefits of LD-YOLO in forest fire detection modeling, comparative experiments were performed with established models including YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv9, and Faster R-CNN. The results were evaluated using the following metrics: precision, recall, mAP@0.5, parameters, FLOPs, and model size. In a series of comparisons, LD-YOLO achieved an mAP@0.5 value of 0.863, which was 4.2% higher than the baseline model. It also had 6.7M parameters, 14.7G FLOPs, and a model size of 14.4MB, which were 36.79%, 48.2%, and 36% lower than the baseline model, respectively, with an increase in FPS of 15.99%. The comparative experiment results can be found in Table 4. To facilitate comparison with classical deep learning algorithms, line graphs were constructed comparing LD-YOLO with various deep learning algorithms, including YOLOv3, YOLOv5s, YOLOv6s, YOLOv7-Tiny, and YOLOv8s, as shown in Figure 14.

From the above comparative experiments, it is evident that LD-YOLO demonstrates superior performance in terms of mAP, when compared to other conventional models. At the same time, a tendency for the model to shrink after approximately 180 epochs was observed, which may be attributed to the insufficient size of the training dataset and the complexity of the model’s structural design. Nonetheless, this limitation does not adversely affect the overall dynamic performance of LD-YOLO, which effectively sustains a balanced methodology regarding detection accuracy, detection speed, model complexity, and other pertinent factors. Moreover, it still performs well compared to the original YOLOv8 in different scenarios.

To verify the performance of LD-YOLO in different scenarios with small targets, long distances, occlusions, and complex and changing environments, several sets of comparison experiments were conducted between the original YOLOv8 model and LD-YOLO.

Firstly, LD-YOLO was able to recognize fuzzy small-target fires and long-distance fires, as well as perceive smoke over a wider range than the original model. In contrast, the original model was unable to recognize a small-target fire in Figure 15a and mistakenly failed to recognize a small-target fire in Figure 15c. Furthermore, the original model exhibited a narrower range of smoke perception and was unable to fully encompass the range of smoke.

It was observed that the YOLOv8 model exhibited difficulty in accurately recognizing forest fires in scenarios with occlusion, as evidenced by the lower accuracy of smoke recognition in Figure 16a compared to LD-YOLO’s performance in Figure 16b. In the comparison between Figure 16c,d, the original YOLOv8 model failed to identify two small fires obscured by trees, as well as one instance of smoke.

Our training set contained 110 unlabeled cloud photos for the model to train on, with details provided in Table 1. Figure 17 below shows the experimental results of the scene with cloud interference. In Figure 17a,b, the original model misidentified the clouds in the sky as smoke, whereas LD-YOLO did not show any misdetection. In Figure 17c,d, smoke actually appeared, but the original model failed to recognize it correctly and identified a fire with lower accuracy than LD-YOLO. This demonstrated the advantage of LD-YOLO in confusing forest fire and smoke scenarios.

Given that realistic forest fire and smoke scenarios often involve large amounts of dispersed fires and smoke, it was critical to accurately identify and distinguish between numerous fires and smoke in complex and diverse forest fire and smoke scenes. To this end, the original YOLOv8 and LD-YOLO models were employed in such scenes for simulation detection, and the resulting detection outcomes are presented in Figure 18. Figure 18a,b illustrated that LD-YOLO was more effective at recognizing fires than the original model, although there were still instances where very small fires were missed by both models. Figure 18c,d demonstrated that LD-YOLO was better at accurately recognizing fires and smoke, with the original model failing to identify a particularly critical fire. In a real-world forest fire scenario, the ability to recognize more fires can be of significant benefit to rescue operations. Thus, LD-YOLO was capable of fulfilling the requirements for detecting forest fires and smoke by delivering more precise detection results with reduced operations and parameters, along with a smaller model size.

4. Discussion

As one of the essential biomes on Earth, forest resources play a pivotal role in sustaining the natural environment’s wholeness and fostering ecological construction within human society. Every year, forests covering millions of hectares worldwide are destroyed by fire. Detecting fire and smoke promptly and accurately in the initial stages of forest fires is crucial for preventing ecological damage and safeguarding human lives and property.

Traditional forest fire and smoke detection techniques utilize algorithms for the manual extraction of features. However, these methods often require fine-tuning for specific application scenarios. Additionally, manually designed feature extraction methods may not fully capture all changes in fire and smoke, which limits their generality and practicality. The significant progress in deep learning technology has highlighted the importance of research on forest fire and smoke detection algorithms based on deep learning. These algorithms primarily utilize deep learning models, especially convolutional neural networks (CNNs), to autonomously learn intricate feature representations from vast image datasets, thereby facilitating the effective identification of fires and smoke. Particularly YOLOv8, the YOLO (You Only Look Once) series, finds extensive application in the detection of smoke and forest fires.

The rapid identification and localization capabilities of YOLOv8 in detecting smoke and fires within complex environments are essential for the timely detection and intervention of forest fires. However, balancing detection accuracy, detection speed, and model complexity is challenging. This paper strives to enhance the speed and accuracy of detecting smoke and forest fires, especially in the initial stages of such fires. GhostConv is introduced to generate additional smoke feature maps through low-cost linear transformations, thereby reducing the number of parameters. Compared to high-performance deep networks, lightweight networks are constrained in terms of depth and channels and have limited ability to express features. Therefore, C2f-Ghost-DynamicConv is proposed, which adaptively fuses multiple convolution kernels based on input, enhancing the extraction and representation of features for forest fire smoke of varying sizes and shapes. To address the issue of small targets for fires, where false alarms and omissions often occur in obstructed and complex forest fire scenarios, DySample is introduced. A point-based sampling technique is employed by DySample, and it also incorporates a learning-based method for upsampling. This approach addresses issues related to small initial fire images, pixel distortion that can result in the loss of detailed features of forest fires and smoke, and challenges in feature learning. By introducing SCAM, the model’s capability to globally associate information across channels and spatial dimensions is enhanced, thereby effectively mitigating the problem of missed smoke target detection in small forest fires. To improve the model, the contextual information of global forest fire and smoke characteristics is captured by restructuring the detection head. Additionally, Shape-IoU is introduced to improve the detection accuracy of smoke in forest fires by concentrating on the shape and scale of the boundaries.

Finally, LD-YOLO performs well on the self-made dataset, but precision does not exceed that of the original YOLOv8 model, being 0.006 lower. Additionally, the parameters, FLOPs, and model size of LD-YOLO do not reach the minimum compared to the original model due to the inherent difficulty. Balancing detection accuracy, speed, and model complexity, while accommodating various forest fire and smoke scenarios, continues to be a challenging task. Therefore, our objective is to achieve an optimal balance among these factors. In the comparative experiment shown in Figure 14, LD-YOLO demonstrates good detection accuracy for forest fires in their early stages compared to the original model and is able to identify all instances of smoke. Figure 15 and Figure 17 illustrate that LD-YOLO accurately detects fires that might be obstructed. Figure 16 is used to verify our data on unlabeled clouds, distinguishing between ordinary clouds and smoke.

Deep learning techniques, however, generally demand a significant amount of labeled data for training and come with high computational expenses, which can constrain their practical application. Moreover, the use of complex deep learning models on small-scale datasets can lead to overfitting issues. Therefore, preventing model overfitting on small-scale datasets is also worth studying. Future research should focus on creating more efficient training strategies to minimize the computational demands of models. Additionally, exploring semi-supervised or unsupervised learning methods could help decrease dependence on extensive labeled datasets. Given the importance of real-time detection, one key research area is improving the model’s inference speed.

5. Conclusions

The LD-YOLO model, which is based on YOLOv8, has been proposed to overcome the inherent drawbacks of traditional methods for detecting forest fires and smoke. Firstly, GhostConv is introduced to reduce parameters, ensuring the efficiency of fire and smoke feature extraction remains intact, which in turn boosts the model’s total computational efficiency. Additionally, innovative improvements are made to the C2f module by proposing the C2f-Ghost-DynamicConv module, which adaptively fuses multiple convolutional kernels according to the input. This model enhances the feature extraction and representation of forest fires and smoke of varying sizes and shapes, optimizes the use of computational resources, and improves the performance of low-precision floating-point operations. To address the challenges posed by time-consuming dynamic convolution and the burden of additional sub-networks, DySample, an ultra-lightweight and efficient dynamic upsampling technique, is introduced. DySample redefines the upsampling process from a point-sampling perspective, bypassing dynamic convolution, resulting in faster detection of forest fires. To combat the issues of insufficient feature representation and background confusion, this paper also introduces the Spatial Context Awareness Module (SCAM). By addressing the problem of inadequate representation of small fire target features and background interference, SCAM enhances early forest fire detection. Additionally, this study introduces a lightweight detection head named the self-attention detection head, which employs self-attention mechanisms to effectively capture the contextual information of global forest fire and smoke characteristics. In conclusion, Shape-IoU is introduced to stress the bounding box’s shape and size, improving the detection precision of the model identifying forest fires and smoke.

This experiment divides 3603 forest fire images into training, testing, and validation sets, following an 8:1:1 distribution ratio. The findings indicate that the LD-YOLO model attains a mean average precision (mAP_0.5) of 86.3% on a specialized forest fire dataset. This performance surpasses the original YOLOv8 model by 4.2%. Additionally, LD-YOLO exhibits 36.79% fewer parameters, 48.24% lower FLOPs, and 15.99% higher FPS, significantly improving the efficiency of forest fire and smoke detection. These enhancements facilitate deployment in miniature embedded devices. Furthermore, the LD-YOLO model demonstrates enhanced performance metrics, including recall, mAP, AP_fire, and AP_smoke, when compared to traditional detection models such as YOLOv5, YOLOv6, YOLOv7, YOLO-World, and Faster R-CNN. Although it exhibits lower precision, FLOPs, and model size compared to YOLOv8n, the trade-off between space and time efficiency is worthwhile. Therefore, LD-YOLO attains high accuracy, rapid detection speed, and low model complexity in detecting forest fires and smoke, which are crucial for the prompt identification of forest fires.

Author Contributions

Conceptualization, Z.L. and B.Y.; methodology, Z.L. and B.Y.; writing—original draft preparation, Z.L. and B.Y.; writing—review and editing, B.Y., Y.Z. and Z.L.; supervision, B.Y.; project administration, B.Y. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the National Natural Science Foundation of China (No. 61972357), the National Innovation and Entrepreneurship Training Program for College Students (No. 202311057037), and the Industry-University-Research Innovation Fund of Chinese Colleges (No. 2022IT009), this work has been made possible.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Keenan, R.J.; Reams, G.A.; Achard, F.; de Freitas, J.V.; Grainger, A.; Lindquist, E. Dynamics of global forest area: Results from the FAO Global Forest Resources Assessment 2015. For. Ecol. Manag. 2015, 352, 9–20. [Google Scholar] [CrossRef]
Morales-Hidalgo, D.; Oswalt, S.N.; Somanathan, E. Status and trends in global primary forest, protected areas, and areas designated for conservation of biodiversity from the Global Forest Resources Assessment 2015. For. Ecol. Manag. 2015, 352, 68–77. [Google Scholar] [CrossRef]
Bergeron, Y.; Gauthier, S.; Flannigan, M.; Kafka, V. Fire regimes at the transition between mixedwood and coniferous boreal forest in northwestern Quebec. Ecology 2004, 85, 1916–1932. [Google Scholar] [CrossRef]
Parashar, A.; Biswas, S. The impact of forest fire on forest biodiversity in the Indian Himalayas (Uttaranchal). In Proceedings of the XII World Forestry Congress, Quebec City, QC, Canada, 21–28 September 2003; Volume 358. [Google Scholar]
Cha, S.; Kim, C.B.; Kim, J.; Lee, A.L.; Park, K.H.; Koo, N.; Kim, Y.S. Land-use changes and practical application of the land degradation neutrality (LDN) indicators: A case study in the subalpine forest ecosystems, Republic of Korea. For. Sci. Technol. 2020, 16, 8–17. [Google Scholar] [CrossRef]
Bui, D.T.; Van Le, H.; Hoang, N.D. GIS-based spatial prediction of tropical forest fire danger using a new hybrid machine learning method. Ecol. Inform. 2018, 48, 104–116. [Google Scholar]
Zheng, B.; Ciais, P.; Chevallier, F.; Chuvieco, E.; Chen, Y.; Yang, H. Increasing forest fire emissions despite the decline in global burned area. Sci. Adv. 2021, 7, eabh2646. [Google Scholar] [CrossRef]
Andela, N.; Morton, D.C.; Schroeder, W.; Chen, Y.; Brando, P.M.; Randerson, J.T. Tracking and classifying Amazon fire events in near real time. Sci. Adv. 2022, 8, eabd2713. [Google Scholar] [CrossRef]
Bowman, D.M.; Williamson, G.J.; Abatzoglou, J.T.; Kolden, C.A.; Cochrane, M.A.; Smith, A.M. Human exposure and sensitivity to globally extreme wildfire events. Nat. Ecol. Evol. 2017, 1, 0058. [Google Scholar] [CrossRef]
Baijnath-Rodino, J.A.; Kumar, M.; Rivera, M.; Tran, K.D.; Banerjee, T. How vulnerable are American states to wildfires? A livelihood vulnerability assessment. Fire 2021, 4, 54. [Google Scholar] [CrossRef]
Haight, R.G.; Cleland, D.T.; Hammer, R.B.; Radeloff, V.C.; Rupp, T.S. Assessing fire risk in the wildland-urban interface. J. For. 2004, 102, 41–48. [Google Scholar] [CrossRef]
Mahmoud, H.; Chulahwat, A. Unraveling the complexity of wildland urban interface fires. Sci. Rep. 2018, 8, 9315. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Zhao, P.; Xu, S.; Wu, Y.; Yang, X.; Zhang, Y. Integrating multiple factors to optimize watchtower deployment for wildfire detection. Sci. Total Environ. 2020, 737, 139561. [Google Scholar] [CrossRef] [PubMed]
Yao, J.; Raffuse, S.M.; Brauer, M.; Williamson, G.J.; Bowman, D.M.; Johnston, F.H.; Henderson, S.B. Predicting the minimum height of forest fire smoke within the atmosphere using machine learning and data from the CALIPSO satellite. Remote Sens. Environ. 2018, 206, 98–106. [Google Scholar] [CrossRef]
Yu, L.; Wang, N.; Meng, X. Real-time forest fire detection with wireless sensor networks. In Proceedings of the 2005 International Conference on Wireless Communications, Networking and Mobile Computing, Wuhan, China, 23–26 September 2005; Volume 2, pp. 1214–1217. [Google Scholar]
Hefeeda, M.; Bagheri, M. Wireless sensor networks for early detection of forest fires. In Proceedings of the 2007 IEEE International Conference on Mobile Adhoc and Sensor Systems, Pisa, Italy, 8–11 October 2007; pp. 1–6. [Google Scholar]
Mölders, N. Suitability of the Weather Research and Forecasting (WRF) model to predict the June 2005 fire weather for Interior Alaska. Weather Forecast. 2008, 23, 953–973. [Google Scholar] [CrossRef]
Kumar, M.; Kosović, B.; Nayak, H.P.; Porter, W.C.; Randerson, J.T.; Banerjee, T. Evaluating the performance of WRF in simulating winds and surface meteorology during a Southern California wildfire event. Front. Earth Sci. 2024, 11, 1305124. [Google Scholar] [CrossRef]
Celik, T.; Ma, K.K. Computer vision based fire detection in color images. In Proceedings of the 2008 IEEE Conference on Soft Computing in Industrial Applications, Muroran, Japan, 25–27 June 2008; pp. 258–263. [Google Scholar]
Abidha, T.; Mathai, P.P.; Divya, M. Vision Based Wildfire Detection Using Bayesian Decision Fusion Framework. Int. J. Adv. Res. Comput. Commun. Eng. 2013, 2, 4603–4609. [Google Scholar]
Gubbi, J.; Marusic, S.; Palaniswami, M. Smoke detection in video using wavelets and support vector machines. Fire Saf. J. 2009, 44, 1110–1115. [Google Scholar] [CrossRef]
Frizzi, S.; Kaabi, R.; Bouchouicha, M.; Ginoux, J.M.; Moreau, E.; Fnaiech, F. Convolutional neural network for video fire and smoke detection. In Proceedings of the IECON 2016—42nd Annual Conference of the IEEE Industrial Electronics Society, Florence, Italy, 23–26 October 2016; pp. 877–882. [Google Scholar]
Yuan, F.; Zhang, L.; Wan, B.; Xia, X.; Shi, J. Convolutional neural networks based on multi-scale additive merging layers for visual smoke recognition. Mach. Vis. Appl. 2019, 30, 345–358. [Google Scholar] [CrossRef]
Sathishkumar, V.E.; Cho, J.; Subramanian, M.; Naren, O.S. Forest fire and smoke detection using deep learning-based learning without forgetting. Fire Ecol. 2023, 19, 9. [Google Scholar] [CrossRef]
Zhang, J.; Guo, S.; Zhang, G.; Tan, L. Fire Detection Model Based on Multi-scale Feature Fusion. J. Zhengzhou Univ. Eng. Sci. 2021, 42. [Google Scholar]
Ryu, J.; Kwak, D. Flame detection using appearance-based pre-processing and convolutional neural network. Appl. Sci. 2021, 11, 5138. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Jindal, P.; Gupta, H.; Pachauri, N.; Sharma, V.; Verma, O.P. Real-time wildfire detection via image-based deep learning algorithm. In Proceedings of the Soft Computing: Theories and Applications: Proceedings of SoCTA 2020; Springer: New York, NY, USA, 2021; Volume 2, pp. 539–550. [Google Scholar]
Qian, J.; Lin, J.; Bai, D.; Xu, R.; Lin, H. Omni-dimensional dynamic convolution meets bottleneck transformer: A novel improved high accuracy forest fire smoke detection model. Forests 2023, 14, 838. [Google Scholar] [CrossRef]
Saydirasulovich, S.N.; Mukhiddinov, M.; Djuraev, O.; Abdusalomov, A.; Cho, Y.I. An improved wildfire smoke detection based on YOLOv8 and UAV images. Sensors 2023, 23, 8374. [Google Scholar] [CrossRef] [PubMed]
Yunusov, N.; Islam, B.M.S.; Abdusalomov, A.; Kim, W. Robust Forest Fire Detection Method for Surveillance Systems Based on You Only Look Once Version 8 and Transfer Learning Approaches. Processes 2024, 12, 1039. [Google Scholar] [CrossRef]
Yun, B.; Zheng, Y.; Lin, Z.; Li, T. FFYOLO: A Lightweight Forest Fire Detection Model Based on YOLOv8. Fire 2024, 7, 93. [Google Scholar] [CrossRef]
de Venâncio, P.V.A.; Lisboa, A.C.; Barbosa, A.V. An automatic fire detection system based on deep convolutional neural networks for low-power, resource-constrained devices. Neural Comput. Appl. 2022, 34, 15349–15368. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Han, K.; Wang, Y.; Guo, J.; Wu, E. ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks. arXiv 2023, arXiv:2306.14525. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Liu, Y.; Li, H.; Hu, C.; Luo, S.; Luo, Y.; Chen, C.W. Learning to aggregate multi-scale context for instance segmentation in remote sensing images. IEEE Trans. Neural Netw. Learn. Syst. 2024, 1–15. [Google Scholar] [CrossRef] [PubMed]
Yu, H.; Wan, C.; Liu, M.; Chen, D.; Xiao, B.; Dai, X. Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search. arXiv 2024, arXiv:2403.10413. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning. PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Zhang, H.; Zhang, S. Shape-IoU: More Accurate Metric considering Bounding Box Shape and Scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]

Figure 1. Example of experimental dataset.

Figure 2. Network structure of YOLOv8.

Figure 3. Structure of the GhostConv.

Figure 4. GhostBottleneck and C3Ghost: (a) GhostBottleneck, stride = 1; (b) GhostBottleneck, stride = 2; (c) C3Ghost.

Figure 5. A dynamic convolution layer.

Figure 6. Comparison between C2f and C2f-Ghost-DynamicConv module: (a) C2f Module; (b) C2f-Ghost-DynamicConv Module.

Figure 7. Sampling point generator in DySample.

Figure 9. YOLOv8 detection head: (a) YOLOv8 decoupled detection head; (b) Self-Attention decoupled detection head.

Figure 10. Self-Attention module.

Figure 11. Structure of the Shape-IoU.

Figure 12. Network structure of LD-YOLO.

Figure 13. Comparison of heat maps of innovative points.

Figure 14. Comparison of mAP between different models.

Figure 15. Detection results of forest fire and smoke scenes for small targets or long distances. (a,c) is YOLOv8 model; (b,d) is LD-YOLO model.

Figure 16. Forest fire and smoke scene detection results with partial occlusion. (a,c) is YOLOv8 model; (b,d) is LD-YOLO model.

Figure 17. Detection results of forest fire and smoke scene with cloud interference. (a,c) is YOLOv8 model; (b,d) is LD-YOLO model.

Figure 18. Detection results of complex and diverse forest fire and smoke scenarios. (a,c) is YOLOv8 model; (b,d) is LD-YOLO model.

Table 1. Details of the dataset.

Dataset	Number of Images	Number of Fire and Smoke	Number of Clouds
Training	2883	2773	110
Validation	360	345	15
Testing	360	350	10

Table 2. Experimental settings.

Configuration	Version
Framework	PyTorch (version 2.3.0)
Programming Language	Python (version 3.9)
GPU	RTX A6000
Operating System	Linux Ubuntu 22.04LTS
Parameter	Value
Input Image Size	640 × 640
Epochs	250
Batch Size	32
Patience	50
Optimizer	Adam
Learning Rate	0.01
Adam Momentum	0.937
Workers	8
Weight Decay	0.0005
Warmup Momentum	0.8

Table 3. Ablation experiment results.

Module	Base Line	Imp 1	Imp 2	Imp 3	Imp 4	Imp 5	Imp 6	Imp 7	Imp 8	Ours
GhostConv		√	√				√	√	√	√
C3fGhost		√					√
C2f-Ghost-DynamicConv			√					√	√	√
DySample-SCAM				√			√	√	√	√
SADH					√				√	√
Shape-IoU						√				√
Precision	0.788	0.817	0.795	0.77	0.789	0.796	0.785	0.795	0.796	0.801
Recall	0.779	0.764	0.783	0.783	0.759	0.779	0.784	0.781	0.79	0.798
mAP@0.5	0.821	0.833	0.836	0.844	0.845	0.83	0.848	0.853	0.86	0.863
mAP@0.5∼0.95	0.59	0.599	0.602	0.593	0.618	0.58	0.608	0.601	0.602	0.603
AP_fire	0.781	0.794	0.784	0.819	0.818	0.788	0.811	0.818	0.826	0.83
AP_smoke	0.86	0.873	0.888	0.869	0.872	0.873	0.884	0.888	0.893	0.897
FLOPs(G)	10.6	5.6	6.4	11.3	10.3	10.6	6.3	7	6.7	6.7
Parameters(M)	28.4	16.4	16.8	29.4	25.7	28.4	17	17.8	14.7	14.7
Model Size(MB)	22.5	12.2	13.6	24	21.9	22.5	13.7	15.1	14.4	14.4
FPS	138.82	164.61	152.32	173.22	157.05	145.95	148.63	153.54	156.77	161.02

Table 4. Results of model comparison experiments.

Model	Precision	Recall	mAP@0.5	mAP@0.5∼0.95	AP_fire	AP_smoke	Parameters	FLOPs/G	Model Size (MB)
YOLOv3	0.78	0.787	0.847	0.594	0.822	0.871	98.9	282.2	207.8
YOLOv3-tiny	0.753	0.739	0.807	0.469	0.775	0.839	11.6	18.9	24.4
YOLOv5s	0.775	0.77	0.816	0.565	0.783	0.848	8.7	23.8	18.5
YOLOv5m	0.788	0.774	0.838	0.568	0.812	0.864	23.9	64	50.5
YOLOv5l	0.778	0.755	0.837	0.548	0.804	0.87	50.7	134.7	106.8
YOLOv5x	0.791	0.759	0.841	0.563	0.817	0.865	92.7	246	195
YOLOv6s	0.768	0.787	0.84	0.575	0.808	0.871	15.5	44	32.8
YOLOv6m	0.754	0.759	0.818	0.553	0.787	0.849	49.6	161.1	104.3
YOLOv6l	0.74	0.734	0.799	0.544	0.778	0.821	105.7	391.2	222.2
YOLOv6x	0.747	0.743	0.81	0.544	0.786	0.834	165	610.2	346.5
YOLOv7	0.788	0.719	0.752	0.528	0.77	0.733	35.5	105.1	74.8
YOLOv7-tiny	0.728	0.75	0.812	0.489	0.799	0.825	5.7	13.2	12.3
YOLOv8n	0.807	0.754	0.813	0.576	0.774	0.853	2.9	8.1	6.2
YOLOv8s	0.788	0.779	0.821	0.59	0.781	0.86	10.6	28.4	22.5
YOLOv8m	0.789	0.785	0.856	0.591	0.825	0.888	24.6	78.7	52
YOLOv8l	0.803	0.772	0.85	0.603	0.819	0.88	41.6	164.8	87.7
YOLOv8x	0.798	0.772	0.851	0.594	0.817	0.885	65	257.4	136.7
YOLOv9c	0.8	0.781	0.846	0.575	0.819	0.873	24.1	102.3	51.6
YOLO-World	0.804	0.781	0.852	0.596	0.816	0.889	3.9	13.1	8.4
Faster R-CNN	0.805	0.749	0.812	0.561	0.788	0.836	41.4	239.3	321
LD-YOLO(Ours)	0.801	0.798	0.863	0.603	0.83	0.897	6.7	14.7	14.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, Z.; Yun, B.; Zheng, Y. LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module. Forests 2024, 15, 1630. https://doi.org/10.3390/f15091630

AMA Style

Lin Z, Yun B, Zheng Y. LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module. Forests. 2024; 15(9):1630. https://doi.org/10.3390/f15091630

Chicago/Turabian Style

Lin, Zhenyu, Bensheng Yun, and Yanan Zheng. 2024. "LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module" Forests 15, no. 9: 1630. https://doi.org/10.3390/f15091630

APA Style

Lin, Z., Yun, B., & Zheng, Y. (2024). LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module. Forests, 15(9), 1630. https://doi.org/10.3390/f15091630

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Methods

2.2.1. YOLOv8

2.2.2. GhostDynamicModule

2.2.3. DySample and Spatial Context Aware Module

2.2.4. Self-Attention Detection Head

2.2.5. Shape-IoU Loss

2.2.6. LD-YOLO

3. Results

3.1. Experimental Environment

3.2. Model Evaluation

3.3. Ablation Experiment

3.4. Performance and Comparative Analysis

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI