Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network

Lei, Lin; Duan, Ruifeng; Yang, Feng; Xu, Longhang

doi:10.3390/f15091652

Open AccessArticle

Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network

by

Lin Lei

¹,

Ruifeng Duan

^1,2,3,

Feng Yang

^1,2,3,* and

Longhang Xu

¹

School of Information and Technology (School of Artificial Intelligence), Beijing Forestry University, Beijing 100083, China

²

Engineering Research Center for Forestry-Oriented Intelligent Information Processing of National Forestry and Grassland Administration, Beijing 100083, China

³

State Key Laboratory of Efficient Production of Forest Resources, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(9), 1652; https://doi.org/10.3390/f15091652

Submission received: 1 September 2024 / Revised: 15 September 2024 / Accepted: 17 September 2024 / Published: 19 September 2024

(This article belongs to the Special Issue Forest Fires Prediction and Detection—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Forest fires pose a significant threat to ecosystems and communities. This study introduces innovative enhancements to the YOLOv8n object detection algorithm, significantly improving its efficiency and accuracy for real-time forest fire monitoring. By employing Depthwise Separable Convolution and Ghost Convolution, the model’s computational complexity is significantly reduced, making it suitable for deployment on resource-constrained edge devices. Additionally, Dynamic UpSampling and Coordinate Attention mechanisms enhance the model’s ability to capture multi-scale features and focus on relevant regions, improving detection accuracy for small-scale fires. The Distance-Intersection over Union loss function further optimizes the model’s training process, leading to more accurate bounding box predictions. Experimental results on a comprehensive dataset demonstrate that our proposed model achieves a 41% reduction in parameters and a 54% reduction in GFLOPs, while maintaining a high mean Average Precision (mAP) of 99.0% at an Intersection over Union (IoU) threshold of 0.5. The proposed model offers a promising solution for real-time forest fire monitoring, enabling a timely detection of, and response to, wildfires.

Keywords:

forest fire detection; YOLOv8; FLAME dataset; lightweight; small fire

1. Introduction

Forest fires pose a significant environmental threat with far-reaching ecological and societal consequences [1]. They not only devastate ecosystems but also endanger human life and property. They also contribute to global warming through substantial carbon dioxide emissions [2,3]. Therefore, the rapid and accurate identification of forest fires is of great importance.

Traditional methods, relying on human observation from lookout towers or ground patrols, are inadequate for large-scale monitoring [4]. These methods are resource-intensive due to their manpower requirements and suffer from inherent delays due to human response times and limited coverage areas [5]. This lack of real-time response hinders their effectiveness in critical situations.

Subsequently, the proliferation of forest fire imagery captured by satellites and drones has fueled research into image-based fire detection using artificial features. Hossain et al. [6] achieved success in fire and smoke detection using multi-color space and LBP features. Yang et al. [7] proposed the novel PreVM (Preferred Vector Machine) method to improve accuracy. Moreover, methods based on RGB [8], YCbCr [9] and optimal quality transmission optical flow [10] have been extensively explored. However, these methods often rely on hand-crafted features, making it challenging to define effective ones, especially for small fires. Additionally, they are prone to false alarms and struggle to adapt to diverse fire scenarios, limiting their overall effectiveness.

The rise of deep learning, particularly Convolutional Neural Networks (CNNs) [11,12,13], has revolutionized computer vision [14]. This has led to a surge in applying deep networks to forest fire target detection. Pioneering research by Zhang et al. [15] demonstrated improved detection accuracy by training a CNN with both full images and fine-grained patch classifiers. Sharma et al. [16] further advanced the field using VGG16 and ResNet50 networks for a superior fire detection performance. Frizzi et al. [17] successfully implemented a CNN for video-based fire detection, achieving superior detection compared to traditional methods. While these studies establish a solid foundation for deep learning in forest fire detection, there is still room for improvement in accuracy and model efficiency.

In terms of improving accuracy, Bahhar et al. [18] proposed a model that combines YOLO architecture with a two-weighted voting ensemble CNN, while Qian et al. [19] introduced the ODConv module to enhance feature extraction, particularly for fires of varying shapes and sizes. Yang et al. [20] and Xue et al. [21] both focused on improving YOLOv8 and YOLOv5 networks through SCConv and CBAM attention modules, respectively. These modules enhance the model’s ability to adapt to the complexities of forest fires, particularly the detection of small fire targets. Furthermore, Xiao et al. [22] introduced the SimAm attention mechanism into YOLOv7 to further emphasize fire target features. Muksimova et al. [23] employed a novel two-pathway encoder–decoder-based model to enhance contextual understanding, effectively improving the extraction and representation of wildfire features. Shakhnoza et al. [24] designed a method based on attention feature maps within capsule networks to address challenges such as varying lighting conditions and occlusions. Despite these advancements, these methods still exhibit high false positive rates and lack significant improvements in model efficiency.

Alongside the focus on accuracy, researchers are actively developing lightweight models for broader deployment. Fan et al. [25] achieved a significant reduction in network complexity by replacing the original YOLOv4 feature extraction network with MobileNet and Depthwise Separable Convolution. Yun et al. [26] further optimized YOLOv8 by implementing the lightweight GhostConv module in the neck part and leveraging a knowledge distillation strategy. Jin et al. [27] and Huang et al. [28] both explored similar approaches, utilizing GhostConv and Depth Separable Convolution alongside SENet (Huang et al. only) to create lightweight models. While these approaches significantly reduce model complexity, they still encounter difficulties in detecting forest fires within complex backgrounds, especially when dealing with occluded fires.

Despite advancements in deep learning, the accurate and efficient detection of forest fires remains a significant challenge due to the irregular shapes, varying scales and occlusions of forest fire targets. This study introduces a novel, lightweight network architecture specifically designed to address these limitations and enhance detection performance. Firstly, we cleverly combine Ghost Convolution and Depthwise Separable Convolution to substantially reduce model parameters and computational costs without sacrificing accuracy, making it suitable for deployment on edge devices. Secondly, we introduce Dynamic UpSampling to effectively fuse multi-scale features, enhancing the network’s ability to detect objects of various scales, especially small-scale fires. Additionally, we employ Coordinate Attention to enable the network to focus more on target regions, improving detection accuracy. Finally, we adopt the Distance-Intersection over Union (DIoU) loss function to achieve more accurate bounding box regression, especially for small objects. Through these innovative designs, our model achieves a superior performance on multiple public datasets, demonstrating its effectiveness in forest fire detection tasks. Experimental results show that our model achieves high detection accuracy while maintaining a fast detection speed, meeting the requirements of real-time monitoring.

The remainder of this paper is structured as follows: Section 2 introduces an improved YOLOv8 network model. The subsequent Section 3 and Section 4, present experimental results to validate the efficacy of the proposed method. Finally, Section 5 concludes the paper by summarizing the study’s key findings and contributions.

2. Materials and Methods

2.1. You Only Look Once

YOLOv8 [29,30] is one of the most advanced detection algorithms in the current YOLO series, capable of performing detection, classification and segmentation tasks, which is shown in Figure 1.

It adopts an anchor-free approach, reducing prediction boxes and speeding up Non-Maximum Suppression (NMS) [31]. Compared with the newly launched YOLOv9, YOLOv8 prioritizes stability with a comparable number of parameters, and the GFLOPs of YOLOv8 is significantly lower than that of YOLOv9. To prioritize lightweight efficiency for this experiment, we leveraged and enhanced the YOLOv8n architecture.

YOLOv8 model consists of three main components: Backbone, Neck and Head. The Backbone’s primary role is feature extraction through a series of convolution layers. Concurrently, the Neck integrates the PANet structure [32] for feature fusion, an upgrade from FPN [33]. Its Backbone and Neck components incorporate the C2f structure, instead of YOLOv5’s C3 module, to capture more gradient information across different scale models. The Head section of YOLOv8 introduces two key advancements. A decoupled head structure is adopted, separating classification and detection tasks for increased efficiency. Additionally, the model transitions to an anchor-free approach, aligning with the overall architectural philosophy.

2.2. Improvements to LightWeighting

2.2.1. Ghost Convolution

The Bottleneck module within YOLOv8’s C2f structure effectively captures gradient flow information and detailed features. However, its complexity stems from the iterative process, where the final input to the detection head combines three C2f outputs for multi-scale fire detection. Optimizing this Bottleneck for large-scale fires can streamline complexity without sacrificing network performance, while still ensuring the reliable detection of small fires.

Deep neural networks, such as ResNet-50 [34], often exhibit high computational complexity due to their use of standard convolutions. Ghost Convolution offers an efficient alternative by reducing redundant feature maps. It replaces traditional convolution with a combination of fewer convolution kernels and less costly linear operations [35]. GhostConv can be divided into three steps: First, a standard convolution is used to compress the number of channels and generate feature maps through regular convolutional operations. Second, simple linear operations (Depthwise Convolution and Identify) are applied to process the feature maps obtained in the previous step. Finally, the feature maps obtained from the first two steps are concatenated. Figure 2 illustrates the Ghost Convolution architecture.

Assuming input feature

X \in R^{c \times h \times w}

, convolution kernel

f \in R^{c \times k \times k \times n}

and output feature

Y \in R^{h^{'} \times w^{'} \times n}

. Here, c represents the number of channels, k denotes the size of the convolution kernel, n signifies the number of convolution kernels and h′ and w′ represent height and width, respectively, for output. The regular convolution operation can be denoted by

Y = X * f + b

, and computational complexity is given by n×h ‘×w’ ×c×k×k. In the Ghost module, m original feature maps (

m \leq n

) are initially derived through a convolutional transformation, resulting in m feature maps. To generate the additional feature maps, a linear transformation is applied to produce s phantom feature maps from these m feature maps, resulting in a total of

n = m \times s

feature maps.

By incorporating GhostConv, the Bottleneck is transformed into a GBottleneck, and the corresponding C2f module becomes a GhostC2f module. While the GhostC2f module effectively reduces model complexity, it cannot be universally applied across all stages due to the significant increase in network layer count. Excessive layers not only slow down inference speed but also introduce issues like overfitting, gradient vanishing or exploding, which can degrade the training effectiveness of the network model. Therefore, GhostC2f replaces only the C2f before SPPF and the C2f serving as input to the large fire detection head. This targeted optimization maintains accurate small fire detection while significantly reducing computational overhead.

2.2.2. Depthwise Separable Convolution

YOLOv8’s detection head comprises three sections, each aligned with the output of the network’s final three C2f layers. For a 640 × 640 input image, this architecture generates detections at three scales (80 × 80, 40 × 40 and 20 × 20), corresponding to different scales of fire detection. While the detection head’s convolutional layers can adapt to capture large object characteristics and optimize small target feature extraction, redundant information may accumulate after previous feature processing stages. To address this, ordinary convolution can be replaced with DSConv, creating the new DSDetect module. This optimization enhances computational efficiency without compromising detection accuracy.

Depthwise separable convolution [36] breaks down the convolution process into two separate operations: Depthwise Convolution and Pointwise Convolution. Depthwise Convolution (DW) operates by convolving each input channel independently, thereby reducing the parameter count. This approach also utilizes weight sharing, where the convolution kernel applies separately to each channel, fostering parameter efficiency and simplifying the model structure. For a standard three-channel color image

X \in R^{3 \times h \times w}

, Depthwise Convolution performs 2D convolutions on each channel, generating three feature maps

α_{1}, α_{2}, α_{3} \in R^{1 \times h \times w}

. Figure 3 illustrates this process.

Depthwise Convolution preserves the number of input channels, operating independently on each channel without considering cross-channel information. To introduce inter-channel interactions and enable channel expansion, Pointwise Convolution is introduced, utilizing a convolution kernel

k \in R^{c \times 1 \times 1}

where c represents the number of input channels. (In Figure 4, the value of c is 3.) Following the Pointwise Convolution operation, the input feature maps are combined using weights across the depth dimension. The number of output feature maps is determined by the number of Pointwise Convolution kernels, allowing for channel expansion. Figure 4 illustrates this process.

The computational cost ratio of DSConv to traditional convolution can be expressed as follows: where N denotes the output channels and

D_{F}

represents the size of the convolution kernel.

\frac{D S C}{C o n v} = \frac{1}{N} + \frac{1}{D_{F}^{2}}

(1)

Equation (1) clearly illustrates that DSConv achieves significantly higher computational efficiency compared to traditional convolution. The DSDetect module composed of DSConv separates spatial convolution and channel convolution, significantly reducing the parameters of the Head module. Since the Head module is the final layer of the network model, prior layers have already learned abundant redundant information. At this stage, employing DSConv allows unnecessary, even harmful background information in forest fire images to be ignored, thereby enhancing the network’s focus on relevant forest fire information.

2.3. Improvements in Accuracy

2.3.1. Coordinate Attention

Forest fire detection presents significant challenges due to complex backgrounds, variable fire spot locations and frequent occlusion by foliage. Additionally, the dynamic nature of forest fires, influenced by wind and other environmental factors, can lead to unpredictable shapes and even detached flames, further complicating detection. Coordinate Attention addresses these challenges by enabling the model to effectively capture and utilize spatial information within the input data. By incorporating coordinate encodings or directly leveraging input coordinates, the network gains a better understanding of the spatial structure of forest fires, enhancing detection accuracy.

Coordinate Attention (CA) [37] is an innovative and efficient mechanism designed to alleviate the computational overhead associated with attention mechanisms in mobile networks. The CA module employs precise location information to encode channel relationships and long-range dependencies, similar to the SE [38] module. It involves two steps: coordinate information embedding and coordinate attention generation, as depicted in Figure 5.

Coordinate information embedding involves encoding spatial details into channel descriptors. While global pooling is commonly used for this purpose, it can lead to a loss of precise location information. To address this, we decompose global pooling into two one-dimensional feature encoding operations. Specifically, two pooling kernels with dimensions (H,1) and (1,W) are applied to each input channel, capturing vertical and horizontal information, respectively. The resulting feature maps for the c-th channel have a height of h and width of w, as defined in Equations (2) and (3):

Z_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i < W} x_{c} (h, i)

(2)

Z_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w)

(3)

By aggregating features horizontally and vertically, we can construct an attention map that captures comprehensive spatial information. This mechanism allows the network to identify long-range dependencies within forest fires while maintaining precise positional cues, facilitating the localization of obscured fire sources. Coordinate attention generation begins by concatenating the two feature maps produced during coordinate information embedding. Subsequently, the same 1 × 1 convolutional kernel

F_{1}

is used to produce intermediate feature maps

f \in R^{C / r \times (H + W)}

in both horizontal and vertical directions, where r represents the downsampling ratio. The computed expression of feature maps is given in Equation (4):

f = δ (F_{1} ([z^{h}, z^{w}]))

(4)

Subsequently, the spatial dimension will be divided into two distinct tensor sums:

f^{h} \in R^{C / r \times H}

and

f^{w} \in R^{C / r \times W}

. After that, two 1 × 1 convolution kernels,

F_{h}

and

F_{w}

, are employed to match the number of channels in the input X. The corresponding calculation Equations are denoted as (5) and (6), respectively:

g^{h} = σ (F_{h} (f^{h}))

(5)

g^{w} = σ (F_{w} (f^{w}))

(6)

Finally, the obtained data is expanded as the weight of attention. The Equation is shown in (7):

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(7)

To enhance the focus on valuable information from input data, we strategically integrate CA modules before the initial four C2f blocks in the feature extraction stage. This integration enables CA to adaptively learn weights based on input variations and provide more precise forest fire information for subsequent C2f modules, thereby enhancing network performance.

2.3.2. Omni-Dimensional Dynamic Convolution

The limited information provided by small and occluded fire points poses a significant challenge for accurate detection. To address this issue, it is necessary to enhance the network’s feature extraction capabilities.

Omni-Dimensional Dynamic Convolution (ODConv) [39] incorporates attention mechanisms to dynamically determine optimal convolution kernel parameters, including number, size, input channels and output channels. Assuming the four attention scalars of the ODConv, computed by the multi-head attention module, are:

α_{w i} \in R, α_{f i} \in R^{k \times k}, α_{c i} \in R^{c i n}, α_{s i} {\in R}^{c o u t}

. Where

α_{w i}

denotes the scalar associated with the convolution kernel

W_{i}

, and

⨀

represents the multiplication across various dimensions of the kernel space, the ODconv expression is depicted in Equation (8):

y = (α_{w 1} ⨀ α_{f 1} ⨀ α_{c 1} ⨀ α_{s 1} ⨀ W_{1} + . . . + α_{w n} ⨀ α_{f n} ⨀ α_{c n} ⨀ α_{s n} ⨀ W_{n}) * x

(8)

To obtain the convolution kernel

W_{i}

and four attention levels, the input x is first processed through per-channel Global Average Pooling (GAP), yielding a feature vector of length c_in. This vector is then compressed using a fully connected (FC) layer. Subsequently, it is passed through four head branches, each consisting of an FC layer followed by either Sigmoid or Softmax normalization. The FC layer shapes the compressed feature vector into specific forms, while Sigmoid and Softmax normalize these outputs. Figure 6 illustrates the calculation process of ODConv.

ODConv employs a multi-dimensional attention mechanism that modulates the convolution kernel based on kernel size, filter, channel and location. These attention components interact synergistically to capture comprehensive contextual information, enabling the network to accurately distinguish forest fires from background clutter. As a result, ODConv significantly enhances feature extraction, leading to improved detection of obscured fire spots and robust performance in complex background conditions.

By replacing the convolution operation in the main part of the network with ODConv, the weight of the convolution kernel can be dynamically adjusted based on variations in the input image. This facilitates more efficient extraction of pertinent features from forest fire images characterized by intricate backgrounds, thereby augmenting the network’s capacity for forest fire recognition.

2.3.3. Dynamic UpSampling

YOLOv8 employs a multi-scale feature pyramid architecture, where feature maps at different stages capture varying levels of detail and semantic information. UpSampling layers are crucial for integrating these feature maps, enabling the network to accurately detect targets of various scales, including small fire points.

In the original YOLOv8 network, the adopted upSampling method is nearest neighbor interpolation [40]. However, it has several drawbacks: This method only considers the nearest data point and does not adequately incorporate information from surrounding points, leading to discontinuous block effects in images or data. Additionally, nearest neighbor interpolation may produce jagged edge effects, making edges appear less smooth. Moreover, it can introduce distortion issues, particularly noticeable when enlarging images. These limitations hinder the network’s ability to effectively process complex forest fire images, which often contain intricate backgrounds, diverse fire sizes and obscured small fire points. Consequently, an alternative upSampling technique is essential to enhance the network’s performance.

DySample [41] is a lightweight and efficient dynamic upSampler, which stands out from kernel-based dynamic upSamplers, such as CARAFE [42], FADE [43] and SAPA [44], due to its point sampling approach and more streamlined design. Figure 7 illustrates the process of dynamic upsampling. Assuming features

χ \in R^{2 \times H \times W}

and a sampling set S

\in R^{2 \times s H \times s W}

where the first dimension of 2 represents x and y coordinates, the grid_sample function uses the positions from S to resample the assumed bilinear interpolation

χ

into

χ^{'} \in R^{C \times s H \times s W}

.

This process can be defined in Equation (9):

χ^{'} = g r i d_s a m p l e (χ, S)

(9)

2.3.4. Distance-Intersection over Union

YOLOv8 originally employs the Complete-Intersection over Union (CIoU) [45] loss function for bounding box regression. While CIoU incorporates overlap area, center point distance and aspect ratio, it may not adequately address the challenges posed by small fire detection. DIoU [46] loss, which focuses on center point distance, offers a more effective approach for localizing small fires, particularly considering the diverse shapes and sizes of forest fires.

If we denote the center of the predicted box as b, then

b^{g t}

can represent the center distance of the true box.

ρ^{2} (b, b^{g t})

represents the distance between the two center points, and

c^{2}

symbolizes the square of the diagonal distance between the true box and the predicted box. Consequently, the DIoU expression can be articulated by Equation (10):

D I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(10)

By calculating the DIoU, the distance between the predicted box and the true box can be optimized, which makes the convergence speed faster. The schematic diagram of its structure is shown in Figure 8.

2.4. A Lightweight Model for Detecting Forest Fire

In summary, after the above lightweight transformation and accuracy optimization, the final YOLOv8 network structure diagram is shown in Figure 9.

3. Methods for Evaluation

3.1. The Dataset

The FLAME dataset [47], consisting of UAV-captured forest fire images in 3480 × 2160 JPEG format, served as the foundation for model training. These images were extracted from videos recorded by UAVs equipped with various cameras and zoom levels, encompassing 12 distinct fire points. To ensure data quality, image filtering was applied to address background similarities that could potentially interfere with model training. However, the resulting dataset remained limited, necessitating data augmentation to prevent overfitting and enhance generalization.

A subset of 300 representative images was carefully selected from the filtered dataset. To expand this limited dataset, the Imgaug library was employed to apply a range of geometric and color transformations, including rotation, flipping, cropping, scaling, noise addition, blurring and color perturbation. This augmentation process yielded a dataset of 5376 images. Finally, the forest fire dataset was partitioned into training, test and validation sets in a ratio of 7:2:1. A visual representation of the dataset is presented in Figure 10.

Although FLAME is built on a region-specific dataset, the optimizations discussed in this paper enhance the network’s feature recognition accuracy. Additionally, the ODConv convolution dynamically adjusts based on the input image features, further improving the network’s detection performance across various images. Consequently, the wildfire features learned from the FLAME dataset demonstrate strong generalization capability when transferred and trained on different datasets, as evidenced by the experiments conducted on self-built datasets presented later in this paper.

3.2. Evaluation of the Mode

In order to better evaluate the performance of the model, AP0.5 and AP0.5:0.95 are selected to evaluate the accuracy of the model, while the number of parameters and floating-point operations are used to assess the model’s lightweightness.

Average Precision

For single-target detection, Average Precision (AP) is equivalent to Precision. Precision measures the proportion of correct positive predictions among all positive predictions. The confusion matrix comprises four components: True Positives (TP), False Negatives (FN), False Positives (FP) and True Negatives (TN). TP denotes the number of samples predicted as positive, FN refers to misclassified examples labeled as negative, FP indicates negative samples incorrectly classified as positive and TN represents successfully predicted negative samples.

The Equation for Precision is given in (11):

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

The Equation for Recall (how many of the true positive samples are correctly predicted to be true) is defined as (12):

R e c a l l = \frac{T P}{T P + F N}

(12)

In object detection, AP usually refers to the area under the Precision–Recall curve, which is used to comprehensively evaluate the performance of the model. The Equation is defined as shown in (13):

A P = \frac{1}{r} \sum_{i = 1}^{r} p_{i}

(13)

Intersection over Union (IoU) quantifies the overlap between predicted and ground-truth bounding boxes. AP0.5 represents the Average Precision calculated at an IoU threshold of 50%. AP0.5:0.95 is a stricter metric, averaging AP values across IoU thresholds ranging from 50% to 95%.

2.: The number of parameters

Model parameters encompass the learnable components of a neural network, including weights, biases and thresholds. These parameters are optimized during training to minimize the model’s error. The number of parameters is a key indicator of model complexity, with a larger parameter count generally correlating to increased expressive power but also higher computational costs.

3.: Giga Floating-point Operations Per Second

Giga Floating-point Operations Per Second (GFLOPs) measure a model’s computational complexity during inference. This metric is essential for evaluating model efficiency and suitability for deployment on resource-constrained platforms, such as edge devices or mobile systems. Forest fire recognition equipment often operates in challenging environments, demanding models with minimal computational overhead.

3.3. Comparison with Other Models

To comprehensively evaluate the proposed network, comparative analysis against state-of-the-art object detectors is essential. In this study, SSD, Faster R-CNN, Cascade R-CNN, EfficientDet, YOLOv7 and YOLOv9 are selected for comparison:

SSD: Single Shot MultiBox Detector algorithm [48] belongs to the one-stage method in the mainstream target detection algorithm. This method is different from the two-stage method of Faster R-CNN algorithm. The recognition accuracy can also reach a high level after optimization. Due to the single detection, SSD has obvious advantages in terms of speed, making it especially suitable for application scenarios that require real-time processing.
Faster R-CNN: Faster R-CNN [49] consists of two main components: Region Proposal Network (RPN), which is used to generate high-quality object region proposals, and Fast R-CNN, which is used to detect and classify objects. It performs well in terms of real-time and accuracy.
Cascade R-CNN: Cascade R-CNN [50] is a multi-stage object detection architecture that demonstrates significant advantages in detecting small objects. Firstly, the multi-stage structure of Cascade R-CNN allows each detector at different stages to incrementally refine the position and size of candidate boxes, which facilitates the localization of small objects. Secondly, by progressively increasing the IoU threshold, the network ensures that the candidate boxes fed into each detector are of higher quality, thereby enhancing the accuracy of small object detection.
EfficientDet: EfficientDet [51] is an efficient object detection model based on the EfficientNet architecture. This network facilitates the effective integration of features at different scales, allowing features of small objects to be enhanced at higher levels, thereby improving the accuracy of small object detection. Additionally, it incorporates an optimized anchor box generation strategy, which better accommodates the detection requirements for small objects.
YOLOv7 and YOLOv9: Among recently introduced detection models, YOLOv7 [52] stands out for its remarkable performance in terms of speed and accuracy. It achieves impressive frame rates across different hardware platforms without compromising detection precision. As the latest model in this series, YOLOv9 [53] integrates advanced feature extraction and fusion techniques to enhance the precise identification of targets with varied scales and shapes, marking a significant advancement in this model series.

4. Results

4.1. The Environment for Training and Hyper-Parameters

Table 1 shows the experimental environment. The experiment employs the Ubuntu operating system, Pytorch as the deep learning framework and GPU acceleration for the training process. The training equipment is from the Hyper Application Inventor of Tencent Cloud in China.

The training parameters are shown in Table 2. Considering YOLO’s automatic preservation of the best training outcomes, we set the epoch to a substantial value of 400. The image dimensions are 640 × 640, while the learning rate and batch-size are configured as 0.01 and 32, respectively.

4.2. Ablation Experiments

Ablation experiments were conducted to systematically evaluate the impact of individual components on model performance. By removing or adding specific features, we gained insights into the model’s architecture and the contributions of different modules. All comparative experiments were performed on the same dataset to ensure consistent evaluation. The original YOLOv8 model served as the baseline, while LYOLOv8, with the incorporation of GhostC2f and DSDetect, represented the lightweight variant. Table 3 summarizes the experimental results.

This study aims to develop a lightweight yet high-performance forest fire detection network based on YOLOv8. To achieve this, we focus on reducing model complexity while preserving accuracy. First, GhostConv is integrated into the Bottleneck module, forming YOLOv8–GhostC2f. As shown in Table 3, the parameters of YOLOv8–GhostC2f decreases from 3.01M to 2.44M, a reduction of approximately 18.9%, and GFLOPs decrease from 8.2G to 7.7G. Moreover, the AP0.5 and AP0.5:0.95 metric remain nearly unchanged, indicating the significant effectiveness of GhostConv in achieving lightweight goals. Second, DSConv is utilized to replace ordinary convolutions in the detect stage, resulting in YOLOv8–DSDetect. This substitution reduces the parameters from 3.01M to 2.23M, a decrease of 25.9%, and GFLOPs from 8.2G to 5.3G, a reduction of 35.4%. Although there is a slight decline in AP performance, the substantial reduction in model complexity suggests that employing DSConv for lightweight optimization is reasonable. Finally, combining these two lightweight modules forms the LYOLOv8 network. While the AP0.5:0.95 metric decreases slightly from 86.1% to 84.8%, the parameters decrease from 3.01M to 1.72M, a reduction of 42.3%, and GFLOPs decrease from 8.2 to 4.9, a reduction of approximately 40.2%. This demonstrates that the network design of LYOLOv8 achieves model lightweighting while maintaining performance.

To ensure optimal performance while reducing model complexity, LYOLOv8 incorporates DySample, CA, ODConv and DIoU modules. These components collectively enhance AP0.5:0.95, indicating improved accuracy. While DySample and CA marginally increase model complexity, they contribute substantially to overall performance. ODConv uniquely reduces complexity while boosting accuracy, and DIoU refines localization. The final LYOLOv8 model achieves a 41% parameter reduction and a 54% GFLOPs reduction compared to the baseline, demonstrating significant computational efficiency gains. Notably, AP0.5 remains at 99.0%, and AP0.5:0.95 improves to 87.1%, showcasing the model’s ability to maintain accuracy while reducing resource consumption.

In addition to the aforementioned experiments, supplementary experiments were conducted to further validate the efficacy of the modules introduced in this study.

Effectiveness of DySample

The role of the upSampling layer is to enlarge the low-resolution feature map in order to align it with the high-resolution feature map, thereby enhancing its capability to capture intricate details of small fire targets. This can be advantageous for subsequent tasks, such as detection or other task executions. To further validate the effectiveness of DySample, we conducted a comparative analysis by evaluating two upSampling algorithms, namely Nearest and Content-Aware Reassembly of FEatures (CARAFE), based on LYOLOv8. The corresponding results are presented in Table 4, showing that DySample performs best on Recall.

In comparison, it is evident that both CARAFE and DySample outperform the original Nearest algorithm on this network. Although CARAFE demonstrates slightly superior overall performance compared to DySample, the additional complexity introduced by CARAFE surpasses that of DySample. Considering the comparable upSampling performance between the two algorithms, we chose to adopt the DySample algorithm.

2.: Effectiveness of CA

The primary objective of the attention mechanism is to highlight crucial information that is relevant to the current task goal. This mechanism can be viewed as a dynamic adjustment of weights for input image features. In this experiment, we explored various attention mechanisms based on LYOLOv8–DySample; the pertinent results are illustrated in Table 5.

The incorporation of attention mechanism enhances the network’s ability to selectively attend to forest fire information, thereby augmenting its capability in detecting small fires. To validate the efficacy of CA, it is compared against Channel Attention, Convolutional Block Attention Module (CBAM) and the original network. Based on our experiments, we observe that the combined attention mechanisms outperform the original network overall, with CA exhibiting superior performance by achieving maximum AP0.5:0.95 index and Precision while introducing minimal complexity to the network.

3.: Effectiveness of DIoU

The loss function plays a crucial role in model training, facilitating better convergence and acquisition of valuable information. In the experiments, CIoU, MPDIoU, GIoU, InnerIoU, SIoU and DIoU were selected as loss functions, respectively, based on the LYOLOv8–DySample–CA–ODConv network. Through experimental data analysis, it is evident that choosing DIoU as the loss function yields superior performance with AP0.5:0.95, reaching a maximum value of 87.1%. The corresponding results are presented in Table 6.

From the conducted experiments, it is evident that different loss functions exhibit varied performance on this network. CIoU achieves maximum Precision but minimum Recall, implying a higher missed detection rate. In contrast, SIoU achieves the highest recall but lacks high precision, resulting in a higher false alarm rate and poor performance on the main metric AP0.5:0.95. Conversely, DIoU demonstrates strong performance across precision, recall and AP0.5:0.95. Additionally, the loss functions listed in Table 6, including GIoU, introduce additional penalty terms for optimization compared to traditional loss functions. Apart from InnerIoU, DIoU exhibits the smallest box_loss, indicating that the model’s predicted bounding boxes have minimal discrepancies with the ground truth boxes, leading to better localization of the actual objects. In summary, DIoU demonstrates superior performance across overall metrics, thereby enhancing the network’s performance and making it more suitable for detecting small target forest fires.

4.3. Display of Object Detection Results

Through the aforementioned network design and subsequent model training, we employed both the original YOLOv8 network and our improved model for prediction on the FLAME dataset. Figure 11 presents comparative results, where the first column displays original images, the middle column shows detections from YOLOv8 and the last column illustrates the results of our proposed model.

Comparative analysis of detection results reveals that both YOLOv8 and the proposed model effectively identify large-scale fires, achieving high confidence levels (approximately 0.9). However, for small-scale fires, the proposed model demonstrates superior performance. YOLOv8 frequently misses distant or occluded small fires, as evident in Figure 11. As illustrated in Figure 11b,c, YOLOv8 exhibits difficulties in detecting distant fires, often missing these targets. In contrast, our proposed model successfully identifies these distant fires with confidence levels more than 0.5. Moreover, Figure 11a,d,e demonstrates YOLOv8’s limitations in detecting small, occluded fires due to insufficient feature extraction. Our model, specifically optimized for these challenging conditions, effectively extracts relevant features, leading to improved detection performance. Despite the aforementioned achievements, false positives still persist. As illustrated in (f) and (g) of Figure 11, our model’s detection results exhibit false positives. This issue arises partly due to the complex backgrounds in wildfire images, which obscure the boundary features of the targets, and partly due to insufficient contextual information, leading to misclassifications. Overall, while both models perform comparably on large-scale fires, our model surpasses YOLOv8 in detecting small, obscured fires, resulting in enhanced overall performance.

To further elucidate the advancements and limitations of our work, we present an analysis from the perspective of feature maps. Figure 12 is divided into two columns: the first column shows the intermediate feature maps from the original YOLOv8, while the second column displays the intermediate feature maps from Ours. The first row consists of the original images (corresponding to image d in Figure 11), with green boxes indicating the ground truth. The subsequent three rows illustrate the outputs from the first three C2f layers of the backbones for both networks.

From the first two rows of feature maps, it is evident that both models effectively extract wildfire features. However, our method, by incorporating a Coordinate Attention mechanism, achieves clearer localization of wildfire features and retains valuable information even in the presence of occluded flames. Moreover, the enhanced feature extraction capability provided by the ODConv module in our network allows for capturing more relevant wildfire information through successive convolutions, compared to the original YOLOv8. The third row of the intermediate feature maps clearly shows that our method captures a greater number of wildfire features, which offers a significant advantage in detecting small and occluded fire points. Nonetheless, despite these improvements, our method still experiences some loss of useful information after multiple convolutions, which can lead to missed detections. This issue will need to be addressed in future work.

4.4. Comparison Experiments

To demonstrate the proposed network’s superior performance and efficiency, the comparison results with SSD, Faster R-CNN, EfficientDet, Cascade R-CNN and YOLO series are shown in Table 7 and Table 8, respectively.

From Table 7, results indicate that our model surpasses classical object detectors (SSD, Faster R-CNN, EfficientDet and Cascade R-CNN) in both accuracy and efficiency, achieving a 99% AP0.5 score while significantly reducing parameters and GFLOPs. Although Cascade R-CNN performs similarly to our method in terms of AP0.5, it significantly lags behind in the AP0.5:0.95 metric, with a value of 80.5% compared to our method’s 87.1%. These findings highlight the model’s potential for further advancement in the field.

Table 8 presents the comparison between Ours and several latest YOLO series networks. It can be observed that Ours outperforms these networks with the best performance, achieving a top result of 87.1% in AP0.5:0.95. Additionally, our network still achieves optimal results in Parameters and GFLOPs compared to other networks. Even when compared to the latest YOLOv9 network, our network excels in both performance and lightweight design, meeting the expected results. It shows that Ours has significant advantages over the compared models and is more suitable for practical applications.

4.5. Testing in Another Data Set

In order to further verify the generalization of the network in this study, the corresponding forest fire images were found from the network and expanded to 2898 images. The transfer training was performed after dividing the training set, test set and validation set according to the ratio of 7:2:1.

The results of the final training are shown in Table 9. Most of the expanded dataset consists of large fires, which the network performs exceptionally well on, with AP0.5:0.95 metrics all exceeding 95%. However, Ours still achieves superior detection results, indicating good generalization capabilities.

Figure 13 shows the detection results based on Ours. It can be seen that it has a good detection effect for forest fires of various scales and different environmental backgrounds.

5. Discussion

The core work of this paper lies in achieving a lightweight design without compromising network performance and optimizing performance specifically for the challenges associated with small fire detection. To this end, our approach involves incorporating lightweight modules within the network and enhancing its feature extraction capability. Compared to the original network and various classical networks, our approach not only significantly reduces complexity but also improves accuracy. Our testing results show that our method exhibits superior detection performance for small and occluded fire points. The reasons for this improvement can be summarized as follows: First, we strategically employed lightweight modules. Specifically, we replaced only the last C2f module in the Backbone and Neck with GhostC2f, which preserved network performance while reducing complexity. Additionally, inspired by pooling layers [54], we used the DSConv module exclusively in the Detect stage to discard redundant information while maintaining accuracy and reducing complexity. Second, to address the challenge of insufficient feature extraction for small fire points, we incorporated ODConv, which captures four-dimensional convolutional kernel information, and DySample, which mitigates distortion during upSampling, thereby enhancing feature extraction capabilities. Finally, to accurately locate fire points in complex environments, we utilized CA and DIoU algorithms, which provide more positional information. By combining lightweight and performance optimization strategies, we achieved the aforementioned experimental results.

Compared to other studies, Ma et al. [55] and our research both employ GhostConv for lightweight design. However, our approach involves using GhostC2f to replace only the C2f module preceding the SPPF and the C2f module that serves as input to the large fire detection head. This approach reduces complexity while mitigating the significant increase in inference time associated with excessive network layers [56]. To enhance network feature extraction capabilities, Zhang et al. [57] utilized techniques such as Feature Pyramid Networks, although this resulted in a high false alarm rate. Zhao et al. [58] expanded the feature extraction network in three dimensions, but their method’s accuracy in detecting occluded flames remains suboptimal. This limitation arises from the complexity of wildfire environments and the potential loss of useful information due to convolutional compression. In contrast, our study employs ODConv, which dynamically adjusts convolutional kernel weights based on input images, and DySample’s point sampling approach, which helps avoid distortion and enables better extraction of useful information. Addressing the challenge of effectively focusing on relevant information to avoid missed detections, Sheng et al. [59] introduced an improved CPDCA (Channel Priority Dilated Convolution Attention), while Feng et al. [60] incorporated SimAM into the YOLOv5 network. These methods mainly focus on extracting more information but do not adequately account for the variability in fire points, such as differing shapes and locations in multi-fire point detection scenarios. In contrast, the CA method used in this paper, with its emphasis on positional capture and long-range dependency, enhances the network’s ability to effectively focus on and identify relevant information.

Although this study has achieved certain research outcomes, there is still potential for further improvement in lightweight design through techniques such as knowledge distillation and pruning. Distillation significantly reduces the computational resources required during inference by transferring knowledge from a complex model to a smaller network, enabling the network to make decisions quickly even with limited hardware resources. Moreover, this method is applicable to various network architectures, demonstrating strong generalizability. Additionally, there is room for further optimization in the Backbone component. While progress has been made in detecting small fire points, there remain cases of missed detections, particularly for more obscure fire points that are less visible due to the shooting distance. Future work could focus on enhancing network performance through data augmentation, feature fusion and improved feature extraction. Additionally, the model could be deployed in hardware for real-time wildfire detection using UAVs. In addition to real-time visual monitoring of fires using cameras mounted on UAVs, future applications could also involve data cleaning and the dimensional expansion of data from multiple sensors (e.g., smoke sensors, CO sensors and temperature sensors) before feeding it into neural networks for classification [61]. By integrating data from various sensors, the capability for wildfire detection in complex environments can be significantly enhanced.

6. Conclusions

This study introduces a series of innovative enhancements to the YOLOv8n object detection algorithm, significantly improving its efficiency for detecting small-scale forest fires. By strategically replacing standard convolutions with more lightweight alternatives, such as Depthwise Separable Convolution and Ghost Convolution, the model’s computational complexity is substantially reduced without compromising performance. To further enhance the model’s ability to detect occluded and small-scale fires, we have introduced Dynamic UpSampling and Coordinate Attention mechanisms. These enhancements enable the model to extract more relevant features and focus on critical regions within forest fire images. Additionally, the adoption of Distance-Intersection over Union loss ensures more accurate bounding box predictions. Experimental results demonstrate the effectiveness of these enhancements, with a 41% reduction in parameters and a 54% reduction in GFLOPs. Despite these significant computational savings, the model maintains exceptional performance, achieving an AP0.5 of 99.0%, comparable to the baseline YOLOv8n network. Notably, our method surpasses the baseline in terms of AP0.5:0.95, demonstrating its superior ability to detect small-scale forest fires.

Author Contributions

Conceptualization, data curation, resources, visualization, writing—original draft: L.L.; methodology, formal analysis: L.L and R.D.; software, investigation: R.D.; supervision, project administration: F.Y.; writing—review and editing: F.Y. and R.D.; validation: L.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Key R&D Program of China (2022YFF1302700), The Emergency Open Competition Project of National Forestry and Grassland Administration (202303), The Outstanding Youth Team Project of Central Universities (QNTD202308) and College Students’ Innovative Entrepreneurial Training Plan Program (S202310022196).

Data Availability Statement

The FLAME dataset in the article “The FLAME dataset: Aerial Imagery Pile burn detection using drones (UAVs)” is used (https://par.nsf.gov/biblio/10497556 (accessed on 12 September 2024)).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Flannigan, M.D.; Stocks, B.J.; Wotton, B.M. Climate change and forest fires. Sci. Total Environ. 2000, 262, 221–229. [Google Scholar] [CrossRef]
Saffre, F.; Hildmann, H.; Karvonen, H.; Lind, T. Monitoring and cordoning wildfires with an autonomous swarm of unmanned aerial vehicles. Drones 2022, 6, 301. [Google Scholar] [CrossRef]
Kantarcioglu, O.; Kocaman, S.; Schindler, K. Artificial neural networks for assessing forest fire susceptibility in Türkiye. Ecol. Inform. 2023, 75, 102034. [Google Scholar] [CrossRef]
Zhang, F.; Zhao, P.; Xu, S.; Wu, Y.; Yang, X.; Zhang, Y. Integrating multiple factors to optimize watchtower deployment for wildfire detection. Sci. Total Environ. 2020, 737, 139561. [Google Scholar] [CrossRef] [PubMed]
Alkhatib, A.A.A. A review on forest fire detection techniques. Int. J. Distrib. Sens. Netw. 2014, 10, 597368. [Google Scholar] [CrossRef]
Hossain, F.M.A.; Zhang, Y.M.; Tonima, M.A. Forest fire flame and smoke detection from UAV-captured images using fire-specific color features and multi-color space local binary pattern. J. Unmanned Veh. Syst. 2020, 8, 285–309. [Google Scholar] [CrossRef]
Yang, X.; Hua, Z.; Zhang, L.; Fan, X.; Zhang, F.; Ye, Q.; Fu, L. Preferred vector machine for forest fire detection. Pattern Recognit. 2023, 143, 109722. [Google Scholar] [CrossRef]
Chen, T.H.; Wu, P.H.; Chiou, Y.C. An early fire-detection method based on image processing. In Proceedings of the 2004 International Conference on Image Processing, ICIP’04, Singapore, 24–27 October 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 1707–1710. [Google Scholar]
Vipin, V. Image processing based forest fire detection. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 87–95. [Google Scholar]
Yuan, C.; Liu, Z.; Zhang, Y. UAV-based forest fire detection and tracking using image processing techniques. In Proceedings of the 2015 International Conference on Unmanned Aircraft Systems (ICUAS), Denver, CO, USA, 9–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 639–643. [Google Scholar]
Rong, D.; Xie, L.; Ying, Y. Computer vision detection of foreign objects in walnuts using deep learning. Comput. Electron. Agric. 2019, 162, 1001–1010. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity mappings in deep residual networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 630–645. [Google Scholar]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Ali, A.B.M.S.; Gandomi, A.H. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E. Deep learning for computer vision: A brief review. Comput. Intell. Neurosci. 2018, 2018, 7068349. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Xu, J.; Xu, L.; Guo, H. Deep convolutional neural networks for forest fire detection. In Proceedings of the 2016 International Forum on Management, Education and Information Technology Application, Guangzhou, China, 30–31 January 2016; Atlantis Press: Amsterdam, The Netherlands, 2016; pp. 568–575. [Google Scholar]
Sharma, J.; Granmo, O.C.; Goodwin, M.; Fidje, J.T. Deep convolutional neural networks for fire detection in images. In Proceedings of the Engineering Applications of Neural Networks: 18th International Conference, EANN 2017, Athens, Greece, 25–27 August 2017; Proceedings. Springer International Publishing: Cham, Switzerland, 2017; pp. 183–193. [Google Scholar]
Frizzi, S.; Kaabi, R.; Bouchouicha, M.; Ginoux, J.M.; Moreau, E.; Fnaiech, F. Convolutional neural network for video fire and smoke detection. In Proceedings of the IECON 2016—42nd Annual Conference of the IEEE Industrial Electronics Society, Florence, Italy, 23–26 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 877–882. [Google Scholar]
Bahhar, C.; Ksibi, A.; Ayadi, M.; Jamjoom, M.M.; Ullah, Z.; Soufiene, B.O.; Sakli, H. Wildfire and smoke detection using staged YOLO model and ensemble CNN. Electronics 2023, 12, 228. [Google Scholar] [CrossRef]
Qian, J.; Lin, J.; Bai, D.; Xu, R.; Lin, H. Omni-dimensional dynamic convolution meets bottleneck transformer: A novel improved high accuracy forest fire smoke detection model. Forests 2023, 14, 838. [Google Scholar] [CrossRef]
Yang, Z.; Shao, Y.; Wei, Y.; Li, J. Precision-Boosted Forest Fire Target Detection via Enhanced YOLOv8 Model. Appl. Sci. 2024, 14, 2413. [Google Scholar] [CrossRef]
Xue, Z.; Lin, H.; Wang, F. A small target forest fire detection model based on YOLOv5 improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Xiao, Z.; Wan, F.; Lei, G.; Xiong, Y.; Xu, L.; Ye, Z.; Liu, W.; Zhou, W.; Xu, C. FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection. Forests 2023, 14, 1812. [Google Scholar] [CrossRef]
Muksimova, S.; Mardieva, S.; Cho, Y.I. Deep encoder–decoder network-based wildfire segmentation using drone images in real-time. Remote Sens. 2022, 14, 6302. [Google Scholar] [CrossRef]
Shakhnoza, M.; Sabina, U.; Sevara, M.; Cho, Y.-I. Novel video surveillance-based fire and smoke classification using attentional feature map in capsule networks. Sensors 2021, 22, 98. [Google Scholar] [CrossRef]
Fan, R.; Pei, M. Lightweight forest fire detection based on deep learning. In Proceedings of the 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Australia, 25–28 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
Yun, B.; Zheng, Y.; Lin, Z.; Li, T. FFYOLO: A Lightweight Forest Fire Detection Model Based on YOLOv8. Fire 2024, 7, 93. [Google Scholar] [CrossRef]
Jin, L.; Yu, Y.; Zhou, J.; Bai, D.; Lin, H.; Zhou, H. SWVR: A Lightweight Deep Learning Algorithm for Forest Fire Detection and Recognition. Forests 2024, 15, 204. [Google Scholar] [CrossRef]
Huang, J.; He, Z.; Guan, Y.; Zhang, H. Real-time forest fire detection by ensemble lightweight YOLOX-L and defogging method. Sensors 2023, 23, 1894. [Google Scholar] [CrossRef] [PubMed]
Vijayakumar, A.; Vairavasundaram, S. YOLO-based object detection models: A review and its applications. Multimed. Tools Appl. 2024, 1–40. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Reddy, R.; Venkata, C. A review on YOLOv8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2023; Springer: Singapore, 2024; pp. 529–545. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of YOLO architectures in computer vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Haase, D.; Amthor, M. Rethinking depthwise separable convolutions: How intra-kernel correlations lead to improved mobilenets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 14600–14609. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual Conference, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Li, C.; Zhou, A.; Yao, A. Omni-dimensional dynamic convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
Kontorovich, A.; Kpotufe, S. Nearest-Neighbor Methods: A Modern Perspective. In Machine Learning for Data Science Handbook: Data Mining and Knowledge Discovery Handbook; Springer International Publishing: Cham, Switzerland, 2023; pp. 75–92. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Simoes, A.; Xavier, J. FADE: Fast and asymptotically efficient distributed estimator for dynamic networks. IEEE Trans. Signal Process. 2019, 67, 2080–2092. [Google Scholar] [CrossRef]
Lu, H.; Liu, W.; Ye, Z.; Fu, H.; Liu, Y.; Cao, Z. SAPA: Similarity-aware point affiliation for feature upsampling. Adv. Neural Inf. Process. Syst. 2022, 35, 20889–20901. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Gholamalinezhad, H.; Khosravi, H. Pooling methods in deep neural networks, a review. arXiv 2020, arXiv:2009.07485. [Google Scholar]
Ma, S.; Li, W.; Wan, L.; Zhang, G. A Lightweight Fire Detection Algorithm Based on the Improved YOLOv8 Model. Appl. Sci. 2024, 14, 6878. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Zhang, L.; Wang, M.; Ding, Y.; Bu, X. MS-FRCNN: A multi-scale faster RCNN model for small target forest fire detection. Forests 2023, 14, 616. [Google Scholar] [CrossRef]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A small target object detection method for fire inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Sheng, S.; Liang, Z.; Xu, W.; Wang, Y.; Su, J. FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism. Forests (19994907) 2024, 15, 1244. [Google Scholar] [CrossRef]
Feng, Z. Research on YOLOv5 forest fire recognition algorithm utilizing attention mechanism. In Proceedings of the International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024), Guangzhou, China, 1–3 March 2024; SPIE: Bellingham, WA, USA, 2024; Volume 13180, pp. 1244–1249. [Google Scholar]
Deng, X.; Shi, X.; Wang, H.; Wang, Q.; Bao, J.; Chen, Z. An indoor fire detection method based on multi-sensor fusion and a lightweight convolutional neural network. Sensors 2023, 23, 9689. [Google Scholar] [CrossRef]

Figure 1. The network structure of YOLOv8. The structure consists of three components: the Backbone, the Neck and the Head. These components are responsible for feature extraction, feature fusion and detection, respectively. In the lower-right corner of the figure, a detailed schematic of several modules within the network is provided.

Figure 2. The structure of Ghost Convolution. In this structure, feature maps are first compressed through standard convolution. Subsequently, two types of linear transformations, Identify and DWConv, are applied to obtain mappings. Finally, the results are concatenated to produce the final output.

Figure 3. The process of spatial feature extraction. In this process, the feature map is adjusted using convolutional kernels with the same number of channels as the feature map itself to achieve the desired height and width of the target feature map.

Figure 4. The process of channel feature extraction. In this process, the feature map is processed using convolutional kernels with the same number of channels as the target feature map, ultimately achieving the desired channel expansion.

Figure 5. The structure of Coordinate Attention. In this structure, two one-dimensional global pooling operations aggregate the input features along the vertical and horizontal directions into two distinct direction-sensitive feature maps. These are then encoded into two attention maps, which are subsequently multiplied with the input feature maps to enhance their representational capacity.

Figure 6. Calculation process of ODConv. In this process, the procedure for calculating attention across four dimensions-kernel spatial dimensions, input channels, output channels and the number of kernels-is demonstrated. These attentions are ultimately applied to the kernels in a parallel manner.

Figure 7. The structure of DySample. In this structure, the input is first subjected to point sampling to obtain a sampling set S, which is then re-sampled to the target output using the grid_sample function.

Figure 8. Schematic diagram of calculating the DIoU. In this diagram, the green box represents the ground truth box, while the black box indicates the predicted box. d denotes the distance between the centers of the two boxes, and c represents the diagonal distance between the top-left and bottom-right corners of the two boxes.

Figure 9. The network structure of lightweight YOLOv8. In the Backbone, some Conv modules are replaced with ODConv modules, each followed by a CA module, and the final C2f module is replaced with a GhostCf module. In the Neck, samplers are first replaced with DySample modules, all Conv modules are substituted with ODConv modules and the final C2f module is replaced with a GhostC2f module. Finally, in the Head, all three Detect modules are replaced with DSDetect modules.

Figure 10. Images from FLAME dataset: (a) Example of multi-point forest fire; (b) Example of single forest fire.

Figure 11. Detection results of different models on the FLAME dataset. Figures (a–g) display seven different images from the FLAME dataset. The first column shows the original FLAME images with ground truth boxes indicated. The middle column presents the detection results from the original YOLOv8 model, while the final column shows the detection results from Ours.

Figure 12. Results of intermediate feature maps for different models. The first row displays the original photographs, with green boxes indicating the ground truth. The two columns below, from left to right, show the feature maps of the original YOLOv8 and Ours, respectively.

Figure 13. Detection results of Ours on the extended dataset. The wildfire detection results across four different images demonstrate that Ours exhibits strong detection capability for various types of fire points, including large and small fires, single and multiple fire points, as well as occluded fire points.

Table 1. Experimental environment.

Experimental Environment	Details
Programming language	Python 3.10
Operating system	Ubuntu 20.04
Deep learning framework	Pytorch 2.2.0
GPU	NVIDIA Tesla V100-SXM2-32GB
GPU acceleration tool	CUDA Version: 12.0

Table 2. Training Parameters.

Training Parameters	Details
epochs	400
batch-size	32
img-size	640 × 640
initial learning rate	0.01
optimization algorithm	SGD

Table 3. Results of ablation experiments. In this table, Original-YOLOv8 serves as the baseline network. LYOLOv8 is a lightweight network that combines the GhostC2f and DSDetect modules based on the baseline network. Various modules are then sequentially added to LYOLOv8 to enhance overall performance, resulting in the final network, Ours.

MODEL	P/%	R/%	AP0.5/%	AP0.5:0.95/%	Parameters/M	GFLOPs
Original-YOLOv8	97.3	95.8	98.8	86.1	3.01	8.2
YOLOv8-GhostC2f	97.0	96.1	99.0	85.7	2.44	7.7
YOLOv8-DSDetect	96.2	95.6	98.6	84.8	2.23	5.3
LYOLOv8	96.4	94.9	98.6	84.8	1.72	4.9
LYOLOv8-DySample	96.7	95.1	98.4	85.0	1.73	4.9
LYOLOv8-DySample-CA	97.6	94.3	98.6	85.4	1.74	4.9
LYOLOv8-DySample-CA-ODConv	98.4	94.3	98.7	86.5	1.78	3.8
LYOLOv8-DySample-CA-ODConv-DIoU (OURS)	97.6	95.0	99.0	87.1	1.78	3.8

Table 4. Comparison of different sampling algorithms. The experiments in the table are conducted based on LYOLOv8.

SAMPLE	P/%	R/%	AP0.5:0.95/%	Parameters/M	GFLOPs
DySample	96.7	95.1	85.0	1.73	4.9
CARAFE	97.2	94.6	85.1	1.85	5.4
Nearest	96.4	94.9	84.8	1.72	4.9

Table 5. Comparison of different attention mechanisms. The experiments in the table are conducted based on LYOLOv8–DySample.

Attention	P/%	R/%	AP0.5:0.95/%	Parameters/M	GFLOPs
None	96.7	95.1	85.0	1.73	4.9
Channel Attention	97.3	94.8	85.3	1.82	5.0
CBAM	97.1	95.5	85.4	1.82	5.0
CA	97.6	94.3	85.4	1.74	4.9

Table 6. Comparison of different loss functions. The experiments in the table are conducted based on LYOLOv8–DySample–CA–ODConv.

Loss	P/%	R/%	AP0.5:0.95/%	Parameters/M	GFLOPs	box_loss
CIoU	98.4	94.3	86.5	1.78	3.8	0.52
MPDIoU	97.4	94.8	85.4	1.78	3.8	0.50
GIoU	97.1	94.9	85.9	1.78	3.8	0.49
InnerIoU	98.1	95.3	78.4	1.78	3.8	0.11
SIoU	97.2	95.7	85.8	1.78	3.8	0.63
DIoU	97.6	95.0	87.1	1.78	3.8	0.44

Table 7. Comparison of detection results from different networks on the FLAME dataset.

MODEL	AP0.5/%	Parameters/M	GFLOPs
SSD	90.1	26.30	140.5
Faster R-CNN	90.8	41.35	78.4
EfficientDet	86.4	6.55	5.6
Cascade R-CNN	98.0	69.39	118.97
Ours	99.0	1.78	3.8

Table 8. Comparison of detection results from YOLO series on the FLAME dataset.

MODEL	AP0.5:0.95/%	Parameters/M	GFLOPs
YOLOv8	86.1	3.01	8.2
YOLOv9	83.5	2.66	11.0
YOLOv7	80.3	6.01	13.2
Ours	87.1	1.78	3.8

Table 9. Results of training on the extended dataset.

MODEL	AP0.5/%	AP0.5:0.95/%	Parameters/M	GFLOPs
Original YOLOv8	98.9	95.9	3.01	8.2
Ours	99.4	96.2	1.78	3.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lei, L.; Duan, R.; Yang, F.; Xu, L. Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network. Forests 2024, 15, 1652. https://doi.org/10.3390/f15091652

AMA Style

Lei L, Duan R, Yang F, Xu L. Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network. Forests. 2024; 15(9):1652. https://doi.org/10.3390/f15091652

Chicago/Turabian Style

Lei, Lin, Ruifeng Duan, Feng Yang, and Longhang Xu. 2024. "Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network" Forests 15, no. 9: 1652. https://doi.org/10.3390/f15091652

APA Style

Lei, L., Duan, R., Yang, F., & Xu, L. (2024). Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network. Forests, 15(9), 1652. https://doi.org/10.3390/f15091652

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network

Abstract

1. Introduction

2. Materials and Methods

2.1. You Only Look Once

2.2. Improvements to LightWeighting

2.2.1. Ghost Convolution

2.2.2. Depthwise Separable Convolution

2.3. Improvements in Accuracy

2.3.1. Coordinate Attention

2.3.2. Omni-Dimensional Dynamic Convolution

2.3.3. Dynamic UpSampling

2.3.4. Distance-Intersection over Union

2.4. A Lightweight Model for Detecting Forest Fire

3. Methods for Evaluation

3.1. The Dataset

3.2. Evaluation of the Mode

3.3. Comparison with Other Models

4. Results

4.1. The Environment for Training and Hyper-Parameters

4.2. Ablation Experiments

4.3. Display of Object Detection Results

4.4. Comparison Experiments

4.5. Testing in Another Data Set

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI