FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection

Xiao, Zhuo; Wan, Fang; Lei, Guangbo; Xiong, Ying; Xu, Li; Ye, Zhiwei; Liu, Wei; Zhou, Wen; Xu, Chengzhi

doi:10.3390/f14091812

Open AccessArticle

FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection

by

Zhuo Xiao

,

Fang Wan

,

Guangbo Lei

^*,

Ying Xiong

,

Li Xu

,

Zhiwei Ye

,

Wei Liu

,

Wen Zhou

and

Chengzhi Xu

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Forests 2023, 14(9), 1812; https://doi.org/10.3390/f14091812

Submission received: 31 July 2023 / Revised: 29 August 2023 / Accepted: 4 September 2023 / Published: 5 September 2023

(This article belongs to the Special Issue UAV Aided Forest Fire Risk Prediction Based on Remote Sensing, Machine Learning and Cloud Computing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Given the limited computing capabilities of UAV terminal equipment, there is a challenge in balancing the accuracy and computational cost when deploying the target detection model for forest fire detection on the UAV. Additionally, the fire targets photographed by the UAV are small and prone to misdetection and omission during detection. This paper proposes a lightweight, small target detection model, FL-YOLOv7, based on YOLOv7. First, we designed a light module, C3GhostV2, to replace the feature extraction module in YOLOv7. Simultaneously, we used the Ghost module to replace some of the standard convolution layers in the backbone network, accelerating inference speed and reducing model parameters. Secondly, we introduced the Parameter-Free Attention (SimAm) attention mechanism to highlight the features of smoke and fire targets and suppress background interference, improving the model’s representation and generalization performance without increasing network parameters. Finally, we incorporated the Adaptive Spatial Feature Fusion (ASFF) module to address the model’s weak small target detection capability and use the loss function with dynamically adjustable sample weights (WIoU) to weaken the impact of low-quality or complex samples and improve the model’s overall performance. Experimental results show that FL-YOLOv7 reduces the parameter count by 27% compared to the YOLOv7 model while improving 2.9%

{m A P 50}_{s m a l l}

and 24.4 frames per second in FPS, demonstrating the effectiveness and superiority of our model in small target detection, as well as its real-time and reliability in forest fire scenarios.

Keywords:

FL-YOLOv7; forest fire detection; remote sensing; small target; lightweight

1. Introduction

Forest fire is a natural disaster that seriously jeopardizes forest resources and ecological security, and early detection and suppression of fire is the key to reducing losses and protecting the environment. Traditional forest fire monitoring methods, such as manual inspection, watchtower observation, aerial patrol, and satellite remote sensing, have many limitations, such as high consumption of workforce and material resources, inefficiency, small coverage, narrow field of view, high false alarm rate, difficult to locate, high cost, and affected by the weather, etc. [1]. These methods are difficult to meet forest fire detection’s real-time, accuracy, and comprehensiveness requirements. On the contrary, the camera as a sensor has high real-time performance and visual information-providing ability. The camera can capture various characteristic signals of the fire, such as the color, shape, motion, temperature, and radiation of the flame, and automatically identify and locate the fire image through computer vision technology to achieve real-time and rapid-fire alarm and warning [2]. As an emerging aircraft, UAV has the advantages of small size, high mobility, low cost, and can carry a variety of sensors, which is suitable for forest fire monitoring. When the camera is mounted on the UAV, the monitoring range can be expanded to cover more expansive and remote forest areas, providing firefighters with more complete images of the forest fire scene [3]. The drone can fly around or conduct spot monitoring in the forest area according to the preset flight route or real-time remote control commands, and through the video stream transmitted back from the drone, the relevant personnel can observe the situation at the fire scene in real-time and the fire department can remotely monitor the fire scene and understand the development of the fire to make more accurate decisions. At the same time, the drone’s location information is transmitted back to the terminal along with the video stream so the terminal can accurately know the drone’s current location, which in turn helps the fire department or other arbiters to determine the exact location of the fire and dispatch rescue resources in time. [4]. With the development of computer vision technology, deploying computer vision models on aerial photography equipment, such as UAVs [5], to automatically monitor forest fires can not only timely and accurately detect the fire situation but also save much workforce, material, and other resources.

Many scholars have proved that the YOLO series has excellent performance in the field of target detection and is widely used in UAV-based detection systems. Therefore, in this work, YOLOv7 [6] is investigated as our UAV vision framework for forest fire detection. The main contributions we made are as follows.

1.: Based on GhostNetV2, a more lightweight C3GhostV2 structure and GhostMP structure are proposed to lighten the backbone network of the YOLOv7 target detection model. The optimized neural network model greatly reduces the computational cost of the model and the network size.
2.: Incorporating the SimAm attention mechanism to improve the model’s characterization ability and generalization performance by mining important neurons in the convolutional neural network with null-space inhibition effect and enhancing them. At the same time, the SimAm attention module does not increase the number of network parameters and maintains the computational efficiency of the model.
3.: Adaptive Spatial Feature Fusion (ASFF) is added after the PAFPN output feature map to further enhance the semantic information and fine-grained features of the underlying high-level features. This enhances the model’s ability to detect small targets of smoke and fire.
4.: Wise-IOU (WIoU) with dynamic non-monotonic focusing mechanism (FM) is introduced.WIoU utilizes the outlier degree to evaluate the quality of the anchor frames and assigns appropriate gradient gains according to its outlier degree. In this way, WIoU reduces the effect of high-quality and low-quality anchor frames on the loss value, allows the model to focus on average-quality anchor frames, and improves the overall performance of the detector.

The rest of the paper is organized as follows. Section 2 gives an introduction to the related work on forest fire target detection and the YOLOv7 network model. Section 3 focuses on the improved YOLOv7 algorithm framework and describes the structure of the FL-YOLOv7 model in detail. Section 4 shows the experimental configuration and some training parameter settings. In addition, the effects of the C3GhostV2 module, the SimAm attention mechanism, the ASFF module, and the WIOU loss function on forest fire recognition are verified, as well as a discussion of the experimental results. Finally, Section 5 summarizes and outlooks the paper.

2. Related Work

Traditionally, forest fire monitoring systems have used watchtowers and satellite imagery, as well as various IoT monitoring systems. Ananthi [7] has developed a smart forest monitoring suite application for forest fire prediction by collecting various smart sensors and devices to provide sufficient services to collect data from the forest environment. These details are sent to the cloud server for further processing. Kuldoshbay et al. [8] propose a solution that combines Internet of Things (IoT) devices with YOLOv5, allowing IoT devices to verify that fires detected by YOLOv5 have not been miss-detected or missed. Rahman [9] has developed a pipeline model consisting of Background Subtraction, Color Segmentation, Spatial Wavelet Analysis, and Support Vector Machine, which will detect real-time fire, but its computational complexity is high. Forest fire monitoring techniques based on high-resolution satellite imagery have the advantage of wide coverage and high accuracy, but they are not very efficient or timely in monitoring the development and spread of small-scale, short-term forest fires [10], while cloud cover, for example, can obscure images and make it difficult to characterize forest fire parameters quantitatively in a timely manner [11].

With the emergence of a convolutional neural network (CNN) and Vision Transformer (ViT), many scholars have exploited their superiority in target detection for forest fire detection. Zhang [12] used classical models such as Faster-RCNN, YOLO, and SSD for flame detection and created a forest fire detection model benchmark.

However, the proportion of pixels of smoke and fire targets in aerial forest fire images is too small, and as the number of network layers increases [13], the features and location information of small smoke and fire targets that can be obtained become less, resulting in the network model’s poor detection ability of small smoke and fire targets, which is very prone to miss-detection and omission of detection. In order to solve this problem, many scholars have studied small target detection. Sunkara et al. [14] proposed a new CNN building block called SPD-Conv to replace each stride convolutional layer and each pooling layer, which improves the model’s ability to deal with low-resolution images and small-targeted objects. Xue et al. [15] adapted the original path aggregation network of YOLOv5 ( PANet) into a bidirectional feature pyramid network (BiFPN) to improve it for YOLOv5’s insufficient performance in small-target fire recognition. The self-attention mechanism (self-attention) in Transformer [16] can acquire the relationship between the distal elements faster and notice the regions in the image, thus integrating the whole image’s information. Zhu et al. [17] added a prediction head to YOLOv5 to detect objects at different scales. Then, the original prediction head is replaced with Transformer Prediction Heads (TPH) to enhance the ability of small target detection. Although these methods can effectively extract the features of small targets, they also increase the number of network parameters, reduce the efficiency of the model, increase the memory consumption and computational cost, and are difficult to deploy on mobile terminal devices.

Reducing the computational cost and network size is definitely the best way to better deploy network models on mobile terminal devices such as UAVs. Liu et al. [18] used point-by-point convolution that only considers information around individual pixel points and does not take into account the information of the entire convolutional kernel to achieve light-weighting of models. Lv et al. [19] constructed lightweight networks by inverting residual blocks that reduce the computational effort of networks by swapping the order of the convolutional and nonlinear activation functions are swapped in order to reduce the computational effort of the network by inverting the residual blocks to construct a lightweight network. Javadi, Li et al. [20,21] achieved lightweighting of the network model by replacing the backbone network in the YOLOv3, YOLOv5 models with MobileNet-v3 [22], GhostNet. Although the lightweight model can be better deployed on mobile terminal devices such as UAVs, it is difficult to reach the requirements of small target detection of forest fires in terms of detection accuracy, and detection accuracy is crucial. Therefore, how to ensure the accuracy of model detection while achieving a lightweight network model is the focus and difficulty of forest fire small target detection based on remote sensing devices such as UAVs. The latest YOLOv7 is now part of the YOLO series and has exceptional target detection performance, particularly in common task scenarios such as identifying pedestrians and vehicles. However, when it comes to forest fire detection in aerial imagery, there are still several challenges in implementing it directly:

1.: The scale of forest fires based on aerial imagery is very small, and there are not enough detailed features, which are difficult to detect by traditional target detection models [23]. Even with the improved high-precision model, it is difficult to detect smoke and fire targets with very small pixels.
2.: While ELAN’s feature extraction and feature fusion structure improves network learning, its complex network structure results in more model parameters [24].
3.: Forest fire detection has very high requirements on the speed of the fire, and if the fire can be detected at the early stage of the fire, a lot of losses can be avoided.

This paper aims to improve the shortcomings of YOLOv7 in detecting forest fires using UAVs in specific scenarios.

3. Methodology

3.1. YOLOv7

The network structure of the YOLOv7 network is shown in Figure 1 and consists of input, backbone, neck, and prediction. Among them, the input module scales all input images to a uniform size and delivers them to the backbone network. The Backbone module consists of a number of CBS convolutional layers, ELAN convolutional layers, and MPConv convolutional layers, in which the CBS convolutional layers consist of convolutional layers, Batch Normalization (BN), SiLU activation function [25], which is used to extract image features at different scales; the ELAN module consists of multiple CBS convolutional layers, whose input and output feature sizes are kept constant. The MPConv convolutional layer adds a Maxpool layer on top of the BConv layer to improve the feature extraction capability of the network. The Path Aggregation Feature Pyramid Network (PAFPN) [26] used in the Head module achieves an efficient fusion of features at different levels by introducing bottom-up paths that make it easier to pass the information from the bottom layer to the top layer. The Prediction module achieves an efficient fusion of features at different levels through the REP (RepVGG Block) [27] structure to adjust the number of image channels for the three different scales of features output from the PAFPN and finally undergoes 1 × 1 convolution for the prediction of confidence, category and anchor frame.

3.2. Lightweight Feature Extraction Module-C3GhostV2

Lightweight Convolutional Neural Networks (CNNs) offer fast inference on mobile devices. However, their recognition accuracy leaves much to be desired. The Ghost module proposed by GhostNet [28] is a new convolutional module that is lighter and more efficient than CNNs. It utilizes inexpensive linear operations to generate more “ghost” feature maps, increasing the diversity and redundancy of feature maps. Without changing the size of the output feature maps and channel sizes, the parameters and computational effort of the convolutional layers are reduced. However, the convolution operation can only capture local information in the window region, which hinders further performance improvement. Introducing self-attention in convolution can capture global information well but can greatly affect the actual speed of convolution.GhostNetV2 [29] proposed a hardware-friendly attention mechanism (DFC attention), which enhances the long-distance dependency of feature maps after regular convolution and cheapening operations to increase the expressiveness and diversity of the feature maps, thus improving the model detection performance. By utilizing the advantages of GhostV2Bottleneck, we design a lightweight feature extraction structure C3GhostV2, as shown in Figure 2.

GhostV2Bottleneck is an inverse residual bottleneck consisting of two Ghost modules, as shown in Figure 3. The first Ghost module is used to expand the number of feature channels, while the second Ghost module reduces the number of channels to produce output features. This design effectively reduces the coupling between model expressiveness and capacity [30] and solves the problem that the model overfits during training or suffers from insufficient generalization ability during testing. The DFC attention branch captures the long-range correlation between pixels at different spatial locations. The DFC attention branch is used in parallel with the first Ghost module to enhance the features, and the second Ghost module accepts the enhanced features and produces the output features to achieve the enhancement of model performance.

The Ghost module produces a larger number of feature mappings at a lower cost of operation. It can usually be used to replace the standard convolution by the following two steps: First, the input features

X \in R^{H \times W \times C}

are convolved with a 1 × 1 pointwise convolution to generate the intrinsic features:

Y^{^{'}} = X * F_{1 \times 1}

(1)

where * denotes the convolution operation.

F_{1 \times 1}

is the pointwise convolution, and

Y^{^{'}} \in R^{H \times W \times}

are the intrinsic features, which typically have smaller sizes in comparison to the original output. The generated intrinsic features are then depth-convolved to generate more features. The two parts of the features are connected along the channel dimension, i.e.,

Y = C o n c a t ([Y^{^{'}}, Y^{^{'}} * F_{d p}])

(2)

In Equation (2),

F_{d p}

is the deep convolutional filter, and

Y \in R^{H \times W \times C_{o u t}}

is the output feature. Although the Ghost module can significantly reduce the computational cost, it inevitably weakens its representation capability. In GhostNet, only half of the features are fed into the 3 × 3 deep convolution to capture their spatial features, and the other half of the features are reduced in computational complexity by a 1 × 1 convolution operation, as the 1 × 1 convolution does not convolve the spatial features of the input tensor, but only performs the convolution operation on the channels. The relationship between spatial pixels is the key to achieving accurate recognition, which results in a weak ability of the module to capture spatial information and hinders further performance improvement.

Using the DFC attention module to augment the output feature Y of the Ghost module enhances the ability of the model to capture remote information between pixels in different spaces. DFC attention combines pixels along horizontal and vertical axes, respectively, and eliminates tensor transformation and transposition operations by sharing some transform weights, thereby accelerating model inference as formulated in Equations (3) and (4).

a_{h w}^{^{'}} = \sum_{h^{^{'}} = 1}^{H} \begin{matrix} F_{h, h^{^{'}} w}^{H} ⨀ z_{h^{^{'}} w}, h = 1, 2, \cdot \cdot \cdot, H, ω = 1, 2, \cdot \cdot \cdot, W \end{matrix}

(3)

a_{h w} = \sum_{w^{^{'}} = 1}^{H} \begin{matrix} F_{ω, h w^{^{'}}}^{W} ⨀ a_{h w^{^{'}}}^{^{'}}, h = 1, 2, \cdot \cdot \cdot, H, ω = 1, 2, \cdot \cdot \cdot, W \end{matrix}

(4)

Figure 4 shows the process of information aggregation in the GhostNetV2 bottleneck, where the input feature

X \in R^{H \times W \times C}

is sent to two branches, the Ghost module and the DFC Attention module extract information from different perspectives under the same input and produce the output feature Y (Equations (1) and (2)), which produces the attention map A (Equations (3) and (4)), and then their outputs are multiplied element by element to produce the final output

O \in R^{H \times W \times C}

, which is shown in Equation (5), with ⨀ as the element multiplication and sigmoid as the scaling function. The attention map A is normalized to the range (0,1).

O = S i g m o i d (A) ⨀ V (X)

(5)

The DFC attention module computes each attention value over a large range of patches so that the output features can contain information from those patches as well. This approach allows the model to better capture the spatial information in the input image and improve the accuracy of the model. At the same time, the design of the ghost module reduces the computational cost and improves the operational efficiency of the model, allowing the model to achieve relatively high recognition accuracy while maintaining computational efficiency.

3.3. Parameter-Free Attention Mechanism

Attention modules are currently widely used in deep learning to enhance feature extraction for good performance. However, most current attention modules typically refine the feature maps and generate 1-D or 2-D weights based on channel or spatial dimensions, which can limit their ability to learn more discriminative cues and can also lead to making it challenging to learn attention weights from channels and spaces. Alternatively, some attentional modules can rely too heavily on hyperparameters. The parameter-free attention module with 3-D attentional weights (SimAm) improves the model’s ability to recognize pyrotechnic targets by refining the features of pyrotechnic targets with full 3-D weights, thereby improving the model’s accuracy and reducing the false detection rate. As shown in Figure 5. In Figure 5, the same color indicates the use of a single scalar for each channel, spatial location, or point on that feature.

To achieve better attention, the weight of each neuron needs to be evaluated. In neuroscience, activated neurons with null-space inhibitory effects are also richer in information, and they should be given higher weights. These neurons can be found more easily by measuring the linear separability between the target neuron and other neurons. Based on these neuroscientific findings, the following energy function was defined for each neuron:

e_{t} (w_{t}, b_{t}, y, x_{i}) = {(y_{t} - \hat{t})}^{2} + \frac{\sum_{i = 1}^{M - 1} (y_{0} - {\hat{x}}_{i})^{2}}{M - 1}

(6)

where:

\hat{t} = w_{t} t + b_{t}

and

{\hat{x}}_{i} = w_{t} x_{i} + b_{t}

are linear transformations of t and

x_{i}

. t and

x_{i}

in the input feature

X \in R^{C \times H \times W}

are the target and other neurons, respectively, i is the spatial dimensionality, and the number of neurons, M, is computed by H × W. The target can be found by using binary labeling with the addition of a regular term to minimize the above equation linear separability between neuron t and all other neurons in the same channel. The final energy function is defined as follows:

\begin{matrix} e_{t} (w_{t}, b_{t}, y, x_{i}) = \frac{\sum_{i = 1}^{M - 1} α^{2} + β^{2} + λ w_{t}^{2}}{M - 1} \\ α = - 1 - (w_{t} x_{i} + b_{t}) \\ β = 1 - (w_{t} t + b_{t}) \end{matrix}

(7)

Theoretically, there are M energy functions per channel. Solving all these equations by an iterative solver such as SGD is computationally intensive. The above equation is further resolved as:

\begin{matrix} ω_{t} = - \frac{2 (t - μ_{t})}{(t - μ_{t})^{2} + 2 {σ_{t}}^{2} + 2 λ^{’}} \\ b_{t} = - \frac{1}{2} (t + μ_{t}) ω_{t} \end{matrix}

(8)

in which:

\begin{matrix} μ_{t} = \frac{1}{M - 1} \sum_{i = 1}^{M - 1} x_{i} \\ σ_{t}^{2} = \frac{\sum_{i = 1}^{M - 1} {(x_{i} - μ_{t})}^{2}}{M - 1} \end{matrix}

(9)

The minimum energy can be obtained by the following equation:

e_{t}^{*} = \frac{4 ({\hat{σ}}^{2} + λ)}{(t - \hat{μ})^{2} + 2 {\hat{σ}}^{2} + 2 λ^{’}}

(10)

The above equation implies that the weight of a neuron is inversely proportional to its energy; the lower the energy, the higher the weight. Therefore, the weights of the neurons can be obtained by

11 / e_{t}^{*}

. After deriving the energy function and the weights of the neurons, the features need to be augmented according to the definition of the attention mechanism. The computational formula for this is:

\tilde{X} = s i g m o i d (\frac{1}{E}) ⊙ X

(11)

3.4. Adaptive Spatial Feature Fusion

FPN solves the multi-scale problem by extracting features on different layers of feature maps. PAFPN in YOLOLv7 does the same, but it simply converts the output feature maps to a uniform size and combines them sequentially. This does not take full advantage of the features at different scales.ASFF effectively fuses small target information through interpolation or convolution operations to ensure that information from features extracted from different layers of the feature map can be fused at the same spatial scale. After the alignment of the feature layers, ASFF learns the optimal fusion weights through training to highlight the feature layers that contain more information about small targets so that these details are dominant in the feature fusion process. Finally, ASFF fuses the features from each feature layer, and the higher-weighted features will dominate the fused feature representation. In this paper, ASFF is introduced between PAFPN and YoLoHead to improve the model’s ability to detect pyrotechnic small targets.

In Figure 6, Out1, Out2, and Out3 are the three feature layer outputs from PAFPN, and

X^{1 \to 3}

,

X^{2 \to 3}

,

X^{3 \to 3}

denote the features of Out1, Out2, and Out3 feature layers, respectively. Then

X^{1 \to 3}

,

X^{2 \to 3}

,

X^{3 \to 3}

are multiplied by their respective weights and summed, and finally ASFF-3 is obtained by feature fusion.

3.5. Loss Function Improvement

The presence of low-quality samples in the training dataset is inevitable, such as poor image quality, blurred or too small targets, and other problems. When training the target detection model, geometric metrics such as target spacing and aspect ratio are used to evaluate and optimize the detection results, and these geometric metrics may exacerbate the penalty for these low-quality samples, resulting in the model overfitting these data and reducing the generalization ability of the model. A loss function that can be effective when the anchor and target frames are well aligned should mitigate the loss of geometric measurements. Boundary box regression (BBR) loss functions are designed to optimize the degree of overlap between the predicted and real frames. Most of the existing BBR loss functions are based on the intersection and merger ratio (IoU) or its variants, such as GIoU, DIoU, CIoU, and SIoU, which solve the gradient vanishing problem of the IoU loss by introducing a geometric penalty term. However, none of these loss functions take into account the presence of low-quality samples in the training data, which may interfere with the learning process of the model and degrade the localization performance. To solve this problem, the WIoU [30] loss function evaluates the quality of the anchor box by replacing the IoU with anomalies through a dynamic non-monotonic focusing mechanism and proposes a sensible gradient gain assignment strategy. The equations of the WIoU loss function are shown below:

{L o s s}_{W I o U} = r • e x p (\frac{ρ_{(b, b^{g t})}^{2}}{{{(d}^{2})}^{*}}) • (1 - I o U)

(12)

r = \frac{β}{δ α^{β - δ}}, β = \frac{{(1 - I o U)}^{*}}{\bar{1 - I o U}} \in [0, + \infty)

(13)

where * denotes the size of the smallest bounding box separated from the computational map,

β

is the degree of anomaly,

α

and

δ

are hyperparameters.

The WIoU loss function uses the degree of an anomaly to dynamically adjust the gradient gain of the anchor box in order to maximize the tilt gain of the anchor box during the training process, thus achieving improved detection accuracy.

3.6. Improved Network Structure FL-YOLOv7

Figure 7 and Figure 8 present the network architecture of FL-YOLOv7 and describe the main modules. The improved model uses the C3GhostNetv2 module to replace the original ELAN module in the backbone network and the GhostMP module to replace the original MP module. The GhostMP module consists of Maxpool, GBS, CBS, and CAT. The first branch first passes through a maximum pooling layer, which serves to divide the feature map according to a certain window size, and then takes the maximum value within each window and uses it as the output of that window, thus achieving the effect of downsampling. This branch then goes through a 1 × 1 convolutional layer for changing the number of channels, an operation that helps the network learn richer and more abstract features. The second branch consists of a 1 × 1 GhostConv layer and 2 of 3 × 3 GhostConv layers, where the 1 × 1 GhostConv layer is used to implement the transformation of the channels. The 3 × 3 convolutional layer with a step size of 2 is used for downsampling. The SimAM attention mechanism is used in the head and neck part of the model to refine the features of the pyrotechnic target by the full 3D weights, which improves the recognition ability of the model on the pyrotechnic target, thus improving the accuracy rate of the model and reducing the false detection rate. On the detection side, the ASFF module is used to dynamically characterize different levels of features combined with different levels of semantic information, which helps the model to better adapt to the characteristics of different targets and changes in the scene and improves the accuracy and robustness of target detection as well as the performance of detecting small targets of pyrotechnic fireworks. At the same time, the problem of poor model generalization ability caused by the presence of low-quality samples in the training set is solved by the gradient gain assignment strategy in WIOU. These measures effectively reduce the number of parameters in the YOLOv7 model and improve the detection speed of the model and its performance on pyrotechnic small target detection. This enables it to be better mounted on remote sensing equipment such as UAVs to detect forest fires.

4. Experiments and Discussion

4.1. Datasets

Forest fire detection based on YOLOv7 relies heavily on the quality of the dataset. Therefore, using high-quality datasets in the training process allows the deep learning model to extract more effective features. We not only obtained datasets from some public forest smoke and fire image datasets such as FLAME and Alert Wildfire but also obtained forest fire images from the web by writing Python scripts. In total, there were 6360 images and 18,389 instances, of which there were 10,663 flame instances and 7726 smoke instances. The number of images for its different scenarios is shown in Table 1. Targets that resemble fire are often seen in images with dark backgrounds, such as lights. Meanwhile, targets resembling smoke are typically found in pictures taken during the day, such as clouds.

We followed the rule of 8:1:1 to divide the dataset and converted it to COCO [21] format. The specific number of images in each set is shown in Table 2.

In COCO format, targets with pixel sizes smaller than 32 × 32 were defined as small targets, and those with pixel sizes between 32 × 32 and 96 × 96 were defined as medium targets. Because most of the images in the dataset of this paper are images captured by devices such as UAVs, the dataset has a relatively large proportion of small target instances. Some of the images in the training set are shown in Figure 9.

4.2. Experimental Platform and Model Hyperparameters

The experimental platform’s hardware and software configurations are listed in Table 3. In order to enhance sample balance and improve the model’s generalization capability, color transformations, and mosaic data augmentation were employed during the experiments, leading to an increased number of training samples. The initial learning rate was set to 0.01, the batch size was set to 12, and the image pixel size was 640 × 640. Table 4 contains comprehensive information about the training parameters.

4.3. Model Evaluation

The experiments used the mean accuracy

{m A P}_{50}

, the mean detection rate of small targets

m {A P}_{50}^{s m a l l}

and the frame per second (

F P S

) as the evaluation indexes. The FPS reflects the processing speed of the model, and

m {A P}_{50}^{s m a l l}

reflects the average detection rate of the target with a pixel area less than 322 when setting the IoU (Intersection over Union) threshold to 0.5,

{m A P}_{50}

refers to the average detection rate obtained by setting the IoU threshold to 0.5. To calculate the mAP, we needed to use precision and recall, which are calculated as follows:

\begin{matrix} P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N} \\ A P = \int_{0}^{1} P (R) d R, m A P = \sum_{i = 1}^{C} \frac{{A P}_{i}}{C} \end{matrix}

(14)

where

T P

(True Positive) signifies the count of targets that are truly positive samples, correctly classified as positive samples, and predicted as positive samples. Conversely,

F P

(False Positive) denotes the number of targets that are negative samples but erroneously detected as positive samples;

F N

(False Negative) represents the number of targets that are actually positive samples but are incorrectly predicted as negative samples. The average precision (

A P

) is calculated as the area under the curve represented by the precision (P) as the vertical axis and recall (R) as the horizontal axis.

When

I o U

is greater than the set threshold, it is noted as

T P

and vice versa as

F P

. The number of detection frames is determined by the threshold and is inversely proportional. The equation for calculating

I o U

is as follows:

I o U = \frac{T P}{T P + F P + F N}

(15)

F P S

is the number of images that the model can detect per second, which represents the speed of detection and is calculated as follows:

F P S = \frac{N}{T}

(16)

where N represents the total number of images, and T denotes the time taken to test all the images.

4.4. Experimental Results and Analysis

To validate the integrity of the Fl-YOLOv7 model, a comprehensive set of comparison and ablation experiments were performed. The resulting results were meticulously compared in detail, using pre-defined evaluation metrics for thorough analysis.

4.4.1. Comparison of Baseline Networks

The three network architectures chosen for comparison in the YOLOv7 model architecture are YOLOv7-tiny, YOLOv7, and YOLOv7x. As can be seen in Table 5, although YOLOv7-tiny has a smaller number of parameters, a larger number of FLOPs, and a higher FPS, it has a very small accuracy that does not satisfy the test requirements. Although YOLOv7x has similar accuracy to YOLOv7, it consumes more resources and has the slowest inference speed. After comparing these models, YOLOv7 was chosen as our baseline network for further improvement.

4.4.2. Comparative Experimental Analysis of Proposed and Lightweight Networks

Table 6 presents a comparative analysis of the FL-YOLOv7 model with lightweight network baseline backbones, including MobileNet v3 [31], ShuffleNet v2 [32], and MobileOne [33]. The reduction in parameters and FLOPs is achieved by lightweight networks, but there is a consequent reduction in their ability to represent features, which ultimately leads to a significant reduction in accuracy. Although the YOLOv7-C3GhostV2 has slightly higher parameters and FLOP numbers than other lightweight nets, overall, it provides the best balance between model size, detection accuracy, and detection speed.

4.4.3. Experimental Analysis of The Proposed Method and Other Attention Blocks

Table 7 presents a performance comparison of the SimAM module, along with other attention mechanisms such as SE [26], CBAM [34], and CA (coordinate attention) [35], in the baseline neck network. The experimental results show that the use of different attention mechanism modules relative to the baseline model can affect the performance and efficiency of the model to varying degrees. The SE module improves the model performance by adaptively calculating the channel weights and has a faster processing speed, but its map50 score is slightly lower than that of the other models. The CA module is able to capture spatial and channel attention better and has good performance in detection accuracy, but its number of attentional parameters is large. The CBAM module is similar to the CA module but has a smaller number of attention parameters and, therefore, a smaller model size. The SimAm module captures spatial and channel attention in the feature map through the self-attention and cross-attention modules, and compared to the other modules, it is second only to the SE module in terms of detection speed, and it has a high detection accuracy and a smaller model size. Therefore, the SimAM module is more suitable for forest fire small target detection scenarios that require high speed and model size while ensuring accuracy.

4.4.4. Ablation Experiments

In the previous sections, a series of cross-sectional comparisons of various improvements to the FL-YOLOv7 network were performed to demonstrate its superiority over other methods. In order to verify the effectiveness of the various improvement strategies, a longitudinal comparison was carried out by comparatively analyzing the impact of the different improvement strategies on the effectiveness of the model detection through ablation tests, and the plotted data is shown in Table 8 and Figure 10. From the table, it can be seen that the network model replacing the backbone network with C3GhostnetV2 reduces the number of parameters by 27%, FLOPS by 63.5%, and model detection speed by 38.1% compared to the baseline model. Incorporating the model into the SimAm attention mechanism resulted in a substantial increase in model detection speed and a slight increase in model detection accuracy and detection accuracy for small targets. After replacing C3GhostnetV2 as the backbone network, the model incorporates the SimAm attention mechanism, the detection accuracy of the model is improved, and the detection speed of the model is further improved, while the Params and FLOP of the model remain basically unchanged. The ASFF module is used to enhance the sensory wildness of the network model and to improve the accuracy of the detection of the small targets of forest fires. The

{m A P}_{50}

of the network model after incorporating the ASFF module is improved by 2.4%, and the detection accuracy (

{m A P 50}_{s m a l l}

) of small targets is improved by 3.5%. The experimental results show that the C3GhostnetV2 module effectively reduces the number of YOLOv7 model parameters and improves the model detection speed, the SimAm Attention module improves the network model’s ability to recognize pyrotechnic targets without increasing the number of model parameters, and at the same time improves the model’s detection speed, and the ASFF module effectively solves the sensory field problem of small target recognition. Finally, the improved network model FL-YOLOv7 is lighter, faster, and more applicable to the scenarios of UAV aerial photography and watchtower monitoring of small targets of forest fires compared to YOLOv7 with basically the same model detection accuracy.

Figure 11 shows the comparison of our loss after using WIoU and data augmentation. From the figure, we can see that in the original training, the model’s loss on the training set is significantly lower than the loss on the validation set, and there is overfitting, whereas after using the WIoU loss function, the gap between the training and validation sets is reduced and lower than the loss of the original training.

4.4.5. Comparison Experiments

In order to further analyze the performance of the improved algorithm for detecting small target fires in forests, FL-YOLOv7 was compared with YOLOv7, Faster R-CNN [36], RetinaNet [37], SSD [38], and EfficientDet [39] networks on a test set of 600 images. The recognition results, detection frame rate, model size, and number of parameters for each network model are shown in Table 9 and Figure 12. As can be seen from Table 9, FL-YOLOv7’s is only slightly lower than YOLOv7 in terms of detection accuracy, which is much higher than the other network models in terms of detection speed. Although the number of parameters of the FL-YOLOv7 model is slightly higher than that of lightweight network models such as SSD and EfficientDet-D4, its performance in terms of detection accuracy and detection speed is much stronger than the two.

Examples of the detection results of the EfficientDet-D4, YOLOv7, and FL-YOLOv7 network models are shown in Figure 13. As can be seen from the figure, FL-YOLOv7 has the characteristics of a lightweight model while the detection results are also closer to YOLOv7, and reduces the leakage rate of the model for small targets, while the other lightweight model, EfficientDet-D4, has a poorer detection effect and there are misdetections and leakage cases. The detection results of FL-YOLOv7 are higher compared to EfficientDet-D4, although the number of model parameters is a little higher, the detection results for forest fires are better than that of YOLOv7, and the detection results of YOLOv7 are better than that of YOLOv7. D4, in terms of the number of model parameters, is a little higher, but it has a more reliable performance for forest fire small target detection.

Because the color and shape of the smoke and fire targets in different scenes are different, and the common smoke and fire detection models are highly dependent on the target color, false alarms can easily occur when the color values of the objects in the background area are similar to those of the fire area. For this reason, we use a large number of forest fire images in complex scenes to test the robustness of the FL-YOLOv7 network model and compare it with YOLOv7 and EfficientDet network models, as shown in Figure 14.

Figure 14a–c show smoke detection under a strong natural light background; Figure 14d–f show distant forest wildfire images captured by UAV at night. As can be seen from the comparison graphs, EfficientDet-D4 has a high false detection rate in forest fire images captured at night, identifying lights inside the city as flame targets and smoke detection in the strong natural light background. The experimental results show that the FL-YOLOv7 network model maintains good performance under different lighting conditions, is robust to changes in light, has a leakage rate, and has a low false detection rate for complex environments where there are smoke and fire-like targets.

5. Conclusions and Discussion

Many of the large forest fires that occur globally have evolved from small forest fires, so it is critical to be able to accurately and quickly detect fires in their early stages. As an emerging aircraft, UAVs are suitable for forest fire monitoring because of their small size, high maneuverability, low cost, and the ability to carry a variety of sensors. However, in aerial forest fire images, the flame target has a small percentage of pixels in the image, which makes it difficult to extract features with discriminative ability. In this paper, a lightweight forest fire small target model based on the improvement of YOLOv7-FL-YOLOv7 is proposed. Firstly, a more lightweight C3GhostV2 structure is proposed to replace the backbone network of the YOLOv7 target detection model, and after achieving model compression and accelerated inference, an optimized neural network model is obtained, which greatly reduces the dependence on the hardware environment and can better apply the network model to devices such as UAVs. Secondly, the SimAm parameterless attention mechanism is added to enhance the model’s recognition ability of pyrotechnic targets, reduce the network model parameters, and improve the model target detection speed. Meanwhile, through adaptive spatial feature fusion, we solve the problem of the model’s insufficient detection ability of small targets. Finally, the influence of high-quality anchor frames and low-quality anchor frames on the loss value is reduced by a dynamic non-monotonic focusing mechanism, which enables the model to focus on ordinary-quality anchor frames and improves the overall performance of the detector.

Under the same conditions, the FL-YOLOv7 algorithm of this paper is compared with other classical target detection algorithms, and the actual detection effect of this paper’s algorithm is better than other classical algorithms. The experimental results show that the improved model has some practical applications in the monitoring of forest fires in aerial images. Although FL-YOLOv7 has a SimAm attention mechanism to extract flame features and contextual information to detect flames. This method can distinguish fire-like scenarios to some extent. However, we also acknowledge that this method is not perfect and cannot completely prevent false alarms, for example, in scenes where smoke and clouds are more similar to each other on a foggy or cloudy day and in scenes where light sources such as street lamps and the sun at sunset more closely resemble the characteristics of flames, While the human eye has a relatively easy time distinguishing between man-made and natural fires, computer vision systems are sometimes affected by factors such as brightness and reflections, and may misclassify fire-like scenes as real fires. This is because, in some cases, these scenes can have visual characteristics similar to real flames. Our model has a lower false alarm rate compared to other models, but false alarms can still occur. In the future, we will address these issues in a number of ways:

1.: Multimodal information fusion: Leveraging data from multiple sensors or data sources, such as video, infrared imagery, and smoke sensor data, for multimodal information fusion. By effectively utilizing different types of data, the reliability and accuracy of fire detection can be improved.
2.: Anomaly detection and spatiotemporal modeling: The use of anomaly detection and spatiotemporal modeling techniques to analyze and model fire or smoke at a finer granularity. By modeling normal and abnormal scenarios, the ability to detect and identify man-made fire or smoke can be improved, false alarm rates reduced, and system robustness increased.
3.: Object detection and tracking: By detecting areas of fire or smoke in the scene and using object tracking algorithms to track the movement of these areas. This approach can help determine the location, shape, and trajectory of the fire or smoke, providing additional information about the fire or smoke that can assist in further analysis and decision-making.

Author Contributions

Conceptualization, Z.X. and F.W.; Methodology, Z.X. and Z.Y.; Software, Z.X. and G.L.; Validation, Z.X.; Formal analysis, L.X and Z.Y.; Investigation, F.W., G.L., Y.X., W.L. and W.Z.; Resources, C.X.; Data curation, Z.X., G.L. and Y.X.; Writing—original draft, Z.X.; Writing—review & editing, F.W., G.L., Y.X., W.L. and Z.Y.; Visualization, Z.X.; Supervision, L.X. and W.Z.; Funding acquisition, C.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (grant No. 62202147) and the Science and Technology Research Project of the Education Department of Hubei Province (grant No. B2021070).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Sahoo, G.; Wani, A.; Rout, S.; Sharma, A.; Prusty, A. Impact and contribution of forest in mitigating global climate change. Des. Eng. 2021, 4, 667–682. [Google Scholar]
Gaur, A.; Singh, A.; Kumar, A.; Kulkarni, K.S.; Lala, S.; Kapoor, K.; Srivastava, V.; Kumar, A.; Mukhopadhyay, S.C. Fire sensing technologies: A review. IEEE Sens. J. 2019, 19, 3191–3202. [Google Scholar] [CrossRef]
Akhloufi, M.A.; Couturier, A.; Castro, N.A. Unmanned aerial vehicles for wildland fires: Sensing, perception, cooperation and assistance. Drones 2021, 5, 15. [Google Scholar] [CrossRef]
Liu, W.; Yang, Y.; Hao, J. Design and research of a new energy-saving UAV for forest fire detection. In Proceedings of the 2022 IEEE 2nd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 27–29 May 2022; pp. 1303–1316. [Google Scholar]
Muid, A.; Kane, H.; Sarasawita, I.; Evita, M.; Aminah, N.; Budiman, M.; Djamal, M. Potential of UAV Application for Forest Fire Detection. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2022; Volume 2243, p. 012041. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Grari, M.; Yandouzi, M.; Idrissi, I.; Boukabous, M.; Moussaoui, O.; AZIZI, M.; MOUSSAOUI, M. Using IoT and ML for Forest Fire Detection, Monitoring, and Prediction: A Literature Review. J. Theor. Appl. Inf. Technol. 2022, 100, 19. [Google Scholar]
Avazov, K.; Hyun, A.E.; Sami S, A.A.; Khaitov, A.; Abdusalomov, A.B.; Cho, Y.I. Forest Fire Detection and Notification Method Based on AI and IoT Approaches. Future Internet 2023, 15, 61. [Google Scholar] [CrossRef]
Rahman, M.A.; Hasan, S.T.; Kader, M.A. Computer vision based industrial and forest fire detection using support vector machine (SVM). In Proceedings of the 2022 International Conference on Innovations in Science, Engineering and Technology (ICISET), Chittagong, Bangladesh, 26–27 February 2022; pp. 233–238. [Google Scholar]
Tian, Y.; Wu, Z.; Li, M.; Wang, B.; Zhang, X. Forest fire spread monitoring and vegetation dynamics detection based on multi-source remote sensing images. Remote Sens. 2022, 14, 4431. [Google Scholar] [CrossRef]
Vargas-Cuentas, N.I.; Roman-Gonzalez, A. Satellite-based analysis of forest fires in the Bolivian Chiquitania and Amazon Region: Case 2019. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 38–54. [Google Scholar] [CrossRef]
Wu, S.; Zhang, L. Using popular object detection methods for real time forest fire detection. In Proceedings of the 2018 11th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 8–9 December 2018; Volume 1, pp. 280–284. [Google Scholar]
Zhan, J.; Hu, Y.; Cai, W.; Zhou, G.; Li, L. PDAM–STPNNet: A small target detection approach for wildland fire smoke through remote sensing images. Symmetry 2021, 13, 2260. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar]
Xue, Z.; Lin, H.; Wang, F. A small target forest fire detection model based on YOLOv5 improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 2. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Liu, Y.; Cao, S.; Lasang, P.; Shen, S. Modular lightweight network for road object detection using a feature fusion approach. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 4716–4728. [Google Scholar] [CrossRef]
Lv, Y.; Liu, J.; Chi, W.; Chen, G.; Sun, L. An inverted residual based lightweight network for object detection in sweeping robots. Appl. Intell. 2022, 52, 12206–12221. [Google Scholar] [CrossRef]
Javadi, S.; Dahl, M.; Pettersson, M.I. Vehicle detection in aerial images based on 3D depth maps and deep neural networks. IEEE Access 2021, 9, 8381–8391. [Google Scholar] [CrossRef]
Li, Y.; Yuan, H.; Wang, Y.; Xiao, C. GGT-YOLO: A novel object detection algorithm for drone-based maritime cruising. Drones 2022, 6, 335. [Google Scholar] [CrossRef]
Li, D.; Sun, X.; Elkhouchlaa, H.; Jia, Y.; Yao, Z.; Lin, P.; Li, J.; Lu, H. Fast detection and location of longan fruits using UAV images. Comput. Electron. Agric. 2021, 190, 106465. [Google Scholar] [CrossRef]
Jindal, P.; Gupta, H.; Pachauri, N.; Sharma, V.; Verma, O.P. Real-time wildfire detection via image-based deep learning algorithm. In Soft Computing: Theories and Applications: Proceedings of SoCTA 2020; Springer: Berlin/Heidelberg, Germany, 2021; Volume 2, pp. 539–550. [Google Scholar]
Du, H.; Zhu, W.; Peng, K.; Li, W. Improved High Speed Flame Detection Method Based on YOLOv7. Open J. Appl. Sci. 2022, 12, 2004–2018. [Google Scholar] [CrossRef]
Lv, Y.; Ai, Z.; Chen, M.; Gong, X.; Wang, Y.; Lu, Z. High-Resolution Drone Detection Based on Background Difference and SAG-YOLOv5s. Sensors 2022, 22, 5825. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. MobileOne: An Improved One Millisecond Mobile Backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7907–7917. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhang, C.; Lin, G.; Liu, F.; Yao, R.; Shen, C. Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 5217–5226. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]

Figure 1. The network structure of YOLOv7.

Figure 2. Module structure of C3GhostV2.

Figure 3. GhostNetV2 bottleneck.

Figure 4. The information aggregation process of different patches.

Figure 5. Full 3-D weights for attention.

Figure 6. The ASFF model after being improved.

Figure 7. The network structure of FL-YOLOv7.

Figure 8. The modules structure of FL-YOLOv7.

Figure 9. Images containing small-target fires: (a,b) small-target fires photographed by a drone at low altitude; (c) small-target forest fires photographed by a drone at high altitude; (d) small-target smoke in the forest captured by an elevated lookout camera.

Figure 10. Comparison chart of ablation experiments.

Figure 11. Comparison chart of Loss experiments.

Figure 12. Comparison chart of comparison experiments.

Figure 13. Example of detection results.

Figure 14. Example of detection results.

Table 1. Details of the dataset.

Scenarios	Number
Daylight	4770
Darkness	1590
Similar-fires	200
Similar-smoke	300

Table 2. Number of images in each set.

Dataset	Number
Train	5088
Validation	636
Test	636
Summary	6360

Table 3. Experimental environment configuration.

Experimental Environment	Configuration Information
Operating System	Windows 10
Ram	DDR4 32 GB (DDR4 16 GB × 2)
Storage	SSD: 512 GB/HDD: 1TB
CPU	Intel(R) Core(TM) i7-11700F
GPU	NVIDIA GeForce RTX 3060
Cuda	Cuda 11.6
Framework	Pytorch 1.12.0
Python	3.7

Table 4. Experimental parameters configuration.

Training Parameters	Details
Epochs	150
Batch size	12
Workers	8
Conf-thres	0.25
Image size (pixels)	640 × 640
Initial learning rate	0.01
Optimization algorithm	SGD

Table 5. Performance comparison of different models of YOLOv7 network.

Method	Params (M)	FLOPs (G)	mAP50 (%)	mAP50s (%)	FPS (Frame/s)
YOLOv7-tiny	6.0	13.2	65.11	53.2	90.9
YOLOv7x	70.8	188.9	76.1	62.9	47.3
YOLOv7-tiny	6.0	13.2	65.11	53.2	90.9

Table 6. Compare the performance of different YOLOv7 network models.

Method	Params (M)	FLOPs (G)	mAP50 (%)	mAP50s (%)	FPS (Frame/s)
YOLOv7-MobileNetv3	24.6	36.6	63.7	52.1	63.2
YOLOv7-ShuffleNetv2	23.3	37.9	65.3	54.5	72.1
YOLOv7-MobileOne	23.3	40.1	66.5	57.6	92.3
YOLOv7-C3GhostV2	25.5	37.6	69.3	58.2	80.6

Table 7. Comparison of the performance of the attention blocks in YOLOv7.

Method	Block Params	FLOPs (G)	mAP50 (%)	mAP50s (%)	FPS (Frame/s)
YOLOv7-SE	163,840	102.9	63.1	55.4	101.3
YOLOv7-CA	247,680	103.2	73.5	62.1	79.6
YOLOv7-CBAM	772	103.2	70.5	56.2	81.2
YOLOv7-SimAm	0	39.3	74.5	60.7	89.3

Table 8. Performance comparison of different models of YOLOv7 network.

Method	Params (M)	FLOPs (G)	mAP50 (%)	mAP50s (%)	FPS (Frame/s)
YOLOv7	36.5	103.2	74.2	60.7	58.1
YOLOv7 + C3GhostnetV2	25.2	37.6	69.3	58.2	80.6
YOLOv7 + simAm	33.4	39.3	74.5	62.3	89.3
YOLOv7 + ASFF	37.9	112.1	76.6	64.2	55.2
YOLOv7 + C3GhostnetV2+ simAm	25.5	38.2	71.2	61.1	87.2
FL-YOLOv7	27.5	41.7	73.3	63.6	82.5

Table 9. Performance Comparison of Target Detection Networks.

Method	Params (M)	FLOPs (G)	mAP50 (%)	mAP50s (%)	FPS (Frame/s)
Faster R-CNN	136.6	369.7	58.2	42.5	30.8
RetinaNet	36.3	145.3	68.1	55.8	69.2
SSD	23.6	273.1	66.7	53.2	50.6
EfficientDet-D4	20.5	104.9	53.1	40.3	47.2
YOLOv7	36.5	103.2	74.2	60.7	58.1
FL-YOLOv7	27.5	41.7	73.3	63.6	82.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, Z.; Wan, F.; Lei, G.; Xiong, Y.; Xu, L.; Ye, Z.; Liu, W.; Zhou, W.; Xu, C. FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection. Forests 2023, 14, 1812. https://doi.org/10.3390/f14091812

AMA Style

Xiao Z, Wan F, Lei G, Xiong Y, Xu L, Ye Z, Liu W, Zhou W, Xu C. FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection. Forests. 2023; 14(9):1812. https://doi.org/10.3390/f14091812

Chicago/Turabian Style

Xiao, Zhuo, Fang Wan, Guangbo Lei, Ying Xiong, Li Xu, Zhiwei Ye, Wei Liu, Wen Zhou, and Chengzhi Xu. 2023. "FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection" Forests 14, no. 9: 1812. https://doi.org/10.3390/f14091812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. YOLOv7

3.2. Lightweight Feature Extraction Module-C3GhostV2

3.3. Parameter-Free Attention Mechanism

3.4. Adaptive Spatial Feature Fusion

3.5. Loss Function Improvement

3.6. Improved Network Structure FL-YOLOv7

4. Experiments and Discussion

4.1. Datasets

4.2. Experimental Platform and Model Hyperparameters

4.3. Model Evaluation

4.4. Experimental Results and Analysis

4.4.1. Comparison of Baseline Networks

4.4.2. Comparative Experimental Analysis of Proposed and Lightweight Networks

4.4.3. Experimental Analysis of The Proposed Method and Other Attention Blocks

4.4.4. Ablation Experiments

4.4.5. Comparison Experiments

5. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI