An Improved Lightweight YOLOv8 Network for Early Small Flame Target Detection

Du, Hubin; Li, Qiuyu; Guan, Ziqian; Zhang, Hengyuan; Liu, Yongtao

doi:10.3390/pr12091978

Open AccessArticle

An Improved Lightweight YOLOv8 Network for Early Small Flame Target Detection

by

Hubin Du

,

Qiuyu Li

,

Ziqian Guan

,

Hengyuan Zhang

and

Yongtao Liu

^*

School of Electronic Information, North China Institute of Science and Technology, No. 467 Xueyuan Street, Yanjiao Development Zone, Sanhe, Langfang 065201, China

^*

Author to whom correspondence should be addressed.

Processes 2024, 12(9), 1978; https://doi.org/10.3390/pr12091978

Submission received: 5 August 2024 / Revised: 4 September 2024 / Accepted: 12 September 2024 / Published: 13 September 2024

(This article belongs to the Special Issue Applications of Artificial Intelligence Technologies in Energy, Manufacturing and Automatic Control Processes)

Download

Browse Figures

Versions Notes

Abstract

:

The efficacy of early fire detection hinges on its swift response and precision, which allows for the issuance of timely alerts in the nascent stages of a fire, thereby minimizing losses and injuries. To enhance the precision and swiftness of identifying minute early flame targets, as well as the ease of deployment at the edge end, an optimized early flame target detection algorithm for YOLOv8 is proposed. The original feature fusion module, an FPN (feature pyramid network) of YOLOv8n, has been enhanced to become the BiFPN (bidirectional feature pyramid network) module. This modification enables the network to more efficiently and rapidly perform multi-scale fusion, thereby enhancing its capacity for integrating features across different scales. Secondly, the efficient multi-scale attention (EMA) mechanism is introduced to ensure the effective retention of information on each channel and reduce the computational overhead, thereby improving the model’s detection accuracy while reducing the number of model parameters. Subsequently, the NWD (normalized Wasserstein distance) loss function is employed as the bounding box loss function, which enhances the model’s regression performance and robustness. The experimental results demonstrate that the size of the enhanced model is 4.8 M, a reduction of 22.5% compared to the original YOLOv8n. Additionally, the mAP0.5 metric exhibits a 2.7% improvement over the original YOLOv8n, indicating a more robust detection capability and a more compact model size. This makes it an ideal candidate for deployment in edge devices.

Keywords:

small flame target detection; YOLOv8n; edge deployment; loss function

1. Introduction

With the continuous development of the warehousing industry, the number of various warehouses and logistics centers is increasing and the number of warehouse-related fires is also on the rise. Once a fire occurs in such places, it generally brings incalculable losses to enterprises and serious social impacts. According to the information released by the National Fire Rescue Bureau, a total of 18,000 various factory fires were reported in 2022, with 278 casualties and a preliminary estimate of economic losses of CNY 1.5 billion. The occurrence of fires in storage places often develop from an inconspicuous spark at the beginning. When it is discovered, the fire has already formed on a large scale and it is difficult to extinguish at this time. There are often large amounts of materials piled up in the warehouse and the fire load increases [1]. Once the fire forms on a scale, it is difficult to extinguish, causing large safety accidents and huge losses. The fire alarms used in the storage environment are mostly smoke alarms and infrared detection sensors. However, these sensor alarms are not sensitive enough to detect fires in the early stages. Often, alarms only function after the fire has reached a certain degree. When the fire forms on a scale, it is extremely difficult to extinguish it. If fire warnings can be issued in the early stages of a fire with timely intervention, the fire can be nipped in the bud and the property and personnel losses caused by the fire will be minimized.

In recent years, there have been significant developments in machine vision algorithms and research, particularly in target detection technology. The advancement of flame target detection technology is noteworthy, with Yu et al. [2] demonstrating an effective improvement in the flame segmentation algorithm by replacing the original segmentation threshold with a proportional threshold. This approach enhances the utilization of traditional color space threshold segmentation methods while maximizing flame feature information. Xie et al. [3] proposed an enhanced flame detection algorithm based on YOLOv5. The embedded coordinate attention mechanism enables the model to identify and detect the target of interest with greater precision, while a novel loss function, α-IoU, is introduced to enhance the accuracy of the regression results. By integrating the model with migration learning, the accuracy of the model is improved and its ability to recognize a single flame image is accelerated. Zhang Li et al. [4] proposed an optimized hybrid kernel. An independent component analysis and echo state network flame image recognition model is proposed. To describe the feature information of the flame image in detail, a comprehensive extraction of three types of features, namely color, shape, and texture, is performed with 19 feature vectors. A hybrid kernel independent component analysis method is proposed to perform the nonlinear transformation and reduce the correlation between them. According to this method, a flame recognition model is proposed, which has good results in flame recognition performance and generalization ability. While the aforementioned research has yielded promising results in the field of flame detection algorithms, there is still a need for further advancements in the recognition and detection of small targets for early flame identification. In practical applications, the ability to detect and alert for the presence of early flames is crucial for prompt intervention and treatment. However, the current recognition and detection of small targets for early flame detection still face significant challenges. The current recognition and detection of small targets for early flames have several limitations. These include a small target pixel ratio, a lack of semantic information, and a susceptibility to interference from complex scenes. This makes it difficult to achieve accurate detection in practical applications of early flame detection, resulting in issues such as omission and incorrect detection [5]. Furthermore, early small flame target detection algorithms must be capable of high-speed accurate detection. This requires a high-performance computing network model, which places significant demands on hardware. The deployment of a neural network model often necessitates a larger memory capacity, which can lead to delays in computation. Additionally, the deployment of these algorithms requires a substantial amount of power, which presents a challenge in terms of energy efficiency [6]. The deployment of the early small flame target recognition algorithm is intended for use in, for example, surveillance cameras and other edge devices. The majority of these devices have low power consumption and limited computational ability, which is not conducive to the deployment of neural network models at the edge.

This article mainly proposes effective improvements to the above problems, proposes an early flame target recognition algorithm based on the improved YOLOv8, effectively improves the recognition accuracy and recognition speed of early flames, and optimizes the neural network model that needs to be deployed, reducing the network model’s demand for power consumption and making the deployment of the improved network model more advantageous, which is also of positive significance for later edge deployment.

2. Yolov8 Network Model

YOLOv8 is the latest version of the current YOLO (you only look once) series of algorithms; YOLOv8 builds on the significant upgrade of YOLOv5, which is one of the commonly used target detection algorithm models. Built on the success of the previous one, YOLOv8 introduces new features and improvements to enhance performance, flexibility, and efficiency [7]. The network structure of YOLOv8 consists of an input end (input), backbone (backbone) network, neck (neck) network, and predictive head structure (head), which is composed of four parts, as shown in Figure 1.

In comparison to the network structure of YOLOv5, the following specific improvements have been made: In the backbone network, the concept of CSP is still employed, but the initial C3 module has been substituted with the C2f module. This modification facilitates further reduction in the model’s complexity while enabling the acquisition of more comprehensive gradient flow data, thus maintaining the overall lightweight nature of the network. YOLOv8 continues to employ the PAN concept but differs from YOLOv5 in that the CBS 1 × 1 convolution structure in the upsampling stage has been eliminated, and the C3 module has been replaced by the C2F module. YOLOv8 abandons the previous anchor-based approach and instead employs the anchor-free concept. With regard to the loss function, the VFL loss is utilized for categorization, while the DFL loss + CIOU loss is employed for regression. The CIOU loss is employed as the regression loss and the previous version of IOU matching or the unilateral proportion of the allocation method is eschewed in favor of the task-aligned assigner matching method.

YOLOv8 currently supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking, and classification. It is available in five versions, YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, with YOLOv8n having a smaller size and faster operation speed, which makes it more suitable for deployment at the edge. Considering that the problem to be solved is the actual problem of early fire detection and warning in structured spaces such as warehouses, the version adopted is YOLOv8n.

3. Selection of Data Sets

Most of the currently available flame datasets come from sources such as CVPR Lab and CORSICAN FIRE DATABASE, and most of the images in these datasets are scenes after a large fire has formed. However, most of these datasets are for non-warehouse structural space environments, and the focus of this article is to solve the problem of early flame detection in warehouses and other structural space environments. In the environment of warehouses and other structural spaces, there may be complex environments such as those involving darkness, which may be interfered with by warehouse lighting, which will affect the effect of fire detection.

Therefore, this article has collected on-site warehouse flame pictures and videos as datasets to ensure that the dataset can fully represent the real scene where the model will be applied. The dataset includes various types and sizes of flames, as well as various possible background environments, such as warehouses, factories, etc. These datasets are unlabeled raw data. In order to train a better model, we manually annotated the dataset to ensure the accuracy and consistency of the annotation. The quality of the annotation will directly affect the training effect of the model [8]. Considering that early flames are usually small in size and irregular in shape, we used rectangular bounding annotation. Most of the flame targets in the dataset are small targets, and the background is a warehouse and other structural spaces, which is more in line with the actual early flame detection scene.

In order to improve the performance of the model, we have enhanced the original dataset using translation, flipping, scaling, rotation, etc., to improve the generalization ability of the model. These enhancement operations can also be regarded as introducing noise to the data, thereby enhancing the robustness of the model. In addition, we have also adjusted the brightness and contrast of the dataset to simulate the flame detection situation under different lighting conditions.

The part of the dataset used in this article is shown in Figure 2. These datasets will provide our flame detection model with more realistic and diversified training samples so that it can accurately and stably perform early flame detection in various warehouse environments.

4. Yolov8 Improvements

4.1. Feature Fusion

In recent years, feature fusion has been widely used in the field of computer vision to improve the performance of models. Feature fusion refers to the use of the complementarity between features when given features of different attributes, fusing the advantages between features and thereby improving the performance of the model. Among them, multi-scale feature fusion is widely used in target detection. Multi-scale feature fusion can help the model better capture target information under different scales [9]. By fusing feature maps from different convolutions, the model’s detection performance for targets of different scales is improved.

However, when the convolutional network extracts image features, due to multiple downsampling operations, it will cause the loss of spatial information and edge contour pixels and cannot fully utilize the underlying semantic information. In order to effectively solve this problem, researchers have proposed the structure of a feature pyramid. The feature pyramid can recognize targets at different scales, and by extracting multi-scale feature information for fusion, it can improve the accuracy of the model. However, the feature pyramid requires a large amount of computation and memory, so Tsung-Yi Lin et al. [10] proposed a new method for constructing a feature pyramid, namely an FPN (feature pyramid network), which can reduce the additional computation and memory consumption and can effectively solve the problem of the feature pyramid requiring a large amount of computation and memory. The original YOLOv8 network also uses the FPN structure. Specifically, the FPN adds an extra branch at the top of the network to generate high-resolution feature maps. This branch improves the resolution through upsampling and convolution operations while retaining the semantic information of the high-level feature map. After generating the high-resolution feature map, the FPN realizes feature fusion through top–down and bottom–up connections. The top–down connection fuses the high-level feature map with the low-level feature map through upsampling fusion so that the low-level feature map can obtain richer semantic information. The bottom–up connection fuses the low-level feature map with the high-level feature map through downsampling fusion so that the high-level feature map can obtain a higher resolution. The network structure of the FPN is shown in Figure 3. In the end, the FPN generates a feature pyramid with multi-scale information through the fusion of multi-layer feature maps and up- and downsampling operations. The feature pyramid can be used in target detection and semantic segmentation tasks and can pay attention to targets of different scales at the same time, improving the detection and segmentation capabilities of the network.

In traditional FPNs, the feature pyramid network usually consists of feature maps of multiple levels. Each level performs top–down and bottom–up feature fusion operations to achieve the fusion and transmission of multi-scale features. Although the FPN structure can significantly reduce the amount of computation, this multi-level feature fusion will increase the amount of computation in the network, thereby affecting the speed and efficiency of the network. In order to further improve the accuracy and efficiency of the model in this article, this article adopts the BiFPN (bidirectional feature pyramid network) structure [11] improved based on the FPN network structure for early small flame target detection. The new network reduces the number of layers of feature fusion, thereby improving the efficiency of the model, and adopts bidirectional feature transmission and dynamic feature fusion when the feature is transmitted downward, which can maintain high recognition accuracy. The multi-scale fusion method of the FPN is shown in Figure 4a and the multi-scale fusion method of the BiFPN is shown in Figure 4b.

Compared to the original FPN structure, the BiFPN has made changes in its design as shown in the comparison diagram. The removal of nodes with a single output edge is a notable alteration. In the event that a node possesses a single input edge and is devoid of feature fusion capabilities, it can be posited that its input to the feature network, which fuses disparate features, is minimal. The removal of this node will not result in any loss of functionality to the original network and can effectively streamline the entire feature fusion network. In the event that the original input node and output node are situated on the same layer, an additional edge is introduced between the aforementioned nodes, thereby facilitating the fusion of additional features without a significant increase in computational load. In contrast to the original FPN design, which comprises a single top–down and bottom–up path, each bidirectional path is regarded as a distinct feature network layer. The repetition of a given level facilitates the integration of higher-level features. While the FPN structure can markedly reduce the amount of computation, this multi-level feature fusion will increase the amount of computation required by the network, thereby affecting the speed and efficiency of the network. The BIFPN algorithm contributes to the reduction in the number of layers of feature fusion in the FPN algorithm, with each level performing feature fusion. This results in an effective improvement in the efficiency of computation. Furthermore, when the feature is transmitted downward, bidirectional feature transmission and dynamic feature fusion are adopted, which can maintain high recognition accuracy and achieve a superior solution for edge deployment.

4.2. Attention Mechanism

In the field of computer vision, the attention mechanism is a prevalent technique employed to emulate the human visual and cognitive system, enabling neural networks to concentrate on pivotal feature points within the data. The attention mechanism enables the neural network to selectively filter important information from the data, extract key information points, and remove insignificant information points, thereby enhancing the model’s performance and generalization ability. The attention mechanism utilized in the original Yolov8 neural network is the CBAM (convolutional block attention module), which integrates the feature channel and feature space dimensions of the attention mechanism. The CBAM processes the incoming feature layers with the channel attention mechanism and the spatial attention mechanism, respectively. The CBAM receives the input feature layer and processes the channel attention mechanism and the spatial attention mechanism in a sequential manner. In comparison to the SENet (squeeze-and-excitation network) module utilized in YOLOv5, which solely concentrates on the channel attention mechanism and can yield superior outcomes, the CBAM can enhance the model’s performance but necessitates greater computational resources and possesses a more intricate computational complexity [12,13]. Furthermore, the CBAM is unable to effectively enrich its original feature space through the spatial information of feature maps of different scales. It is only capable of capturing local information and cannot establish dependencies on upper-distance channels.

Daliang et al. [14] proposed a novel attention mechanism module, EMA (efficient multi-scale attention), which aims to preserve information on each channel while reducing computational overhead. This is achieved by reshaping part of the channel into batch dimensions and grouping the dimensions in the channel into multiple sub-features, ensuring that the spatial semantic features are evenly distributed in each feature group and uniformly distributed. The overall structure of the EMA is illustrated in Figure 5.

The EMA module will be subdivided into three branches in accordance with the channel dimension. These three branches will be arranged in parallel in order to facilitate the aggregation of multi-scale spatial and structural information with a rapid response. The large local receptive fields of the neurons enable them to gather multi-scale spatial information. The EMA utilizes three parallel routes to extract the attentional weight descriptors of the grouped feature maps, where two of the parallel routes are located in the

1 \times 1

branches and the third route is located in the

3 \times 3

branch. In order to capture the dependencies between all channels and to reduce computational consumption, two 1D global average pooling manipulations are employed to encode the channels along the two spatial directions separately in the

1 \times 1

branch. Furthermore, only a single stub

3 \times 3

route is stacked in the

3 \times 3

branch for the purpose of capturing multi-scale features.

In light of the fact that there is no batch coefficient issue present in the normal convolution function, it can be stated that the number of convolution kernels is independent of the batch coefficient of the forward operation input. In the figure, the parameter dimension of the 2D convolution kernel represents the output plane of the input, represents the input plane of the input feature, and represents the kernel size. Therefore, the parameter dimension must be reshaped and transposed into the batch dimension, and the input tensor must be redefined with a shape of

[o u p, i n p, k, k]

, where

ο u p

represents the output plane of the input,

i n p

represents the input plane of the input feature, and

k

represents the size of the kernel. It is thus necessary to reshape and transpose into the batch dimension, and to redefine the input tensor with the shape of

C / / G \times H \times W

.

The two encoded features are concatenated in the image height direction, and both use the same

1 \times 1

convolution with a dimension reduction in the

1 \times 1

branch. After the output of the

1 \times 1

convolution is decomposed into two vectors, two nonlinearities are used for differential purposes.

S i g m o i d = \frac{1}{1 + e^{- x}}

(1)

The objective is to fit the 2D binary distribution onto the linear convolution. This is achieved by utilizing the appropriate functions. In order to achieve distinct cross-channel interaction characteristics between the two parallel routes within the

1 \times 1

branch, a straightforward multiplication is employed to aggregate the two channel attention maps within each group. And by capturing local cross-channel interactions through the

3 \times 3

branch convolution, an expanded feature space can be obtained. The EMA mechanism module is capable of not only encoding information between channels in order to adjust the relative importance of different channels, but also of preserving more accurate spatial structure information within the channels.

In terms of cross-space learning, the EMA adopts cross-space information aggregation methods in different spatial dimension directions to achieve richer feature aggregation. In the module, two tensors are still introduced; one is the output of the

1 \times 1

branch and the other is the output of the

3 \times 3

branch. A 2D global average pooling is used to encode the global spatial information in the output of the

1 \times 1

branch, and the output of the smallest branch will be directly converted into the corresponding dimension shape before the joint activation mechanism of channel features. The operation formula for 2D global pooling is

Z_{C} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{C} (i, j)

(2)

This is designed to encode global information and model long-range dependencies. It can effectively compute these and uses a natural nonlinear function of 2D Gaussian mapping on the output of 2D global average pooling.

Soft \max = \frac{e^{z_{z}}}{Σ_{c = 1}^{C} e^{z_{c}}}

(3)

To perform a linear transformation, the outputs of the aforementioned parallel processing are multiplied by the matrix dot product, resulting in the first spatial attention map. In order to effectively observe the collection of different scale spatial information at the same processing stage, the global spatial information in the

3 \times 3

branch is encoded by 2D global pooling, and the channel features in the

1 \times 1

branch are transformed into the corresponding dimension shape before the joint activation mechanism is performed. The second spatial attention map, which retains the entire precise spatial position information, is exported. The output feature maps within each group are calculated as the fusion of two generated spatial attention weight values.

S i g m o i d = \frac{1}{1 + e^{- x}}

(4)

The EMA is effective in capturing pixel-level pairwise relationships and it can effectively highlight all pixels in the global context. Moreover, its size is the same as its final output, which is conducive to stacking the EMA module into modern architectures.

The factors that influence the attention mechanism are guided exclusively by the degree of similarity between the global and local feature descriptions within each group. The cross-space information aggregation method is employed for the purposes of dependency modeling and the embedding of precise location information into the EMA module. The integration of contextual data at varying resolutions enables the neural network to produce more refined pixel-level attention for high-level feature maps. This paper presents the integration of the EMA module into the C2F component of the neck. Figure 6 illustrates the modified YOLOv8.

4.3. Improvement of Loss Function

The loss function of YOLOv8 consists of two parts, the class classification loss and the bounding box regression loss. The original bounding box regression loss of YOLOv8 is CIoU (compatible intersection over union) [15] and DFL (deep feature loss) [16]. The CIoU improves the calculation method of the IoU, more accurately measuring the overlap between target boxes, making the loss function more accurately guide the training of the target detection model. During training, the CIoU can better predict the bounding box offset between the prediction box and the real box, which helps to improve the positioning accuracy of the target detection model. DFL introduces the distribution focal loss function, which can effectively deal with the problem of sample imbalance, improve the attention of the target detection model to difficult samples, and help to improve the generalization ability of the model [17,18]. However, since our task is mainly aimed at early fire detection, the early stage of fire occurrence is mostly small targets. The calculation of the CIoU is relatively complex and requires additional computational overhead, which may increase the time cost of training and inference and currently, the CIoU has a problem with the design of aspect ratio weight, which may damage the quality of regression samples; DFL introduces additional hyperparameters which need to be tuned, increasing the complexity of model training. In response to the above problems, researcher Jinwang Wang and others [19] proposed a new NWD (normalized Wasserstein distance) loss function. The similarity between the bounding boxes is calculated through the Gaussian distribution to which they correspond by means of a 2D Gaussian model. In the event of no or minimal overlap, the similarity between distributions can nevertheless be quantified, rendering this approach particularly well suited to measuring the similarity between small objects.

In the case of smaller objects, it is not uncommon for background pixels to be present within their bounding boxes. This is due to the fact that real-world objects do not always conform to a strictly rectangular shape. In these bounding boxes, the concentration of foreground pixels and background pixels is observed to be at the center and edge of the bounding box, respectively. In order to more accurately describe the weight of different pixels within the bounding box, the bounding box can be modeled as a two-dimensional (2D) Gaussian distribution, where the center pixel of the bounding box is assigned the highest weight and the importance of pixels gradually decreases from the center to the edge. In particular, for a horizontal bounding box

R = (c x, c y, w, h)

,

(c x, c y)

,

w

, and

h

represent the center coordinates, width, and height, respectively. The equation of the inscribed ellipse can be expressed as follows:

\frac{{(x - μ_{x})}^{2}}{σ_{x}^{2}} + \frac{{(y - μ_{y})}^{2}}{σ_{y}^{2}} = 1

(5)

where

(μ_{x}, μ_{y})

is the center coordinate of the ellipse and

σ_{x}

and

σ_{y}

are the semi-axis lengths along the

x

and

y

axes, respectively. Therefore,

μ_{x} = c_{x}

,

μ_{y} = c_{y}

,

σ_{x} = w^{2}

, and

σ_{y} = h_{2}

. The probability density function of a 2D Gaussian distribution is usually represented as

f (X | μ, Σ) = \frac{\exp {(- \frac{1}{2} (x - μ))}^{T} Σ^{- 1} (x - μ))}{2 π | Σ |^{\frac{1}{2}}}

(6)

where

x

,

μ

, and

Σ

represent the coordinate

(x, y)

of the Gaussian distribution, the mean vector, and the covariance matrix.

{(X - μ)}^{T} Σ^{- 1} (X - μ) = 1

(7)

The ellipse represented in Formula (5) will serve as the density contour for the two-dimensional Gaussian distribution. Accordingly, the horizontal bounding box

R = (c_{x}, c_{y}, w, h)

can be modeled as a two-dimensional Gaussian distribution,

N (μ, Σ)

, where

μ

and

Σ

represent the mean vector and the covariance matrix of the Gaussian distribution, respectively.

μ = [\begin{matrix} c_{x} \\ c_{y} \end{matrix}], \sum = [\begin{array}{l} \frac{w^{2}}{4} \\ 0 \end{array} \begin{array}{r} 0 \\ \frac{h^{2}}{4} \end{array}]

(8)

In addition, the resemblance between bounding boxes A and B can be converted into the divergence between two Gaussian distributions. The NWD uses the Wasserstein distance of optimal transport theory to calculate the distribution distance. For two 2D Gaussian distributions

μ_{1} = N (m_{1}, Σ_{1})

and

μ_{2} = N (m_{2}, Σ_{2})

, the second-order Wasserstein distance between

μ_{1}

and

μ_{2}

is defined as

W_{2}^{2} (μ_{1}, μ_{2}) = {‖m_{1} - m_{2}‖}_{2}^{2} + T r (Σ_{1} + Σ_{2} - 2 {(Σ_{2}^{1 / 2} Σ_{1} Σ_{2}^{1 / 2})}^{1 / 2})

(9)

The above formula can be simplified to

W_{2}^{2} (μ_{1}, μ_{2}) = {‖m_{1} - m_{2}‖}_{2}^{2} + {‖Σ_{1}^{1 / 2} - Σ_{2}^{1 / 2}‖}_{F}^{2}

(10)

Among them,

{‖.‖}_{F}

represents the Frobenius norm. Furthermore, for Gaussian distributions

A = (c x_{a}, c y_{a}, w_{a}, h_{a})

and

B = (c x_{b}, c y_{b}, w_{b}, h_{b})

, modeled from bounding boxes

N_{a}

and

N_{b}

, Equation (10) can be further simplified to

W_{2}^{2} (N_{a}, N_{b}) = {‖([c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]^{T}, [c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{h_{b}}{2}])‖}_{2}^{2}

(11)

However,

W_{2}^{2} (N_{a}, N_{b})

is a distance measure and cannot be used directly as a similarity measure. Accordingly, the exponential form is normalized to yield a novel measure, designated as the normalized Wasserstein distance (NWD).

N W D (N_{a}, N_{b}) = \exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(12)

where

C

is a constant that is closely related to the data set.

The NWD (normalized Wasserstein distance) loss function is specifically designed to measure the similarity between small objects, making it more effective in tasks such as early fire detection. It uses 2D Gaussian distribution modeling by modeling the bounding box as a 2D Gaussian distribution; the NWD loss function captures the spatial information and distribution characteristics of the object, providing a more accurate similarity measure. It can measure similarity even in cases where there is little or no overlap, making it more robust for scenes with small objects and limited overlap. Given the advantages of the NWD in detecting small objects, it is used to replace the original bounding loss function in YOLOv8 and optimize it, thereby further improving detection accuracy.

5. Analysis of Results

5.1. Experimental Environment and Configuration

The experimental environment utilized in this study was Python 3.10.2, Torch 2.0.0 framework, Cuda 11.2; the GPU model was NVIDIA Tesla P100 16 GB (NVIDIA, CA, USA), using two cards; the training parameters set for the experimental model were as follows: the initial learning rate parameter was set to 0.01, the input image size was set to 640 × 640 pixels, the batch size was 16, the epoch was set to 300 rounds, workers were set to eight, and the number of early stopping rounds was set to 100 rounds.

The dataset used for training was self-collected and manually annotated, making the dataset more closely aligned with the actual problem to be solved in this paper. The majority of this training set comprised small targets in the early stages of fires, and the dataset used manual annotation, effectively deleting invalid images and making the annotation of fire targets more accurate. The final number of valid datasets was determined to be 4234, and the dataset was divided into training sets, validation sets, and test sets at a ratio of 8:1:1.

5.2. Evaluation Indicators

This paper investigates methods for enhancing the precision and compactness of preliminary fire detection models. Accordingly, the following indicators were employed as evaluation criteria [20,21,22,23]: Precision (P): Precision is a measure of the proportion of positive samples that are correctly identified as such by the model.

Recall ^®: Recall measures how many of all positive samples are successfully predicted as positive by the model. The higher the recall, the more target boxes the model can detect.

Average precision (AP): AP is an indicator used to measure the performance of the model in information retrieval and binary classification tasks.

Mean average precision (mAP): mAP is an indicator used to measure the performance of target detection or object recognition models.

Model size and frames Per second (FPS): The calculation formula is as follows:

P = \frac{T P}{T P + F P}

(13)

R = \frac{T P}{T P + F N}

(14)

A P = \int_{0}^{1} P d R

(15)

m A P = \sum_{1}^{n} \frac{A P_{i}}{n}

(16)

5.3. Comparative Experiment

In order to validate the accuracy and lightweight design improvements proposed in this paper for the original model, the improvement methods proposed in Section 3 were compared separately. All algorithms used the same hardware device and trained with the same dataset to ensure the fairness of the experimental results. The first experiment explored the comparison of the accuracy of the model algorithm improved by different feature fusion algorithms in this paper and compared the FPN module and the BiFPN module of the original YOLOv8 model. The results of the experiment are presented in Table 1.

The experimental results show that the use of the BiFPN module compared to the original FPN module achieved a higher detection accuracy improvement of about 6% and the parameter weight size compared to the original FPN module decreased by 32%, from 6.2 MB to 4.2 MB; compared to the original FPN module, the single-frame detection speed only slightly increased by 0.2 ms, which basically indicates no significant change in single-frame detection speed. Overall, the BiFPN feature fusion module achieved higher detection accuracy, indicating that the smaller model size has an advantage in edge deployment.

Next, for its different attention mechanism models, still ensuring that it is in the same hardware environment and is using the same training set conditions for experiments, the original YOLOv8′s CBAM, the SENET module, and the EMA module adopted in this paper were compared experimentally. The experimental results are shown in Table 2.

From Table 2, it can be seen that the EMA module used in this paper has a more noticeable improvement in detection speed compared to its original attention mechanism, and it is also superior to the CBAM and SENET module in terms of accuracy. Even when multiple EMA modules are added, the weight size basically remains consistent with the original weight and overall, the effect is the best.

In terms of improving the loss function, the original YOLOv8n was compared with the improved NWD’s YOLOv8. The box_loss function was used to supervise the regression of the detection box. The smaller the value, the smaller the error between the prediction box and the calibration box and the more accurate the prediction. As can be seen from Figure 7, the box_loss data of the improved model have a faster convergence speed and a lower value, which is better than the original YOLOv8n model.

In the following, the original YOLOv8 network and the improved YOLOv8 network were trained using the same hardware equipment and the same training set with identical training parameters. The mAP0.5 and mAP0.5–0.95 derived from the training of both networks were then compared, with the results displayed in Figure 8. The mAP0.5 data are improved by 2.7% compared with YOLOv8n. As illustrated in the figure, the enhanced YOLOv8 network has demonstrated enhanced detection accuracy in comparison to the preceding YOLOv8 network. Additionally, the majority of the flame targets within the training set are of a diminutive size, thereby substantiating the assertion that the enhanced YOLOv8 network has also exhibited an improvement in the detection accuracy of minute targets.

5.4. Ablation Experiment

In order to more intuitively verify the effectiveness of the improvements to the original YOLOv8 model under the same hardware environment and using the same dataset, the experimental results are as follows. In the table, “√” indicates that the improvement module was added to the original model; otherwise, the improvement module was not used.

As illustrated in Table 3, the replacement of the FPN feature fusion module with the BiFPN module resulted in a reduction of 2 MB in the model size, an improvement of 0.9% in accuracy, and an increase of 0.1 s in the single detection time. The results demonstrate that the BiFPN module is more suitable for lightweight neural network design, with a reduction in impact on detection accuracy through the removal of redundant networks in the fused feature network computation. Furthermore, the passing of features in both directions maintains high recognition accuracy, making it more suitable for deployment at the edge end.

Improvement 2 incorporates the EMA mechanism into the original YOLOv8 network. Prior attention mechanism models facilitate cross-channel relationships through channel dimensionality reduction, but this approach has the drawback of increased computation. The EMA, in contrast, employs dimensionality reduction in the channel to group multiple sub-features, ensuring the uniform distribution of spatial semantic features across each feature group. This results in a model weight that is essentially equivalent to that of the original network while significantly enhancing detection accuracy. The network is essentially the same, but the EMA mechanism effectively improves the detection accuracy and single detection speed. This indicates that the EMA mechanism effectively helps the model to focus on important flame information with greater accuracy, thus improving the performance of the model and reducing redundant calculations. This, in turn, improves the computational efficiency of the model.

Improvement 3 concerns the modification of the loss function. This paper employs the NWD loss function, which exhibits superior performance compared to the CIoU loss function. The detection accuracy is notably enhanced, with an improvement of 1.7% compared to the original YOLOv8 network model. The NWD loss function demonstrates enhanced efficacy in detecting minute target objects, exhibiting a greater capacity to capture the object’s spatial information and distributional characteristics. Additionally, it exhibits enhanced robustness in detecting scenes with smaller objects.

Improvement 4 entails an increase of 0.3 ms in the time required for a single detection and a 0.3 MB increase in the model size in comparison to Improvement 1. Improvement 4 represents a simultaneous improvement to both Improvement 1 and Improvement 2 of the original network. Compared to the original YOLOv8 network, Improvement 4 demonstrates an average increase in accuracy of 1.2%. In comparison to Improvement 1, the average value of accuracy improves by 0.4%. When compared to Improvement 2, the average value of accuracy improves by 0.6%. These results indicate that Improvement 4 is more effective for the accuracy of small target objects, more capable of capturing the spatial information and distribution characteristics of objects, and more robust for the detection of smaller objects. It can be demonstrated that Improvement 4 has a more pronounced impact on the enhancement of accuracy than the optimization of single-picture detection time and the expansion of the model size. Improvement 5 entails the advancement of both Improvement 2 and Improvement 3 in the original YOLOv8 network, resulting in an improvement in accuracy of 1%. Improvement 8 represents an 8% improvement in accuracy compared to the original YOLOv8n while also improving the speed of single-picture detection by 0.9 ms. However, this comes at the cost of an increased model size of 0.8 MB. Improvement 6 represents a 1% improvement in accuracy for both Improvement 1 and Improvement 3, which is added to the network. This results in an improvement of 1.7% in accuracy, a decrease in model size of 0.8 MB, and a decrease in detection speed.

Improvement 7 is the improved network used in this paper. It adopts the BiFPN model architecture, adds the EMA mechanism and the NWD loss function, reduces part of the redundant computation, increases the number of a few parameters, and benefits from the bidirectional mutual fusion of different features to effectively increase the effective information, thereby making the model features richer. In comparison to the initial YOLOv8 network, it is possible to enhance the accuracy by a further 2.7% while simultaneously reducing the number of parameters by 22.4%. The model is capable of maintaining and enhancing target detection accuracy while also necessitating a more lightweight configuration, one that is more readily deployable on embedded devices at the edge.

To further evaluate the detection performance of the proposed improved model, a comparative analysis was conducted with the current mainstream target detection models, namely YOLOv3-tiny [24], YOLOv4-tiny [25], YOLOv5s, YOLOv7-tiny [26], YOLOv8, and the improved YOLOv8 detection model proposed in this paper. All models were evaluated under identical conditions to ensure a fair and objective comparison. The experimental data are shown in Table 4.

The results in the table show that the improved YOLOv8 algorithm in this paper is superior to the original YOLOv8n model in terms of detection accuracy, detection speed, and model size. Compared with the current mainstream lightweight models, YOLOv3-tiny, YOLOv4-tiny, YOLOv5s, and YOLOv7-tiny, in terms of parameter quantity, the model size of this paper is reduced by 41.4%, 54.7%, 15%, and 29.4%, respectively, compared to the above four models. The mAP0.5 data are improved by 21.7%, 19.9%, 7.9%, and 2.7%, respectively, compared to the other four networks. Although the FPS of the model in this paper is slightly lower than the other four, the mAP0.5 still shows a significant improvement when the parameter calculation amount is greatly reduced, which fully shows that the large parameter quantity cannot guarantee the high accuracy of the network model. The improved YOLOv8 algorithm in this paper not only realizes model lightweighting, but is also more convenient for deployment on edge embedded devices and effectively improves the detection accuracy of the model, which can effectively meet the requirements of lightweightness, real-time tracking, and accuracy for the detection model for early fire small target detection.

5.5. Result Analysis

Figure 9 depicts the enhanced YOLOv8 network’s detection efficacy, with the original YOLOv8n network serving as a point of comparison. The first and second columns of the chart illustrate the original YOLOv8n’s shortcomings, particularly its inability to detect the entirety of the flame in a given image. This deficiency in detecting small flame targets is evident. The third column of the effect comparison chart demonstrates that the original YOLOv8n network produces a flame target box that is both larger and less accurate than that produced by the improved YOLOv8. This indicates that the improved YOLOv8 has a superior regression effect. The fourth column of the comparison effect diagram shows that the original YOLOv8n did not detect the flame target in the picture. This is due to the fact that the model is not sensitive to the detection of small target flames. A summary of the detection effect graph indicates that the improved model is more accurate for flame detection, with higher confidence. The improved YOLOv8 model is more accurate for small target flame detection compared to the original YOLOv8 model and the accuracy of the flame is improved for larger targets. The model demonstrates robust performance, which meets the requirements for the early flame detection of small targets.

6. Conclusions

This paper aims to present the deployment of a lightweight and highly accurate early small flame target neural network detection model in edge-end embedded devices. The model has been successfully deployed in various edge-end devices, including RK3588 boards and Nvidia Jetson Nano, and has demonstrated excellent flame target detection capabilities. In terms of algorithm optimization, the original YOLOv8n network is enhanced by modifying the original FPN feature fusion network structure into a BiFPN feature fusion network structure. This involves removing edges that contribute minimally to the feature fusion effect, simplifying the entire feature fusion network and effectively reducing the model size. This approach ensures that the detection accuracy is not significantly compromised. The introduction of the EMA mechanism module allows for the assignment of varying weights to different segments of the source sequence, thereby facilitating the capture of crucial information and enhancing the precision of the model’s detection capabilities. In addressing the issue of small flame target detection, this paper introduces the NWD loss function, which is more robust for the detection of smaller objects and limited overlap scenarios. This results in an improvement in the original YOLOv8n network’s ability to detect small targets and an enhancement in the model’s detection performance.

The experimental results demonstrate that the size of the enhanced network is 4.8 M, a reduction of 22.5% compared to the original YOLOv8n. This reduction effectively minimizes the model size, facilitating the deployment of the network on embedded devices at the edge. Additionally, the enhanced network exhibits improved detection accuracy for small flame targets, with an mAP0.5 that is 2.7% more accurate than the original YOLOv8n model. Further research will be conducted on the flame target detection algorithm with the objective of further optimization and enhancement. The fusion of smoke and other concomitant features of the fire will be considered with the aim of improving the accuracy of the algorithm’s judgment of early small flame targets in more complex and diverse target scenarios.

Author Contributions

Study conception and design: H.D. and Y.L.; data collection: Q.L. and H.Z.; analysis and interpretation of results: H.D. and Z.G.; draft manuscript preparation: H.D. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hebei Provincial Key RD Program Project: Research on Key Technology of Unattended Active Firefighting Robot in Warehouse Space (22375411D).

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, Y. Liu, upon reasonable request.

Acknowledgments

We extend our sincere gratitude to all members for their steadfast support of our work. Furthermore, our thanks go to the editors and reviewers whose diligent efforts have greatly contributed to the improvement of this manuscript.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

References

Song, M. Analysis of fire safety management issues and countermeasures in logistics warehouses. China Storage Transp. 2023, 128–129. [Google Scholar] [CrossRef]
Zhao, Y.; Wu, S.; Wang, Y.; Chen, H.; Zhang, X.; Zhao, H. Fire Detection Algorithm Based on an Improved Strategy of YOLOv5 and Flame Threshold Segmentation. Comput. Mater. Contin. 2023, 75, 5639–5657. [Google Scholar]
Xie, X.; Chen, K.; Guo, Y.; Tan, B.; Chen, L.; Huang, M. A Flame-Detection Algorithm Using the Improved YOLOv5. Fire 2023, 6, 313. [Google Scholar] [CrossRef]
Li, Z.; Zhu, Y.; Yan, X.; Wu, H.; Li, K. Optimized Mixture Kernels Independent Component Analysis and Echo State Network for Flame Image Recognition. J. Electr. Eng. Technol. 2022, 17, 3553–3564. [Google Scholar]
Pan, X.; Jia, N.; Mu, Y.; Gao, X. Review of small target detection research. Chin. J. Image Graph. 2023, 28, 2587–2615. [Google Scholar]
Hu, Y.; Xia, Y. In-memory computing deployment optimization algorithm based on deep reinforcement learning. Comput. Appl. Res. 2023, 40, 2616–2620. [Google Scholar]
Liu, Z.; Xu, H.; Zhu, X.; Li, C.; Wang, Z.; Cao, Y.; Dai, K. Bi-YOLO: An improved lightweight target detection algorithm based on YOLOv8. Comput. Eng. Sci. 2024, 46, 1444–1454. [Google Scholar]
Zhou, D.; Hu, J.; Zhang, L.; Duan, F. Collaborative correction technology for missing data set labels for target detection. Comput. Eng. Appl. 2024, 60, 267–273. [Google Scholar]
Wang, C.; Yang, S.; Zhou, L.; Hua, B.; Wang, S.; Lyu, J. Research on metal gear end face defect detection method based on adaptive multi-scale feature fusion network. J. Electron. Meas. Instrum. 2023, 37, 153–163. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. arXiv 2016, arXiv:1612.03144. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. arXiv 2019, arXiv:1911.09070. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
Ouyang, D.; He, S. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv 2023, arXiv:2305.13563. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. arXiv 2020, arXiv:2005.03572. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv 2019, arXiv:1911.08287. [Google Scholar] [CrossRef]
Wang, Y.; Morariu, V.L.; Davis, L.S. Learning a Discriminative Filter Bank within a CNN for Fine-grained Recognition. arXiv 2016, arXiv:1611.09932. [Google Scholar]
Shen, Z.; Lin, H.; Xiang’e, S.; Meihua, L. Infrared ship detection based on attention mechanism and multi-scale fusion. Prog. Lasers Optoelectron. 2023, 60, 256–262. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2022, arXiv:2110.13389. [Google Scholar]
Xu, X.; Gao, C. Improved lightweight infrared vehicle target detection algorithm of YOLOv7-tiny. Comput. Eng. Appl. 2024, 60, 74–83. [Google Scholar]
Du, C.; Wang, X.; Dong, Z.; Wang, Y.; Jiang, Z. Improved YOLOv5s underground garage flame smoke detection method. Comput. Eng. Appl. 2023, 57, 784–794. [Google Scholar]
Zhao, L.; Jiao, L.; Zhai, R.; Li, B.; Xu, M. Lightweight detection algorithm for bottle cap packaging defects based on YOLOv5. Prog. Laser Optoelectron. 2023, 60, 139–148. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An Advanced Object Detection Network. arXiv 2016, arXiv:1608.01471. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Jiang, Z.; Zhao, L.; Li, S.; Jia, Y. Real-time object detection method based on improved YOLOv4-tiny. arXiv 2020, arXiv:2011.04244. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]

Figure 1. YOLOv8 detection network.

Figure 2. Self-mined flame dataset.

Figure 3. FPN network structure.

Figure 4. FPN and BiFPN multi-scale fusion diagram. (a) FPN multi-scale fusion diagram. (b) BiFPN multi-scale fusion diagram.

Figure 5. EMA (efficient multi-scale attention) module.

Figure 6. Improved YOLOv8 network architecture diagram.

Figure 7. Comparison of box loss data.

Figure 8. Average precision comparison.

Figure 9. Detection performance comparison.

Table 1. Comparative experiment of feature fusion architecture.

Feature Fusion Module	mAP0.5/%	Single-Sheet Detection Time/ms	Weight Size/MB
FPN module	89.5	14.8	6.2
BiFPN module	90.4	15.0	4.2

Table 2. Comparison experiment of attention mechanisms.

Attention Mechanism	mAP0.5/%	Single-Sheet Detection Time/ms	Weight Size/MB
CBAM	89.5	14.8	6.2
SENET	89.1	15.3	6.2
EMA	90.2	13.2	6.3

Table 3. Ablation experiment.

Method	Feature Fusion	Attention Mechanism	Loss Function	mAP0.5/%	Single-Frame Detection Time/ms	Model Size/MB
Yolov8n				89.5	14.8	6.2
Improvement 1	√			90.4	15.0	4.2
Improvement 2		√		90.2	13.2	6.3
Improvement 3			√	91.2	15.7	7.2
Improvement 4	√	√		90.8	13.5	4.5
Improvement 5		√	√	91.3	13.7	7.0
Improvement 6	√		√	91.2	15.1	4.4
Improvement 7	√	√	√	92.2	14.0	4.8

Table 4. Comparison experiment of different models.

Neural Networks	mAP0.5/%	FPS/(frame/s)	Model Size/MB
YOLOv3-tiny	70.5	91	8.2
YOLOv4-tiny	72.3	107	10.6
YOLOv5s	84.3	89	5.7
YOLOv7-tiny	85.6	83	6.8
YOLOv8n	89.5	67	6.2
Our	92.2	71	4.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Du, H.; Li, Q.; Guan, Z.; Zhang, H.; Liu, Y. An Improved Lightweight YOLOv8 Network for Early Small Flame Target Detection. Processes 2024, 12, 1978. https://doi.org/10.3390/pr12091978

AMA Style

Du H, Li Q, Guan Z, Zhang H, Liu Y. An Improved Lightweight YOLOv8 Network for Early Small Flame Target Detection. Processes. 2024; 12(9):1978. https://doi.org/10.3390/pr12091978

Chicago/Turabian Style

Du, Hubin, Qiuyu Li, Ziqian Guan, Hengyuan Zhang, and Yongtao Liu. 2024. "An Improved Lightweight YOLOv8 Network for Early Small Flame Target Detection" Processes 12, no. 9: 1978. https://doi.org/10.3390/pr12091978

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Lightweight YOLOv8 Network for Early Small Flame Target Detection

Abstract

1. Introduction

2. Yolov8 Network Model

3. Selection of Data Sets

4. Yolov8 Improvements

4.1. Feature Fusion

4.2. Attention Mechanism

4.3. Improvement of Loss Function

5. Analysis of Results

5.1. Experimental Environment and Configuration

5.2. Evaluation Indicators

5.3. Comparative Experiment

5.4. Ablation Experiment

5.5. Result Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI