Next Article in Journal
Numerical Analysis of the Overtopping Failure of the Tailings Dam Model Based on Inception Similarity Optimization
Previous Article in Journal
Computational Fluid Dynamics-Aided Simulation of Twisted Wind Flows in Boundary Layer Wind Tunnel
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fire Detection and Flame-Centre Localisation Algorithm Based on Combination of Attention-Enhanced Ghost Mode and Mixed Convolution

1
School of Advanced Manufacturing, Nanchang University, Nanchang 330031, China
2
Jiangxi Tellhow Military Industry Group Co., Ltd., Nanchang 330031, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(3), 989; https://doi.org/10.3390/app14030989
Submission received: 12 November 2023 / Revised: 6 January 2024 / Accepted: 19 January 2024 / Published: 24 January 2024
(This article belongs to the Section Applied Thermal Engineering)

Abstract

:
This paper proposes a YOLO fire detection algorithm based on an attention-enhanced ghost mode, mixed convolutional pyramids, and flame-centre detection (AEGG-FD). Specifically, the enhanced ghost bottleneck is stacked to reduce redundant feature mapping operations in the process for achieving lightweight reconfiguration of the backbone, while attention is added to compensate for accuracy loss. Furthermore, a feature pyramid built using mixed convolution is introduced to accelerate network inference speed. Finally, the local information is extracted by the designed flame-centre detection (FD) module for furnishing auxiliary information in effective firefighting. Experimental results on both the benchmark fire dataset and the video dataset show that the AEGG-FD performs better than the classical YOLO-based models such as YOLOv5, YOLOv7 and YOLOv8. Specifically, both the mean accuracy (mAP0.5, reaching 84.7%) and the inferred speed (FPS) are improved by 6.5 and 8.4 respectively, and both the number of model parameters and model size are compressed to 72.4% and 44.6% those of YOLOv5, respectively. Therefore, AEGG-FD achieves an effective balance between model weight, detection speed, and accuracy in firefighting.

1. Introduction

Fires in chemical plants and large-scale workshops may lead to serious property loss, equipment damage, and even casualties [1]. Half of the 20 largest industrial accidents in the world have been related to fires since 1988, which have resulted in significant economic losses [2]. The results indicate that the expansion of fire losses results from the inadequacy of effective fire detection technologies [3]. With the emergence and development of smart factories, it is important to design an effective fire detection system for providing timely early warnings when flames appear.
Traditional fire detection devices generally consist of physical detectors such as smoke detectors, heat detectors, and ultraviolet flame detectors [4,5]. To achieve the fire detection function, these detectors can only generate signals through transformations based on factors like temperature and brightness because they are unable to extract flame characteristics [6]. Therefore, the real-time accuracy of these traditional detectors is very poor [7,8]. In order to overcome the above limitations of detection technology, it is very important to further develop fire detection technology.
In recent years, visual fire detection technology has become a research hotspot because of its advantages of low cost and high accuracy. At the beginning, fire detection was done using artificially extracted features that included colour, area, and simple texture [9,10]. Horng and Peng [11] established a colour model of flame in HIS space and used differential methods to obtain suspected flame regions, thereby building feature extractors. Chen et al. [12] judged fires by area growth rate and centre-of-mass stability. A real-time fire detection method based on the hidden Markov model (HMM) analysing fire characteristics was proposed by Zhu et al. in [13]. Töreyin et al. [14] used high-frequency features of a flame to describe the flame after analysing its flickering properties by applying a wavelet transform to the boundary of the flame. These researchers have improved the accuracy of fire detection by making feature extractors. However, because of the complexity and uncertainty of fire types in practical applications, algorithms extracting simple features struggle to distinguish between fires and fire-like, which leads to low robustness [15].
Detection methods that incorporate deep learning have become mainstream due to their ability to efficiently and automatically learn and extract image features. Researchers have developed two types of deep learning target detection methods, i.e., region-based and regression-based methods [16], which have achieved outstanding results in areas such as disaster prediction and medical image recognition [17,18,19]. To achieve better detection performance, some scholars have proposed a region-based convolutional neural network (CNN) approach in the field of fire detection. Muhammad et al. [20] proposed an efficient CNN fire detection system for video surveillance applications. James Pincott et al. [21] employed existing models based on Faster Region-CNN (R-CNN) and Single Shot MultiBox Detector (SSD) MobileNet to develop an indoor fire and smoke detection system. Li et al. [22] proposed a convolutional neural network (R-CNN) framework based on the Dirichlet Process Gaussian Mixture Model (DPGMM), which is used to detect fire autonomously in complex environments. Casallas et al. [23] proposed a fire warning system combining 3D-CNN and Bayesian classifiers to estimate the possibility of wildfires. However, it is difficult to achieve real-time target detection for CNN-based methods due their high computational complexities.
Joseph Redmon et al. [24] first proposed the regression-based method called You Only Look Once (YOLO), which takes the whole graph as input to the network. Then, the position and category of the prediction box at the output layer can be obtained simultaneously. Although this method realizes real-time target detection, it suffers from slow speed and large errors in the coordinates of the prediction box for the early stage of fire detection. In this regard, Zhao et al. [25] improved YOLOv3 using the EfficientNet method, which facilitates the feature learning of the model and optimizes the detection process for small targets. However, this method ignores the problem of incomplete detection of occluded targets. Qian and Lin [26] fused two independent weakly supervised models, YOLOv5 and EfficientDet, to achieve algorithmic improvement, improving the accuracy and detection speed. But the required huge computation costs lead to the lack of practicality. Wei et al. [27] improved the Tiny-YOLO-V3 model and combined the dense connection to obtain a lightweight YOLO fire detection model with high accuracy, but there is an obvious gap between its real-time performance and the actual industrial requirements. Therefore, conventional deep learning methods for flame detection still suffer from low accuracy, slow speed, and large parameters.
Compared to YOLOv5, the most significant improvement of YOLOv7 [28] is the use of the extended efficient layer aggregation network (E-ELAN) as the feature extraction unit, and the addition of auxiliary headers for deep supervision in model training. As the latest YOLOv8 [29] model, the Decoupled-Head is further used to improve the head, while the sample allocation strategy of the model uses the Task-Aligned Assigner from Task-Aligned One-Stage Object Detection (TOOD). They have been successfully applied to many practical image recognition scenarios such as tea bud detection in complex backgrounds [30], smart city fire protection [31], etc. However, during the design of these two models, they do not explicitly consider the recognition speed and complexity of the model, which decreases the compatibility for diverse devices since high configuration parameters are required. In fact, detection accuracy, speed, and model weight metrics have equal status in flame recognition applications, and all three metrics need to be effectively balanced to design appropriate target detection models. More specifically, the core factor that makes it difficult to effectively control fire hazards is the rapid and unpredictable spread of flames, which indicates that the detection model for flame recognition needs to improve its speed as soon as possible on the basis of guaranteeing the detection accuracy so that it can give timely fire warnings. In addition, the complexity metrics of the model directly determine whether the model can be applied to practical flame recognition equipment, but the classical YOLOv7 and YOLOv8 models improve the accuracy of the model somewhat from the angle of sacrificing the complexity and speed. As we know, compared to the classical YOLOv7 and YOLOv8, the detection accuracy of the YOLOv5 is slightly inferior, but it has a certain advantage in detection speed and complexity. More significantly, YOLOv7 and YOLOv8 do not require anchor boxes in the construction mechanism resulting in their inability to be used directly for image centre region localization, which has obvious contradictions in flame detection since the flame-centre region need to be located for effectively fire control. Therefore, this paper utilizes YOLOv5 as the base model to design a detection model that balances these three metrics, instead of YOLOv7 or YOLOv8.
A YOLO fire detection algorithm (AEGG-FD) based on an attention-enhanced ghost bottleneck stack, mixed convolutional pyramids, and flame-centre detection is proposed in this paper, which improves the inference speed and accuracy of the YOLOv5 fire detection while ensuring the model is lightweight. The main contributions of our work are listed below: (1) The stacking strategy is used to cascade the Ghost modules and optimize the backbone network structure by combining with the attention mechanism, which enhances the robustness of the network and further achieves a lightweight model volume. (2) The neck is improved using mixed convolution (GSConv) and the cross-stage partial network module is formed by combining with mixed convolution, which increases the inference speed of the model while reducing its complexity. (3) The OpenCV technology is merged into a flame-centre detection (FD) module for achieving real-time localization of the flame-centre region where the pixel thresholds of suspected targets are used. Therefore, the AEGG-FD is able to provide auxiliary information for rescue personnel since the detection accuracy and efficiency are improved.
The rest of the paper is organized as follows: The YOLOv5 method and channel attention mechanism are briefly introduced in Section 2. AEGG-FD is described in detail in Section 3. Experiments and results are presented in Section 4. Future perspectives and conclusions are presented in Section 5.

2. Related Works

2.1. YOLOv5 Network

YOLOv5 is a single-stage target detection algorithm proposed by Ultralytics [32] that transforms the target bounding box positioning problem into a regression-based problem where a single neural network is used to predict both the bounding box and the category of the target. Jiang et al. [33] utilized MobileNetv3 and genetic algorithm (GA) to improve YOLOv5, while incorporating the model with augmented reality (AR) technology. They proposed a novel compatible detector for equipment anomaly detection in hydropower plants. Hu et al. [34] optimized the YOLOv5 model by integrating the attention mechanism CBAM and adjusting the loss function, and their model was used for defect detection in citrus epidermis. Gong et al. [35] proposed an improved forest fire detection model based on YOLOv5 from the perspective of multi-scale features, which achieved high detection accuracy and speed. Compared to YOLOv4, YOLOv5 further improves on the YOLOv4 algorithm to achieve faster inference and higher accuracy. In addition, the small amount of weighted data in the YOLOv5 network makes it easy to deploy in devices with limited memory and power consumption requirements.
The network structure of YOLOv5 is divided into four parts: input, backbone, neck, and head. The input represents the input image. All three operations of Mosaic data enhancement, image size processing, and adaptive anchor box calculations are used by the model in this phase to improve the training speed and network accuracy. The backbone is a convolutional neural network for extracting multi-scale image features. The neck has a characteristic pyramid structure with Feature Pyramid Network (FPN) [36] and Path Aggregation Network (PAN) [37]. Shallow graphical features and deep semantic features are combined through this structure to improve the flow of feature information on the network. Meanwhile, Image feature maps are used in head to generate bounding boxes and predict the class of the object.
The YOLOv5 network model consists of five different sizes, the YOLOv5s, YOLOv5m, YOLOv5l, YOLOv5x, and YOLOv5n [38]. Backbones of different sizes are acquired by adjusting the depth_multiple and width_multiple of parameters. YOLOv5s is the simplest version of these versions since the depth_multiple and width_multiple are 0.33 and 0.50, respectively. The structure of the network model for YOLOv5s is shown in Figure 1.
In YOLOv5, the backbone network consists of Focus module, convolution, Cross Stage Partial Bottleneck with three convolutions (C3), and Spatial Pyramid Pooling (SPP). The Focus module splits the feature map into four independent feature layers, and performs stacking and convolution operations to effectively improve the quality of image feature extraction. The C3 module, which consists of three convolutions and several bottlenecks, is built based on a cross-stage partial network (CSPnet) [39]. SPP is a spatial pyramid pooling layer that further increases the Receptive-field of the feature map. The Conv module used in the network is composed of standard convolution (SC), batch normalization, and Sigmoid Linear Unit (SiLU) activation layers. The parameters of the convolution function are input channels, output channels, kernel, and stride.

2.2. Cross-Stage Partial Network with SE Module Addition

The Squeeze-and-Excitation (SE) module is a channel attention model proposed in [40] by Hu et al. Because of its lightweight structure, it will not increase the complexity of the model when added to the network. The SE module consists of two main parts: compression and excitation. It automatically acquires the weights of the feature channels through learning, and uses the acquired information to suppress low-sensitivity features. Then, the expressive power and performance of the neural network are improved. The realization process of the SE module is shown in Figure 2.
The SECSP module built from cross-stage partial network with SE module addition is designed to obtain more information about key features in the complex context of fire detection. The SECSP module and the original C3 module are both designed based on the CSP architecture, but the difference is that SECSP contains the SE channel attention module. By adding this new structure (SECSP) to the last layer of the backbone network, it allows the model to be more finely tuned at higher levels of feature manifestation. The structure of the module is shown in Figure 3.

3. Improved AEGG-FD Structure

In order to further improve the accuracy and speed of the YOLOv5s model for fire detection tasks under the premise of prioritising a lightweight model, AEGG-FD is proposed in this paper. The architecture diagram of AEGG-FD is shown in Figure 4.
In Figure 4, four main optimization aspects are included in the proposed AEGG-FD. In the attention-enhanced ghost mode explained in Section 3.1, the backbone of the model is firstly rebuilt by stacking the Ghost bottlenecks [41]. Then, the designed SECSP attention module is added into the last layer of the backbone, and thus both the simplification of the model structure and the improvement of the accuracy can be achieved. Subsequently, the standard convolution (SC) is replaced with the GSConv convolution [42] to reduce the computational complexity, and the mixed convolution (GSConv) is integrated into the bottleneck for designing a new cross-stage partial network module (named VoV). GSConv convolution and VoV are used to optimize the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) structures, which results in a new neck (named GSCneck) introduced in Section 3.2. Then, the GSCneck is employed to improve the speed of model inference and the detection accuracy. The flame-centre detection (FD) module shown in Section 3.3 is intended for real-time accurate localization of the flame-centre region. This module completes the accurate orientation of the flame centre by reducing the invalid information in the model prediction box. The real-time centre localization is fulfilled by employing the FD module in the detection phase. Finally, the loss function SCYLLA-IoU (SIoU) [43] is used in place of Complete-IOU (CIoU) [44]. SIoU redefines the angle penalty metric based on the consideration of the angle of the vectors between the real and prediction box regressions. The improvement of the SIoU loss function improves the efficiency and convergence speed of the AEGG-FD.

3.1. Attention-Enhanced Ghost Mode

When a neural network is sufficiently well trained, the feature map contains rich or even redundant information to capture the valuable features of input data [45]. The feature maps after the first standard convolutional processing in YOLOv5 are shown in Figure 5. Reductions in computational complexity and the number of parameters can be achieved by replacing the SC operation with an inexpensive linear operation. Then, the redundant information can be effectively handled. The structure of the specific Ghost module is shown in the Figure 6, where three similar feature map pair examples are annotated with boxes of the same colour. One feature map in the pair can be approximately obtained by converting another feature map through cheap operations (denoted by wrenches).
Using the SC, the ordinary convolutional layer accomplishes the transformation of the input feature X mapping to the output feature Y :
X ϵ R c × h × w Y ϵ R h × w × n
where the convolution kernel in this process is f ϵ R c × k × k × n , the number of convolution kernels is n, and the number of channels is c. The number of FLOPs required for the convolution process is given as n · h · w · c · k · k , which is often up to hundreds of thousands since the number of filters n and the number of channels c are commonly very large. During the Ghost transformation, the convolution is used to generate m original feature maps: Y ϵ R h × w × m   ( m n ) . The s ghost feature maps are obtained subsequently by applying a series of cheap linear operations on each intrinsic feature in Y . To maintain the same number of feature maps as the SC, the Ghost module obtains n feature maps as output data ( n = m · s ) . The computation using the Ghost module is approximately to n s · h · w · c · k · k since s c . Therefore, the operating parameters are significantly reduced in Ghost convolution, i.e., the operating parameters in SC is s times of those in Ghost convolution.
The Ghost bottleneck (G_bneck) structure is similar to the basic residual block which is composed of several convolutional layers and shortcuts. The first Ghost module achieves upscaling by increasing the number of channels, while the second Ghost module reduces the number of channels for matching the shortcut path. Two types of Ghost bottleneck residual modules are designed for two different convolution kernel step sizes, i.e., one and two, respectively. Their structures are shown in Figure 7, in which ReLU is used for the activation function, BN for batch normalization, and DWConv (Depthwise Separable Convolution) for deep convolution with stride = 2.
In brief, when using the attention-enhanced ghost mode, the backbone of YOLOv5 will be reconstructed based on the Ghost bottleneck and the SE module. The GhostNet arrangement is essentially followed in the main body of the backbone, while the sequence and amount of Ghost bottlenecks are adjusted in the targeted manner. In contrast to GhostNet, the Ghost bottleneck layers in the eighth and ninth layers are removed, and a cross-stage partial network with channel attention is added after the final standard convolutional layer. The first layer in the new backbone is a standard convolutional layer which uses 16 filters with kernel size 3 and step size 2, then 14 Ghost bottlenecks with increasing channel counts are stacked sequentially backward. Feature maps at different scales (from shallow to deep) are input individually into the feature pyramid structure at the fifth, ninth, and sixteenth layers. Table 1 lists the overall structure of AEGG-FD.

3.2. Improved Mixed Convolutional Pyramid

In the traditional network improvement, the addition of model convolutional layers generally improves the model accuracy, but it is difficult to achieve the real-time detection requirement with a large model. The use of the Deep Separable Convolution (DSC) module can improve the model speed because the DSC operation significantly reduces the number of model parameters and floating-point operations (FLOPs). Therefore, when the detection task has a real-time requirement, the model can utilize the DSC module in place of the original standard convolution (SC).
However, compared to the SC, the DSC module also has some limitations. During the information transfer process, the space compression (the height and width) and channel expansion of the feature maps will cause the partial loss of semantic information. The standard convolutional computation maximally preserves the hidden connections between each pairwise channels, but the DSC operation severs these connections completely. Therefore, the DSC operation makes part of the information of the input image lost, which results in the average accuracy of the DSC module being lower than that of the SC module [46].
In order to make the output accuracy of DSC as close to that of standard convolution as possible, this paper builds the mixed convolution (named GSConv) by mixing SC, DSC, and shuffle, which achieves improvements in both accuracy and speed (vs. the SC). When processing the input feature maps for fire detection, a DSC layer is added to the GSConv module for improving the model inference speed. At the same time, GSConv preserves the hidden connections between each channel as much as possible with low time complexity. Both the interaction of channel information and the nonlinear expression of features are enhanced in the shuffling operation, which improves the generalisation of the model. Then, the trade-off between the speed and accuracy can be achieved by GSConv. The process of GSConv is shown in Figure 8.
On the basis of the GSConv module, inference speed and accuracy are further enhanced by the GS bottleneck and VoV module. In contrast to the original bottleneck module in the YOLOv5 model, the new bottleneck module (named GS bottleneck) utilizes the GSConv module in the main branch instead of the SC module. Meanwhile, a SC operation is added to the shortcut connection in the GS bottleneck. To further exploit the effect of GS bottlenecks, this paper designs the cross-stage partial network module (named VoV) by combining GS bottlenecks with the one-shot aggregation method. The only difference between the VoV module and the C3 module is that VoV uses the GS bottleneck. After being processed through the GSConv convolution, the feature information is stable enough to export to the head. The complexity of model computation is reduced in the VoV module by borrowing the idea of CSP. Moreover, GSConv convolution with stronger feature extraction and fusion capabilities is incorporated into the main branch of VoV module, thereby the inference speed of the model can be improved. The structures of the GS bottleneck and VoV module are shown in Figure 9. In summary, the convolutional feature pyramid mixed by the two modules is combined with Path Aggregation Network (PAN) for the purpose of constructing the GSConv neck (named GSCneck).

3.3. Flame-Centre Detection Module

The category probabilities and anchor box coordinates of the targets are obtained after the image is imported into the model for process, and are outputted in the tensor data format. The anchor box coordinates obtained by the original YOLOv5 after numerical normalization are encoded as:
x c e n t e r ,   y c e n t e r ,   w i d t h ,   h e i g h t
where the first two values indicate the two coordinates of central position in the anchor box and the last two values indicate the width and height of the anchor box. Even if the finally obtained prediction box is very close to the position of the target bounding box, it only obtains position information between the detection fire and the background image, instead of position information of the flame centre.
For the above-mentioned problems, the FD module is added after the Non-Maximum Suppression (NMS) operation to further achieve the delineation of the flame-centre box within the prediction box. The methodology flowchart of the FD module in this paper is shown in Figure 10, where the red dashed box indicates the image morphological operations on the prediction box screenshot.
The FD module mainly combines OpenCV technique to process the image pixels. If the input map of undetected flames is processed directly, then the presence of fire-like object or light source in the image will inevitably generate unreasonable detection for the position of the detection centre. Therefore, the prediction box in the original image will be cropped to reduce the influence of interference sources after the image prediction box is drawn. Then, the image obtained from cropping is subjected to grey scale processing and Gaussian filtering to reduce the noise interference in the image (referring to the highlighted pixels in other regions). The removal of bright fine areas is then continued by a series of erosion and expansion operations, where the size of the fire area will not alter significantly. The brightest area of the finished image is used as the flame-centre position, at which the round point and radius length of the centre box are defined. Finally, the centre box coordinate code is mapped back to the original image by performing coordinate transformations, where the centre box is plotted subsequently in the original image. Figure 11 shows the comparison of the output image after each image transformation performed by the FD module.
It can be seen that the FD module allows the centre of the fire to be located more accurately by reducing the influence of useless information in the prediction box in the presence of fire interference sources or large area fires.

3.4. SIoU Loss

In the target detection task, the intersection over union (IoU) loss function is used to measure the degree of overlap between the prediction and real boxes. It avoids the interference of bounding box sizes using a proportional form, which allows the model to efficiently learn various objects of different sizes when the IoU is used as a boundary regression loss. The Complete-IOU (CIoU) loss functions used in YOLOv5 add consideration of the aspect ratio based on Distance-IoU (DIoU), however they both ignore the mismatch in orientation between the real and prediction boxes. This leads to slower convergence and less efficient models. Based on CIoU, new penalty metrics are defined by importing the vector angle, which produces the SCYLLA-IoU (SIoU). This operation is performed to accelerate the regression of the prediction box to the closest axis of the real box. In brief, the total degrees of freedom of the loss are effectively reduced by adding the angle penalty cost, which makes the model converge faster and more accurate.
The SIoU loss function consists of four functions: angle loss, distance loss, shape loss, and IoU loss. The calculation formula is as follows:
Λ = 1 2   sin 2 arcsin x π 4 ,
Δ = t = x , y 1 γ ρ t , γ = 2 Λ ,
Ω = t = w , h 1 w t θ ,
I o U = B B G T B B G T ,
L o s s S I o U = 1 I o U + Δ + Ω 2 ,
where Λ is the angular loss that allows the prediction frame to reach the horizontal or vertical position of the real box faster, Δ is the distance loss, and Ω is the shape loss. The calculation process of angular loss is shown in Figure 12, where A is the prediction box, B is the real box, C w , C h are the horizontal and vertical distances between the prediction box and the real box, and σ is the distance between the center points of these two boxes.
Consequently, the training and inference capabilities of the detection model are improved using the SIoU loss function. Meanwhile, the NMS operation becomes more rational and effective due to the usage of SIoU.

4. Experiment

4.1. Experimental Setup

4.1.1. Experimental Condition

The network model is programmed based on Python and Pytorch deep learning architectures, where the configuration of the experimental environments is listed in Table 2. The training epoch values of the benchmark model were adjusted during the experiments according to the dataset used in this paper. When using the original 300 epochs, overfitting problems occurred in the training process, leading to model performance drop. It was eventually determined after a series of experiments that 150 epochs are optimal. The training parameters of the improved fire detection model are shown in Table 3.

4.1.2. Dataset

The application scenes of fire detection generally include indoor and outdoor fires, forest fires, and night-time fires; the classical fire pictures of these types of scenes were collected from the existing public fire datasets. To further validate the robustness of the proposed AEGG-FD, the images of other classical scenes where fire-like targets (such as bright lights, red objects, sun, etc.) are contained were also integrated into the dataset. These images are similar to the fire targets because of their colours, textures, and other features, which makes the fire detection algorithms have a higher possibility of misdetection and omission [47]. Therefore, the dataset constructed in this paper was enriched by adding these fire-like samples, which enabled the detection algorithm to maintain high performance in complex and diverse scenes.
Our dataset has a total of 7149 images, of which 4248 are fire images and 2901 are fire-like images. After the image numbering of the above dataset was completed, the Labelme tool was used for manual labelling. To ensure uniform distribution of the dataset, the whole dataset was randomly divided into train and test sets in the ratio of 8 to 2. Some representative images from the dataset are shown in Figure 13.

4.2. Evaluation Metrics

In this paper, the Microsoft COCO standard was used to evaluate the model. Specifically, average precision (mAP0.5) was used as a metric to assess the predictive accuracy of the model. Giga floating-point operations per second (GFLOPS) and parameters were chosen as metrics to assess the complexity of the model. The number of detected frames per second (FPS) was used as a speed metric for evaluating the model. The formulae for precision and recall are shown as follows:
P r e c i s i o n = T P T P + F P ,
R e c a l l = T P T P + F N ,
where TP denotes the number of fire samples correctly labelled as fire, FP denotes the number of non-fire samples incorrectly labelled as fire, and FN denotes the number of fire samples incorrectly labelled as non-fire, which is the number of objects missed in the detection. The area under the precision-recall (P-R) curve is defined as Average Precision (AP), and the mean AP (mAP) was obtained by averaging the AP values of all categories selected. Furthermore, the mAP0.5 is the average accuracy when the IoU threshold is 0.5, while the mAP0.95 is the average accuracy as the IoU threshold is gradually increased between 0.5 and 0.95 (the step size is 0.05). The performance evaluation metrics AP and mAP were calculated as follows, where P r represents the precision value P at the value of r , and n denotes the number of detection categories. The larger the mAP value, the better the model performance (In this article there is only one detection category).
A P = 0 1 P r d r ,
m A P = 1 n i = 1 n A P i ,
GFLOPs were used to measure the complexity of the model: the fewer GFLOPs the model has, the less hardware performance is required for the model. Parameters indicates the total number of parameters used by the model in millions (M). In the formula for the GFLOPs, C i ,   C 0 denote the number of input and output channels respectively, K represent the kernel size, and H and W are the dimensions of the feature map. The specific Equation (12) is as follows:
G F L O P s = ( 2 C i K 2 1 ) H W C 0
Time represents the time required to process each frame in the target detection process, which includes the amount of time for the three parts: image preprocessing, inference, and non-maximum suppression. FPS indicates the number of images that can be processed per second, which is inversely proportional to time. The larger the FPS value, the faster the speed of the model detection. The Formula (13) for calculating FPS is as follows:
F P S = 1 T i m e

4.3. Comparison with the Latest YOLO Models

Regarding the evidence of using YOLOv5 as the baseline model for the fire detection problem in this paper, we compared YOLOv5 with the latest YOLO models (i.e., YOLOv7 and YOLOv8) to further validate the performance. In this group of experiments, YOLOv5s version was used for YOLOv5, while the classical versions of YOLOv7 and YOLOv8 were employed. The comparison results are shown in Table 4.
In Table 4, compared to the YOLOv7, the precision of classical YOLOv5 was slightly worse, but its speed and recall metrics had a significant advantage. Compared to the YOLOv8, the speed of the classical YOLOv5 was somewhat lower, but it had significant advantages in precision, recall, and complexity metrics. Moreover, YOLOv5 uses an anchor-based method to build the network, which can effectively improve the accuracy and stability of flame-centre localisation, while both the YOLOv7 and YOLOv8 cannot. Therefore, YOLOv5 is indeed more suitable for flame recognition scenarios. Furthermore, AEGG-FD beats all the compared methods, i.e., the classical YOLOv5, YOLOv7, and YOLOv8, on all metrics, which indicates that the proposed framework is capable of providing excellent performance for flame recognition.

4.4. Module Validity Experiments

Before conducting experiments, the appropriate selection for the optimizer and loss function will pose a significant impact on training status. Different attentional mechanisms also differ in their effects on the model, which requires further experimentation to ensure that the chosen attentional module is beneficial for model training. Suitable optimizer and loss function can effectively improve the performance and training speed of the model, while the attention module matched with the model can improve the feature extraction ability of the network.
Therefore, to verify the effectiveness of the above key components selected in the network of this paper, three sets of experiments are conducted to test the effectiveness of SGD, SECSP, and SIoU, respectively. The experiments uniformly use the YOLOv5s version as the baseline model and will be abbreviated to YOLOv5 in subsequent tables.

4.4.1. Effects of the SGD

There are common optimizers such as Stochastic Gradient Descent (SGD), Adaptive Moment Estimation (Adam), and Adam with Weight Decay Fix (AdamW). Their role is to update the network parameters based on the gradient information of the network Back-Propagation as a way to reduce the computed value of the loss function. However, unsuitable optimizers may lead to problems such as local optima or gradient exploding during training.
Therefore, it is necessary to analyse the sensitivity of the optimizer and thus assess fits applicability in network models. The performance of the same model under different optimizers is judged by the metric of mean average precision. In this paper, the applicability of the above optimizers was compared in 150 epochs of training, and Figure 14 shows the experimental results.
In this study, SGD, Adam1, and AdamW were experimented with their respective default learning rates, and Adam2 was experimented with a learning rate of 0.01. The importance of appropriate learning rates can be seen in Figure 14. It can be seen that the training process oscillates and diverges because of the inappropriate learning rate of Adam2, as its convergence results are far worse than the SGD and Adam1 optimisers with suitable learning rates. In addition, the SGD optimizer shows faster convergence and higher accuracy in the experiments, which is better for optimizing the model training. Consequently, the SGD algorithm was selected for subsequent research in this paper.

4.4.2. Effects of the SECSP

In this section, in order to verify the effectiveness of the SECSP module designed in this paper for the refactored backbone network, the SECSP module was utilized for comparison with the Coordinate Attention (CA) module and the Convolutional Block Attention Module (CBAM). CA and CBAM with the same CSP structure were inserted into the model and the results are shown in Table 5, where the optimal values of the different models in terms of precision metrics are bolded.
From Table 5, it can be seen that when SECSP joins the new backbone (The backbone with Ghost), the mAP0.5 and mAP0.95 are increased by 1.3% and 2.2% respectively. This fully demonstrates that the inclusion of the SECSP module in Ghost’s refactored backbone network is effective, which enhances the feature extraction capability of the model.

4.4.3. Effects of the SIoU

For the effectiveness analysis of the loss function SIOU, this section compares the performance of SIoU with CIoU, Generalized-IoU (GIoU), and Distance-IoU (DIoU) by experimenting on the same benchmark model. The results are shown in Table 6, where the optimal values of the different models in terms of precision metrics are bolded. The table shows that SIoU improves mAP0.5 by 2.1% and mAP0.95 by 0.8% compared to the baseline model using CIoU. Meanwhile, SIoU outperforms GIoU and DIoU in every performance metric listed in the Table 6. Therefore, SIoU was selected as the loss function for the model in this paper.

4.5. Ablation Experiments

To verify the effectiveness of each improved module and their respective effects on the model, the first group of experiments was conducted after adding different improved modules under the same test set. The second group of ablation experiments was performed on the basis of the benchmark model, incorporating the Ghost module. The version of the baseline model was YOLOv5s, and the results of the experiment are shown in Table 7 (The best value in the same indicator is bolded). The visualizations of the results from the ablation experiment are shown in Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19.
To help the reader understand the results of the ablation experiments more distinctly, the visualized results of the two groups of experiments are given in the following figures (Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19), which were plotted based on the different performance metrics.
In the first group of experiments, four improved modules were sequentially added to the benchmark model, resulting in Models 2 to 5. Specifically, the mAP0.5 and mAP0.95 of the benchmark model (YOLOv5s) are 78.2% and 47.7% respectively, while its GFLOPs and FPS are 16.6 and 51.8 respectively. From the experimental results in Table 7, all four modules used in this paper are effective, which indicates the validity of these four modules in this model. The detailed discussion is listed as follows.
Firstly, the improvement of the loss function SIoU increases the model’s consideration of the vector angles required for the expected regression, improving the model’s inference and accuracy. Compared to the benchmark Model 1, Model 2 improves mAP0.5 by 2.1% and FPS by 1.1, respectively.
Secondly, the SECSP attention module increases the sensitivity of the model to the channel features and effectively improves the feature extraction capability of the model, which increases the mAP0.5 and the FPS of Model 3 by 1.2% and 3.4, respectively.
Thirdly, the GSCneck is integrated into Model 4 to reduce the computational complexity, which accomplishes the task of improving the model inference speed and detection accuracy. Then, the mAP0.5 is increased by 4.4% and the FPS is improved to 58.1.
Eventually, although the Ghost module is not as good as the GSCneck module in terms of improving accuracy and speed, the addition of Ghost to the Model 5 simplifies the parameters and computational costs dramatically when compared to the previously discussed models. This achieves the lightweight construction of the model. With the addition of Ghost’s reconstructed backbone, the model GFLOPs are reduced by 9.4 and the parameters by 2.81 M.
In the second group of experiments, all the models based on Ghost modules (Models 6 to 10) show better results compared to Model 1, which indicates that the combination of these modules with Ghost is effective. Moreover, the computational costs and parameters of all Ghost-based models are substantially decreased when compared to the first group of experiments. The detailed comparison results between the benchmark Model 5 and other models such as Models 6–10 are listed as follows.
Firstly, the combination of the two modules SECSP and GSCneck with Ghost (Models 6 and 8) respectively, both achieve further growth in model speed and accuracy. However, the model with the addition of SECSP requires additional processing time for the extraction of the channel features, which causes a slight decrease in the speed of the model within the combination of SECSP and Ghost (Model 7). The results show that the FPS of Model 7 is reduced by 0.3, but the mAP0.5 and mAP0.95 of Model 7 are increased by 1.3% and 2.2%, respectively.
Secondly, the second group of experimental results shows that SECSP and SIoU improve the performance of the model less (the degree of improvement: SECSP > SIoU), while GSCneck has the most significant improved effect for the model. Further, the model of Ghost plus both modules GSCneck and SECSP (the effects on model improvement by the two modules are significantly better than that of combinations of other two modules) was compared to Ghost plus one module. The results show that the network of Model 9 simplifies down to 5.24 M parameters and 6.2 GFLOPs, while the mAP0.5 is increased to 83.9% and the FPS is improved to 58.1.
Finally, Model 10 is the one proposed in this paper, which gives the best results when compared with the benchmark Model 1 and the best model with the combination of adding modules (Models 5 to 10).
In summary, the model in this paper achieves mAP0.5 of 84.7% and FPS of 60.2, while having only 5.24 M parameters and 6.2 GFLOPs. Compared to the baseline model, the mAP0.5 is improved by 6.5% and the FPS is improved by 8.4. Regarding the model complexity assessment metrics, AEGG-FD compresses the number of model parameters and GFLOPs to 72.4% and 37.3% of YOLOv5s, respectively, which significantly reduces the complexity of the model on both temporal and spatial scales, thus achieving a lightweight model.

4.6. Visual Comparison in Images

This section selects images from the dataset for experiments in terms of the classic scenes in fire detection. These classical fire images include mountain fires, night fires, and fires with occlusion situations, which are used to compare the detection effects between our proposed model and the original mode. The test results are shown as follows, where the left side is the original model and the right side is our model.
Comparing the detection results in Figure 20, it can be seen that the original model has a leakage problem in detecting fires in different scenes because of undetectable or obstruction by interfering objects, while the model in this paper can better detect flame targets in complex scenes.
The model with FD turned on was compared to the model with FD turned off in order to verify the detection effect of the FD module, where the greenish-yellow round box was drawn by the FD module. Viewing the test results in Figure 21, the model can exclude irrelevant information from the prediction box and locate the centre region of the fire even further after adding the FD module. This proves the effectiveness of the FD module designed in this paper.

5. Conclusions

In this paper, a new fire detection model, AEGG-FD, is proposed to balance speed, accuracy, and complexity in dealing with fire detection tasks. The method proposed in this paper consists of four main parts: reconstruction of the backbone, improvement of the neck, reformulation of the loss function, and design of the flame-centre detection function.
Specifically, the reconstructed of the backbone means that the backbone network for extracting features is rebuilt using the Ghost module, which makes the model more lightweight, and the SECSP attention mechanism, which emphasizes the channel information, is added to enhance the ability of extracting the fire feature information. In the second group of experiments in Section 4.5, it can be seen that the reconstructed backbone results in more compact and lower-complexity model structures. The improvement of the neck means improving the feature pyramid at the neck by replacing the standard convolution (SC) with the mixed convolution GSConv, while integrating a cross-stage partial network module that incorporates the GS bottleneck to form the GSCneck. Model 4 and Models 8 to 10 shown in Table 7 demonstrate that GSCneck accomplishes the task of improving the model detection speed and maintaining accuracy. The loss function SIoU increases the sensitivity of the model towards the channel features. From the experimental results, it can be seen that using the SIoU improves the inference ability and accuracy of the model. Finally, the flame-centre detection (FD) module that extracts the information of the flame-centre region is designed to achieve the real-time localization of the fire region, which is shown in the comparative experiments of Section 4.6 that the FD module can further improve the detection accuracy and efficiency of the model.
In summary, our model achieves a superior balance between speed, accuracy, and complexity, which outperforms the original network in fire detection tasks. From the results of ablation experiments, it can be seen that the model proposed in this paper can achieve 84.7% mAP0.5 and 60.2 FPS, while the model parameters are only 5.24 M.
Currently, many research focuses on developing multimodal studies of fire detection for other complicated problems such as merging smoke features or combining sensors. Therefore, how to effectively extend the currently proposed AEGG-FD for solving these problems is an important future work.

Author Contributions

Conceptualization, J.L. and J.Y.; methodology, J.L. and J.Y.; software, J.Y.; validation, J.Y.; formal analysis, J.Y.; investigation, J.Y. and Z.Y.; resources, Z.Y.; data curation, J.Y.; writing—original draft preparation, J.L., J.Y. and Z.Y.; writing—review and editing, J.L., J.Y. and Z.Y.; visualization, J.Y.; supervision, J.L. and Z.Y.; project administration, J.L.; funding acquisition, J.L. and Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Development of Multi-Source Micro-grid: Intelligent Control, Efficient Thermal Management, Noise Reduction, and Infrared Stealth Technology (grant number 20223AAE02012); the Key Technology Research on High-Power Hydrogen Fuel Cell Metal Ultra-Thin Bipolar Plates for Multi-Source Energy Equipment (grant number 20232BCJ22058); the Young Talent Cultivation Innovation Fund Project of Nanchang University (grant number 9167-28740080); and the Topology optimization design of multi-scale composite porous metamaterials (grant number BSKYCXZX 2023-07).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Author Zan Yang was employed by the company Jiangxi Tellhow Military Industry Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from Jiangxi Tellhow Military Industry Group Co., Ltd. The funder had the following involvement with the study: the Development of Multi-Source Micro-grid: Intelligent Control, Efficient Thermal Management, Noise Reduction, and Infrared Stealth Technology (grant number 20223AAE02012); the Key Technology Research on High-Power Hydrogen Fuel Cell Metal Ultra-Thin Bipolar Plates for Multi-Source Energy Equipment (grant number 20232BCJ22058); the Topology optimization design of multi-scale composite porous metamaterials (grant number BSKYCXZX 2023-07).

References

  1. Wu, H.; Wu, D.; Zhao, J. An Intelligent Fire Detection Approach through Cameras Based on Computer Vision Methods. Process Saf. Environ. Prot. 2019, 127, 245–256. [Google Scholar] [CrossRef]
  2. Nolan, D.P. Handbook of Fire and Explosion Protection Engineering Principles: For Oil, Gas, Chemical and Related Facilities; Saudi Aramco: Dhahran, Saudi Arabia, 2014. [Google Scholar]
  3. Gaur, A.; Singh, A.; Kumar, A.; Kumar, A.; Kapoor, K. Video Flame and Smoke Based Fire Detection Algorithms: A Literature Review. Fire Technol. 2020, 56, 1943–1980. [Google Scholar] [CrossRef]
  4. Kizilkaya, B.; Ever, E.; Yatbaz, H.Y.; Yazici, A. An Effective Forest Fire Detection Framework Using Heterogeneous Wireless Multimedia Sensor Networks. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–21. [Google Scholar] [CrossRef]
  5. Wu, X.; Zhang, X.; Jiang, Y.; Huang, X.; Huang, G.G.Q.; Usmani, A. An Intelligent Tunnel Firefighting System and Small-Scale Demonstration. Tunn. Undergr. Sp. Technol. 2022, 120, 104301. [Google Scholar] [CrossRef]
  6. Nguyen, M.D.; Vu, H.N.; Pham, D.C.; Choi, B.; Ro, S. Multistage Real-Time Fire Detection Using Convolutional Neural Networks and Long Short-Term Memory Networks. IEEE Access 2021, 9, 146667–146679. [Google Scholar] [CrossRef]
  7. Saponara, S.; Elhanashi, A.; Gagliardi, A. Real-Time Video Fire/Smoke Detection Based on CNN in Antifire Surveillance Systems. J. Real-Time Image Process. 2021, 18, 889–900. [Google Scholar] [CrossRef]
  8. Xue, Q.; Lin, H.; Wang, F. FCDM: An Improved Forest Fire Classification and Detection Model Based on YOLOv5. Forests 2022, 13, 2129. [Google Scholar] [CrossRef]
  9. Mao, W.; Wang, W.; Dou, Z.; Li, Y. Fire Recognition Based on Multi-Channel Convolutional Neural Network. Fire Technol. 2018, 54, 531–554. [Google Scholar] [CrossRef]
  10. Peng, Y.; Wang, Y. Real-Time Forest Smoke Detection Using Hand-Designed Features and Deep Learning. Comput. Electron. Agric. 2019, 167, 105029. [Google Scholar] [CrossRef]
  11. Horng, W.B.; Peng, J.W. Image-Based Fire Detection Using Neural Networks. In Proceedings of the 9th Joint International Conference on Information Sciences (JCIS-06), Kaohsiung, Taiwan, 8–11 October 2006. [Google Scholar] [CrossRef]
  12. Chen, T.H.; Wu, P.H.; Chiou, Y.C. An Early Fire-Detection Method Based on Image Processing. In Proceedings of the 2004 International Conference on Image Processing, 2004, ICIP ’04, Singapore, 24–27 October 2004; Volume 3, pp. 1707–1710. [Google Scholar] [CrossRef]
  13. Teng, Z.; Kim, J.H.; Kang, D.J. Fire Detection Based on Hidden Markov Models. Int. J. Control Autom. Syst. 2010, 8, 822–830. [Google Scholar] [CrossRef]
  14. Töreyin, B.U.; Dedeoǧlu, Y.; Güdükbay, U.; Çetin, A.E. Computer Vision Based Method for Real-Time Fire and Flame Detection. Pattern Recognit. Lett. 2006, 27, 49–58. [Google Scholar] [CrossRef]
  15. Li, Y.; Zhang, W.; Liu, Y.; Jin, Y. A Visualized Fire Detection Method Based on Convolutional Neural Network beyond Anchor. Appl. Intell. 2022, 52, 13280–13295. [Google Scholar] [CrossRef]
  16. Wu, X.; Sahoo, D.; Hoi, S.C.H. Recent Advances in Deep Learning for Object Detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
  17. Anbarasan, M.; Muthu, B.A.; Sivaparthipan, C.B.; Sundarasekar, R.; Kadry, S.; Krishnamoorthy, S.; Samuel, D.J.; Dasel, A.A. Detection of Flood Disaster System Based on IoT, Big Data and Convolutional Deep Neural Network. Comput. Commun. 2020, 150, 150–157. [Google Scholar] [CrossRef]
  18. Hu, M.; Zhong, Y.; Xie, S.; Lv, H.; Lv, Z. Fuzzy System Based Medical Image Processing for Brain Disease Prediction. Front. Neurosci. 2021, 15, 714318. [Google Scholar] [CrossRef]
  19. Ma, X.; Niu, Y.; Gu, L.; Wang, Y.; Zhao, Y.; Bailey, J.; Lu, F. Understanding Adversarial Attacks on Deep Learning Based Medical Image Analysis Systems. Pattern Recognit. 2021, 110, 107332. [Google Scholar] [CrossRef]
  20. Muhammad, K.; Khan, S.; Elhoseny, M.; Hassan Ahmed, S.; Wook Baik, S. Efficient Fire Detection for Uncertain Surveillance Environment. IEEE Trans. Ind. Inform. 2019, 15, 3113–3122. [Google Scholar] [CrossRef]
  21. Pincott, J.; Tien, P.W.; Wei, S.; Calautit, J.K. Indoor Fire Detection Utilizing Computer Vision-Based Strategies. J. Build. Eng. 2022, 61, 105154. [Google Scholar] [CrossRef]
  22. Li, Z.; Mihaylova, L.; Yang, L. A Deep Learning Framework for Autonomous Flame Detection. Neurocomputing 2021, 448, 205–216. [Google Scholar] [CrossRef]
  23. Casallas, A.; Jiménez-Saenz, C.; Torres, V.; Quirama-Aguilar, M.; Lizcano, A.; Lopez-Barrera, E.A.; Ferro, C.; Celis, N.; Arenas, R. Design of a Forest Fire Early Alert System through a Deep 3D-CNN Structure and a WRF-CNN Bias Correction. Sensors 2022, 22, 8790. [Google Scholar] [CrossRef] [PubMed]
  24. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  25. Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A Small Target Object Detection Method for Fire Inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
  26. Qian, J.; Lin, H. A Forest Fire Identification System Based on Weighted Fusion Algorithm. Forests 2022, 13, 1301. [Google Scholar] [CrossRef]
  27. Wei, F.; Wang, L.; Ren, P. Tinier-YOLO: A Real-Time Object Detection Method for Constrained Environments. IEEE Access 2020, 8, 1935–1944. [Google Scholar] [CrossRef]
  28. WongKinYiu-YOLOv7. Available online: https://github.com/WongKinYiu/yolov7 (accessed on 2 January 2024).
  29. Ultralytics-YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 2 January 2024).
  30. Meng, J.; Kang, F.; Wang, Y.; Tong, S.; Zhang, C.; Chen, C. Tea Buds Detection in Complex Background Based on Improved YOLOv7. IEEE Access 2023, 11, 88295–88304. [Google Scholar] [CrossRef]
  31. Talaat, F.M.; ZainEldin, H. An Improved Fire Detection Approach Based on YOLO-v8 for Smart Cities. Neural Comput. Appl. 2023, 35, 20939–20954. [Google Scholar] [CrossRef]
  32. Ultralytics-YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 2 January 2024).
  33. Jiang, J.; Yang, Z.; Wu, C.; Guo, Y.; Yang, M.; Feng, W. A Compatible Detector Based on Improved YOLOv5 for Hydropower Device Detection in AR Inspection System. Expert Syst. Appl. 2023, 225, 120065. [Google Scholar] [CrossRef]
  34. Hu, W.; Xiong, J.; Liang, J.; Xie, Z.; Liu, Z.; Huang, Q.; Yang, Z. A Method of Citrus Epidermis Defects Detection Based on an Improved YOLOv5. Biosyst. Eng. 2023, 227, 19–35. [Google Scholar] [CrossRef]
  35. Chen, G.; Zhou, H.; Li, Z.; Gao, Y.; Bai, D.; Xu, R.; Lin, H. Multi-Scale Forest Fire Recognition Model Based on Improved YOLOv5s. Forests 2023, 14, 315. [Google Scholar] [CrossRef]
  36. Zhao, Y.; Han, R.; Rao, Y. A New Feature Pyramid Network for Object Detection. In Proceedings of the 2019 International Conference on Virtual Reality and Intelligent Systems (ICVRIS), Jishou, China, 14–15 September 2019; pp. 428–431. Available online: https://ieeexplore.ieee.org/document/8920795 (accessed on 21 January 2024).
  37. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. PANet: Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
  38. Yar, H.; Khan, Z.A.; Ullah, F.U.M.; Ullah, W.; Baik, S.W. A Modified YOLOv5 Architecture for Efficient Fire Detection in Smart Cities. Expert Syst. Appl. 2023, 231, 120465. [Google Scholar] [CrossRef]
  39. Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
  40. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
  41. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the EEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
  42. Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Better Design Paradigm of Detector Architectures for Autonomous Vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar] [CrossRef]
  43. Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
  44. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI 2020—34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar] [CrossRef]
  45. Cheng, L.; Li, J.; Duan, P.; Wang, M. A Small Attentional YOLO Model for Landslide Detection from Satellite Remote Sensing Images. Landslides 2021, 18, 2751–2765. [Google Scholar] [CrossRef]
  46. Liu, P.; Wang, Q.; Zhang, H.; Mi, J.; Liu, Y. A Lightweight Object Detection Algorithm for Remote Sensing Images Based on Attention Mechanism and YOLOv5s. Remote Sens. 2023, 15, 2429. [Google Scholar] [CrossRef]
  47. Huang, L.; Liu, G.; Wang, Y.; Yuan, H.; Chen, T. Fire Detection in Video Surveillances Using Convolutional Neural Networks and Wavelet Transform. Eng. Appl. Artif. Intell. 2022, 110, 104737. [Google Scholar] [CrossRef]
Figure 1. Structure of YOLOv5s model version 6.2, where different categories of heads are distinguished by three colours and different superscript numbers.
Figure 1. Structure of YOLOv5s model version 6.2, where different categories of heads are distinguished by three colours and different superscript numbers.
Applsci 14 00989 g001
Figure 2. Realization of the SE module.
Figure 2. Realization of the SE module.
Applsci 14 00989 g002
Figure 3. The structure of SECSP and SE_Bottleneck.
Figure 3. The structure of SECSP and SE_Bottleneck.
Applsci 14 00989 g003
Figure 4. The structure of AEGG-FD.
Figure 4. The structure of AEGG-FD.
Applsci 14 00989 g004
Figure 5. Visualization of some feature maps generated by the first convolution in YOLOv5.
Figure 5. Visualization of some feature maps generated by the first convolution in YOLOv5.
Applsci 14 00989 g005
Figure 6. An illustration of the convolutional layer and the proposed Ghost module for outputting the same number of feature maps.
Figure 6. An illustration of the convolutional layer and the proposed Ghost module for outputting the same number of feature maps.
Applsci 14 00989 g006
Figure 7. Two types of Ghost bottleneck.
Figure 7. Two types of Ghost bottleneck.
Applsci 14 00989 g007
Figure 8. The structure of the GSConv module.
Figure 8. The structure of the GSConv module.
Applsci 14 00989 g008
Figure 9. GS bottleneck and VoV.
Figure 9. GS bottleneck and VoV.
Applsci 14 00989 g009
Figure 10. Model flowchart.
Figure 10. Model flowchart.
Applsci 14 00989 g010
Figure 11. Effects of different image manipulations on the original image, where the number on the detection frame represents the confidence and the green round detection frame is generated from the FD module.
Figure 11. Effects of different image manipulations on the original image, where the number on the detection frame represents the confidence and the green round detection frame is generated from the FD module.
Applsci 14 00989 g011
Figure 12. Schematic diagram of SIoU.
Figure 12. Schematic diagram of SIoU.
Applsci 14 00989 g012
Figure 13. Representative images from the dataset.
Figure 13. Representative images from the dataset.
Applsci 14 00989 g013
Figure 14. Comparison of different optimizers.
Figure 14. Comparison of different optimizers.
Applsci 14 00989 g014
Figure 15. The params of each model in the ablation experiment.
Figure 15. The params of each model in the ablation experiment.
Applsci 14 00989 g015
Figure 16. The GFLOPs of each model in the ablation experiment.
Figure 16. The GFLOPs of each model in the ablation experiment.
Applsci 14 00989 g016
Figure 17. The m A P 0.5 of each model in the ablation experiment.
Figure 17. The m A P 0.5 of each model in the ablation experiment.
Applsci 14 00989 g017
Figure 18. The m A P 0.95 of each model in the ablation experiment.
Figure 18. The m A P 0.95 of each model in the ablation experiment.
Applsci 14 00989 g018
Figure 19. The FPS of each model in the ablation experiment.
Figure 19. The FPS of each model in the ablation experiment.
Applsci 14 00989 g019
Figure 20. Flame detection results in different scenes.
Figure 20. Flame detection results in different scenes.
Applsci 14 00989 g020
Figure 21. Positioning effect on the flame-centre area after using the FD module.
Figure 21. Positioning effect on the flame-centre area after using the FD module.
Applsci 14 00989 g021
Table 1. The network’s parameters of AEGG-FD.
Table 1. The network’s parameters of AEGG-FD.
LayersParametersOutputLayersParametersOutput
FiltersStrideFiltersStride
Conv1162320 × 320 × 16GSConv480120 × 20 × 480
G_bneck 1 40 × 40 × 480
G_bneck242160 × 160 × 24Concat [−1, 9] 40 × 40 × 592
1 VoV480 40 × 40 × 480
G_bneck(SE)40280 × 80 × 40GSConv240 40 × 40 × 240
1 80 × 80 × 240
G_bneck80240 × 40 × 80Concat [−1, 5] 80 × 80 × 280
1 VoV 80 × 80 × 240
G_bneck(SE)112 40 × 40 × 112GSConv 40 × 40 × 240
Concat [−1, 21] 40 × 40 × 480
G_bneck(SE)160220 × 20 × 160VoV
G_bneck 1 GSConv 20 × 20 × 480
G_bneck(SE) Concat [−1, 17] 20 × 20 × 960
G_bneck VoV
G_bneck(SE)
Conv2960120 × 20 × 960Detection
SECSP9601
Table 2. Experimental environment configuration.
Table 2. Experimental environment configuration.
Experimental EnvironmentDetails
LanguagePython 3.11
Deep learning architectureTorch 2.0
Acceleration environmentCUDA 12.1 + cuDNN 8.9.1
Operation systemWindows 11 64 bit
RAM16GB
CPUIntel(R) Core(TM) i9-13900HX 2.20 GHz
GPUNVIDIA GeForce RTX 4060 Laptop GPU
Table 3. Model training parameters.
Table 3. Model training parameters.
Training ParametersDetails
Epochs150
Batch Size16
Img-size (pixels)640 × 640
Optimization algorithmSGD
Initial learning rate0.01
Momentum0.937
Optimizer weight decay0.0005
Table 4. Comparison with the latest YOLO models.
Table 4. Comparison with the latest YOLO models.
ModelP/%R/% m A P 0.5 /% m A P 0.95 /%FPSGFLOPs
YOLOv578.975.278.247.751.816.6
YOLOv779.773.278.946.248.313.2
YOLOv876.573.677.549.554.628.6
AEGG-FD79.376.184.753.560.26.2
Table 5. Comparison of different attention mechanisms.
Table 5. Comparison of different attention mechanisms.
ModelP/%R/% m A P 0.5 /% m A P 0.95 /%
YOLOv578.975.278.247.7
YOLOv5-Ghost79.274.580.548.3
YOLOv5-Ghost-CBAM76.775.278.747.7
YOLOv5-Ghost-CA79.376.180.048.3
YOLOv5-Ghost-SE79.578.481.850.5
Table 6. Comparison of different loss functions.
Table 6. Comparison of different loss functions.
ModelP/%R/% m A P 0.5 /% m A P 0.95 /%
YOLOv5-CIoU78.975.278.247.7
YOLOv5-GIoU78.574.278.647.8
YOLOv5-DIoU78.172.879.348.1
YOLOv5-SIoU78.275.580.348.5
Table 7. Results of ablation experiments.
Table 7. Results of ablation experiments.
ModulesModel
12345678910
SIoU
SECSP
GSCneck
Ghost
Params/M7.247.248.436.754.434.435.594.195.245.24
GFLOPs16.616.617.613.97.27.28.25.46.26.2
m A P 0.5 /%78.280.379.482.680.581.381.883.783.984.7
m A P 0.95 /%47.748.547.952.348.347.850.553.153.253.5
FPS51.852.955.258.157.857.957.559.258.160.2
Note: √ means that the model in the vertical column adopts the left-hand module, e.g., Model 6 uses SIoU and Ghost modules. Models 1 to 5 are the first group of experiments, while Models 6 to 10 are the second group of experiments. Model 1 is the benchmark model (YOLOv5s), Model 10 is the model proposed in this paper, and the second group of ablation experiments was performed on the basis of Model 5.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, J.; Yin, J.; Yang, Z. Fire Detection and Flame-Centre Localisation Algorithm Based on Combination of Attention-Enhanced Ghost Mode and Mixed Convolution. Appl. Sci. 2024, 14, 989. https://doi.org/10.3390/app14030989

AMA Style

Liu J, Yin J, Yang Z. Fire Detection and Flame-Centre Localisation Algorithm Based on Combination of Attention-Enhanced Ghost Mode and Mixed Convolution. Applied Sciences. 2024; 14(3):989. https://doi.org/10.3390/app14030989

Chicago/Turabian Style

Liu, Jiansheng, Jiahao Yin, and Zan Yang. 2024. "Fire Detection and Flame-Centre Localisation Algorithm Based on Combination of Attention-Enhanced Ghost Mode and Mixed Convolution" Applied Sciences 14, no. 3: 989. https://doi.org/10.3390/app14030989

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop