Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention

Li, Yuxuan; Nie, Lisha; Zhou, Fangrong; Liu, Yun; Fu, Haoyu; Chen, Nan; Dai, Qinling; Wang, Leiguang

doi:10.3390/fire8050165

Open AccessArticle

Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention

by

Yuxuan Li

¹,

Lisha Nie

²,

Fangrong Zhou

³,

Yun Liu

²,

Haoyu Fu

¹,

Nan Chen

¹,

Qinling Dai

⁴ and

Leiguang Wang

^5,*

¹

College of Soil and Water Conservation, Southwest Forestry University, Kunming 650224, China

²

College of Forestry, Southwest Forestry University, Kunming 650224, China

³

Joint Laboratory of Power Remote Sensing Technology (Electric Power Research Institute, Yunnan Power Grid Company Ltd., China Southern Power Grid), Kunming 650217, China

⁴

Art and Design College, Southwest Forestry University, Kunming 650024, China

⁵

College of Landscape Architecture and Horticulture, Southwest Forestry University, Kunming 650224, China

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(5), 165; https://doi.org/10.3390/fire8050165

Submission received: 12 March 2025 / Revised: 16 April 2025 / Accepted: 18 April 2025 / Published: 22 April 2025

(This article belongs to the Topic AI for Natural Disasters Detection, Prediction and Modeling)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Fires pose significant threats to human safety, health, and property. Traditional methods, with their inefficient use of features, struggle to meet the demands of fire detection. You Only Look Once (YOLO), as an efficient deep learning object detection framework, can rapidly locate and identify fire and smoke objects in visual images. However, research utilizing the latest YOLO11 for fire and smoke detection remains sparse, and addressing the scale variability of fire and smoke objects as well as the practicality of detection models continues to be a research focus. This study first compares YOLO11 with classic models in the YOLO series to analyze its advantages in fire and smoke detection tasks. Then, to tackle the challenges of scale variability and model practicality, we propose a Multi-Scale Convolutional Attention (MSCA) mechanism, integrating it into YOLO11 to create YOLO11s-MSCA. Experimental results show that YOLO11 outperforms other YOLO models by balancing accuracy, speed, and practicality. The YOLO11s-MSCA model performs exceptionally well on the D-Fire dataset, improving overall detection accuracy by 2.6% and smoke recognition accuracy by 2.8%. The model demonstrates a stronger ability to identify small fire and smoke objects. Although challenges remain in handling occluded targets and complex backgrounds, the model exhibits strong robustness and generalization capabilities, maintaining efficient detection performance in complicated environments.

Keywords:

fire and smoke object detection; YOLO11; wildfire; attention mechanism

1. Introduction

Wildfires are a natural component of forest ecosystems and play a crucial role in maintaining ecological balance and biodiversity. Moderate wildfires can promote forest species renewal, release stored nutrients, and reintroduce them into the soil cycle, thereby nourishing the forest ecosystem [1,2]. However, improper wildfire management can lead to severe consequences, such as large-scale tree mortality, destruction of forest structures, and loss of habitats for flora and fauna. Additionally, wildfires release significant amounts of carbon dioxide and other pollutants, exacerbating global climate change [3,4]. Against the backdrop of climate change, the frequency and intensity of wildfires are increasing annually, further threatening the stability of forest ecosystems. Therefore, timely monitoring and rapid response to fires and their resulting smoke not only help prevent the spread of fires and protect forest resources but also mitigate the damage caused by climate change to ecosystems. Moreover, the occurrence of wildfires is closely linked to human activities, as humans are one of the primary causes of wildfires, whether through direct ignition or indirect trigger [5]. Thus, fire prevention not only reduces the ecological damage caused by fires but also safeguards human lives and property from threats [6], making it essential to enhance fire prevention and control capabilities. To address this challenge, efforts are being made to explore more efficient technologies and methods for preventing and extinguishing wildfires, thereby protecting the ecological environment and promoting the sustainable development of human society.

Over the years, traditional fire detection methods have been widely used for wildfire monitoring, including manual patrols, watchtower observations, and physical sensor detection. Considering personnel safety, the monitoring range of manual patrols is relatively limited, making it difficult to detect fires in a timely manner [7]. The fire response capability of watchtowers is also constrained by their field of view, and their construction is limited by terrain and budget, resulting in less than ideal monitoring effectiveness [8]. As a result, an increasing number of people are adopting various sensor technologies to enhance fire monitoring capabilities, such as smoke sensors and temperature sensors [9,10,11]. While these sensor technologies are cost-effective and easy to implement, they are more suitable for indoor environments, require manual intervention to verify the authenticity of alarms, and cannot provide specific fire information, such as the exact location, scale, and intensity of the fire [12,13]. Additionally, temperature triggers require reaching a certain threshold to respond, and the process of smoke particles reaching the sensors can be influenced by natural factors such as wind speed [14], leading to delays that ultimately reduce detection accuracy.

With the continuous advancement of surveillance technology, visual fire detection technology has become an effective alternative to traditional fire detection methods. When smoke or fire appears within the camera’s field of view, the system can immediately detect the fire, demonstrating significant potential for early fire detection. Additionally, visual fire detection can provide critical information about fire development, such as the spread speed and direction, which is essential for assessing fire evolution and minimizing losses [15,16]. Currently, vision-based fire detection technologies are primarily divided into machine learning methods and deep learning methods. Machine learning methods [17] rely on manually designed features to identify targets, with the drawback of requiring feature selection and parameter tuning, and the selected features may not be sufficiently representative. To address these limitations, deep learning technology has become the preferred choice in the field of fire detection due to its ability to automatically extract features, achieve high detection accuracy, and exhibit strong adaptability [18,19]. Object detection technology in deep learning has garnered widespread attention for its capability to automatically identify and locate fire and smoke objects in images or video streams. Object detection tasks are generally divided into two categories: two-stage detection and one-stage detection. Two-stage detection first generates candidate regions that may contain objects; in the second stage, these candidate regions are classified and subjected to bounding box regression to determine the target’s location and category. Representative algorithms include R-CNN [20], Fast R-CNN [21], and Faster R-CNN [22], among others. In contrast, one-stage detection methods directly predict the object’s location and category from the input image, completing both classification and regression tasks in a single forward pass. Representative algorithms include SSD [23] and the YOLO (You Only Look Once) series [24,25].

Although two-stage detection methods offer higher accuracy, their process of first generating candidate regions and then performing classification and regression results in greater computational overhead and longer processing times. Therefore, in real-time wildfire detection tasks that demand both speed and accuracy, the one-stage YOLO object detection framework has been widely improved and applied [26,27]. To verify the effectiveness of YOLOv3 for real-time forest fire detection on small drones, Zhentian Jiao et al. [28] proposed an improved algorithm based on YOLOv3-tiny and tested it on a drone platform. The experiments demonstrated that this method met expectations in terms of both detection speed and accuracy. For rapid detection of ship fires in complex and dynamic marine environments, Huafeng Wu et al. [29] constructed a high-quality dataset, introduced multi-scale detection techniques, and incorporated the SE (Squeeze and Excitation) attention mechanism. This effectively enhanced the accuracy and efficiency of small object detection. Additionally, they accelerated model convergence using transfer learning. The experimental results showed that the detection speed fully met real-time requirements. To address the challenge of accurately identifying different types of forest fires in complex backgrounds, Qilin Xue et al. [30] proposed an improved YOLOv5-based model called FCDM. By introducing the SIoU loss function, the CBAM attention mechanism, and an enhanced BiFPN feature fusion network, the model performed exceptionally well in fire classification and detection tasks. It demonstrated significant potential for practical applications. In the future, researchers plan to explore model lightweighting to improve deployment efficiency. To tackle the difficulty of detecting small smoke targets in complex backgrounds, Xin Chen et al. [31] proposed the RepVGG-YOLOv7 model. This model incorporated the ECA attention mechanism, the SIoU loss function, the RepVGG structure, and an improved non-maximum suppression algorithm. These innovations significantly boosted feature extraction capabilities and improved inference efficiency through lossless compression techniques. Experimental results indicated that the model achieved a precision of 95.1% in complex backgrounds and small target detection. Future work will focus on further optimization for edge devices and exploring the integration of multimodal data. Additionally, to address the issues of high false alarm and missed detection rates in existing fire detection technologies, Tianxin Zhang et al. [32] proposed an improved YOLOv8-based fire detection algorithm, YOLOv8-FEP. By adding a large object detection head, introducing the EMA attention mechanism, and designing a PAN-Bag structure, the model excelled in detecting large-scale fires and smoke object and effectively improved robustness and accuracy in complex scenarios. Experiments indicated that YOLOv8-FEP achieved higher precision, lower false alarm rates, and lower missed detection rates in fire and smoke detection tasks, making it more suitable for real-time detection. Future studies will focus on expanding the training dataset and optimizing the algorithm to enhance the model’s performance in special environments. Chenmeng Zhao et al. [33] proposed a lightweight deep learning model called SF-YOLO (Smoke and Fire-YOLO), which is based on an improved version of YOLOv11. It incorporates the C3k2 module with a dual-path residual attention mechanism and the SEAMHead detection head with an embedded attention mechanism, along with a dynamically adjusted W-SIoU loss function. These innovations significantly enhance the model’s ability to detect targets in situations with occlusion, blurred boundaries, and complex backgrounds. Experiments conducted on the self-built dataset S-Firedata and the public dataset M⁴SFWD show that SF-YOLO outperforms existing models in terms of detection accuracy (with mAP50 and mAP50-95 improving by 2.5% and 2.4%, respectively) as well as lightweight design. Hongying Liu et al. [34] proposed a Transformer-based multi-scale feature fusion network called TFNet. This network enhances feature extraction through the SRModule, utilizes the CG-MSFF Encoder to fuse multi-scale features, and employs the WIOU loss function to optimize localization accuracy. Experimental results demonstrate that TFNet performs excellently on the D-Fire and M4SFWD public datasets, outperforming existing advanced models in metrics such as precision, recall, F1 score, mAP50, and mAP50–95. At the same time, it maintains a good balance in terms of parameter size and computational complexity, making it suitable for real-time applications. From the past to the present, from CNN to Transformer, researchers have been working on improving YOLO from different angles to make it more suitable for fire and smoke object detection task requirements. We use Table 1 to more directly observe what problems the researchers have solved and how to solve the remaining problems in the future.

It is evident that many efforts and studies have been conducted using the YOLO framework to address issues in visual fire detection, such as the diverse types, colors, and shapes of smoke and fire, the complexity of the environment, and the practicality of model deployment [15]. These challenges remain key research topics in visual fire detection [35]. The latest YOLO series, YOLO11, was released by the Ultralytics team at the end of September 2024. YOLO11 utilizes an improved backbone and neck architecture to enhance feature extraction, offering a faster processing speed while maintaining the best balance between accuracy and performance. It can be seamlessly deployed on edge devices, cloud platforms, and GPU-equipped systems, ensuring maximum flexibility [36]. Whether the latest YOLO11 can better address the challenges in fire and smoke object detection tasks or what its advantages are remains to be seen. This study conducts experiments to discuss the performance of YOLO11 in fire and smoke object detection tasks.

This study addresses the following specific problems and conducts the following work:

The YOLO series has consistently played an active role in fire and smoke target detection. This study first explores the advantages of the YOLO11 network in fire and smoke target detection tasks. The result show that YOLO11 achieves the best balance between detection accuracy, detection speed, and model complexity for fire and smoke object detection tasks. Moreover, all YOLO models demonstrate higher detection accuracy for smoke than for fire, and the YOLO series can detect fires at an earlier stage. YOLO11 will be an important tool for our future fire and smoke object detection research.
To address the variability of smoke and fire objects across different scales and environments, this study introduces the Multi-Scale Convolutional Attention (MSCA) module to enhance YOLO11 (for simplicity, the improved YOLO11 is referred to as YOLO11s-MSCA in this paper). The components of MSCA are characterized by high efficiency, low computational cost, and minimal parameters, ensuring that the model’s detection speed and algorithm complexity do not significantly increase.
Observations from the results of YOLO11s-MSCA on the D-Fire show that YOLO11s-MSCA improves the model’s detection accuracy without noticeably compromising detection speed or increasing model complexity. Compared to YOLO11s, the improved model detects small targets in images more accurately and demonstrates a stronger understanding of environmental contexts. However, it still struggles with identifying the occluded object, and the main sources of errors remain the complex environmental variations.
The generalization performance of our YOLO classic model and YOLO11s-MSCA model is verified through the overall metrics and visual details on the Fire and Smoke Dataset and CBM-Fire. Even when the dataset is changed, the models maintain high accuracy and recall, confirming that the conclusions drawn from the D-Fire dataset are applicable. At the same time, different datasets of the same model are compared, and the conclusion is drawn that the construction strategy of the dataset (data volume, scene richness, classification of interference items) has a significant impact on the performance and stability of the detection model. Especially when the interference terms are clearly labeled as independent categories, the robustness and accuracy of the model can be significantly improved in complex environments.

2. Materials and Methods

2.1. Datasets

To improve the accuracy of training results and enhance the generalization ability of deep convolutional neural network models, it is crucial to select datasets that can simulate real fire scenarios. This research comprehensively evaluates the model’s performance by assessing its results on the D-Fire dataset, the Fire and Smoke Dataset, and CBM-Fire.

The D-Fire [37] fire and smoke object detection dataset contains 21,527 images, with 17,221 used for training the model and 4306 for testing. The dataset includes 1164 images featuring only fire cases, 5867 images containing only smoke cases, and 4658 images where both fire and smoke are present. These extensive annotated data provide a wealth of samples for the model. To improve the model’s robustness, the dataset also contains 9838 images without fire or smoke, including instances often misidentified as fire and smoke, such as fog and bright light. To enhance the model’s generalization ability and simulate complex scenarios, the dataset’s sources are highly diverse, covering scenes from the internet, schools, technology parks, and parks—all with intricate backgrounds. Moreover, the dataset distinguishes between day and night images, and it includes various camera interference factors, such as raindrops and spiders, which reflect real-world complexities. The number of instances in the dataset is detailed in Table 2, ensuring that the model can be trained on a substantial number of samples. Typical examples from the dataset are illustrated in Figure 1, showcasing the diversity of its sources.

The Fire and Smoke Dataset [38] is the second dataset used in this study, primarily to validate the model’s generalization ability and result applicability. This publicly available dataset consists of two parts, from which we downloaded FireSmokeDataset_part1. It contains 12,813 training images, 6068 validation images, and 2237 test images—sufficient data volume to effectively support model training and validation. The dataset covers a wide range of scenarios, including indoor and outdoor environments, urban areas, industrial sites, and forests. It also includes both low- and high-resolution images, as well as multi-instance cases. These diverse scenes simulate various real-world complexities, ensuring the model undergoes comprehensive training and testing even under challenging conditions. The dataset is divided into three categories: smoke, fire, and other classes. The “other classes” include images that are easily mistaken for fire, such as sunrise and sunset, clouds, streetlights, campfire embers, reflections, vehicle headlights, emergency lights of various vehicles, and lights from public electronic screens. The inclusion of these categories enhances the model’s ability to recognize interfering factors, thereby improving its robustness in practical applications. The dataset is sourced from a variety of origins, including other public datasets, images downloaded from the internet, and frames extracted from videos, encompassing images from daytime to nighttime and under different weather conditions. Typical examples from the dataset are illustrated in Figure 2, effectively simulating complex real-world scenarios.

To address the issue of insufficient quantity and low quality in fire target detection datasets, Xin Geng et al. proposed a new fire detection dataset called CBM-Fire [39]. This dataset contains 2000 images of flames and smoke, divided into two categories—flame and smoke—at a ratio of 2:1. The images cover a wide range of fire scenarios and are annotated using LabelMe software. The dataset is split into training, testing, and validation sets in an 8:1:1 ratio.

When users download the dataset, the images are already organized by category. By simply reading the corresponding .txt files with the provided code, the images can be sorted into separate folders accordingly, as shown in Table 3 (CBM-Fire dataset partitioning ratio). The dataset features high diversity, precise annotations, and excellent quality, making it well-suited for the training and evaluation of fire detection algorithms. The related resources have been publicly released on the open-source repository at: https://github.com/GengHan-123/yolov9-cbm (accessed on 15 April 2025).

2.2. Model Overview

2.2.1. Proposed Method

The latest YOLO network framework consists of four main components: the input, backbone, neck, and head, as illustrated in parts a, b, and c of Figure 3. The input layer processes the image and applies data augmentation [40]. The backbone network is responsible for extracting multi-scale features from the input image. The neck uses a Feature Pyramid Network (FPN) [41] to aggregate features from different scales, enhancing the model’s ability to detect objects of various sizes and improving multi-scale semantic representation. The detection head consists of several convolutional layers designed to produce detection results.

This paper focuses on YOLO11, a model developed by the Ultralytics team, similar to YOLOv8, with several improvements based on YOLOv8 [42]. Key information can be found on the website https://docs.ultralytics.com/models/yolo11/ (accessed on 15 April 2025). The backbone modules of YOLO11 include the Conv composite module (comprising Conv2d, batch normalization [43], and the SiLU activation function [44]), the C3k2 module, the Spatial Pyramid Fast Pooling (SPPF) module [26], and the Cross Stage Partial Spatial Attention (C2PSA) module. C3k2 and C2PSA are novel modules introduced by YOLO11. The C3k2 module enhances the model’s adaptability to complex scenes by employing parallel convolution and optimized parameter settings, effectively handling multi-scale features. The module achieves efficient feature extraction through parallel convolution branch and parameter configurable design. Its core adopts a two-way structure: one way directly transfers shallow features, and the other way processes deep features through an optional C3k (variable convolution kernel) or standard bottleneck module and finally concatenates and fuses. When C3k is false, C3k2 is C2f, and when C3k is true, the module is C3k2. The details are shown in Figure 4. Meanwhile, the C2PSA module combines the Cross Stage Partial (CSP) [45] structure and the Pyramid Squeeze Attention (PSA) [46] mechanism, further boosting feature extraction capabilities, particularly excelling in multi-scale feature extraction. CSP will process the feature map by dividing it into two parts: one part is directly transmitted, while the other part is processed through the PSA module. Finally, the two parts are concatenated and fused. By emphasizing spatial correlations in feature maps, C2PSA ensures that the model focuses on crucial regions, thereby striking a balance between computational cost and detection accuracy (refer to Figure 5 for details). In the detection head, YOLO11 produces detection results at three different scales by utilizing feature maps generated from the backbone and neck.

The improvements in this study focus on the neck of the YOLO11 network. The neck network adopts a structure like YOLOv8, integrating the Feature Pyramid Network (FPN) and Path Aggregation Network (PANet) [47]. By introducing top–down pathways and lateral connections, it effectively fuses high-level semantic information with low-level spatial details, enhancing the model’s ability to detect objects of varying sizes [48]. This structure efficiently processes the multi-scale features extracted from the backbone network. In YOLO11, the C3k2 module is introduced into the neck to enable the model to extract features at different scales and facilitate multi-scale information interaction. This allows the model to capture more complex spatial features, which are then passed to the detection head. Increasing the number of neurons in the neck network may further optimize the advantages of feature fusion. After the C3k2 module enables multi-scale feature interaction, the MSCA module is introduced to focus on important regions’ multi-scale contextual information, highlighted in blue in Figure 3.

To maintain a balance between accuracy and model complexity, this study currently adopts a single-module approach, ensuring high precision without significantly increasing computational cost.

2.2.2. Multi-Scale Convolutional Attention (MSCA)

The MSCA (Multi-Scale Convolutional Attention) mechanism was proposed by Meng-Hao Guo et al. [49] to address the challenge of efficiently encoding contextual information in semantic segmentation tasks. It leverages multi-scale convolutional features to activate spatial attention [50], thereby encoding contextual information more effectively. To tackle the variability of smoke and fire types, as well as the complexity of the environment, this study introduces MSCA into the YOLO11 framework to focus on the multi-scale and contextual information of objects. The key components of MSCA are as follows:

Depth-wise Convolution: used to aggregate local information. Unlike standard convolutions that operate across the entire input feature map, depth-wise convolutions perform convolution independently on each input channel [51]. The kernel depth matches the number of input channels, while the kernel width and height remain relatively small. The primary advantage of this convolution method lies in its computational efficiency, as it avoids convolution operations between different channels, significantly reducing the number of parameters and calculations.

Multi-branch Depth-wise Strip Convolutions: used to capture multi-scale contextual information. Each branch employs strip convolutions, a lightweight convolution method that approximates a standard k × k convolution using a pair of k × 1 and 1 × k convolutions [52]. Compared to standard convolutions, strip convolutions significantly reduce computational cost and the number of parameters while maintaining performance. This design balances efficiency and accuracy, enabling the model to process spatial features across multiple scales effectively.

The 1 × 1 Convolution: used to model relationships between different channels [53] and generate attention weights. It reduces the number of feature maps while maintaining a relatively low parameter count.

The high efficiency, low computational cost, and minimal parameters of each component in the MSCA module are key reasons for incorporating it into the fire and smoke detection task. It ensures that accuracy is improved without significantly increasing model complexity or reducing detection speed.

Specifically, the computational process of MSCA is as shown in Equations (1) and (2):

M S C A = {C o n v}_{1 \times 1} (\sum_{i = 0}^{3} {S c a l e}_{i} (D W_C o n v (I n)))

(1)

O u t = M S C A \otimes I n

(2)

Here, In represents the input features, MSCA and Out denote the attention map and output, respectively, ⊗ signifies the element-wise matrix multiplication operation, DW_Conv stands for depth-wise convolution, Scale_i represents the i-th scale branch, and Scale₀ is the identity connection. A detailed diagram of the MSCA module is shown in Figure 6.

By adopting this approach, MSCA effectively captures multi-scale features and generates adaptive attention weights, enhancing the model’s performance. This research integrates this module into the YOLO11s model, which significantly boosts the model’s accuracy. The specific results will be discussed in the next section. MSCA is a plug-and-play module, and we do not need to enter additional parameters. It is an added single module, and no changes are made to the other modules of the source code.

3. Experiments and Analyses

3.1. Evaluation Metrics Section

To improve the model, evaluate its performance, and compare differences between models, selecting appropriate evaluation metrics is crucial. This study employs multiple metrics related to object detection [37,54]. Given the severe danger posed by wildfires, timely and accurate detection of fires and their accompanying smoke is of utmost importance. Therefore, the accuracy of fire and smoke object detection is the primary focus. Additionally, the number of missed detections serve as a critical basis for assessing model performance and guiding future improvements. To evaluate the model’s performance in terms of high precision, this paper employs metrics such as precision, recall, mAP50, and mAP50-95. Furthermore, to meet real-time requirements and evaluate the model’s computational efficiency, this study uses FPS (frames per second) and GFLOPS (giga floating-point operations per second) as metrics. The specific formulas are provided in Equations (3)–(6).

1.: Precision

Precision measures the proportion of correctly identified targets among all the targets detected by the model. The higher the precision, the fewer false positives the model has.

p r e c i s i o n = \frac{T P}{T P + F P}

(3)

2.: Recall

Recall measures the proportion of correctly detected objects among all actual objects. It represents the ratio of correctly detected objects among all objects of interest (each smoke and fire object). A higher recall indicates fewer missed detections by the model.

R e c a l l = \frac{T P}{T P + F N}

(4)

Here, TP (True Positive) refers to targets correctly identified and localized by the model, meaning the bounding box detected by the model fully matches the ground truth box and is classified as the positive class (target class). FP (False Positive) refers to cases where the model incorrectly detects background regions or non-target regions as targets, meaning the model predicts the existence of a target that does not actually exist. FN (False Negative) refers to real targets that the model fails to detect, meaning actual targets are neither recognized nor localized by the model.

3.: Mean Average Precision (mAP)

The area under the precision–recall curve for a single class is called average precision (AP), and the mean of AP across all classes is referred to as mean average precision (mAP) [55]. mAP50 refers to the average precision calculated at an Intersection over Union (IoU) threshold of 0.5. mAP50-95 is an extension of mAP, which calculates the average precision across multiple IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05. This value reflects the model’s capability in bounding box regression.

m A P = \frac{1}{m} \sum_{i = 1}^{m} A P (i),

(5)

A P = \int_{0}^{1} P (r) d r,

(6)

4.: GFLOPS

GFLOPS is one of the metrics used to measure the computational complexity of a model. It reflects the number of floating-point operations required for the model to perform forward inference. The higher the GFLOPS value, the more floating-point operations the model needs to perform, which increases the computational demands and may slow down the inference speed, especially when hardware resources are limited.

5.: FPS

FPS is a critical metric for evaluating the real-time performance of a model, particularly in applications such as video processing, real-time object detection, and augmented reality [56]. FPS represents the number of images or video frames processed by the model per unit of time, directly reflecting its ability to meet real-time requirements.

3.2. Experimental Configuration and Parameter Setting

To ensure that performance differences stem from the model’s design and functionality rather than external factors such as hardware configuration or hyperparameter settings, it is essential to meticulously document and control these external variables.

The deep learning training platform used in this study has powerful computing capabilities and efficient parallel processing abilities, making it suitable for training and inference on large-scale datasets. The system configuration includes the Windows Server 2022 Standard 64-bit operating system, equipped with two Intel(R) Xeon(R) Gold 6330 processors, with a base frequency of 2.00 GHz and a maximum frequency of 3.10 GHz, and 256 GB of high-speed memory, providing ample memory support for handling large datasets and complex models. The system is also equipped with a Tesla T4 GPU (15206 MiB), supporting the CUDA computing platform, which accelerates model training and inference processes in deep learning tasks, significantly reducing training time. The deep learning framework used is PyTorch 2.4.1, with CUDA version 11.8, the Ultralytics 8.3.18 toolkit, and Python version 3.10.15, ensuring flexibility and efficiency in the development environment.

The training process consists of 100 epochs, with an input image size of 640 × 640 pixels. The batch size is set to 20, and the optimization algorithm used is Stochastic Gradient Descent (SGD), with all other hyperparameters set to their default values. The initial learning rate (lr0) is set to 0.01, and the SGD momentum factor is set to 0.937 by default. Additionally, the number of worker threads for data loading is set to 0 to ensure efficient memory usage. Mosaic data augmentation is disabled in the final 10th epoch of training. The evaluation metrics in the following sections are calculated based on these hyperparameter settings to ensure comparison and analysis. The key hyperparameter settings are shown in Table 4.

3.3. Experimental Results and Discussion Analysis

3.3.1. Comparative Analysis of the Advantages of YOLO11

The main reason for the generation-by-generation turnover of the YOLO model is to continuously improve its performance, speed, accuracy, and adaptability, thereby meeting the growing demands of computer vision tasks. In order to explore how YOLO11 is superior in the direction of fire and smoke target detection on the D-Fire dataset, this research compares YOLO11 with previous classical methods of the YOLO family. All our experiments and index calculations are under the same device and the same hyperparameter settings, ensuring that the index difference is caused by the model improvement part, as described in Section 3.2. This experiment selects models that are representative and can be run on our equipment.

To meet the needs of more diverse computer vision applications, YOLOv5 aims to improve the accuracy and ease of use of real-time object detection while optimizing resource usage. By improving the network architecture and training strategy, the detection accuracy is improved to meet the demands of complex scenarios. In addition, computational and memory resource consumption is optimized to ensure efficient operation on resource-constrained devices and expand the types of inspection task [57]. Table 5 shows that in this task, the YOLOv5s is 1.1% less refined overall compared to the YOLO11s, and all other metrics are worse than the YOLO11s except for the FPS improvement of 32.3.

YOLOv6 is dedicated to solving the key problems of speed and accuracy balance, model deployment utility, quantization performance degradation and label assignment, and loss function fitness in the field of object detection. It redesigns the network architecture and training strategy for the diverse needs of speed and accuracy in different scenarios and strives to achieve the best balance between the two [58]. As seen in Table 5, YOLOv6s has larger GFLOPS as well as higher FPS, which differ from YOLO11s by 22.7 and 20.8, respectively. However, compared to YOLO11s, the overall accuracy loss is 1.7%, and the mAP50 for the fire target is 2.2% lower, which is the largest difference among all the metrics, and all the other metrics are lower than YOLO11s.

YOLOv7 aims to address the balance between speed and accuracy in real-time object detection, while focusing on improving the detection of small objects. It further reduces model complexity and computational resource consumption and improves training efficiency by optimizing network architecture and training strategies. In addition, YOLOv7 enhances the generalization capability of the model, enabling it to better adapt to diverse application scenarios and meet different task requirements [59]. As shown in Table 5, the overall precision of YOLOv7 is only 69%, a difference of 8.1% compared to YOLO11, mainly since the detection precision of YOLOv7 for fire object is only 62.7%, which is significantly lower than that of the YOLO11 algorithm.

The goal of YOLOv8 is to further improve the performance and applicability of the YOLO series in real-time object detection tasks through technological innovations and architectural optimizations, while maintaining its ease of use and high efficiency, with few innovations and bias towards engineering practice, which can be effectively scaled up between different hardware configurations [60]. As shown in Table 5, YOLO11s has 7.1 fewer GFLOPS compared to YOLOv8s, with other metrics slightly reduced and an overall accuracy decrease of 0.1%. Although the performance difference is minimal, YOLO11s exhibits lower model complexity.

YOLOv9 addresses the problem of data loss during transmission through deep networks by introducing innovative methods such as Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN), a problem that can lead to biased gradient streams, which in turn affects the training effectiveness and prediction accuracy of the model [61]. As shown in Table 5, compared with YOLOv9s, the overall accuracy of YOLO11s decreased by 0.6%, GFLOPS decreased by 5.4, and FPS increased by 62.5. YOLO11s shows a slight decrease in accuracy, but it significantly improves model complexity and reduces detection speed.

The YOLOv10 research team aims to further advance the performance efficiency boundaries of YOLO in terms of both post-processing and model architecture and comprehensively optimizes each component of YOLO from the perspectives of efficiency and accuracy, greatly reducing computational overhead and improving performance [62]. As shown in Table 5, compared with YOLOv10s, the overall accuracy of YOLO11s is improved by 0.6%, the accuracy of smoke detection is improved by 1.9%, other accuracy indicators are also improved, and GFLOPS reduced by 3.3.

YOLO11 provides a version with five parameters of n, s, m, l, and x gradually increasing. Although the version size is different, it can still achieve a good balance between accuracy and speed. In order to achieve the balance of accuracy, real-time performance, and model complexity under our equipment, this experiment compares the differences between YOLO11n and YOLO11s. Other versions will surge in the number of parameters, which does not meet our low complexity requirements and real-time performance when detecting fire and smoke objects [36]. It can be seen from Table 5, compared with YOLO11 s, the YOLO11 n parameter is smaller, the GFLOPS is 6.3, which is only 30% of YOLO11 s, and the FPS increases by 112.3, which meets the requirements of real-time detection and lower model complexity, but the accuracy is 1.8% lower than the overall accuracy of YOLO11s. YOLO11n provides us with a new choice based on specific task requirements. Due to the great harm of fire in the fire and smoke detection task, this study puts the accuracy in the first consideration. The improved baseline model in this paper is YOLO11s.

The variation in evaluation metrics during model training is shown in Figure 7. It can be observed that YOLOv7 performs the worst in terms of precision, while the differences among the other models are not significant. For recall, YOLO11n performs the worst, and YOLOv7 achieves the highest score. YOLOv9s and YOLO11s exhibit the most stable changes, indicating that these two models have a high precision rate and can stably and continuously observe targets. YOLO11n and YOLOv7 show the worst mAP results, while YOLOv9s performs the best. The performance of YOLO11s and several other models under different IoU thresholds varies little. Although YOLO11s does not achieve the best results in all metrics, each metric remains stable and ranks highly, demonstrating its advantage of stability and balance.

Through comparative analysis, it is evident that all YOLO models exhibit higher detection accuracies for smoke objects compared to fire objects, indicating that YOLO is effective for early fire detection. Although each model has its strengths and weaknesses, the differences in accuracy are not significant. YOLO11 demonstrates a clear advantage in fire and smoke object detection tasks, outperforming other methods in the YOLO family in terms of balancing accuracy, speed, and real-time performance. This makes YOLO11 another important tool for future fire and smoke detection tasks. Additionally, the superior performance of other models in detecting fire or smoke can provide valuable insights for future improvements.

3.3.2. YOLO11s-MSCA Model Performance Analysis

The previous section verified that the YOLO11 model demonstrates significant advantages in fire and smoke detection tasks. Since the performance of deep learning models is closely related to specific tasks, selecting appropriate improvement methods is crucial. In this section, our study focuses on enhancing the YOLO11 model for fire and smoke detection, making it more suitable for these tasks. Based on the characteristics of fire and smoke summarized by previous studies [19], this study introduces the MSCA module to strengthen YOLO11, with the specific framework shown in Figure 3. We start comparing the effects of our improvements from the model training stage.

In the process of YOLO series model training and verification, the optimal model is determined by minimizing the bounding box loss (box_loss), confidence loss (dfl_loss), and classification loss (cls_loss). As seen in Figure 8, YOLO11n, YOLO11s, and YOLO11s-MSCA losses are gradually minimized during the training process, and the YOLO11s-MSCA model has the best minimization effect, the model converges faster, and YOLO11n is the worst; in the validation set, there is no significant difference in the localization loss of the boundaries of YOLO11s and YOLO11s-MSCA. YOLO11n has the worst localization ability, the three models have little difference in confidence loss, and YOLO11s-MSCA accelerates the fastest convergence in classification loss, suggesting that the introduced attentional mechanism can focus on our region of interest faster.

Due to different models, RT-DETR loss function combines classification loss, L1loss and IOUloss to optimize the boundary box regression effect. The decrease in Giou_loss indicates that the boundary frame positioning changes, and the decrease in L1_loss indicates that the coordinate prediction is more stable. Figure 9 shows that during model training, boundary positioning gradually tends to be stable, and classification loss, although it has a downward trend, has been fluctuating violently. In the validation process, the positioning ability of the RT-DETR model tends to stabilize faster than that of YOLO series models, but the classification loss (cls_loss) tends to stabilize more slowly than that of YOLO series models.

Figure 10 shows the comparison of accuracy curves between YOLO11s-MSCA and other classical target detection models in the training process. All indexes of the RT-DETR model are the worst. The Precision_Curve in Figure 10 shows that YOLO11S-MSCA has the highest accuracy rate and no significant differences with YOLO11s in other indicators, indicating that the introduced module mainly improves the classification ability, but the model boundary regression has no significant differences, which is in line with the reality. We have not improved the model boundary regression. Future improvements can be made in these directions. After that, we made a comparative analysis of the model from the evaluation indicators. Since YOLO11n had the worst effect, we did not make the comparison in Table 6.

To enhance the comprehensiveness of the comparison, we compare with RT-DETR, the first Transformer architecture model capable of end-to-end detection in real-time scenarios. Experimental results of RT-DETR on the COCO dataset indicate that it significantly outperforms popular YOLO series real-time detectors in both speed and accuracy [63]. However, in our D-Fire dataset, the RT-DETR model only achieves an overall accuracy of 71.5%, with other metrics also showing significant differences when compared to YOLO. Most importantly, the model’s GFLOPS reaches 103.4, which is 4.6 times that of YOLO11s-MSCA, while the detection data consist of only 68 frames, which is one-third of YOLO11s-MSCA. Therefore, for fire and smoke object detection tasks, YOLO11s-MSCA outperforms RT-DETR. The specific results can be found in Table 6.

Accurate identification of fire events is exceptionally important, the accuracy of smoke and fire detection should be in the first place, because the hazards of fire occurrence are very large, and the improved model meets the requirements of high accuracy. The occurrence of fire is accompanied by the occurrence of smoke; because the diffusion of smoke in the early stage of fire is much higher than that of fire, the improvement of the model of the more accurate identification of the smoke target allows the relevant personnel to monitor the target area in a timely manner, which is exceptionally important for the early detection of fire. The improved model did not significantly increase the model complexity. Our improvements fulfill the requirements for fire and smoke object detection.

While metrics analysis can provide a quantitative assessment for the object detection model and help researchers understand the overall performance of the model, visual analysis provides more detailed information about the model and can help us gain a deeper understanding of the model’s behavior under different conditions.

By analyzing the visual results, it can be seen that YOLO11s is accurate in classification and localization in smoke and fire detection tasks, but there are still problems in many details, and through the observation of the results of the verification set, the improved YOLO11s-MSCA model in this paper is effective, and the advantages of the algorithm improvement are as follows:

As shown in Figure 11b,e, the improved method in this paper is more accurate in the detection of small objects. The small object fires that are not labeled in the label are identified accurately, which proves the effectiveness of the multi-scale spatial attention introduced in this paper. The model focuses on objects of different scales (smoke and fire objects in this paper).
The YOLO11s-MSCA model is more accurate in locating the target frame. For example, the a in Figure 11 correctly locates the range of the smoke object and does not identify the scattering of the light as smoke. In c, the cloud of smoke object and background are correctly localized. Spatial attention makes the model focus on the correct important areas.
The YOLO11s-MSCA model can more accurately identify the differences between fire objects and highlighted fire-like objects and can correctly classify them as in Figure 11a,c,d. The improved model in Figure 11f can correctly distinguish smoke targets at night. The advantage comes from the ability to understand the context. Smoke is a fire companion organism, and the two basically exist at the same time.

The improving limitations of model detection results are as follows:

The introduced contextual spatial attention module focuses on contextual information, which also causes some loss of accuracy in our results, e.g., our model does not detect small object white smoke in the white wall background. Small object fire in the background of a smoke, as in Figure 12g, identifies the highlight as fire, and smoke in the background of a grayish white, as in Figure 12e, does not detect smoke object.
For occluded objects, the algorithm does not recognize them correctly, as in Figure 12e,f, occluded targets are not recognized.
Although the improved algorithm has enhanced the detection performance for small targets, challenges remain, as illustrated in Figure 12c,d. This limitation is primarily attributed to the complexity of the environment.
The YOLO11s-MSCA model still has the problem of misidentification for feature similarities, e.g., in Figure 12a, smoke in the distance is similar to clouds, scattered light similar to smoke (e.g., b), and sunset in red clouds (e.g., i).
The complexity of the environment and the different performance of smoke and fire in different environments are the most important sources of error rate. The performance of smoke and fire comes from the difference in combustion material, the change in light, the change in day and night, the distance from the video sensor and so on. The processing of this variability is still an important research direction for us in the future. As in Figure 12h, the recognition accuracy of smoke and fire objects decreases at night.

3.3.3. Comparison of Model Generalization Experiments

In this study, in order to verify that the model has good generalization ability, robustness, and practical application adaptability, the YOLO11s-MSCA model was systematically verified on two representative datasets: Fire and Smoke and CBM-Fire. Detailed training parameter settings are described in Section 3.2 of this article.

As can be seen through the bolded font in Table 7, YOLOv9 was the most accurate model during the validation of the Fire and Smoke Dataset, but YOLO11 and its improved version were still the more balanced models across all metrics. The overall precision, recall, and mAP of YOLO11s-MSCA are higher than those of YOLO11s by 0.3%, 0.8%, and 0.6%, respectively, although there is an increase of 1.2 and a decrease of 33 frames in GFLOPS and FPS metrics, respectively, and the balance of improving accuracy for fire and smoke object detection tasks without affecting the other performances. Our improvements are effective and generalizable. Future research on YOLO11 is an important direction for our fire and smoke object detection mission. Compared to RT-DETR, which successfully introduces the Transformer architecture into the field of real-time object detection, YOLO11s-MSCA has an overall accuracy that is 0.5 lower. However, all other metrics are better than those of RT-DETR. Notably, in terms of the balance between accuracy, speed, and model complexity, YOLO11s-MSCA is the best, making it more suitable for our fire and smoke object detection domain.

The overall performance of the YOLO11s and YOLO11s-MSCA models is understood through metrics, and the details of the model’s performance on the Fire and Smoke Dataset are as follows:

YOLO11s and YOLO11s-MSCA still perform well in fire and smoke object detection tasks, accurately classifying and localizing most objects in an image. Figure 13i correctly distinguishes fire objects from lights; j correctly localizes smoke objects by eliminating trees from the smoke background.
In this dataset, YOLO11s-MSCA is still able to focus more on small objects in the model than the YOLO11s model and classify them correctly, e.g., it can focus on the candles in the pictures in Figure 13a,i, differentiating between the lights in the background of the fire and the small object flames in b, f, and k. The introduced Multi-Scale Convolutional Attention mechanism still focuses on the multi-scale information of the object in the image.
The YOLO11s-MSCA model recognizes the sun in clouds, as shown in Figure 13d, and correctly distinguishes smoke objects in snowy landscapes, as in Figure 13c,e, and thin smoke in h. These are not small objects, their correct distinction comes from the model’s understanding of the object and the object’s surroundings. The Multi-Scale Convolutional Attention mechanism this research introduces can still focus on the object’s contextual information.
For occluded objects, the model still fails to recognize them, as in Figure 13g. Similarly, our introduction of the MSCA module causes some small objects similar to the background to be misdistinguished, and the most important errors still originate from the variability of the environment and the diversity of fire and smoke.

YOLO11s is validated to have high precision, high recall, high detection speed, and low model complexity in fire and smoke target detection tasks, and while not the best in all metrics, it ranks among the top in every metric and has the best balance in all metrics. At the same time, our improved model YOLO11s-MSCA is effective. The introduced module improves the recognition ability of the model to different sizes of targets in the image and the understanding of context information, so that the overall accuracy has been significantly improved. This model has more accurate identification and localization of small object fires and smoke, ability to identify thin smoke around fires, smoke objects in snow, etc. The problem summarization and improvement of the improved model on the D-Fire dataset hold true on the second dataset as well.

The model’s generalization capability on unseen data has become a critical metric for evaluating its practical value. Building upon the CBM-Fire dataset, this study continues to investigate the generalization performance of YOLO11s-MSCA through comparative experiments, aiming to explore the model’s adaptability in fire and smoke object detection scenarios.

As presented in Table 8, YOLOv9s achieves an overall accuracy of 82.3%, which is 2.7% lower than that of YOLO11s-MSCA (85.0%). While the difference in FPS between the two models is 23.9, the remaining evaluation metrics exhibit more pronounced disparities. YOLOv10s underperforms YOLO11s-MSCA across all metrics, with an overall accuracy of just 77.7%, and particularly substantial gaps in recall and mAP50, registering differences of 11.4% and 11.1%, respectively. The RT-DETR model demonstrates the weakest performance, with an accuracy of only 57.4%, and significantly inferior results across all other metrics. Notably, its GFLOPS is 4.6 times higher than that of YOLO11s-MSCA, indicating considerably greater computational cost.

Among all evaluated models, the best-performing metrics (highlighted in bold) are consistently observed in either YOLO11s or YOLO11s-MSCA. YOLO11s-MSCA, in particular, achieves the highest overall accuracy while maintaining a balanced trade-off among other key indicators such as recall, mAP50-95, GFLOPS, and FPS. Importantly, the model attains a 3% gain in accuracy at the cost of only a 1.2-point increase in GFLOPS, underscoring the efficiency and effectiveness of the improvements. These enhancements prove to be robust across all three datasets, further affirming the generalizability of YOLO11s-MSCA in fire and smoke object detection tasks.

Additionally, as illustrated in Figure 14, the model’s behavior under the CBM-Fire dataset—augmented with image rotations—is analyzed in detail. The performance of YOLO11s-MSCA is summarized as follows:

Enhanced Confidence and Localization Accuracy: The YOLO11s-MSCA model demonstrates superior confidence scores and more precise localization capabilities. As shown in Figure 14a–c (highlighted with yellow arrows), the model accurately detects fire and smoke targets. Notably, in Figure 14b, it correctly identifies a fire instance that is absent in the ground truth annotations, and in Figure 14c, it effectively localizes a broader region of smoke. These results confirm the efficacy of the introduced attention mechanism in improving target recognition.
Improved Detection of Small Objects: YOLO11s-MSCA exhibits enhanced sensitivity to small-scale targets, as demonstrated in Figure 14a,d, where it successfully detects small fire sources (again indicated by yellow arrows). In contrast to YOLO11s, which misclassifies a firefighter’s red uniform as fire in Figure 14a, YOLO11s-MSCA correctly identifies the actual fire target. This further supports the effectiveness of the attention mechanism in directing focus towards semantically relevant regions.
Robustness Across Diverse Environments: Comparative experiments across three distinct datasets—each differing in source and environmental conditions—demonstrate that YOLO11s-MSCA maintains robust performance in the face of varying complexity and scene dynamics.
Limitations in Background-Dominant Scenes: As shown in Figure 14c, although YOLO11s-MSCA captures a larger smoke area, it fails to identify smoke against a sky background. Similarly, in Figure 14e, where the entire image is engulfed in smoke, the model is unable to distinguish the smoke target. This indicates that while the MSCA module enhances environmental understanding, it may also reduce the model’s discriminability when the target blends seamlessly with the background.
Challenges with Occlusion and Visual Similarity: In certain scenarios, such as in Figure 14f, the model struggles with occluded targets. However, the primary cause of detection errors remains the high variability of the environment. For instance, in Figure 14f, the fire color closely resembles the yellowish reflection on nearby leaves, leading to misclassification.

Experimental results on D-Fire, Fire and Smoke Dataset, and CBM-Fire show that the improved YOLO11s-MSCA model has excellent and stable performance in flame and smoke target detection tasks. The model achieves high inference speed while maintaining high detection accuracy and achieves a good balance between accuracy and speed. Compared with mainstream models such as YOLOv9, YOLOv10, and RT-DETR, YOLO11s-MSCA has significant advantages in many indicators. By introducing Multi-Scale Convolutional Attention (MSCA) mechanism, the model has improved the ability of small target detection significantly and shows stronger robustness in complex environments. Although there are some limitations in extreme scenarios such as full smoke background or severe occlusion, the model shows consistent and excellent performance across multiple datasets, which verifies its good generalization ability and practical application potential.

Datasets produced by different teams differ, and these differences have an impact on model training and evaluation results. For example, the D-Fire dataset emphasizes its advantages such as large-scale, accurate labeling, diverse environment, and strong applicability and also points out its limitations such as uneven data distribution and single resolution [64].

First of all, when the amount of data is different, there will be significant differences between the CNN-based model and the Transformer model. As shown in Table 8, there are significant differences between the results of RT-DETR and YOLO series. When there was little difference in the amount of data, the Fire and Smoke Dataset with more resolutions and more scenes had an accurate recall rate of 3.4%, despite a 1.2% loss in accuracy. And the richer the scene, the more accurate the RT-DETR is, as shown in Table 6 and Table 7. Meanwhile, the recall, mAP50, and mAP50-95 of the Fire and Smoke Dataset are significantly better than those of D-Fire, mainly because the Fire and Smoke Dataset directly divides interference items into one class instead of background as D-Fire had. The results show that the detection stability can be significantly improved by mixing the easily confused interference types (such as smoke, clouds, false fire points).

4. Discussion

The YOLO family has attracted much attention due to the optimal balance between speed and accuracy and has undergone continuous evolution, from YOLOv1 to the latest YOLO11 [42]. Each generation of enhancements has focused on minimizing latency while maximizing detection accuracy, and since YOLOv6, further advancing the utility of target detection to make it more suitable for running on a resource-constrained device [58]. Considering the accuracy, speed, and practical deployment requirements of fire and smoke target detection, many scholars have improved the YOLO family [31,54,65]. In this paper, we use the classical models of the YOLO family for comparison to verify that the latest YOLO11 is still superior for smoke and fire detection tasks, that is, it guarantees an optimal balance of accuracy, speed, and utility of the model for fire and smoke target detection tasks. To meet specific requirements such as deployment environments, hardware constraints, and performance goals, YOLO11 models can be exported as configurations in ONNX or other formats. YOLO11 will be another great option for our fire and smoke target detection.

Considering the characteristics of fire and smoke target detection tasks, this research first considers the accuracy, while taking into account the speed (real-time) is not reduced and the model complexity (deployment practicality) is not increased to enhance the YOLO11 model. For the size variability of fire and smoke, the MSCA mechanism is introduced to pay attention to the multi-scale context information in the image. The loss plot and accuracy results demonstrate that the enhancement algorithm has improved the precision, recall, and confidence scores, while also achieving faster convergence during the training process. The overall indicators show that YOLO11s-MSCA improves the accuracy of the model without significantly increasing the complexity of the model. Compared with other YOLO series networks, the improved model network shows the best performance, and the accuracy of smoke objects is higher than that of fire objects. YOLO11s-MSCA has high accuracy and sensitivity for the detection of early smoke objects.

As observed by the visual effect map, the improved model can pay more attention to objects of different sizes in the image, especially those with small targets, has a better understanding of context, and can accurately distinguish some targets not recognized by the baseline model from the background. And the improvements and conclusions still hold in the Fire and Smoke and CBM-Fire datasets, where the model improvements possess generalizability and robustness to changes in complex environments. Nevertheless, the recognition of occluded targets and the understanding of complex backgrounds are still problems that have not fully been solved, and in the future, we can realize the lightweight deployment of the model by increasing the complexity of the dataset, improve the model by introducing other modules to enhance the model, or reduce the parameters of the model so that the model can provide fire and smoke detection services in practice, such as unmanned aerial vehicle (UAV) devices and camera devices, in order to reduce the time consumption and economic cost of data transmission.

At the same time, combining the conclusions of different datasets, it is concluded that the dataset construction strategy (data volume, scene diversity, classification system) will significantly affect the model performance, and reasonable classification design of interference items plays a key role in improving the detection stability. Clarifying the core objectives of model tasks and identifying tendencies and deployment scenarios are the prerequisites for formulating effective dataset construction strategies.

Author Contributions

Conceptualization, Y.L.(Yuxuan Li); methodology, Y.L. (Yuxuan Li) and L.N.; software, Y.L. (Yuxuan Li); validation, Y.L. (Yuxuan Li), H.F., F.Z. and L.W.; formal analysis, Y.L. (Yuxuan Li); investigation, Y.L. (Yuxuan Li); resources, Y.L. (Yuxuan Li); data curation, Y.L. (Yuxuan Li); writing—original draft preparation, Y.L. (Yuxuan Li); writing—review and editing, Y.L. (Yuxuan Li), H.F., F.Z., L.N., Y.L. (YunLiu), N.C., Q.D. and L.W.; visualization, Y.L. (Yuxuan Li); supervision, Y.L. (Yuxuan Li), F.Z. and L.W.; project administration, Q.D., F.Z. and L.W.; funding acquisition, Q.D., F.Z. and L.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 32160369 and Grant 31860182, by Yunnan Provincial Key Science and Technology Program under Grant 202202AD080010, and by the Ten Thousand Talent Plans for Young Top-Notch Talents of Yunnan Province under Grant YNWR-QNBJ-2019-026.

Institutional Review Board Statement

The study in this paper did not involve humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The full implementation code, training model, and dataset download are publicly available at https://github.com/AbusAlex/YOLO11s-MSCA (accessed on 15 April 2025).

Conflicts of Interest

Author Fangrong Zhou was employed by the company Joint Laboratory of Power Remote Sensing Technology (Electric Power Research Institute, Yunnan Power Grid Company Ltd., China Southern Power Grid). The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Morales-Hidalgo, D.; Oswalt, S.N.; Somanathan, E. Status and Trends in Global Primary Forest, Protected Areas, and Areas Designated for Conservation of Biodiversity from the Global Forest Resources Assessment 2015. For. Ecol. Manag. 2015, 352, 68–77. [Google Scholar] [CrossRef]
Sahoo, G.; Wani, A.; Rout, S.; Sharma, A.; Prusty, A.K. Impact and Contribution of Forest in Mitigating Global Climate Change. Des. Eng. 2021, 4, 667–682. [Google Scholar]
Lu, N. Dark Convolutional Neural Network for Forest Smoke Detection and Localization Based on Single Image. Soft Comput. 2022, 26, 8647–8659. [Google Scholar] [CrossRef]
Tien Bui, D.; Le, H.V.; Hoang, N.-D. GIS-Based Spatial Prediction of Tropical Forest Fire Danger Using a New Hybrid Machine Learning Method. Ecol. Inform. 2018, 48, 104–116. [Google Scholar] [CrossRef]
Pourmohamad, Y.; Abatzoglou, J.T.; Fleishman, E.; Short, K.C.; Shuman, J.; AghaKouchak, A.; Williamson, M.; Seydi, S.T.; Sadegh, M. Inference of Wildfire Causes From Their Physical, Biological, Social and Management Attributes. Earth’s Future 2025, 13, e2024EF005187. [Google Scholar] [CrossRef]
Wah, W.; Gelaw, A.; Glass, D.C.; Sim, M.R.; Hoy, R.F.; Berecki-Gisolf, J.; Walker-Bone, K. Systematic Review of Impacts of Occupational Exposure to Wildfire Smoke on Respiratory Function, Symptoms, Measures and Diseases. Int. J. Hyg. Environ. Health 2025, 263, 114463. [Google Scholar] [CrossRef]
Zeng, B.; Zhou, Z.; Zhou, Y.; He, D.; Liao, Z.; Jin, Z.; Zhou, Y.; Yi, K.; Xie, Y.; Zhang, W. An Insulator Target Detection Algorithm Based on Improved YOLOv5. Sci. Rep. 2025, 15, 496. [Google Scholar] [CrossRef]
Zhang, F.; Zhao, P.; Xu, S.; Wu, Y.; Yang, X.; Zhang, Y. Integrating Multiple Factors to Optimize Watchtower Deployment for Wildfire Detection. Sci. Total Environ. 2020, 737, 139561. [Google Scholar] [CrossRef] [PubMed]
Hosseini, A.; Hashemzadeh, M.; Farajzadeh, N. UFS-Net: A Unified Flame and Smoke Detection Method for Early Detection of Fire in Video Surveillance Applications Using CNNs. J. Comput. Sci. 2022, 61, 101638. [Google Scholar] [CrossRef]
Shoshe, M.A.M.S.; Rahman, M.A. Improvement of Heat and Smoke Confinement Using Air Curtains in Informal Shopping Malls. J. Build. Eng. 2022, 46, 103676. [Google Scholar] [CrossRef]
Kuznetsov, G.V.; Volkov, R.S.; Sviridenko, A.S.; Strizhak, P.A. Fire Detection and Suppression in Rooms with Different Geometries. J. Build. Eng. 2024, 90, 109427. [Google Scholar] [CrossRef]
Yar, H.; Khan, Z.A.; Rida, I.; Ullah, W.; Kim, M.J.; Baik, S.W. An Efficient Deep Learning Architecture for Effective Fire Detection in Smart Surveillance. Image Vis. Comput. 2024, 145, 104989. [Google Scholar] [CrossRef]
Khan, Z.A.; Ullah, F.U.M.; Yar, H.; Ullah, W.; Khan, N.; Kim, M.J.; Baik, S.W. Optimized Cross-Module Attention Network and Medium-Scale Dataset for Effective Fire Detection. Pattern Recognit. 2025, 161, 111273. [Google Scholar] [CrossRef]
Gaur, A.; Singh, A.; Kumar, A.; Kulkarni, K.S.; Lala, S.; Kapoor, K.; Srivastava, V.; Kumar, A.; Mukhopadhyay, S.C. Fire Sensing Technologies: A Review. IEEE Sens. J. 2019, 19, 3191–3202. [Google Scholar] [CrossRef]
Cheng, G.; Chen, X.; Wang, C.; Li, X.; Xian, B.; Yu, H. Visual Fire Detection Using Deep Learning: A Survey. Neurocomputing 2024, 596, 127975. [Google Scholar] [CrossRef]
Huang, P.; Chen, M.; Chen, K.; Zhang, H.; Yu, L.; Liu, C. A Combined Real-Time Intelligent Fire Detection and Forecasting Approach through Cameras Based on Computer Vision Method. Process Saf. Environ. Prot. 2022, 164, 629–638. [Google Scholar] [CrossRef]
Boroujeni, S.P.H.; Razi, A.; Khoshdel, S.; Afghah, F.; Coen, J.L.; O’Neill, L.; Fule, P.; Watts, A.; Kokolakis, N.-M.T.; Vamvoudakis, K.G. A Comprehensive Survey of Research towards AI-Enabled Unmanned Aerial Systems in Pre-, Active-, and Post-Wildfire Management. Inf. Fusion. 2024, 108, 102369. [Google Scholar] [CrossRef]
Han, Z.; Tian, Y.; Zheng, C.; Zhao, F. Forest Fire Smoke Detection Based on Multiple Color Spaces Deep Feature Fusion. Forests 2024, 15, 689. [Google Scholar] [CrossRef]
Özel, B.; Alam, M.S.; Khan, M.U. Review of Modern Forest Fire Detection Techniques: Innovations in Image Processing and Deep Learning. Information 2024, 15, 538. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc., Red Hook, NY, USA; 2015; Volume 28. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Ragab, M.G.; Abdulkadir, S.J.; Muneer, A.; Alqushaibi, A.; Sumiea, E.H.; Qureshi, R.; Al-Selwi, S.M.; Alhussian, H. A Comprehensive Systematic Review of YOLO for Medical Object Detection (2018 to 2023). IEEE Access 2024, 12, 57815–57836. [Google Scholar] [CrossRef]
Zhao, Z.-Q.; Zheng, P.; Xu, S.-T.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
Mamadaliev, D.; Touko, P.L.M.; Kim, J.-H.; Kim, S.-C. ESFD-YOLOv8n: Early Smoke and Fire Detection Method Based on an Improved YOLOv8n Model. Fire 2024, 7, 303. [Google Scholar] [CrossRef]
Wang, D.; Qian, Y.; Lu, J.; Wang, P.; Yang, D.; Yan, T. Ea-Yolo: Efficient Extraction and Aggregation Mechanism of YOLO for Fire Detection. Multimed. Syst. 2024, 30, 287. [Google Scholar] [CrossRef]
Jiao, Z.; Zhang, Y.; Xin, J.; Mu, L.; Yi, Y.; Liu, H.; Liu, D. A Deep Learning Based Forest Fire Detection Approach Using UAV and YOLOv3. In Proceedings of the 2019 1st International Conference on Industrial Artificial Intelligence (IAI), Shenyang, China, 23–27 July 2019; pp. 1–5. [Google Scholar]
Wu, H.; Hu, Y.; Wang, W.; Mei, X.; Xian, J. Ship Fire Detection Based on an Improved YOLO Algorithm with a Lightweight Convolutional Neural Network Model. Sensors 2022, 22, 7420. [Google Scholar] [CrossRef]
Xue, Q.; Lin, H.; Wang, F. FCDM: An Improved Forest Fire Classification and Detection Model Based on YOLOv5. Forests 2022, 13, 2129. [Google Scholar] [CrossRef]
Chen, X.; Xue, Y.; Hou, Q.; Fu, Y.; Zhu, Y. RepVGG-YOLOv7: A Modified YOLOv7 for Fire Smoke Detection. Fire 2023, 6, 383. [Google Scholar] [CrossRef]
Zhang, T.; Wang, F.; Wang, W.; Zhao, Q.; Ning, W.; Wu, H. Research on Fire Smoke Detection Algorithm Based on Improved YOLOv8. IEEE Access 2024, 12, 117354–117362. [Google Scholar] [CrossRef]
Zhao, C.; Zhao, L.; Zhang, K.; Ren, Y.; Chen, H.; Sheng, Y. Smoke and Fire-You Only Look Once: A Lightweight Deep Learning Model for Video Smoke and Flame Detection in Natural Scenes. Fire 2025, 8, 104. [Google Scholar] [CrossRef]
Liu, H.; Zhang, F.; Xu, Y.; Wang, J.; Lu, H.; Wei, W.; Zhu, J. TFNet: Transformer-Based Multi-Scale Feature Fusion Forest Fire Image Detection Network. Fire 2025, 8, 59. [Google Scholar] [CrossRef]
Safarov, F.; Muksimova, S.; Kamoliddin, M.; Cho, Y.I. Fire and Smoke Detection in Complex Environments. Fire 2024, 7, 389. [Google Scholar] [CrossRef]
Alkhammash, E.H. Multi-Classification Using YOLOv11 and Hybrid YOLO11n-MobileNet Models: A Fire Classes Case Study. Fire 2025, 8, 17. [Google Scholar] [CrossRef]
de Venâncio, P.V.A.B.; Lisboa, A.C.; Barbosa, A.V. An Automatic Fire Detection System Based on Deep Convolutional Neural Networks for Low-Power, Resource-Constrained Devices. Neural Comput. Applic 2022, 34, 15349–15368. [Google Scholar] [CrossRef]
Catargiu, C.; Cleju, N.; Ciocoiu, I.B. A Comparative Performance Evaluation of YOLO-Type Detectors on a New Open Fire and Smoke Dataset. Sensors 2024, 24, 5597. [Google Scholar] [CrossRef] [PubMed]
Geng, X.; Han, X.; Cao, X.; Su, Y.; Shu, D. YOLOV9-CBM: An Improved Fire Detection Algorithm Based on YOLOV9. IEEE Access 2025, 13, 19612–19623. [Google Scholar] [CrossRef]
Terven, J.; Córdova-Esparza, D.-M.; Romero-González, J.-A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ali, M.L.; Zhang, Z. The YOLO Framework: A Comprehensive Review of Evolution, Applications, and Benchmarks in Object Detection. Computers 2024, 13, 336. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the Proceedings of the 32nd International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef]
Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-Level Semantic Feature Detection: A New Perspective for Pedestrian Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5187–5196. [Google Scholar]
Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. EPSANet: An Efficient Pyramid Split Attention Block on Convolutional Neural Network. arXiv 2021, arXiv:2105.14447. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-Shot Image Semantic Segmentation With Prototype Alignment. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Zhang, D. A Yolo-based Approach for Fire and Smoke Detection in IoT Surveillance Systems|EBSCOhost. Available online: https://openurl.ebsco.com/contentitem/doi:10.14569%2Fijacsa.2024.0150109?sid=ebsco:plink:crawler&id=ebsco:doi:10.14569%2Fijacsa.2024.0150109 (accessed on 6 March 2025).
Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.-S. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.-M.; Liu, J.; Wang, J. On the Connection between Local Attention and Dynamic Depth-Wise Convolution. arXiv 2022, arXiv:10.48550/arXiv.2106.04263. [Google Scholar]
Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large Kernel Matters—Improve Semantic Segmentation by Global Convolutional Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4353–4361. [Google Scholar]
Lin, M.; Chen, Q.; Yan, S. Network In Network 2014. arXiv 2022, arXiv:10.48550/arXiv.1312.4400. [Google Scholar]
Pan, W.; Wang, X.; Huan, W. EFA-YOLO: An Efficient Feature Attention Model for Fire and Flame Detection. arXiv 2024, arXiv:2409.12635. [Google Scholar]
Jiang, K.; Xie, T.; Yan, R.; Wen, X.; Li, D.; Jiang, H.; Jiang, N.; Feng, L.; Duan, X.; Wang, J. An Attention Mechanism-Improved YOLOv7 Object Detection Algorithm for Hemp Duck Count Estimation. Agriculture 2022, 12, 1659. [Google Scholar] [CrossRef]
Cao, L.; Shen, Z.; Xu, S. Efficient Forest Fire Detection Based on an Improved YOLO Model. Vis. Intell. 2024, 2, 20. [Google Scholar] [CrossRef]
Jia, X.; Tong, Y.; Qiao, H.; Li, M.; Tong, J.; Liang, B. Fast and Accurate Object Detector for Autonomous Driving Based on Improved YOLOv5. Sci. Rep. 2023, 13, 9711. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Al Mudawi, N.; Qureshi, A.M.; Abdelhaq, M.; Alshahrani, A.; Alazeb, A.; Alonazi, M.; Algarni, A. Vehicle Detection and Classification via YOLOv8 and Deep Belief Network over Aerial Image Sequences. Sustainability 2023, 15, 14597. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Boroujeni, S.P.H.; Mehrabi, N.; Afghah, F.; McGrath, C.P.; Bhatkar, D.; Biradar, M.A.; Razi, A. Fire and Smoke Datasets in 20 Years: An In-Depth Review. arXiv 2025, arXiv:2503.14552. [Google Scholar]
Saydirasulovich, S.N.; Mukhiddinov, M.; Djuraev, O.; Abdusalomov, A.; Cho, Y.-I. An Improved Wildfire Smoke Detection Based on YOLOv8 and UAV Images. Sensors 2023, 23, 8374. [Google Scholar] [CrossRef]

Figure 1. Examples from the D-Fire dataset: (a) top–down view, (b) high-brightness non-fire target, (c) foggy weather, (d) nighttime fire and car headlights, (e) nighttime fire, (f) bright morning sun, (g) lens insect, and (h) lens water droplets.

Figure 2. Examples from the Fire and Smoke Dataset ((a) low-resolution highlighting target; (b) airport fire; (c) drone platform image; (d) high-resolution fire at night; (e) near fire; (f) cloudy day smoke target; (g) sea boat fire; and (h) indoor fire).

Figure 3. YOLO11s-MSCA framework ((a) YOLO11 backbone; (b) YOLO11 neck; (c) YOLO11 head; and (d) MSCA module).

Figure 4. C3k2 module.

Figure 5. C2PSA module.

Figure 6. MSCA module (where DWConv represents depth-wise convolution, and k × k indicates the convolution kernel size).

Figure 7. Model training evaluation metrics variation. YOLO series baseline model training evaluation metrics variation.

Figure 8. YOLO model training loss variation.

Figure 9. RT-DETR model training loss variation.

Figure 10. Comparison of training evaluation metrics between YOLO11s-MSCA and classical models. As can be seen from Table 6, the overall accuracy rate of the improved algorithm has increased by 2.6%, and the contribution to the accuracy rate is mainly reflected in the recognition of smoke, which has increased by 2.8%, and the recognition of fire, which has increased by 2.5%, indicating that the introduction of the MSCA module can pay more attention to the smoke object, while other indicators have not been greatly improved. The improved overall precision reaches 79.7% with a recall of 69.8%, indicating that the model has fewer false and missed detections. With an mAP50 of 77.9%, the model still has good target detection ability in complex scenes, and with an mAP50-95 of 45.8%, the model still has objective detection accuracy in a larger IOU (Intersection over Union) range [54]. The increase in GFLOPS for the improved model only adds 1.2, mainly because the modules introduced use deep convolution, multi-scale banded convolution, and 1 × 1 convolution, which are computationally efficient with low parameters. With an FPS of 222.2, the improved model does not significantly lose real-time detection performance or increase model complexity while improving accuracy.

Figure 11. Analysis of typical examples of visual advantages in YOLO11s-MSCA (D-Fire). (a) accurately locate the fire and lights at night. (b) accurately identify small targets. (c) accurately locate the smoke target and distinguish between light and fire. (d) accurate sunset recognition. (e) accurate distant target recognition. (f) Accurate smoke detection near flames at night.

Figure 12. Analysis of typical examples of visual limitations in YOLO11s-MSCA (D-Fire). (a). white smoke on sky background. (b). scattered light resembling smoke in color. (c). blurred small objects at distance. (d). tiny objects. (e). extensive smoke coverage and partially obscured fires. (f). smoke objects partially obscured by electrical wires. (g). highlighted regions around flames. (h). nighttime thin smoke and small fires. (i). Sunset at the horizon’s edge.

Figure 13. Analysis of typical examples of visual effects in YOLO11s-MSCA (Fire and Smoke Dataset). (a). candles and small fires. (b). small fire objects. (c). robust snow-smoke discrimination. (d). correctly identify the sun. (e). smoke recognition under snow interference. (f). more accurate positioning. (g). occluded target recognition inaccuracy. (h). near-field thin smoke. (i). light sources in fire environments. (j). precision smoke target localization. (k). small-target fire.

Figure 14. Analysis of typical examples of visual effects in YOLO11s-MSCA (CBM-Fire). (a). small target recognition. (b). precise localization of fire. (c). precise localization of smoke. (d). small-target fire. (e). large-area smoke dispersion. (f). occluded targets in complex environments.

Table 1. Summary of the relevant studies.

Approach	Key Contributions	Future Work
ESFD-YOLOv8 [26]	Develop a real-time system for early and accurate smoke and fire detection in complex environments.	Address data imbalance, misdetection in complex scenes, and real-time optimization.
EA-YOL [27]	Solve issues of low precision, small and dense target detection, sample imbalance, and balancing real-time performance with accuracy in existing fire detection models.	Tackle challenges in small target detection, data scarcity, and complex background interference.
Using UAV and YOLOv3 [28]	Provide an efficient, low-cost UAV solution for resource-constrained field environments.	Improve detection of small fires in forests.
I-YOLOv4-tiny + S [29]	Overcome the accuracy limitations and computational complexity of traditional fire detection methods and deep learning models in complex maritime environments.	Future work will focus on optimizing model lightweighting, improving accuracy, incorporating contextual information, and expanding applications.
FCD [30]	Address challenges in distinguishing forest fire types, improving detection accuracy, small target detection, real-time performance, and dataset limitations.	Future work will optimize model performance, integrate multiple detection models, design lightweight models, and expand datasets.
RepVGG-YOLOv7 [31]	Solve the problem of low smoke detection accuracy in complex backgrounds and small targets, while balancing model complexity with detection speed.	Future plans include expanding datasets and reducing model computation and parameters.
YOLOv8-FE [32]	Overcome the low detection accuracy of traditional methods in complex environments prone to false alarms and missed detections due to background interference.	Future work aims to enhance detection speed, meet higher real-time requirements, and reduce model complexity for resource-constrained devices.
SF-YOLO (Smoke and Fire-YOLO) [33]	Solve the insufficient detection accuracy of traditional methods in complex environments.	Future plans include incorporating environmental covariates into model training and exploring multispectral data for smoke and flame detection.
TFNet [34]	Effectively extract features of small targets and sparse smoke, maintaining high accuracy in complex backgrounds while reducing model parameters and computational complexity.	Future plans include expanding dataset scale and exploring more lightweight network architectures.

Table 2. The number of bounding boxes per category in the D-Fire dataset.

Category	Training Set	Test Set
Fire	14,692	2869
Smoke	11,865	2307

Table 3. CBM-Fire dataset partitioning ratio.

CBM-Fire	Training Set	Test Set	Validation Set
2000	1620	200	180

Table 4. Hyperparameter settings.

Parameters	Values
imgsz	640
epochs	100
batch	20
close_mosaic	10
optimizer	SGD
lr0	0.01
momentum	0.937

Table 5. Comparison of YOLO11 with the classic YOLO method (D-Fire).

Model	Class	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)	GFLOPS	FPS
YOLOv5s	All	76.0	69.6	76.7	44.8	23.8	344.8
	Smoke	80.3	76.3	82.6	51.8
	Fire	71.7	62.8	70.8	37.8
YOLOv6s	All	75.4	69.9	76.0	44.3	44.0	333.3
	Smoke	80.0	77.3	82.8	52.0
	Fire	70.9	62.5	69.2	36.6
YOLOv7	All	69.0	73.1	75.2	39.9	103.2	117.6
	Smoke	75.3	74.7	78.4	43.7
	Fire	62.7	71.6	71.9	36.2
YOLOv8s	All	77.2	70.4	77.5	45.6	28.4	322.6
	Smoke	81.4	77.6	83.7	52.9
	Fire	73.0	63.1	71.3	38.3
YOLOv9s	All	77.7	72.1	78.3	45.9	26.7	250.0
	Smoke	82.5	79.3	84.4	53.3
	Fire	72.9	64.9	72.1	38.5
YOLOv10s	All	76.5	69.3	76.3	44.8	24.4	312.5
	Smoke	79.8	77.6	82.8	52.2
	Fire	73.3	61.0	69.8	69.8
YOLO11n	All	75.3	67.4	74.6	42.8	6.3	434.8
	Smoke	79.9	74.2	80.5	49.7
	Fire	70.8	60.5	68.6	35.9
YOLO11s	All	77.1	70.1	77.4	45.5	21.3	312.5
	Smoke	81.7	77.3	83.3	52.7
	Fire	72.4	63.0	71.4	38.2

Table 6. Evaluation accuracy of YOLO11s-MSCA validation set (D-Fire).

Model	Class	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)	GFLOPS	FPS
RT-DETR	All	71.5	63.2	70.2	39.5	103.4	68.0
	Smoke	72.1	71.2	74.9	45.2
	Fire	71.0	55.2	65.5	33.8
YOLO11s	All	77.1	70.1	77.4	45.5	21.3	312.5
	Smoke	81.7	77.3	83.3	52.7
	Fire	72.4	63.0	71.4	38.2
YOLO11s-MSCA	All	79.7	69.8	77.9	45.8	22.5	222.2
	Smoke	84.5	76.9	84.0	53.3
	Fire	74.9	62.6	71.8	38.2

Table 7. Comparison of evaluation metrics between YOLO11s-MSCA and classical YOLO methods (Fire and Smoke Dataset).

Model	Class	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)	GFLOPS	FPS
YOLOv9s	All	78.7	72.8	79.2	50.4	26.7	51.3
	Fire	83.0	84.0	89.3	58.8
	Other	69.0	56.1	61.6	33.8
	Smoke	84.1	78.3	86.6	58.5
YOLOv10s	All	77.5	72.2	77.7	49.0	24.5	131.6
	Fire	81.4	83.0	88.2	58.1
	Other	68.6	55.2	59.0	31.5
	Smoke	82.4	78.3	86.0	57.3
RT-DETR	All	79.0	71.2	76.5	46.9	103.4	64.9
	Fire	83.7	83.3	87.7	56.4
	Other	69.0	54.0	58.2	30.4
	Smoke	84.2	76.2	83.5	54.0
YOLO11s	All	78.3	72.4	78.6	49.5	21.3	142.9
	Fire	83.1	83.9	89.4	58.6
	Other	67.8	54.5	31.9	31.9
	Smoke	84.0	78.8	87.0	58.0
YOLO11s-MSCA	All	78.5	73.2	79.2	50.3	22.5	109.9
	Fire	83.0	84.0	89.4	59.0
	Other	68.1	56.2	60.7	33.4
	Smoke	84.4	79.5	87.5	58.6

Table 8. Comparison of YOLO11 and classical object detection methods (CBM-Fire).

Model	Class	Precision (%)	Recall (%)	mAP50 (%)	mAP50-95 (%)	GFLOPS	FPS
YOLOv9s	All	82.3	64.4	76.5	46.9	26.7	175.4
	Fire	74.8	65.4	74.9	40.9
	Smoke	89.7	63.4	78.1	52.9
YOLOv10s	All	77.7	59.0	70.8	46.2	24.4	131.6
	Fire	75.0	61.5	69.4	41.7
	Smoke	80.5	56.5	72.2	50.7
RT-DETR	All	57.4	48.9	49.5	27.3	103.4	69.9
	Fire	63.1	60.1	63.4	33.5
	Smoke	51.8	37.7	35.7	21.0
YOLO11s	All	82.0	71.3	80.6	53.6	21.3	212.8
	Fire	78.8	75.6	81.2	81.2
	Smoke	85.3	67.0	80.1	59.2
YOLO11s-MSCA	All	85.0	70.4	81.9	52.0	22.5	151.5
	Fire	80.1	75.5	84.4	48.1
	Smoke	90.0	65.2	79.4	55.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Nie, L.; Zhou, F.; Liu, Y.; Fu, H.; Chen, N.; Dai, Q.; Wang, L. Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention. Fire 2025, 8, 165. https://doi.org/10.3390/fire8050165

AMA Style

Li Y, Nie L, Zhou F, Liu Y, Fu H, Chen N, Dai Q, Wang L. Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention. Fire. 2025; 8(5):165. https://doi.org/10.3390/fire8050165

Chicago/Turabian Style

Li, Yuxuan, Lisha Nie, Fangrong Zhou, Yun Liu, Haoyu Fu, Nan Chen, Qinling Dai, and Leiguang Wang. 2025. "Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention" Fire 8, no. 5: 165. https://doi.org/10.3390/fire8050165

APA Style

Li, Y., Nie, L., Zhou, F., Liu, Y., Fu, H., Chen, N., Dai, Q., & Wang, L. (2025). Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention. Fire, 8(5), 165. https://doi.org/10.3390/fire8050165

Article Menu

Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Model Overview

2.2.1. Proposed Method

2.2.2. Multi-Scale Convolutional Attention (MSCA)

3. Experiments and Analyses

3.1. Evaluation Metrics Section

3.2. Experimental Configuration and Parameter Setting

3.3. Experimental Results and Discussion Analysis

3.3.1. Comparative Analysis of the Advantages of YOLO11

3.3.2. YOLO11s-MSCA Model Performance Analysis

3.3.3. Comparison of Model Generalization Experiments

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI