At the data injection stage, all incoming images are routed to the data preparation module, where uniform standardization is applied in terms of resolution, format, and channel configuration, ensuring consistency across subsequent processing stages.
Once sanitized, the images proceed to the core image processing module, which comprises three key enhancement sub-tasks: denoising, dehazing, and deblurring. These tasks operate in an integrated fashion to correct typical degradation phenomena in fire imagery, including sensor noise, atmospheric scattering, and motion-induced blur. Notably, the denoising network employed in this framework leverages group sparse constraints and an attention-guided mechanism, enabling it to suppress background noise while preserving critical texture and structural details.
Beyond pre-processing, the system offers an image generation branch that utilizes the processed images to augment the training dataset, further boosting the model’s generalization performance in data-scarce scenarios. Finally, all processed and enhanced images are passed into the downstream fire detection model, where real-time and reliable identification of fire-related features is carried out.
2.1. Image Denoising
The literature on image denoising covers a variety of methods aimed at effectively reducing noise while preserving key image details (such as edges and textures). Li et al. [
28] proposed a method that combines soft thresholding processing with edge enhancement techniques. Their approach utilizes the Canny edge detection operator to identify image features, and then pre-processes the image through a flat operator. Subsequently, the denoised edge image and the noisy image are subjected to stationary wavelet transform, and the wavelet coefficients are combined at corresponding levels. This process facilitates soft threshold denoising and effectively suppresses noise while maintaining the integrity of the edges. Based on the transform-based technique, Chen and Zhou [
29] proposed using contour wavelet transform for image denoising. Their research results showed that this method outperforms traditional wavelet-based methods in terms of signal-to-noise ratio (SNR), image enhancement factor (IEF), and visual quality, indicating its superior ability in capturing image features and reducing noise. Besides transform-domain methods, filtering techniques, such as bilateral filters, have also been studied due to their denoising performance [
30]. This study explored how the efficacy of filters is affected by parameters such as window size, spatial variance, and radiation variance. This research emphasized the importance of adjusting parameters in bilateral filtering to optimize noise reduction without compromising image details. The reduction of speckle noise, especially in medical imaging, has been addressed through advanced minimization techniques. Thapa et al. [
31] proposed a multi-frame weighted nuclear norm minimization (MWNNM) method for spectral domain optical coherence tomography (SD-OCT) images. Their method extends the weighted nuclear norm minimization framework to multiple frames, achieving more effective speckle noise suppression compared to single-frame methods. In addition to traditional filtering and transformation methods, recent developments have included methods based on deep learning. Majumdar [
32] proposed a blind denoising autoencoder trained directly on noisy samples, marking a new breakthrough in denoising methods based on autoencoders. This method can learn denoising without prior knowledge of noise features, representing a significant advancement in unsupervised denoising techniques. Moreover, the decomposition-based framework has been used in compressive sensing reconstruction research. Devi and Patil [
33] evaluated various filtering techniques for microscope images, emphasizing the importance of adopting customized methods for different imaging modalities. Yuan et al. [
34] addressed sonar image denoising by analyzing noise features and using a gamma distribution-based expert domain (FoE) model, demonstrating the effectiveness of combining statistical analysis with learning priors in complex noise scenarios.
In the image denoising task, especially when dealing with real-world noisy images, traditional degradation assumptions based on Gaussian distribution modeling often struggle to accurately capture the complex noise structure. To address this issue, this paper proposes a novel blind image denoising network that combines group sparsity constraints with an attention-guided transformer (AGT). This approach takes into account both structural priors and long-range dependency modeling capabilities, achieving excellent results.
The entire denoising model consists of two key components: a Group Sparse Denoising Network (GSDNet) and an Attention-Guided Transformer Module (AGT). The GSDNet employs a multi-scale CNN structure and incorporates group sparsity regularization to initially suppress background noise and to retain local texture information. The AGT module further exploits the distant context relationships through a carefully designed window attention mechanism, enabling detail restoration.
In the group sparsity constraint, the network encodes image feature blocks and constructs a group sparse structure in the feature channel dimension. This method leverages the coefficient sparsity of natural images in the frequency domain or spatial domain, effectively suppressing unstructured noise. The mathematical modeling is as follows:
Here, represents the noisy image, represents the denoised image, denotes the feature representation of the g-th channel group, and is the sparsity constraint coefficient. This formula retains the local structure while forcing the feature blocks to be sparsely distributed within the group, effectively eliminating non-structural components.
Secondly, the attention-guided transformer module enhances the information flow ability through cross-window interaction. The overall network structure adopts the residual connection form and is composed of multiple AGT blocks stacked together. Each AGT block contains multi-head attention layers and a feedforward neural network (FFN). The attention mechanism can be expressed as follows:
Here, ,, and represent the input query, key, and value vectors, respectively, and is the dimension of the key vector. In practical applications, the transformer module employs a sliding window attention mechanism, thereby maintaining computational efficiency while capturing long-range dependencies.
As shown in
Figure 2, a network with a dynamic kernel fusion convolution structure is introduced. The core idea is to adaptively generate convolution kernel parameters at specific positions to enhance the spatial context modeling ability and to improve the network’s expression ability in image denoising, image super-resolution, or image enhancement tasks. This structure consists of three main sub-modules: feature enhancement branch, fusion coefficient generator, and dynamic convolution kernel generator.
The input image X ∈ R, H × W × C is extracted as the initial feature map Xe through convolution. At each position (i, j), the network predicts the fusion coefficient F[i, j] for that position using an independent coefficient mapping network, which is used to weight K predefined convolution kernel bases W1, W2, …, WK.
In addition, this method also introduces a residual enhancement module and feature fusion strategy to avoid problems such as image smoothing and texture loss, thereby improving the final image quality. This framework effectively overcomes the limitations of traditional methods in noise modeling and context perception by combining the group sparsity constraints and the attention mechanism, providing a feasible and efficient blind image denoising solution.
2.2. Image Blurring Modeling and Restoration
In actual fire smoke scenarios, blurring phenomena occur widely at all stages of image acquisition, such as camera jitter, smoke obstruction, and image quality degradation under low illumination conditions, which seriously affect the response efficiency and accuracy of subsequent fire situation recognition and smoke warning systems. To effectively restore the clarity of images and to enhance the perceptual representation of fire-region features, researchers have proposed various methods around single-image dehazing and deblurring technologies. In dehazing, Sun et al. proposed the semi-supervised single-image dehazing network SADnet based on the attention mechanism, which can effectively guide the model to learn dehazing features in the case of insufficient supervision, enhancing the expression ability of flames and smoke boundaries in the image [
35]. Liang et al. designed a heterogeneous prior-driven dehazing method for remote sensing images, which is suitable for smoke removal in complex environments and has good generalization performance [
36]. Liao et al. proposed an image dehazing model without supervision based on fuzzy clustering and local structure information, suitable for application in scenarios with limited imaging in initial fires [
37]. In addition, Guo et al. and Liu et al. systematically reviewed the application of image dehazing methods in information fusion and remote sensing image processing from a technical review perspective, providing theoretical support for fire image enhancement [
38,
39]. In the field of image deblurring, Yang and Evans proposed lightweight deblurring methods for resource-constrained platforms, suitable for embedded fire monitoring equipment [
40]. Xu and Wei designed an unsupervised deblurring method with a pyramid structure based on deep image priors, avoiding the reliance on real clear images and having good application promotion capabilities [
41]. Li et al. integrated polarization information and improved underwater dehazing performance through multi-index reconstruction strategies, whose idea can also be transferred to fire images with heavy smoke [
42]. At the same time, the transformer structure has demonstrated strong modeling capabilities in the deblurring task. For example, Tsai et al. proposed Stripformer, which improved the image restoration speed through strip-shaped feature modeling [
43], and Kong et al. used a frequency-domain transformer to enhance the clarity performance of blurred images [
44]. Moreover, Dong et al. proposed the multi-scale residual filtering network [
45], and Ren et al. introduced the structure-guided diffusion modeling method [
46], achieving excellent results in complex blurring conditions. Ji et al. and Kim et al., respectively, adopted the “divide and conquer” and multi-stage structure to perform efficient deblurring on a single image, providing a solution with both structure and performance for the dynamic image blurring processing of fire images [
47,
48].
Specifically, such blurring can be classified into two main categories: static blurring and dynamic blurring. The former is mainly caused by the degradation of the image signal itself, such as sensor noise, focal length deviation and compression loss; the latter stems from temporal sequence-based dynamic changes, such as target movement, camera jitter, or thermal air disturbance. Due to the fact that fire scenes often exist in complex lighting and motion interaction environments, a single image enhancement method is difficult to handle the modeling and restoration of various types of blurring. Therefore, this paper designs a two-stage blurring modeling framework, combining static noise suppression and dynamic blurring correction strategies, to uniformly complete the multimodal blurring restoration process within the ChaIR network architecture. This module provides a clearer and more robust image foundation for subsequent target detection. As shown in
Figure 3, this network adopts an encoder–decoder architecture, integrating the spatial-channel attention module (SCA) and the frequency-channel attention module (FCA), respectively, to model the static and dynamic blurring features. Each stage achieves feature fusion through multi-scale residual connections and introduces explicit frequency-domain modulation to enhance the high-frequency restoration capability.
To quantitatively evaluate the restoration effect of the image deblurring module, this paper introduces two image quality evaluation indicators: the structural similarity index (SSIM) and the peak signal-to-noise ratio (PSNR). The SSIM mainly measures the fidelity of the image in terms of brightness, contrast, and structure. Its calculation formula is as follows:
Here, and represent the mean values of image x and y, respectively, and are the variances, is the covariance, and and are stability constants. When the SSIM value is closer to 1, the closer the image is to the real image.
The PSNR measures the image reconstruction error at the pixel level and is defined as follows:
Here, L represents the maximum possible value of the image pixel (for an 8-bit image, it is 255), and MSE represents the image mean square error, which is defined as follows:
Here, and represent the pixel values at position in the original image and the restored image, respectively, with m × n being the image size. The smaller the MSE, the less the image distortion, and the larger the PSNR, the better the restoration effect.
In the fuzzy modeling and deblurring module constructed in this paper, by comparing the PSNR and SSIM values between the restored image and the clear image, the model’s adaptability to static and dynamic blurry scenes can be effectively measured. The experimental results show that, when dealing with smoke obstruction, this method demonstrates superior performance compared to traditional approaches in both indicators, providing a more reliable input quality guarantee for subsequent intelligent defect detection and image recognition.
2.2.1. Data Preparation
In order to simulate the image degradation process affected by haze in a natural environment and to construct a reliable training dataset, this paper adopts an image synthesis method based on the atmospheric scattering model to simulate the presence of haze in clear images.
Figure 4 illustrates the physical mechanism of light propagation in the haze environment during natural imaging. Under typical atmospheric conditions, the radiation signals from the target scene will be subject to the dual effects of scattering and absorption by water vapor particles (haze) before reaching the imaging device, ultimately resulting in phenomena such as decreased contrast, color shift, and structural blurring in the imaging image.
This process can be formally described by the classic atmospheric scattering model (ASM) as follows:
Here,
represents the foggy image that is finally observed,
is the original clear image, A is the global atmospheric light value, and
is the transmittance function, which is defined as follows:
Here, represents the scattering coefficient of fog particles in the atmosphere and is the physical depth information at pixel . The above formula indicates that the fog’s shading effect is in an exponential decay relationship with the relative depth of the target from the camera.
As shown in
Figure 4, the image obtained by the imaging device not only contains the direct light component emitted by the target scene (reduced by transmittance
), but includes additional light intensity introduced by atmospheric light
and the scattering path. This scattered light significantly increases with the increase in fog concentration, resulting in the loss of local details in the image. By adjusting the parameters
,
, and the depth map
, the fog density and its spatial distribution can be flexibly controlled, generating diverse degraded image data.
This synthesis method provides sufficient labeled data for the subsequent training of the image dehazing network, avoiding the high cost and difficult labeling problems encountered in the real acquisition of haze images. At the same time, it also supports the combination of real depth maps or simulated depth scenarios to construct more accurate synthetic fog images, enabling more robust, weakly supervised and unsupervised learning for image enhancement.
Meanwhile, during the actual image acquisition process, due to the high-speed movement of the target object, the jitter of the camera platform, and the excessively long exposure time, it is very easy to cause the target edges in the image to have trailing and stretching phenomena, forming what is called dynamic motion blur images. This type of blur has stronger spatial non-uniformity compared to static blur, and its blur kernel changes with the different spatial positions of the image, seriously affecting the image clarity and the robustness of subsequent visual tasks.
The physical imaging process of dynamic blur can be formally expressed as follows:
Here, represents the blurred image, ) represents the clear image, represents the velocity vector of the pixel position at time , is the exposure duration, and is the sensor noise. This integral form indicates that the blurred image is formed by the cumulative superposition of multiple instantaneous images of the target at different time positions.
This paper adopts an image degradation method based on the superposition of linear motion kernels: according to the preset velocity field or displacement trajectory, a set of intermediate images in consecutive time frames are generated; then, through temporal mean or weighted accumulation, an equivalent blurred image to the real acquisition system is obtained. This method not only can simulate the linear blurring caused by uniform translational motion, but can extend to support nonlinear displacement, rotational blurring, camera zooming and other complex situations.
It is worth noting that, to improve the authenticity of the synthesis of the blurred image, this paper introduces random jitter vectors, kernel distortion, scene occlusion, and foreground crossing, which are possible perturbation factors in real shooting processes, into the image generation process, thereby constructing a high-fidelity and diverse dynamic blurred image dataset, providing effective training support for subsequent deep deblurring models. Through this simulation mechanism, not only is the generalization ability of the training data improved, but a unified benchmark is provided for the evaluation of various deblurring algorithms in real complex scenarios.
2.2.2. Image Dehazing
In the scenarios of fire and smoke warning, image dehazing serves as a crucial pre-processing step, which is of great significance for enhancing the accuracy and robustness of subsequent detection models. Due to the extensive dispersion of smoke during the initial stage or the spread process of a fire, significant phenomena, such as low contrast, color deviation, and blurred edges, will occur in the image. These degradations will seriously interfere with the model’s ability to recognize flames, smoke contours, and background structures. Therefore, effectively removing the interference of haze and restoring the intrinsic structure of the image is one of the core links in improving the performance of the visual perception system.
This paper adopts a deep learning-driven end-to-end image dehazing method. Without relying on the real transmittance and depth map, it achieves global structure restoration and detail enhancement of the haze map. The ChaIR network is used as a unified image restoration framework. In this task, through a specialized data preparation strategy and target optimization, it guides its learning of the dehazing transformation relationship.
In the network design, the ChaIR network enhances the decoupling and processing capabilities of the model for low-frequency (background haze layer) and high-frequency (edge details) information by introducing a frequency channel attention module (FCA). Compared to traditional CNNs, the FCA module can explicitly enhance the remaining structural texture signals during reconstruction, especially in the transition areas between thin smoke and atmospheric diffusion, showing higher reconstruction quality. To further optimize the training process, this paper introduces a dual-domain loss function as follows:
Here, represents the pixel space loss, while represents the frequency domain loss, which is used to enhance the model’s robustness under changes in illumination distribution.
2.2.3. Image Deblurring
In the actual deployment of the fire monitoring system, image blurring is also a problem it faces. This problem includes both static blurring caused by the image acquisition device (such as incorrect focal length, compression artifacts, etc.) and dynamic blurring caused by camera jitter, target movement, or thermal disturbances. Such blurring seriously affects the accuracy of identifying key features, such as flame edge contours and smoke diffusion trajectories. Therefore, in order to improve the spatial recognition and the structural clarity of the image, it is necessary to introduce a deblurring mechanism with dual-domain modeling capabilities to uniformly restore both the static and dynamic blurring features.
Research on image dehazing for single images has made significant progress, covering a variety of methods from traditional model-based approaches to advanced deep learning architectures. Early studies, such as the study by Matlin and Milanfar [
49], mainly focused on methods that could simultaneously remove fog and noise from a single image, emphasizing the challenge of handling multiple degradation problems without relying on multiple images. This method highlights the importance of developing single-image solutions, especially in cases where multiple shots cannot be taken. On this basis, Fang et al. [
50] proposed a fast variational method, which uses an adaptive window method based on the dark channel prior to estimate the transmission map, thereby achieving efficient dehazing and denoising simultaneously. This method reflects the trend of using prior-based models to improve dehazing performance. Further progress includes the research results of Shin et al. [
51], who used convolutional network architecture to estimate environmental light and transmission maps, especially in underwater images. Their method improves the reconstruction quality by jointly estimating these parameters, addressing the unique challenges of the underwater environment. Perez et al. [
52] demonstrated that deep learning techniques can generate high-quality image restoration effects from a single foggy image, inspired by successful cases in related image processing tasks (such as colorization and object detection). Based on this paradigm, Yang and Sun [
53] proposed Proximal Dehaze-Net, which learns dark channel and transmission prior knowledge by expanding the iterative algorithm into a deep network, integrating prior knowledge into a learnable framework. Similarly, Zhang et al. [
54] introduced a dehazing method for sky and river scenes, using external and internal cues to improve the dehazing effect in these specific scenarios.
The defuzzification strategy adopted in this paper is based on the ChaIR (Channel Interaction Restoration) network structure. This structure fully exploits the feature differences of image blurring degradation in the channel dimension. Through the spatial domain channel attention module (SCA) and the frequency domain channel attention module (FCA), it realizes the reconstruction of the blurred components and the restoration of details. The ChaIR structure adopts a U-shaped main network and combines multiple residual blocks to construct a feature extraction and reconstruction path. At the end, a channel attention mechanism is introduced to effectively enhance the dynamic selection ability of fuzzy information between feature layers.
The SCA module aims to conduct neighborhood interaction and weighted integration of each channel in the convolutional feature map, thereby enhancing the structural information of the blurred area and suppressing redundant background signals. Its specific calculation form is as follows:
Here, I represents the input feature map, GAP is the global average pooling, represents the convolution, BN represents batch normalization, and the tanh activation function is used to generate positive and negative weighting factors to enhance the information filtering ability. Compared to the traditional Softmax attention, the SCA module can generate negative weights and suppress useless or ambiguous channel features.
In dynamic blurred images, key information is often reflected in the high-frequency part of the image. Therefore, the FCA module expands the expression ability of high-frequency information through multiple convolution paths and performs frequency fusion based on channel weights. The core calculation process is as follows:
During the above process, multiple frequency branches capture multi-frequency information in the blurry image through different-depth convolutions, and then use the sequential addition and Softmax calculation to obtain the fusion weights . Subsequently, channel-weighted superposition is performed to restore the clear structure of the image.
2.3. Image Data Augmentation
In industrial visual inspection tasks, the scarcity of data and the imbalance of defect samples have long constrained the generalization ability of deep learning models. This paper selects Projected GAN as the main method to improve the quality and diversity of image generation. Projected GAN effectively avoids the problems of discriminator overfitting and instability in small sample training in the previous StyleGAN series of models by introducing a multi-scale feature space discriminator and combining the discriminative signals of the traditional image space.
The network architecture mainly includes two innovations: Firstly, the discriminator no longer directly classifies the image as true or false, but uses a fixed pre-trained feature extraction network to map the image to the semantic feature space and to make judgments at multiple scales, thereby improving the discriminative ability for structural information and local details. Secondly, the projection mechanism ensures that the generator obtains more explicit and stable gradient feedback during training, effectively improving the training stability and the fidelity of the final generated image.
The structure of the Projected GAN network is shown in
Figure 5. In the basic structure, the discriminator first performs the multi-level feature encoding of the input image through the main feature extraction network. Specifically, the image is sent layer by layer to four different convolution modules from L1 to L4, corresponding to different spatial resolutions and semantic depths. These intermediate features are then sent to the D1 to D4 discriminator heads to perform true/false discrimination for each layer’s feature representation. This design enhances the collaborative modeling ability of local and global structures, enabling the discriminator to more accurately capture the texture details and semantic consistency of the image, thereby effectively improving the learning efficiency of the generator.
Based on this, Projected GAN introduces an auxiliary projection structure in the perceptual space. At each scale of the feature hierarchy, the discriminator not only uses the convolutional features from the original image, but aligns them with the perceptual features extracted by a fixed pre-trained network. The green feature branch in
Figure 5 represents this projection path. Through the fusion between the perceptual space and the original feature space, the model obtains higher level semantic consistency constraints in the discrimination stage. This alignment mechanism prompts the generator not only to imitate real images at the pixel level, but to maintain the consistency of texture structure and style in the perceptual dimension.
This design significantly enhances the discrimination power of the discriminator and the convergence quality of the generator. Multi-scale supervision enhances the model’s sensitivity to the different scale details of the image, while the perceptual space alignment effectively avoids the blurring and structural drift problems of the generated images in complex backgrounds or weak texture areas.
Projected GAN in this system not only enhances the diversity and generalization ability of the training data, but provides controllable and high-fidelity sample support for the defect detection network under conditions of scarce data, promoting the intelligent development of industrial vision systems.
To further quantitatively evaluate the performance of the generated images of Projected GAN adopted in this paper, in terms of structural fidelity and semantic consistency, this paper introduces the Fréchet Inception Distance (FID) as an unrefereed image generation quality evaluation indicator. The FID is an authoritative measurement method widely used in the field of image generation and adversarial networks at present. Its core idea is to compare the statistical distributions of the generated images and the real images in the feature space, thereby evaluating the differences between the two.
Specifically, the FID assumes that images follow a multivariate Gaussian distribution in the feature space. Let the distribution of the features of the real images be N(
,
), and the distribution of the features of the generated images be N(
,
). Then, the Fréchet distance between the two is defined as follows:
The smaller the FID value, the closer the generated image is to the real image in terms of perceptual quality and semantic structure, that is, the better the generation effect. Different from traditional pixel-level indicators, such as the PSNR and SSIM, the FID measures the statistical distribution difference of the image in the high-level feature space. Therefore, it is more sensitive to image content, style, and semantic integrity, and is suitable for the authentic evaluation of complex image structures, such as flame shapes and texture edges.
Research in the field of object detection has made significant progress due to the application of deep learning methods. Sun et al. [
55] provided an overview of the evolution of object detection methods, highlighting the process from traditional approaches to complex deep learning models. These developments have made it possible to achieve more accurate and reliable detection in various scenarios. Recent studies have emphasized the importance of multimodal data fusion for improving detection performance. Open set target detection has also attracted attention. Liu et al. [
56] developed Grounding DINO, a detector based on transformer, which identifies arbitrary objects by integrating pre-training based on human-provided category names or expression representations. Similarly, Wu et al. [
57] introduced GRiT, a generative transformer, which can understand open set targets without predefined object categories, expanding the detection range and making it no longer limited to fixed categories. In response to the challenges brought by visual degradation scenarios, Liu et al. [
58] proposed a guided detection method based on image enhancement. This method integrates an enhancement branch in an end-to-end manner in the detection network, aiming to improve detection performance in challenging visual environments. The application of object detection to specific domains, such as synthetic aperture radar (SAR) images, has also been explored. Li et al. [
59] proposed a large-scale SAR dataset named SARDet-100K and a multi-stage (MSFA) pre-training framework with filter enhancement. This framework addresses the domain gap between RGB and SAR data, facilitating better transfer learning and detection performance in SAR environments. In summary, the literature reflects the expansion of object detection research, including multimodal fusion, robustness under environmental changes, open set recognition, and challenges in specific domains, which is driven by innovative architectures and training strategies by Jiang et al. [
60].
This paper uses the modified YOLOv8 to detect the fire scene.
Figure 6 shows the network architecture design principle of the improved YOLOv8n structure proposed in this paper for the fire image detection task. This architecture is based on the traditional YOLO backbone network and integrates three key improvement modules: the BiFormer Attention mechanism, the Agent Attention global perception module, and the CCC lightweight feature compression module, aiming to enhance the model’s feature extraction and accuracy performance in complex fire scenarios. The entire network structure is divided into three main parts: input encoding, feature fusion, and prediction output.
In the feature extraction stage, the input image is first processed through multiple stacked CBS modules, composed of convolution (Conv), batch normalization (BN), and activation function (SiLU), to extract low-level semantic features. At the same time, C2F modules are inserted at different layers to enhance the channel-level fine-grained information flow. The underlying structure also introduces the Agent Attention module to simulate the human visual focus mechanism, improving the model’s perception ability of the fire source area. As the network passes downward, the features enter the main part, namely the blue area in the figure, and are modeled through the BiFormer module for bidirectional feature flow, taking into account both local and global context semantics, thereby enhancing the model’s ability to distinguish heterogeneous fire targets.
In the feature fusion and upsampling stage, the multi-layer features are further stacked after upsampling and fusion, and the dimensions are compressed through lightweight C2F and CBS modules. At this time, intermediate features of different scales (80 × 80, 40 × 40, 20 × 20) enter the right CCC structure for prediction. In the CCC structure, three groups of branches composed of dual CBS modules and convolution are used to handle different-scale fire target detection tasks, respectively, effectively alleviating the target recognition errors caused by scale changes. In addition, the SPPF (Spatial Pyramid Pooling-Fast) module is used to compress semantic information and to reduce computational costs. The entire design demonstrates the characteristics of flexible structure, clear feature flow, controllable parameters, and strong deployability, and is suitable for real-time fire monitoring systems.
This paper uses four core indicators commonly used in the target detection task: Precision, Recall, AP, mAP, F1score, and FPS. They are key references for evaluating the performance of the detection model, especially suitable for the dual requirements of accuracy and completeness in safety defense scenarios such as fire detection.
Here, AP represents the area under the Precision–Recall curve for a certain category, which is a unified metric for comprehensively evaluating Precision and Recall. During the model prediction process, by continuously adjusting the classification threshold, a series of Precision and Recall pairs can be obtained, thereby drawing the PR curve. AP integrates the curve to evaluate the model’s overall performance at different thresholds. In practical applications, the larger the AP, the better the model performs in balancing false positives and false negatives.
Here, mAP is the average value of AP for all detection categories, where N represents the number of categories and
represents the Average Precision of the i-th category. As the core metric for overall detection performance, mAP is widely used for model comparison. In the fire detection task, mAP can measure the overall detection level of the model for all fire-related target indications, and is a key indicator for evaluating whether the system is suitable for practical deployment.
Precision (P) indicates the proportion of samples predicted by the model as positive (i.e., fire sources) that are actually fire sources, TP (True Positives) refers to the cases where the prediction is a fire source and the actual situation is also a fire source, while FP (False Positives) refers to the cases where the prediction is a fire source but the actual situation is not. In a fire scenario, Precision reflects the level of false alarm rate. A high Precision means that the system rarely issues false alarms for non-fire images, which is crucial for ensuring the efficiency of emergency response. In areas with high foot traffic or high density, false alarms can cause unnecessary panic and waste of resources. Therefore, Precision must be maintained at a high level.
Recall (R) indicates the proportion of actual fire source images that have been successfully detected by the model and FN (False Negatives) signifies the image is actually a fire source, but the model failed to detect it. Recall reflects the situation of missed detections. In fire monitoring, missed detections are more dangerous than false alarms because a false alarm means that the system cannot promptly warn of the real fire situation, which may lead to major safety accidents. Therefore, Recall is an important metric for measuring the robustness of a fire detection system.
The F1 Score takes into account both the False Positive (FP) and the False Negative (FN) situations of the model, and is particularly important in fire detection. Since missed detections (FN) in fire scenarios can lead to disastrous consequences, while false detections (FP) may cause unnecessary panic and waste of resources, the F1 Score provides a single metric to evaluate the model’s overall performance in these two aspects. The higher the F1 value, the better the model achieves a balance between Precision and Recall.
In the fire monitoring system, real-time performance is of utmost importance. A high FPS (frames per second) indicates that the model can respond quickly to fire incidents and is suitable for scenarios, such as video surveillance and unmanned aerial vehicle inspection, that require high timeliness. The “Processing time per frame” represents the processing time for each frame.