1. Introduction
Forest fires are a major environmental disaster facing forest resources. Once a forest fire occurs, it can quickly get out of control, and then require enormous effort, time, and resources to extinguish. Additionally, forest fires and related combustion processes release large amounts of pollutants into the atmosphere, including CO
2, CO, CH
4, BC, VOCs, PM
2.5, and PM
10. CO
2 and CH
4 directly increase the concentration of greenhouse gases in the atmosphere, while CO can oxidize to CO
2 in the atmosphere. BC exacerbates global warming by absorbing solar radiation and altering the albedo of ice and snow. VOCs, PM
2.5, and PM
10 reduce air quality and worsen air pollution. Consequently, these pollutants have significant impacts on environmental pollution and global warming. In recent years, numerous studies have focused on various pollutants and their effects on the environment and climate. Gürbüz et al. [
1] proposed a computational approach for predicting the environmental impact of pollutants emitted from the transportation sector. Meanwhile, Ekici et al. [
2] introduced a novel transparent computational method for analyzing the quantity of aviation pollutants. Smoke is always the first sign seen when a forest fire occurs. Therefore, early smoke recognition in forest fires is significant for early warning. Early smoke detection methods mainly rely on traditional image processing methods [
3,
4,
5,
6] and deep learning methods [
7,
8,
9,
10,
11,
12,
13,
14,
15,
16].
In addition to examining the spatial properties of images, traditional image analysis techniques extensively employ transform domain analysis, with the wavelet transform being a notable example. Barmpoutis et al. [
3] presented a real-time smoke detection algorithm for video streams, incorporating background subtraction, color analysis, spatial energy analysis, spatiotemporal analysis, histogram analysis, and dynamic texture analysis. Dimitropoulos et al. [
4] introduced an algorithm for recognizing smoke in videos using descriptors derived from linear dynamical systems. Islam et al. [
5] proposed a smoke detection method that combines mixture smoke segmentation with an efficient dynamic smoke symmetry model. Recently, Wu et al. [
6] developed a patchwork dictionary learning approach for early smoke detection in forest fires. However, these methods face challenges such as computational complexity, uncertainty, and insufficient precision in feature selection, resulting in weak adaptability and high computational resource requirements.
With the development and progress of deep learning, detection algorithms based on deep learning have become more efficient, reducing hardware costs and eliminating the need for manual feature extraction compared to traditional algorithms. Hu et al. [
7] presented a method for video smoke recognition using a spatio-temporal CNN. Aslan et al. [
8] proposed a motion-based geometric image transformation and DCGAN for wildfire smoke recognition. Yang et al. [
9] introduced a smoke detection technique utilizing the DenseNet neural network architecture. Hsu et al. [
10] introduced the pioneering RISE dataset, a comprehensive video dataset designed explicitly for identifying industrial smoke releases, using the Inception-v1 I3D neural network architecture for smoke recognition, which is capable of distinguishing smoke from water vapor. Shi et al. [
11] employed a compact deep convolutional neural network for recognizing smoke in videos. Tao et al. [
12] proposed a deformable convolutional enhancement network based on semantic correlation multidirectional interaction for forest smoke recognition. Jiang et al. [
13] introduced an attention mechanism in the EfficientNet network for smoke recognition during straw burning. Cao et al. [
14] developed a smoke source recognition and prediction method based on the enhanced feature foreground network for wildfire smoke recognition. Li et al. [
15] proposed a 3D parallel fully convolutional network architecture for wildfire smoke recognition. Zhu et al. [
16] proposed a 3D convolutional encoder–decoder network architecture for smoke recognition. The existing algorithms have only achieved satisfactory performance on synthetic, self-made, small-scale, and non-forest datasets. This is primarily because the smoke images in these datasets feature simple backgrounds, clearly visible smoke movement, large target sizes, and dense colors. However, these conditions differ significantly from those in forest environments. In forest scenarios, smoke is often small, light, slow-moving, and surrounded by complex and variable backgrounds. Moreover, current smoke detection algorithms lack modules to handle these challenging conditions. Consequently, existing algorithms cannot be directly applied to forest environments for detecting small, light, and slow-moving smoke amidst complex backgrounds.
To address the above issues, this paper first integrates a forest fire early smoke surveillance video dataset in a real scenario with all the early smoke features, laying a good foundation for future practical applications. Then, this paper introduces a 4D-MENet (4D attention-based motion target enhancement network) for early smoke recognition in forest fire monitoring videos. The 4D-MENet contains three new modules: FS (important frame sorting module), 4D-ME (4D attention-based motion target enhancement module), and HFM (high-resolution multi-scale fusion module). Our contribution is fourfold:
To detect smoke targets in complex backgrounds and improve the feature representation of light smoke targets, a 4D-ME module is proposed which enhances the neural network’s attention to temporal, spatial, and color features of smoke, improves the recall of smoke recognition, and reduces the false alarm rate.
To detect slow-moving smoke targets, an FS module is proposed to adaptively extract significant frame sequences from the input image sequence, facilitating the subsequent network to extract motion features of slow-moving smoke.
To detect small smoke targets, an HFM module is proposed to add a small target recognition layer which fuses shallow small target feature information in the high-level feature map, thus, enhancing the high-level feature map’s small smoke target recognition capability.
In this paper, we integrate a large-scale video dataset of forest fire early smoke in real scenarios containing various challenging features of early smoke. To save human, material, and financial costs, this paper annotates them with categorical labels for subsequent supervised learning of neural networks, including 2450 smoke sequences and 3800 non-smoke sequences, laying a good foundation for future practical applications.
3. Results
This chapter first introduces experiment details and evaluation criterion. It then compares the current state-of-the-art smoke detection algorithms from both qualitative and quantitative perspectives. Finally, it discusses ablation experiments and practical applications.
3.1. Experiments Details
The experimental setup in this paper utilizes a personal desktop computer as the platform for conducting the experiments; the practical environment is an Ubuntu system, the processor is AMD Ryzen 9 5900X 12-Core Processor, and the graphics cards are multiple NVIDIA GeForce RTX 3090 GPU and PyTorch framework. The AMD Ryzen 9 5900X 12-Core Processor is manufactured by AMD, located in Santa Clara, CA, USA. The NVIDIA GeForce RTX 3090 GPU is manufactured by GAINWARD, located in Taiwan, China. In this study, we fine-tuned the learning rate, batch size, optimizer selection, regularization parameters, and network structure parameters. We initially used default parameters to initialize the model for the learning rate and conducted a broad range test to identify a rough range, gradually increasing from 0.0001 to 0.1. We refined the search within this range and recorded the model performance for each learning rate, ultimately selecting the best performance. We fixed other parameters for batch size at the initially determined optimal learning rate and tested different batch sizes, such as 4, 16, 32, and 64. We recorded the training time and model performance and selected the batch size that provided the best performance with a reasonable training time. In optimizing the optimizer, we compared the performance and training time of the model using the default parameters of SGD, Adam, and RMSprop. We chose the optimizer that performed best on the validation set. We set an initial range for regularization parameters and conducted cross-validation under different regularization coefficients, selecting the coefficient that minimized the validation set loss. Finally, we performed experiments with different convolutional kernel sizes and layer combinations for the network architecture parameters under optimal learning rate, batch size, and operation timing. The convolutional kernels of sizes 1 × 1 × 1, 3 × 3 × 3, and 5 × 5 × 5 were primarily tested in the inception modules of the RGB-I3D network. Additionally, different layers were experimented with by incorporating 4D-ME at various positions within the backbone network. By balancing performance and resource consumption, we selected the best network structure. Throughout the training phase, the learning rate at the beginning is set to 0.1, the milestones are (500, 1500), the attenuation weight is 10−6, and the video shape of the input network is [40, 3, 36, 224, 224], corresponding to the batch size, the number of input channels, number of frames, and image height and width.
3.2. Evaluation Criterion
The evaluation metrics employed in this experiment include recall (R), the harmonic mean of precision and recall (F1-score), and false positive rate (FPR), which the following equation can mathematically describe:
where TP represents true positives, the number of instances correctly classified as the positive class. FN represents false negatives, the number of cases incorrectly classified as harmful. FP represents false positives, the number of instances incorrectly classified as the positive class. TN represents true negatives, the number of cases correctly classified as harmful.
3.3. Comparative Experiments
3.3.1. Quantitative Analysis
In this paper, we first compare 12 state-of-the-art smoke recognition algorithms on the integrated Forest Smoke dataset, and the specific comparison results are shown in
Table 1. As shown in
Table 1, the proposed algorithm significantly improves detection performance, with a recall (R) 1.17% higher and a false alarm rate (FAR) 1.61% lower than the second-best algorithm, EFFNet. This improvement is primarily due to EFFNet’s approach of randomly extracting different frames from the input video sequence as keyframes, which introduces image redundancy and loss of important information. In contrast, our algorithm employs an important frame sorting module to adaptively extract critical frames from the input video sequence, enriching the features extracted by the neural network and enhancing the representation of smoke features, thereby improving the network’s recall rate. Additionally, EFFNet uses a 2D attention mechanism to enhance smoke targets, while our algorithm utilizes a 4D attention mechanism. This enhances the distinction between smoke features and similar objects, improves the feature extraction capability for light smoke, and reduces the false alarm rate. Then, to prove the proposed algorithm’s advantages, this paper compares it with 12 advanced smoke recognition algorithms on the RISE dataset. The specific comparison results are shown in
Table 2. As shown in
Table 2, the proposed algorithm in this paper achieves the best F1-score on S
0, S
1, and S
3. Especially on S
3, the F1-score is improved by 3% compared to the second-best method (I3D), mainly because the increased 4D-ME module adds the temporal attention mechanism. It, thus, has a more significant impact on the classification results of time series.
Table 3 shows some of the parameters of the different models, and it can be seen that our algorithm trades only a small increase in parameters for a significant performance improvement.
3.3.2. Qualitative Analysis
To better assess the performance of this paper’s algorithm, this paper provides a qualitative analysis of 11 current advanced algorithms which can intuitively reflect the specific characteristics of the algorithmic classification and understand the nature of this paper’s algorithmic improvement.
Figure 4 shows a heat map comparing the different algorithms, where the first to fifth rows show the test results of the different algorithms on the Forest Smoke dataset, and the sixth to seventh rows show the test results of the different algorithms on the RISE dataset. The first image features small smoke targets with fog and cloud interference. The second image features light smoke targets with haze and cloud interference. The third image features small smoke targets with cloud and forest interference. The fourth image features small smoke targets with lighting and road interference. The fifth image features light smoke targets with rooftop and cloud interference. The sixth image features small smoke targets with shade and chimney interference. The seventh image features small smoke targets with water vapor interference. From these experimental results, it is evident that the proposed algorithm outperforms the second-best algorithm, EFFNet, by effectively eliminating interference from similar objects and more accurately identifying smoke regions.
Based on the above analysis, the 4D-MENet network proposed in this paper performs better for early smoke recognition in natural forest fire scenarios. It solves the problems of low recall and high false alarm rates that exist at present. This is mainly due to the following factors: (1) the important frame ordering module proposed in this paper eliminates the operation of the neural network to extract the same features repeatedly, which makes the neural network extract the smoke motion features well, thus, increasing the accuracy of smoke recognition. (2) The 4D-ME module proposed in this paper makes the neural network pay more attention to the motion region in the video, thus, distinguishing the interference of the objects that are similar to the appearance of the smoke well and improving the feature extraction capability of light smoke targets, which increases the accuracy of smoke recognition and reduces the false alarm rate. (3) The high-resolution multi-scale fusion module proposed in this paper increases the network image recognition layer for small targets, increasing smoke recognition accuracy.
3.4. Ablation Experiments
The advantages of the 4D-MENet network proposed in this paper are mainly the FS module, the 4D-ME module, and the HFM module. To validate the contributions of the different modules, the following ablation experiments are conducted in this section for comparative analysis.
3.4.1. Important Frame Sorting Module
To eliminate the negative impact of the network repeatedly extracting the same features, this paper proposes an FS.
Section 2.3.2 provides a detailed description of the network architecture of this module. To validate its effectiveness, ablation experiments were conducted on the FS module, with specific experimental details and fine-tuning parameter settings provided in
Section 3.1. The experiments were performed on 4D-MENet with and without the FS module on the Forest Smoke dataset. The network performance was evaluated using R, FPR, and F1-score metrics. Additionally, the FS module was fine-tuned with different frame sequence O parameters, and the network performance of the 4D-MENet on the Forest Smoke dataset under different O parameters was compared, as shown in
Table 4.
From the analysis of the above table, it can be seen that introducing the FS significantly increases the recall rate, with only a slight increase in the false alarm rate. As the number of important frames continues to decrease, the network’s performance continues to improve. It reaches its peak at O = 6, but as it continues to shrink, the performance metrics begin to decline, which may be caused by the extreme loss of the time dimension.
3.4.2. 4D Attention-Based Motion Target Enhancement Module
To distinguish the interference of objects similar to the appearance of smoke and improve the representation of light smoke features, this paper proposes the 4D-ME module. In
Section 2.3.3, the network architecture of this module is described in detail. To validate its effectiveness, ablation experiments were conducted on the 4D-ME module, with specific experimental details and fine-tuning parameter settings provided in
Section 3.1. Experiments were performed on 4D-MENet with and without the 4D-ME module on the Forest Smoke dataset. The network performance was evaluated using R, FPR, and F1-score metrics. Additionally, different attention mechanism A parameters in the 4D-ME module were fine-tuned, and the network performance of 4D-MENet on the Forest Smoke dataset under different A parameters was compared, as shown in
Table 5.
The above table shows a slight increase in the recall rate. In contrast, the false alarm rate has a more substantial increase, indicating that the module significantly improves similar object differentiation and, to a certain extent, improves the feature extraction ability of light smoke targets. The comparison data of different attention mechanisms in
Table 5 reveal that temporal attention mechanisms significantly impact the performance of smoke detection. This is primarily because the dynamic changes in smoke are a critical feature for smoke target detection. The network achieves optimal performance when spatial attention, channel attention, and temporal attention mechanisms are combined.
3.4.3. High-Resolution Multi-Scale Fusion Module
To increase the small target image recognition layer and improve the semantic information and spatial information of high-level feature maps used for classification, this paper proposes the HFM module.
Section 2.3.4 provides a detailed description of the network architecture of the HFM module. To validate its effectiveness, ablation experiments were conducted on the HFM module, with specific experimental details and fine-tuning parameter settings provided in
Section 3.1. Experiments were performed on 4D-MENet with and without the HFM module on the Forest Smoke dataset. The network performance was evaluated using the R, FPR, and F1-score metrics. Additionally, the parameter for the number of repetitions, N, in the HFM module was fine-tuned, and the network performance of 4D-MENet on the Forest Smoke dataset with different N parameters was compared, as shown in
Table 6.
From the above table, it can be seen that the introduction of HMF significantly increases the recall rate. As the number of repetitions of the HMF module increases, the network’s performance continues to improve and reaches its peak when N = 4. However, the performance metrics start to decrease when N continues to grow, which may be due to the overfitting of the network.
3.5. Practical Application
To validate the practical application of the proposed algorithm in real-world scenarios, this study selected a small number of smoke samples from diverse indoor and outdoor scenes in publicly available datasets, including CVPR [
34], USTC [
35], XJTU-RS [
36], and Kaggle wildfire smoke [
37]. Due to the absence of consistent labels across these smoke samples, this paper directly compares the visual results of different smoke instances. During experimentation, the model trained on a custom dataset is utilized, and consistent testing parameters are applied to evaluate the samples from each dataset. Detailed experimental procedures are elaborated on in
Section 3.1. The specific experimental results are illustrated in
Figure 5.
Figure 5 shows that the proposed algorithm exhibits good network performance in different scenarios, including indoor cigarette smoke, outdoor smoke, and synthesized smoke. Therefore, this algorithm plays a crucial role in indoor and outdoor fire prevention.
4. Discussion
Forests cover one-third of the Earth’s land area, and trees absorb a significant amount of carbon dioxide through photosynthesis, playing a crucial role in maintaining the health of the Earth’s environment. However, forest fires have devastating impacts on local residents’ lives and on terrestrial environments, causing irreparable damage to atmospheric conditions and ecosystems. Therefore, timely prevention of forest fires is vital for both ecological environments and human societies. Forest fires typically generate smoke initially, making smoke detection crucial for early fire detection and for reducing fire losses. However, current forest fire smoke detection systems suffer from low recall rates and high false alarm rates. A low recall rate means many actual fires go undetected, rendering fire prevention efforts almost meaningless, while a high false alarm rate increases the workload of firefighting personnel. Hence, further research and solutions are needed to improve the reliability and efficiency of detection systems.
Currently, mainstream smoke detection algorithms include SAN-SD, EFFNet, VSSNet, 3D-PFCN, I3D, CNN-LSTM, DCNN, MobileNetv2, DenseNet, DCGAN, C3D, MOG-CNN, and others. In this study, experiments were conducted on different smoke detection algorithms using both self-made datasets and public datasets. The experimental results are shown in
Table 1,
Table 2 and
Table 3, as well as in
Figure 4. From qualitative and quantitative analysis, it can be seen that among all mainstream algorithms, the proposed algorithm exhibits the best network performance, with EFFNet being the second-best algorithm. The main reasons for the poorer performance of other algorithms are as follows:
Reason 1: Although all of the above algorithms have ideal recognition results on their datasets, the recognition results will be abruptly downgraded if they are changed to a different dataset, mainly because there is no available large-scale smoke dataset, which leads to poor performance in detecting scenarios that do not exist in the dataset.
Reason 2: Existing algorithms cannot recognize small smoke targets. The main reason is that during the feature extraction process of the neural network, as the depth increases, shallow features (small target features) will gradually be covered by high-level features, and high-level features have strong semantics. Small smoke target information will be lost when performing category recognition on high-level features.
Reason 3: Existing algorithms cannot recognize slow-moving smoke targets. The main reason is that the slow movement of smoke will cause several consecutive frames in the video to have the same features. When the neural network repeatedly extracts the same features, it will harm network prediction and is not conducive to model training and convergence.
Reason 4: Existing algorithms cannot recognize light smoke targets. The main reason is that the color becomes translucent as the smoke rises, making it easy for the neural network to ignore this part of the content when extracting features, resulting in the extracted features not having discriminative properties.
Reason 5: Existing algorithms cannot recognize smoke targets in complex backgrounds. The main reason is that there are many interferences between the appearance of smoke and familiar targets in the forest background (clouds, fog, haze, roads, roofs, etc.), resulting in the network being unable to extract discriminative features.
Based on the analysis above, this paper proposes the 4D-MENet network, which incorporates three additional modules, FS, 4D-ME, and HFM, into the RGB-I3D algorithm, significantly enhancing network performance. Compared to EFFNet, this paper first replaces the random keyframe extraction method with an adaptive keyframe extraction method, reducing redundancy and highlighting important information. Subsequently, the 2D attention mechanism in the intermediate feature layer of EFFNet is replaced with a 4D attention mechanism to enhance the smoke’s temporal characteristics and channel feature representation capabilities. Finally, a small target detection layer is added, resulting in the improved network performance of the proposed algorithm.
Table 1 and
Table 2 show that compared to EFFNet, the proposed algorithm exhibits significant improvements in both recall rate and false alarm rate network performance. Visualization in
Figure 4 demonstrates that the proposed algorithm is superior from the subjective perspective. Furthermore, in
Section 3.5, this paper conducts experiments using the proposed algorithm on various publicly available datasets (including indoor, outdoor, and synthesized smoke) in the same experimental environment. The experimental results are visualized to demonstrate that the proposed algorithm can be applied to smoke detection in forest scenes and directly to indoor and outdoor smoke detection in non-forest scenes. Therefore, the proposed algorithm is significant in terms of applicability across various scenarios. However, the algorithm proposed in this paper has certain limitations regarding real-time performance, particularly for tasks such as real-time monitoring of smoke in forest fire videos, where high real-time requirements are crucial. A dependable smoke video detection algorithm must meet specific performance metrics: FLOPs between 10 G and 100 G, FPS above 30 fps, and model size ranging from 5 M to 50 M. As shown in
Table 3, the proposed algorithm performs reliably on these metrics. However, compared to EFFNet, our algorithm lags by 28.3 G FLOPs, 9.69 fps in FPS, and 4.9 M in model size. Subsequent optimization of the proposed algorithms is necessary. In resource-constrained environments, the following optimization strategies are recommended:
Parallel processing: Use multi-threading or GPU acceleration to process different frames or image regions in parallel, reducing the processing time per frame.
Algorithm simplification: Employ more straightforward and more efficient algorithmic steps, such as replacing morphological processing with lighter-weight filtering techniques.
Frame rate adjustment: In extremely resource-constrained scenarios, reducing the video frame rate (e.g., from 30 FPS to 15 FPS) can decrease the number of frames processed per second, thus, alleviating the computational burden.
Region of interest (ROI): Process only the regions of interest (ROIs) in the video, ignoring static areas, thereby reducing the processing load.
Model compression: Reduce computational and storage requirements through model compression techniques such as pruning and quantization.