2.2. The Preprocessing of the Image
Given the wide range of data sources for forest smoke images and the inevitable influence of complex natural environmental factors, such as diverse lighting conditions, throughout the process of image collection, images often exhibit issues of uneven illumination and significant color deviations. If these smoke images are directly input into a detection network, the network may struggle to accurately capture smoke features due to overly bright areas or shadowed parts in the images, leading to an increase in false detection rates. Therefore, this study proposes an improved Bilateral Filtering-Multi-Scale Retinex (BF-MSR) method, enhancing smoke images to improve image quality.
Based on the theory of Retinex [
21], the input image
I(
x,y) can be split into two components: the incident component
L(
x,y), and the reflection component
R(
x,y), which is expressed as
Since the human eye’s response to changes in brightness approximates a logarithmic pattern, taking the logarithm on both sides
The image decomposition process for Multi-Scale Retinex (MSR) is as follows:
Multi-scale Gaussian filtering is applied to the image. Specifically,
N different scales
σ are selected, and for each scale, a Gaussian filter
is computed and convolved with the image
I(x,y) to obtain the incident component
L(x,y):
For each scale
n, compute the reflection component
:
Combine the reflection components of each scale through a weighted linear fusion to obtain the final MSR:
In this context, N signifies the count of scales, G denotes the Gaussian surround function, σ represents the standard deviation of the Gaussian function, wn represents the weight of the nth scale, and w1 + … + wn = 1.
The Retinex algorithm is based on a theoretical assumption when estimating illumination for an image; namely, that the incident illumination layer varies slowly. However, in complex forest environments, due to direct sunlight or shading from trees, the actual incident illumination layer can vary dramatically. This results in the loss of image edge information after Multi-Scale Retinex (MSR) processing and may lead to noticeable halos, making overexposure likely. Compared to Gaussian filtering, bilateral filtering, as a nonlinear filter, can maintain image edge details effectively and smooth image noise. Therefore, we improve MSR by using bilateral filtering instead of Gaussian filtering, with the formula as follows:
In which, (xc,yc) indicates the location of the center point of the image, while f(xc,yc) signifies the grayscale value of the central pixel. The spatial Gaussian function has a standard deviation denoted by σs, and the range Gaussian function has a standard deviation represented by σr.
To understand this intuitively, we chose three sets of low-illumination images for testing, and the test results can be observed in
Figure 2.
Compared to the original smoke images, BF-MSR significantly enhances the color contrast of low-illumination and nighttime images, making the contours and textures of the smoke much clearer. Compared to MSR, BF-MSR effectively avoids image overexposure and reduces the likelihood of edge halos.
For an unbiased assessment of the performance of the algorithm, Information Entropy (IE) [
22] and Peak Signal-to-Noise Ratio (PSNR) [
23] were selected as evaluation metrics. A higher IE points to the image containing a richer amount of information, and a higher PSNR indicates stronger noise suppression capability for the image. Taking Figure a as an example, the results of a comparison among diverse image enhancement approaches are displayed in
Table 2.
As can be seen from the table above, BF-MSR had the highest IE and PSNR, indicating that the processed images are rich in detail information and have good noise reduction effects, making it more suitable for enhancing forest smoke images.
In the preprocessing step for flame images, current methods mainly rely on color and edge information to segment the flame area from the background and fill in incomplete areas within the flame center [
24]. However, in this study, the flame images underwent meticulous segmentation and annotation processing, with the flame areas accurately defined and the annotated regions continuous and without holes, which ensured the integrity of the flame areas. Therefore, there was no need to repeat this process using color and edge-based segmentation methods. As a result, this study did not perform preprocessing on the flame images.
2.3. Methods
2.3.1. Yolov8
The basic framework of MTL-FSFDet is based on YOLOv8, and its overall structure is illustrated in
Figure 3.
YOLOv8 mainly consists of three components: the backbone, neck, and head. In the backbone, the input image undergoes initial feature extraction through convolutional layers (Conv), followed by multiple stacked C2f modules to extract multi-scale features, and the SPPF module is utilized to enhance the receptive field. In the neck part, after upsampling, the deep features are concatenated with shallow features to achieve multi-scale feature fusion. In the head part, the fused multi-scale feature maps are used to generate classification and regression predictions.
2.3.2. MTL-FSFDet
In multitask learning, the network architecture and parameters are primarily shared among the layers. Specifically, when multiple tasks perform hard parameter sharing in the shared layers, they use the same network architecture, and the weights and biases of these network layers are identical. This sharing mechanism means that for each neuron in the shared layers, the input it receives, the activation function it applies, and the feature representation it outputs are the same for all tasks. During the training process, these shared parameters are updated synchronously based on the loss functions of all tasks, thereby learning a general representation or features that can adapt to multiple tasks simultaneously. By sharing these parameters and network architecture, the multitask learning model can capture common information or patterns among different tasks, which helps improve the performance of each task. Simultaneously, since the shared layers decrease the count of parameters that need to be learned, this also aids in alleviating the issue of overfitting and improves the model’s capacity for generalization.
This study introduces MTL-FSFDet, a model for detecting forest smoke and fires utilizing multi-task learning. It plays a vital role in addressing challenges like low detection precision and difficulties in spotting small targets in complex forests. As shown in
Figure 4, the preprocessed image data by BF-MSR are fed into MTL-FSFDet as input. In the backbone, C2f_Hybrid extracts local and global features of the image. Subsequently, these feature data are transmitted to the neck. In the neck, Dysample captures tiny features, and CGAFusion performs weighted fusion of low-level and high-level features. Finally, the fused feature maps are sent to the segmentation head and the detection head, respectively, which output the boundaries of the fire regions and the detection results.
The entire model includes two tasks: a task for detecting smoke and fire, and another task for segmenting smoke and fire. Among them, the object detection task is the primary task, and the segmentation task serves as an auxiliary task to assist in improving the detection performance. These two tasks share the feature extraction module and the multi-scale feature fusion network, which can enhance the performance of their respective task-guiding networks.
2.3.3. Hybrid Feature Extraction Block
In the backbone of YOLOv8, the original C2f module primarily consists of simple convolutional stacks, which excel at extracting local features but lack the ability to perceive global features. When dealing with forest smoke and fire images with complex backgrounds, where smoke and flames exhibit varied shapes and are influenced by factors like illumination and occlusion, this emphasis on local feature extraction often leads to difficulties for the model in distinguishing between targets and backgrounds, causing erroneous detections and omissions. In response to this challenge, this study introduces a Hybrid Feature Extraction Block [
25], a convolution-self-attention module, to serve as an alternative to the bottleneck in C2f, forming a new C2f_Hybrid. This design combines the ability of convolutions to learn local relative positional information, with the ability of the self-attention mechanism to capture global context, which permits the model to better detect forest smoke and fires within complex backgrounds.
In the Hybrid module, depicted in
Figure 5, the input feature map
X, sized
C × H × W, is split evenly into two sub-feature maps along its channel dimension:
X1 and
X2.
X1 is processed by IDConv, which performs local feature aggregation by injecting inductive biases to generate a local feature map
X1′; while
X2 is processed by OSRA, which extracts global features by expanding the receptive field, yielding a global feature map
X2′. The feature maps’ dimensions after these two steps are both
C/2 ×
H ×
W, where the height and width stay unchanged, while the channel count is halved. Subsequently,
X1′ and
X2′ are reconnected along the channel dimension, forming an output feature map
X’ that regains the original dimensions. Lastly, to further enhance the efficiency of feature fusion, a lightweight STE is introduced to acquire the ultimate features. The process of the Hybrid module can be represented as
- (1)
Input-dependent Depthwise Convolution (IDConv)
Within the IDConv block, as shown in
Figure 6a, with a feature map
X of size
C ×
H ×
W, it first aggregates spatial context information through an adaptive average pooling layer and reduces the spatial size to
K ×
K dimensions, where
K denotes the size of the dynamic convolution kernel to be generated. Following this, the modified feature map is fed into two 1 × 1 convolutions to obtain multiple sets of spatial attention maps
S’, with dimensions
G ×
C ×
K ×
K, where
G indicates the total groups of spatial attention maps and
C stands for the channel count in the initial feature map. To endow the spatial attention maps with adaptive selection properties, a normalized softmax function is utilized over the
G dimension, yielding attention weights
S. Ultimately, the attention weights
S are multiplied element-by-element with a series of learnable parameters
P of the same dimensions to generate dynamic convolution kernels
W.
IDConv has the capability to adaptively modify the weights of the convolution kernels in response to various input feature maps, thereby dynamically capturing local information and endowing the network with strong inductive biases. The operation of IDConv can be represented as follows:
- (2)
Overlapping Spatial Reduction Attention (OSRA)
In the OSRA module, as shown in
Figure 6b, an Overlapping Spatial Reduction (OSR) technique is introduced to optimize the representation of spatial structures by the Multi-Head Self-Attention (MHSA) mechanism. Through leveraging expanded and overlapping patch units, this approach can better grasp spatial details in the edge regions of patches, thereby significantly enhancing the MHSA mechanism’s ability to discern spatial structures. This method accomplishes fusion of global information via overlapping regions between patches. The operation of OSR can be represented as follows:
where
LR( ) represents a localized enhancement block initialized through a 3 × 3 depthwise convolution,
B denotes the matrix of relative positional biases used to encode spatial relationships within the attention map, and
d represents the count of channels contained in each individual attention head.
- (3)
Squeezed Token Enhancer (STE)
Within the STE module, shown in
Figure 6c, the input feature map
X is processed by a 3 × 3 depthwise convolution to enhance local feature correlations. Subsequently, 1 × 1 convolutional serves to reduce the channel count, with the goal of lessening the model’s computational burden. Additionally, the introduction of a residual connection mechanism ensures the feature representation capability. Compared to traditional methods that directly use a single 1 × 1 fully connected convolution layer for feature fusion, the STE proposed in this study exhibits better performance and more favorable computational complexity. The process of the STE can be represented as
2.3.4. DySample
In the neck of YOLOv8, the original UpSample achieves upsampling by interpolating the entire feature map, a process that not only consumes substantial computational resources, limiting the model’s ability to stay lightweight when detecting forest fires, but also leads to the omission of certain details in the upsampling, thus increasing the likelihood of overlooking small fire and smoke objects. In response to this problem, this study proposes a lightweight and effective upsampling method, DySample [
26], as an alternative to UpSample.
DySample achieves the upsampling process through dynamic sampling, a method that does not rely on additional CUDA packages, significantly reducing computational complexity. Furthermore, utilizing the concept of point sampling, DySample divides a single point into several smaller points. This innovative design enhances edge clarity, assisting the model in retaining feature details more completely and elevating its precision when detecting small targets within forest fires.
Figure 6 illustrates the structure of DySample.
Figure 7a depicts the sampling, grounded in dynamic upsampling. When provided with a feature map
X (sized
C ×
H ×
W) and a sampling set
S (with dimensions
2 ×
sH ×
sW), the grid_sample function utilizes the coordinate information in sampling set
S to carry out bilinear interpolation resampling on the input feature map
X, resulting in the generation of a fresh feature map
X’ of size
C × sH ×
sW. The step is defined as follows:
Figure 7b illustrates the point sampling generator process. Set the upsampling scale factor to
s. Given a feature map
X of size
C ×
H ×
W, offsets
O of size
2s2 ×
H ×
W are generated using input channels C and output channels
2s2. Subsequently, reshape the offsets
O into size
2 ×
sH ×
sW through pixel shuffling. The sampling set
S is obtained by adding the offsets
O and the initial sampling grid
G. The entire process can be depicted as follows:
2.3.5. Content-Guided Attention Fusion (CGAFusion)
In the demanding context of detecting smoke and fire in forests, the attention mechanism refines the input features through weighted processing, to emphasize key details and diminish non-essential or redundant information, thus significantly improving the efficacy of feature fusion. Traditional attention modules for features generally comprise two parts: channel attention, and spatial attention, which exert their functions by computing attention weights. Specifically, channel attention [
27] is responsible for recalibrating features, while spatial attention [
28] produces a spatial importance map (SIM) to depict the salience of various areas, which is crucial for precisely locating forest smoke or fire areas.
However, when facing complex forest environments and varied smoke and fire patterns, the inherent limitations of traditional attention become apparent. Spatial attention needs to overcome feature-level inhomogeneity to accurately capture the faint information of smoke, while channel attention, due to its lack of context analysis capability, struggles to fully understand the complex scenarios of fire sites. Furthermore, these two types of attention operate independently, lacking effective interaction, which prevents them from synergistically enhancing feature representations.
To tackle this problem, this study proposes a Content-Guided Attention (CGA) [
29] module. CGA resolves the problem of insufficient information correlation between features by generating specific SIMs for each channel. This design promotes a deep interaction between spatial and channel attention, enhances the understanding of context information, and allows the model to precisely identify small smoke and fire objects in complex scenarios. The operation of CGA is illustrated in
Figure 8.
Considering an input feature
X of size
C ×
H ×
W, the objective of CGA is to produce a channel-specific SIM (denoted as
W, which has the same dimensions as
X). First,
Ws and
Wc are calculated separately using the following formulas:
where
,
, and
represent the features after channel global maximum pooling, channel global average pooling, and spatial global average pooling. max (0,
x) represents the ReLU activation function. Subsequently,
Ws and
Wc are fused through simple addition to obtain a coarse SIM
Wcos:
Wcos and each channel of
X are rearranged alternately through channel shuffling to obtain the final channel-specific SIM
W:
In this context, σ is used to signify the sigmoid activation function, while CS refers to the channel shuffle. Therefore, CGA allocates a distinct SIM to every channel, steering the model’s attention towards important regions of smoke and flame within each channel.
YOLOv8 adopts a feature fusion strategy that combines upsampling and concatenation. During the upsampling process, the receptive fields of low-level features and high-level feature maps may not align, leading to spatial misalignment in the fused features. Low-level features capture edge details and texture information, which are used to distinguish targets from objects with similar visual characteristics (such as smoke and clouds, and flames and the sun). High-level features emphasize the extraction of semantic information (such as the spread of smoke and the shape of flames), which is used to distinguish targets from non-targets in complex environments. Owing to the substantial disparities in encoded information [
30] and receptive fields between low-level and high-level features, merely concatenating them may fail to achieve effective complementarity between the two, and may even lead to the loss of key information.
To resolve this challenge, this study presents CGAFusion, a feature fusion approach based on CGA. It takes low-level features from the backbone and corresponding high-level features from the neck as inputs into CGA for weighted fusion. This retains fine-grained details from low-level features and semantic info from high-level features, enhancing the feature expression capabilities and boosting detection accuracy for small objects.
Figure 9 shows how low-level and high-level features are input into the CGA to compute weights for each feature location. Subsequently, a method involving weighted summation is used to merge these features. A skip connection is added to mitigate gradient vanishing and ensure the integrity of information transmission. Lastly, a 1 × 1 conv layer is used to map the fused features, resulting in the ultimate feature. The process of CGAFusion can be represented as
2.3.6. Multi-Task Head
For detection tasks and segmentation tasks, specific task heads are designed for each task. The design of the task heads is illustrated in
Figure 10: For the semantic segmentation task head [
31], the fused feature map downsampled by eight times from the neck network is used as input. Through a sequence of upsampling and convolutional operations, the feature map is restored to the size of the original input, and the count of feature channels corresponds to the count of semantic segmentation categories.
For the object detection task, this study adopts an anchor-free mechanism. By simplifying the object localization process, the anchor-free approach eliminates the need for calculating and matching multiple anchor boxes, effectively reducing the computational cost and memory footprint, and allowing the model to operate efficiently under limited hardware resources. At the same time, it can better handle the diversity of objects in training and test data, without relying on predefined shapes and sizes of anchor boxes, and thus having better generalization capability. The input to the object detection task head are features from three different scales in the feature fusion layer. The feature map at the larger scale is more attentive to information about small objects, while the small-scale feature map focuses more on information about large objects. The final feature map regresses by directly predicting the center point and bounding box of the object.
2.4. Experiment
2.4.1. Experimental Setup
This study was carried out utilizing Pycharm (PyTorch 2.2.1, Python 3.8, and CUDA 12.1). The CPU was an Intel(R) Core (TM) i7-12700H, and the GPU configuration was an NVIDIA GeForce RTX 4060. The dataset was divided into training, validation, and testing sets in a 7:2:1 ratio. The model was trained from scratch, with the training parameters shown in
Table 3.
2.4.2. Evaluation Indicators
To analyze the model performance, this study used two types of evaluation metrics: accuracy metrics, and resource metrics. In terms of accuracy metrics, average precision (
AP), mean average precision (
mAP), precision (
P), and recall (
R) [
32] were selected as evaluation criteria to measure the model’s detection capability. In terms of resource metrics, model parameters (Param) and frames per second (FPS) were chosen as evaluation criteria to assess the feasibility of the model’s practical application. The formulas are as follows:
where
TP,
FP, and
FN represent true positive, false positive, and false negative.
2.4.3. Ablation Experiment
To verify the effectiveness of each improved module in MTL-FFSDet, a total of five groups of ablation experiments were conducted, with each group using the same environment and training parameters. The results are shown in
Table 4.
From
Table 4, it can be seen that the introduction of BF-MSR improved the model’s mAP@0.5 from 80.2% to 80.8%. This is because BF-MSR enhanced the quality of low-illumination smoke images, making the smoke features clearer and thus improving the accuracy of feature extraction.
After the introduction of C2f_Hybrid, the model’s mAP@0.5 increased from 80.8% to 82.8%. This is because C2f_Hybrid can capture both global and local features, providing richer semantic information and detailed representations, which makes the model more accurate and robust in detecting smoke and fires in complex forest scenarios.
The introduction of Dysample further increased the model’s mAP@0.5 from 82.8% to 83.3%. This is because Dysample, based on point sampling for upsampling, improved the feature resolution and detail capturing ability, thereby enhancing the recognition accuracy for smoke and fires.
After the introduction of CGAFusion, the model’s mAP@0.5 rose from 83.3% to 84.2%. This is because CGAFusion synergizes spatial and channel attention, weightedly fusing low-level edge information with high-level semantic information, enabling it to more effectively capture subtle features of small smoke and fire targets in complex backgrounds.
Finally, the introduction of the multi-task head increased the model’s mAP@0.5 from 84.2% to 85.5%. This is because multi-task learning shares the feature extraction module and enhances the understanding of target edges, shapes, and details in the detection task by introducing supervision of the segmentation task, thereby improving the detection performance for small targets and complex scenarios.
The accuracy for smoke and fire is shown in
Table 5.
2.4.4. Comparative Experiment
To additionally confirm the advantages of MTL-FSFDet for detecting forest smoke and fire, it was compared with several mainstream detection methods. The comparative results are presented in
Table 6.
The table indicates that MTL-FSFDet outperformed the majority of the other models in the object detection task, especially in terms of mAP@0.5, precision, and recall, achieving 85.5%, 83.1%, and 77.9% respectively. Compared to YOLOv8, it was better by 5.3%, 4.4%, and 4.5%, respectively, and it outperformed the other models to varying degrees. There was a minor rise in parameter count and a slight drop in FPS, but these did not affect the model’s real-time detection performance.
2.4.5. Result Analysis
To visually showcase the effectiveness of our MTL-FSFDet model, this section selects different scenarios including tiny objects, tree obstructions, interfering images, and complex backgrounds to assess the model’s performance.
As shown in
Figure 11, the original model struggled to identify the small fire targets in image (a) and the small smoke targets in image (d), and its perception range for smoke in image (a) was very limited, making it difficult to achieve comprehensive coverage.
However, the MTL-FSFDet model demonstrated excellent performance, not only accurately identifying these small fire and smoke targets but also significantly expanding the perception range for smoke. Compared to the original model, MTL-FSFDet achieved a more extensive and comprehensive smoke detection.
As shown in
Figure 12, the initial model had difficulty precisely identifying the tiny fire obscured by trees in image (a), as well as the faint and sparse smoke partially hidden in image (c). These forest smokes and fires appeared blurry and indistinguishable due to the obstruction of trees, posing significant challenges in the recognition task.
However, the MTL-FSFDet model proposed in this study demonstrated remarkable recognition capabilities. It could accurately capture the subtle features of forest smokes and fires in complex environments with tree obstructions, and its recognition confidence significantly exceeded that of the initial model.
This section selects four groups of interfering images for study: one group is smoke images under cloud interference, one group is smoke images under haze interference, another group is images with sun interference, and the last group has images with maple leaf interference, with the results presented in
Figure 13. In the cloud interference scenario shown in
Figure 13a, because of the striking resemblance in visual attributes between clouds and smoke, the original model had difficulty accurately distinguishing between them, leading to missed detections of smoke. In the haze interference scenario shown in
Figure 13d, although the original model could identify smoke, it produced duplicate detection boxes, with low confidence levels. In the sun interference image shown in
Figure 13g, the sun’s rays share some visual similarities with flames, causing the original model to mistakenly identify the sun as a flame, resulting in false positives. In the maple leaf interference image shown in
Figure 13j, the original model did not produce false positives.
In contrast, the MTL-FSFDet model proposed in this study demonstrated significant advantages. This model could accurately identify smoke under cloud and haze interference, and effectively distinguished between the sun, maple leaves, and flames, avoiding missed detections and false positives. These results indicate that MTL-FSFDet had a higher accuracy and robustness in handling smoke and flame recognition tasks in confusing scenarios.
Given the high complexity and variability of forest smoke and fire scenarios, where numerous hardly noticeable fire spots are often scattered and accompanied by diffuse smoke, it is particularly important to accurately identify and distinguish every instance of smoke and fire in these scenarios. As shown in
Figure 14, the original model had obvious limitations when dealing with such complex environments, missing many obscured and barely visible small fire spots, and its smoke perception range was relatively limited.
In contrast, the MTL-FSFDet model demonstrated significant advantages. Even with such complex backgrounds, it could still accurately detect the obscured and inconspicuous small fire spots, and its smoke perception range was more comprehensive.