This section discusses the proposed deep learning-based forest fire smoke detection model. Through our model, small target smoke in mountains and forests can be identified more accurately and quickly, so as to detect and prevent forest fires as early as possible.
The improved YOLOv5s architecture we propose is illustrated in
Figure 2, and the changes are framed by the solid green line. It comprises three primary components: the backbone, the neck, and the prediction heads. The backbone network consists of FasterBlock modules, designed based on partial convolution (PConv) that offers rapid memory access capabilities. Additionally, we integrated a CA module at the end of the backbone, effectively focusing the model’s attention on the foreground smoke targets and distinguishing them from the background to further enhance feature extraction. Lastly, the model utilizes four prediction heads, incorporating an additional small object detection head and a large feature map to reinforce feature extraction for small-scale targets. This integration enables the model to establish long-range dependencies and capture global contextual information within the input image, allowing for an improved understanding of the semantic and spatial relationships of objects, thus providing powerful foreground-background distinction and small-scale smoke recognition capabilities.
3.1.1. Original YOLOv5
Our proposed methodology builds upon the YOLOv5s model, a widely-utilized framework for object detection.
Figure 3 provides an overview of the YOLOv5 architecture, which can be delineated into three primary components: the backbone network for extracting features, the neck network for fusing features, and the head network for detecting the class and location of the target through regression. The YOLOv5 architecture is characterized by its straightforwardness and efficiency.
The YOLOv5 model incorporates adaptive image scaling and batch normalization of input image size to enhance its performance. The initial size of the anchor frame is automatically determined by the model, and the preprocessing of the image data is conducted. During training, K-means clustering is employed to ascertain the optimal size of the anchor frame based on annotated samples.
The backbone network of YOLOv5 comprises Conv and CBS modules, along with an SPP structure. The CBS module facilitates the extraction of feature information from the images through convolutional operations. To tackle the issue of non-uniform image sizes, an SPP layer is introduced to the backbone network.
The neck of YOLOv5 consists of a bottom-up Feature Pyramid Network (FPN) and a top-down Path Aggregation Network (PAN) structure. The fusion of multi-scale features from FPN and PAN empowers the feature map to encompass semantic and feature-based information, thereby ensuring the precise identification of targets of varying sizes.
3.1.2. K-Means++ Methodology
YOLOv5s incorporates the utilization of anchor boxes into the procedure of detecting objects. Anchor boxes are predefined bounding boxes with fixed sizes and aspect ratios. In the training stage, initial anchor boxes are adjusted to align with those actual boundary boxes, which enables models to undergo effective training and generate more precise predictions. Consequently, the anchor parameters within the original YOLOv5s model necessitate adaptation in accordance with the specific training requirements of diverse datasets. Based on the distinctive attributes of the YOLOv5s model, it becomes imperative to establish the width and height of nine clustering centers, which are subsequently employed as the values for the anchor parameters within the network configuration file. K-means clustering, renowned for its simplicity and efficiency, has been extensively employed in the realm of clustering. Within the framework of YOLOv5s, the K-means methodology is employed for obtaining a primary set of k anchor boxes. However, the K-means method suffers from requiring predetermined initial clustering centers, making it arduous to determine these values. To overcome this limitation, the K-means++ method, characterized by its enhanced selection of initial points, is employed to acquire the initial primary anchor boxes, which substantially mitigates the classification error rate, thereby facilitating the attainment of an anchor box that is appropriate for the detection of small-scale smoke.
The procedure for selecting an anchor box utilizing K-means++ method is as follows:
(1) Randomly select a central coordinate to be the primary center from the given dataset .
(2) Calculate the Euclidean distance and closest center between each sample. The probability of samples being chosen as the subsequent center
is determined using Equation (1):
where
represents the Euclidean distance and
with probability
.
(3) Determine the subsequent clustering center by employing random turntable selection according to the probability.
(4) Repeat steps (1)–(3) until k clustering centers are confirmed. The value of k can be defined.
3.1.3. The Design of Backbone
In the original YOLOv5’s backbone network, the utilization of conventional convolutional CBS modules leads to a considerable redundancy in the intermediate feature map computation, resulting in increased computational costs. To address this, we introduced the FasterBlock module, drawing inspiration from the concept of FasterNet [
56], to serve as the backbone network for extracting features from UAV images. We employed the innovative technique of partial convolution (PConv), which enables a more efficient extraction of spatial features by reducing redundant computations and memory access simultaneously.
The PConv technique offers a computationally efficient solution by applying filters exclusively to a limited set of input channels, while leaving the remaining channels untouched. Compared to standard convolution, PConv achieves lower floating-point operations (FLOPs), yet surpasses depthwise/group convolution in terms of FLOPS. The design of PConv is depicted in
Figure 4, which leverages redundancy within the feature maps and selectively applies regular convolution (Conv) solely on a subset of input channels. It exclusively applies regular Convolution to a segment of the input channels for spatial feature extraction, leaving the remaining channels unaffected. For contiguous or regular memory access, PConv treats the first or last continuous “
” channels as representatives of the entire feature maps for computation. Without loss of generality, we assumed that the input and output feature maps possess an equal number of channels. PConv exhibits superior computational efficiency compared to regular convolution, albeit being more computationally intensive than Depthwise convolution/Group convolution (DWConv/GConv). Essentially, PConv maximizes the computational capacity of the device it operates on. Therefore, the FLOPs of a PConv are only:
where
and
are the width and height of the feature map, respectively;
is the size of the convolution kernel; and
is the number of channels operated by conventional convolution. So, the FLOPs of PConv is only
of a regular convolution with a typical partial ratio
.
Additionally, PConv has a smaller amount of memory access:
which is only
of a regular convolution for a typical partial ratio
. And the remaining (c-
) channels are not involved in the calculation; thus, there is no need to access the memory.
Furthermore, we employed a 3 × 3 convolution kernel. Two 3 × 3 kernels and one 5 × 5 kernel possess an equivalent receptive field, while three 3 × 3 kernels and one 7 × 7 kernel share the same receptive field. In situations where the receptive field remains constant, utilizing three 3 × 3 convolution kernels necessitates fewer parameters compared to employing a single 7 × 7 convolution kernel. This reduction in parameters undoubtedly diminishes model complexity and accelerates training. Despite having an identical receptive field, the 3 × 3 convolution exhibits greater nonlinearity and enables the representation of more intricate functions.
Consequently, we devised the FasterBlock module by leveraging PConv.
Figure 5 illustrates the structure of FasterBlock, where PConv is employed to reduce computational redundancy and memory access. The functioning of FasterBlock is demonstrated in
Figure 5 as well. We used a BN layer and a ReLU layer between the Convolutions. The benefit of BN is that it can be incorporated into adjacent Conv layers by means of structural reparameterization for faster inference. By incorporating FasterBlock into the YOLOv5 backbone, we replaced certain CSP modules while preserving the original YOLOv5 architecture.
3.1.5. Coordinate Attention Mechanism
It has been demonstrated that the incorporation of the channel attention mechanism yields significant performance enhancements to YOLOv5 [
33]. However, the utilization of such a mechanism can lead to the issue of neglecting spatial location information within high-level feature maps. Prominent attention mechanisms in this context include SE (Squeeze and Excitation) [
57] and CBAM (Convolutional Block Attention Module) [
58]. Among these, SE solely focuses on reassessing the significance of each channel by modeling channel relationships, thus overlooking the significance of location message and spatial structure, which are essential for generating spatially selective attention maps. On the other hand, CBAM encodes global spatial information through channel-wise global pooling, thereby compressing the global spatial information into a single channel descriptor. Consequently, this approach has difficulties in preserving the spatial location message of smoke within those channels. Consequently, preserving the spatial location information of objects within the channel becomes challenging.
The CA (Coordinate Attention) [
59] module considers both channel relationships and location information within the feature space. Its essence lies in encoding channel relationships and long-term dependencies through precise location information. The CA module decomposes attention into the X-direction and Y-direction, employing one-dimensional feature encoding to establish long-range point-space location relationships, thus acquiring more accurate location information. Consequently, direction-sensitive and location-sensitive feature maps are formed via feature encoding, which enhances the representation of the target of interest by incorporating features with precise location information.
Figure 6 illustrates the process, which can be divided into two steps.
(1) Coordinate information embedding
The typical method of encoding the spatial location of smoke images through channel attention involves global pooling. This involves pooling low-level features with abundant spatial location information to acquire high-level semantic features. However, this approach is often unable to retain global spatial location information. To address this limitation, we used two one-dimensional feature encodings to decompose the global pooling. This enables greater interaction between distant points and better preserves spatial location information.
The pooling operation is conducted separately in the horizontal and vertical directions, namely, average pooling along the x-axis and average pooling along the y-axis.
Denote H as the height of the input feature map X and W as its width. Coordinate attention encoding is applied to each channel (denoted by c) of X in both the
x-axis and
y-axis directions:
where
is the feature map of the c-th channel.
Then, the output of the c-th channel with height
in the horizontal direction (
x-axis direction) after pooling is characterized by:
Similarly, the output of the c-th channel with the weight
in the vertical direction (
y-axis direction) can be written as:
The two pooling methods mentioned above operate along different directions of the same dimensional features, and the resulting aggregated features have some sensitivity to all values in both the x-axis and y-axis directions of the feature map. The two transformations were employed to ensure that the attention module captures the long-range dependencies of the features along one spatial dimension while retaining the exact location message of the features in the other spatial dimension. This approach helps the network to more accurately identify the relevant information.
(2) Coordinate attention generation
The method described in
Section 1 is used to decompose and pool the feature map from two dimensions, resulting in pooled features with a larger perceptual field that fully utilizes the information near the foreground target of the smoke image. This pooling method allows distant points on the same dimensional features to maintain their mutual relationships. To integrate these transformed features into the neural network, final features with weights need to be generated.
After embedding the information, the information generation process consists of information fusion and convolutional transformation. Information fusion involves combining all the information from different regional features, followed by convolution, batch normalization, nonlinear activation, and other operations, as shown in Equation (7):
where
is the stitching and fusion of two feature maps of different orientations along the spatial dimension;
denotes the convolution;
δ is the nonlinear activation function; and
is the intermediate feature map where spatial information is encoded in the horizontal and vertical directions, where
is the reduction rate of the regulatory dimension, and to reduce the dimensionality of the feature vector and improve the efficiency of network training, an appropriate ratio
is chosen to reduce the number of channels. The intermediate feature maps
f along the
x-axis and
y-axis directions are decomposed into
and
, which correspond to the two dimensions of the horizontal and vertical directions of the feature map, respectively. The convolutional transform and nonlinear activation are performed on the two tensors, as shown in Equations (8) and (9), respectively:
where
and
are the 1*1 convolutional change operations,
is the Sigmoid activation function, and the outputs
and
are the attention weights of the horizontal and vertical directions (
x-axis and
y-axis directions) of the input X, respectively.
Ultimately, the output of the feature
, which represents the height and width of input X on the c-th channel is
and
, after the coordinate attention module, can be expressed as Equation (10).
By multiplying the input feature map X with the attention weights and along the x-axis and y-axis directions, respectively, we can generate the output feature map with attention weights across the width and height dimensions.
We added a CA module to the YOLOv5 model to increase its capability to capture smoke features from complex backgrounds and improve the attention to the small-scale smoke.