CLSANet considers the essential attributes of multi-level features for adaptive hierarchical reweighting during aggregation, and cross-layer connections are made in a biased way. In this section, the overall architecture of the CLSANet model is presented first. Then we introduce the details of three novel modules, SPM, TFM, and FSCHead, respectively. In particular, SPM is applied to the backbone, TFM is embedded at the end of feature encoding, and FSCHead directly is the head of our whole network. These novel cross-layer connection structures effectively prevent the dilution of meaningful information and pursue accurate smoke detection. Finally, we briefly describe the loss function for training.
3.1. Whole Network Architecture
We propose a lightweight CLSANet for multi-scale convergence of smoke features with different preferences, and
Figure 2 shows its overall architecture. Specifically, besides the usual convolutional operation, C2F block [
41], and path aggregation network (PAN) [
42], it also contains three novel modules including SPM, TFM, and FSCHead. As we can see, in the backbone, the three features
,
, and
are all calculated by the cross-layer fusion in the spatial perception SPM before being fed into the PAN. In SPM, the feature maps are combined with texture details mined from lower layers four times larger than themselves to enhance and integrate the selective information. Indirect feature exchange between distant layers ensures that low-level context can be dynamically preserved until the final output layer. The highest feature,
, also undergoes the texture federation TFM. This is because the deep layers experience more convolutional computation and correspondingly need to be supplemented with more underlying details. TFM aggregates the low-level spatial texture attention STA with the high-level fully connected attention FCA after spatial pyramid pooling-fast (SPPF) [
43] processing to gain access to rich contexts. Subsequently, FSCHead performs a self-cooperation mechanism between neighboring layers on the three outputs of the PAN,
,
, and
. The localization and classification tasks perform adjacent layer cooperation via deep decoupling.
3.2. Spatial Perception Module
As the network deepens, the receptive field is gradually getting larger and the semantic expression capability is also enhanced. But it also reduces the resolution of feature maps and lots of spatial details become blurred [
44]. Therefore, we design the SPM to perform cross-layer feature supplementation, as shown in
Figure 3. It is employed to the input feature maps of pyramidal networks to enhance their spatial perception across four scales.
It is well known that the common smoke usually has two colors, black and white, and thus SPM simultaneously extracts both the maximum and minimum values on the shallow layer. In addition, because the pixel values of blurred boundaries and smoke regions are susceptible to the influence of ambient colors, SPM also takes the average operation to preserve the contextual details of smoke. Then after minimum, mean, and maximum operations, the three texture features are refined by a convolution with kernel 3. At last, the low-level spatial guidance maps are formed by a sigmoid function, which are loaded on the deep layers to bridge the loss of texture details. It is interesting to note that there is a fourfold difference in scale between the feature maps of the semantic and spatial layers in SPM. Such cross-layer connections allow texture information to be dynamically retained in each output layer of the network backbone, effectively addressing the issue of dissipation of contextual details during the feature encoding process.
As previously mentioned, when the low-level spatial feature is denoted as
, the high-level semantic input is recorded as
, and the acquired intermediate feature containing textures by minimum, mean, and maximum operations is noted as
, the output of SPM
can be formulated as:
where minimum, mean, and maximum operations are along the channel dimension,
denotes the concatenation,
means the convolution with a kernel size of 3,
is the sigmoid function, and ⊗ denotes the element-wise multiplication.
3.3. Texture Federation Module
Many detailed features gradually dissipate after multi-layer convolutional operations. For this reason, we design the spatial perception SPM in the backbone. But at the deepest layer of feature encoding, this issue progressively accumulates, and it is insufficient to compensate for spatial details by only relying on SPM. Therefore, we devise the texture federation TFM after SPM modification. It is arranged in the last layer of the backbone to reinforce the semantic features and further supplement the meaningful spatial details faded in the deep network maps. The elaborate structure of TFM is illustrated in
Figure 4. Following the cross-layer design, STA is applied to preserve valuable low-level texture details. As for the high-level features, after the adaptive dimension of the SPPF structure, they are input into the FCA to strengthen the deep semantic information through fully connected attention. The final low- and high-level features are integrated and exported as the deepest feature encoding.
We first describe the specific process of the low-level network path. Smoke, as a salient target with fuzziness, we similarly introduce the minimum, mean, and maximum values to perceive texture. Specifically, when the input feature of the TFM is notated as
, the preliminary texture-aware feature
is obtained by concatenating the results of the minimum, mean, and maximum computations. Then the
is subjected to a convolution operation with kernel 7 and a sigmoid function to obtain the final smoke spatial filter, which will be used to filter out the invalid noise and selectively enhance the meaningful texture details in input
. The output of STA module
can thus be expressed as:
where minimum, mean, and maximum operations are along the channel dimension,
denotes the concatenation,
means the convolution with a kernel size of 7,
is the sigmoid function, and ⊗ denotes the element-wise multiplication. Note that the convolutional kernel size here is 7, which is because the TFM is applied to the highest layer of the backbone, and a larger receptive field can drum up the development of global contextual dependencies.
As for the high-level branch, the characteristic flow passes through SPPF and FCA sequentially. The SPPF structure [
43] enables adaptive dimension generation via multi-scale spatial containers with little increase in computational effort. When the input is denoted as
, the output
of SPPF can be simply obtained by:
where
means the base convolution including convolution, batch normalization, and SiLU activation function,
represents the concatenation, and
means the maximum computation with a kernel of 5 × 5.
Next, the
serves as the input for our FCA structure and is denoted as
. One distinct difference between FCA and existing image attention approaches is that it breaks through the spatial location constraints caused by 2D convolution and instead takes advantage of only one fully connected layer. In FCA, the
first undergoes the average pooling to refine its spatial information, and then a simple one-layer full connection, where the operation objects of the neurons are the individual channels in the original image feature maps, reweights the global channels to introduce more possible feature representation. The obtained channel mask at this point is denoted as
, and immediately after it is imported into the activation function for final channel scores. The output
of the FCA is derived by multiplying the initial input
with the channel scores, and it is formulated as:
where
denotes the adaptive average pooling,
means the linear fully connection,
is the sigmoid function, and ⊗ denotes the element-wise multiplication.
Finally, to recover the spatial context details in the top-level encoding, we directly concatenate the
after spatial texture attention on the low-level pathway and the
after fully connected attention on the high-level pathway, and then modify the channel numbers via a convolution operation so that the size of the input and output feature maps of the whole texture federation TFM module can be consistent. Therefore, the output feature
of TFM can capture richer smoke context and the process can be defined as:
where
denotes the concatenation operator and
means the base convolution.
3.4. Feature Self-Collaboration Head
After going through the feature pyramid, the network holds three branches at distinct scales. The conventional detection head, either anchor-based [
45] or anchor-free [
46], carries out the localization and classification tasks simultaneously, resulting in little communication between the different paths. Several studies [
47,
48,
49] indicate that high-level features in deep layers encode the semantic information and acquire an abstract description of smoke, while low-level features in shallow layers retain spatial details for rebuilding the smoke boundaries. We hence propose the feature self-collaboration FSCHead, as presented in
Figure 5. With cross-layer cooperation, the high-level paths are used only for the classification task, and the low-level layer is employed merely for smoke localization.
The essential idea of our FSCHead module is to adapt the feature computation to fit the appropriate detection task based on the attribute preferences of different layers, instead of indiscriminately conducting localization and classification. Here are four exclusive strategies, as displayed in
Figure 5a–d.
Figure 5a,b classify smoke on the deep layers which are rich in semantics, and localization is performed on the low layers which contain spatial details. Due to the scale disparity on the respective branches,
Figure 5a adjusts the classification branch to match the scale of the localization branch via up-sampling, and
Figure 5b tunes the scale of the localization branch to be consistent with that of the classification branch by down-sampling. In contrast,
Figure 5c,d implement smoke localization on the high level and the classification task is deployed on the low level. The former produces high-resolution feature maps after up-sampling the localization branch, while the latter down-samples the classification information and thus exports low-resolution detection maps.
In addition, referring to the anchor-free mechanisms [
50] and taking into account the real-time requirements for smoke detection, each branch is concretely explored. The specific classification branch and localization branch are composed of two layers of base convolution and one 2D convolution. The base convolution with a kernel of 3 is designed to reinforce the comprehension of the features and recover discriminative smoke information. The mere 2D convolution, whose kernel size is 1, adaptively modifies the channel number of the features according to the assigned task.
Figure 5a achieves the best results both theoretically and practically, and such feature self-collaboration mechanism with high layers for classification and low layers for localization can simply and directly eliminate the redundancy and preserve the meaningful smoke features. Denote the three inputs of different scales of FSCHead as
,
, and
, respectively. The classification and localization outputs of the low-level layers are designated as
and
, and
and
separately represent the corresponding outputs on the high-level branches. The total output is noted as
, and then the transfer process of features in FSCHead can be described by:
where
means the base convolution,
means the convolution with a kernel of 1,
denotes the up-sampling, and
denotes the concatenation operator.
3.5. Hybrid Loss Function
In our network, a hybrid loss function is introduced to access the gap between the predicted results and the ground truths and direct the subsequent training. CLSANet has two computational outputs for each smoke target, the classification branch and the localization branch, and the total loss
is made up of three parts. The classification loss
is computed on the classification branch, and the regression loss
and confidence loss
are derived from the localization branch. The total loss
is expressed as:
where
and
are the weight parameters,
denotes the two different forward paths in the head, and
signifies the sum of losses on the forward path
i.
Specifically, classification in smoke detection is a binary task, and a binary cross-entropy loss (BCE) is adopted to guide its optimization. BCE is easy to deploy and has a high computational efficiency. As for smoke localization, it is essentially a regression task, and a complete intersection over union loss (CIOU) is employed to penalize the inconsistent results on it. CIOU takes into account the intersection over union, centroid distance, and relative proportions between the predicted and true values. It is suitable for smoke detection tasks that have various shapes. Furthermore, since smoke detection is binarised and its distribution is highly steep, a distributional focal loss (DFL) is thus implemented in the regression branch to further refine the coordinates of the detection boxes after decoding their integrals. As a result, we can acquire the
,
, and
losses by:
where
means the two different forward paths in the head,
and
denote the predicted classification probability and bbox coordinate, respectively, and
and
represent the corresponding ground truths, respectively.