In order to optimize the feature extraction capability of the network and to improve the robustness of the model in complex scenarios, the backbone network employs a two-branch network consisting of Darknet53 and cross-stage partial structures (CSP). BA utilizes light conditions and the contrast as a priori knowledge to guide the fusion of modalitis. CAMF consists of two parts, the DA module and the CA module. Features from the two parts are fused in a weighted manner to obtain the fused attention map. Finally, the detection results are generated by feature weighting.
3.1. BA Module
The factors affecting the reliability of multispectral detection are complex and diverse. Currently, academics mainly use illumination information as a guide for multimodal fusion. Although illumination information is effective for cross-modality fusion, utilizing only illumination information still has the following limitations: (1) illumination-aware guided methods perform better in stationary scenarios such as daytime and nighttime, but it is difficult to discriminate between objects and the surrounding area in complex backgrounds such as rain, fog, and camouflage; (2) illumination information is usually used to guide the fusion of inter-modality features but is insufficient for intra-modality feature enhancement. However, in foggy or low-visibility scenes, where more factors affect image quality, light conditions play a more limited role. In this case, contrast is crucial for distinguishing objects from the background. Therefore, we propose BA module, in which the light and the contrast are calculated as the prediction weights to guide the adaptive fusion of intra-modality and inter-modality features.
Due to the difference in spectral bands, visible images are more dependent on external light sources than infrared images. Visible images vary greatly from day to night. Infrared images, on the contrary, are passive imaging, reflecting more the difference between the radiation of the target and the background. In general, visible images have higher imaging quality and contain more color and texture information in daytime scenes. Infrared images are sharper and give an outline of objects in dark scenes. Therefore, during the fusion process, when the scene is daytime, visible images contribute more to the detection results. When the scene is dark, the information in visible images fails and infrared images contribute more to the detection results. In order to quantify the effect of light conditions on fusion, we calculate the light conditions. Given a visible image as
, the probabilities that the image belongs to day or night are defined as
and
. Note that,
. In natural scenes, daytime light conditions are usually better than those of dark scenes. The better the light conditions, the greater the
. We intend to use the probability values to represent the perceptual weights contributed by the different modalities. Due to the binary nature of the light source,
and
will be close to 0 or 1. If the values are multiplied directly by the results of the branches, the modality with lower probability would be significantly suppressed during the fusion process. In order to optimize the weights used in fusion, a gate function is designed to realign the weights of the two modalities so that their complementary information can be more fully integrated. The function is
where
and
denote the weights used to guide the fusion of visible and infrared images, and
. When the daytime probability
is larger, the weight for visible images is greater than 1/2 and the weight for infrared images is no longer close to 0. And vice versa. Therefore, we design a classification network to predict the light intensity to guide the fusion of inter-modality features. The image pair is resized to 128 × 128 and fed into BA. The visible image is fed into the light prediction network, which consists of a convolutional layer and a fully connected layer. After the convolutional layer, an activation layer and 2 × 2 adaptive pooling are added to compress and extract light features. Subsequently, features are fed into the fully connected and computed to convert the output to the desired weights. In
Figure 3, the blue box represents the prediction process of the light conditions. In this case, the blue cubes consist of a convolution module, an activation layer, and a pooling layer. The gray box represents the fully connected layer.
As mentioned above, using only light information to guide fusion is not sufficient. For example, the similarity in color in visible images and the similarity in temperature in infrared images between objects and backgrounds can affect the accuracy of detection. The higher the similarity between objects and the background, the more difficult it is to be discriminated. Therefore, this paper introduces the contrast as a priori knowledge to guide the fusion of intra-modality features. The contrast is measured by the gray scale difference between objects and the surrounding background. First, the target region is divided into a 3 × 3 grid, and a variable
is defined to represent the pixel mean of the remaining grid regions except the object. The calculation process is
where
represents the ordinal number of the region, and
is the total number of pixel points in each region.
represents the ordinal number of the pixel points and
represents the gray value of the pixel points. Equation (3) calculates the average pixel intensity of the divided region, which is crucial for differentiating them in multispectral images. During rectangular box sliding, the target is usually located in the center region. The variable
is introduced to represent the largest gray value in the object region. The role of
is to quantify the pixel differences between the object and the rest of the region. The contrast difference
can be described by
The process of calculating the contrast is shown in the right part of
Figure 3. Area 0 is the object area. Here,
is the maximum gray value of region 0,
represents the ordinal number of the background region, and
is 8. Therefore, contrast indicates the mean value of the pixel difference between the target region and the remaining 8 background regions. The process is shown on the right side of
Figure 3. By calculating the contrast, the network better discriminates objects from the background, which would be used in the subsequent fusion module.
3.2. CAMF Module
Both visible and infrared images have their intrinsic and complementary information, and how to fuse the two modalities is the key to multispectral object detection. However, most of the existing methods based on two-branch networks only use simple fusion schemes, which cannot fully utilize the inter-modality and intra-modality features. In addition, rough combinations and connections also increase the difficulty of network learning, which leads to the degradation of detection performance. Inspired by differential amplification circuits, in which the differential mode and common mode signals are amplified and suppressed respectively, a CAMF module is proposed. CAMF consists of two parts: the inter-modality DA module and intra-modality CA module, as shown in
Figure 2. Given visible and infrared convolutional feature maps as
and
, the differential features
and the common features
can be represented as
can be viewed as the difference between the two modalities, which is obtained by subtraction to enhance the inter-modality specific feature. On the contrary, can be regarded as the sum of the two modalities, which is obtained by addition to enhance the intra-modality consistency feature. Based on this, our CAMF module defines two new hybrid modals for the final fusion.
3.2.1. Inter-Modality DA Module
Inspired by signals in differential circuits, the inter-modality DA module aims at extracting specific features by computing the difference between visible and infrared modalities. As shown in
Figure 4, the differential features are enhanced by the channel attention weighting mechanism.
First, visible features
and infrared features
are input into the module as the initial values and differential features
are obtained by direct subtraction. Second, the differential features are encoded into global vectors
and
by global average pooling (GAP) and global maximal pooling to integrate the global spatial information. The global vectors represent the differences in channel characteristics across modalities. Then, the vectors are sent into a shared convolution, the outputs of which are summed to obtain the channel attention map
. Moreover, the attention map is multiplied, respectively, with the visible and infrared feature maps for adaptive aggregating. The results are summed with the input modality to obtain feature maps after differential amplification. Finally, the feature maps are summed according to the weights generated by BA to obtain the output of the DA module. This process can be expressed by
Through differential, compression, excitation, and weighted fusion, the DA module adaptively learns the importance of different channels across modalities. The generalization is also enhanced. Notably, DA draws on residual networks for enhancing the stability of the network. Differential feature maps are added to the input modalities through jump connections, which could avoid the loss of key features.
3.2.2. Inter-Modality CA Module
The similarity between foreground and background in multispectral object detection affects the detection performance. In addition to complementary features, intrinsic features are also crucial for discriminative feature extraction. Therefore, the CA module is designed to focus on intra-modality shared information guided by contrast weights. As shown in
Figure 5, CA sums the features of two branch networks and remixes them into a new feature to achieve an enhanced feature map.
First, the visible features
and infrared features
are used as inputs and directly summed to obtain the common features
. Next, the visible attention maps
and infrared attention maps
are computed through GAP and a fully connected layer (FC). Subsequently, the attention maps of the two modalities are multiplied, respectively, with the input features and contrast weights from BA. Finally, the enhanced shared features are summed up to obtain the output of ca within the modality. This process can be expressed by
Through summation, weights sharing, compression, and normalization, the model achieves adaptive channel selection. The innovations of this module are as follows: (1) the parameters of FC are shared to reduce the dimension of features and improve the computational efficiency; (2) through channel selecting and the guidance of prior knowledge, the weights of the model are redistributed to the feature channels of the visible and infrared modality to avoid the introduction of redundant features; (3) the use of skip connections enables the network to reuse shallow features and improves the representation of complex features.
3.2.3. Multiscale Cross-Fusion Strategies
Fusing the outputs is the final step in our CAMF module. Generally, two-branch features are fused by adding and subtracting. However, existing studies have demonstrated that the above approaches may exacerbate the imbalance of the network. In addition, variations in scales of objects, especially for some small-sized targets, can also lead to the degradation of the model’s performance. Therefore, a multiscale cross-fusion strategy is designed to achieve the fusion of cross-modality images, through which different levels of feature maps are interacted with each other. As shown in
Figure 2, we extract feature maps from different modalities and scales (small, medium, and large) out and feed them into our CAMF module, followed by connecting horizontally and feature weighting to achieve the fusion of multiscale features. First, we use a convolutional layer after each result to reduce their dimensions to 1/2 of the original ones. Then, the bilinear interpolation is performed to restore them as the input aspect. Finally, the features at different scales are spliced together as global features for multiscale fusion. The step of fusion is shown as follows:
where
takes values in the range of 1, 2, and 3, representing the attention feature maps at three different scales: small, medium, and large, respectively. It is worth noting that the features of both visible and infrared modalities are processed and refined by the attention module to avoid the loss of key information.
3.3. Loss Function
To capture inter-modality and intra-modality information, we propose the multitask perception loss in this paper. It is noted that the contrast needs not to be trained using a separate network, so the loss function does not take contrast into account. Therefore, the multitask loss consists of detection loss, light condition loss and which are used to calibrate the multispectral detection results and the light prediction weights, respectively. The loss is defined as
The detection loss consists of the classification loss
, the bounding-box regression loss
and the confidence loss
. The classification loss uses the strong correlation of the states to enable that the labels could better guide the learning of the category. The definition of
is shown in Equation (11). The bounding-box regression loss is inspired by GIoU, which is proposed to alleviate the gradient problem of IoU loss. We added a penalty term to the original loss and defined it in Equation (12). The confidence loss is formulated as Equation (13), which mainly solves the problem of imbalance in the proportion of different kinds of objects. It is suitable for complex scenarios, such as few samples and various scales.
where
denotes the ordinal number of the sample,
denotes the probability predicted by the model, and
is a binary variable taking the value of 0 or 1.
denotes the area of the real box,
denotes the area of the predicted box, and
denotes the area of the smallest box that contains both the real and the predicted.
is used to address the imbalance between positive and negative samples, and
is used to address the imbalance between difficult and easy samples, which have been set to 0.25 and 2, respectively, in this paper.
The fusion of inter-modal complement and intra-modality intrinsic information relies heavily on the guidance of the background-aware module, especially the perception of light conditions. The light condition reflects the strength of the light conditions in the image and can be viewed as a classifier that calculates the probability of belonging to daytime and nighttime. Therefore, we use the cross-entropy loss to constrain its training process with the following equation:
where
is the label of the light condition,
denotes the probability that the image belongs to the daytime, and
is a softmax function that normalizes the light condition probability to [0,1]. In order to fully characterize the strength of the light conditions, the value space of
is set to 0, 0.5, and 1.0 to denote the dark, low light, and daytime scenes, respectively.