2.2.2. Semantic Edge Supervision for Contextual Feature Enhancement
In remote sensing, aerial imagery often has more complex backgrounds than natural images, and the target has the characteristics of large-scale change, high aspect ratio, small and dense. After our observations, basic one-stage detectors tend to have the following three problems:
- (i)
It is challenging to capture the target’s global features and only focus on the critical local features of the target while ignoring the significant target edge and context features, which is very unfavorable for regression, especially those remote sensing targets with high aspect ratios, as shown in the second row of the first column of
Figure 3;
- (ii)
The network invariably pays attention to the apparent boundaries in the image, but it is worth noting that these boundaries are often the background, thus causing a lot of redundant background interference, as shown in the second row of the second column of
Figure 3;
- (iii)
It is challenging to pay attention to small targets in complex backgrounds. Due to the lack of targeted attention, a large amount of background noise drowns out the small targets, as shown in the third column of the second row of
Figure 3.
Figure 3 shows the above three problems. The comparison in the first column shows that the baseline only pays attention to some harbor features. Our method not only pays attention to the global boundary features of the harbor but also provides rich contextual information. The second column represents the background boundary interference problem. The bridge to be detected is only located above the lake in the center of the picture. Because the network is sensitive to all boundaries, the baseline pays more attention to the road boundary where the bridge deck extends to the land, which seriously interferes with the detection of the original bridge deck. From the comparison in the third column, in the baseline feature map, many high-response areas are distributed on the white line of the parking area, focusing only on the local features of the vehicle head. After introducing the proposed attention-like mechanism, our method not only has the effect of eliminating noise but also pays attention to the global features of the large vehicle.
In order to deal with the above challenges, various attention mechanisms in the spatial and channel dimensions have been widely proposed [
18,
23,
24,
25,
39,
40], and achieve significant results in model accuracy performance and feature response. However, we find that there are still some shortcomings. Firstly, these attention mechanisms make the high response of the feature map expand outward from the center area of the target without paying extra attention to the edge features of the inter-class and intra-class targets. Thus, the ability to capture the contextual edge information of the target is relatively weak. Some research [
41,
42] has demonstrated that the discriminant features required to locate the target are often not wholly distributed on the ground truth; this also confirms the importance of the context edge information of the target. Secondly, they all work through back-propagating from the head of the network during the training phase. However, as the depth of the network deepens, the propagation chain also gradually increases. This long-distance supervision may cause the effect of the attention mechanism to be weakened. Therefore, we call these attention mechanisms general “inadequately supervised” attentions in the domain, and it is challenging to achieve domain-specific effects.
Based on the above analysis of domain-specific problems and previous methods, we innovatively combine semantic edge detection with oriented object detection and design a semantic edge supervision network with a high domain fit. This module is fundamentally different from the attention mechanism mentioned above. We solve all the problems with excellent adaptability between semantic edge detection and remote sensing object detection. The following will introduce the module’s implementation details and specific processes.
Figure 4a shows the detailed flow of the semantic edge supervision module, in which we use pyramid levels
to
for semantic edge detection, where
has
resolution of the input image. On the one hand, because the receptive field of the shallow feature map is small and the semantic information is weak, the conventional object detection tasks are predicted based on the pyramid level
to
. In our work, we also perform edge detection on the
feature map, but considering the computational overhead of the feature map at this scale, we do not send it into the detection head. This advantage is that more edge details can be extracted without increasing the computational overhead of the head. Meanwhile, it can indirectly guide the upper-layer features to suppress background information and provide more accurate edge localization and structural information to retain more contextual details for the final edge activation map. On the other hand, the maximum downsampling stride of the features in edge detection [
22,
43,
44,
45] is eight. It can be inferred that feature maps with too small scales are challenging to restore the detailed features of the edge, so out of the above analysis, we decide to use pyramid levels
to
for semantic edge detection.
The specific implementation details are as follows: we first consider expanding the receptive field for the above stack of pyramid blocks output. Due to the uniqueness of edge detection, a larger receptive field is generally required to cover the target’s edge information. Therefore, we first use multiple dilated convolutions to adapt the subsequent edge detection to obtain edge information of different scales. The dilation rate follows the strategy in [
43], which also well copes with the challenge of large-scale changes in remote sensing targets. Then, we use two convolution layers to split the feature into two branches.
The first branch obtains the semantic edge activation map through the output of a self-adaptive upsample block (SAU), a new method proposed by us, as shown in
Figure 4b. This module aims to optimize the checkerboard effect caused by upsampling using only a single transposed convolution for traditional semantic edge detection [
46]. Specifically, for the input feature
, we first reduce the channel dimension by using a 1 × 1 convolution and ReLU activation function. We use the transposed convolution with a stride of 2 for upsampling. So the factor of each upsampling is 2. We repeat the above operation
n times until the size of the edge map is consistent with the original image. Finally, we use a 1 × 1 convolution with the number of channels equal to the number of categories to extract category-aware features. So far, we have obtained edge activation maps of multiple scales.
where
represents the edge activation map of the
k-th pyramid level obtained by the decoder output,
represents the
n-th class channel of the current bypass activation map. For
, due to its limited receptive field, it is difficult to obtain sufficient semantic information, so this bypass only outputs a single-channel activation map to provide edge details. The number of channels in
and
is equal to the number of categories in the dataset, so in theory, different instances of each category have been decoupled to different channels. Next, we perform a shared connection operation on
(
Figure 4c); the module fuses the edge activation maps obtained above at different feature scales and category channels. Specifically, we first copy the single-channel edge map of
by
n, denoted as
, and then concatenate it with each class of the other two bypasses, which is expressed as follows:
Finally, the activation map
with
channels is obtained by concatenating. Then, through a depth-wise convolution with
n groups and the convolution kernel size of 1, we can obtain the fused category-aware edge activation map. The above operations can further help the decoupling between classes and reweighting the feature maps between different pyramid levels, thus allowing the network to learn the importance of each category between different pyramid levels. Finally, the poly-based coarse edge ground truth and all the above activation maps are used to calculate the pixel-wise loss in a deeply supervised manner to guide the network to focus on different instances in each class in the spatial dimension. The poly-based coarse edge ground truth is generated from the annotations of the original dataset. In order to further explain the similarities and differences between the role of the proposed attention-like mechanism and the conventional attention method, we also list the attention mechanism in [
24], and its expression is shown in Equation (
3):
where
represents the feature maps before and after attention. Compared to the attention mechanism in [
24], the channel attention
and the spatial attention
act on the input feature map in an insufficient supervised form.
represents the attention weight of each channel dimension, ⋃ represents channel concatenation, and
C is the number of channels. In CBAM [
24], the importance of each channel is obtained by using global max pooling and global average pooling, followed by a fully connected layer. Then, the channel attention is obtained by multiplying the channel weights one by one with each channel’s feature
and concatenating them. Subsequently, perform pooling, dimensionality reduction and sigmoid operations based on the channel again to obtain the response weight of the spatial dimension
, where ⊙ represents the element-wise product.
Our method is significantly different from the attention methods described above. The expression is shown in Equation (
4), where
represents our proposed self-adaptive upsampling, the factor of each up-sampling is 2,
represents the shared connection operation, and
stands for semantic edge supervision.
The main differences are as follows: first, in the channel dimension, the self-adaptive upsample block first reduces the channel dimensions through a 1 × 1 convolution and ReLU activation function, then uses the transposed convolution with a stride of 2 for upsampling, repeating the above until the edge map size is the same as the original. Quantitatively, we use a 1 × 1 convolution with the same number of channels as the number of categories to obtain edge activation maps with the same number of channels as the number of categories. Therefore, the above operations can achieve the effect of decoupling each category to its corresponding category channel. Finally, the channel supervise is performed by calculating the multi-label pixel-wise loss on the poly-based coarse edge ground truth and the multi-scale edge activation map obtained for different categories of channels, and then reweight is applied to each category of channels. Specifically, assuming that the number of categories in the current picture is , these n channels will be strengthened, and the remaining channels will be weakened as the background.
Next, we use the shared connection operation to achieve the effect of hierarchical attention under the action of supervision. To expand, we have obtained the activation maps of multiple scales after reweighting the channels in the previous step. Assuming that the activation maps of one of the categories in multiple bypasses are , we concatenate them to obtain the activation map of dimension . Then, we reduce its dimension via depth-wise convolution to obtain the fused activation map . At this time, the activation maps of multiple scales are reweighted and communicated to obtain the importance of different pyramid levels.
On the spatial dimension: we propose a semantic edge label, a coarse edge label derived from the oriented annotation bounding boxes of the original dataset. The target edges to be detected are supervised, not all edges. The background area, especially the non-object with an object-like shape, does not introduce edge guidance, weakens the background, highlights the target edge, and achieves some noise reduction effect. Specifically, we calculate the multi-label pixel-wise loss between the poly-based coarse edge ground truth of the target to be detected and multiple bypass activation maps for supervision. The spatial responses
within
N categories are obtained. After attention to the channel and hierarchy dimensions, it is multiplied by the element with the previously obtained features. The effect of suppressing the background, highlighting the foreground contour in a strongly supervised form, and capturing the target global context information, can be achieved.
After that, to take advantage of this guiding role, we express another branch in the form of Equation (
5).
represents the sigmoid activation function,
represents FPN feature,
represents FPN feature after expanding the receptive field and ⊙ represents the element-wise product. We first use the sigmoid activation function to convert the FPN feature after expanding the receptive field
into the attention-like weights, multiply it element-wise by itself
and finally perform the residual operation. The effect of this kind of attention can directly guide the adjacent network to pay attention to the edge features of the target and capture the complete object-wise context information through the above operations, so it is more influential and targeted. The final effect is shown in the third row of
Figure 3. Meanwhile, we have conducted detailed ablation experiments for this innovation, and the results can be found in
Section 3.2.2.