Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing

Yang, Feifan; Chen, Gang; Duan, Jianshu

doi:10.3390/rs16162884

Open AccessArticle

Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing

by

Feifan Yang

,

Gang Chen

^*

and

Jianshu Duan

School of Geography and Ocean Science, Nanjing University, Nanjing 210023, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2884; https://doi.org/10.3390/rs16162884

Submission received: 27 June 2024 / Revised: 25 July 2024 / Accepted: 5 August 2024 / Published: 7 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

The transformer architecture is gradually gaining attention in remote sensing. Many algorithms related to this architecture have been proposed. However, the DEtection TRansformer (DETR) has been proposed as a new approach for implementing object detection tasks. It uses the transformer architecture for feature extraction, and its improved derivative models are uncommon in remote sensing object detection (RSOD). Hence, we selected the DETR with the improved deNoising anchor boxes (DINO) model as a foundation, upon which we have made improvements under the characteristics of remote sensing images (RSIs). Specifically, we proposed the skip-encoder (SE) module that can be applied to the encoder stage of the model and the skip-decoder (SD) module for the decoder stage. The SE module can enhance the model’s ability to extract multiscale features. The SD module can reduce computational complexity and maintain the model performance. The experimental results on the NWPU VHR-10 and DIOR datasets demonstrate that the SE and SD modules can improve DINO for better learning small- and medium-sized targets in RSIs. We achieved a mean average precision of 94.8% on the NWPU VHR-10 dataset and 75.6% on the DIOR dataset.

Keywords:

transformer; detection transformer (DETR); multiscale object detection; remote sensing images (RSIs); skip-attention

1. Introduction

Object detection intends to locate and classify some special objects such as those present in complex images, including natural scene images, medical images, and remote sensing images (RSIs). Remote sensing object detection (RSOD) is a cutting-edge research field. Detecting targets on RSIs has been commonly used in various scenarios, including military reconnaissance, natural resource exploration, and post-disaster relief. However, compared with natural scene images, target detection on RSIs is challenging because of the presence of rich background information and severe data imbalance in RSIs.

As deep learning has advanced in recent years, most RSOD methods have been primarily improved based on natural image object detection methods. The mainstream object detection approaches can be roughly categorized into the two-stage and one-stage types. The two-stage object detection model adopts a region proposal extraction process and combines the derived region proposal to regress the coordinate and classification of the Bounding boxes (Bbox). For instance, R-CNN [1] and its improved variants, such as Fast R-CNN [2] and Faster R-CNN [3], are classic two-stage object detection models. By contrast, the one-stage object detection model processes the image in one step to determine the object’s location and category. YOLO [4] and RetinaNet [5] are the typical one-stage methods. All the aforementioned methods involve a post-processing operation, non-maximum suppression (NMS), which eliminates redundant boxes and helps locate the optimal matching box. However, NMS exhibits limitations such as difficult parameter tuning, less hardware support, and deployment challenges, which hinder the further promotion and industrialization of these object detection methods.

The DEtection TRansformer (DETR) [6] introduces a novel end-to-end object detection method that eliminates the need for NMS, thereby solving the aforementioned issue. It replaces the operation of removing redundant boxes in NMS with that of bipartite matching. Unlike mainstream CNN-based object detection architectures, it mainly uses the transformer [7] architecture for feature extraction. Although the DETR [6] model offers a novel and high-potential implementation for object detection models, the original DETR [6] model is characterized by slow convergence and a long training time. Therefore, Zhu et al. [8] proposed the Deformable DETR, which replaces the self-attention module in the transformer [7] with a deformable attention module to achieve multiscale feature extraction and improve slow convergence issues. Meng et al. [9] also proposed an alternative method, the Conditional DETR, for enhancing the convergence speed of the model. They first analyzed that the difficulty in learning target spatial positions was a reason for the slow convergence of the original DETR [6]. Thus, they incorporated additional spatial information into the query in the cross-attention module of the transformer decoder, thereby improving the model’s convergence speed. The Anchor DETR [10] is a step further as it transforms the learnable query into 2D spatial coordinates involved in the decoder part. They also analyzed that objects in the scene might display varying patterns. By learning these different patterns, the performance of the model can be improved. Liu et al. [11] analyzed the reasons for accelerating the convergence of the DETR with the introduction of spatial coordinate information and proposed the DAB-DETR. They augmented the learnable query by transforming it into 4D spatial information (i.e., the box’s x-coordinate, y-coordinate, width, and height). Based on the foundation of the DAB-DETR [11], the DN-DETR [12] and DETR with improved deNoising anchor boxes (DINO) [13] additionally incorporate denoising learning to allow for faster model convergence and have achieved state-of-the-art (SOTA) performance in natural scene object detection. However, as new implementation paths for object detection, these models have not yet received considerable attention in the RSOD field.

Compared to natural images, RSIs are taken from an overhead perspective, cover a larger area, contain complex background information, and exhibit greater scale differences among different categories of targets. Existing object detection models for natural images typically face challenges in recognizing targets captured at various distances and angles, identifying obscured targets, and operating under different lighting conditions. In contrast, the primary challenge in RSOD is detecting targets of varying scales within large areas of a rich background.

In response to the RSI characteristics, improvements in the RSOD field are typically made from the aspects of information fusion and feature enhancement, background suppression, the exploration of contextual background relationships, and target information mining at different scales. For instance, Liu et al. [14] proposed the adaptive feature pyramid network and the context enhancement module to better aggregate multiscale features and enhance them. They also introduced the enhanced effective channel attention module to confine the disturbance caused by the complicated background. Hu et al. [15] designed a global semantic interaction module to improve model performance by suppressing background information and enhancing foreground objects. They also incorporated the local attention pyramid to extract small objects specifically. Similarly, Zhang et al. [16] proposed the coarse-to-fine feature adaptation and coarse-to-fine sample assignment to improve features and select stronger training samples on RSIs, respectively. To better capture important information, Dong et al. [17] proposed a gated context-aware module to adaptively use local valuable information and overall context information in the feature pyramid network (FPN). Likewise, Teng et al. [18] adopted Clip-long short-term memory (LSTM) to exploit the spatial correlation information and used multiscale perception for extracting global context clues. Ye et al. [19] proposed an adaptive attention fusion mechanism to integrate semantic information at different scales and thus better extract multiscale objects. To generate high-quality feature presentations for each scale, Wang et al. [20] developed a feature-reflowing pyramid structure by integrating fine-grained features from the adjacent lower level. Similarly, to underscore the features of small objects in shallow feature maps, SME-Net [21] proposed a feature split-and-merge module for eliminating the salient information of large objects. However, most of the aforementioned methods improve CNN-based object detection architectures, such as Faster R-CNN [3], and typically require NMS as a post-processing operation for eliminating redundant boxes in the training phase.

With the advancement of remote sensing technology, the acquisition of RSIs has significantly increased in both speed and area, leading to a growing demand for processing ample RSI data. NMS evidently hinders the training of object detection models involving abundant RSI data. As a potential new implementation of object detection, the DETR-based object detection method is worthy of attention and development in the RSOD field because it does not use NMS operations in the training phase. Thus, this study aims to introduce the DETR-based model and propose some improvements to better adapt the characteristics of RSIs.

Considering the aforementioned factors and inspired by the skip-attention module [22], we proposed the skip-encoder (SE) and skip-decoder (SD) modules for DINO [13], which currently achieves SOTA performance in natural images. The SE module enhances the model’s multiscale feature extraction capability at the encoder stage, thereby improving the capacity to learn small- and medium-sized targets of DINO [13]. In RSIs, small-to-medium-sized targets are often numerous and more prevalent. In tasks such as military reconnaissance and disaster emergency rescue, these small-to-medium-sized targets typically constitute the primary objectives. Therefore, the SE module is a better option for DINO [13] in RSOD. The SD module is used to replace the multi-head self-attention (MSA) in the decoder layer, transforming the computational complexity from growing quadratically with the length of the decoder input data to growing linearly, without decreasing the performance of the model. Finally, our proposed modules were evaluated on two representative remote sensing public datasets, including NWPU VHR-10 [23] and DIOR [24]. The contributions of the present study are summarized as follows:

To enhance the capabilities for detecting small- and medium-sized targets, we proposed the SE module, which is mainly used in the encoder stage of the model, to enhance the model’s ability to extract multiscale features.
We proposed the SD module, which is primarily used in the decoder stage and can reduce the computational cost of the model without affecting its performance.

Our method achieved a mean average precision (

{mAP}_{50}

) of 94.8% on the small public dataset NWPU VHR-10 and of 75.6% on the large public dataset DIOR. The remainder of this article is organized as follows. In Section 2.1, we briefly review the related works on DETR-based models and the skip-attention [22] module. A detailed description of the proposed methodology is provided in Section 2.2, and the settings and results of our experiments are presented in Section 3. Finally, the conclusion is presented in Section 5.

2. Materials and Methods

2.1. Related Work

2.1.1. DETR-Based Model

Being a rare end-to-end model in the object detection field, the DETR [6] elegantly implements object detection by discarding NMS operations in the training phase. While the DETR [6] simplifies object detection, it faces two main challenges: First, its convergence speed is extremely slow, and second, the computational costs of using the original transformer increase quadratically with the sequence length.

Subsequent studies and improvements in the DETR [6] are chiefly aimed at addressing these two issues. First, the Deformable DETR [8] introduces the deformable attention module to replace the self-attention module, thereby reducing the quadratic complexity that increases with the sequence length to linear complexity. Figure 1 provides an example illustrating the implementation of the deformable attention module. The input data for this module comprise two components: a single vector query from the sequence data, and the feature map restored from the sequence.

Firstly, the query identifies a reference point on the feature map. Through a linear layer, it learns the offset points near the reference point on the feature map that need to be attended to. The values of the feature map at the positions corresponding to these offset points are the focus of this query. These values are combined with an attention weight derived from the query to produce the output. Compared with the self-attention module [7] where each query needs to be computed with all vectors of the input sequence, the deformable attention module [8] only needs to concentrate on a predetermined number of focus points. This considerably reduces the computational costs and eliminates many background interferences.

Along with addressing the slow convergence issue of self-attention in the DETR [6], some researchers intend to optimize the decoder stage of the DETR [6]. The Conditional DETR [9] suggests that learnable queries in the DETR [6] take too long to learn spatial information, which is one of the reasons for the slow convergence speed of the DETR [6]. Consequently, the Conditional DETR [9] proposes that spatial information can be artificially added to learnable queries in the decoder. The Anchor DETR [10] further proposes to transform randomly initialized learnable queries in the decoder into those learned from uniformly initialized anchor points, thus improving the learning of spatial coordinate information. Moreover, based on the design philosophy of the Anchor DETR [10], the DAB-DETR [11] improves the decoder part of the DETR [6]. In addition to spatial coordinates, it introduces scale information such as width and height, thereby significantly accelerating the convergence speed. The DN-DETR [12] incorporates denoising learning, thereby extending the DAB-DETR [11] to handle the instability of bipartite matching in loss computation and improving the learning capability for boxes. Furthermore, building upon the DN-DETR [12], DINO [13] refines denoising learning and introduces mixed query selection and Contrastive DeNoising (CDN) training.

As shown in Figure 2, DINO comprises a backbone, an encoder made up of 6 encoder layers, a decoder section consisting of 6 decoder layers, and a linear layer that transforms the decoder output into categories and box outputs. The backbone is used to extract multiscale feature maps from the input image, typically utilizing ResNet50 [25] as the backbone. Each encoder layer in DINO [13] adopts the deformable attention module [8], replacing the MSA [7] module. After passing through a total of 6 encoder layers, the resulting output is referred to as memory, which participates in the cross-attention computation in the decoder layer. Simultaneously, DINO [13] uses a Multi-Layer Perceptron (MLP) to derive an initial anchor from the encoder’s output memory, which then initializes learnable queries in the decoder input. This approach of incorporating spatial information more effectively improves convergence speed compared with direct random initialization of learnable queries in the DETR [6], which is the aforementioned mixed query selection. In the decoder stage, the input consists of three parts: First, a learnable sequence of one-dimensional vectors is randomly initialized, which is referred to as the embedding. Second, learnable queries are derived by the mixed query selection, which will play a role similar to position encoding. Third, noise data obtained by adding noise to the truth data primarily accelerates the training. It does not participate in the final bounding box matching process but instead learns to denoise through the CDN module. The specific denoising training process can be referred to [13].

In this study, as DINO [13] exhibited the SOTA performance among the current DETR [6] series models, we used DINO [13] as the baseline in the experiments and improved it to better meet the requirements of RSOD.

2.1.2. Skip-Attention

Skip-attention [22], introduced in 2023, enhances the computational efficiency of the vision transformer (ViT) [26]. As shown in Figure 3, this module approximates the output of the MSA from the preceding layer using a simpler module

Φ

. This module reduces computational redundancy without compromising on model performance.

Specifically, the module

Φ

consists of two linear layers, a depth-wise convolution (DwC) [27], and an efficient channel attention module (ECA) [28]. Given the output

Z_{l - 1}^{MSA}

∈

R^{n \times d}

of the MSA at layer

l - 1

, the approximation

{\hat{Z}}_{l}^{MSA}

∈

R^{n \times d}

of the output

Z_{l}^{MSA}

of the MSA at layer l can be calculated using the following formula:

{\hat{Z}}_{l}^{MSA} = ECA ({F C}_{2} (Dwc ({F C}_{2} (Z_{l - 1}^{MSA}))))

(1)

The input

Z_{l - 1}^{MSA}

first passes through the first linear layer

{FC}_{1}

:

R^{n \times d}

→

R^{n \times 2 d}

, thereby expanding the dimension of the channel. The sequence, with length n and channel dimension d, is reshaped into a 3D tensor of dimensions

\sqrt{n} \times \sqrt{n} \times 2 d

. The subsequent kernel of size

r \times r

in DwC

R^{\sqrt{n} \times \sqrt{n} \times 2 d}

→

R^{\sqrt{n} \times \sqrt{n} \times 2 d}

is applied to capture relationships between different input tokens. Of note, GeLU activations are used after

{FC}_{1}

and DwC. Subsequently, the DwC output is flattened, which then enters

{FC}_{2}

:

R^{n \times 2 d}

→

R^{n \times d}

to be restored to the input dimension d. Finally, the ECA [28] first uses the global average pooling along the channel to aggregate features. Then, a

1 \times 1

convolution with an adaptive kernel size proportional to the channel’s dimension and a sigmoid activation function are employed to learn the dependencies between different channels.

Skip-attention [22] is applied when the intermediate layers of the model have highly correlated vectors, thereby replacing the MSA to reduce computation. The correlation is analyzed through the CKA similarity [29]. However, although skip-attention [22] can be applied to highly correlated MSA modules [7], the design of the parametric function

Φ

in skip-attention [22] is neither suitable for RSOD nor can be used in the deformable attention module [8]. In this study, inspired by the skip-attention module [29], we designed SE and SD modules that are suitable for RSIs.

2.2. Methods

We propose two modules, SE and SD. The SE module replaces the deformable attention module [8] in the encoder stage, enhancing the model’s ability to extract features at different scales. The SD replaces part of the MSA module [7] in the decoder stage. The SD module can reduce a certain amount of computational complexity without significantly affecting the model’s performance. As the number of candidate Bboxes to be detected increases, the SD brings a greater reduction in computational load. The details of the two components and computational complexity are presented in the following.

2.3. Skip-Encoder Module

In the deformable attention module [8], the input is typically a concatenation of flattened multiscale feature maps. However, the parametric function

Φ

in the skip-attention module [22] does not apply to multiscale feature maps. The parametric function

Φ

of the skip-attention module [22] neglects the spatial information inherent in the input sequence. It transforms 1D sequence data directly into 2D data for DwC [27] operations, which is evidently flawed. Meanwhile, in the encoder stage, the indiscriminate stacking use of the deformable attention module [8] also leads to a lot of redundant computation.

Considering the aforementioned reasons, we designed the SE module. The SE module can replace the deformable attention module [8] because it can extract features from targets of different scales, reduce information interference between scales, and better handle complex backgrounds in RSIs. As illustrated in Figure 4, the proposed SE module is predominantly employed to replace the highly correlated adjacent layers within DINO’s encoder. In the SE module, a linear layer doubles the input data dimensionality, enhancing the detailed recognition of the image pattern. Then, the SE module reshapes the input data into four parts, consistent with the shapes of different scale feature maps input into the encoder. Then, each part is passed through its DwC [27] and point-wise convolution (PwC) [27] while aiming at information extraction at their specific scales without interference from other scales. Subsequently, all four parts are flattened, concatenated together, and passed through a linear layer for their restoration to a sequence consistent with the shape of the input data. Finally, an ECA module [28] is used to exchange information between channels at different scales, thus obtaining an output. The detailed implementation of the SE module can be referred to in Algorithm 1.

Algorithm 1 Implementation of the SE module

Input: Previous layer’s output

Z_{i - 1}

Output:

Z_{i}

1:

X \leftarrow

Linear(

Z_{i - 1}

)
2:

X_{2 d} \leftarrow

GELU(X)
// reshape

X_{2 d}

into

{x_{i}}

, matching the shapes of the multiscale feature maps input to the encoder
3:

{x_{i}} \leftarrow

Reshape(

X_{2 d}

)
//Each

x_{i}

passes through its DwC and PwC.
4: for x in

{x_{i}}

do
5:

t \leftarrow

Dwc(x)
6:

x_{n e w} \leftarrow

Pwc(t)
7: end
8:

X_{n e w}

= Concat(

{x_{n e w}}

)
//restore the shape of X from

R^{n \times 2 d}

to

R^{n \times d}

.
9:

X_{d} \leftarrow

Linear(

X_{n e w}

)
//exchange information across different scales using ECA
10:

Z_{i} \leftarrow

ECA(

X_{d}

)

The results from the SE module will be combined with the input to the SE and the output from the previous layer in a residual operation. This is primarily performed to maintain consistency with the original encoder layer process and preserve information.

2.4. Skip-Decoder Module

In the decoder stage, the input data are a sequence of one-dimensional vectors. Each 1D vector, after passing through the decoder, participates in the final category and Bbox output. That is, each 1D vector corresponds to the candidate Bbox on the image. Compared to natural images, RSIs often involve detection tasks of dense objects or extensive areas of images. Intuitively, the more candidate boxes, the better. The computational complexity of the MSA module [7] increases quadratically with the length of the input sequence. In the face of increasing demand for the number of candidate Bboxes, there is room for improvement. Therefore, we proposed the SD module for use in the decoder stage. The SD module can reduce computational complexity without significantly affecting the model’s performance. The increase in the number of candidate Bboxes only brings a linear increase in computational complexity.

As illustrated in Figure 5, the SD module separates the input sequence data and noise data during the decoder stage. First, the sequence data are passed through a linear layer to double its original dimension. Then, the sequence data are treated as 2D data with a width of 1 and a height equal to the sequence length, and DwC and PwC [27] are performed on the 2D plane. In this case, DwC [27] is equivalent to exchanging information between single dimensions of the sequence data, and PwC [27] enables the exchange of information within the dimensions of a single vector itself. This allows for exploring information across all data in the sequence without adding any extra artificial information. We also use an additional DwC [27] to enhance the degree of information exchange between different vectors. Finally, we restore the original dimensions through a linear layer and use ECA [28] to achieve information exchange on the dimension.

For the noise part, the denoising learning of the DINO [13] model requires the unique function of the attention mask in MSA [7]. To maintain the denoising process, we adopt the concept of the cross-attention module to facilitate the implementation of denoising learning in the SD module. Specifically, we use the noise as the query for MSA [7], and the sequence data as the key and value to facilitate the learning of the noise part. This ensures that the noise part can learn information from the Sequence, while the Sequence does not learn from the noise, preventing information leakage that could affect the training process.

From the computational complexity perspective, if the input sequence length is n, and the noise component is fixed at a length of 200, the computational complexity through the SD module can be summarized as

O (n * d * d + 200 * n * d)

, where we have simplified some constant factors to emphasize the importance of each variable. In comparison to the computational complexity of MSA [7], which is

O (n * n * d)

, it is evident that the SD module can provide a more lightweight computational cost as n, corresponding to the number of candidate Bboxes, increases.

2.5. Network Details

Our proposed module is mainly an enhancement of a DINO-based foundation. As illustrated in Figure 6, the SD and SE modules are applied to specific layers within the encoder and decoder of DINO [13], respectively. Further network details are summarized as follows:

Backbone: We used a pre-trained ResNet-50 [25] on ImageNet-1K [30] as the backbone of our model, returning the last three feature maps. The fourth feature map is convolved from the final feature map output by the backbone, thus forming four feature maps of different scales as the input for the transformer [7] stage.
Transformer Encoder: The encoder consists of six layers. In our experiments, the proposed SE module mainly replaced the deformable attention module [8] in the 4th, 5th, and 6th encoder layers of DINO [13], which resulted in optimal performance. The other encoder layers are consistent with those of DINO [13], adopting the deformable attention module [8] instead of the MSA module [7].
Transformer Decoder: The decoder consists of six layers. The proposed SD module mainly replaced the MSA module in the 3rd and 4th decoder layers of DINO [13]. In line with the approach adopted by DINO [13], the multi-head cross-attention module [7] in the decoder layer was replaced with the deformable attention module [8].
Loss Function: We did not add a new loss function. The most suitable results were selected from the output sequence of the decoder through bipartite matching. These results were then used for calculating the three main loss functions, including focal loss [5] for classification, L1 loss, and generalized intersection over union loss [31] for learning the bounding box.

3. Results

This section introduces the basic experimental setup, including the dataset, evaluation metrics, and implementation details. Subsequently, we compare our model with mainstream RSOD models. Then, the impact of our proposed module on the model’s performance is evaluated using various metrics. Finally, ablation experiments have been conducted to demonstrate the effectiveness of the proposed module.

3.1. Datasets

We selected two datasets, NWPU VHR-10 and DIOR. NWPU VHR-10 is a dataset with a smaller volume of data, smaller image sizes, and fewer target categories. It presents lower training difficulty and can reflect the model’s performance on the datasets containing fewer data. By contrast, DIOR involves a larger volume of data, larger image sizes, and more target categories and presents higher training difficulty. It can reflect the model’s performance on higher data-containing datasets. The introductions of these datasets are as follows:

NWPU VHR-10: This dataset constitutes an aerial image dataset for bounding box object detection and encompasses ten categories. The second version of this dataset contains 1172 images (400 × 400 pixels) cropped from 650 aerial imagery with sizes ranging from 533 × 597 to 1728 × 1028 pixels. We used the prevalent data distribution, that is, 75% of the dataset (879 images) was allocated for training and the remaining 25% (293 images) was allocated for testing.
DIOR: This dataset is the most representative object detection dataset for RSIs. It contains 23,463 images (800 × 800 pixels), encompassing 20 categories. Following the mainstream setup, this dataset allocated 11,725 images (50% of the dataset) for training, with the remaining 11,738 images being designated for testing.

3.2. Evaluation Metrics

We employed the average precision (AP) of each class and the mAP of all the classes for evaluating the model’s performance. The AP and mAP are metrics that are commonly used in the remote sensing field. A higher mAP indicates a better object detection capability of the model. The AP and mAP are calculated as follows:

AP = \int_{0}^{1} P (R) d R

(2)

mAP = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} A P_{i}

(3)

where P and R represent the precision and the recall, respectively, and

N_{c}

denotes the number of classes contained in the current dataset. The precision P and the recall R are defined as

P = \frac{N_{T P}}{N_{T P} + N_{F P}}

(4)

R = \frac{N_{T P}}{N_{T P} + N_{F N}}

(5)

where

N_{T P}

,

N_{F P}

, and

N_{F N}

denote the number of true positives, false positives, and false negatives, respectively. Suppose the intersection over union (IoU) of the model’s predicted bounding box and the ground truth exceeds a specific value, such as 0.5, and the model’s predicted class is correct. In that case, we consider them as true positives. In this study, the mainstream metric in remote sensing

{mAP}_{50}

was primarily used as the evaluation criterion for performance comparison. In the “Model Analysis” section, we also provide

{mAP}_{75}

and

{mAP}_{@ 50 : 5 : 95}

as additional metrics for measuring the model’s performance. Here, the subscripts 50 and 75 in the mAP represent the values when the IoU threshold for true positives is 50 and 75, respectively.

{mAP}_{@ 50 : 5 : 95}

represents the average mAP value obtained with an IoU threshold ranging from 50 to 95, with a step size of 5. Moreover, we adopted the standard for target-scale division provided by the COCO [32] dataset. Small refers to targets less than 32 × 32 pixels, which are classified as small targets. Medium requires targets to be between 32 × 32 and 96 × 96 pixels, and large requires targets to be larger than 96 × 96 pixels. Finally, we provided the

{mAP}_{@ 50 : 5 : 95}

of the

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

at different scales, which are used for a better comparison of performance across various scales. The subscripts correspond to the recognition rates

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

of the small, medium, and large scales, respectively.

3.3. Implementation Details

Our experimental environment was the PyTorch framework and our models were trained using two NVIDIA Tesla T4 GPUs. Then, we adopted the AdamW optimizer [33,34] with a weight decay of

1 \times 10^{- 4}

, and the models used an initial learning rate of

1 \times 10^{- 4}

.

In the experiment conducted on the NWPU VHR-10.v2 dataset, the batch size was set to 2 and the number of epochs was set to 50. The learning rate was dropped to 0.1 times the value at the 35th epoch. Data augmentation was performed using random horizontal flipping and cropping. The input size of the images was fixed at 400 × 400 pixels.

In the experiment conducted on the DIOR dataset, the batch size of the model was set to 2 and the number of epochs was set to 18. Data augmentation was performed using only random horizontal flipping. The learning rate dropped to 0.1 times its original value at the 14th epoch. The input size of the images was fixed at 800 × 800 pixels.

We also used the same training settings to train the previously mentioned DETR [6], Deformable DETR [8], and DINO [13]. Apart from the DETR [6], which used overall pre-trained weights on the COCO dataset for 300 epochs because of its slow training speed, all the other models only used the pre-trained weights from the ImageNet-1K [30] classification task for the ResNet50 [25] backbone. These models were compared with the current mainstream models in RSOD and thus served as a new standard baseline for reference.

3.4. Performance Comparison

In this part, our comparative study primarily focuses on key models, including Fast R-CNN [2], Faster R-CNN [3], Yolov3 [35], FPN [36], FCOS [37], GLNet [18], ABNet [14], CoF-Net [16], and GLSANet [15]. These models were selected for comparison mainly because they have been trained on the NWPU VHR-10 and DIOR datasets, thereby providing sufficient reference value.

Results on NWPU VHR-10: In Table 1, after DINO [13] used the SE module, most categories of NWPU VHR-10 exhibited significant improvement, with the overall mAP₅₀ increasing from 92.1% to 94.8%, thereby attaining SOTA performance. Compared with the mainstream models of RSOD, DINO [13] with the SE module demonstrated superior comprehensive performance. On the other hand, the SD module also brought a minor improvement to DINO [13], which aligns with our motivation for designing the SD, that is, to reduce computational complexity without compromising model performance. Additionally, an odd phenomenon observed is that the model performance did not match that of using the SE or SD module individually when the SD and SE were used together. This will be explained in the subsequent model analysis section. Overall, the results obtained with NWPU VHR-10 reflect that our SE module can comprehensively enhance DINO’s feature extraction capability, thereby improving the model’s performance. It also proves that the SE module does not affect the model’s performance.
Results on DIOR: As shown in Table 2, our SE module enhanced the overall mAP₅₀ of DINO [13] from 74.6% to 75.6%, finally achieving a performance close to the SOTA on the DIOR dataset. In the results of DINO [13] using the SE module, an improvement can be observed in the recognition of common small- and medium-sized object categories in RSIs, such as airplanes, bridges, ships, vehicles, and storage tanks. For instance, in the airplane class, the SE module increased the mAP₅₀ from 71.1% to 76.3%, enhancing the performance by 5.2%. On the other hand, the SD module did not compromise the model’s performance on the large-scale dataset DIOR, reflecting the generalizability of the SD module. In summary, the results on DIOR reflect the performance improvements in various aspects of DINO [13] provided by the SE module, illustrating the role of the SE in mining useful information. This also reflects that the SD module can maintain the model’s performance while reducing computational complexity, regardless of the size of the dataset.

3.5. Model Analysis

To further analyze the role of the SE and SD modules for DINO [13], more comprehensive indicators were used for comparison. First, different scales were evaluated using

{AP}_{s}

,

{AP}_{m}

, and

{AP}_{l}

. Then, stricter overall evaluation indicators such as

{mAP}_{75}

and

{mAP}_{@ 50 : 5 : 95}

were used. Finally, we computed the model’s parameters, evaluated the computational costs in giga floating-point operations per second (GFLOPs), and assessed the inference speed in frames per second (FPS).

Model analysis on NWPU VHR-10: As shown in Table 3, under the same training conditions, either the SE or SD module can enhance performance on small- and medium-sized targets, which are useful for RSOD. It is noteworthy that the SE module increased the ${AP}_{s}$ from 30.5% to 36.5%, and the SD module improved the ${AP}_{s}$ from 30.5% to 36.7%. However, when the SE and SD modules were used in combination, the ${AP}_{s}$ only increased from 30.5% to 32.9%. We attributed this phenomenon to the fact that NWPU VHR-10 is a small dataset, and both the SD and SE modules enhance the ${AP}_{s}$ , leading to overfitting, which in turn significantly decreases the performance on the ${AP}_{s}$ . In Figure 7, we provide representative detection results for each category in the NWPU VHR-10 dataset.
Model analysis on DIOR: As shown in Table 4, the SE module slightly increased DINO’s ${mAP}_{50}$ from 74.6% to 75.6%, and the SD module maintained DINO’s model performance. On DIOR, the individual use of the SE and SD modules brings a slight improvement to the detection of medium-sized targets but does not significantly enhance the ${AP}_{s}$ for small-sized targets. This is primarily because the DIOR dataset, a large dataset with 20 categories, presented high training difficulty, and there is still a bottleneck in the model’s performance. When the SD and SE modules are used in combination, they can increase the ${AP}_{s}$ from 15.6% to 16.6%. This not only demonstrates that both the SD and SE can enhance the model’s ability to detect small targets but also validates the previously mentioned overfitting phenomenon of the SD and SE on NWPU VHR-10. Similarly, in Figure 8, we present representative detection results for each category in the DIOR dataset.

Finally, as shown in Figure 9, we analyzed the influence of the SE and SD modules on the model’s convergence time. We observed that both on the smaller-scale NWPU VHR-10 dataset and the large-scale DIOR dataset, the SE and SD modules significantly enhance the convergence speed.

3.6. Ablation Study

In this section, we mainly conducted ablation experiments in two directions. First, we employed CKA analysis [29] and heatmaps to elucidate why the deformable attention module and MSA in specific layers of DINO [13] were chosen to be replaced with the SD and SE. Second, we verified the impact of using the SD or SE on different numbers of layers.

First, as shown in Figure 10, we can observe that, whether it is the NWPU VHR-10 or the DIOR dataset, and whether it is the encoder stage or the decoder stage, there are areas of high correlation. This indicates sufficient computational redundancy in the architectural design of DINO. Specifically, during the encoder stage, we can see that there is a high correlation between the second to fourth layers corresponding to one to four and between the fourth to sixth layers corresponding to three to six. However, compared to the encoder stage, during the decoder stage, although the CKA similarity [29] can also indicate which specific decoder layers have high correlation, it is not accurate enough. As shown in Figure 11, in the output of each decoder layer, most of the vectors typically correspond to redundant boxes with low category probability scores, which can affect the accuracy of CKA similarity [29]. Therefore, in the use of the SE module, we mainly selected the encoder layers with high CKA similarity [29]. For the SD module, we primarily considered the CKA similarity [29], the Pearson correlation coefficient of the heatmap for boxes with high probability, and the trade-off between the combined use of MSA [7] and the SD.

Second, as shown in Table 5, a gradual increase was observed in the model’s

{mAP}_{50}

as more layers of DINO’s encoder were replaced by the layers of the SE module. Correspondingly, each layer of the SE module contributes to an approximately 2% increase in the model’s parameters and GFLOPs. Consequently, this leads to a 1–2% decrease in inference speed, indicating that optimizing the SE module is one of our future improvement directions. Overall, even after applying a single SE module on DINO [13], the

{mAP}_{50}

increased from 92.1% to 93.4%. This indirectly confirms our hypothesis that the SE can enhance feature extraction capabilities, thereby improving the model’s performance.

Moreover, we can observe that with the use of the SD module, each layer of the SD module can bring about a reduction of 0.2 GFLOPs and also introduce approximately 0.5 M parameters. Given that the decoder part typically accounts for approximately 10% of the computational load in DINO, a single layer of the SD can bring about a 3.6% reduction in computational load for the decoder. Hence, when the input image is of a smaller magnitude, such as

400 \times 400

images on NWPU VHR-10, and the demand for the bounding box is not substantial, a single SD reduces the computational cost by 0.4%. Although this is not a significant improvement, the SD module not only maintains the performance of the original model but also brings about some enhancement. For future RSIs requiring more bounding box recognition, the SD module offers further optimization. In addition, we also found that as the number of layers replaced by the SD module increases, the performance of the model does not always improve, but rather, the increase in performance turns into a decrease. We believe this is mainly because the SD module, composed of CNNs, plays a different role in the network compared to MSA [7]. Therefore, when used in combination with MSA [7], it can bring about a better performance.

4. Discussion

4.1. Applicability

Both the SE and SD modules can improve the model’s ability, but they each have different focuses. The use of these two modules should be based on actual needs.

First, for wide-range RSIs with complex backgrounds, only using MSA [7] to consider all scale information may increase training difficulty and have negative effects. Thus, the SE module can be combined with modules like MSA [7] that perform multiscale feature fusion extraction. As for when to use the SE module, mathematical indicators such as the CKA similarity [29], which evaluates the hidden representation of the neural network, can be used to assist in the selection. If there are some modules with high similarity outputs, it indicates that these modules have some computational redundancy. At this time, the SE module can shift this unnecessary computation and reduce the difficulty of feature extraction at different scales. Importantly, no perfect mathematical indicator exists to explain the hidden representations of neural networks. Existing indicators mainly provide a general direction. Optimal use of the SE module requires consideration of different experimental results.

On the other hand, detection tasks on RSIs often require dense objects or need to cover a wide area, where we typically need a sufficient number of candidate boxes to detect targets or to prevent missed detections. At this time, the length of the input sequence data in the decoder stage will increase, and MSA [7] will bring a quadratic increase in computation, which hinders large-scale data processing. The SD module can provide a better option without reducing the model’s performance. However, in the decoder stage, although the SD module can reduce computational cost, it cannot fully replace the role of MSA. MSA performs sufficient information exchange for each vector in the input sequence, while the SD module mainly conducts information exchange on the overall sequence through the combination of PwC and DwC. The SD module is more like a trade-off between computational cost and performance. When there is computational redundancy in the decoder, the SD module can be used to achieve a reduction in computational costs while maintaining model performance. However, the decoder also cannot simply use similarity indicators like CKA to detect computational redundancy, because the representation data in the decoder corresponds to the actual output candidate boxes. This means that most boxes overlap and are redundant boxes that will be filtered, which can lead to misjudgments by the similarity indicators. Therefore, the overall CKA similarity and the similarity after filtering out those with low probability vectors are used in combination to determine the degree of redundancy. These analyses then guide the application of the SD module.

4.2. Limitations

Although our research has made some important findings, we must acknowledge that it has certain limitations. First, the public datasets used in the experiments consist solely of optical RSIs and do not include multispectral and hyperspectral RSIs. Compared to optical RSIs, hyperspectral and multispectral RSIs provide more channel information reflecting the physical properties of the target. The SE and SD modules can be extended to these two types of data. However, due to hardware limitations and data acquisition, we acknowledge a certain gap in demonstrating the generalizability of the SE and SD modules. We hope to extend the application of the SE and SD modules to other types of remote sensing data in our future work.

Second, the SE and SD show a slight performance decrease of about 1% in detecting large-scale targets, which is one of the directions that needs improvement. The detection performance for large-scale targets may be primarily due to the lower degree of full-scale information exchange in the SE module, compared to deformable attention. The extraction of full-scale information can affect the receptive field of the model, thereby further impacting the detection of large-scale targets. We will attempt to improve this limitation of the SE module in our future work.

Third, although the SE module can enhance performance, it has some drawbacks. Compared with the deformable attention module [8], due to the addition of four different scales of DwC [27] and PwC [27], the model’s parameters slightly increase in a single SE module. A single SE module adds approximately one million parameters (compared with the total parameters of DINO [13], which is 46.6 M) and increases the GFLOPs by 2%.

4.3. Other Directions

The SE module’s ability to enhance the model’s multiscale features suggests its potential for future application in fields that demand high recognition accuracy for small-to-medium-sized targets, such as object detection in medical imaging. On the other hand, the SD module can be extended to scenarios that require a reduction in model computational cost, yet need large-scale object recognition, such as object detection in autonomous driving.

5. Conclusions

In this study, we introduced the DINO [13] model and the skip-attention module [22]. Subsequently, we proposed the SE module that can be applied to the encoder stage of the model and the SD module for the decoder stage. The SE module can enhance the model’s ability to extract multiscale features. The SD module can reduce computational complexity and maintain the performance of the model. Experiments on the NWPU VHR-10 and DIOR datasets were conducted using the DETR [6], the Deformable DETR [8], DINO [13], and DINO [13] with the SE and SD modules. Our SE and SD modules can significantly enhance the accuracy of DINO [13] on small- and medium-sized targets, which is a useful improvement for RSOD tasks. Finally, we achieved SOTA performance on NWPU VHR-10 and near-SOTA performance on DIOR.

Despite the significant improvement in performance that the SE and SD modules bring to DINO [13], the modules have certain limitations. First, the SE introduces a slight increase in the model’s parameters and GFLOPs. Creating a more lightweight SE module is one of our future improvement goals. Second, we also intend to extend the application of the SE and SD modules to other remote sensing tasks in the future.

Author Contributions

Conceptualization, F.Y.; methodology, F.Y.; software, F.Y.; validation, F.Y.; formal analysis, F.Y.; investigation, F.Y.; resources, F.Y. and G.C.; data curation, F.Y.; writing—original draft preparation, F.Y. and G.C.; writing—review and editing, F.Y. and J.D.; visualization, F.Y.; supervision, F.Y.; project administration, F.Y., G.C. and J.D.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (No. 42071172).

Data Availability Statement

The data presented in this study are available at https://github.com/Afakash/Skip-Multiscale-Attention (accessed on 25 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2567–2575. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
Hu, X.; Zhang, P.; Zhang, Q.; Yuan, F. GLSANet: Global-Local Self-Attention Network for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Zhang, C.; Lam, K.M.; Wang, Q. Cof-net: A progressive coarse-to-fine framework for object detection in remote-sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y. Remote sensing object detection based on gated context-aware module. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Teng, Z.; Duan, Y.; Liu, Y.; Zhang, B.; Fan, J. Global to local: Clip-LSTM-based object detection from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Ye, Y.; Ren, X.; Zhu, B.; Tang, T.; Tan, X.; Gui, Y.; Yao, Q. An adaptive attention fusion mechanism convolutional network for object detection in remote sensing images. Remote Sens. 2022, 14, 516. [Google Scholar] [CrossRef]
Wang, J.; Wang, Y.; Wu, Y.; Zhang, K.; Wang, Q. FRPNet: A feature-reflowing pyramid network for object detection of remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature split–merge–enhancement network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Venkataramanan, S.; Ghodrati, A.; Asano, Y.M.; Porikli, F.; Habibian, A. Skip-Attention: Improving Vision Transformers by Paying Less Attention. arXiv 2023, arXiv:2301.02240. [Google Scholar]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Kornblith, S.; Norouzi, M.; Lee, H.; Hinton, G. Similarity of neural network representations revisited. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 3519–3529. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]

Figure 1. Illustration of the different attention modules. (a) Simplified architecture of the deformable attention module. (b) Simplified architecture of the self-attention module.

Figure 2. An overview of DINO. The input RSI is processed by the backbone for feature map extraction. These maps are flattened into a one-dimensional sequence feature, with the application of spatial position encoding. These data will be input into the transformer’s encoder and decoder, with the classification results and spatial coordinates obtained through two simple MLP operations. The obtained candidate categories and boxes will be selected through the matching algorithm for the most suitable results to participate in loss calculation. The noise data in the decoder input will complete the denoising learning through the CDN. The FFN denotes the Feedforward Neural Network.

Figure 3. Illustration of the skip-attention module.

Figure 4. Illustration of the SE module. Suppose the output of the i-th layer of the encoder highly correlates with those of its subsequent i + 1 and i + 2 layers. In that case, the output of the i-th layer can be leveraged through the SE modules of the latter two layers. The SE allows the model to shift from redundant computation to information extraction across different scales.

Figure 5. Illustration of the SD module.

Figure 6. Illustration of replacing the SE and SD modules in the overall architecture of DINO. “SE encoder layer” denotes the SE module replacing the deformable attention of the encoder layer. “SD and deformable attention decoder layer” suggests the use of the SD module instead of MSA, with deformable attention as the cross-attention.

Figure 7. Some representative detection results of DINO-SE on the NWPU VHR-10.v2 dataset. (a) Airplane. (b) Baseball. (c) Bridge. (d) Ground track field. (e) Harbor. (f) Ship. (g) Tennis court and basketball. (h) Vehicle.

Figure 8. Some representative detection results of DINO-SE on the DIOR dataset. (a) Airplane. (b) Airport. (c) Storage tank and vehicle. (d) Bridge, ground track field, and stadium. (e) Chimney. (f) Golf field and dam. (g) Expressway toll station, overpass, and expressway service area. (h) Ship and harbor. (i) Storage tank. (j) Basketball court, and tennis court. (k) Train station. (l) Windmill. Red boxes are the missing predictions.

Figure 9. The influence of the SE and SD modules on the model’s convergence speed.

Figure 10. CKA analysis of the representations of each layer in DINO. Each position’s value represents the degree of similarity between the outputs of the two layers, with a higher value indicating greater computational redundancy.

Figure 11. A heatmap of the overlapping regions for the candidate boxes corresponding to the feature vectors at each layer. (a) Rows corresponding to the areas of interest without considering the probability score weights of each box. (b) Rows corresponding to the areas of interest considering the probability score weights of each box.

Table 1. A comparison with the state of the arts on NWPU-VHR 10. * denotes our implementation, and ^† indicates the model’s overall use of pre-trained weights. The best results are marked in bold.

Model	Backbone	Airplane	Ship	Storage Tank	Baseball Diamond	Tennis Court	Basketball Court	Ground Track Field	Harbor	Bridge	Vehicle	${mAP}_{50}$
Fast R-CNN [1]	ResNet-50	90.9	90.6	89.3	47.3	100	85.9	84.9	88.2	80.3	69.8	82.7
Faster R-CNN [3]	ResNet-50	90.9	86.3	90.5	98.2	89.7	69.6	100	80.1	61.5	78.1	84.5
YOLOv3 [35]	DarkNet53	99.6	81.8	80.3	98.3	80.6	81.8	99.5	74.3	89.6	87.0	87.3
FPN [36]	ResNet-50	100	90.9	100	96.8	90.7	95.1	100	93.7	50.9	90.2	90.8
FCOS [37]	ResNet-101	100	85.2	96.9	97.8	95.8	80.3	99.7	95.0	81.8	88.9	92.1
GLNet [18]	ResNet-101	100	84.4	98.5	81.6	88.2	100	97.2	88.4	90.9	88.7	91.8
ANnet [14]	ResNet-50	100	92.6	97.8	97.8	99.3	96.0	99.9	94.3	69.0	95.6	94.2
CoF-Net [16]	ResNet-50	100	90.9	96.1	98.8	91.1	95.8	100	91.4	89.7	90.8	94.5
GLSANet [15]	ResNet-50	99.9	95.8	97.1	99.4	98.8	86.1	99.5	97.5	84.8	86.8	94.5
DETR *^,† [6]	ResNet-50	100	88.6	98.6	96.5	94.3	93.9	100	91.8	70.0	83.7	91.7
Deformable DETR * [8]	ResNet-50	97.5	88.9	91.1	93.3	89.4	87.6	94.4	83.7	73.7	79.1	87.9
DINO * [13]	ResNet-50	100	89.0	96.9	96.1	95.8	88.9	100	91.6	74.5	88.5	92.1
DINO-SE *	ResNet-50	100	94.1	97.4	95.1	95.0	95.6	100	93.7	88.0	89.4	94.8
DINO-SD *	ResNet-50	100	93.7	96.4	95.3	95.2	91.5	100	91.5	83.7	88.6	93.6
DINO-SE-SD *	ResNet-50	99.7	93.0	96.4	95.1	96.6	96.5	100	93.6	69.9	87.2	92.8

Table 2. A comparison with the state of the arts on DIOR. * denotes our implementation, and ^† indicates the model’s overall use of pre-trained weights. The best results are marked in bold.

Model	Backbone	AL	AY	BF	BC	BG	CM	DM	EA	ES	GC	GF	HB	OP	SP	SD	ST	TC	TS	VH	WM	${mAP}_{50}$
Fast R-CNN [2]	ResNet-50	44.2	66.8	67.0	60.5	15.6	72.3	52.0	65.9	44.8	72.1	62.9	46.2	38.0	32.1	71.0	35.0	58.3	37.9	19.2	38.1	50.0
Faster R-CNN [3]	ResNet-50	50.3	62.6	66.0	80.9	28.8	68.2	47.3	58.5	48.1	60.4	67.0	43.9	46.9	58.5	52.4	42.4	79.5	48.0	34.8	65.4	55.5
YOLOv3 [35]	DarkNet53	72.2	29.2	74.0	78.6	31.2	69.7	26.9	48.6	54.4	31.1	61.1	44.9	49.7	87.4	70.6	68.7	87.3	29.4	48.3	78.7	57.1
FPN [36]	ResNet-50	54.0	74.5	63.3	80.7	44.8	72.5	60.0	75.6	62.3	76.0	76.8	46.4	57.2	71.8	68.3	53.8	81.1	59.5	43.1	81.2	65.1
FCOS [37]	ResNet-101	61.1	82.6	76.6	87.6	42.8	80.6	64.1	79.1	67.2	82.0	79.6	46.4	57.8	72.1	64.8	63.4	85.2	62.8	43.8	87.5	69.4
GLNet [18]	ResNet-101	62.9	83.2	72.0	81.1	50.5	79.3	67.4	86.2	70.9	81.8	83.0	51.8	62.6	72.0	75.3	53.7	81.3	65.5	43.4	89.2	70.7
ANnet [14]	ResNet-50	66.8	84.0	74.9	87.7	50.3	78.2	67.8	85.9	74.2	79.7	81.2	55.4	61.6	75.1	74.0	66.7	87.0	62.2	53.6	89.1	72.8
CoF-Net [16]	ResNet-50	84.0	85.3	82.6	90.0	47.1	80.7	73.3	89.3	74.0	84.5	83.2	57.4	62.2	82.9	77.6	68.2	89.9	68.7	49.3	85.2	75.8
GLSANet [15]	ResNet-50	95.8	78.9	92.9	87.9	50.7	81.1	55.5	79.8	74.1	71.6	87.6	66.4	65.5	95.2	92.4	86.3	94.8	50.6	62.1	89.2	77.9
DETR *^,† [6]	ResNet-50	63.8	78.6	71.6	85.1	21.7	76.3	41.7	68.3	45.4	74.0	74.2	24.8	46.1	33.8	36.9	40.0	81.6	47.5	38.3	78.8	56.4
Deformable DETR * [8]	ResNet-50	54.2	81.5	72.1	84.4	41.0	75.3	58.8	72.5	65.6	73.4	70.3	25.6	54.5	56.2	60.5	43.7	82.0	60.8	39.6	81.5	62.7
DINO * [13]	ResNet-50	71.1	88.8	80.8	86.7	49.3	80.1	72.6	88.9	77.0	79.6	82.3	57.3	61.7	76.5	72.1	72.2	87.1	66.6	52.7	88.8	74.6
DINO-SE *	ResNet-50	76.3	87.8	81.0	86.7	50.9	81.2	71.9	88.0	78.0	80.7	82.7	56.1	64.1	77.1	75.2	73.0	86.9	70.9	53.6	89.4	75.6
DINO-SD *	ResNet-50	70.1	89.7	79.2	86.9	51.8	82.3	69.6	88.0	78.1	80.7	81.7	56.0	64.1	75.7	71.9	71.8	87.4	65.8	53.6	88.9	74.7
DINO-SE-SD *	ResNet-50	71.9	89.0	78.7	87.7	51.9	81.5	68.9	89.2	78.9	79.9	83.5	53.4	64.6	76.4	76.9	73.6	87.9	67.5	54.6	90.4	75.3

AL: Airplane. AT: Airport. BF: Baseball Field. BC: Basketball Court. BG: Bridge. CM: Chimney. DM: Dam. EA: Expressway Service Area. ES: Expressway Toll Station. GC: Golf Course. GF: Ground Track Field. HB: Harbor. OP: Overpass. SP: Ship. SD: Stadium. ST: Storage Tank. TC: Tennis Court. TS: Train Station. VH: Vehicle. WM: Windmill.

Table 3. The model analysis on NWPU VHR-10. ^† indicates the model’s overall use of pre-trained weights. The best results are marked in bold.

Model	Backbone	Epochs	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{@ 50 : 5 : 95}$	Params	GFLOPs	FPS
DETR ^† [6]	ResNet-50	50	29.7	52.7	67.5	91.7	65.7	58.4	41.3M	16.0	40.0
Deformable DETR [8]	ResNet-50	50	27.2	51.3	61.1	87.9	64.5	55.1	39.8M	34.1	20.3
DINO [13]	ResNet-50	50	30.5	57.4	65.4	92.1	71.6	62.1	46.6M	55.1	15.7
DINO-SE	ResNet-50	50	36.5	60.3	64.0	94.8	70.8	62.7	49.9M	58.1	15.2
DINO-SD	ResNet-50	50	36.7	57.7	64.2	93.6	71.6	62.5	47.7M	54.7	16.0
DINO-SE-SD	ResNet-50	50	32.9	58.4	65.1	92.8	72.6	62.9	51.0M	57.7	15.6

Table 4. A model analysis on DIOR. ^† indicates the model’s overall use of pre-trained weights. The best results are marked in bold.

Model	Backbone	Epochs	${AP}_{s}$	${AP}_{m}$	${AP}_{l}$	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{@ 50 : 5 : 95}$	Params	GFLOPs	FPS
DETR ^† [6]	ResNet-50	18	3.8	27.1	61.7	56.4	40.4	38.0	41.3M	60.2	18.8
Deformable DETR [8]	ResNet-50	18	6.7	30.1	60.0	62.7	41.6	39.4	39.8M	111.3	9.2
DINO [13]	ResNet-50	18	15.6	40.3	72.3	74.6	55.7	52.1	46.6M	179.5	6.6
DINO-SE	ResNet-50	18	15.6	40.9	71.7	75.6	56.3	52.3	49.9M	191.4	6.1
DINO-SD	ResNet-50	18	15.7	41.0	72.3	74.7	56.1	52.3	47.7M	179.1	6.6
DINO-SE-SD	ResNet-50	18	16.6	40.7	71.7	75.3	56.2	52.3	51.0M	191.0	6.2

Table 5. Ablation study of the SE and SD modules on the NWPU VHR-10 dataset. The best results are marked in bold.

Model	${mAP}_{50}$	${mAP}_{75}$	${mAP}_{@ 50 : 5 : 95}$	Params	GFLOPs	FPS
Baseline (DINO)	92.1	71.6	62.1	46.6M	55.1	15.7
+Skip Attention	93.0 (+0.9)	72.1	62.3	46.7M	55.5	16.4
+SE-4	93.4 (+1.3)	72.2	62.5	47.7M	56.1	15.5
+SE-4, 5	93.9 (+1.8)	71.4	62.4	48.8M	57.0	15.3
+SE-4, 5, 6	94.8 (+2.7)	70.8	62.7	49.9M	58.1	15.2
+SE-2, 3, 4	93.0 (+0.9)	71.7	62.4	49.9M	58.1	15.2
+SD-2	92.4 (+0.3)	71.8	61.4	47.1M	54.9	15.7
+SD-2, 3	93.6 (+1.5)	71.6	62.5	47.7M	54.7	16.0
+SD-2, 3, 4	93.5 (+1.4)	72.4	62.4	48.2M	54.5	16.1

SE-i,j,k or SD-i,j,k denotes the module used in the i-th, j-th, and k-th layer of the encoder or decoder.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, F.; Chen, G.; Duan, J. Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing. Remote Sens. 2024, 16, 2884. https://doi.org/10.3390/rs16162884

AMA Style

Yang F, Chen G, Duan J. Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing. Remote Sensing. 2024; 16(16):2884. https://doi.org/10.3390/rs16162884

Chicago/Turabian Style

Yang, Feifan, Gang Chen, and Jianshu Duan. 2024. "Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing" Remote Sensing 16, no. 16: 2884. https://doi.org/10.3390/rs16162884

APA Style

Yang, F., Chen, G., & Duan, J. (2024). Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing. Remote Sensing, 16(16), 2884. https://doi.org/10.3390/rs16162884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing

Abstract

1. Introduction

2. Materials and Methods

2.1. Related Work

2.1.1. DETR-Based Model

2.1.2. Skip-Attention

2.2. Methods

2.3. Skip-Encoder Module

2.4. Skip-Decoder Module

2.5. Network Details

3. Results

3.1. Datasets

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Performance Comparison

3.5. Model Analysis

3.6. Ablation Study

4. Discussion

4.1. Applicability

4.2. Limitations

4.3. Other Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI