Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing
Abstract
:1. Introduction
- To enhance the capabilities for detecting small- and medium-sized targets, we proposed the SE module, which is mainly used in the encoder stage of the model, to enhance the model’s ability to extract multiscale features.
- We proposed the SD module, which is primarily used in the decoder stage and can reduce the computational cost of the model without affecting its performance.
2. Materials and Methods
2.1. Related Work
2.1.1. DETR-Based Model
2.1.2. Skip-Attention
2.2. Methods
2.3. Skip-Encoder Module
Algorithm 1 Implementation of the SE module |
Input: Previous layer’s output Output: 1: Linear() 2: GELU(X) // reshape into , matching the shapes of the multiscale feature maps input to the encoder 3: Reshape() //Each passes through its DwC and PwC. 4: for x in do 5: Dwc(x) 6: Pwc(t) 7: end 8: = Concat() //restore the shape of X from to . 9: Linear() //exchange information across different scales using ECA 10: ECA() |
2.4. Skip-Decoder Module
2.5. Network Details
- Backbone: We used a pre-trained ResNet-50 [25] on ImageNet-1K [30] as the backbone of our model, returning the last three feature maps. The fourth feature map is convolved from the final feature map output by the backbone, thus forming four feature maps of different scales as the input for the transformer [7] stage.
- Transformer Encoder: The encoder consists of six layers. In our experiments, the proposed SE module mainly replaced the deformable attention module [8] in the 4th, 5th, and 6th encoder layers of DINO [13], which resulted in optimal performance. The other encoder layers are consistent with those of DINO [13], adopting the deformable attention module [8] instead of the MSA module [7].
- Transformer Decoder: The decoder consists of six layers. The proposed SD module mainly replaced the MSA module in the 3rd and 4th decoder layers of DINO [13]. In line with the approach adopted by DINO [13], the multi-head cross-attention module [7] in the decoder layer was replaced with the deformable attention module [8].
- Loss Function: We did not add a new loss function. The most suitable results were selected from the output sequence of the decoder through bipartite matching. These results were then used for calculating the three main loss functions, including focal loss [5] for classification, L1 loss, and generalized intersection over union loss [31] for learning the bounding box.
3. Results
3.1. Datasets
- NWPU VHR-10: This dataset constitutes an aerial image dataset for bounding box object detection and encompasses ten categories. The second version of this dataset contains 1172 images (400 × 400 pixels) cropped from 650 aerial imagery with sizes ranging from 533 × 597 to 1728 × 1028 pixels. We used the prevalent data distribution, that is, 75% of the dataset (879 images) was allocated for training and the remaining 25% (293 images) was allocated for testing.
- DIOR: This dataset is the most representative object detection dataset for RSIs. It contains 23,463 images (800 × 800 pixels), encompassing 20 categories. Following the mainstream setup, this dataset allocated 11,725 images (50% of the dataset) for training, with the remaining 11,738 images being designated for testing.
3.2. Evaluation Metrics
3.3. Implementation Details
3.4. Performance Comparison
- Results on NWPU VHR-10: In Table 1, after DINO [13] used the SE module, most categories of NWPU VHR-10 exhibited significant improvement, with the overall mAP50 increasing from 92.1% to 94.8%, thereby attaining SOTA performance. Compared with the mainstream models of RSOD, DINO [13] with the SE module demonstrated superior comprehensive performance. On the other hand, the SD module also brought a minor improvement to DINO [13], which aligns with our motivation for designing the SD, that is, to reduce computational complexity without compromising model performance. Additionally, an odd phenomenon observed is that the model performance did not match that of using the SE or SD module individually when the SD and SE were used together. This will be explained in the subsequent model analysis section. Overall, the results obtained with NWPU VHR-10 reflect that our SE module can comprehensively enhance DINO’s feature extraction capability, thereby improving the model’s performance. It also proves that the SE module does not affect the model’s performance.
- Results on DIOR: As shown in Table 2, our SE module enhanced the overall mAP50 of DINO [13] from 74.6% to 75.6%, finally achieving a performance close to the SOTA on the DIOR dataset. In the results of DINO [13] using the SE module, an improvement can be observed in the recognition of common small- and medium-sized object categories in RSIs, such as airplanes, bridges, ships, vehicles, and storage tanks. For instance, in the airplane class, the SE module increased the mAP50 from 71.1% to 76.3%, enhancing the performance by 5.2%. On the other hand, the SD module did not compromise the model’s performance on the large-scale dataset DIOR, reflecting the generalizability of the SD module. In summary, the results on DIOR reflect the performance improvements in various aspects of DINO [13] provided by the SE module, illustrating the role of the SE in mining useful information. This also reflects that the SD module can maintain the model’s performance while reducing computational complexity, regardless of the size of the dataset.
3.5. Model Analysis
- Model analysis on NWPU VHR-10: As shown in Table 3, under the same training conditions, either the SE or SD module can enhance performance on small- and medium-sized targets, which are useful for RSOD. It is noteworthy that the SE module increased the from 30.5% to 36.5%, and the SD module improved the from 30.5% to 36.7%. However, when the SE and SD modules were used in combination, the only increased from 30.5% to 32.9%. We attributed this phenomenon to the fact that NWPU VHR-10 is a small dataset, and both the SD and SE modules enhance the , leading to overfitting, which in turn significantly decreases the performance on the . In Figure 7, we provide representative detection results for each category in the NWPU VHR-10 dataset.
- Model analysis on DIOR: As shown in Table 4, the SE module slightly increased DINO’s from 74.6% to 75.6%, and the SD module maintained DINO’s model performance. On DIOR, the individual use of the SE and SD modules brings a slight improvement to the detection of medium-sized targets but does not significantly enhance the for small-sized targets. This is primarily because the DIOR dataset, a large dataset with 20 categories, presented high training difficulty, and there is still a bottleneck in the model’s performance. When the SD and SE modules are used in combination, they can increase the from 15.6% to 16.6%. This not only demonstrates that both the SD and SE can enhance the model’s ability to detect small targets but also validates the previously mentioned overfitting phenomenon of the SD and SE on NWPU VHR-10. Similarly, in Figure 8, we present representative detection results for each category in the DIOR dataset.
3.6. Ablation Study
4. Discussion
4.1. Applicability
4.2. Limitations
4.3. Other Directions
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3651–3660. [Google Scholar]
- Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor detr: Query design for transformer-based detector. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 2567–2575. [Google Scholar]
- Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar]
- Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
- Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
- Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
- Hu, X.; Zhang, P.; Zhang, Q.; Yuan, F. GLSANet: Global-Local Self-Attention Network for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
- Zhang, C.; Lam, K.M.; Wang, Q. Cof-net: A progressive coarse-to-fine framework for object detection in remote-sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
- Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y. Remote sensing object detection based on gated context-aware module. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Teng, Z.; Duan, Y.; Liu, Y.; Zhang, B.; Fan, J. Global to local: Clip-LSTM-based object detection from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
- Ye, Y.; Ren, X.; Zhu, B.; Tang, T.; Tan, X.; Gui, Y.; Yao, Q. An adaptive attention fusion mechanism convolutional network for object detection in remote sensing images. Remote Sens. 2022, 14, 516. [Google Scholar] [CrossRef]
- Wang, J.; Wang, Y.; Wu, Y.; Zhang, K.; Wang, Q. FRPNet: A feature-reflowing pyramid network for object detection of remote sensing images. IEEE Geosci. Remote Sens. Lett. 2020, 19, 1–5. [Google Scholar] [CrossRef]
- Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature split–merge–enhancement network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
- Venkataramanan, S.; Ghodrati, A.; Asano, Y.M.; Porikli, F.; Habibian, A. Skip-Attention: Improving Vision Transformers by Paying Less Attention. arXiv 2023, arXiv:2301.02240. [Google Scholar]
- Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Kornblith, S.; Norouzi, M.; Lee, H.; Hinton, G. Similarity of neural network representations revisited. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 3519–3529. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
Model | Backbone | Airplane | Ship | Storage Tank | Baseball Diamond | Tennis Court | Basketball Court | Ground Track Field | Harbor | Bridge | Vehicle | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Fast R-CNN [1] | ResNet-50 | 90.9 | 90.6 | 89.3 | 47.3 | 100 | 85.9 | 84.9 | 88.2 | 80.3 | 69.8 | 82.7 |
Faster R-CNN [3] | ResNet-50 | 90.9 | 86.3 | 90.5 | 98.2 | 89.7 | 69.6 | 100 | 80.1 | 61.5 | 78.1 | 84.5 |
YOLOv3 [35] | DarkNet53 | 99.6 | 81.8 | 80.3 | 98.3 | 80.6 | 81.8 | 99.5 | 74.3 | 89.6 | 87.0 | 87.3 |
FPN [36] | ResNet-50 | 100 | 90.9 | 100 | 96.8 | 90.7 | 95.1 | 100 | 93.7 | 50.9 | 90.2 | 90.8 |
FCOS [37] | ResNet-101 | 100 | 85.2 | 96.9 | 97.8 | 95.8 | 80.3 | 99.7 | 95.0 | 81.8 | 88.9 | 92.1 |
GLNet [18] | ResNet-101 | 100 | 84.4 | 98.5 | 81.6 | 88.2 | 100 | 97.2 | 88.4 | 90.9 | 88.7 | 91.8 |
ANnet [14] | ResNet-50 | 100 | 92.6 | 97.8 | 97.8 | 99.3 | 96.0 | 99.9 | 94.3 | 69.0 | 95.6 | 94.2 |
CoF-Net [16] | ResNet-50 | 100 | 90.9 | 96.1 | 98.8 | 91.1 | 95.8 | 100 | 91.4 | 89.7 | 90.8 | 94.5 |
GLSANet [15] | ResNet-50 | 99.9 | 95.8 | 97.1 | 99.4 | 98.8 | 86.1 | 99.5 | 97.5 | 84.8 | 86.8 | 94.5 |
DETR *,† [6] | ResNet-50 | 100 | 88.6 | 98.6 | 96.5 | 94.3 | 93.9 | 100 | 91.8 | 70.0 | 83.7 | 91.7 |
Deformable DETR * [8] | ResNet-50 | 97.5 | 88.9 | 91.1 | 93.3 | 89.4 | 87.6 | 94.4 | 83.7 | 73.7 | 79.1 | 87.9 |
DINO * [13] | ResNet-50 | 100 | 89.0 | 96.9 | 96.1 | 95.8 | 88.9 | 100 | 91.6 | 74.5 | 88.5 | 92.1 |
DINO-SE * | ResNet-50 | 100 | 94.1 | 97.4 | 95.1 | 95.0 | 95.6 | 100 | 93.7 | 88.0 | 89.4 | 94.8 |
DINO-SD * | ResNet-50 | 100 | 93.7 | 96.4 | 95.3 | 95.2 | 91.5 | 100 | 91.5 | 83.7 | 88.6 | 93.6 |
DINO-SE-SD * | ResNet-50 | 99.7 | 93.0 | 96.4 | 95.1 | 96.6 | 96.5 | 100 | 93.6 | 69.9 | 87.2 | 92.8 |
Model | Backbone | AL | AY | BF | BC | BG | CM | DM | EA | ES | GC | GF | HB | OP | SP | SD | ST | TC | TS | VH | WM | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Fast R-CNN [2] | ResNet-50 | 44.2 | 66.8 | 67.0 | 60.5 | 15.6 | 72.3 | 52.0 | 65.9 | 44.8 | 72.1 | 62.9 | 46.2 | 38.0 | 32.1 | 71.0 | 35.0 | 58.3 | 37.9 | 19.2 | 38.1 | 50.0 |
Faster R-CNN [3] | ResNet-50 | 50.3 | 62.6 | 66.0 | 80.9 | 28.8 | 68.2 | 47.3 | 58.5 | 48.1 | 60.4 | 67.0 | 43.9 | 46.9 | 58.5 | 52.4 | 42.4 | 79.5 | 48.0 | 34.8 | 65.4 | 55.5 |
YOLOv3 [35] | DarkNet53 | 72.2 | 29.2 | 74.0 | 78.6 | 31.2 | 69.7 | 26.9 | 48.6 | 54.4 | 31.1 | 61.1 | 44.9 | 49.7 | 87.4 | 70.6 | 68.7 | 87.3 | 29.4 | 48.3 | 78.7 | 57.1 |
FPN [36] | ResNet-50 | 54.0 | 74.5 | 63.3 | 80.7 | 44.8 | 72.5 | 60.0 | 75.6 | 62.3 | 76.0 | 76.8 | 46.4 | 57.2 | 71.8 | 68.3 | 53.8 | 81.1 | 59.5 | 43.1 | 81.2 | 65.1 |
FCOS [37] | ResNet-101 | 61.1 | 82.6 | 76.6 | 87.6 | 42.8 | 80.6 | 64.1 | 79.1 | 67.2 | 82.0 | 79.6 | 46.4 | 57.8 | 72.1 | 64.8 | 63.4 | 85.2 | 62.8 | 43.8 | 87.5 | 69.4 |
GLNet [18] | ResNet-101 | 62.9 | 83.2 | 72.0 | 81.1 | 50.5 | 79.3 | 67.4 | 86.2 | 70.9 | 81.8 | 83.0 | 51.8 | 62.6 | 72.0 | 75.3 | 53.7 | 81.3 | 65.5 | 43.4 | 89.2 | 70.7 |
ANnet [14] | ResNet-50 | 66.8 | 84.0 | 74.9 | 87.7 | 50.3 | 78.2 | 67.8 | 85.9 | 74.2 | 79.7 | 81.2 | 55.4 | 61.6 | 75.1 | 74.0 | 66.7 | 87.0 | 62.2 | 53.6 | 89.1 | 72.8 |
CoF-Net [16] | ResNet-50 | 84.0 | 85.3 | 82.6 | 90.0 | 47.1 | 80.7 | 73.3 | 89.3 | 74.0 | 84.5 | 83.2 | 57.4 | 62.2 | 82.9 | 77.6 | 68.2 | 89.9 | 68.7 | 49.3 | 85.2 | 75.8 |
GLSANet [15] | ResNet-50 | 95.8 | 78.9 | 92.9 | 87.9 | 50.7 | 81.1 | 55.5 | 79.8 | 74.1 | 71.6 | 87.6 | 66.4 | 65.5 | 95.2 | 92.4 | 86.3 | 94.8 | 50.6 | 62.1 | 89.2 | 77.9 |
DETR *,† [6] | ResNet-50 | 63.8 | 78.6 | 71.6 | 85.1 | 21.7 | 76.3 | 41.7 | 68.3 | 45.4 | 74.0 | 74.2 | 24.8 | 46.1 | 33.8 | 36.9 | 40.0 | 81.6 | 47.5 | 38.3 | 78.8 | 56.4 |
Deformable DETR * [8] | ResNet-50 | 54.2 | 81.5 | 72.1 | 84.4 | 41.0 | 75.3 | 58.8 | 72.5 | 65.6 | 73.4 | 70.3 | 25.6 | 54.5 | 56.2 | 60.5 | 43.7 | 82.0 | 60.8 | 39.6 | 81.5 | 62.7 |
DINO * [13] | ResNet-50 | 71.1 | 88.8 | 80.8 | 86.7 | 49.3 | 80.1 | 72.6 | 88.9 | 77.0 | 79.6 | 82.3 | 57.3 | 61.7 | 76.5 | 72.1 | 72.2 | 87.1 | 66.6 | 52.7 | 88.8 | 74.6 |
DINO-SE * | ResNet-50 | 76.3 | 87.8 | 81.0 | 86.7 | 50.9 | 81.2 | 71.9 | 88.0 | 78.0 | 80.7 | 82.7 | 56.1 | 64.1 | 77.1 | 75.2 | 73.0 | 86.9 | 70.9 | 53.6 | 89.4 | 75.6 |
DINO-SD * | ResNet-50 | 70.1 | 89.7 | 79.2 | 86.9 | 51.8 | 82.3 | 69.6 | 88.0 | 78.1 | 80.7 | 81.7 | 56.0 | 64.1 | 75.7 | 71.9 | 71.8 | 87.4 | 65.8 | 53.6 | 88.9 | 74.7 |
DINO-SE-SD * | ResNet-50 | 71.9 | 89.0 | 78.7 | 87.7 | 51.9 | 81.5 | 68.9 | 89.2 | 78.9 | 79.9 | 83.5 | 53.4 | 64.6 | 76.4 | 76.9 | 73.6 | 87.9 | 67.5 | 54.6 | 90.4 | 75.3 |
Model | Backbone | Epochs | Params | GFLOPs | FPS | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
DETR † [6] | ResNet-50 | 50 | 29.7 | 52.7 | 67.5 | 91.7 | 65.7 | 58.4 | 41.3M | 16.0 | 40.0 |
Deformable DETR [8] | ResNet-50 | 50 | 27.2 | 51.3 | 61.1 | 87.9 | 64.5 | 55.1 | 39.8M | 34.1 | 20.3 |
DINO [13] | ResNet-50 | 50 | 30.5 | 57.4 | 65.4 | 92.1 | 71.6 | 62.1 | 46.6M | 55.1 | 15.7 |
DINO-SE | ResNet-50 | 50 | 36.5 | 60.3 | 64.0 | 94.8 | 70.8 | 62.7 | 49.9M | 58.1 | 15.2 |
DINO-SD | ResNet-50 | 50 | 36.7 | 57.7 | 64.2 | 93.6 | 71.6 | 62.5 | 47.7M | 54.7 | 16.0 |
DINO-SE-SD | ResNet-50 | 50 | 32.9 | 58.4 | 65.1 | 92.8 | 72.6 | 62.9 | 51.0M | 57.7 | 15.6 |
Model | Backbone | Epochs | Params | GFLOPs | FPS | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
DETR † [6] | ResNet-50 | 18 | 3.8 | 27.1 | 61.7 | 56.4 | 40.4 | 38.0 | 41.3M | 60.2 | 18.8 |
Deformable DETR [8] | ResNet-50 | 18 | 6.7 | 30.1 | 60.0 | 62.7 | 41.6 | 39.4 | 39.8M | 111.3 | 9.2 |
DINO [13] | ResNet-50 | 18 | 15.6 | 40.3 | 72.3 | 74.6 | 55.7 | 52.1 | 46.6M | 179.5 | 6.6 |
DINO-SE | ResNet-50 | 18 | 15.6 | 40.9 | 71.7 | 75.6 | 56.3 | 52.3 | 49.9M | 191.4 | 6.1 |
DINO-SD | ResNet-50 | 18 | 15.7 | 41.0 | 72.3 | 74.7 | 56.1 | 52.3 | 47.7M | 179.1 | 6.6 |
DINO-SE-SD | ResNet-50 | 18 | 16.6 | 40.7 | 71.7 | 75.3 | 56.2 | 52.3 | 51.0M | 191.0 | 6.2 |
Model | Params | GFLOPs | FPS | |||
---|---|---|---|---|---|---|
Baseline (DINO) | 92.1 | 71.6 | 62.1 | 46.6M | 55.1 | 15.7 |
+Skip Attention | 93.0 (+0.9) | 72.1 | 62.3 | 46.7M | 55.5 | 16.4 |
+SE-4 | 93.4 (+1.3) | 72.2 | 62.5 | 47.7M | 56.1 | 15.5 |
+SE-4, 5 | 93.9 (+1.8) | 71.4 | 62.4 | 48.8M | 57.0 | 15.3 |
+SE-4, 5, 6 | 94.8 (+2.7) | 70.8 | 62.7 | 49.9M | 58.1 | 15.2 |
+SE-2, 3, 4 | 93.0 (+0.9) | 71.7 | 62.4 | 49.9M | 58.1 | 15.2 |
+SD-2 | 92.4 (+0.3) | 71.8 | 61.4 | 47.1M | 54.9 | 15.7 |
+SD-2, 3 | 93.6 (+1.5) | 71.6 | 62.5 | 47.7M | 54.7 | 16.0 |
+SD-2, 3, 4 | 93.5 (+1.4) | 72.4 | 62.4 | 48.2M | 54.5 | 16.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yang, F.; Chen, G.; Duan, J. Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing. Remote Sens. 2024, 16, 2884. https://doi.org/10.3390/rs16162884
Yang F, Chen G, Duan J. Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing. Remote Sensing. 2024; 16(16):2884. https://doi.org/10.3390/rs16162884
Chicago/Turabian StyleYang, Feifan, Gang Chen, and Jianshu Duan. 2024. "Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing" Remote Sensing 16, no. 16: 2884. https://doi.org/10.3390/rs16162884