GANsformer: A Detection Network for Aerial Images with High Performance Combining Convolutional Network and Transformer
Abstract
:1. Introduction
- Foremost, with variable altitude conditions and smaller object sizes, aerial images contain a considerable variation. Altitude causes diverse clarity and sizes of objects with varying resolutions in captured images. If the images were captured in an excessively high-altitude circumstance, images of the same objects would be viewed as different shapes and blurred, complicating the differentiating.
- Afterward, under different altitude circumstances, corresponding mutable illuminance appears. Assuming images were captured at a relatively higher altitude, the lighting condition would worsen, and objects could reach less favorable sharpness [12].
- Moreover, aerial images are prone to have multiple tiny objects in an individual image. Some of them probably distribute densely [13], which causes mutual shielding. Thus, using the pre-trained state-of-the-art models would generate lower accuracy.
- In addition, when capturing aerial images, it would be affected by the weather. Moreover, there are clutters in the high altitude that interfere with the target detection object, such as flying birds, flying insects, and leaves.
- We modified the transformer to reduce the number of parameters, improve the training speed, and act as a branch network to improve CNN’s ability to capture global features. Because GANsformer inherits and combines the structural and global feature extraction advantages of CNN and visual transformers, its performance is significantly better than CNN and ViT with comparable parameter complexity. GANsformer has demonstrated its remarkable potential capability in aerial images detection tasks. Eventually, on the validation set, the suggested technique achieves 96.77%, 98.86%, and 97.91% on , , and , respectively. This experimental result indicates that the suggested model outperforms all other comparison models.
- In Section 6, we evaluated the performance of various combinations of generative models to verify the efficacy of Multi-GANs implementations. Experimentally, the SPA-GAN model performs best in the attention extraction module, whereas the WGAN model works best in image augmentation.
- Moreover, the detection task’s loss function is optimized by substituting the loss with a more appropriate loss.
- Additionally, this paper established a detection application based on the macOS. The optimized model based on the proposed method, when being integrated in the practical application device, could also effectively and satisfactorily detect objects in aerial images.
2. Related Work
- One viable solution is to define larger receptive fields by introducing deeper architectures or more pooling operations. Dilated convolution methods [35,36] increase the sampling step, while deformable convolution learns the sampling position. SENet [37] and GENet propose using global AvgPooling to aggregate the global context and then rethinking the feature channels. In contrast, CBAM [38,39] uses global max pooling and AvgPooling to independently refine features in the spatial and channel dimensions, respectively.
- Another possible solution is the global attention mechanism [40,41,42], which has significant advantages in capturing long-range dependencies in natural language processing. Inspired by the non-local means approach, non-local operations are introduced into CNNs via a self-attention mode. Thus, the response at each location is a weighted sum of global location features. The attention convolutional network [43] connects the convolutional feature map with the self-attention feature map to enhance the convolutional operation and, thus, capture remote interactions.
3. Materials and Methods
3.1. Data Augmentation
3.1.1. Basic Augmentation
3.1.2. Advanced Augmentation
3.2. GANsformer Detection Network
- Two generative network models are added to the network to address the inadequate training of CNNs due to small datasets and improve the ability of deep CNNs to extract image features.
- We modified the transformer, by reducing the number of parameters, improving the training speed, to improve the CNN’s ability to capture global features as a branch network. Because it inherits and combines the structural and global feature extraction advantages of CNN and visual transformers, The performance of GANsformer is significantly better than CNN and vision transformer with comparable parameter complexity, showing the great potential capability in aerial images detection tasks.
- Use Mixup, Cutout, CutMix, SnapMix, and Mosaic data augmentation methods to reduce overfitting and render the detection network to identify smaller-scale objects better.
- Using label smoothing techniques and optimizing the loss function to improve the performance of GANsformer detection network.
- Improve the NMS algorithm in the detection network by adding weight coefficients to fuse the bounding boxes.
3.2.1. Multi-GANs
Algorithm 1 Algorithm of WGAN. |
|
- The last layer of the discriminator removes the sigmoid.
- The values of the generator and discriminator are not calculated logarithmically.
- Truncate the absolute value after each discriminator parameter update, limiting it to a fixed constant c and not greater than it.
- Use the RMSProp and SGD algorithms instead of momentum optimization algorithms, including momentum and Adam.
3.2.2. Transformer
- Its training time is exceedingly long;
- It is not conducive to deployment acceleration;
- It requires a vast dataset;
3.2.3. Loss Function
3.2.4. Fusion Method for Bounding Boxes
4. Experiment
4.1. Dataset Analysis
- The dataset contains a relatively large number of detection targets. Most of the images possess more than one detection target, and some of them, such as cars, represent a tiny proportion of the overall image.
- The samples in the dataset are distributed unevenly. To be more specific, the number of bridge samples is 4.5 times higher than that of baseball field samples.
- The overall data volume is small, making deep learning training rather tricky.
4.2. Evaluation Metrics
4.3. Experiment Setting
4.4. Label Smoothing
4.5. Training Strategy
5. Result
5.1. Validation Results
5.2. Detection Results
5.3. Results Analysis
6. Discussion
6.1. Ablation Experiment of Multi-GANs
6.2. Ablation Experiment of Data Augmentation Methods
6.3. Validation on Wheat Head Dataset
6.4. Detection Application on macOS
6.5. Limitation
- Scarcity of datasets. The dataset is inadequate, and the substantial difference in the sample size of different categories in the dataset are fundamental reasons for the unsatisfactory performance of the model. The dataset used in this paper contains 800 images with ten categories of targets, of which 150 are pure background images with no target. The bridge category accounts for 26% of the ten categories of objects. Although this paper uses various data augmentation methods and Multi-GANs to expand the data, the imbalance among samples in the dataset is still not completely solved. One viable solution to the above problem is to set a parameter in the loss function construction, whose value is inversely proportional to the percentage of each category in the total dataset. Specifically, the smaller the percentage of the category in the loss function, the more critical it is to balance the sample size gap between categories.
- Drawbacks of the one-stage structure. Considering the fusion trend of one-stage and two-stage models and advantages of a one-stage network in terms of the model’s inference speed and accuracy, this paper constructs a one-stage-based detection network. However, the one-stage network model has its intrinsic shortcomings, and the feature extraction ability of the backbone still needs to be improved. Although the backbone’s ability to process feature maps has been improved by applying the attention extraction module of Multi-GANs, the distance between the shallowest and deepest networks is getting more considerable, and more information is lost as the number of layers of the network increases. The effectiveness of feature maps fusion is also decreasing gradually.
- The definition of the loss function is still deficient. The first point apparently indicates that the definition of the loss function can still be improved to include more information and balance the imbalance between the samples in the dataset.
7. Conclusions
- Multi-GANs model: first and foremost, a generative model is added in front of the backbone to expand the input aerial images, which aims to alleviate the general problem of small sample size datasets. Second, GAN models are added to the attention extraction module to generate attention masks. Figure 6 shows the effect of adding GAN models on feature maps, and the results of the experimental part also illustrate that this approach can effectively improve the robustness of the model. Ultimately, on the validation set, the proposed method reaches 96.77%, 98.86% and 97.91% on , , and , respectively. This experimental result demonstrates that the proposed model outperforms all the comparison models.
- We modified the transformer, by reducing the number of parameters, improving the training speed, to improve CNN’s ability to capture global features as a branch network. The performance of the GANsformer—because it inherits and combines the structural and global feature extraction advantages of CNN and visual transformers—is significantly better than CNN and vision transformer, with comparable parameter complexity, showing great potential capability in aerial image detection tasks.
- In order to verify the effectiveness of various implementations of Multi-GANs, in Section 6, we tested the performance of different combinations of generative models. Experimental results expound that the SPA-GAN model performs best in the attention extraction module, while in image augmentation, WGAN performs best.
- This paper encapsulated the model and developed a corresponding application under the macOS platform, making the model applicable.
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Eikelboom, J.A.; Wind, J.; Van de Ven, E.; Kenana, L.M.; Schroder, B.; de Knegt, H.J.; van Langevelde, F.; Prins, H.H. Improving the precision and accuracy of animal population estimates with aerial image object detection. Methods Ecol. Evol. 2019, 10, 1875–1887. [Google Scholar] [CrossRef] [Green Version]
- Xiao, Z.; Wang, K.; Wan, Q.; Tan, X.; Xu, C.; Xia, F. A2S-Det: Efficiency Anchor Matching in Aerial Image Oriented Object Detection. Remote. Sens. 2021, 13, 73. [Google Scholar] [CrossRef]
- Chen, C.; Zhong, J.; Tan, Y. Multiple-oriented and small object detection with convolutional neural networks for aerial image. Remote. Sens. 2019, 11, 2176. [Google Scholar] [CrossRef] [Green Version]
- Wang, Y.; Zorzi, S.; Bittner, K. Machine-learned 3D Building Vectorization from Satellite Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1072–1081. [Google Scholar]
- Abbasi, S.; Rezaeian, M. Visual object tracking using similarity transformation and adaptive optical flow. Multimed. Tools Appl. 2021, volume, 1–19. [Google Scholar] [CrossRef]
- Liu, M.; Wang, X.; Zhou, A.; Fu, X.; Ma, Y.; Piao, C. UAV-YOLO: Small object detection on unmanned aerial vehicle perspective. Sensors 2020, 20, 2238. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, W.; Tang, P.; Zhao, L. Remote sensing image scene classification using CNN-CapsNet. Remote. Sens. 2019, 11, 494. [Google Scholar] [CrossRef] [Green Version]
- Pham, M.T.; Courtrai, L.; Friguet, C.; Lefèvre, S.; Baussard, A. YOLO-Fine: One-stage detector of small objects under various backgrounds in remote sensing images. Remote. Sens. 2020, 12, 2501. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- He, J.; Deng, Z.; Zhou, L.; Wang, Y.; Qiao, Y. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7519–7528. [Google Scholar]
- Nguyen, N.D.; Do, T.; Ngo, T.D.; Le, D.D. An evaluation of deep learning methods for small object detection. J. Electr. Comput. Eng. 2020, 2020, 3189691. [Google Scholar] [CrossRef]
- Hu, G.X.; Yang, Z.; Hu, L.; Huang, L.; Han, J.M. Small object detection with multiscale features. Int. J. Digit. Multimed. Broadcast. 2018, 2018, 4546896. [Google Scholar] [CrossRef]
- Liu, C.; Wu, Y.; Liu, J.; Han, J. MTI-YOLO: A Light-Weight and Real-Time Deep Neural Network for Insulator Detection in Complex Aerial Images. Energies 2021, 14, 1426. [Google Scholar] [CrossRef]
- Courtrai, L.; Pham, M.T.; Lefèvre, S. Small Object Detection in Remote Sensing Images Based on Super-Resolution with Auxiliary Generative Adversarial Networks. Remote. Sens. 2020, 12, 3152. [Google Scholar] [CrossRef]
- Rabbi, J.; Ray, N.; Schubert, M.; Chowdhury, S.; Chao, D. Small-Object Detection in Remote Sensing Images with End-to-End Edge-Enhanced GAN and Object Detector Network. Remote. Sens. 2020, 12, 1432. [Google Scholar] [CrossRef]
- Xu, D.; Wu, Y. Improved YOLO-V3 with DenseNet for multi-scale remote sensing target detection. Sensors 2020, 20, 4276. [Google Scholar] [CrossRef] [PubMed]
- Avola, D.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Mecca, A.; Pannone, D.; Piciarelli, C. MS-Faster R-CNN: Multi-stream backbone for improved Faster R-CNN object detection and aerial tracking from UAV images. Remote. Sens. 2021, 13, 1670. [Google Scholar] [CrossRef]
- Jin, R.; Lv, J.; Li, B.; Ye, J.; Lin, D. Toward efficient object detection in aerial images using extreme scale metric learning. IEEE Access 2021, 9, 56214–56227. [Google Scholar] [CrossRef]
- Fujiyoshi, H.; Hirakawa, T.; Yamashita, T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019, 43, 244–252. [Google Scholar] [CrossRef]
- Sim, H.S.; Kim, H.I.; Ahn, J.J. Is deep learning for image recognition applicable to stock market prediction? Complexity 2019, 2019, 4324878. [Google Scholar] [CrossRef]
- Hatt, M.; Parmar, C.; Qi, J.; El Naqa, I. Machine (deep) learning methods for image processing and radiomics. IEEE Trans. Radiat. Plasma Med Sci. 2019, 3, 104–108. [Google Scholar] [CrossRef]
- Ann, E.T.L.; Hao, N.S.; Wei, G.W.; Hee, K.C. Feast In: A Machine Learning Image Recognition Model of Recipe and Lifestyle Applications. MATEC Web Conf. EDP Sci. 2021, 335, 04006. [Google Scholar] [CrossRef]
- Gu, H.; Wen, F.; Wang, B.; Lee, A.K.; Xu, D. Machine Learning-Based Image Recognition for Visual Inspections; SNAME Maritime Convention: Tacoma, WA, USA, 2019. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Li, Z.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Jocher, G. Yolov5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 17 January 2022).
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
- Zhang, Y.; Wa, S.; Liu, Y.; Zhou, X.; Sun, P.; Ma, Q. High-Accuracy Detection of Maize Leaf Diseases CNN Based on Multi-Pathway Activation Function Module. Remote. Sens. 2021, 13, 4218. [Google Scholar] [CrossRef]
- Zhang, Y.; He, S.; Wa, S.; Zong, Z.; Liu, Y. Using Generative Module and Pruning Inference for the Fast and Accurate Detection of Apple Flower in Natural Environments. Information 2021, 12, 495. [Google Scholar] [CrossRef]
- Zhang, Y.; Wa, S.; Sun, P.; Wang, Y. Pear Defect Detection Method Based on ResNet and DCGAN. Information 2021, 12, 397. [Google Scholar] [CrossRef]
- Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yu, Y. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv 2019, arXiv:1903.11816. [Google Scholar]
- Wang, B.; Lei, Y.; Tian, S.; Wang, T.; Liu, Y.; Patel, P.; Jani, A.B.; Mao, H.; Curran, W.J.; Liu, T.; et al. Deeply supervised 3D fully convolutional networks with group dilated convolution for automatic MRI prostate segmentation. Med. Phys. 2019, 46, 1707–1718. [Google Scholar] [CrossRef]
- Li, X.; Shen, X.; Zhou, Y.; Wang, X.; Li, T.Q. Classification of breast cancer histopathological images using interleaved DenseNet with SENet (IDSNet). PLoS ONE 2020, 15, e0232127. [Google Scholar] [CrossRef]
- Wang, S.H.; Fernandes, S.; Zhu, Z.; Zhang, Y.D. AVNC: Attention-based VGG-style network for COVID-19 diagnosis by CBAM. IEEE Sensors J. 2021. [Google Scholar] [CrossRef]
- Chen, L.; Tian, X.; Chai, G.; Zhang, X.; Chen, E. A New CBAM-P-Net Model for Few-Shot Forest Species Classification Using Airborne Hyperspectral Images. Remote. Sens. 2021, 13, 1269. [Google Scholar] [CrossRef]
- Cai, W.; Wang, Y.; Ma, J.; Jin, Q. Can: Effective cross features by global attention mechanism and neural network for ad click prediction. Tsinghua Sci. Technol. 2021, 27, 186–195. [Google Scholar] [CrossRef]
- Wu, T.; Ku, T.; Zhang, H. Research for image caption based on global attention mechanism. In Proceedings of the Second Target Recognition and Artificial Intelligence Summit Forum; International Society for Optics and Photonics: Bellingham, WA, USA, 2020; Volume 11427, p. 114272. [Google Scholar]
- Gan, X.; Wang, L.; Chen, Q.; Ge, Y.; Duan, S. GAU-Net: U-Net Based on Global Attention Mechanism for brain tumor segmentation. J. Physics Conf. Ser. 2021, 1861, 012041. [Google Scholar] [CrossRef]
- Bello, I.; Zoph, B.; Vaswani, A.; Shlens, J.; Le, Q.V. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 3286–3295. [Google Scholar]
- Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on visual transformer. arXiv 2020, arXiv:2012.12556. [Google Scholar]
- Sajid, U.; Chen, X.; Sajid, H.; Kim, T.; Wang, G. Audio-visual transformer based crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2249–2259. [Google Scholar]
- Truong, T.D.; Duong, C.N.; Pham, H.A.; Raj, B.; Le, N.; Luu, K. The Right to Talk: An Audio-Visual Transformer Approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1105–1114. [Google Scholar]
- Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote. Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25. Available online: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf (accessed on 17 January 2022).
- Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
- DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
- Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 6023–6032. [Google Scholar]
- Huang, S.; Wang, X.; Tao, D. SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data. arXiv 2020, arXiv:2012.04846. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Everingham, M. The PASCAL Visual Object Classes Challenge 2007; Springer: New York, NY, USA, 2007. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv 2017, arXiv:1701.04862. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning Sydney, Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
- Mariani, G.; Scheidegger, F.; Istrate, R.; Bekas, C.; Malossi, C. Bagan: Data augmentation with balancing gan. arXiv 2018, arXiv:1803.09655. [Google Scholar]
- Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier gans. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2642–2651. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 26–28 June 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Kaggle. Global Wheat Detection. 2020. Available online: https://www.kaggle.com/c/global-wheat-detection (accessed on 17 January 2022).
Model | Input Size | Precision | Recall | mAP | FPS |
---|---|---|---|---|---|
SSD | 300 × 300 | 83.96 | 80.23 | 87.64 | 33.7 |
512 × 512 | 86.43 | 86.26 | 91.27 | 32.3 | |
FSSD | 300 × 300 | 89.76 | 94.37 | 94.85 | 32.9 |
512 × 512 | 93.75 | 96.89 | 96.31 | 32.2 | |
RefineDet | 300 × 300 | 94.34 | 98.28 | 96.81 | 27.8 |
512 × 512 | 94.91 | 98.49 | 96.97 | 25.3 | |
EfficientDet L2 | 300 × 300 | 92.10 | 95.33 | 94.98 | 20.8 |
512 × 512 | 93.24 | 95.98 | 95.14 | 20.2 | |
Faster RCNN | 300 × 300 | 82.87 | 78.32 | 90.13 | 25.0 |
512 × 512 | 85.29 | 76.91 | 92.20 | 46.7 | |
YOLO v3 | 608 × 608 | 94.92 | 98.43 | 96.93 | 52.1 |
YOLO v4 | 608 × 608 | 94.38 | 98.51 | 97.42 | 57.5 |
YOLO v5 | 608 × 608 | 95.98 | 98.57 | 97.51 | 60.3 |
ours | 300 × 300 | 96.77 | 98.83 | 97.91 | 32.2 |
512 × 512 | 96.45 | 98.86 | 97.50 | 30.4 |
Object | Precision | Recall | mAP |
---|---|---|---|
Bridge | 97.35 | 99.01 | 98.59 |
Baseball Field | 96.21 | 98.19 | 97.32 |
Basketball Court | 96.13 | 98.60 | 97.03 |
Airplane | 96.27 | 98.59 | 97.28 |
Track and Field | 97.29 | 98.04 | 98.55 |
Oil tank | 96.18 | 98.83 | 97.89 |
Tennis Field | 97.21 | 98.89 | 98.37 |
Port | 96.99 | 98.89 | 97.87 |
Ship | 97.23 | 98.81 | 97.56 |
Car | 96.91 | 98.95 | 97.90 |
Method | Precision | Recall | mAP | FPS |
---|---|---|---|---|
no GAN (baseline) | 94.17 | 95.22 | 94.39 | 47.9 |
WGAN + SAGAN | 96.06 | 97.69 | 97.13 | 34.3 |
BAGAN + SAGAN | 95.18 | 97.19 | 96.98 | 34.1 |
WGAN + SPA-GAN | 96.77 | 98.83 | 97.91 | 32.2 |
BAGAN + SPA-GAN | 96.38 | 98.55 | 97.20 | 32.2 |
MixUp | CutOut | CutMix | SnapMix | Mosaic | Precision | Recall | mAP |
---|---|---|---|---|---|---|---|
✓ | ✓ | ✓ | ✓ | ✓ | 96.77 | 98.86 | 97.91 |
✓ | ✓ | ✓ | ✓ | 96.21 | 98.51 | 97.82 | |
✓ | ✓ | ✓ | 96.53 | 98.82 | 97.69 | ||
✓ | ✓ | ✓ | 96.76 | 98.89 | 97.93 | ||
✓ | ✓ | ✓ | ✓ | 96.23 | 98.56 | 97.85 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Liu, X.; Wa, S.; Chen, S.; Ma, Q. GANsformer: A Detection Network for Aerial Images with High Performance Combining Convolutional Network and Transformer. Remote Sens. 2022, 14, 923. https://doi.org/10.3390/rs14040923
Zhang Y, Liu X, Wa S, Chen S, Ma Q. GANsformer: A Detection Network for Aerial Images with High Performance Combining Convolutional Network and Transformer. Remote Sensing. 2022; 14(4):923. https://doi.org/10.3390/rs14040923
Chicago/Turabian StyleZhang, Yan, Xi Liu, Shiyun Wa, Shuyu Chen, and Qin Ma. 2022. "GANsformer: A Detection Network for Aerial Images with High Performance Combining Convolutional Network and Transformer" Remote Sensing 14, no. 4: 923. https://doi.org/10.3390/rs14040923