Multi-Scale Feature Fusion Enhancement for Underwater Object Detection
Abstract
:1. Introduction
- We introduce an end-to-end real-time Transformer-based detection framework for underwater small and ambiguous objects.
- We design an align-split network (ASN) for identifying small objects via reinforcing multi-scale feature interaction and fusion. We also develop a distinction enhancement module based on different attention mechanisms to enhance the sensitivity of the model on ambiguous objects.
- The experiment results on four challenging underwater datasets, DUO, Brackish, TranshCan, and WPBB, demonstrate that our method outperforms most other object detection methods in underwater environments.
2. Related Works
2.1. CNN-Based Methods
2.2. Transformer-Based Methods
3. Methodology
3.1. Model Overview
3.2. Align-Split Network (ASN)
3.3. Cross-Scale Feature Fusion Enhancement Network (CFFEN)
3.4. Decoder
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. State-of-the-Art Comparison
4.4.1. Results on DUO
4.4.2. Results on Brackish, TrashCan and WPBB
4.5. Ablation Studies
4.5.1. Analysis of the Align-Split Network
4.5.2. Analysis of the Distinction Enhancement Module
4.6. Model Complexity Analysis
4.7. Error Analysis
4.8. Qualitative Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Xu, S.; Zhang, M.; Song, W.; Mei, H.; He, Q.; Liotta, A. A systematic review and analysis of deep learning-based underwater object detection. Neurocomputing 2023, 527, 204–232. [Google Scholar] [CrossRef]
- Chen, G.; Mao, Z.; Wang, K.; Shen, J. HTDet: A hybrid transformer-based approach for underwater small object detection. Remote Sens. 2023, 15, 1076. [Google Scholar] [CrossRef]
- Fu, C.; Liu, R.; Fan, X.; Chen, P.; Fu, H.; Yuan, W.; Zhu, M.; Luo, Z. Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing 2023, 517, 243–256. [Google Scholar] [CrossRef]
- Mu, P.; Xu, H.; Liu, Z.; Wang, Z.; Chan, S.; Bai, C. A generalized physical-knowledge-guided dynamic model for underwater image enhancement. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7111–7120. [Google Scholar]
- Lin, W.H.; Zhong, J.X.; Liu, S.; Li, T.; Li, G. Roimix: Proposal-fusion among multiple images for underwater object detection. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2588–2592. [Google Scholar]
- Dai, L.; Liu, H.; Song, P.; Liu, M. A gated cross-domain collaborative network for underwater object detection. Pattern Recognit. 2024, 149, 110222. [Google Scholar] [CrossRef]
- Dai, L.; Liu, H.; Song, P.; Tang, H.; Ding, R.; Li, S. Edge-guided representation learning for underwater object detection. CAAI Trans. Intell. Technol. 2024, 9, 1078–1091. [Google Scholar] [CrossRef]
- Er, M.J.; Chen, J.; Zhang, Y.; Gao, W. Research challenges, recent advances, and popular datasets in deep learning-based underwater marine object detection: A review. Sensors 2023, 23, 1990. [Google Scholar] [CrossRef]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Anwar, S.; Li, C. Diving deeper into underwater image enhancement: A survey. Signal Process. Image Commun. 2020, 89, 115978. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Mandal, R.; Connolly, R.M.; Schlacher, T.A.; Stantic, B. Assessing fish abundance from underwater video using deep neural networks. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Qi, S.; Du, J.; Wu, M.; Yi, H.; Tang, L.; Qian, T.; Wang, X. Underwater small target detection based on deformable convolutional pyramid. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 7–13 May 2022; pp. 2784–2788. [Google Scholar]
- Song, P.; Li, P.; Dai, L.; Wang, T.; Chen, Z. Boosting R-CNN: Reweighting R-CNN samples by RPN’s error for underwater object detection. Neurocomputing 2023, 530, 150–164. [Google Scholar] [CrossRef]
- Li, X.; Yu, H.; Chen, H. Multi-scale aggregation feature pyramid with cornerness for underwater object detection. Vis. Comput. 2024, 40, 1299–1310. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Liu, K.; Peng, L.; Tang, S. Underwater object detection using TC-YOLO with attention mechanisms. Sensors 2023, 23, 2567. [Google Scholar] [CrossRef] [PubMed]
- Zhao, L.; Yun, Q.; Yuan, F.; Ren, X.; Jin, J.; Zhu, X. YOLOv7-CHS: An Emerging Model for Underwater Object Detection. J. Mar. Sci. Eng. 2023, 11, 1949. [Google Scholar] [CrossRef]
- Shen, X.; Wang, H.; Cui, T.; Guo, Z.; Fu, X. Multiple information perception-based attention in YOLO for underwater object detection. Vis. Comput. 2024, 40, 1415–1438. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV 2020), Glasgow, UK, 23–28 August 2020; Springer: New York, NY, USA, 2020; pp. 213–229. [Google Scholar]
- Shah, S.; Tembhurne, J. Object detection using convolutional neural networks and transformer-based models: A review. J. Electr. Syst. Inf. Technol. 2023, 10, 54. [Google Scholar] [CrossRef]
- Gao, J.; Zhang, Y.; Geng, X.; Tang, H.; Bhatti, U.A. PE-Transformer: Path enhanced transformer for improving underwater object detection. Expert Syst. Appl. 2024, 246, 123253. [Google Scholar] [CrossRef]
- Rekavandi, A.M.; Rashidi, S.; Boussaid, F.; Hoefs, S.; Akbas, E.; Bennamoun, M. Transformers in small object detection: A benchmark and survey of state-of-the-art. arXiv 2023, arXiv:2309.04902. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Zong, Z.; Song, G.; Liu, Y. Detrs with collaborative hybrid assignments training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6748–6758. [Google Scholar]
- Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
- Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. In Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Curran Associates Inc.: Red Hook, NY, USA, 2023; pp. 51094–51112. [Google Scholar]
- Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
- Narayanan, M. SENetV2: Aggregated dense layer for channelwise and global representations. arXiv 2023, arXiv:2311.10807. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R.W. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
- Lian, S.; Li, H.; Cong, R.; Li, S.; Zhang, W.; Kwong, S. WaterMask: Instance Segmentation for Underwater Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1305–1315. [Google Scholar]
- Li, H.; Xiao, X.; Liu, X.; Wen, G.; Liu, L. Learning Cognitive Features as Complementary for Facial Expression Recognition. Int. J. Intell. Syst. 2024, 2024, 7321175. [Google Scholar]
- Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Liu, C.; Li, H.; Wang, S.; Zhu, M.; Wang, D.; Fan, X.; Wang, Z. A dataset and benchmark of underwater object detection for robot picking. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Pedersen, M.; Bruslund Haurum, J.; Gade, R.; Moeslund, T.B. Detection of marine animals in a new underwater dataset with varying visibility. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019; pp. 18–26. [Google Scholar]
- Hong, J.; Fulton, M.; Sattar, J. Trashcan: A semantically-segmented dataset towards visual detection of marine debris. arXiv 2020, arXiv:2007.08097. [Google Scholar]
- Zocco, F.; Lin, T.C.; Huang, C.I.; Wang, H.C.; Khyam, M.O.; Van, M. Towards more efficient efficientdets and real-time marine debris detection. IEEE Robot. Autom. Lett. 2023, 8, 2134–2141. [Google Scholar] [CrossRef]
- Wang, Z.; Liu, C.; Wang, S.; Tang, T.; Tao, Y.; Yang, C.; Li, H.; Liu, X.; Fan, X. UDD: An underwater open-sea farm object detection dataset for underwater robot picking. arXiv 2020, arXiv:2003.01446. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part V. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
- Wang, B.; Wang, Z.; Guo, W.; Wang, Y. A dual-branch joint learning network for underwater object detection. Knowl.-Based Syst. 2024, 293, 111672. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
- Chen, L.; Zhou, F.; Wang, S.; Dong, J.; Li, N.; Ma, H.; Wang, X.; Zhou, H. SWIPENET: Object detection in noisy underwater scenes. Pattern Recognit. 2022, 132, 108926. [Google Scholar] [CrossRef]
- Liu, Z.; Wang, B.; Li, Y.; He, J.; Li, Y. UnitModule: A lightweight joint image enhancement module for underwater object detection. Pattern Recognit. 2024, 151, 110435. [Google Scholar] [CrossRef]
- Zhu, B.; Wang, J.; Jiang, Z.; Zong, F.; Liu, S.; Li, Z.; Sun, J. Autoassign: Differentiable label assignment for dense object detection. arXiv 2020, arXiv:2007.03496. [Google Scholar]
- Liang, X.; Song, P. Excavating roi attention for underwater object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 2651–2655. [Google Scholar]
Method | Year | Epochs | AP | Echinus | Starfish | Holothurian | Scallop | ||
---|---|---|---|---|---|---|---|---|---|
Generic Object Detector: | |||||||||
Faster R-CNN [12] | 2015 | 36 | 61.3 | 81.9 | 69.5 | 70.4 | 71.4 | 61.4 | 41.9 |
Cascade R-CNN [49] | 2017 | 36 | 61.2 | 82.1 | 69.2 | 69.0 | 72.0 | 61.9 | 41.9 |
Deformable DETR [27] | 2020 | 36 | 63.7 | 84.4 | 71.9 | 71.6 | 73.9 | 63.0 | 46.3 |
AutoAssign [55] | 2020 | 36 | 66.1 | 85.7 | 72.6 | 74.1 | 75.5 | 65.8 | 48.9 |
YOLOv7 [50] | 2022 | 36 | 66.3 | 85.8 | 73.9 | 73.7 | 74.5 | 66.3 | 50.8 |
Underwater Object Detector: | |||||||||
ROIMIX [5] | 2020 | 36 | 61.9 | 81.3 | 69.9 | 70.7 | 72.4 | 63.0 | 41.7 |
RoIAttn [56] | 2022 | 36 | 62.3 | 82.8 | 71.4 | 70.6 | 72.6 | 63.4 | 42.5 |
SWIPENet [53] | 2022 | 36 | 63.0 | 79.7 | 72.5 | 68.5 | 73.6 | 64.0 | 45.9 |
Boosting R-CNN [16] | 2023 | 36 | 63.5 | 78.5 | 71.1 | 69.0 | 74.5 | 63.8 | 46.8 |
GCC-Net [6] | 2023 | 36 | 69.1 | 87.8 | 76.3 | 75.2 | 76.7 | 68.2 | 56.3 |
ERL-Net [7] | 2024 | 36 | 64.9 | 82.4 | 73.2 | 71.0 | 74.8 | 67.2 | 46.5 |
YOLOX-S w/UnitModule [54] | 2024 | 100 | 63.7 | 85.8 | 72.2 | - | - | - | - |
DJL-Net [51] | 2024 | 12 | 65.6 | 84.2 | 73.0 | - | - | - | - |
Aqua-DETR (Ours) | 2024 | 36 | 69.3 | 87.6 | 76.4 | 74.5 | 75.4 | 67.3 | 58.3 |
Method | Years | Brackish | TrashCan | WPBB | |||
---|---|---|---|---|---|---|---|
AP | AP50 | AP | AP50 | AP | AP50 | ||
Faster R-CNN [12] | 2015 | 61.2 | 62.9 | 31.2 | 55.3 | 75.7 | 98.7 |
Deformable DETR [27] | 2020 | 77.5 | 97.1 | 36.1 | 56.9 | 73.2 | 98.3 |
YOLOv7 [50] | 2022 | 57.2 | 88.5 | 24.1 | 43.4 | 78.7 | 99.5 |
RoIAttn [56] | 2022 | 78.3 | 91.0 | 32.6 | 57.2 | 70.2 | 88.1 |
Boosting R-CNN [16] | 2023 | 79.6 | 97.4 | 36.8 | 57.6 | 78.5 | 97.1 |
GCC-Net [6] | 2023 | 80.5 | 98.3 | 41.3 | 61.2 | 81.0 | 99.5 |
ERL-Net [7] | 2024 | 85.4 | 98.8 | 37.0 | 58.9 | 79.7 | 98.5 |
Aqua-DETR (Ours) | 2024 | 82.8 | 98.9 | 42.9 | 63.0 | 83.8 | 100.0 |
Method | AP | Echinus | Starfish | Holothurian | Scallop | |||||
---|---|---|---|---|---|---|---|---|---|---|
baseline | 68.6 | 87.0 | 76.2 | 50.6 | 70.1 | 68.0 | 74.1 | 75.5 | 67.3 | 57.8 |
baseline + ASN | 69.2 | 87.5 | 76.5 | 52.9 | 70.7 | 68.5 | 74.7 | 76.0 | 67.4 | 59.1 |
baseline + DEM | 68.8 | 87.2 | 76.4 | 50.5 | 70.4 | 68.0 | 74.5 | 75.4 | 67.3 | 58.3 |
Aqua-DETR | 69.3 | 87.6 | 76.4 | 54.2 | 70.5 | 68.9 | 74.6 | 76.3 | 68.0 | 58.4 |
Method | Backbone | Params | FLOPs | FPS |
---|---|---|---|---|
Generic Object Detector: | ||||
Faster R-CNN [12] | ResNet50 | 41.17 M | 63.29 G | 41.2 |
Cascade R-CNN [49] | ResNet50 | 68.94 M | 91.06 G | 32.5 |
Deformable DETR [27] | ResNet50 | 39.83 M | 51.06 G | 25.7 |
Underwater Object Detector: | ||||
RoIMix [5] | ResNet50 | 68.94 M | 91.08 G | 14.1 |
Boosting R-CNN [16] | ResNet50 | 45.95 M | 54.71 G | 34.7 |
ERL-Net [7] | SiEdge-ResNet50 | 218.83M | 416.63 G | 12.7 |
RoIAttn [56] | ResNet50 | 55.23 M | 331.39 G | 13.1 |
GCC-Net [6] | SwinFusionTransformer | 38.31 M | 78.93 G | 21.3 |
DJL-Net [51] | ResNet50 | 58.48 M | 69.51 G | 23.2 |
Aqua-DETR (Ours) | ResNet50 | 50.36 M | 78.33 G | 35.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xiao, Z.; Li, Z.; Li, H.; Li, M.; Liu, X.; Kong, Y. Multi-Scale Feature Fusion Enhancement for Underwater Object Detection. Sensors 2024, 24, 7201. https://doi.org/10.3390/s24227201
Xiao Z, Li Z, Li H, Li M, Liu X, Kong Y. Multi-Scale Feature Fusion Enhancement for Underwater Object Detection. Sensors. 2024; 24(22):7201. https://doi.org/10.3390/s24227201
Chicago/Turabian StyleXiao, Zhanhao, Zhenpeng Li, Huihui Li, Mengting Li, Xiaoyong Liu, and Yinying Kong. 2024. "Multi-Scale Feature Fusion Enhancement for Underwater Object Detection" Sensors 24, no. 22: 7201. https://doi.org/10.3390/s24227201
APA StyleXiao, Z., Li, Z., Li, H., Li, M., Liu, X., & Kong, Y. (2024). Multi-Scale Feature Fusion Enhancement for Underwater Object Detection. Sensors, 24(22), 7201. https://doi.org/10.3390/s24227201