A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation
Abstract
:1. Introduction
- A novel hybrid framework for referring image segmentation: The dual-decoder model with SAM complementation achieves an improvement in referring image segmentation results than other SOTA models.
- A novel dual-decoder framework with KAN is proposed to increase the prediction accuracy of segmentation target edge coordinate points.
- We propose a SAM-based referring image segmentation results completion module. It could further complement the segmentation result on the prediction of our decoder.
- We have successfully used our framework for referring image segmentation tasks on three open datasets and have surpassed other state-of-the-art methods.
2. Related Work
2.1. Traditional Pixel-to-Pixel Segmentation Methods
2.2. Sequence-to-Sequence Segmentation Methods
3. Method
3.1. Overall Framework
3.2. Encoder
3.3. Dual-Decoder
3.4. Dual-Decoder Output Ensemble Module
Algorithm 1 Dual-branch decoder output integration module. |
|
3.5. SAM-Based Segmentation Completion Module
4. Experiment
4.1. Experimental Settings and Evaluation Indicators
4.1.1. Experimental Settings
Algorithm 2 Two-branch decoder integration with SAM. |
|
4.1.2. Evaluation Indicators
4.2. Dataset
4.3. Main Results
4.4. Ablation Studies
4.4.1. Modeled Structural Ablation Experiment
4.4.2. SAM Input Ablation Experiment
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, S.; Zhao, Y.; Jin, Q.; Wu, Q. Fine-Grained Video-Text Retrieval with Hierarchical Graph Reasoning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Nie, W.; Yu, Y.; Zhang, C.; Song, D.; Zhao, L.; Bai, Y. Temporal-Spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care Unit. IEEE Trans. Bio-Med. Eng. 2024, 71, 583–595. [Google Scholar] [CrossRef] [PubMed]
- Hu, R.; Darrell, M.R.T. Segmentation from Natural Language Expressions. In Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 108–124. [Google Scholar]
- Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring Image Segmentation via Recurrent Refinement Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent Multimodal Interaction for Referring Image Segmentation. In Proceedings of the 2017 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 22–25 July 2017. [Google Scholar]
- Huang, S.; Hui, T.; Liu, S.; Li, G.; Wei, Y.; Han, J.; Liu, L.; Li, B. Referring Image Segmentation via Cross-Modal Progressive Comprehension. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Hui, T.; Liu, S.; Huang, S.; Li, G.; Yu, S.; Zhang, F.; Han, J. Linguistic Structure Guided Context Modeling for Referring Image Segmentation. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Chen, D.J.; Jia, S.; Lo, Y.C.; Chen, H.T.; Liu, T.L. See-Through-Text Grouping for Referring Image Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Ding, H.; Liu, C.; Wang, S.; Jiang, X. VLT: Vision-Language Transformer and Query Generation for Referring Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7900–7916. [Google Scholar] [CrossRef] [PubMed]
- Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-Directional Relationship Inferring Network for Referring Image Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Margffoy-Tuay, E.; Pérez, J.C.; Botero, E.; Arbeláez, P. Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-Word-Aware Network for Referring Expression Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chen, K.; Pang, J.; Wang, J.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Shi, J.; Ouyang, W.; et al. Hybrid Task Cascade for Instance Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Dai, J.; He, K.; Sun, J. Instance-Aware Semantic Segmentation via Multi-task Network Cascades. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 386–397. [Google Scholar] [CrossRef] [PubMed]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation(Conference Paper). Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2018, 8759–8768. [Google Scholar]
- Liu, J.; Ding, H.; Cai, Z.; Zhang, Y.; Satzoda, R.K.; Mahadevan, V.; Manmatha, R. PolyFormer: Referring Image Segmentation as Sequential Polygon Generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014.
- Lazarow, J.; Xu, W.; Tu, Z. Instance Segmentation with Mask-supervised Polygonal Boundary Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Acuna, D.; Ling, H.; Kar, A.; Fidler, S. Efficient Interactive Annotation of Segmentation Datasets with Polygon-RNN++(Conference Paper). Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2018, 859–868. [Google Scholar]
- Castrejon, L.; Kundu, K.; Urtasun, R.; Fidler, S. Annotating Object Instances with a Polygon-RNN. Comput. Vis. Pattern Recognit. 2017, 4485–4493. [Google Scholar]
- Liang, J.; Homayounfar, N.; Ma, W.C.; Xiong, Y.; Hu, R.; Urtasun, R. PolyTransform: Deep Polygon Transformer for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. PolarMask: Single Shot Instance Segmentation With Polar Representation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Yu, L.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling Context in Referring Expressions. In Proceedings of the Computer Vision—ECCV 2016. 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.; Murphy, K. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H.S. Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation. arXiv 2023, arXiv:2303.06345. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
- Chen, T.; Saxena, S.; Li, L.; Fleet, D.J.; Hinton, G. Pix2seq: A Language Modeling Framework for Object Detection. arXiv 2022, arXiv:2109.10852. [Google Scholar]
- Chen, T.; Saxena, S.; Li, L.; Lin, T.Y.; Fleet, D.J.; Hinton, G.E. A unified sequence interface for vision tasks. Adv. Neural Inf. Process. Syst. 2022, 35, 31333–31346. [Google Scholar]
- Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 23318–23340. [Google Scholar]
- Luo, G.; Zhou, Y.; Sun, X.; Cao, L.; Wu, C.; Deng, C.; Ji, R. Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Luo, G.; Zhou, Y.; Ji, R.; Sun, X.; Su, J.; Lin, C.W.; Tian, Q. Cascade Grouped Attention Network for Referring Expression Segmentation. In Proceedings of the MM ’20: Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020. [Google Scholar]
- Jing, Y.; Kong, T.; Wang, W.; Wang, L.; Li, L.; Tan, T. Locate then Segment: A Strong Pipeline for Referring Image Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Li, M.; Sigal, L. Referring Transformer: A One-step Approach to Multi-task Visual Grounding. Adv. Neural Inf. Process. Syst. 2021, 34, 19652–19664. [Google Scholar]
- Liu, C.; Ding, H.; Zhang, Y.; Jiang, X. Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation. IEEE Trans. Image Process. 2023, 32, 3054–3065. [Google Scholar] [CrossRef] [PubMed]
- Yang, J.; Zhang, L.; Sun, J.; Lu, H. Spatial Semantic Recurrent Mining for Referring Image Segmentation. arXiv 2024, arXiv:2405.09006. [Google Scholar]
- Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Lin, L.; Yan, P.; Xu, X.; Yang, S.; Zeng, K.; Li, G. Structured Attention Network for Referring Image Segmentation. IEEE Trans. Multimed. 2022, 24, 1922–1932. [Google Scholar] [CrossRef]
- Kim, N.; Kim, D.; Kwak, S.; Lan, C.; Zeng, W. ReSTR: Convolution-free Referring Image Segmentation Using Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Liu, C.; Jiang, X.; Ding, H. Instance-Specific Feature Propagation for Referring Segmentation. IEEE Trans. Multimed. 2023, 25, 3657–3667. [Google Scholar] [CrossRef]
- Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. CRIS: CLIP-Driven Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Yang, J.; Zhang, L.; Lu, H. Referring Image Segmentation with Fine-Grained Semantic Funneling Infusion. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–12. [Google Scholar] [CrossRef] [PubMed]
Method | RefCOCO | RefCOCO+ | RefCOCOg | |||||
---|---|---|---|---|---|---|---|---|
val | Test A | Test B | val | Test A | Test B | val | Test | |
DMN18 [13] | 49.78 | 54.83 | 45.13 | 38.88 | 44.22 | 32.29 | - | - |
MCN20 [38] | 62.44 | 64.20 | 59.71 | 50.62 | 54.99 | 44.69 | 49.22 | 49.40 |
CGAN20 [39] | 64.86 | 68.04 | 62.07 | 51.03 | 55.51 | 44.06 | 51.01 | 51.69 |
LTS21 [40] | 65.43 | 67.76 | 63.08 | 54.21 | 58.32 | 48.02 | 54.40 | 54.25 |
VLT21 [10] | 65.65 | 68.29 | 62.73 | 55.50 | 59.20 | 49.36 | 52.99 | 56.65 |
RefTrans21 [41] | 74.34 | 76.77 | 70.87 | 66.75 | 70.58 | 59.40 | 66.63 | 67.39 |
LAVT22 [32] | 74.46 | 76.89 | 70.94 | 65.81 | 70.97 | 59.23 | 63.34 | 63.62 |
M3Att23 [42] | 73.60 | 76.23 | 70.36 | 65.34 | 70.50 | 56.98 | 64.92 | 67.37 |
SADLR23 [33] | 76.52 | 77.98 | 73.49 | 68.94 | 72.71 | 61.10 | 67.47 | 67.73 |
PolyFormer23 [20] | 75.96 | 77.09 | 73.22 | 70.65 | 74.51 | 64.64 | 69.36 | 69.88 |
[43] | 76.88 | 78.43 | 74.01 | 70.01 | 73.81 | 63.21 | 69.02 | 68.49 |
Ours | 77.14 | 78.33 | 74.33 | 71.75 | 75.70 | 65.69 | 70.72 | 71.43 |
Method | RefCOCO | RefCOCO+ | RefCOCOg | |||||
---|---|---|---|---|---|---|---|---|
val | Test A | Test B | val | Test A | Test B | val | Test | |
RMI+DCRF17 [6] | 45.18 | 45.69 | 45.57 | 29.86 | 30.48 | 29.50 | - | - |
RRN18 [5] | 55.33 | 57.26 | 53.93 | 39.75 | 42.15 | 36.11 | - | - |
MAttNet 18 [44] | 56.51 | 62.37 | 51.70 | 46.67 | 52.39 | 40.08 | 47.64 | 48.61 |
STEP19 [9] | 60.04 | 63.46 | 57.97 | 48.19 | 52.33 | 40.41 | - | - |
CMSA+DCRF19 [15] | 58.32 | 60.61 | 55.09 | 43.76 | 47.60 | 37.89 | ||
CMPC+DCRF20 [7] | 61.36 | 64.53 | 59.64 | 49.56 | 53.44 | 43.23 | - | - |
LSCM+DCRF20 [8] | 61.47 | 64.99 | 59.55 | 49.34 | 53.12 | 43.50 | - | |
SANet 21 [45] | 61.84 | 64.95 | 57.43 | 50.38 | 55.36 | 42.74 | - | - |
BRINet+DCRF20 [12] | 61.35 | 63.37 | 59.57 | 48.57 | 52.87 | 42.13 | 48.04 | - |
CEFNet 21 [11] | 62.76 | 65.69 | 59.67 | 51.50 | 55.24 | 43.01 | - | - |
ReSTR22 [46] | 67.22 | 69.30 | 64.45 | 55.78 | 60.44 | 48.27 | - | - |
ISPNet 22 [47] | 65.19 | 68.45 | 62.73 | 52.70 | 56.77 | 46.39 | 53.00 | 50.08 |
CRIS22 [48] | 70.47 | 73.18 | 66.10 | 62.27 | 68.08 | 53.68 | 59.87 | 60.36 |
LAVT22 [32] | 72.73 | 75.82 | 68.79 | 62.14 | 68.38 | 55.10 | 61.24 | 62.09 |
FSFINet23 [49] | 71.23 | 74.34 | 68.31 | 60.84 | 66.49 | 53.24 | 61.51 | 61.78 |
SADLR23 [33] | 74.24 | 76.25 | 70.06 | 64.28 | 69.09 | 55.19 | 63.60 | 63.56 |
PolyFormer23 [20] | 74.82 | 76.64 | 71.06 | 67.64 | 72.89 | 59.33 | 67.76 | 69.05 |
[43] | 74.35 | 76.57 | 70.44 | 65.39 | 70.63 | 57.33 | 65.37 | 65.30 |
Ours | 75.37 | 77.20 | 71.38 | 68.07 | 73.46 | 59.47 | 67.75 | 69.50 |
Method | RefCOCO | RefCOCO+ | RefCOCOg | |||
---|---|---|---|---|---|---|
mIoU | oIoU | mIoU | oIoU | mIoU | oIoU | |
PolyFormer23 [20] | 75.96 | 74.82 | 70.65 | 67.64 | 69.36 | 67.76 |
+ KAN decoder | 76.04 | 74.98 | 70.69 | 67.83 | 69.29 | 67.36 |
+ Two-decoder | 76.08 | 74.97 | 70.78 | 67.79 | 69.44 | 67.40 |
+ SAM complement | 77.14 | 75.37 | 71.75 | 68.07 | 70.72 | 67.75 |
SAM Operation | val | Test A | Test B | ||||||
---|---|---|---|---|---|---|---|---|---|
BB | PP | DN | CP | mIoU | oIoU | mIoU | oIoU | mIoU | oIoU |
✓ | ✗ | ✗ | ✗ | 67.46 | 62.67 | 71.50 | 68.91 | 61.06 | 54.85 |
✓ | ✓ | ✗ | ✗ | 69.07 | 64.42 | 72.75 | 69.67 | 62.96 | 56.44 |
✓ | ✓ | ✓ | ✗ | 70.81 | 67.34 | 74.48 | 72.51 | 64.52 | 58.84 |
✓ | ✓ | ✓ | ✓ | 71.75 | 68.07 | 75.70 | 73.46 | 65.69 | 59.47 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, H.; Zhou, S.; Li, K.; Yin, J.; Huang, J. A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation. Mathematics 2024, 12, 3061. https://doi.org/10.3390/math12193061
Chen H, Zhou S, Li K, Yin J, Huang J. A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation. Mathematics. 2024; 12(19):3061. https://doi.org/10.3390/math12193061
Chicago/Turabian StyleChen, Haoyuan, Sihang Zhou, Kuan Li, Jianping Yin, and Jian Huang. 2024. "A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation" Mathematics 12, no. 19: 3061. https://doi.org/10.3390/math12193061
APA StyleChen, H., Zhou, S., Li, K., Yin, J., & Huang, J. (2024). A Hybrid Framework for Referring Image Segmentation: Dual-Decoder Model with SAM Complementation. Mathematics, 12(19), 3061. https://doi.org/10.3390/math12193061