Decoupled Cross-Modal Transformer for Referring Video Object Segmentation
Abstract
:1. Introduction
- An end-to-end unified network termed DCT is proposed to tackle referring video object segmentation, which sufficiently utilizes multi-modal information and aggregates multi-scale visual features.
- The Language-Guided Visual Enhancement Module (LGVE) and the decoupled transformer decoder are constructed to establish coordinated information interactions among object queries, visual features and linguistic features.
- Cross-layer Feature Pyramid Network (CFPN) is brought in to reduce the information loss in the progressive fusion process.
- Experiments on four benchmarks demonstrate that our proposed method achieves competitive segmentation accuracy compared with the state-of-the-art methods.
2. Related Works
2.1. Referring Video Object Segmentation
2.2. Transformer
3. Method
3.1. Overall Pipeline
3.2. Language-Guided Visual Enhancement
3.3. Decoupled Transformer Decoder
3.4. Cross-Layer Feature Pyramid Network
4. Experiments
4.1. Datasets and Evaluation Metric
4.1.1. Datasets
4.1.2. Evaluation Metric
4.2. Inplementation Details
4.3. Ablation Study
4.3.1. Ablation Study on the Main Components
4.3.2. Analysis of the Temporal Context Size
4.3.3. Analysis of the Query Number
4.4. Comparison with Existing Methods
4.5. Visualization and Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Caelles, S.; Maninis, K.-K.; Pont-Tuset, J.; Leal-Taixé, L.; Cremers, D.; Van Gool, L. One-shot video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 221–230. [Google Scholar]
- Maninis, K.-K.; Caelles, S.; Pont-Tuset, J.; Van Gool, L. Deep extreme cut: From extreme points to object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 616–625. [Google Scholar]
- Yang, Z.; Wei, Y.; Yang, Y. Collaborative video object segmentation by foreground-background integration. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 332–348. [Google Scholar]
- Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Part I 14; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 108–124. [Google Scholar]
- Bellver, M.; Ventura, C.; Silberer, C.; Kazakos, I.; Torres, J.; Giro-i-Nieto, X. Refvos: A closer look at referring expressions for video object segmentation. arXiv 2020, arXiv:2010.00263. [Google Scholar]
- Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1271–1280. [Google Scholar]
- Gavrilyuk, K.; Ghodrati, A.; Li, Z.; Snoek, C.G. Actor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5958–5966. [Google Scholar]
- Wang, H.; Deng, C.; Ma, F.; Yang, Y. Context modulated dynamic networks for actor and action video segmentation with language queries. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12152–12159. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10502–10511. [Google Scholar]
- Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4424–4433. [Google Scholar]
- Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18155–18165. [Google Scholar]
- Seo, S.; Lee, J.-Y.; Han, B. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020, Part XV 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 208–223. [Google Scholar]
- Botach, A.; Zheltonozhskii, E.; Baskin, C. End-to-end referring video object segmentation with multimodal transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4985–4995. [Google Scholar]
- Wu, J.; Jiang, Y.; Sun, P.; Yuan, Z.; Luo, P. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4974–4984. [Google Scholar]
- Li, X.; Wang, J.; Xu, X.; Li, X.; Lu, Y.; Raj, B. R^2VOS: Robust Referring Video Object Segmentation via Relational Multimodal Cycle Consistency. arXiv 2022, arXiv:2207.01203. [Google Scholar]
- Yuan, L.; Shi, M.; Yue, Z. LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation. arXiv 2023, arXiv:2306.08736. [Google Scholar]
- Feng, G.; Zhang, L.; Hu, Z.; Lu, H. Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation. arXiv 2022, arXiv:2203.15969. [Google Scholar]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Li, Z.; Lang, C.; Liew, J.H.; Li, Y.; Hou, Q.; Feng, J. Cross-layer feature pyramid network for salient object detection. IEEE Trans. Image Process. 2021, 30, 4587–4598. [Google Scholar] [CrossRef] [PubMed]
- Hui, T.; Liu, S.; Huang, S.; Li, G.; Yu, S.; Zhang, F.; Han, J. Linguistic structure guided context modeling for referring image segmentation. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020, Part X 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 59–75. [Google Scholar]
- Liang, C.; Wu, Y.; Zhou, T.; Wang, W.; Yang, Z.; Wei, Y.; Yang, Y. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. arXiv 2021, arXiv:2106.01061. [Google Scholar]
- Yang, J.; Huang, Y.; Niu, K.; Huang, L.; Ma, Z.; Wang, L. Actor and action modular network for text-based video segmentation. IEEE Trans. Image Process. 2022, 31, 4474–4489. [Google Scholar] [CrossRef] [PubMed]
- Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1307–1315. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Ding, H.; Liu, C.; Wang, S.; Jiang, X. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16321–16330. [Google Scholar]
- Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790. [Google Scholar]
- Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8741–8750. [Google Scholar]
- Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
- Milletari, F.; Navab, N.; Ahmadi, S.-A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Xu, C.; Xiong, C.; Corso, J.J. Action understanding with multiple classes of actors. arXiv 2017, arXiv:1704.08723. [Google Scholar]
- Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar]
- Xu, N.; Yang, L.; Fan, Y.; Yang, J.; Yue, D.; Liang, Y.; Price, B.; Cohen, S.; Huang, T. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 585–601. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Zhang, X.; Wang, Y. Referring segmentation in images and videos with cross-modal self-attention network. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3719–3732. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.; Deng, C.; Yan, J.; Tao, D. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 3939–3948. [Google Scholar]
- Liu, S.; Hui, T.; Huang, S.; Wei, Y.; Li, B.; Li, G. Cross-modal progressive comprehension for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4761–4775. [Google Scholar] [CrossRef]
- Liang, C.; Wu, Y.; Luo, Y.; Yang, Y. Clawcranenet: Leveraging object-level relation for text-based video segmentation. arXiv 2021, arXiv:2103.10702. [Google Scholar]
- Ding, Z.; Hui, T.; Huang, S.; Liu, S.; Luo, X.; Huang, J.; Wei, X. Progressive multimodal interaction network for referring video object segmentation. 3rd Large-Scale Video Object Segm. Chall. 2021, 8. [Google Scholar]
No. | LGVE | Decoupled Transformer Decoder | CFPN | IoU | mAP | |
---|---|---|---|---|---|---|
Overall | Mean | |||||
1 | - | - | - | 69.7 | 60.9 | 43.3 |
2 | √ | - | - | 71.8 | 63.5 | 46.3 |
3 | - | √ | - | 71.0 | 62.7 | 45.1 |
4 | - | - | √ | 72.3 | 63.3 | 46.8 |
5 | √ | - | √ | 73.0 | 64.2 | 47.2 |
6 | √ | √ | √ | 73.5 | 65.0 | 48.7 |
Window Size | IoU | mAP | |
---|---|---|---|
Overall | Mean | ||
1 | 69.7 | 61.0 | 42.6 |
4 | 70.8 | 62.0 | 44.5 |
6 | 71.9 | 63.3 | 46.1 |
8 | 73.5 | 65.0 | 48.7 |
10 | 72.1 | 63.8 | 46.5 |
Query Number | IoU | mAP | |
---|---|---|---|
Overall | Mean | ||
5 | 71.8 | 63.0 | 45.8 |
30 | 73.1 | 64.4 | 47.5 |
50 | 73.5 | 65.0 | 48.7 |
75 | 72.3 | 63.6 | 46.4 |
Method | Backbone | Precision | IoU | mAP | |||||
---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | [email protected] | Overall | Mean | |||
Hu et al. [4] | VGG-16 | 34.8 | 23.6 | 13.3 | 3.3 | 0.1 | 47.4 | 35.0 | 13.2 |
Gavrilyuk et al. [7] | I3D | 47.5 | 34.7 | 21.1 | 8.0 | 0.2 | 53.6 | 42.1 | 19.8 |
CMSA + CFSA [42] | ResNet-101 | 48.7 | 43.1 | 35.8 | 23.1 | 5.2 | 61.8 | 43.2 | - |
ACAN [43] | I3D | 55.7 | 45.9 | 31.9 | 16.0 | 2.0 | 60.1 | 49.0 | 27.4 |
CMPC-V [44] | I3D | 65.5 | 59.2 | 50.6 | 34.2 | 9.8 | 65.3 | 57.3 | 40.4 |
ClawCraneNet [45] | ResNet101 | 70.4 | 67.7 | 61.7 | 48.9 | 17.1 | 63.1 | 59.9 | - |
MTTR [13] | Video-Swin-T | 75.4 | 71.2 | 63.8 | 48.5 | 16.9 | 72.0 | 64.0 | 46.1 |
ReferFormer [14] | Video-Swin-T | 76.0 | 72.2 | 65.4 | 49.8 | 17.9 | 72.3 | 64.1 | 48.6 |
DCT (ours) | Video-Swin-T | 76.3 | 72.8 | 66.0 | 50.2 | 18.3 | 73.5 | 65.0 | 48.7 |
Method | Backbone | Precision | IoU | mAP | |||||
---|---|---|---|---|---|---|---|---|---|
[email protected] | [email protected] | [email protected] | [email protected] | [email protected] | Overall | Mean | |||
Hu et al. [4] | VGG-16 | 63.3 | 35.0 | 8.5 | 0.2 | 0.0 | 54.6 | 52.8 | 17.8 |
Gavrilyuk et al. [7] | I3D | 69.9 | 46.0 | 17.3 | 1.4 | 0.0 | 54.1 | 54.2 | 23.3 |
CMSA + CFSA [42] | ResNet-101 | 76.4 | 62.5 | 38.9 | 9.0 | 0.1 | 62.8 | 58.1 | - |
ACAN [43] | I3D | 75.6 | 56.4 | 28.7 | 3.4 | 0.0 | 57.6 | 58.4 | 28.9 |
CMPC-V [44] | I3D | 81.3 | 65.7 | 37.1 | 7.0 | 0.0 | 61.6 | 61.7 | 34.2 |
ClawCraneNet [45] | ResNet101 | 88.0 | 79.6 | 56.6 | 14.7 | 0.2 | 64.4 | 65.6 | - |
MTTR [13] | Video-Swin-T | 93.9 | 85.2 | 61.6 | 16.6 | 0.1 | 70.1 | 69.8 | 39.2 |
ReferFormer [14] | Video-Swin-T | 93.3 | 84.2 | 61.4 | 16.4 | 0.3 | 70.0 | 69.3 | 39.1 |
DCT (ours) | Video-Swin-T | 94.7 | 85.5 | 62.0 | 16.9 | 0.1 | 70.7 | 70.5 | 39.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, A.; Wang, R.; Tan, Q.; Song, Z. Decoupled Cross-Modal Transformer for Referring Video Object Segmentation. Sensors 2024, 24, 5375. https://doi.org/10.3390/s24165375
Wu A, Wang R, Tan Q, Song Z. Decoupled Cross-Modal Transformer for Referring Video Object Segmentation. Sensors. 2024; 24(16):5375. https://doi.org/10.3390/s24165375
Chicago/Turabian StyleWu, Ao, Rong Wang, Quange Tan, and Zhenfeng Song. 2024. "Decoupled Cross-Modal Transformer for Referring Video Object Segmentation" Sensors 24, no. 16: 5375. https://doi.org/10.3390/s24165375
APA StyleWu, A., Wang, R., Tan, Q., & Song, Z. (2024). Decoupled Cross-Modal Transformer for Referring Video Object Segmentation. Sensors, 24(16), 5375. https://doi.org/10.3390/s24165375