Next Article in Journal
EAIA: An Efficient and Anonymous Identity-Authentication Scheme in 5G-V2V
Previous Article in Journal
One-Channel Wearable Mental Stress State Monitoring System
Previous Article in Special Issue
Analysis of the Generalization Ability of Defogging Algorithms on RICE Remote Sensing Images
 
 
Article
Peer-Review Record

Decoupled Cross-Modal Transformer for Referring Video Object Segmentation

Sensors 2024, 24(16), 5375; https://doi.org/10.3390/s24165375
by Ao Wu 1, Rong Wang 1,2,*, Quange Tan 1 and Zhenfeng Song 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Sensors 2024, 24(16), 5375; https://doi.org/10.3390/s24165375
Submission received: 31 May 2024 / Revised: 13 August 2024 / Accepted: 19 August 2024 / Published: 20 August 2024
(This article belongs to the Special Issue AI-Driven Sensing for Image Processing and Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper is well structured. However, some major revisions are needed:
The Matching-based methods section should cite relevant literature.
It would be helpful to include a discussion on the method's limitations and possible future improvements.

Author Response

Comments 1: The Matching-based methods section should cite relevant literature.

Response 1: We have added citations at section 2.1 in the revised manuscript. (Line 107)

 

Comments 2: It would be helpful to include a discussion on the method's limitations and possible future improvements.

Response 2: Thank you for your professional advice. We have added a Discussion section containing the method's limitations and possible future improvements. (Line 383-Line 394)

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes a method referred to as a decoupled cross-model transformer for the task of referring video object segmentation.

 

Proposed DCT is composed of three modules; (a) Language Guided Visual Enhancement Module, (b) Decoupled Transformer Decoder, and (c) Cross-layer Feature Pyramid Network.

 

- Third paragraph of section 2.1 should contain references to existing works. 

- MTTR, ReferFormer methods should be cited in L 186, 189

- Figures should be center aligned.

- The proposed work seems a little bit old-fashioned. Utilizing LLM with multi-modal interaction (e.g., Q-former) might benefit the performance and novelty of proposed method.

Author Response

Comments 1: Third paragraph of section 2.1 should contain references to existing works.

Response 1: We have added citations at section 2.1 in the revised manuscript. (Line 107)

 

Comments 2: MTTR, ReferFormer methods should be cited in L 186, 189

Response 2: The citations are added in the revised manuscript. (Line 186,189)

 

Comments 3: Figures should be center aligned.

Response 3: Thank you for pointing this out and we have revised it according to the journal format.

 

Comments 4: The proposed work seems a little bit old-fashioned. Utilizing LLM with multi-modal interaction (e.g., Q-former) might benefit the performance and novelty of proposed method.

Response 4: Thank you very much for your professional advise. We will conduct relevant researches and explorations about LLM in the future.

Author Response File: Author Response.pdf

Back to TopTop