Next Article in Journal
A Block-Based Concatenated LDPC-RS Code for UAV-to-Ground SC-FDE Communication Systems
Next Article in Special Issue
Virtual Agents in DTx: Focusing on Usability and Therapeutic Effectiveness
Previous Article in Journal
YOLOv7-UAV: An Unmanned Aerial Vehicle Image Object Detection Algorithm Based on Improved YOLOv7
Previous Article in Special Issue
WYDISWYG: A Method to Design User Interfaces Combining Design Principles and Quality Factors
 
 
Article
Peer-Review Record

Language-Led Visual Grounding and Future Possibilities

Electronics 2023, 12(14), 3142; https://doi.org/10.3390/electronics12143142
by Zezhou Sui 1,†, Mian Zhou 1,*,†, Zhikun Feng 1, Angelos Stefanidis 2 and Nan Jiang 3
Reviewer 2: Anonymous
Electronics 2023, 12(14), 3142; https://doi.org/10.3390/electronics12143142
Submission received: 9 June 2023 / Revised: 5 July 2023 / Accepted: 17 July 2023 / Published: 20 July 2023
(This article belongs to the Special Issue Future Trends and Challenges in Human-Computer Interaction)

Round 1

Reviewer 1 Report

This paper proposed an improved visual grounding (VG) algorithm and an integration of visual grounding and HCI. The authors build on existing work in VG to share context between the visual and language paths of traditional VG networks to provide context from one to the other. This results in better identification of objects in a scene as queried by the user. Experimental results confirm comparable or better performance with existing state-of-the-art approaches.

In terms of integration with HCI, the authors suggest two potential areas of benefit. The first is easing the process of image annotation, that is, identifying ground truth objects based on a user query of a given scene. This could be done by having a trained VG model recommend objects based on a user query, then have the user confirm the recommendation or reject it and identify the correct object. The second suggestion proposes interactive object tracking, where multiple objects that meet a general user query are identified, then the user indicates the specific object to track, and/or when to switch objects over time.

In terms of the VG model proposed, I have no specific issues with the approach the authors suggest. Experimental results suggest it improves in some cases, and struggles in others, possibly (as the authors note) due to language in a query not appearing in the data used to train the model. Although I don't think the authors'  approach would replace other existing VG algorithms, the insights into how to improve upon these existing algorithms is insightful, and it seems to me that the authors' approach could be a better choice in the right circumstances.

My main comment on the paper is that it seems mainly focused on a new VG model, and much less on integration of that model with HCI or UI/UX. The authors do propose two interesting potential applications of VG in HCI, but these are much more like future work items, since as far as I can tell, the authors have not pursued the ideas at this point. Certainly, no discussion is provided on any work in either area.

Given this, I think the decision on whether to accept the paper or not depends on the importance of including a significant HCI component. If this is critical, the paper is not ready for publication. If it isn't, the paper's VG material is novel and could stand as a contribution on its own.

My recommendation is:

1. The authors refocus the abstract, title, and introduction to better highlight that this is primarily a new VG method and not an integration of VG with HCI.

2. The authors include the ideas for integration into HCI as future work or potential application of the VG section. This is similar to what they have now, and probably only needs some slight rewording and maybe a section title change.

Assuming there is general agreement among other reviewers and the associate editor, I would support some minor revisions to the paper to get it into shape for publication.

The English in the paper is good, but it could use a final review to catch a number of grammar errors. For example, in the second paragraph of Section 5. VG Applications in HCI, "Firstly, a dataset organiser/manager generate describes the objects..." should be "Firstly, a dataset organiser/manager describes the objeccts..." Or at the bottom of pg. 5, "These verification score measures the correlation..." should be (I think) "These verification scores measure the correlation..." A number of these minor issues were sprinkled throughout the paper.

Author Response

Point 1: The authors refocus the abstract, title, and introduction to better highlight that this is primarily a new VG method and not an integration of VG with HCI.

Response 1: We greatly appreciate your careful review of our paper and identifying these issues. We have now revised the abstract, title, and conclusion sections, emphasizing that this is a new VG method rather than an integration with HCI.

 

Point 2: The authors include the ideas for integration into HCI as future work or potential application of the VG section. This is similar to what they have now, and probably only needs some slight rewording and maybe a section title change.

Response 2: Thank you for your correction. The idea of integrating with HCI has now been categorized as future work for VG and as existing to enhance HCI capabilities.

Author Response File: Author Response.pdf

Reviewer 2 Report

General:

The uniqueness and relevance of the suggested system are not explicitly established in the study. Although it merges existing study fields of visual grounding and human-computer interaction, it does not adequately show how this combination adds fresh insights or breakthroughs to the field. Furthermore, the report lacks a comprehensive assessment of the proposed system's possible effect and practical consequences. I suggest modifying the introduction.

The state of the art evaluation appears haphazard and disorganized. Object tracking, visual grounding, and human-computer interaction are examples of related works. A summary and comparative table should assist readers understand the existing methodologies, highlight gaps in current research, and underline how the proposed system covers those gaps.

A more detailed explanation of the used language module, the particular VG model, and the tracking model, as well as a dedicated hyperparameter table outlining the consequences, would be beneficial.

Major:

If the first localization is inaccurate, the algorithm can re-localize or wait for a more exact target description, according to the paper. The principle however is not well explained.

A robustness analysis of the proposed system is required.
Include an ablation study to explain the efficiency of each module in your method.
FLOPS can now do computational performance analysis, as well as add and compare model size and other factors.
To demonstrate training performance, combine the ROC+AUC+Loss curves.

Minor editing of English language required

Author Response

Point 1: If the first localization is inaccurate, the algorithm can re-localize or wait for a more exact target description, according to the paper. The principle however is not well explained.

Response 1: We greatly appreciate your careful review of our paper and identifying these issues. We have provided a detailed description of the repositioning operation and added examples to clarify the concept.

 

Point 2: A robustness analysis of the proposed system is required.Include an ablation study to explain the efficiency of each module in your method.FLOPS can now do computational performance analysis, as well as add and compare model size and other factors. To demonstrate training performance, combine the ROC+AUC+Loss curves.

Response 2: Thank you for pointing out the shortcomings of our experimental design. We have now added the ablation study section, which includes metrics such as the model's parameter count and FLOPS. We apologize for the confusion regarding ROC and AUC curves, which are typically used for classification tasks and require appropriate labeling. As this is not a classification task, we have only included the loss curve to demonstrate our model's training performance.

Author Response File: Author Response.pdf

Back to TopTop