Next Article in Journal
Performance Evaluation of a Low-Cost Semitransparent 3D-Printed Mesh Patch Antenna for Urban Communication Applications
Previous Article in Journal
An Active Equalization Method of Battery Pack Based on Event-Triggered Consensus Algorithm
 
 
Article
Peer-Review Record

MSGeN: Multimodal Selective Generation Network for Grounded Explanations

Electronics 2024, 13(1), 152; https://doi.org/10.3390/electronics13010152
by Dingbang Li 1, Wenzhou Chen 2 and Xin Lin 1,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Reviewer 4: Anonymous
Electronics 2024, 13(1), 152; https://doi.org/10.3390/electronics13010152
Submission received: 8 November 2023 / Revised: 13 December 2023 / Accepted: 24 December 2023 / Published: 29 December 2023

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents the Multimodal Selective Generation Network (MSGeN) for enhancing interpretability of visual reasoning. The MSGeN  model consists of five components:  A multimodal encoder, a reasoner, a selector, a speaker and a pointer.  All these modules interact sequentially to provide plausible explanation to responses of visual question-answering scenarios.

The overall problem taken by this paper is the scope of  explainable AI which is a very import topic. The proposed work is therefore well motivated. It is also well presented by the authors. The only concerns I have can be summarized in the following points:

1) Why the model does not make use of object detection in the process. It seems to me that the final accuracy will depend heavily on hyper-parameter values such as the chosen patch size P and the channel size d.

2) The authors did not give much details on how the different modules are trained together, and how potentially prediction errors in one module can affect the performance of others? it seems that the overall architecture is very complex and can be prone to errors and instability.

3) There are very similar research works that are published recently that the authors did not mention in the paper:

REX: Reasoning-aware and Grounded Explanation, CVPR 2022

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

CVPR 2018

Towards Reasoning-Aware Explainable VQA, arXiv preprint arXiv:2211.05190

4)Are there examples where the model did not work well?

 

Comments on the Quality of English Language

The overall english is good

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The proposed model is very interesting for the current scientific ecosystem. However, below are some recommendations to qualify the research, including:

Please describe in more detail the model which consists of five collaborative components:

1) The Multimodal Encoder, which encodes and fuses the input data;

2) The Reasoner, responsible for generating inference states step by step;

3) The Selector, used to select the modality of each explanation step;

4) The Speaker, which generates descriptions in natural language;

5) The Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched 10 with natural language context and visual cues;

Explain the following sentence in more detail: “We provide case studies that exemplify MSGeN’s ability to produce more comprehensive and coherent explanations”;

Furthermore:

- Please describe in more detail the Summary of this research, thus highlighting its importance to the scientific and academic community;

- Please explain in more detail the figures presented in the article, it highlights their importance for the present work;

- Finally, I suggest inserting one more topic, right after the Conclusion, addressing what is expected from this work in the future. This aspect is very important as it demonstrates the importance of the present study for the future;

Grateful.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The paper is well-written and comprehensively presents the methodology and results of the Multimodal Selective Generation Network (MSGeN). The authors have done a commendable job in explaining the motivation, methodological details, and experimental setup. The results are clearly presented, and the discussion provides valuable insights into the model's performance.

Here are a few suggestions that might help improve the paper:

  1. While the technical details are well-explained, it could be beneficial to provide more context about the broader impact and potential applications of MSGeN. This would help readers outside the field understand its significance.
  2. The paper could benefit from a more detailed discussion of the limitations of the proposed model and potential future improvements.
  3. In the ablation studies, it would be interesting to see a more detailed analysis of the impact of each component of the model on the final results.
  4. Consider adding more visual aids, such as diagrams or flowcharts, to help readers understand the complex processes involved in the model.

    Overall, the paper is of high quality and makes a significant contribution to the field of visual reasoning and explainable AI. The proposed MSGeN model is novel and demonstrates impressive results. Great job!

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

The manuscript is well written and easy to read. The following minor errors can be corrected.

Abstract words should not be bolded (or italicized) unless math notation is used.

The keywords should be sorted.

Create an acronym for artificial intelligence.

There are many repetitions of acronyms (e.g., VQA).

Instead of Multimodal Selection Generation Network, MSGeN can be used.

In Section 4.4, include comparison among models based on percentage.

Similarly, include the percentage of MSGeN improvements over other models.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop