MSGeN: Multimodal Selective Generation Network for Grounded Explanations
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper presents the Multimodal Selective Generation Network (MSGeN) for enhancing interpretability of visual reasoning. The MSGeN model consists of five components: A multimodal encoder, a reasoner, a selector, a speaker and a pointer. All these modules interact sequentially to provide plausible explanation to responses of visual question-answering scenarios.
The overall problem taken by this paper is the scope of explainable AI which is a very import topic. The proposed work is therefore well motivated. It is also well presented by the authors. The only concerns I have can be summarized in the following points:
1) Why the model does not make use of object detection in the process. It seems to me that the final accuracy will depend heavily on hyper-parameter values such as the chosen patch size P and the channel size d.
2) The authors did not give much details on how the different modules are trained together, and how potentially prediction errors in one module can affect the performance of others? it seems that the overall architecture is very complex and can be prone to errors and instability.
3) There are very similar research works that are published recently that the authors did not mention in the paper:
REX: Reasoning-aware and Grounded Explanation, CVPR 2022
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
CVPR 2018
Towards Reasoning-Aware Explainable VQA, arXiv preprint arXiv:2211.05190
4)Are there examples where the model did not work well?
Comments on the Quality of English Language
The overall english is good
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe proposed model is very interesting for the current scientific ecosystem. However, below are some recommendations to qualify the research, including:
Please describe in more detail the model which consists of five collaborative components:
1) The Multimodal Encoder, which encodes and fuses the input data;
2) The Reasoner, responsible for generating inference states step by step;
3) The Selector, used to select the modality of each explanation step;
4) The Speaker, which generates descriptions in natural language;
5) The Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched 10 with natural language context and visual cues;
Explain the following sentence in more detail: “We provide case studies that exemplify MSGeN’s ability to produce more comprehensive and coherent explanations”;
Furthermore:
- Please describe in more detail the Summary of this research, thus highlighting its importance to the scientific and academic community;
- Please explain in more detail the figures presented in the article, it highlights their importance for the present work;
- Finally, I suggest inserting one more topic, right after the Conclusion, addressing what is expected from this work in the future. This aspect is very important as it demonstrates the importance of the present study for the future;
Grateful.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper is well-written and comprehensively presents the methodology and results of the Multimodal Selective Generation Network (MSGeN). The authors have done a commendable job in explaining the motivation, methodological details, and experimental setup. The results are clearly presented, and the discussion provides valuable insights into the model's performance.
Here are a few suggestions that might help improve the paper:
- While the technical details are well-explained, it could be beneficial to provide more context about the broader impact and potential applications of MSGeN. This would help readers outside the field understand its significance.
- The paper could benefit from a more detailed discussion of the limitations of the proposed model and potential future improvements.
- In the ablation studies, it would be interesting to see a more detailed analysis of the impact of each component of the model on the final results.
- Consider adding more visual aids, such as diagrams or flowcharts, to help readers understand the complex processes involved in the model.
Overall, the paper is of high quality and makes a significant contribution to the field of visual reasoning and explainable AI. The proposed MSGeN model is novel and demonstrates impressive results. Great job!
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsThe manuscript is well written and easy to read. The following minor errors can be corrected.
Abstract words should not be bolded (or italicized) unless math notation is used.
The keywords should be sorted.
Create an acronym for artificial intelligence.
There are many repetitions of acronyms (e.g., VQA).
Instead of Multimodal Selection Generation Network, MSGeN can be used.
In Section 4.4, include comparison among models based on percentage.
Similarly, include the percentage of MSGeN improvements over other models.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf