Next Article in Journal
Application of Dynamic Time Warping to Determine the Shear Wave Velocity from the Down-Hole Test
Previous Article in Journal
Combined Effect of Bioactive Compound Enrichment Using Rosa damascena Distillation Side Streams and an Optimized Osmotic Treatment on the Stability of Frozen Oyster Mushrooms
 
 
Article
Peer-Review Record

An Effective Med-VQA Method Using a Transformer with Weights Fusion of Multiple Fine-Tuned Models

Appl. Sci. 2023, 13(17), 9735; https://doi.org/10.3390/app13179735
by Suheer Al-Hadhrami 1,2, Mohamed El Bachir Menai 1, Saad Al-Ahmadi 1 and Ahmad Alnafessah 3,*,†
Reviewer 1:
Reviewer 2:
Appl. Sci. 2023, 13(17), 9735; https://doi.org/10.3390/app13179735
Submission received: 23 June 2023 / Revised: 27 July 2023 / Accepted: 9 August 2023 / Published: 28 August 2023

Round 1

Reviewer 1 Report

The paper, "An Effective Med-VQA Method Using Transformer with Weights Fusion of Multiple Fine-Tuned Models." You've made a substantial improvement over prior art when it comes to visual question answering (VQA) in the medical sector.

A few things come to me that I think would help strengthen your paper even more:

It would be beneficial to include some background on why and how VQA could be used in the medical field in the beginning. This may encourage your audience to keep reading by highlighting the significance of your work.

It would assist to train and test your proposed models on big datasets. Readers may gain a better appreciation for the difficulties of VQA in the medical sector and the solutions provided by your models if you provide more context regarding the data.

Additional explanation of the fusion method is needed in the part devoted to the greedy-soup-based model. Details on the fusion process and its benefits to the model's accuracy would be welcome.

Discussing the caveats of your proposed models and suggestions for future study would be very helpful. This may encourage others to improve upon your work by providing context for its limits.

In conclusion, I think the method of VQA presented in your study has great potential in the field of medicine and, with more work, might have real-world applications. I appreciate your significant work in this area.

 

 â€‹â€‹ English language is accepted  

 

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

The manuscript “An Effective Med-VQA Method Using Transformer with Weights Fusion of Multiple Fine-Tuned Models” is an interesting read but the impact of the work is doubtful. This work aims to show two models, the best-value model and the greedy-soup-based model, and compare them to state-of-the-art models applied to visual question answering in the field of medicine.

 

Comments:

1.     The equations are not well explained. For example, Equation (2) X has subscripts i,j, what do these subscripts represent? What do the subscripts CE represent in Equation (5)? Please explain all variables in every equation.

2.     The authors selected the “best” learning rate from the set 1.0e-2, 1.0e-3, 1.0e-4, and 1.0e-5 to find that 1.0e-4 gives the best accuracy of 86.8%. Why did the authors not extrapolate between 1.0e-3 and 1.0e-5 to find the best learning rate, surely the best learning rate is not exactly 1.0e-4.

3.     There are multiple mentions throughout this article that the results are compared to SOTA, but it appears that there is no clear description of what this SOTA model is. What algorithm is the SOTA model using? What are the parameters associated with the SOTA model? For the sake of transparency in reporting results, these questions must be answered.

4.     Table 11 suggests to this reviewer that there is no significant advantage of using these models as compared with the SOTA. Please elaborate on the significance and impact of the models proposed in this work. Also, there is no mention of the aspects of the effectiveness of this model, as claimed in the title of the article.

 

I recommend the following edits to improve readability:

5.     Section two gives a nice overview of related work associated with the four components of VQA. However, it is not immediately clear to the readers how data is being transformed by these four components, and how they overlap or interact with each other. A figure or summarizing paragraph should suffice.

6.     Please provide the citations to the AdamW algorithm.

7.     Please improve the readability of Figures 6-11. Increase the font size and improve the resolution of these figures.

8.     “State-of-the-art” is only abbreviated to SOTA on line 458 when it is used multiple times throughout the manuscript. Please abbreviate it after the first use or not abbreviate it at all.

 

Minor edits:

9.     Line 40: There is/are missing reference(s).

10.  Line 72: Mathematics environment typesetting error.

11.  Line 119: Citation typesetting error.

12.  Figure 1: There is an overlap of arrows in the flow diagram (train base_model flow).

13.  Line 178: The patch embeddings “X” is not of the same typesetting as Equation (1).

14.  Equation (3): i,j should be subscripts

15.  Table 1: The table has multiple mistakes. Please standardize whether “,” is added to separate the hundreds from the thousands. The total column has a typesetting error.

16.  Line 400: An accuracy of 884.97% is unheard of. I believe this is a typographical error.

17.  Line 432: Adamw -> AdamW.

18.  Table 3-6: Typographical error in table caption, there is a double == which is different.

19.  Table 10: Accuracy of Model 4 not typeset correctly.

 

These are only but a few typographical errors, the authors should carefully inspect the manuscript again.

Comments for author File: Comments.docx

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop