An Image Captioning Algorithm Based on Combination Attention Mechanism
Round 1
Reviewer 1 Report
The paper describes an interesting approach to image captioning, combining a visual attention and
a keyword attention module.
The procedure is complex and I have found difficult to follow all the steps, mainly because the notation is not always clear.
The description of the meaning of the variables in the equations should be definitely improved, see below for some pointers.
Language should be checked, some suggestions are given below, but they are not comprehensive.
L12 "existing image captioning methods focus only on visual attention mechanism while not keywords attention mechanism" -> "existing image captioning methods focus only on visual attention mechanism and not on keywords attention mechanism".
L50 Figure 1 is not referenced anywhere in the text: its meaning and importance in the paper should be addressed in the text.
L147-149 "The keyword attention module enables the model to focus on
important keyword text by associating keywords with visual features to generate richer and more accurate descriptive sentences." is a repetition of the previous sentence.
L182 _" Wka, Wha is the weight matrix to be learned": are the weight matrices.
L183 " a∗ =": a* should be bold.
"is the unnormalized weight" -> "is the unnormalized weight vector"? (same at L185).
L187 Figure 4 is not mentioned in the text. The caption of figure should describe the parts of the diagram, at least at the layer level.
L218 Equations 6 and 7: what does the bar over "V" mean?
L243 Eq. 9 and successive: the meaning of the variables in the equations should be described right after the equation. In some cases they are explained after a few lines (e.g. Q, K, V of eq. 9 described at L248), in other cases there is no description (e.g. d_k of eq. 10).
L253 "connection around each sub-layer around each sub-layer": remove repetition.
L269 What is "T"?
L271 Eq. 13: is "p" the probability?
L324 "When the number of layers continues to increase, the performance degrades.": how do you explain this?
L342 Figure 7: images should be larger.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
- What do you mean by Extensive experiments in the abstract? How many of them exist in this paper?
- Sec1 needs a complete rewriting. It is short, no motivation, no contribution and no even structure of the paper paragraph.
- I would like to see a comparison table of features in the existing work for those in the literature compared to the proposed solution
- Some symbols used in the equations are not explained. For example, what does the upside-down triangle mean in Equ14?
- The results are good, but would have been better if visual plots are presented to help understanding the performance.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
All the issues I pointed out have been addressed.
Reviewer 2 Report
The paper has been improved significantly, thanks for the authors for addressing all comments in a professional and proper way. I have no other comments, therefore, I do recommend acceptance for this version