Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Image Captioning Algorithm Based on Combination Attention Mechanism

Electronics 2022, 11(9), 1397; https://doi.org/10.3390/electronics11091397

by Jinlong Liu^*

, Kangda Cheng, Haiyan Jin and Zhilu Wu

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Electronics 2022, 11(9), 1397; https://doi.org/10.3390/electronics11091397

Submission received: 25 March 2022 / Revised: 22 April 2022 / Accepted: 24 April 2022 / Published: 27 April 2022

(This article belongs to the Section Computer Science & Engineering)

Round 1

Reviewer 1 Report

The paper describes an interesting approach to image captioning, combining a visual attention and
a keyword attention module.
The procedure is complex and I have found difficult to follow all the steps, mainly because the notation is not always clear.
The description of the meaning of the variables in the equations should be definitely improved, see below for some pointers.

Language should be checked, some suggestions are given below, but they are not comprehensive.

L12 "existing image captioning methods focus only on visual attention mechanism while not keywords attention mechanism" -> "existing image captioning methods focus only on visual attention mechanism and not on keywords attention mechanism".

L50 Figure 1 is not referenced anywhere in the text: its meaning and importance in the paper should be addressed in the text.

L147-149 "The keyword attention module enables the model to focus on
important keyword text by associating keywords with visual features to generate richer and more accurate descriptive sentences." is a repetition of the previous sentence.

L182 _" Wka, Wha is the weight matrix to be learned": are the weight matrices.

L183 " a∗ =": a* should be bold.
"is the unnormalized weight" -> "is the unnormalized weight vector"? (same at L185).

L187 Figure 4 is not mentioned in the text. The caption of figure should describe the parts of the diagram, at least at the layer level.

L218 Equations 6 and 7: what does the bar over "V" mean?

L243 Eq. 9 and successive: the meaning of the variables in the equations should be described right after the equation. In some cases they are explained after a few lines (e.g. Q, K, V of eq. 9 described at L248), in other cases there is no description (e.g. d_k of eq. 10).

L253 "connection around each sub-layer around each sub-layer": remove repetition.

L269 What is "T"?

L271 Eq. 13: is "p" the probability?

L324 "When the number of layers continues to increase, the performance degrades.": how do you explain this?

L342 Figure 7: images should be larger.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

What do you mean by Extensive experiments in the abstract? How many of them exist in this paper?
Sec1 needs a complete rewriting. It is short, no motivation, no contribution and no even structure of the paper paragraph.
I would like to see a comparison table of features in the existing work for those in the literature compared to the proposed solution
Some symbols used in the equations are not explained. For example, what does the upside-down triangle mean in Equ14?
The results are good, but would have been better if visual plots are presented to help understanding the performance.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

All the issues I pointed out have been addressed.

Reviewer 2 Report

The paper has been improved significantly, thanks for the authors for addressing all comments in a professional and proper way. I have no other comments, therefore, I do recommend acceptance for this version

Article Menu

An Image Captioning Algorithm Based on Combination Attention Mechanism

Further Information

Guidelines

MDPI Initiatives

Follow MDPI