Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Boosted Transformer for Image Captioning

Appl. Sci. 2019, 9(16), 3260; https://doi.org/10.3390/app9163260

by Jiangyun Li^1,2,‡, Peng Yao^1,2,†,‡, Longteng Guo³ and Weicun Zhang^1,2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2019, 9(16), 3260; https://doi.org/10.3390/app9163260

Submission received: 17 July 2019 / Revised: 3 August 2019 / Accepted: 5 August 2019 / Published: 9 August 2019

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

The article describes an improved Image Captioning algorithm based on Transformer. The latter is boosted through the use of a Concept Guided Attention (CGA) module, embedded in the encoder, which permits to augment the feature extraction process by fusing semantic concept and visual features. The decoder is also improved by using a Vision Guided Attention (VGA) unit, which permits to avoid inductive biases due to unbalanced datasets.

The description of the state of the art and of the proposed methodology is well detailed.

The different tests performed to identify the most performing architecture are well described.

The proposed methodology presents some novelty and improves the state of the art according to different evaluation metrics, as cleared stated in 4.4 section.

Only a minor review is necessary before publication:

in section 3.3, the description of CGA should be improved; in particular, what is the meaning of the arrow in figure 3a directed from V(i+1) to V(i)?

It seems a sort of feedback which is not evident from equations 7 and 8.

Figure 3b is not cited in the test. Is it relative to the tests described in section 4.3.1?

The relation of figure 3a with respect to the system architecture (figure 2) shall be better clarified.

In figure 7, what is the baseline you are referring to?

Minor english typos are present: (line 22: for machineS). Conctractions should be avoided to improve the quality of writing (line 228: doesn't should be replaced by does not).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

In the paper, the authors present their boosted transformer (BT) model to generate captions for images better than the SOTA (State-of-the-art). Their BT model uses
a semantic boost module in the encoder i.e. CGA and a visual boost module in decoder i.e. VGA. The former utilizes the instance-level semantic concepts to boost the visual features and the latter achieves the double attention under the influence of visual information.

Overall, the paper is not an easy read, However, the mathematical equations were elaborate enough.

Questions to the authors:

1) What is the detective mode? (Ln 134).

2) What do you mean by backbones (Ln 196)?

3) Do you really increase by 2 or multiply by 2? (ln 207)

4) Score improvement over baseline is very low. I am not sure whether that can be considered as noise.

5) One metric that missing is the system whether they run their program, how much time it took to run the program. If you are taking 10 days more to run than other baselines, the improvement might not be worth it!

6) The authors shall open-source their code or at least a version/base version of it so that their claimed results can be verified by others.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Boosted Transformer for Image Captioning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI