Next Article in Journal
Optical Identification of Parenteral Nutrition Solutions Exploiting Refractive Index Sensing
Next Article in Special Issue
A Transformer-Based Model for Super-Resolution of Anime Image
Previous Article in Journal
Accelerating 3D Convolutional Neural Network with Channel Bottleneck Module for EEG-Based Emotion Recognition
Previous Article in Special Issue
Bed-Exit Behavior Recognition for Real-Time Images within Limited Range
 
 
Review
Peer-Review Record

A Review of Multi-Modal Learning from the Text-Guided Visual Processing Viewpoint

Sensors 2022, 22(18), 6816; https://doi.org/10.3390/s22186816
by Ubaid Ullah 1, Jeong-Sik Lee 1, Chang-Hyeon An 1, Hyeonjin Lee 1, Su-Yeong Park 1, Rock-Hyun Baek 2 and Hyun-Chul Choi 1,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Sensors 2022, 22(18), 6816; https://doi.org/10.3390/s22186816
Submission received: 31 May 2022 / Revised: 3 July 2022 / Accepted: 24 July 2022 / Published: 8 September 2022
(This article belongs to the Special Issue AI Multimedia Applications)

Round 1

Reviewer 1 Report

It is always a hard comment, but the reading of the paper has been extremely difficult due to poor English. This meant many of the key contributions of the paper may have not been highlighted well. But a lot of the work was really not understandable.

I recommend a strong edit of the work to make the paper readable. 

The motivation behind the work is definitely valid, but the poor explanation makes it really difficult to judge the work on scientific quality and the insights derived.

 

Author Response

Please see the attached document.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper presents a thorough review of text-to-image synthesis techniques from a broader view, namely text-guided visual-output. Based on this taxonomy, recent SOTA models are analyzed based on a range of datasets. My minor concerns are:
1. The overall presentation, including language, figures, and tables, should be improved. Countless issues can be found, e.g., (1) The title should be capitalized in a consistent way. (2) line 30: an empty pair of brackets following ‘… AI techniques’ (3) Bottom line of Table. 1 is missing. (4) typo in the caption of Figure 6: ‘Deep beleif Network (DBN)’ (5) line 1394: a verb is missing It is highly recommended that this manuscript should be revised carefully by native English speakers, and all the minor issues (not limited to the examples above) should be fixed.

2. In section 2.1.1, reinforcement learning is mentioned on line 110. However, it is not discussed in the following paragraph.

3. Text font in tables, e.g., Table 2 and Table 3, are too small, and the marks (e.g., blue and red highlights in Table 2) are too vague. It is highly recommended that these tables should be reorganized so that they can be fit in a paper.

4. Figure 3 is not mentioned in this manuscript. Moreover, the caption should be something like ‘scheme of GAN’ instead of ‘Vision models’, and the output of ‘Discriminative’ should be a decision (true of false) instead of a regression result.

Author Response

Please see the attached document.

Author Response File: Author Response.pdf

Reviewer 3 Report

This paper provided a survey on Multi-modal learning from Text-guided Visual processing viewpoint. My major concerns:

1) Figure 2 seem useless, as we all know those common terms.

2) some figures are not informative, such as fig 3, what is input, output? you should give more information on the figure context, rather than listing an abstract figure for readers.

3) I think Fig.4 is not complete, as discriminative models should contain traditonal methods, apart from deep learning methods. So the survey is incomplete.

4) is it possible to further summarize some open-source tools for this task?

 

Author Response

Please see the attached document.

Author Response File: Author Response.pdf

Back to TopTop