Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessReview

Peer-Review Record

Research Progress on Vision–Language Multimodal Pretraining Model Technology

Electronics 2022, 11(21), 3556; https://doi.org/10.3390/electronics11213556

by Huansha Wang, Ruiyang Huang^* and Jianpeng Zhang

Reviewer 1:

Tae-Jin Yoon

Reviewer 2:

Zhenyu Zhou

Electronics 2022, 11(21), 3556; https://doi.org/10.3390/electronics11213556

Submission received: 11 October 2022 / Revised: 27 October 2022 / Accepted: 27 October 2022 / Published: 31 October 2022

(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

Some of the English terms seem to be misleading or unclear. For example, linguist model doesn't seem to be correct. The English is overall okay, but due to some unclear terms, I would recommend the authors to make the manuscript checked by native English speakers. Also there are places where two commas (,,) are used, or where typo is made (e.g., adversarail), or where space is needed after a punctuation mark.

Other than these issues, the paper made an extensive survey of current approaches to the multimodal pretrained models.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper provides a thorough review of the research progress of Visual-Linguist Multimodal Pre-training Model. It also suggested potential research directions in this field. The summary is pretty detailed and the classification helps researchers get the big picture quickly.

As a survey paper, my comments mainly focuses on potential improvements of the classification and sources of results, together with the presentation to make the paper more helpful to readers.

intro
- " the single-stream and dual-stream architecture can be further sorted out according to three types": From fig 2, the 2 ways of classification under image-linguist pre-training model are orthogonal, but this reads like single/dual-stream is a parent dimension? I'm not quite certain about the relationship between them.
- Why do you omit video-linguist pre-training models directly? It's a bit weird that it is in fig 2 but is almost not discussed at all. Does it have any classifications?

Section 2
- In table 2, why is [19] the only one represented by image/text size?
- In section 2.2, please consider adding citations for each model you introduced. The same for 2.3 and 2.3.1.
- "Some models regard multimodal downstream tasks such as ...": what are "some models"? Please be precise, eg. add citations.
- "It has achieved excellent results 2-4 percentage points better than.." and several other similar spots: what is "2-4 percentage points" for? Is it for precision? recall? or sth else?
- For the evaluation results on public datasets, do you perform the experiments by yourself, or do you get the result from all these related works? Either way, please make it explicit.

nit:
- "different models mainly refer to different forms of data": do you mean "modals"?
- (in fig 2) "Imgae-Linguist pretraining model": image (typo)
- In fig 6, there are non-english characters
- For table 3, 4 and many others, can you add the horizontal lines for each row? Otherwise, it's not easy to figure out the boundary or rows especially for the columns that are crowded
- "Although the number of high parameters ensures the model effect": what are "high parameters"? Do you mean "high number of params"?
- "a)Multi-task high-quality model research.Research": many missing spaces in this section

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Research Progress on Vision–Language Multimodal Pretraining Model Technology

Further Information

Guidelines

MDPI Initiatives

Follow MDPI