Next Article in Journal
A Survey of Machine Learning Assisted Continuous-Variable Quantum Key Distribution
Next Article in Special Issue
Usable Security: A Systematic Literature Review
Previous Article in Journal
Customer Shopping Behavior Analysis Using RFID and Machine Learning Models
Previous Article in Special Issue
A Deep Learning Methodology for Predicting Cybersecurity Attacks on the Internet of Things
 
 
Article
Peer-Review Record

CapGAN: Text-to-Image Synthesis Using Capsule GANs

Information 2023, 14(10), 552; https://doi.org/10.3390/info14100552
by Maryam Omar 1, Hafeez Ur Rehman 1,2,*, Omar Bin Samin 3, Moutaz Alazab 2,4, Gianfranco Politano 5 and Alfredo Benso 5
Reviewer 1:
Reviewer 2:
Information 2023, 14(10), 552; https://doi.org/10.3390/info14100552
Submission received: 24 August 2023 / Revised: 27 September 2023 / Accepted: 3 October 2023 / Published: 9 October 2023
(This article belongs to the Special Issue Advances in Cybersecurity and Reliability)

Round 1

Reviewer 1 Report

In this article the authors introduce CapGAN, a model that synthesizes images from a given single text statement. The proposed model resolves the problem of global coherent structures in complex scenes.

The problem of image synthesis from text is very interesting and important; the authors support this importance well in the Introduction section. Capsule Networks are known for their capability of identifying geometric information in contrast to the typical CNNs.

Comments:

1. The article is well-written and well-structured. The text is easy to follow, whereas the content is technically sound.

2. In line 60 the authors mention that “The conceptual novelty of this work lies in integrating capsules at Discriminator level to make the model understand the orientational and relative spatial relationship among various diverse entities of an object in an image”. In reality, the authors employ Capsule Nets (a well-established, previously-published model), to play the role of the Discriminator inside a GAN. There is a large number of works that work in exactly the same manner: Take a successful model, and put it in the Discriminator (mainly) or the Generator of a GAN.

Of course, this does not nullify the importance of this work, however, it blurs its novelty and limits down its contribution.

3. In line 114 the authors employ skip-thought vectors for representing the input text. Although this is perfectly OK, there have been proposed other, more successful models for that. E.g. BERT and its variants have been proved particularly effective at representing text with low-dimensional, fix-sized, dense vectors that also capture text semantics. The same applies to other popular schemes like FastText, Word2Vec and Glove. Is there any particular reason that led the authors to use skip-though vectors instead of these well-established models?

4. In the experimental evaluation, the authors have not included several of the most successful GANs for image generation. In order to fully evaluate the usefulness of the proposed model, please include DCGAN (Deep Convolutional GAN) and StyleGAN into your experimental evaluation (and particularly, Table 7).

5.       Minor issues:

5a. In Line 49: “Unlike CNN, which operates on scalar inputs,”: The input of a CNN is usually a Tensor, not a scalar value. I understand what the authors are trying to say, but this statement is incorrect. Please rephrase.

5b.      There are several syntactical and grammatical errors in the article that need to be fixed. For example:

5b-i.      Line 39: “Although the models based on convolutional layers e.g., convolutional neural networks (CNN), has provided massive” -> “Although the models based on convolutional layers e.g., convolutional neural networks (CNN), have provided massive”.

5b-ii.      Line 41: “For instance, largeamount of data…” -> “For instance, large amounts of data…”

5b-iii.      Line 63: “dogs.Our” A space character is required.

5c. The authors must carefully proofread the text and correct all similar errors.

The quality of writing is good. Only minor issues and typos have been detected.

Author Response

Dear Reviewer,

I hope this message finds you well. I would like to express my sincere gratitude for taking the time to review our manuscript titled "CapGAN: Text to Image Synthesis using Capsule GANs". Your insightful comments and feedback have been invaluable in improving the quality of our work. We have carefully considered each of your suggestions and have made the necessary revisions to address them.

In the attached document, you will find a point-by-point response to your comments, detailing the changes we have made and providing explanations where necessary. We believe that these revisions have significantly strengthened the manuscript and have enhanced its contribution to the field of machine learning.

Once again, thank you for your thorough review, which has undoubtedly played a vital role in refining our work. We are confident that the revised manuscript now meets the high standards of the Information journal and will make a meaningful contribution to the scientific community.

We greatly appreciate your time and expertise.

Best regards,

Dr. Hafeez

Author Response File: Author Response.pdf

Reviewer 2 Report

In this manuscript, a model called CapGAN is proposed to synthesize images from given single text statement to resolve the problem of global coherent structures in complex scenes. Skip-thought vectors are used for text embedding. Experimental results on three datasets demonstrate the effectiveness of the proposed method. The details of my comments are as follows.

 

-- A schematic diagram is suggested to be added for the skip-thought mechanism. How to use skip-thought model when the caption of each image is only one sentence instead of a paragraph consisting of multiple sentences?

-- Is it possible to find an example from the real experiments to illustrate the difference between CNN and capsule networks in Figure 2?

-- Ablation studies are suggested. How is the contribution of skip-thought vectors? One may commpare with word2vec based baseline vectors [1]? 

[1] Alternative semantic representations for zero-shot human action recognition

-- When comparing with other methods, is the same text embedding used across different methods? How to determine the hyper-parameters of different comparative models?

-- It is suggested to compare the computational complexity of different models as well.

-- Typos and grammar issues: "...quantitative evaluation metrices..."; 

Quality of English language is fine.

Author Response

Dear Reviewer,

I hope this message finds you well. I would like to express my sincere gratitude for taking the time to review our manuscript titled "CapGAN: Text to Image Synthesis using Capsule GANs". Your insightful comments and feedback have been invaluable in improving the quality of our work. We have carefully considered each of your suggestions and have made the necessary revisions to address them.

In the attached document, you will find a point-by-point response to your comments, detailing the changes we have made and providing explanations where necessary. We believe that these revisions have significantly strengthened the manuscript and have enhanced its contribution to the field of machine learning.

Once again, thank you for your thorough review, which has undoubtedly played a vital role in refining our work. We are confident that the revised manuscript now meets the high standards of the Information journal and will make a meaningful contribution to the scientific community.

We greatly appreciate your time and expertise.

Best regards,

Dr. Hafeez

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have responded appropriately to my initial comments. The article deserves to be published.

Reviewer 2 Report

All my comments have been addressed. No further comments.

Looks good.

Back to TopTop