Next Article in Journal
Installation of Clip-Type Bird Flight Diverters on High-Voltage Power Lines with Aerial Manipulation Robot: Prototype and Testbed Experimentation
Next Article in Special Issue
Multi-Fusion Approach for Wood Microscopic Images Identification Based on Deep Transfer Learning
Previous Article in Journal
Research on Multicore Key-Value Storage System for Domain Name Storage
Previous Article in Special Issue
Face Gender Recognition in the Wild: An Extensive Performance Comparison of Deep-Learned, Hand-Crafted, and Fused Features with Deep and Traditional Models
 
 
Article
Peer-Review Record

Say What You Are Looking At: An Attention-Based Interactive System for Autistic Children

Appl. Sci. 2021, 11(16), 7426; https://doi.org/10.3390/app11167426
by Furong Deng 1,†, Yu Zhou 2,†, Sifan Song 3, Zijian Jiang 4, Lifu Chen 5, Jionglong Su 6,*, Zhenglong Sun 7,* and Jiaming Zhang 4,8,*
Reviewer 1:
Reviewer 2: Anonymous
Appl. Sci. 2021, 11(16), 7426; https://doi.org/10.3390/app11167426
Submission received: 29 June 2021 / Revised: 31 July 2021 / Accepted: 4 August 2021 / Published: 12 August 2021
(This article belongs to the Special Issue Deep Learning towards Robot Vision)

Round 1

Reviewer 1 Report

The paper is interesting and technically reasonable overall. Based on my own understanding, I have the following comments.

  1. The system overview in Figure 1 is unclear, which cannot differentiate the gazing captioning from other captioning models.
  2. In Section 2, the authors introduced several hyper-parameters, such as $\alpha_e$, $\alpha_h$ and $r$. How do they affect the receptive field for object detection in real applications or in your experimental scenario?
  3. As for the captioning model, you just mentioned the encoder/decoder. Please give more details of the captioning model, since different models perform differently.
  4. The dataset description in Section 3.1 is lengthy (2nd paragraph).
  5. In this manuscript, the author claimed to solve the object detection in cross-frames. How is it achieved? It seems that the method is still based on static images from my point of view. Can the dataset collected by the authors support this?
  6. There are too many typos. Please carefully proofread the whole article. 

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

The paper proposes a gaze-based, image captioning system for children with autism. The idea is to collect colour and depth images from a RealSense D435 camera attached to a robot, and use the depth and image information captured by this camera to generate a caption based on what the children are looking at. The paper mentions that previous studies required both the user and the object to be visible in the same image. In comparison, the method proposed in this paper addresses this problem, which the paper refers to as cross-frame gaze-following. 

The idea being presented in this paper is interesting, and allows for a person-perspective description of the object being gaze upon, rather than a generic description of the scene in its entirety. The contributions of the work are, also, clearly defined. 

The following are several comments for consideration:

  1. Section 2.1, Equations 4 and 5: Parameters, r and j, do not appear to have been described in the text, but would help the reader understand better if their description is included.
  2. Section 2.1, Lines 169-170: It would help if R, S and θ are displayed on Figure 3. 
  3. Section 2.1, Figure 3: It is not clear what the horizontal line consisting of green and blue dots, and the grey dashed cone represent. A more descriptive labelling of the diagram would be better. 
  4. Section 2.2, Line 199: It is mentioned that the model is trained on preprocessed data. How has this preprocessing been carried out?
  5. Section 3.2, Lines 265-269: It does not emerge clearly throughout the paper how the cross-frame gaze estimation is being handled. Is the last gaze estimate being retained as the field-of-view of the camera shifts from the user to the object of interest, assuming that the gaze direction does not change when the person moves out of the field-of-view? Does the camera follow the user's gaze until a salient object is found along the gaze path? It is recommended that the authors explain this aspect of their work more clearly, ideally in Section 2 before the results are then presented in the following section. 
  6. Section 3.2, Figure 5: The axes labels are missing and need to be included. 
  7. Section 3.2: The proposed method is being compared to several other methods in the literature (such as, [7], [9], [32] etc). On which bases have these methods been selected? A brief description of how these methods function differently from the proposed work would also help to understand the results into context. 
  8. Section 3.2, Line 287: Can a few examples of descriptions generated by [22, 24] be provided, similar to those provided for [32]?
  9. Section 3.2, Table 2: Should the column headings be revised?
  10. Section 3.3: The purpose of this section does not emerge clearly. Is this work that has yet to be done? If so, can it be carried out before finalising the paper?
  11. Section 3: On which platform are the algorithms running? And at which frame rate?
  12. Section 4, Lines 307-309: The concluding remarks regarding the insufficiency of the image description accuracy, and the further optimization required for the gaze tracking and SLAM algorithms, have not necessarily come across in Section 3, where it was mentioned that the proposed method outperformed all other methods that it was being compared to. Can these remarks be substantiated and explained further?
  13. The paper requires some re-wording and further proof-reading to improve its readability.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The authors have addressed my concerns, and the new manuscript is more convenient to understand and readable. The contribution is moderate, but it is fine for the field of applied science.

Reviewer 2 Report

The quality of the paper has been improved substantially, both in terms of the use of the Engllish language, and in the descriptions of the methods and results. The shortcomings that have been identified through the first review of this paper, have also been suitably addressed by the authors. 

Back to TopTop