Next Article in Journal
Study on Power Frequency Breakdown Characteristics of Nano-TiO2 Modified Transformer Oil under Severe Cold Conditions
Previous Article in Journal
Electronic Cigarettes, Heated Tobacco Products, and Oral Health: A Systematic Review and Meta-Analysis
Previous Article in Special Issue
A Semantic Information-Based Optimized vSLAM in Indoor Dynamic Environments
 
 
Article
Peer-Review Record

Grasp Detection Combining Self-Attention with CNN in Complex Scenes

Appl. Sci. 2023, 13(17), 9655; https://doi.org/10.3390/app13179655
by Jinxing Niu 1, Shuo Liu 1, Hanbing Li 2, Tao Zhang 1 and Lijun Wang 1,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Appl. Sci. 2023, 13(17), 9655; https://doi.org/10.3390/app13179655
Submission received: 18 June 2023 / Revised: 22 August 2023 / Accepted: 24 August 2023 / Published: 25 August 2023
(This article belongs to the Special Issue AI-Based Image Processing)

Round 1

Reviewer 1 Report

The authors propose the grasp detection network, which seems to be practical in the field of robotics. However, there are many typos and grammatical errors that make the manuscript difficult to understand. Also, I would recommend that the authors improve the following things.

-      In Eqs.(1) and (2), h_R and h_i are not defined or described.

-      Describe Eq.(3) in more detail. Are the frames in the form of homogeneous transformation matrices? The authors need to explain how to calculate Eq.(3) explicitly.

-      In Eq.(4), the authors had better explain how to obtain the Q, K, and V vectors by using the corresponding projection matrix W_i^T clearly, including the definition of the matrix W_i^T.

-      To aid the reader’s understanding, the authors had better add symbols such as X’’, I’, etc. to Figure 4.

-      In the 230th line, the sentence, “The angle theta and width W denote~,” should be rewritten appropriately.

-      In subsection 4.5, Smooth_L1 should be written with subscript L1, and the ‘h’ of ‘Smooth’ in Eq(6) should be a normal font.

-      In the 278th line, the authors use a three-stage training strategy and the total epochs are 200. Are the epochs of each stage 200? How to finish the training in each stage? Or what is the condition to finish each training? The authors had better explain the training processes clearly and explicitly.

-      In subsection 5.3, the acceptable angle difference is less than 30 degrees. I think an error of 30 degrees is too big for a robot to grasp something in general. What is the authors’ opinion on this?

 

-      In Figures 8 and 9, the authors had better discuss the results in more detail, including the relationship among the figures of Quality, Width, Angle, and Grasp. The ground truth should be depicted together with the predicted rectangles in the figures.

 Many typos and grammatical errors make the manuscript difficult to understand.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

1) As there are lots of abbreviations used which possibly not everybody is familiar with => please explain abbreviations on their first appearance or place an abbreviation table at the beginning of the paper

2) Some Figures could be bigger (1, 5, 6, 7, 10)

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

In conducting a thorough peer review of this manuscript, I have identified areas that could benefit from further clarity and depth. I suggest the following improvements to the authors:

Introduction & Motivation: The paper would benefit from a more explicit delineation of the specific technical problem it seeks to address. Rather than vaguely referring to the low accuracy of existing methods, it would be more informative to state the problem directly. Also, the motivation for the proposed method seems unclear. This could be remedied in the "Related Work" section by highlighting the specific shortcomings of existing transformer-based detection methods, thereby building a stronger case for the methodology proposed in this paper. In addition, the reader's understanding would be enhanced by including in the "Method" section the rationale behind the design of the various modules, such as the choice of a particular transformer module configuration.

Evaluation metrics: The experimental results of this paper use Intersection Over Union (IOU) as the sole evaluation metric, thereby simplifying the complex grasping problem into a "rotated object detection" problem. I recommend the inclusion of additional metrics typically used in grasping problems to provide a more comprehensive evaluation of the experimental results.

Results section: Consider providing examples of the model's limitations, such as instances where it fails, to intuitively demonstrate the model's performance and limitations. This practical illustration would help readers better understand the scope and limitations of your research.

The manuscript demonstrates strong academic writing skills in English. I have not noticed any major problems with the language quality of this paper. Minor revisions can be made to improve clarity and conciseness, but overall it is ready for publication by English standards.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

- Eq.(3) has been modified in detail, but I think it's insufficient to understand clearly. G_i is transformed by T^R_i into G_R. If the T^R_i is a homogeneous transformation matrix, T^R_i*G_i can't be calculated because of their dimension mismatch. The authors need to define the transformation matrices, including T^R_i, explicitly.

- Fig.5 should be modified, especially the arrow on the top side.

- In the 347th line, the grasp detection result is not visualized in the last row because the author added the GT Boxes in the last row. This sentence should be modified according to the author's revision.

- In Figs.8 and 9, many grasp boxes exist in the ground truth images, but one or three are obtained in the detection results. Are all grasp boxes in the GT used for training? Or some of them are preselected for training? If all grasp boxes are used for training, do you select some of the grasp boxes in the detection results based on the quality score?

Most of the typos and grammatical errors have been corrected.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 3

Reviewer 1 Report

- The authors modified Eq.(4). But the symbol Z is not defined in the manuscript. And the authors had better give the definition of T^R_C and T^C_i or their reference.

- The authors responded that all GT boxes are used in training, and the top 3 grasp boxes are selected in the results. It would be helpful to readers if this content is also reflected in the manuscript.

No comments.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop