Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Integrating Egocentric and Robotic Vision for Object Identification Using Siamese Networks and Superquadric Estimations in Partial Occlusion Scenarios

Biomimetics 2024, 9(2), 100; https://doi.org/10.3390/biomimetics9020100

by Elisabeth Menendez^1,*

, Santiago Martínez¹

, Fernando Díaz-de-María²

and Carlos Balaguer¹

Reviewer 1:

Farook Sattar

Reviewer 2:

Giovanni De Gasperis

Reviewer 3: Anonymous

Reviewer 4:

Pringgo Widyo Laksono

Reviewer 5:

Shuai Teng

Biomimetics 2024, 9(2), 100; https://doi.org/10.3390/biomimetics9020100

Submission received: 21 November 2023 / Revised: 31 January 2024 / Accepted: 6 February 2024 / Published: 8 February 2024

(This article belongs to the Special Issue Intelligent Human-Robot Interaction: 2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposed an improved object identification method using Siamese networks and superquadric estimations based on both egocentric and robotic vision. Interesting results are presented for the challenging partial occlusion cases. The paper needs some revisions as follows:

1. Please include a table containing the hyperparameter values used in the Siamese networks.

2. Please add the comparison results. For instance, the authors may consider the performances of similar/existing method(s) for their comparison.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Summary and understanding:

This paper tackles a combination of several Human-Robot Interaction related problems:

• class-agnostic object identification

• user-guided object identification

• intension estimation

• gaze event identification

The problem is basically identification (segmentation without classification) of objects that are the target of a user’s gaze, thus first the goal is intent estimation, then category-agnostic identification. The paper tackles an environment where the peripheral inputs are:

• RGB Camera

• Depth Camera

• Eye Tracker

When it comes to gaze estimation, previous methods often rely on prior knowledge of objects and their fixed positions or the use of markers which restricts their use in realistic environments (with unknown objects or obfuscated/obscure variations).

To solve the problem the authors trained Siamese Networks for identification of gazed objects, the dataset crops the segments of each category agnostic object based on its superquadratic, where superquadratics are a compact representation of simple objects with a set of parameters, thus achieving the category agnostic behaviour that is desired.

The workflow of the developed framework is the following:

• a depth image is acquired with the robot camera, then combined with the camera intrinsic

parameters and the robot joint configuration with forward kinematics, this depth image is

converted into a 3D point cloud, and the joints are taken out from the cloud.

• the point cloud is clustered.

• by projecting the points of each cluster onto the horizontal plane and computing the convex

hull of them, a reconstruction of occlusions is achieved.

• Each cluster is run through an inside-outside function in order to calculate its

superquadratic.

• Using a direct formulation on the surface of a superquadratic a mask is obtained for each

superquadratic.

• The RGB image is cropped based on the superquadratic mask for each object:

◦ a positive crop where the object is taken out.

◦ a negative crop where the other objects are taken out.

◦ an anchor image where the object is highlighted by its mask.

• a Triplet neural network is constructed, which extends the concept of Siamese Networks by learning embeddings based on the Three-Crops of each object, and the Triplet Loss Function is used to train the network, the network produces 3 outputs: positive, negative, and anchor.

• Through the fusion of the three outputs, object segmentation is achieved based on shape and pose rather than based on category.

In the results section the authors discuss their work by testing the average error for different types of objects, and under different circumstances (varying occlusions), they also test their work with superquadratics against variants that do not rely on superquadratics and they conclude that the superquadratics method achieves better numerical results, however no direct comparison with previous works is done.

Comments:

Great work, however it would be useful to provide an overarching diagram (with a small section explaining the workflow/dataflow) for how all the components of the discussed solution framework actually work in tandem together.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This study aims to develop an optimization model that enables a robot to identify a selected object based on the user's gaze, facilitated by eye-tracking glasses. However, the results are limited and lack in-depth explanations and validation. Also, the novelty and significance of the manuscript are not clear. The authors should clearly explain the new contributions in this work to fill the research gaps. The manuscript has many mathematical equations that are out of the reader's interest because it is difficult to follow them. It needs in-depth discussions and explanations of equations and their benefits. In addition, the technical writing has many grammatical errors, making it hard to read.

In addition to the following:

- The abstract is vague and broad. Enhance the abstract to emphasize this work's objectives by adding a paragraph about the findings and results.

- Enhance the conclusion to focus only on the objectives, methodology, and quantitative results. It should present the limitations and future directions of this study.

-Enhance the introduction to present sufficient knowledge and recent studies on the current problem.

- Add more literature review to focus on the current objectives of the proposal and identify the research Gaps. Add recent studies 2023, 2022, etc.

- Most of the Figures missed clear explanation and analysis. For example, in Fig. 7, What does dimention2, 3,4,5 mean?

-The research paper should be written in the third person's perspective; words such as "we" must be avoided.

Also, how the model computes the proposed system efficiency compared to other studies needs to be clarified.

- Enhance the figure resolutions and captions (too long).

- Too long sentences make the meaning unclear and hard to read. Consider breaking it into multiple sentences, such as the following examples: L6-L8; L26-L28; L37-L40; L44-L47;

The English language, redaction, and punctuation must be improved in general. The manuscript should undergo editing before being submitted to a journal again. The following are some examples:

L1: robot identifying a . …… should be …. robot to identify a

L3: the objects categories or their categories …… should be …. the objects' categories or categories

L3: without the use of external markers…… should be …. without external markers

L5: matches the user's gaze, with …… should be …. that matches the user's gaze with

L6: estimating objects shapes …… should be …. estimating objects' shapes

L8: to the identification of new …… should be …. to identification new

Comments on the Quality of English Language