Do Similar Objects Have Similar Grasp Positions?

Sun, Qi; He, Lili

doi:10.3390/s24237735

Open AccessArticle

Do Similar Objects Have Similar Grasp Positions?

by

Qi Sun

and

Lili He

^*

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Sensors 2024, 24(23), 7735; https://doi.org/10.3390/s24237735

Submission received: 20 September 2024 / Revised: 16 November 2024 / Accepted: 26 November 2024 / Published: 3 December 2024

(This article belongs to the Section Sensors and Robotics)

Download

Browse Figures

Versions Notes

Abstract

In robotic grasping tasks, shape similarity has been widely adopted as a reference in grasp positions prediction for unknown objects. However, to the best of our knowledge, the issue “do similar objects have similar grasp positions?” has not been quantitatively analyzed before. This work aims to confirm or disprove the question by analyzing the relationship between the object shape similarity and grasp positions similarity. To this end, we constructed a similarity-estimation plane (SE-Plane), whose horizontal and vertical axes indicate the objects similarity and grasp similarity, respectively. Then, the proof of the issue is equal to the confirmation of the inference that “the points with higher objects similarity accordingly own higher grasps similarity in the proposed SE-Plane”. We adopted several classical shape descriptors and two kinds of widely recognized deep neural network (DNN) architectures as objects similarity strategies. Furthermore, we employed the widely adopted intersection-over-union (IoU) of grasp anchors to measure the grasp similarity between objects. The experiments were carried out on a dozen objects with commonly seen primitive shapes selected from two well-known open grasp datasets: Cornell and Jacquard. It was found that the IoU values of grasp anchors are generally proportional to those of objects similarity in the SE-Plane. In addition, we obtained several primitive shapes from the commonly seen shapes, which are more suitable references in grasp positions prediction for unknown objects. We also constructed a realistic object dataset that included the objects with commonly seen primitive shapes. With the IoU prediction strategy learned from Cornell and Jacquard, the IoU predicted for realistic objects yielded similar results in the proposed SE-Plane. These discussions indicate that “similar objects have similar grasp positions” is reasonably correct. The proposed SE-Plane presents a new strategy to measure the relationship between objects similarity and grasp similarity.

Keywords:

grasping; shape descriptor; deep neutral network (DNN)

1. Introduction

Robotic grasping of unknown objects is an open problem in the robotic community. It was reported that human grasp behavior for known and familiar objects depends on object identifications and object presentations [1,2]. As for an unknown object, the appearance features provide information about the object’s parts and their respective affordances, which are then processed by the brain in a spatial stream to assist in the action selection process. Additionally, they contribute to a trade-off between appearance and identification for grasping [2]. Inspired by human’s visual pathways for perception and action, early researchers consider grasping unknown objects to be a problem due to the shape, texture, and weight that is to be handled by a subsequent fine controller. In the early stages, shape similarity was widely used for robotic grasp positions prediction of unknown objects, especially when the objects’ materials and weights were unknown [2,3,4,5,6,7,8].

In contrast to the conventional shape feature descriptors devised by humans, a deep neural network (DNN) is capable of predicting the optimal grasp position from object image presentations. This is achieved by analyzing the vast number of images with labeled anchors, which indicate the positions and orientations of potential grasps [9,10]. The deep neural network is able to achieve high performance in grasp prediction from image presentations due to the knowledge gained from the labeled grasp anchors [11,12]. Furthermore, some researchers combined the deep neural network with shape feature descriptors in the grasping tasks for two-dimensional (2D) [13] and three-dimensional (3D) [14,15,16,17] grasp position prediction of unknown objects.

However, to our best knowledge, a fundamental question underlying these studies—do similar objects have similar grasp positions—has not been discussed before.There have been some researchers who combined shape categories [18], shape features [2,19,20], and shape contours [5,15] in the grasping anchors’ prediction. However, in these works, how the objects grasping positions were influenced by their shapes was not discussed. This work aims to confirm or disprove the issue by analyzing the relationship between the objects’ shape similarity and grasp similarity. The contributions of this work include the following:

We analyze the correlation between the grasping positions and the similarity of the objects for a dozen commonly seen shapes. This is a basic and important issue in the research of robotic grasps, which has not been discussed before.
We construct a similarity-estimation plane (SE-Plane), whose horizontal and vertical axes indicate the objects similarity and grasp similarity, respectively. Then, the proof of the target question is converted to confirm the object pair relationships in the SE-Plane; namely, the object pair with higher objects similarity is expected to have higher grasp similarities as well.
We adopt several classical shape descriptors and the widely recognized deep neural network (DNN) architectures as objects similarity strategies and use THE intersection-over-union (IoU) of grasp anchors to measure the grasp similarity between objects. With the correlation analysis between the grasping positions and the objects similarity for different objects in the SE-Plane, we found some primitive shapes which have higher correlations between the grasp positions and objects similarity. We believe the discussions and the discoveries in this work provide valuable information about the robotic grasping of unknown objects.

The remainder of this work is organized as follows: the related works are introduced in Section 2; the details of the calculation and analysis of the correlation between the grasp positions and the objects’ similarities in the proposed SE-Plane for different objects are presented in Section 3; and the experiments, discussion, and conclusions are given in Section 4 and Section 5, respectively.

2. Related Work

2.1. Objects Similarity

2.1.1. Categories and Shapes

The object shapes vary in image presentations with different poses and observation views. For example, the images in Figure 1 are from the Cornell grasping datasets [21]. In Figure 1, the images in the top and the bottom row present the objects’ shapes and the grasp position labels, respectively. The two images in the bottom row of Figure 1a show that the cup has different grasping positions between its “standing upright” (left) and “laying down” (right) states. A similar situation happened with the gummed paper tape shown in Figure 1b. This indicates that object grasp positions are largely influenced by the objects’ image presentations, which vary for their different poses and observation views. There are a few robotic grasping datasets, including Cornell [22], Jacquard [23], YCB [13], ARC [24], VMRD [25], etc. The objectss information, such as their categories, image presentations, and grasp anchors, were labeled in these datasets, but their shapes were not indicated.

2.1.2. Shape Similarity

Shape similarity has been widely discussed in object retrieval [6,26], image matching [3,27,28,29,30], and robotic grasping tasks [19,21,31]. In these works, the Hu moment invariants method, proposed in 1962, was the earliest and a very classical shape descriptor [32]. Now, Hu moment is still widely concerned in objects’ shape analyzes because of its rotation and scale invariance [29,33,34].

Shape Context (SC) [3] is also a typical shape feature descriptor that is widely recognized in object matching. Since it is invariant to rotation, translation, and scaling transformation, it is also widely adopted in robot grasp tasks.

A histogram of oriented gradient (HOG), originally proposed for human detection, is an edge- and gradient-based descriptor [35]. The HOG method is suitable for the robotic grasp task, especially when the image foreground has boundaries distinguished from the background [15].

Besides Hu moments, Shape Context, and HOG methods, there are also other well-known shape similarity calculation methods, such as the Iterative Closest Point (ICP) [36], Point Distribution Model (PMP), etc. In this work, we use Hu moment, Shape Context, and HOG in the object shape similarity calculation, since they are widely used in grasping tasks.

2.1.3. DNN Objects Similarity

Deep neural networks (DNNs) have been proved to be excellent tools to measure objects similarity [28]. An autoencoder is a widely recognized method for extracting effective image features for objects similarity, whereby a decoder recovers the image presentation from the low-dimensional presentation. In this way, the low-dimensional feature map could be applied for shape matching and image retrieving [17,28,37,38,39].

With the excellent abilities of data reduction and feature extraction of the autoencoder, we used an autoencoder for objects similarity estimation in this work.

2.2. Grasping Position Similarity

Before 2000, the policies of grasping novel objects were generally constructed according to the manipulator joint torque, the friction between the touch point and the gravity of the object [40,41,42]. In this period, grasp performance was directly evaluated on the number of trials before a successful grasp [4]. From 2000 to 2013, techniques of grasping novel objects aimed to achieve faster learning of grasp knowledge [2,7,19], more accurate detections of grasp positions [2], and a more robust grasp of an unknown object [7,43,44,45].

Since 2013, when the deep neural network (DNN) was first used in grasping tasks [9], various networks have been proposed to learn the grasp knowledge from image presentations. Inspired by object detection algorithms, such as Yolo v1–Yolo v5 [10,46,47], single-shot multi-box detection (SSD) [48] and the anchors, which were used to indicate the ranges of objects (such as the width, height, and the center point of the targets) have also been adopted to represent the grasp positions of the objects [10,49]. Despite the high performance of deep architectures on grasp position prediction, the hypothesis that similar objects have similar grasping positions has not been validated.

With the development of DNN grasp techniques, some researchers constructed grasping datasets for DNN training, such as Cornell [22], Jacquard [23], YCB [13], ARC [24], and VMRD [25]. In these datasets, the grasping positions are labeled with anchors. Intersection-over-union (IoU) between the predicted anchors and the ground truth is used as the network loss function in the training process and grasp performance evaluation. In this work, we use the anchors provided by the datasets and IoU to evaluate the grasping similarity between different objects.

2.3. Objects’ Similarity and Grasping Positions Similarity

Just as mentioned in Section 2.1, the classical methods such as Hu moment, Shape Context, HOG, etc., are able to describe objects’ shape features very well. In addition to these, the autoencoder is able to reduce the dimensions of objects’ shape features for similarity evaluation. We have also introduced that IoU between the anchors of two objects implies their grasp similarities. However, the relevance between the similarity of the objects and grasp similarity has not been discussed before. This is just the issue our work will focus on.

3. Definitions

We aim to find the relationship between objects similarity and grasp similarity. In this section, we first introduce the related conceptions and definitions. Similarity-estimation Plane (SE-Plane): Let

o b j_{i} (1 \leq i \leq m)

be the

i th

object in a grasp evaluation dataset X, where m is the total number of objects in X.

o s (i, j)

and

g s (i, j)

are the normalized values of objects similarity and the grasp similarity between

o b j_{i}

and

o b j_{j}

, where

0.0 \leq o s (i, j)

,

g s (i, j) \leq 1.0

and larger

o s (i, j)

and

g s (i, j)

imply higher objects similarity and grasp similarity between

o b j_{i}

and

o b j_{j}

.

Supposing that

Γ_{i}^{o s} = {o s (i, 1), \dots, o s (i, j), \dots, o s (i, m)}

is the set of objects similarity, where all the items are sorted in reverse order with

o s (i, 1) \geq \dots \geq o s (i, j) \geq \dots \geq o s (i, m)

, the number

j_{k}^{o s}

is the index of the object which is the

k^{th}

maximal item in

Γ_{i}^{o s}

.

Given a two-dimensional (2D) coordinate plane

{X Y}_{O s - G s}

, where the x coordinate records the values of

o s (i, j_{k}^{o s})

, and those of the y coordinate present the corresponding values of

g s (i, j_{k}^{o s})

. In this way, the point

p (o s (i, j_{k}^{o s}), g s (i, j_{k}^{o s}))

presents a 2D point in the

{X Y}_{O s - G s}

plane, whose x value presents the top

k^{th}

maximal

o s (i, j)

values for

{o b j}_{i}

, and the y value presents the corresponding values of

g s (i, j)

for

{o b j}_{i}

and

{o b j}_{j}

, where

j = j_{k}^{o s}

. In this work, we also call the 2D plane

{X Y}_{O s - G s}

the similarity-estimation plane (SE-Plane).

Figure 2 shows a sample, which presents the relationship of the objects similarity

(o s)

and grasp similarity

(g s)

in the SE-Plane for two object pairs. In Figure 2,

o b j_{i}

,

o b j_{j 1}

, and

o b j_{j 2}

are a remote control, sunglasses, and an apple, respectively. Intuitively,

o b j_{i}

and

o b j_{j 1}

have higher similarity (rectangle vs. dumbbell shape) than

o b j_{i}

and

o b j_{j 2}

(rectangle vs. circle). Then, it is expected that

p (o s (i, j_{1}), g s (i, j_{1}))

is located at the upper right side of

p (o s (i, j_{2}), g s (i, j_{2}))

in the SE-Plane, just as presented in Figure 2. Figure 2 gives a simple and straightforward sample for two assumed object pairs in the proposed 2D

X Y_{O s - G s}

SE-Plane. Based on the definition of the SE-Plane, we have two inferences:

Inference A:

If the assumption “do similar objects have similar grasp positions” was true, then the 2D point of two objects in the SE-Plane, which has higher similarity for their object and grasp position, should be located at the top-right area of the points whose similarities are smaller.

Inference B:

It is also expected that given some common objects

o b j_{i} (1 \leq i \leq m)

, which are randomly picked from the open, widely recognized grasping datasets, the points

p (o s (i, j_{k}^{o s}), g s (i, j_{k}^{o s}))

with smaller k should be located at the top-right area of those with larger k. In the experiment section, we will analyze and discuss whether Inferences A and B could be proved using the classical objects’ similarity evaluation methods (including the shapes similarity strategies) and the grasping similarity obtained by the IoU between the grasping anchors in the open datasets. Furthermore, if Inferences A and B were correctly proved, it is reasonable that the answer to the question “do similar objects have similar grasp positions?” is true. In addition, we hope to find some valuable patterns between grasping positions and their objects’ similarities as well.

4. Experiments

Before the experiments, we first outlined the methods for objects similarity, grasp positions similarity evaluation, the grasping datasets, and the objects used in experiments as well.

4.1. Objects Similarity

4.1.1. Hu Moment Invariants

For a given

o b j_{i}

, we adopted the seven values of Hu moment as the objects similarity vector, which is denoted as

H u_{i} (s) (1 \leq s \leq S)

. We used the function HuMoments() provided by the open computer vision library (OpenCV) [50] to obtain the seven Hu moments from an object image. Then, the Hu objects similarity between

o b j_{i}

and

o b j_{j}

was obtained by (1), where

{\hat{H u}}_{i, j} (s)

was used to normalize the

s^{t h}

Hu moment pair to the range of

(0, 1)

for

o b j_{i}

and

o b j_{j}

, and

o s_{H u} (i, j)

∈ (0.0, 1.0). The larger value of

o s_{H u} (i, j)

implies the higher objects similarity between

o b j_{i}

and

o b j_{j}

on the Hu moment measurement.

o s_{H u} (i, j) = \sum_{s = 1}^{S} (1 - | | {\hat{H u}}_{i, j} (s) | |) {, w h e r e \hat{H u}}_{i, j} (s) = \frac{| | H u_{i} (s) | | - | | H u_{j} (s) | |}{m a x (H u_{i} (s), H u_{j} (s))}

(1)

4.1.2. Shape Context

Shape Context (SC) uses log-polar histogram bins to describe contour features.

h_{i} (k)

denotes the object feature descriptor from one object i [3]. The minimum

H (π)

implies the correlation coefficient of

o b j_{i}

and

o b j_{j}

according to Equation (2). The matching cost of two objects i, j under a possible feature matching

π

is denoted as Equation (3). In Equation (3),

C_{i, j}

reflect the cost between contour points

p_{i}

and

q_{j}

.

o s_{S C} (i, j) = m i n (H (π))

(2)

H (π) = \sum_{i} C (p_{i}, q_{π (i)}) {, w h e r e C}_{i, j} = C (p_{i}, q_{j}) = \frac{1}{2} \sum_{k = 1}^{K} \frac{{[h_{i} (k) - h_{j} (k)]}^{2}}{h_{i} (k) + h_{j} (k)}

(3)

4.1.3. Histogram of Oriented Gradient (HOG)

We used the function HOGDescriptor() provided by OpenCV to obtain the HOG vector

H O G_{i} (s)

(1 \leq s \leq S)

for

o b j_{i}

, where S is the length of the HOG vector. The correlation coefficient of the HOG similarity between

o b j_{i}

and

o b j_{j}

is presented as Equation (4). Compared with Hu moments, whose features contain only seven items, the length of the HOG feature vector is far larger than that of the Hu moments. In this case, the correlation coefficient is more suitable to describe whether two curves have consistency in their change tendency. The correlation coefficient of the HOG similarity between

o b j_{i}

and

o b j_{j}

is presented in Equation (4).

o s_{H O G} (i, j) = \frac{\sum_{s} ((H O G_{i} (s) - \bar{H O G_{i}}) (H O G_{j} (s) - \bar{H O G_{j}}))}{\sqrt{\sum_{s} {(H O G_{i} (s) - \bar{H O G_{i}})}^{2}} \sqrt{\sum_{s} {(H O G_{j} (s) - \bar{H O G_{j}})}^{2}}}

(4)

4.1.4. Autoencoder

We also used a deep full-connection (FC) neural network [51] and CNN structure autoencoders to convert object images to a low-dimensional feature map for objects similarity evaluation. Different from the Hu, Shape Context, and HOG features, features obtained by the FC network and CNN autoencoder tend to be implicit. We used the Euler distances to calculate the similarity between the feature vectors obtained by them.

o s_{F C} (i, j) = 1 - \frac{\sqrt{\sum_{s} {(A u t o_{i}^{F C} (s) - A u t o_{j}^{F C} (s))}^{2}}}{\sqrt{\sum_{s} {(A u t o_{i}^{F C} (s))}^{2}} \sqrt{\sum_{s} {(A u t o_{j}^{F C} (s))}^{2}}}

(5)

o s_{C N N} (i, j) = 1 - \frac{\sqrt{\sum_{s} {(A u t o_{i}^{C N N} (s) - A u t o_{j}^{C N N} (s))}^{2}}}{\sqrt{\sum_{s} {(A u t o_{i}^{C N N} (s))}^{2}} \sqrt{\sum_{s} {(A u t o_{j}^{C N N} (s))}^{2}}}

(6)

We denoted the low-dimensional feature maps obtained by the FC network and CNN autoencoder for

o b j_{i}

with

A u t o_{i}^{F C} (s)

and

o_{i}^{C N N} (s)

(1 \leq s \leq S)

, then we used (5) and (6) to calculate the objects similarity between

o b j_{i}

and

o b j_{j}

.

4.2. Grasp Similarity

The grasp similarity between two objects can be quantified by calculating the IoU between their labeled anchors. Considering that each object has multiple grasping anchors, we adopted two methods to obtain the grasp similarity for

o b j_{i}

and

o b j_{j}

in this work, average grasp similarity,

g s_{A} (i, j)

, and the maximal grasp similarity,

g s_{M} (i, j)

.

g s_{A} (i, j) = \frac{\sum_{s} {\bar{g s}}_{A} (i, j)}{S}, w h e r e {\bar{g s}}_{A} (i, j) = \frac{a_{s}^{I o U} \cap b_{s}^{I o U}}{a_{s}^{I o U} \cup b_{s}^{I o U}}

(7)

g s_{M} (i, j) = \lor_{s} m a x (\frac{a_{s}^{I o U} \cap b_{s}^{I o U}}{a_{s}^{I o U} \cup b_{s}^{I o U}})

(8)

Let

a_{i} (1 \leq i \leq I)

and

b_{j} (1 \leq j \leq J)

be the anchors of

o b j_{i}

and

o b j_{j}

, and I and J are the anchor numbers of

o b j_{i}

and

o b j_{j}

, respectively.

a_{s}^{I o U}

and

b_{s}^{I o U} (1 \leq s \leq S)

are the interacted anchors of

o b j_{i}

and

o b j_{j}

when both their centers and main axes coincide.

S \leq m i n (I, J)

is the number of interacted anchors for

o b j_{i}

and

o b j_{j}

. Then

g s_{A} (i, j)

and

g s_{M} (i, j)

can be obtained by (7) and (8), respectively.

4.3. Data Preparation

In this work, we carried out experiments both on public grasping datasets (Cornell [22] and Jacquard [23]) and the image presentations from the realistic objects.

The Cornell grasping dataset contains 885 images of 240 grasp objects, and Jacquard contains 54 k images of 11 k objects. The labels are presented with grasp anchors. Inspired by the pioneering work proposed in [18], we selected the objects with shape primitives to analyze the relationship between grasp similarity and objects similarity in experiments. In [18], the authors divided the grasping shape primitives into 5 categories: Boxes, Spheres, Cylinders (Side Grasp), Cylinders (End Grasp), and Cones. In this work, we first selected 12 objects from the Cornell dataset which are listed in Figure 3a. These 12 objects cover the common shapes of the rectangle, dumbbell, fusiform, cross-shaped, trapezoid, strip, circle, ring-shaped, T-shape, ellipse, arc, and irregular shapes. The number of primitive shapes discussed in this work is 12, which is far more than the 5 primitive shapes discussed in [18]. Similarly, we also selected 12 objects with the same primitive shapes from the Jacquard dataset (Figure 3b). These 24 objects were used to evaluate Inferences A and B.

We also constructed a dataset which contains 2080 images for 21 realistic objects. These realistic objects have the similar primitive shapes to those objects selected from Cornell [22] and Jacquard [23], and they were taken in various angles and scales. Figure 3c presents a group of them in horizontal level which are similar to those in Figure 3a,b. We named this dataset CasiaZhuasape, and a copy of this dataset is available at the address [52].

In the Hu, Shape Context, HOG, FC network, and CNN autoencoder methods, only Hu moment and Shape Context had the rotation and scale invariance. It is necessary to align the objects on scale and rotations for the HOG, FC network, and CNN autoencoder. We extracted the objects from the original images and rotated and aligned their main axes to horizontal level. Each of them was finally put in a square picture. All the square pictures were resized to 128 × 128 pixel resolutions. The 5th and the 9th images in Figure 3a present the rotation and scale normalized images of the cup and the gummed paper tape presented in Figure 1. Meanwhile, the object grasp anchors were rotated, aligned, and resized to the appropriate positions and angles as well.

4.4. Experiments on Public Datasets

4.4.1. Objects Similarity Evaluation

In the HOG objects similarity calculation, we used a 96 × 128 window, 16 × 16 block, 16 × 16 stride, and 8 × 8 cell to obtain the feature maps; in this way, the length of

H O G_{i} (s)

was 57,024. In the FC network, as described in [51], we used the full connection network with a 4096-2048-1024 structure to obtain the low-dimensional features for the 128 × 128 image presentations. Then, the length of

{A u t o}_{i}^{F C} (s)

was 1024. For the CNN autoencoder objects similarity calculation, we adopted the standard VGG16 proposed in [53] for data reduction. The 128 × 128 images were converted to the feature vector at the length of 2048. The VGG16 AutoEncoder structure in this work consisted of 13 convolution layers and 3 full connection layers. The standard structure of VGG16 takes Euclidean Distance (Mean Squared Error, MSE) as loss calculation for full connection layers. Euclidean Distance is also one of the widely recognized measurement functions for the image similarity calculation method. Therefore, we used Euclidean Distance (Mean Squared Error, MSE)-based logistic regression as a loss function for VGG16 in this work.

The images in the rows of Figure 4a–e list the symmetric confusion matrixes of

o s_{H u} (i, j)

,

o s_{S C} (i, j)

,

o s_{H O G} (i, j)

,

o s_{F C} (i, j)

, and

o s_{C N N} (i, j)

, where the images in the columns of Figure 4(I) and Figure 4(II) are the results for the objects selected from Cornell and Jacquard, respectively. In Figure 4, the darker grid implies greater similarity between

o b j_{i}

and

o b j_{j}

. We can see from Figure 4 that these symmetric confusion matrixes are similar in the distribution of the values, especially the values in the green rectangle areas. Furthermore, the values of

O S

among the first three objects seem relatively higher. In these symmetric matrixes, the general colors of

o s_{H u} (i, j)

,

o s_{S C} (i, j)

, and

o s_{H O G} (i, j)

are relatively lighter than those of

o s_{F C} (i, j)

and

o s_{C N N} (i, j)

. One possible reason is that the SC, Hu, and HOG pay more attention to the gradient features, while the FC network and CNN autoencoder obtained the objects’ shape and appearance features synchronously. In this way, the symmetric matrixes can be divided into two groups: Group A (SC, Hu, and HOG) for shape, and Group B (FC network and CNN autoencoder) for appearance.

In Group A, the HOG method outperformed the SC and Hu methods in the objects similarity evaluation. For example, the first object (a remote control in column I and a rectangle box in column II) and the 5th object (two cups in the shape of a trapezoid) in Cornell and Jacquard had high shape similarity to the perspective of humanity, and results of the HOG showed exactly that object 1 and 5 have higher

O S

values, while the corresponding

O S

values of the SC and Hu were relatively smaller. In Group B, the FC and CNN methods obtained similar

O S

distribution, which illustrates that the deep learning method can generally extract more effective features. Moreover, the CNN method tends to have better performance on similarity distinguishing. For example, in the values in the red rectangle areas, the 6th objects (the two earphone wires in irregular shapes) were very different from other objects in the perspective of humanity, and the results of the CNN had lower

O S

values than the FC between the 6th objects and the others. Therefore, in the following experiments, the HOG and CNN were applied for further analysis since they are the better ones in Groups A and B.

4.4.2. Objects Similarity and Grasping Positions Similarity

Figure 5a,b list all the top k

(1 \leq k \leq 3)

points

p (o s (i, j_{k - i}^{o s}), g s_{A} (i, j_{k - i}^{o s}))

for the HOG and CNN autoencoder in the proposed SE-Plane, respectively, for the objects in Figure 3a,b. In Figure 5, the values of X-coordinate present the top 1, top 2, and top 3 objects similarity for

o b j_{i} (1 \leq i \leq 12)

, and those of Y-coordinate present the average grasping similarity between the selected top

j_{k}^{o s}

objects

(1 \leq k \leq 3)

for

o b j_{i}

. Figure 5c,d list all the top

k (1 \leq k \leq 3)

points

p (o s (i, j_{k}^{o s}), g s_{M} (i, j_{k}^{o s}))

for the HOG and CNN autoencoder.

We can see from Figure 5 that the top 1 points are mostly located in the upper right area of the top 2 points, and the top 1 and top 2 points are mostly located at the upper right area of the top 3 points. This indicates that if two objects (

o b j_{i}

and

o b j_{j}

) have high objects similarity, they tend to have a high grasp similarity.

In addition, all the top 3 points

p (o s (i, j_{k}^{o s}), g s_{M} (i, j_{k}^{o s}))

in Figure 5c,d are more distinguishing compared with the corresponding points in Figure 5a,b. This indicates that

g s_{M} (i, j)

have better distinction than

g s_{A} (i, j)

. At the same time, the points in Figure 5b,d have better linearity than those in Figure 5a,c. This indicates that the objects similarity obtained by the CNN autoencoder has better linear correlation with the grasp similarity than that obtained by HOG.

In general, the object pairs with higher objects similarity and grasping similarity are located at the upper right area of the objects with lower objects and grasping similarity. Namely, the points with smaller k are located at the top-right areas of the ones with larger k. This shows that the Inference B is reasonable and correct.

4.4.3. The Patterns for Objects Similarity and Grasping Positions

The 24 images in Figure 6 (I) and (II) list the values of

Γ_{i}^{o s}

in the SE-Plane between the max IoU values of the grasping positions and the objects similarity for each object given in Figure 3a,b, where the connecting line in red means that the values of objects similarity were obtained by the CNN autoencoder, and the ones in blue were from the HOG. We can see from Figure 6 that the points which have the highest values of

O S

are mostly located at the top-right areas of the SE-Plane, and the points with higher similarities in object and grasp positions are mostly located in the top-right area of the points whose object and grasp position similarities are relatively smaller, especially the points labeled by red ‘Δ’ and blue “◊” in Figure 6(I)(II)(a–c,e,h–l).

In addition, we can infer from Figure 6(I)(II)(a–c,e,g,h,k) that the red curves in these small charts have clear upward trends from the bottom-left to the top-right. This indicates the grasping positions of these objects, namely the rectangle (Figure 6(I)(II)(a)), dumbbell (Figure 6(I)(II)(b)), fusiform (Figure 6(I)(II)(c)), trapezoid (Figure 6(I)(II)(e)), trip (Figure 6(I)(II)(g)), sphere (Figure 6(I)(II)(h)), and ellipse (Figure 6(I)(II)(k)), are obviously positive correlations to the objects’ similarities. In this way, we can infer that given an unknown object, if it is similar to the ones listed in Figure 6(I)(II)(a–c,e,g,h,k) in shape or appearance, then its grasping positions are reasonably similar to them. On the other hand, for the curves in Figure 6(I)(II)(d,f,i–l), regardless of the red or blue curves, the value changes of GS had less relationships with those of the OS. This indicates that the shapes, namely the cross (Figure 6(I)(II)(d)), irregular (headphone cable in Figure 6(I)(II)(f)), loop (Figure 6(I)(II)(i)), T-type (Figure 6(I)(II)(j)), and arc (Figure 6(I)(II)(l)), are not suitable as grasp references for unknown objects. In general, the shapes listed in Figure 6(I)(II)(a–c,e,g,h,k), namely the shapes of the rectangle, dumbbell, fusiform, trapezoid, strip, sphere, and ellipse, have better performances to measure the unknown objects’ grasping positions through evaluating their similarities to the unknown object in the proposed SE-Plane.

4.5. Experiments on Realistic Objects

4.5.1. Objects Similarity

Figure 7 lists the objects similarity confusion matrixes of

o s_{H u} (i, j)

,

o s_{S C} (i, j)

,

o s_{H O G} (i, j)

,

o s_{F C} (i, j)

, and

o s_{C N N} (i, j)

for the 12 selected realistic objects listed in Figure 3c. The general colors of

o s_{H u} (i, j)

,

o s_{S C} (i, j)

, and

o s_{H O G} (i, j)

are relatively lighter than those of

o s_{F C} (i, j)

and

o s_{C N N} (i, j)

. We also divided these symmetric matrixes into Group A (SC, Hu and HOG) and Group B (FC network and CNN autoencoder).

Just like the results presented in Figure 4, in Group A, the HOG represented better performance on objects similarity, which coincided more with the perspective of humanity (e.g., the OS relevance among the first, second, and the third object, and the OS differences for the 4th and the 6th object with others). Similarly, the FC and CNN methods in Group B also obtained similar reliable OS distribution to those in Figure 4, while the CNN method performed better.

These experiments on the images of realistic objects reveal that these evaluation methods are still reliable in physical environments.

4.5.2. Grasp Similarity

Since the objects in the realistic dataset have no pre-labeled grasp anchors, we adopted the grasp prediction network (GP-Net) described in [54] to generate multiple possible grasp anchors for these objects. This method achieved an accuracy of 94.8% with a speed of 74.07 fps on the Cornell grasping dataset on GeForce GTX 1050 Ti; therefore, it is reliable enough to predict grasp anchors in this work.The network received an RGB image the size of 224 × 224 × 3 as input and generated encoded grasp x, y, w, h,

θ

, where x, y, w, h,

θ

are the center point, width, height, and rotation of the predicted grasp anchor. The 128 × 128 resolution images were resized to 224 × 224 before they were inputted into the GP-Net. Figure 8 presents its structure, which contained 4 convolutional layers and 2 FC layers. The network was trained on the mixed datasets of Cornell and Jacquard, and a pre-trained model of VGG16 on MS COCO [55] was applied to accelerate the training. Figure 9 lists the predicted grasp anchors for each object given in Figure 3c. For more prediction results, please refer to the address at [52]. These grasp anchors were used to calculate the maximal grasp similarity

g s_{M} (i, j)

as Equation (8).

4.5.3. Objects Similarity and Grasping Positions Similarity

Similar to Figure 6, Figure 10 lists the top

k (1 \leq k \leq 3)

points

p (o s (i, j_{k}^{o s}), g s_{M} (i, j_{k}^{o s}))

for the HOG and CNN autoencoder in the SE-Plane, respectively, for the objects in the realistic objects. It is noticeable that the values of grasp similarity for the realistic objects are generally smaller than those in the public dataset. This is because only the top five grasp anchors were used for each realistic object, while at least ten or more grasp anchors were always labeled for each object in the public dataset.

We can see from Figure 10 that the data distributions of the realistic objects are similar to those in the public dataset. The top 1, top 2, and top 3 points are generally distributed from the upper-right to the lower-left of the SE-Plane, which indicates that objects with high objects similarity tend to have a high grasp similarity.

In addition, similar to the results in the public dataset given in Figure 5, the results for realistic objects also show that the CNN method has better performance on objects similarity evaluation because the data distribution of the CNN has better linearity along the bottom-left to top-right than those of the HOG.

5. Discussions

In experiments, we firstly compared the objects similarity performances obtained by the Shape Context, Hu moment, HOG, FC network, and CNN autoencoder on the objects selected from the Cornell and Jacquard grasping datasets. It was found that the HOG and CNN autoencoder have better distinguishing performance than those of the SC, Hu, and FC network.

Furthermore, we adopted the HOG and CNN autoencoder in the similarity analysis between the objects similarities and the similarities of their grasping positions by converting the relative objects pairs into the SE-Plane. It was found that the 2D point of two objects with both higher objects similarity and grasp similarity are located at the upper right area of the points with lower objects similarities and grasps similarities. In addition, the points with smaller

k (1 \leq k \leq 3)

are located at the upper right area of those with larger k. It proves that objects with higher objects similarity generally have higher grasp similarity. These situations proved that the more similar two objects, the more similar the grasping positions for them. In this way, Inference A and Inference B were confirmed, and the answer to the question “Do Similar Objects Have Similar Grasp Positions?” is that this is reasonably correct. We also found that the points obtained by maximal IoU grasping similarity were better clustered in the SE-Plane than those of the average IoU grasping similarity. It reveals that the maximal IoU has better performance on analyzing the relationship between grasping similarity and objects similarity than the average IoU.

Finally, we used the maximal IoU to evaluate the relationship between the objects similarity and grasping similarity of a single object. We found that grasping similarity is generally positively correlated to objects similarity. This is also intuitive evidence to prove that “Do Similar Objects Have Similar Grasp Positions?” is true. Furthermore, it was found that the CNN autoencoder had higher performance in the objects similarity evaluation for the correlation analysis between the grasping similarity and objects similarity than the HOG. In addition, it was also found that the objects with better positive correlation between the CNN’s objects similarity and grasp similarity were more suitable to be regarded as primitive shape objects, such as the rectangle, dumbbell, fusiform, trapezoid, strip, sphere, loop, and ellipse.

6. Conclusions

In this work, we aimed to confirm that “similar objects have similar grasp positions” is correct by analyzing the relationship between the object shape similarity and grasp positions similarity. To this end, we proposed the SE-Plane, whose horizontal and vertical axes indicate the objects similarity and grasp similarity, respectively. Then, the proof of the issue was converted to confirm the point relationships in the SE-Plane, namely the points with higher objects similarity accordingly having higher grasp similarity. Several widely recognized shape calculation algorithms and deep learning methods ertr applied to evaluate objects similarity, and the IoU of grasp anchors was applied to evaluate grasping similarity. With experiments on the well-known public grasp datasets (Cornell, Jacquard) and realistic datasets, the inferences raised in this work were proved to be correct from the statistical results: positive correlation exists between the grasp similarity and objects similarity. We also obtained valuable information from the experiments: (1) the HOG and CNN autoencoder are relatively more valuable in objects similarity evaluation; (2) the maximal IoU of grasping anchor has better performance in grasp similarity evaluation than the average IoU; (3) the shapes, including rectangle, dumbbell, fusiform, trapezoid, strip, sphere, loop, and ellipse, have better positive correlation between shape similarity and grasping similarity. They are reliable as primitive shapes and references to measure the grasp positions for a novel object, as long as this unknown object has a similar shape to the primitive ones in the SE-Plane; (4) finally, the proposed SE-Plane presents a new strategy to measure the relationship between objects similarity and grasp similarity, thus it provides a new strategy to predict the grasping anchors for a novel object from the known objects. As long as the novel object has a similar shape or appearance to an object with the previous primitive shapes in the proposed SE-Plane, then the grasp knowledge (namely the grasp positions) could be potentially transferred to the novel object from the objects with primitive shapes. How the proposed SE-Plane could be used in the task of robotic grasp for novel objects is out of the scope of this work; we will discuss it in the future.

Author Contributions

Conceptualization, Q.S. and L.H.; methodology, Q.S.; software, Q.S.; validation, Q.S.; investigation, Q.S.; data curation, Q.S.; writing—original draft preparation, Q.S.; writing—review and editing, Q.S.; project administration, L.H.; funding acquisition, L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding including: the Zhejiang Key Research and Development Program, grant number 2023C01121; Science and Technology on AerospaceFlight Dynamics Laboratory, grant number No.KGJ6142210210311; the National Natural Science Foundation of China (NSFC), grant number No.61873269; the Guangxi Key Research and Development Program, grant number AB24010164 and AB23026048.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Webster, M.J.; Bachevalier, J.; Ungerleider, L.G. Connections of Inferior Temporal Areas TEO and TE with Parietal and Frontal Cortex in Macaque Monkeys. Cereb. Cortex 1994, 4, 470–483. [Google Scholar] [CrossRef] [PubMed]
Bohg, J.; Kragic, D. Grasping familiar objects using shape context. In Proceedings of the International Conference on Advanced Robotics (ICRA), Munich, Germany, 22–26 June 2009. [Google Scholar]
Belongie, S.; Malik, J.; Puzicha, J. Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 509–522. [Google Scholar] [CrossRef]
Dunn, G.; Segen, J. Automatic discovery of robotic grasp configurations. In Proceedings of the 1988 IEEE International Conference on Robotics and Automation, Philadelphia, PA, USA, 24–29 April 1988; pp. 396–401. [Google Scholar] [CrossRef]
Joshi, R.; Sanderson, A. Shape matching from grasp using a minimal representation size criterion. In Proceedings of the IEEE International Conference on Robotics and Automation, Atlanta, GA, USA, 2–6 May 1993; pp. 442–449. [Google Scholar] [CrossRef]
Lindner, C.; Bromiley, P.A.; Ionita, M.C.; Cootes, T.F. Robust and Accurate Shape Model Matching Using Random Forest Regression-Voting. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1862–1874. [Google Scholar] [CrossRef] [PubMed]
Saxena, A.; Driemeyer, J.; Kearns, J.; Ng, A.Y. Robotic Grasping of Novel Objects. In Advances in Neural Information Processing Systems 19; Schölkopf, B., Platt, J., Hofmann, T., Eds.; MIT Press: Cambridge, MA, USA, 2007; pp. 1209–1216. [Google Scholar] [CrossRef]
Barrett, T.M.; Needham, A. Developmental differences in infants’ use of an object’s shape to grasp it securely. Dev. Psychobiol. 2008, 50, 97–106. [Google Scholar] [CrossRef]
Lenz, I.; Lee, H.; Saxena, A. Deep Learning for Detecting Robotic Grasps. arXiv 2014, arXiv:1301.3592. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Park, D.; Seo, Y.; Shin, D.; Choi, J.; Chun, S.Y. A Single Multi-Task Deep Neural Network with Post-Processing for Object Detection with Reasoning and Robotic Grasp Detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 7300–7306. [Google Scholar] [CrossRef]
Araki, R.; Onishi, T.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. MT-DSSD: Deconvolutional Single Shot Detector Using Multi Task Learning for Object Detection, Segmentation, and Grasping Detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10487–10493. [Google Scholar] [CrossRef]
Zeng, A.; Song, S.; Yu, K.T.; Donlon, E.; Hogan, F.R.; Bauza, M.; Ma, D.; Taylor, O.; Liu, M.; Romo, E.; et al. Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 3750–3757. [Google Scholar] [CrossRef]
Bai, X.; Bai, S.; Zhu, Z.; Latecki, L.J. 3D Shape Matching via Two Layer Coding. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 2361–2373. [Google Scholar] [CrossRef]
Liu, C.; Fang, B.; Sun, F.; Li, X.; Huang, W. Learning to Grasp Familiar Objects Based on Experience and Objects’ Shape Affordance. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 2710–2723. [Google Scholar] [CrossRef]
Tasse, F.P.; Dodgson, N. Shape2Vec: Semantic-based descriptors for 3D shapes, sketches and images. ACM Trans. Graph. 2016, 35, 208. [Google Scholar] [CrossRef]
Xie, J.; Fang, Y.; Zhu, F.; Wong, E. Deepshape: Deep learned shape descriptor for 3D shape matching and retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1275–1283. [Google Scholar] [CrossRef]
Miller, A.; Knoop, S.; Christensen, H.; Allen, P. Automatic grasp planning using shape primitives. In Proceedings of the 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422), Taipei, Taiwan, 14–19 September 2003; pp. 1824–1829. [Google Scholar] [CrossRef]
Bohg, J.; Kragic, D. Learning grasping points with shape context. Robot. Auton. Syst. 2010, 58, 362–377. [Google Scholar] [CrossRef]
Zheng, Y.; Doermann, D. Robust point matching for nonrigid shapes by preserving local neighborhood structures. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 643–649. [Google Scholar] [CrossRef]
Jiang, Y.; Moseson, S.; Saxena, A. Efficient grasping from RGBD images: Learning using a new rectangle representation. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3304–3311. [Google Scholar] [CrossRef]
Danescu-Niculescu-Mizil, C.; Lee, L. Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs. In Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, Portland, OR, USA, 23 June 2011; Association for Computational Linguistics: Stroudsburg, PA, USA, 2011; pp. 76–87. [Google Scholar]
Daniel, M.; Teytaud, F.; Teytaud, O.; Mary, J.; Berenz, V. The Jacquard dataset: A robot grasping benchmark in the wild. arXiv 2017, arXiv:1710.05060. [Google Scholar]
Inagaki, Y.; Araki, R.; Yamashita, T.; Fujiyoshi, H. Detecting layered structures of partially occluded objects for bin picking. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 4–8 November 2019; pp. 5786–5791. [Google Scholar] [CrossRef]
Zhang, H.; Lan, X.; Zhou, X.; Tian, Z.; Zhang, Y.; Zheng, N. Visual Manipulation Relationship Network for Autonomous Robotics. In Proceedings of the 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), Beijing, China, 6–9 November 2018; pp. 118–125. [Google Scholar] [CrossRef]
Zhang, Z. Iterative point matching for registration of free-form curves and surfaces. Int. J. Comput. Vis. 1994, 13, 119–152. [Google Scholar] [CrossRef]
Lian, W.; Zhang, L.; Zhang, D. Rotation-Invariant Nonrigid Point Set Matching in Cluttered Scenes. IEEE Trans. Image Process. 2012, 21, 2786–2797. [Google Scholar] [CrossRef] [PubMed]
Lowry, S.; Sünderhauf, N.; Newman, P.; Leonard, J.J.; Cox, D.; Corke, P.; Milford, M.J. Visual Place Recognition: A Survey. IEEE Trans. Robot. 2016, 32, 1–19. [Google Scholar] [CrossRef]
Ma, Z.; Ma, J.; Xiao, B.; Lu, K. A 3D polar-radius-moment invariant as a shape circularity measure. Neurocomputing 2017, 259, 140–145. [Google Scholar] [CrossRef]
Myronenko, A.; Song, X. Point Set Registration: Coherent Point Drift. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 2262–2275. [Google Scholar] [CrossRef]
Faria, D.; Prado, J.; Paulo, D.J.; Dias, J. Object Shape Retrieval through Grasp Exploration. In Proceedings of the 4th European Conference on Mobile Robots (ECMR), Dubrovnik, Croatia, 23–25 September 2009. [Google Scholar]
Hu, M.-K. Visual pattern recognition by moment invariants. IEEE Trans. Inf. Theory 1962, 8, 179–187. [Google Scholar] [CrossRef]
Žunić, J.; Hirota, K.; Rosin, P. A Hu Invariant as a Shape Circularity Measure. Pattern Recognit. 2010, 43, 47–57. [Google Scholar] [CrossRef]
Žunić, J.; Hirota, K.; Dukić, D.; Aktaş, M.A. On a 3D analogue of the first Hu moment invariant and a family of shape ellipsoidness measures. Mach. Vis. Appl. 2016, 27, 129–144. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
Liu, D.; Chen, T. Soft shape context for iterative closest point registration. In Proceedings of the 2004 International Conference on Image Processing, ICIP ’04, Singapore, 24–27 October 2004; Volume 2, pp. 1081–1084. [Google Scholar] [CrossRef]
Elbaz, G.; Avraham, T.; Fischer, A. 3D Point Cloud Registration for Localization Using a Deep Neural Network Auto-Encoder. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2472–2481. [Google Scholar] [CrossRef]
Ghodrati, H.; Hamza, A.B. Nonrigid 3D shape retrieval using deep auto-encoders. Appl. Intell. 2017, 47, 44–61. [Google Scholar] [CrossRef]
Vinyals, O.; Blundell, C.; Lillicrap, T.; Kavukcuoglu, K.; Wierstra, D. Matching Networks for One Shot Learning. arXiv 2017, arXiv:1606.04080. [Google Scholar]
Gruber, S. Robot hands and the mechanics of manipulation (T. Mason and J.K. Salisbury, Jr. (Cambridge, MA: M.I.T., 1985) [Book Reviews]. IEEE J. Robot. Autom. 1986, 2, 59. [Google Scholar] [CrossRef]
Murray, R.M.; Li, Z.; Sastry, S.S. A Mathematical Introduction to Robotic Manipulation, 1st ed.; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar] [CrossRef]
Okamura, A.; Smaby, N.; Cutkosky, M. An overview of dexterous manipulation. In Proceedings of the 2000 ICRA, Millennium Conference, IEEE International Conference on Robotics and Automation, Symposia Proceedings (Cat. No.00CH37065), San Francisco, CA, USA, 24–28 April 2000; Volume 1, pp. 255–262. [Google Scholar] [CrossRef]
Curtis, N.; Xiao, J. Efficient and effective grasping of novel objects through learning and adapting a knowledge base. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France, 22–26 September 2008; pp. 2252–2257. [Google Scholar] [CrossRef]
Ferrari, C.; Canny, J. Planning optimal grasps. In Proceedings of the 1992 IEEE International Conference on Robotics and Automation, Nice, France, 12–14 May 1992; pp. 2290–2295. [Google Scholar] [CrossRef]
Nakamura, T.; Sugiura, K.; Nagai, T.; Iwahashi, N.; Toda, T.; Okada, H.; Omori, T. Learning Novel Objects for Extended Mobile Manipulation. J. Intell. Robot. Syst. 2012, 66, 187–204. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef]
Chu, F.J.; Xu, R.; Tang, C.; Vela, P.A. Recognizing Object Affordances to Support Scene Reasoning for Manipulation Tasks. arXiv 2020, arXiv:1909.05770. [Google Scholar]
OpenCV. Available online: https://opencv.org/ (accessed on 11 May 2021).
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Yang, M.; Sun, Y.; Zhang, H. CasiaZhuasape. Available online: https://github.com/lingau/CasiaZhuasape (accessed on 11 May 2022).
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Ribeiro, E.G.; Mendes, R.d.Q.; Grassi, V., Jr. Real-Time Deep Learning Approach to Visual Servo Control and Grasp Detection for Autonomous Robotic Manipulation. Robot. Auton. Syst. 2021, 139, 103757. [Google Scholar] [CrossRef]
Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollar, P.; Zitnick, C.L. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv 2015, arXiv:1504.00325. [Google Scholar]

Figure 1. Two objects selected from Cornell dataset, which have different shapes and grasp positions because of different poses and views.

Figure 2. An example of the relationship between grasp similarity and objects similarity of two different object pairs.

Figure 3. The image presentations for the objects with commonly seen shapes in experiments. (a) 12 objects from Cornell dataset; (b) 12 objects from Jacquard dataset; (c) 12 realistic objects from the dataset constructed in this work.

Figure 4. The shape similarity confusion matrixes of (a)

o s_{H u} (i, j)

, (b)

o s_{S C} (i, j)

, (c)

o s_{H O G} (i, j)

, (d)

o s_{F C} (i, j)

, and (e)

o s_{C N N} (i, j)

for the objects in Figure 3, where the images in column (I) and column (II) are the objects from Cornell (Figure 3a) and Jacquard (Figure 3b), respectively.

Figure 4. The shape similarity confusion matrixes of (a)

o s_{H u} (i, j)

, (b)

o s_{S C} (i, j)

, (c)

o s_{H O G} (i, j)

, (d)

o s_{F C} (i, j)

, and (e)

o s_{C N N} (i, j)

for the objects in Figure 3, where the images in column (I) and column (II) are the objects from Cornell (Figure 3a) and Jacquard (Figure 3b), respectively.

Figure 5. All the top k points for the HOG and CNN autoencoder with

1 \leq k \leq 3

in the SE-Plane, respectively, for the objects listed in Figure 3a,b: (I) the objects from the Cornell dataset (Figure 3a); (II) the objects from the Jacquard dataset (Figure 3b).

Figure 5. All the top k points for the HOG and CNN autoencoder with

1 \leq k \leq 3

in the SE-Plane, respectively, for the objects listed in Figure 3a,b: (I) the objects from the Cornell dataset (Figure 3a); (II) the objects from the Jacquard dataset (Figure 3b).

Figure 6. The curves of the relationship between the max IoU values of the grasping positions and the objects similarity for each object in Figure 3a,b. (I) The objects from the Cornell dataset (Figure 3a); (II) the objects from the Jacquard dataset (Figure 3b).

Figure 7. The objects similarity confusion matrixes of (a)

o s_{H u} (i, j)

, (b)

o s_{S C} (i, j)

, (c)

o s_{H O G} (i, j)

, (d)

o s_{F C} (i, j)

, and (e)

o s_{C N N} (i, j)

for the realistic objects listed in Figure 3c.

Figure 7. The objects similarity confusion matrixes of (a)

o s_{H u} (i, j)

, (b)

o s_{S C} (i, j)

, (c)

o s_{H O G} (i, j)

, (d)

o s_{F C} (i, j)

, and (e)

o s_{C N N} (i, j)

for the realistic objects listed in Figure 3c.

Figure 8. The backbone structure of the network used for grasp detection.

Figure 9. The predicted grasp anchors of the objects given in Figure 3c.

Figure 10. The top

k (1 \leq k \leq 3)

points

p (o s (i, j_{k}^{o s}), g s_{M} (i, j_{k}^{o s}))

for the HOG and CNN autoencoder in the SE-Plane, respectively, for the objects listed in Figure 3c.

Figure 10. The top

k (1 \leq k \leq 3)

points

p (o s (i, j_{k}^{o s}), g s_{M} (i, j_{k}^{o s}))

for the HOG and CNN autoencoder in the SE-Plane, respectively, for the objects listed in Figure 3c.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Q.; He, L. Do Similar Objects Have Similar Grasp Positions? Sensors 2024, 24, 7735. https://doi.org/10.3390/s24237735

AMA Style

Sun Q, He L. Do Similar Objects Have Similar Grasp Positions? Sensors. 2024; 24(23):7735. https://doi.org/10.3390/s24237735

Chicago/Turabian Style

Sun, Qi, and Lili He. 2024. "Do Similar Objects Have Similar Grasp Positions?" Sensors 24, no. 23: 7735. https://doi.org/10.3390/s24237735

APA Style

Sun, Q., & He, L. (2024). Do Similar Objects Have Similar Grasp Positions? Sensors, 24(23), 7735. https://doi.org/10.3390/s24237735

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Do Similar Objects Have Similar Grasp Positions?

Abstract

1. Introduction

2. Related Work

2.1. Objects Similarity

2.1.1. Categories and Shapes

2.1.2. Shape Similarity

2.1.3. DNN Objects Similarity

2.2. Grasping Position Similarity

2.3. Objects’ Similarity and Grasping Positions Similarity

3. Definitions

4. Experiments

4.1. Objects Similarity

4.1.1. Hu Moment Invariants

4.1.2. Shape Context

4.1.3. Histogram of Oriented Gradient (HOG)

4.1.4. Autoencoder

4.2. Grasp Similarity

4.3. Data Preparation

4.4. Experiments on Public Datasets

4.4.1. Objects Similarity Evaluation

4.4.2. Objects Similarity and Grasping Positions Similarity

4.4.3. The Patterns for Objects Similarity and Grasping Positions

4.5. Experiments on Realistic Objects

4.5.1. Objects Similarity

4.5.2. Grasp Similarity

4.5.3. Objects Similarity and Grasping Positions Similarity

5. Discussions

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI