Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning

Yi, Jae-Bong; Kim, Joonyoung; Kang, Taewoong; Song, Dongwoon; Park, Jinwoo; Yi, Seung-Joon

doi:10.3390/app122412861

Open AccessArticle

Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning

by

Jae-Bong Yi

,

Joonyoung Kim

,

Taewoong Kang

,

Dongwoon Song

,

Jinwoo Park

and

Seung-Joon Yi

^*

Department of Electrical Engineering, Pusan National University, Busan 46241, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(24), 12861; https://doi.org/10.3390/app122412861

Submission received: 16 November 2022 / Revised: 11 December 2022 / Accepted: 12 December 2022 / Published: 14 December 2022

(This article belongs to the Special Issue New Insights into Collaborative Robotics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This paper presents an autonomous grasping approach for complex-shaped objects using an anthropomorphic robotic hand. Although human-like robotic hands have a number of distinctive advantages, most of the current autonomous robotic pickup systems still use relatively simple gripper setups such as a two-finger gripper or even a suction gripper. The main difficulty of utilizing human-like robotic hands lies in the sheer complexity of the system; it is inherently tough to plan and control the motions of the high degree of freedom (DOF) system. Although data-driven approaches have been successfully used for motion planning of various robotic systems recently, it is hard to directly apply them to high-DOF systems due to the difficulty of acquiring training data. In this paper, we propose a novel approach for grasping complex-shaped objects using a high-DOF robotic manipulation system consisting of a seven-DOF manipulator and a four-fingered robotic hand with 16 DOFs. Human demonstration data are first acquired using a virtual reality controller with 6D pose tracking and individual capacitive finger sensors. Then, the 3D shape of the manipulation target object is reconstructed from multiple depth images recorded using the wrist-mounted RGBD camera. The grasping pose for the object is estimated using a residual neural network (ResNet), K-means clustering (KNN), and a point-set registration algorithm. Then, the manipulator moves to the grasping pose following the trajectory created by dynamic movement primitives (DMPs). Finally, the robot performs one of the object-specific grasping motions learned from human demonstration. The suggested system is evaluated by an official tester using five objects with promising results.

Keywords:

imitation learning; grasp complex-shaped object; 3D reconstruction

1. Introduction

Although robotic manipulators have been widely used for decades, their roles have been mostly limited to repetitive tasks in industrial environments due to safety reasons. However, with the widespread adoption of cooperative robots that can safely work in human environments, and the growing demand for small-quantity batch productions, we now see an increasing number of smart, flexible robotic manipulation systems used for many different tasks [1,2,3]. Furthermore, with recent advances in machine learning algorithms, with the help of low-cost, high-resolution depth sensors, now many robotic manipulation systems can perceive environments and manipulate objects utilizing sensory information to intelligently handle complex nonrepetitive tasks, such as bin picking [4,5].

In addition to perception, there have been approaches to utilize machine learning to optimize the motions of complex, high-DOF robotic systems. However, as it is often not practical to prepare a large amount of training data using physical robotic systems, they require highly data-efficient machine learning algorithms. One such approach is imitation learning, which uses a small number of demonstration data from human experts to quickly learn the desirable motion policy for the given tasks [6,7,8,9,10,11,12]. Although human-like robotic hands have distinctive advantages in robotic manipulation over simple robotic grippers, such as a large number of grasp types [13] and in-hand manipulation capability [14], there have been few previous works that use anthropomorphic robotic hand for autonomous robotic manipulation due to difficulties coming from the high number of DOFs of the system.

This work presents a robotic platform that autonomously perceives and grasps complex-shaped objects using an anthropomorphic robotic hand. Unlike most of the previous approaches that utilize imitation learning for robotic manipulation, our approach learns how to utilize a high-DOF human-like robotic hand for manipulation using a number of machine learning approaches, which includes ResNet [15], KNN [16], point-set registration [17,18], and dynamical movement primitives (DMPs) [19]. As the direct teaching method cannot be used for such a high-DOF system, we use a VR controller with full finger tracking to acquire each finger motion data as well as the 6D trajectory of the hand. The suggested system can learn the grasp pose, hand trajectory, and finger motions for grasping complex-shaped objects from a limited number of demonstrations and is capable of reliably grasping such objects with arbitrary, previously unknown orientations.

The remainder of the paper proceeds as follows. Section 3 introduces the hardware configuration used for the work, which is shown in Figure 1. Section 4 describes the overall workflow of perception, pose and class estimation, motion planning, and grasping an object. Section 5 presents how the training data are collected, which consists of 3D point clouds, 6D hand trajectory data, and RGB images of the objects. Section 6 explains how the platform estimates the grasping pose for the manipulation object utilizing the learning data collected in Section 5. Section 7 shows the hand trajectory generation process given the grasping pose. Finally, Section 8 presents the experimental results to validate the system’s performance done by an official tester. Finally, we conclude the future direction of this work in Section 9.

2. Related Work

Imitation Learning approaches can use a variety of primary machine learning algorithms, which are systemically surveyed in [6].

In [7], the authors developed a robotic system that played table tennis with a human opponent utilizing a mixture of motor primitives (MoMP). The system used a Barett 7DOF WAM robotic arm attached to a table tennis racket. It learned the ball-hitting motions from human demonstrations, which was provided by the direct teaching method of physically moving the robotic arm. In [8], the authors also developed a similar table tennis playing robotic system using imitation learning, but utilizing different algorithms, a hidden Markov model (HMM) [20] and Gaussian mixture regression (GMR) [21]. Compared to this work, both systems used a manipulator without any gripper, and the demonstration data were acquired by the direct teaching method.

On the other hand, a number of recent imitation learning approaches use deep neural network (DNN) algorithms as the primary technique for imitation learning [9,10]. A 3D convolutional neural network (CNN) [22] is used in [9], where the authors used an imitation learning algorithm to teach a robot to grasp a deformable fish. In that work, the human expert used a six-DOF virtual reality (VR) controller to move the manipulator in a simulation environment using the Unity game engine. The human grasping pose and the point cloud of the manipulation object were logged as the training data. In [10], the authors used a DNN-based algorithm to teach a dual-arm robot to perform various manipulation tasks with objects on a table. The proposed system also used VR controllers to provide grasping poses from a human expert. Compared to our work, these approaches used simple one-DOF grippers and only learned the final grasping poses instead of the full grasping trajectories.

There have been imitation-learning-based robotic manipulation systems that use high-DOF, human-like hands [11,12]. In [11], the authors presented a robotic system that learned to catch flying objects using human demonstrations of grasping and pose estimation algorithm based on the Gaussian mixture model (GMM) [23]. In [12], the authors presented a method for learning hierarchical manipulation skills using a state-based transitions autoregressive hidden Markov model (STARHMM) [24]. The system successfully taught a dual-armed robot with human-like hands to perform a bimanual grasping of a box. However, neither system used human-like robotic hands for fine manipulations, such as grasping complex-shaped objects, which is the main focus of our work.

3. Hardware Configuration

Figure 1 shows the hardware setup used for this work, which consisted of a 7-DOF manipulator arm and a 16-DOF human-like hand. The system had a total number of DOFs of 23 and could replicate complex human manipulation behaviors. In addition to the manipulator setup, a lightweight RealSense RGBD camera was mounted on the back of the hand to perceive the manipulation objects from multiple viewpoints.

3.1. Manipulator Arm

We used a commercial Franka Emika Panda 7-DOF manipulator for this work, as the redundant DOF provides a much larger dexterous workspace compared to 6-DOF ones [25,26,27], which we found to be crucial for grasping objects in arbitrary orientation [28,29]. In addition, the manipulator’s electrically adjustable compliance greatly helped prevent possible damage to the delicate robotic hand.

3.2. Robot Hand

We used the commercial Allegro robotic hand, which has four fingers with 4 DOFs each. The hand consists of multiple smart servomotors that directly actuate the joints and spherical rubber fingertips shown in Figure 2b that are used for precise grasps. We chose the Allegro robotic hand as it was many times more affordable compared to other commercial human-like robotic hands such as Shadow Dexterous Hand [14] or SCHUNK SVH hand [30], and was easily serviceable due to its modular construction. As we wanted the robotic hand to perform delicate manipulation motions such as putting fingers through object holes for grasping, we made a slimmer, low-friction fingertip as shown in Figure 2c to prevent unwanted contact during the manipulation.

3.3. RGBD Camera

As shown in Figure 3b, we mounted an RGBD camera on the back of the hand to perceive the manipulation object from multiple viewpoints. Among the commercially available lightweight RGBD cameras, we chose the RealSense SR305 model as it was lightweight, had a very short minimum sensing distance of 20cm, and had much better depth accuracy compared to stereo-vision-based RGBD cameras according to our test. Figure 4 shows the comparison of the depth data from three different RGBD cameras, where the SR305 model showed distinctively better depth accuracy over others.

4. System Overview

Figure 5 shows the overall system architecture. When the system starts up, the manipulator moves through three capture positions so that the system can get the depth images of the object from three different viewpoints. These images are then converted to 3D point clouds and merged into a single 3D model using a 3D reconstruction algorithm [31,32] described in Section 5. After creating a 3D model, the manipulator captures the RGB image of the manipulation object to determine its object class using ResNet. Candidate data are obtained by selecting the 3D data having a corresponding object class value with the current object class. From these data, the optimal grasping pose is obtained by comparing the 3D model of candidate data and the current model using the KNN and point-set registration, which is described in more detail in Section 6.

After obtaining the grasping pose, the hand trajectory to reach the grasping pose is generated. DMPs, an imitation learning algorithm, are used for this task. It derives the grasping trajectory of the manipulator and robot hand from the movements performed by an operator. The moving data are collected by controlling a manipulator using a VR controller in a simulated environment, and specific DMP weights are generated for each manipulation object. The object class information of the current manipulation object, provided by ResNet, is used to generate the correct weight for the trajectory generation process. Then, the platform moves the manipulator according to the planned trajectory to reach the grasping pose.

5. Collecting Data for Imitation Learning

To estimate the grasping pose and plan the hand trajectory to grasp an object, the system uses three types of data: the 3D point cloud of the manipulation object, the RGB image of the manipulation object, and the hand trajectory data from human demonstration.

5.1. Three-Dimensional Data

As we can see in Figure 5, at the beginning of the process, the platform first captures depth images of the manipulation object from three different viewpoints, and these images are converted to point clouds. By transforming these point clouds according to the captured pose and merging the transformed data into a single point cloud, we can make a 3D model of the object. However, we cannot directly use the result of this process as it may contain points from backgrounds.

In order to remove the background points behind the object, we use the following filtering process. We first capture the depth image of the objects in the up pose, where the camera faces downwards, and first remove the points with z-coordinate lower than the preset threshold. Then, we calculate the minimum and maximum

x, y

values of the remaining points, which are used to filter out points when the system captures the depth image of the objects sideways. Although this method can remove most of the background points, it is hard to remove the point cloud of the tripod stand the object is placed upon. To solve this problem, as shown in Figure 6, we wrapped the tripod stand with black tape so that it was not detected by the IR-based RealSense sensor.

The 3D data acquisition algorithm is summarized in Algorithm 1, where

T_{u p}

,

T_{l e f t}

, and

T_{r i g h t}

are defined as a transform matrix [33] for the up pose, left pose, and right pose, respectively. The output of the above algorithm is shown in Figure 7. After creating a 3D model with this algorithm, it is labeled with the object name, trajectory data, and grasping pose. The following section explains this task.

Algorithm 1 Three-Dimensional Reconstruction Algorithm

1:: $c a p t u r i n g_p o s e$ ← $u p$
2:: while $c a p t u r i n g$ $o b j e c t$ do
3:: $p o i n t_c l o u d$ ← $c o n v T o P o i n t C l o u d (d e p t h_i m a g e)$
4:: $p o i n t_c l o u d$ ← $r e m o v e U n d e r S t a n d (p o i n t_c l o u d)$
5:: if $c a p t u r i n g_p o s e$ is $u p$ then
6:: $3 d_m o d e l . s e t E m p t y ()$
7:: $t r a n s f o r m_m a t$ ← $T_{u p}$
8:: $m a x_x, m a x_y, m i n_x, m i n_y$ ← $s e t X Y C o n d (p o i n t_c l o u d)$
9:: $x y_c o n d$ ← $[m a x_x, m a x_y, m i n_x, m i n_y]$
10:: $c a p t u r i n g_p o s e$ ← $l e f t$
11:: else if $c a p t u r i n g_p o s e$ is $l e f t$ then
12:: $t r a n s f o r m_m a t$ ← $T_{l e f t}$
13:: $c a p t u r i n g_p o s e$ ← $r i g h t$
14:: else if $c a p t u r i n g_p o s e$ is $r i g h t$ then
15:: $t r a n s f o r m_m a t$ ← $T_{r i g h t}$
16:: end if
17:: $p o i n t_c l o u d$ ← $f i l t e r i n g (p o i n t_c l o u d, x y_c o n d)$
18:: $p o i n t_c l o u d$ ← $t r a n s f o r m (t r a n s f o r m_m a t,$ $p o i n t_c l o u d)$
19:: $3 d_m o d e l . a p p e n d (p o i n t_c l o u d)$
20:: if $c a p t u r i n g_p o s e$ is $r i g h t$ then
21:: $b r e a k$
22:: end if
23:: end while

5.2. Trajectory Data

The trajectory data were collected by controlling the platform manually using a VR controller [34]. In this task, Valve’s Index Knuckles VR controller was used to control the platform. As this controller can track the precise 6D pose of the hand, as well as the positions of individual fingers using noncontact sensors, the controller can be used to naturally control both the robotic arm and a human-like robotic hand at the same time.

In controlling the platform with the VR controller, we interconnected SteamVR software with the OpenVR API to read the controller pose. As the base stations decide its pose, we connected its initial pose with the robot hand pose in the up pose and used the relative pose, based on the initial pose, in controlling the manipulator.

Starting from the up pose, we controlled the platform to grasp the captured object properly. In this process, trajectory data were collected by saving all joint angles of the manipulator and the robot hand whenever their states changed. With this method, we could get the trajectory data of the manipulator and robot hand from the up pose to the grasping pose. When the robot hand grasped the captured object, the trajectory, grasping pose, and object’s name were labeled in the 3D model.

We implemented direct teleoperation of both the manipulator and hand using the VR controller, as shown in Figure 8a. Although we could use this setup for training data acquisition, we used the simulation environment shown in Figure 8b extensively to speed up the training process while preventing possible hardware damage.

5.3. RGB Data

Since we recognized an object in a static environment, there was no need to consider changes in the object’s position or the environment. For this reason, we captured RGB data only in the up pose, changing an object’s orientation with a tripod stand. However, some objects should be grasped differently depending on whether the grasping pose is on the left side or right side. For example, when the handle of the brush was on the left, we labeled the image as brush-left, and when the handle was on the right, we labeled the image as brush-right. Figure 9 shows objects which should be labeled differently according to their direction, and Figure 10 shows different grasping strategies based on the object orientation.

6. Grasping Pose Prediction

Before grasping an object, we need to predict the grasping pose on the 3D model. Predicting the grasping pose proceeded in three steps: candidate data selection, transform matrix creation, and grasping pose prediction. As we can see in the process, this task was implemented by matching the current 3D model with candidate data using the following techniques: ResNet, KNN, and point-set registration. This section describes how these techniques were applied in each step.

6.1. Candidate Data Selection

In predicting the grasping pose, selecting candidate data is the first step to match them with the current 3D model. To choose candidate data from the 3D data, the manipulator captured the RGB image in the up pose and recognized its name with ResNet. Selecting candidate data was executed by matching the recognized name with the object information labeled in 3D data. If the information had the same object name as the recognized name, these data were selected as candidate data, and the selected data were used to get the transform matrix.

6.2. Transformation Matrix Creation

Although we can derive spatial transformation matrices between the current model and candidate models by matching them directly, this method has a high failure rate because of the partial difference, indicated in Figure 11. To solve this problem, we matched the current and candidate models’ divided clusters instead of matching the models themselves. The KNN was used for dividing the models. After dividing, as shown in Figure 12, we derived transformation matrices by matching clusters of the current model and candidate models using a point-set registration, and match rates between them by transforming the current model with the matrices and comparing it with the candidate model. Out of these matrices, the matrix with the best match rate was a transformation matrix. By matching divided clusters, they were analyzed several times with various parts. For example, when an object had one candidate data, the matching was done 25 times for different parts. As a result, matching failures were likely to be corrected in the process of matching, which could prevent failures caused by partial differences. To determine the optimal K value for the KNN, we performed several tests with varying K values and used a K-value of 5 that showed the lowest failure rate.

6.3. Grasping Pose Prediction

We needed to select the appropriate one to use for predicting the grasping pose from the candidate data. When selecting the appropriate data, we chose one that had a minimum error rate derived in Section 6.2. After finding the appropriate data, the object’s grasping pose was predicted using the labeled grasping pose and transform matrix. As we can see in Algorithm 2, the predicted grasping pose was derived by transforming the labeled grasping pose with a selected matrix. Figure 13 shows the result of Algorithm 2.

Algorithm 2 Predict_grasping_pose(rgb_image, 3d_data)

1:: $o b j e c t_c l a s s$ ← $R e s N e t (r g b_i m a g e)$ ▹ candidate data selection
2:: $c a n d i d a t e_d a t a$ ← $f i n d P o i n t C l o u d D a t a (o b j e c t_c l a s s)$
3:: $c l u s t e r s$ ← $K N N (3 d_d a t a)$ ▹ transform matrix creation
4:: $t r a n s f o r m_m a t r i c e s$ ← $[]$
5:: $s y n c_r a t e s$ ← $[]$
6:: $t r a j e c t o r i e s$ ← $[]$
7:: fori← $0;$ $i < c a n d i d a t e_d a t a . l e n g t h$ do
8:: $c l u s t e r s$ ← $K N N (c a n d i d a t e_d a t a [i] . 3 d_d a t a)$
9:: for j← $0;$ $j < c l u s t e r s . l e n g t h$ do
10:: for k← $0;$ $k < c l u s t e r s . l e n g t h$ do
11:: T← $P o i n t S e t R e g i s r a t i o n (c l u s t e r s [j], c l u s t e r s [k])$
12:: $t r a n s f o r m e d_d a t a$ ← $t r a n s f o r m (3 d_d a t a, T)$
13:: $e r r o r_r a t e$ ← $g e t S y n c R a t e (t r a n s f o r m e d_d a t a, c a n d i d a t e_d a t a [i])$
14:: $t r a n s f o r m_m a t r i c e s . a p p e n d (T)$
15:: $e r r o r_r a t e s . a p p e n d (e r r o r_r a t e)$
16:: $t r a j e c t o r i e s . a p p e n d (c a n d i d a t e_d a t a [i] . g r a s p i n g_t r a j e c t o r y)$
17:: end for
18:: end for
19:: end for
20:: $i n d e x$ ← $g e t M i n E l e m e n t I n d e x (e r r o r_r a t e s)$ ▹ grasping pose prediction
21:: $g r a s p i n g_t r a j e c t o r y$ ← $t r a j e c t o r i e s [i n d e x]$
22:: $g r a s p i n g_p o i n t$ ← $F K (g r a s p i n g_t r a j e c t o r y [l a s t] . s t a t e)$
23:: $s e l e c t e d_m a t r i x$ ← $t r a n s f o r m_m a t r i x_l i s t [i n d e x]$
24:: $p r e d i c t e d_g r a s p i n g_p o i n t$ ← $t r a n s f o r m (g r a s p i n g_p o i n t, s e l e c t e d_m a t r i x)$

7. Manipulation

Although we discovered the grasping pose in Section 6, grasping an object would fail if the manipulator was controlled by the wrong trajectory. The platform’s grasping trajectory can be derived from the simulation’s trajectory data collected with the VR controller. DMPs, the imitation learning algorithm, were used in this process. As the trajectory data from the up pose to a grasping pose were presented in joint space, we could obtain the positions, orientations, and hand motions data when grasping an object. From these data, the positions were used as learning data to get the weights of DMPs and to derive the position trajectory of the manipulator.

7.1. Extraction

As we can see in Algorithm 2, we have trajectory data included in candidate data when we predicted the grasping pose. As the manipulator’s states in the trajectory were presented in joint space, they could be transformed into a list of homogeneous matrices using FK [33,35,36], and we could get the positions and orientations of the end effector from them. The manipulator’s state data were used after transformation, but the robot hand’s state data were used without any transformation.

7.2. Position Trajectory

For planning the trajectory of the end effector’s position, we used the Pydmp, an open-source package that makes DMPs easy to use. In the source code of this package, the methodology is described as follows.

τ \dot{V} = K (X_{g} - X) - D V + (X_{g} - X_{0}) f

(1)

τ \dot{X} = \dot{V}

(2)

where vectors

V, X, X_{g},

and

X_{0}

indicate the current velocity, current position, goal position, and initial position, and constants

τ, K,

and D indicate the timescale, spring, and damping terms. Function f indicates the forcing function that determines the nonlinear behavior learned from trajectory data, and it is defined as follows:

f (s) = \frac{\sum_{i = 1}^{L} ω_{i} ψ_{i} (s) s}{\sum_{i = 1}^{L} ψ_{i} (s)}

(3)

where

ψ_{i} (s) = exp (- h_{i} {(s - c_{i})}^{2})

is a Gaussian basis function, with center

c_{i}

and width

h_{i}

, and

ω_{i}

are adjustable weights. Parameter L represents the number of Gaussian basis functions. The function f does not directly depend on time. Instead, it depends on a phase variable s, which monotonically changes from 1 towards 0 during a movement and is obtained by the equation

τ \dot{s} = - α s,

(4)

where

α

is a predefined constant. Using these equations,

f_{t a r g e t} (s)

is computed as follows.

f_{t a r g e t} (s) = \frac{- K (X_{g} - X) + D V + τ \dot{V}}{X_{g} - X_{0}}

(5)

Thus, finding the weights

ω_{i}

that minimize the error criterion

J = \sum_{s} {(f_{t a r g e t} (s) - f (s))}^{2}

is a linear regression problem, which can be solved efficiently.

7.3. Orientation and Hand Shape Trajectory

In planning the trajectory of the orientation and shape of the robot hand, similar to planning the positions, learning the trajectory using DMPs seemed to be the intuitively right way. However, when these data were learned with DMPs, the results showed poor performance. For this reason, we used another strategy in this work.

Before moving the position, the robot hand adjusted its orientation identically to the grasping pose. While adjusting the orientation, the robot hand shifted the hand shape to a grasping-ready pose, stored in the first element in the hand motion trajectory. The orientation and hand motion were not changed while moving the position, but there was no problem grasping an object. When the robot hand arrived at the grasping pose, it changed its pose to the grasping pose. After grasping an object, the robot hand bent all its joints to prevent slipping.

8. Experiment

8.1. Simulation Results

Before performing the test in the real world, we used the Webots [37] open-source robotic simulator to validate its accuracy and safety. As shown in Figure 14, the simulated model was a simplified version of the actual robot with the same specifications as the real platform. In the simulation environment, we used ten objects that we can see in Figure 15: a screwdriver, a cup, a carafe, a telephone, a valve, a hammer, a wrench, a skillet, scissors, and a lamp. For each object, ten trials were performed, where the orientation of the object was randomly determined for each trial. We automatically scored the trial result by checking the pose of the objects, and we considered the trial a success if the robot could hold the object for more than 10 s as Figure 16. Table 1 shows the result of the experiment, where the suggested algorithm autonomously picked up the 10 test objects with an average success rate of 97%.

8.2. Experimental Results

8.2.1. Official Test

The suggested system was validated through an accredited certification evaluation conducted by KTR (Korea Testing and Research Institute). The test was performed with five complex-shaped tools that we can see in Figure 17: scissors, spray bottle, brush, cup, and clamp.

8.2.2. Success Conditions

To formally clarify the success or failure of object grasping, we defined success conditions in the test as follows. Firstly, when grasping scissors, the thumb should pass through one side of the handle, and the index and middle fingers should pass through the other. In holding a brush, the robot hand’s thumb should touch one side of the handle, and the other fingers should wrap the other side simultaneously. In the case of a cup, the index and middle fingers should pass through the front of the handle. In addition, the thumb should contact these fingers at the back of the handle. Grasping a clamp is similar to grasping a brush. The thumb should touch one side of the handle, and the other fingers wrap the other side simultaneously. However, in the case of the brush, the thumb should be in the orange area of the clamp handle. Finally, a spray bottle should be gripped with the thumb on the back and the other fingers on the front. In addition, the index finger should be positioned on the trigger. Unlike other objects, the spray bottle cannot be grasped on the left side.

8.2.3. Results

Table 2 shows the result of the experiment performed in an accredited certification evaluation. Each of these objects was tested ten times by changing its orientation. The average success rate was 86%, and the success rates for all objects were more than 80%.

8.3. Discussions

Experimental results from the official test proved that the suggested robotic framework could perceive and grasp complex-shaped objects using an anthropomorphic robot hand with high accuracy. Compared to other studies that conducted similar work either using a low-DOF hand with soft fingers [38] or only in a simulated environment [39], our work had the main distinction of fully utilizing a high-DOF human-like robotic hand for delicate manipulation in reality, while having a comparable success rate as others. Although the motion generation itself took a very short time, less than 1 s, at the stage of writing this paper, we were using a very conservative and slow arm motion that made the whole process from data acquisition to grasping rather long, around 40 s. in future work, we plan to optimize the motion generation and control module and use inverse reinforcement learning [40,41,42,43,44,45] to find a better trajectory from default pose to grasping than human experts suggested.

9. Conclusions

This work presented a robotic system that grasped objects in various orientations using an imitation learning algorithm composed of a 3D reconstruction, ResNet, KNN, a point-set registration, and DMPs. We presented a novel, imitation learning based autonomous robotic grasping system that fully utilized a high-DOF human-like robotic hand for delicate manipulation using several machine learning approaches. Furthermore, its reliability was demonstrated by an official certification evaluation. As this research was limited to imitating grasping the object, future work will include a system that learns how to handle objects through inverse reinforcement learning using tactile data from a robot hand equipped with tactile sensors [46,47,48]. In addition, we plan to adopt inverse reinforcement learning based approaches [49] to optimize the motion-planning part of the system.

Author Contributions

Conceptualization, S.-J.Y.; methodology, S.-J.Y. and J.-B.Y.; software, J.-B.Y.; hardware, T.K., J.K., D.S. and J.P.; writing—original draft preparation, J.-B.Y.; writing—review and editing, S.-J.Y.; supervision, S.-J.Y.; project administration, S.-J.Y.; funding acquisition, S.-J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0008473, HRD Program for Industrial Innovation).

Acknowledgments

The authors also gratefully acknowledge Jongho Shin, Beomsoo Choi, and Yerin Moon for collecting high-quality data.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DOF	Degree of freedom
ResNet	Residual neural network
DMP	Dynamic movement primitive
FK	Forward kinematics
IK	Inverse kinematics
KNN	K-means clustering
HMM	Hidden Markov model
DNN	Deep neural network
GMR	Gaussian mixture regression
GMM	Gaussian mixture model
STARHMM	State-based transitions autoregressive hidden Markov model
MoMP	Mixture of motor primitives

References

Grau, A.; Indri, M.; Bello, L.L.; Sauter, T. Industrial robotics in factory automation: From the early stage to the Internet of Things. In Proceedings of the IECON 2017—43rd Annual Conference of the IEEE Industrial Electronics Society, Beijing, China, 29 October–1 November 2017; pp. 6159–6164. [Google Scholar]
Grau, A.; Indri, M.; Lo Bello, L.; Sauter, T. Robots in Industry: The Past, Present, and Future of a Growing Collaboration with Humans. IEEE Ind. Electron. Mag. 2021, 15, 50–61. [Google Scholar] [CrossRef]
Land, N. Towards Implementing Collaborative Robots within the Automotive Industry. Master’s Thesis, University of Skövde, School of Engineering Science, Skövde, Sweden, 2018. Available online: http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-15925 (accessed on 16 November 2022).
Torres, P.; Arents, J.; Marques, H.; Marques, P. Bin-Picking Solution for Randomly Placed Automotive Connectors Based on Machine Learning Techniques. Electronics 2022, 11, 476. [Google Scholar] [CrossRef]
Ishige, M.; Umedachi, T.; Ijiri, Y.; Taniguchi, T.; Kawahara, Y. Blind Bin Picking of Small Screws through In-finger Manipulation with Compliant Robotic Fingers. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 9337–9344. [Google Scholar] [CrossRef]
Osa, T.; Pajarinen, J.; Neumann, G.; Bagnell, J.; Abbeel, P.; Peters, J. An Algorithmic Perspective on Imitation Learning. Found. Trends Robot. 2018, 7, 1–179. [Google Scholar] [CrossRef]
Mülling, K.; Kober, J.; Kroemer, O.; Peters, J. Learning to select and generalize striking movements in robot table tennis. Int. J. Robot. Res. 2013, 32, 263–279. [Google Scholar] [CrossRef]
Calinon, S.; D’halluin, F.; Sauser, E.L.; Caldwell, D.G.; Billard, A.G. Learning and Reproduction of Gestures by Imitation. IEEE Robot. Autom. Mag. 2010, 17, 44–54. [Google Scholar] [CrossRef] [Green Version]
Dyrstad, J.S.; Ruud Øye, E.; Stahl, A.; Reidar Mathiassen, J. Teaching a Robot to Grasp Real Fish by Imitation Learning from a Human Supervisor in Virtual Reality. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 7185–7192. [Google Scholar]
Zhang, T.; McCarthy, Z.; Jowl, O.; Lee, D.; Chen, X.; Goldberg, K.; Abbeel, P. Deep Imitation Learning for Complex Manipulation Tasks from Virtual Reality Teleoperation. arXiv 2018, arXiv:1710.04615. [Google Scholar]
Kim, S.; Shukla, A.; Billard, A. Catching Objects in Flight. IEEE Trans. Robot. 2014, 30, 1049–1065. [Google Scholar] [CrossRef]
Kroemer, O.; Daniel, C.; Neumann, G.; van Hoof, H.; Peters, J. Towards learning hierarchical skills for multi-phase manipulation tasks. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 1503–1510. [Google Scholar]
Feix, T.; Romero, J.; Schmiedmayer, H.B.; Dollar, A.M.; Kragic, D. The GRASP Taxonomy of Human Grasp Types. IEEE Trans. Hum.-Mach. Syst. 2016, 46, 66–77. [Google Scholar] [CrossRef]
OpenAI; Akkaya, I.; Andrychowicz, M.; Chociej, M.; Litwin, M.; McGrew, B.; Petron, A.; Paino, A.; Plappert, M.; Powell, G.; et al. Solving Rubik’s Cube with a Robot Hand. arXiv 2019, arXiv:1910.07113. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Los Alamitos, CA, USA, 2016; pp. 770–778. [Google Scholar]
Na, S.; Xumin, L.; Yong, G. Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm. In Proceedings of the 2010 Third International Symposium on Intelligent Information Technology and Security Informatics, Ji’an, China, 2–4 April 2010; pp. 63–67. [Google Scholar]
Zhu, H.; Guo, B.; Zou, K.; Li, Y.; Yuen, K.V.; Mihaylova, L.; Leung, H. A Review of Point Set Registration: From Pairwise Registration to Groupwise Registration. Sensors 2019, 19, 1191. [Google Scholar] [CrossRef] [Green Version]
Golyanik, V.; Ali, S.A.; Stricker, D. Gravitational Approach for Point Set Registration. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5802–5810. [Google Scholar]
Ijspeert, A.J.; Nakanishi, J.; Hoffmann, H.; Pastor, P.; Schaal, S. Dynamical Movement Primitives: Learning Attractor Models for Motor Behaviors. Neural Comput. 2013, 25, 328–373. [Google Scholar] [CrossRef] [PubMed]
Awad, M.; Khanna, R. Hidden Markov Model. In Efficient Learning Machines; Apress: Berkeley, CA, USA, 2015; pp. 81–104. [Google Scholar] [CrossRef] [Green Version]
Fabisch, A. gmr: Gaussian Mixture Regression. J. Open Source Softw. 2021, 6, 3054. [Google Scholar] [CrossRef]
Rao, C.; Liu, Y. Three-dimensional convolutional neural network (3D-CNN) for heterogeneous material homogenization. Comput. Mater. Sci. 2020, 184, 109850. [Google Scholar] [CrossRef]
Reynolds, D. Gaussian Mixture Models. In Encyclopedia Of Biometrics; Springer: New York, NY, USA, 2009. [Google Scholar]
Levine, S.; Wagener, N.; Abbeel, P. Learning Contact-Rich Manipulation Skills with Guided Policy Search. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 156–163. [Google Scholar]
Pratheep, V.G.; Chinnathambi, M.; Priyanka, E.B.; Ponmurugan, P.; Thiagarajan, P. Design and Analysis of six DOF Robotic Manipulator. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1055, 012005. [Google Scholar] [CrossRef]
Kang, T.; Yi, J.B.; Song, D.; Yi, S.J. High-Speed Autonomous Robotic Assembly Using In-Hand Manipulation and Re-Grasping. Appl. Sci. 2021, 11, 37. [Google Scholar] [CrossRef]
Yi, J.-B.; Kang, T.; Song, D.; Yi, S.-J. Unified Software Platform for Intelligent Home Service Robots. Appl. Sci. 2020, 10, 5874. [Google Scholar] [CrossRef]
Ye, L.; Xiong, G.; Zeng, C.; Zhang, H. Trajectory tracking control of 7-DOF redundant robot based on estimation of intention in physical human-robot interaction. Sci. Prog. 2020, 103, 0036850420953642. [Google Scholar] [CrossRef]
Merckaert, K.; Convens, B.; Wu, C.-J.; Roncone, A.; Nicotra, M.M.; Vanderborght, B. Real-time motion control of robotic manipulators for safe human–robot coexistence. Robot. Comput.-Integr. Manuf. 2022, 73, 102223. [Google Scholar] [CrossRef]
Schwarz, M.; Lenz, C.; Rochow, A.; Schreiber, M.; Behnke, S. NimbRo Avatar: Interactive Immersive Telepresence with Force-Feedback Telemanipulation. arXiv 2021, arXiv:2109.13772. [Google Scholar]
Zhu, L. A Pipeline of 3D Scene Reconstruction from Point Clouds. Ph.D. Thesis, Maankäyttötieteiden laitos Department of Real Estate, Planning and Geoinformatics, Espoo, Finland, 2015. [Google Scholar]
Ko, J.; Ho, Y.S. 3D Point Cloud Generation Using Structure from Motion with Multiple View Images. In Proceedings of the Korean Institute Of Smart Media Fall Conference, Kwangju, Republic of Korea, 28–29 October 2016; pp. 91–92. [Google Scholar]
Lynch, K.M.; Park, F.C. Modern Robotics: Mechanics, Planning, and Control; Cambridge Univeristy Press: Cambridge, UK, 2017. [Google Scholar]
Schäfer, M.B.; Stewart, K.W.; Lösch, N.; Pott, P.P. Telemanipulation of an Articulated Robotic Arm Using a Commercial Virtual Reality Controller. Curr. Dir. Biomed. Eng. 2020, 6, 127–130. [Google Scholar] [CrossRef]
Constantin, D.; Lupoae, M.; Baciu, C.; Ilie, B. Forward Kinematic Analysis of an Industrial Robot. In Proceedings of the International Conference on Mechanical Engineering (ME 2015), Kuantan, Malaysia, 18–19 August 2015; pp. 90–95. [Google Scholar]
Craig, J.J. Introduction to Robotics: Mechanics and Control, 2nd ed.; Addison-Wesley Longman Publishing Co., Inc.: Boston, MA, USA, 1989. [Google Scholar]
Michel, O. Webots: Professional Mobile Robot Simulation. J. Adv. Robot. Syst. 2004, 1, 39–42. [Google Scholar]
Choi, C.; Schwarting, W.; DelPreto, J.; Rus, D. Learning Object Grasping for Soft Robot Hands. IEEE Robot. Autom. Lett. 2018, 3, 2370–2377. [Google Scholar] [CrossRef]
Zhu, T.; Wu, R.; Lin, X.; Sun, Y. Toward Human-Like Grasp: Dexterous Grasping via Semantic Representation of Object-Hand. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15721–15731. [Google Scholar] [CrossRef]
Zhifei, S.; Er, J. A survey of inverse reinforcement learning techniques. Int. J. Intell. Comput. Cybern. 2012, 5, 293–311. [Google Scholar] [CrossRef]
Lee, M.S.; Admoni, H.; Simmons, R. Machine Teaching for Human Inverse Reinforcement Learning. Front. Robot. AI 2021, 8, 188. [Google Scholar] [CrossRef]
Vasquez, D.; Okal, B.; Arras, K.O. Inverse Reinforcement Learning algorithms and features for robot navigation in crowds: An experimental comparison. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 1341–1346. [Google Scholar]
Natarajan, S.; Kunapuli, G.; Judah, K.; Tadepalli, P.; Kersting, K.; Shavlik, J. Multi-Agent Inverse Reinforcement Learning. In Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications, Washington, DC, USA, 12–14 December 2010; pp. 395–400. [Google Scholar]
Wang, Z.; Long, C.; Cong, G. Trajectory Simplification with Reinforcement Learning. In Proceedings of the 2021 IEEE 37th International Conference on Data Engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 684–695. [Google Scholar] [CrossRef]
Efroni, Y.; Merlis, N.; Mannor, S. Reinforcement Learning with Trajectory Feedback. Proc. AAAI Conf. Artif. Intell. 2021, 35, 7288–7295. [Google Scholar] [CrossRef]
Zhu, H.; Gupta, A.; Rajeswaran, A.; Levine, S.; Kumar, V. Dexterous Manipulation with Deep Reinforcement Learning: Efficient, General, and Low-Cost. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3651–3657. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Xu, J.; Agrawal, P. A System for General In-Hand Object Re-Orientation. arXiv 2021, arXiv:2111.03043. [Google Scholar]
Andrychowicz, O.M.; Baker, B.; Chociej, M. Learning dexterous in-hand manipulation. Int. J. Robot. Res. 2020, 39, 3–20. [Google Scholar] [CrossRef]
Zhao, Z.; Liang, Y. Deep Inverse Reinforcement Learning for Route Choice Modeling. arXiv 2022, arXiv:2206.10598. [Google Scholar]

Figure 1. Hardware configuration.

Figure 2. Robot hand.

Figure 3. RGBD camera.

Figure 4. Point cloud data taken by three different RGBD cameras.

Figure 5. System overview.

Figure 6. Five objects used for the experiment. (a) Scissors. (b) Brush. (c) Cup. (d) Clamp. (e) Spray bottle.

Figure 7. Acquired 3D point cloud data of manipulation objects.

Figure 8. Hand trajectory recording using the VR Controller.

Figure 9. Examples of left-side and right-side objects.

Figure 10. Different grasping strategies based on the object orientation.

Figure 11. Partial difference.

Figure 12. Matching models with clusters.

Figure 13. Grasping pose prediction.

Figure 14. Simulated model.

Figure 15. Ten objects used in simulation experiment.

Figure 16. Results in simulation: grasping ten objects in various orientations.

Figure 17. Autonomous grasping of complex-shaped objects.

Table 1. Simulation results.

Object	Trial Num	Success Num	Failure Num	Success Rate	Failure Rate
Screwdriver	10	10	0	100%	0%
Cup	10	9	1	90%	10%
Carafe	10	10	0	100%	0%
Telephone	10	10	0	100%	0%
Valve	10	10	0	100%	0%
Hammer	10	10	0	100%	0%
Wrench	10	10	0	100%	0%
Skillet	10	10	0	100%	0%
Scissors	10	8	2	80%	20%
Lamp	10	10	0	100%	0%
Average	10	9.7	0.3	97%	3%

Table 2. Experiment results.

Object	Trial Num	Success Num	Failure Num	Success Rate	Failure Rate
Scissors	10	8	2	80%	20%
Spray bottle	10	9	1	90%	10%
Clamp	10	9	1	90%	10%
Cup	10	8	2	80%	20%
Brush	10	9	1	90%	10%
Average	10	8.6	1.4	86%	14%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, J.-B.; Kim, J.; Kang, T.; Song, D.; Park, J.; Yi, S.-J. Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning. Appl. Sci. 2022, 12, 12861. https://doi.org/10.3390/app122412861

AMA Style

Yi J-B, Kim J, Kang T, Song D, Park J, Yi S-J. Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning. Applied Sciences. 2022; 12(24):12861. https://doi.org/10.3390/app122412861

Chicago/Turabian Style

Yi, Jae-Bong, Joonyoung Kim, Taewoong Kang, Dongwoon Song, Jinwoo Park, and Seung-Joon Yi. 2022. "Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning" Applied Sciences 12, no. 24: 12861. https://doi.org/10.3390/app122412861

APA Style

Yi, J.-B., Kim, J., Kang, T., Song, D., Park, J., & Yi, S.-J. (2022). Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning. Applied Sciences, 12(24), 12861. https://doi.org/10.3390/app122412861

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anthropomorphic Grasping of Complex-Shaped Objects Using Imitation Learning

Abstract

1. Introduction

2. Related Work

3. Hardware Configuration

3.1. Manipulator Arm

3.2. Robot Hand

3.3. RGBD Camera

4. System Overview

5. Collecting Data for Imitation Learning

5.1. Three-Dimensional Data

5.2. Trajectory Data

5.3. RGB Data

6. Grasping Pose Prediction

6.1. Candidate Data Selection

6.2. Transformation Matrix Creation

6.3. Grasping Pose Prediction

7. Manipulation

7.1. Extraction

7.2. Position Trajectory

7.3. Orientation and Hand Shape Trajectory

8. Experiment

8.1. Simulation Results

8.2. Experimental Results

8.2.1. Official Test

8.2.2. Success Conditions

8.2.3. Results

8.3. Discussions

9. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI