Learning by Demonstration of a Robot Using One-Shot Learning and Cross-Validation Regression with Z-Score

Duque-Domingo, Jaime; García-Gómez, Miguel; Zalama, Eduardo; Gómez-García-Bermejo, Jaime

doi:10.3390/electronics13173365

Open AccessArticle

Learning by Demonstration of a Robot Using One-Shot Learning and Cross-Validation Regression with Z-Score

by

Jaime Duque-Domingo

^1,*,†

,

Miguel García-Gómez

^1,†

,

Eduardo Zalama

^1,2,†

and

Jaime Gómez-García-Bermejo

^1,2,†

¹

Institute of Advanced Production Technologies, Department of Systems Engineering and Automatics (ITAP-DISA), School of Industrial Engineers, University of Valladolid, Prado de la Magdalena 3-5, 47011 Valladolid, Spain

²

CARTIF Technological Center, Parque Tecnológico de Boecillo, 47151 Valladolid, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(17), 3365; https://doi.org/10.3390/electronics13173365 (registering DOI)

Submission received: 16 July 2024 / Revised: 4 August 2024 / Accepted: 6 August 2024 / Published: 24 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

We introduce a One-Shot Learning system where a robot effectively learns how to manipulate objects by relying solely on the object’s name, a single image, and a visual example of a person picking it up. Once the robot has mastered picking up a new object, an audio command is all that is needed to prompt it to perform the action. Our approach heavily depends on synthetic data generation, which is crucial for training various detection and regression models. Additionally, we introduce a novel combined regression model called Cross-Validation Regression with Z-Score (CVR-ZS), which improves the robot’s grasp accuracy. The system also features a classifier that uses a cutting-edge text-encoding technique, allowing for flexible user prompts for object retrieval. The complete system includes a text encoder and classifier, an object detector, and the CVR-ZS regressor. This setup has been validated with a Niryo Ned robot.

Keywords:

robotic grasping; demonstration learning; synthetic data generation; cross-validation regression; ensemble learning

1. Introduction

In collaborative robotics, replicating human-like object manipulation presents a significant challenge, especially when relying on just a single observation of human handling. The difficulty is amplified by the diversity in object shapes, sizes, and materials, requiring robots to exhibit a level of adaptability and dexterity comparable to that of humans.

Traditional robot programming methods often involve extensive manual coding and task-specific programming, which are cumbersome for applications needing quick adaptation and deployment. While manually guiding the robot’s claw to the correct grasping position is a possible approach, it becomes impractical when objects can appear in various positions. Learning by imitation offers a promising alternative, where robots learn by observing and replicating human or expert demonstrations. This method can accelerate skill acquisition but challenges remain due to the requirement for large datasets for each new task.

Few-Shot Learning (FSL) is a machine learning approach aimed at identifying or classifying new instances using only a few examples. In the most restrictive cases, such as One-Shot Learning (OSL), only one sample per category is available [1]. This is particularly valuable in scenarios where collecting data is challenging or costly, making it impractical to compile large datasets. OSL minimizes the need for extensive datasets by focusing on learning efficient representations, using prior knowledge, employing metric-based methods, and applying data augmentation and regularization techniques to generalize from just a few samples. The integration of One-Shot Learning (OSL) addresses the limitations associated with traditional demonstration learning, thereby minimizing the reliance on extensive datasets. In the application of OSL, the user is only required to provide the object’s name and a sample demonstrating how it is manipulated. The robot, leveraging this singular example, learns to handle the object in a human-like manner. To initiate the pick-up action, a plain text prompt suffices, such as “robot, please pick up the screwdriver”. In our case, we consider that each sample is formed by the image of the object and its name since it is necessary to have both visual and textual information. Using only the image would require another prior model of object classification in order to obtain its textual information.

To transform our system into an OSL framework, we introduce an integrated approach wherein synthetic data are generated for each image depicting how a person grasps an object. These synthetic data are then used to train a cascade model, incorporating a classical YOLO v8 (You Only Look Once) object detector [2] and an innovative combined regression model named CVR-ZS (Cross-Validation Regression with Z-score). Unlike a single regression model, this method achieves more accurate localization of the object pick-up position. Additionally, our approach is seamlessly integrated with a CLIP (Contrastive Language–Image Pre-training)-based text encoder [3], enabling the identification of user prompts that specify the object the user desires the robot to pick up. The efficacy of our method is substantiated through validation on a cost-effective robot platform.

The main contribution of our method is the presentation of the new CVR-ZS-based regression system and how it has been applied for object grasping under an OSL paradigm by augmenting data learned by human demonstration.

In Section 2, we review the main contributions to imitation learning. In Section 3, we present the developed method, explaining the data augmentation process, the training of the object detection model, and the classifier integrated with CLIP, as well as the operation of CVR-ZS. In Section 4, we discuss the most significant experiments and results obtained. Finally, the conclusions of our study are presented in Section 5.

2. Overview of Related Work

In the field of collaborative robotics, a significant amount of research has focused on advancing understanding of the environment and object manipulation, trying to emulate human dexterity. Studies have highlighted the use of advanced perception techniques, such as computer vision, to enable robots to acquire a detailed understanding of their environment [4]. In addition, work has been conducted on developing algorithms and systems that mimic the human ability to grasp and manipulate various objects, addressing challenges such as adaptability to variable shapes [5,6]. Demonstration learning [7,8] and reinforced learning [9,10] have also been applied to enable more natural interaction. Taken together [11], these advances bring robots closer to the ability to understand the world and manipulate objects in efficient and versatile ways, mirroring human abilities.

One approach for robotics involves reinforcement learning (RL), where the robot autonomously develops control policies through iterative experimentation. Lobbezoo et al. [12] combine traditional and reinforcement learning control for simulated and real environments to validate the RL approach for standard industrial tasks such as reach, grasp and pick-and-place. They want to bring intelligence to robotic control so that robotic operations can be completed without precisely defining the environment, constraints, and the action plan. In contrast to conventional methods in which robots choose a grasping point and execute the action, Kalashnikov et al.’s method [13] uses a closed-loop vision-based control approach. In this approach, the robot continuously updates its grasping strategy based on the most recent observations, thus optimizing grasping success over the long term. The framework used is called QT-Opt, and is a reinforcement learning method based on self-supervised vision. It uses a large number of real-world grasping attempts to train a deep neural network with more than 1.2 million parameters. This approach has reported 96% success in performing closed-loop real-world captures, even on unseen objects. In addition to achieving a high success rate, the method exhibits advanced behaviors. Using only RGB vision-based perceptual information from an over-the-shoulder camera, the system automatically learns regrasp strategies, probes objects to find the most effective regrasps, learns to reposition objects, and performs other inapprehensible manipulations prior to regrasping. It is also able to respond dynamically to perturbations and disturbances in its environment.

To solve the problem of where to grasp objects, Mahler et al. [14] utilize a synthetic dataset of 6.7 million point clouds, grasps, and analytic grasp metrics to train a model predicting the success probability of grasps from depth images. The model is trained using 3D models from [15] placed in randomized poses on a table. Their framework has been extended to vacuum suction grippers [16] and dual-arm robots [17]. Along this line, the authors of [18] create a dataset which focuses on real-world manipulable objects for robotics, providing category-level pose estimation and affordance prediction annotations. Annotation is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, yielding high-quality 3D annotations without crowdsourcing. A practical multi-stage grasp detection method for a Kinova robot in stacked environments [19] uses a multi-stage network for multi-object grasp detection. The algorithm extracts an initial region of interest and continuously refines it to achieve accurate object detection and grasping results, outperforming existing algorithms on the VMRD dataset [20].

Within the learning by demonstration in robotics, some works have focused on the recognition of gestures using human skeletal coordinates and the use of neural networks and Markov models [21]. Other research explores the problem of learning by human demonstrations in inconsistent contexts [22], presenting an imitation learning framework for robots to reproduce behavior observed in human demonstrations with variations in viewpoints, operators, backgrounds, and positions. Another approach involves robot eye–hand coordination learning [23], where a robot learns a task function from human videos and uses it in real-time to guide its motions in the execution phase.

Jin et al.’s research [24] focuses on a vision-based learning system for robotic arms, aims to detect human actions in real-time, and proposes an inductive object trajectory method for motion planning to replicate tasks demonstrated by a human user.

Other researchers combine learning from human demonstration with reinforcement learning. San et al. [25] propose a method in which the robot continuously interacts with the environment to master the skill. Firstly, learning from demonstrations initializes the control policies, and then the robot begins to practice the demonstrated skill, receiving a reward from the environment after each practice round. Kamali et al. [26] use a virtual reality system to control the robot tool position and orientation with hand motions, while monitoring its movements in a 3D virtual reality environment, to give reference trajectories to a deep neural policy network for controlling the robot’s joint movements. Then, they leverage the Proximal Policy Optimization algorithm for deep reinforcement learning to train the policy network with the robot joint values and the reference trajectory observed at each iteration. Cabi et al. [27] control policies for a diverse set of manipulation tasks by generating observation–action pairs through teleoperation, scripted policies, or trained agents. Human preferences are then introduced to construct a reward function for a new task, which is used to annotate historical data for different tasks. This annotated dataset is employed for batch reinforcement learning to train manipulation policies from visual input.

The advances achieved in our work not only effectively address the inherent challenge of reinforcement learning, which often requires numerous simulations to achieve satisfactory results but also present significant advantages in object manipulation. Our methodology allows learning to manipulate objects without relying on specific three-dimensional representations of objects using point clouds, which facilitates generalization to a wide variety of objects in an efficient manner.

The independence of teleoperation is another distinctive aspect of our research. The ease in data collection is highlighted, as our approach is based on the generation of synthetic data. This translates into a One-Shot Learning system, where the ability to acquire knowledge with a minimum amount of data is a key characteristic. This innovation not only simplifies the data acquisition process but also paves the way for faster and more effective learning.

Finally, although ensemble models have been used to improve the results of regression problems [28,29,30], our method uses a novel method that integrates several regressors under the CVR-ZS paradigm.

3. Analysis of the System

Figure 1 illustrates the schematic representation of the proposed system. By employing an automatic speech recognition (ASR) system based on a pre-trained OpenAI Whisper model [31], the text order (“take the screwdriver” in the figure example) is extracted to invoke a pre-trained CLIP model. CLIP [3] is a neural network trained on diverse (image and text) pairs, capable of understanding natural language instructions to predict the most relevant text snippet for a given image. This CLIP model outputs a 512-dimensional embedding vector, which is then fed into our custom classifier model. The classifier’s objective is to convert the CLIP output embedding vector into a specific class, with the number of classes corresponding to the objects we aim to recognize. To train the classifier, we have generated multiple prompts for each object, such as “take the pen”, “grab the pen”, “pen”, and so on.

Simultaneously, we employ a YOLO model trained with synthetic data generated by our custom generator for recognizing the desired objects. The output class from YOLO aligns with the output class of our CLIP classification model. This approach enables us to isolate the target object and more effectively pinpoint the grasping points. Not utilizing YOLO at this stage results in inferior detection outcomes for grasping points compared to using a single regression model because the image is much larger and there are many small, scattered objects, which confound the model. After identifying the object window, we invoke a specific regression model for grasping point detection, with a distinct model for each added object. The training of these regression models involves learning grasping points from human demonstrations and augmenting the data using techniques discussed later.

The regression model returns the grip points that can be used directly to obtain the robot’s grip points, as well as the orientation from a previous calibration of the robot. One of the highlights of our research is the use of the CVR-ZS method, which significantly improves the regression problem to obtain the grip position.

3.1. Synthetic Data Generation

Figure 2 shows what the synthetic data generation process looks like. From a single image of what an object looks like and how it is grasped, we extract the grasping coordinates using the Mediapipe hand model [32]. We choose the coordinates of the index finger and thumb since they are the ones that directly touch the object. The data generator randomly chooses from all registered objects and places them at positions in the robot’s workspace. We use a subtraction model to subtract the object image from the background and apply scale, rotation, flip, and translation transformations. In our study, we focus on nine common objects, generating a dataset of 30,000 images through data augmentation.

The subtraction model incorporates the Structural Similarity Index [33]. Unlike the mean squared error (MSE), which may not accurately reflect perceived similarity, the structural similarity metric considers texture, enhancing the assessment of image similarity. From the similarity index, we apply a common Otsu threshold [34] and extract the segmentation mask (although, of course, any other segmentation technique can be used).

The data augmentation method randomly chooses different objects to be positioned on the workspace. For each selected object, we perform various random transformations on the isolated object image, including rotations, translations, reduced scaling, or flips both horizontally and vertically. The transformations are also performed on the grip points, calculating the points in their new position. The selection of a random workspace position takes into account the object’s mask within the workspace and ensures it does not overlap with other objects.

In addition to the grip coordinates, we create a file in YOLO format where we record the bounding box of each object. This allows us to train the detection model prior to regression. For the nine objects used, we generate 30,000 training images and 3000 evaluation images.

3.2. Object Detection with Text Prompts

Utilizing a CLIP pre-trained model, we transform a text prompt into a 512-element encoded vector. This output undergoes processing through our custom classification network, featuring two Fully Connected (FC) layers: 512 to 256 and 256 to n outputs, where n corresponds to the number of objects.

We generate a random set of hundreds of sentences by integrating the name of the object given by the user when adding a new one. So, for example, for screwdriver, we train our model with sentences like “Take the screwdriver”, “Please take the screwdriver”, etc. All sentences related to an object are classified with that object, e.g., class 0.

From the synthetic data, which include recorded bounding box coordinates for each object, we train a YOLO v8 model [2]. The output from our classification model, integrated with CLIP, categorizes text prompts into n distinct objects, aligning with the number of objects detected by YOLO. When the confidence of the CLIP classification is over a threshold, we select the object returned by YOLO for that class. Additionally, for YOLO, we establish another threshold for confidence. If the confidence falls below this threshold, we consider the object to not be present in the workspace.

3.3. Cross-Validation Regression Integrated with Z-Score

Once the objects are isolated using YOLO, we use a regression model that starts from the object image and returns four values: center point of object grasp (

X, Y

), modulus of distance between the two grasp points (M), and vector angle (A). Instead of directly using the coordinates of the two grasping points, our regression model works with the central grasping point, the distance modulus of the two grasping points and the angle of the vector joining them. Based on our experiments, this conversion allows us to improve the robot’s grip by avoiding problems when the regression returns a slightly displaced grip point.

We train a regression model for each type of object, which allows for improved grasping results. However, in some cases, the output may not be accurate due to erroneous predictions of the model itself due to the existence of less known situations, such as objects in proximity. To improve accuracy, we propose to use an ensemble model integrated under a novel technique that we present, which we call CVR-ZS (Cross-Validation Regression integrated with Z-score). This technique allows to improve the regression performance of diverse types of models and is based on the training of several models of an analogous nature with disjoint validation sets (see Figure 3). Based on the Cross-Validation Voting method (CVV) used for a classification problem [35], we spread the training set over k datasets and train k models. CVV was successfully used in its classification-oriented CVV-CP version using Siamese networks in the One-Shot Learning problem [36]. Unlike CVV and CVV-CP, which are focused on classification problems, in regression, we could choose to obtain the mean value of the regressors, which in itself improves the result of the individual regressor, or to perform an extraction operation on the outputs prone to be outliers.

The data are previously randomized, and k different validation slots are selected. The rest of the data from each slot are used for training. Let

τ

be the set that includes all the samples of the complete training dataset. Let

T_{i}

and

V_{i}

be the training and validation sets corresponding to the slot i. These sets must verify Equations (1)–(6):

τ = ⊎_{i = 1}^{k} T_{i}

(1)

⋂_{i = 1}^{k} T_{i} = \emptyset

(2)

τ = ⋃_{i = 1}^{k} V_{i}

(3)

⋂_{i = 1}^{k} V_{i} = \emptyset

(4)

[τ = T_{i} \cup V_{i}] \forall i \in k

(5)

[T_{i} \cap V_{i} = \emptyset] \forall i \in k

(6)

Then, k models of the same nature are trained with each of the training and validation slots. The use of an ensemble model allows us to obtain a more robust an accurate model. In addition, integrating it with Z-score allows us to discard outliers in the output of each test. During training, k models are trained separately. During inference, Z-scores are computed for each column of data, X, Y, M, and A, as defined in Equation (7).

\{\begin{matrix} X_{{Z S}_{i}} = [X_{i} - μ_{X}] / σ_{X} \\ Y_{{Z S}_{i}} = [Y_{i} - μ_{Y}] / σ_{Y} \\ M_{{Z S}_{i}} = [M_{i} - μ_{M}] / σ_{M} \\ A_{{Z S}_{i}} = [A_{i} - μ_{A}] / σ_{A} \end{matrix}

(7)

For each model, we compute the quadratic value of its Z-score values as depicted in Equation (8). This calculation enables us to filter out outliers and values that closely align with the consensus among most models:

z_{i} = X_{i}^{2} + Y_{i}^{2} + M_{i}^{2} + A_{i}^{2}

(8)

In CVR-ZS, only the entire half plus one (s) of the models involved in CVV (k) are selected to calculate the regression based on the lowest Z-score (see Equation (9)). This allows most of the outliers to be rejected. Averaging the remaining s models, the test results are significantly improved:

s = ⌊\frac{k + 1}{2}⌋

(9)

For a

k = 5

, the

s = 3

models with the lowest calculated

z_{i}

are used. The output of the ensemble model during inference is the mean value for each value for the selected models as seen in Equation (10):

\{\begin{matrix} X = \frac{\sum_{i \in s e l e c t e d_m o d e l s} X_{i}}{l e n (s e l e c t e d_m o d e l s)} \\ Y = \frac{\sum_{i \in s e l e c t e d_m o d e l s} Y_{i}}{l e n (s e l e c t e d_m o d e l s)} \\ M = \frac{\sum_{i \in s e l e c t e d_m o d e l s} M_{i}}{l e n (s e l e c t e d_m o d e l s)} \\ A = \frac{\sum_{i \in s e l e c t e d_m o d e l s} A_{i}}{l e n (s e l e c t e d_m o d e l s)} \end{matrix}

(10)

There is one main reason for the selection of five estimators. In [35], a comparison of the CVV method for classification using different number of estimators is carried out. It is concluded that the accuracy is improved up to five estimators but could stabilize and even worsen after that value. In our experiments, we also observe such behavior in the preliminary results. Furthermore, operating with a larger number of estimators would make the method computationally very expensive.

We use a ResNet-34 convolution backbone [37] for each regression model, pre-trained with ImageNet. The classification layers are replaced by a ReLu activation, an average pooling, and a normalization. Finally, a dense layer connects the 512 outputs of the normalization with the four outputs of the model.

An advantage of this ResNet-34 backbone is that it is lightweight and allows us to reduce the computational cost of training. Although other models could improve the regression accuracy, this model is sufficient for the robot to grasp the objects correctly. In any case, the CVR-ZS approach can also be used with backbones of a different nature, such as ResNeXt [38], EfficientNet [39], RegNet [40], or ConvNeXt [41]. Each classifier of a different nature is trained with the similar slots used with the others, taking advantage of the goodness that each model offers against the same data partitioning. Finally, all regressors are combined as detailed earlier.

4. Experiments and Results

The experiments were conducted using a Niryo Ned robot [42], which is sufficiently small and manageable to be used in a human environment. It is 3D printed and has 6 DoFs. An experimental scenario was chosen, where we marked the robot workspace and deployed a tripod with RGB camera. The robot was calibrated using the method of Zhang et al. [43]. The robot was connected to a computer with RTX3090 GPU for faster operation and training.

At first, data acquisition was performed, consisting of the name of each object, an image of the isolated object on the workspace, and an image of how a person picks up the object. For nine evaluated objects (nut, pen, rubber, screw, screwdriver, spoon, glue stick, usb stick, and lip corrector), we applied our data augmentation method to generate 30,000 training images, where we randomly mixed different objects. In addition, we generated 3000 test images. For each of the images, we generated the grasping coordinates of each object from the person’s thumb and index finger, as well as the bounding boxes of the objects.

Next, we performed the training of the models. We trained a YOLO v8 model, a classifier based on the CLIP text encodings for different sentences generated from the object names, and finally five regression models per object using CVR-ZS. The whole synthetic data generation process required 24 h to be completed.

For experimentation, we evaluated the results of CVR-ZS with respect to the test set, as well as the results of object collection by the real robot. It should be noted that, in this last case, the results are visually quantified.

Regarding the object detection, the results obtained with new synthetic data when evaluating the metrics of YOLOv8 are satisfactory, consolidating the efficiency and robustness of this architecture in object detection (see Figure 4). The loss plots charted a path of steady improvement, evidencing the model’s ability to adjust to intricate image details and learn meaningful representations of the objects in question.

Regarding the CVR-ZS method, each of the regressors is based on a ResNet-34 backbone connected to an FC regression layer that links the 512 features to four outputs (coordinates of the central gripping point, modulus, and direction). We used the L1 loss, which takes the mean element-wise absolute value difference between the actual and the predicted value. For each object, we trained five regressors by partitioning the training dataset as explained above (five disjoint validation sets). Each of the models was trained with fine-tuning and an Adam optimizer, where we reduced the learning factor from 0.006 to 0.001 and then to 0.0001. Each fine-tuning block was trained for 100 epochs using a patience of 20 epochs to cut the training if the validation was not reduced.

The integration of the models was performed using Z-scores. In Table 1, we show the results comparing the average result of a model trained without CVR-ZS (with five runs), a model where we simply average the output values of the five regressors trained with CVR, another model where we apply clustering to the output of the models by distributing in two classes and selecting the class with the largest number of samples and another model with CVR-ZS.

In the case of clustering, a grid search was carried out with different methods and parameters, including methods such as classic K-Means, Feature Agglomeration, Bisecting K-Means, MiniBatch K-Means, Spectral Clustering or DBSCAN. The best results were obtained with classic K-Means and specifying two output classes. The clustering algorithm partitions the outputs of the k regressors into two classes. The main class is the one with the largest number of samples, and its centroid is the regression output. Using clustering, we sought to isolate the regressors with a result farther away from the centroid of the main class.

We obtained the errors for each object as shown graphically in Figure 5 and Table 2. The results show that the model improves the regression of all objects. Even objects like the nut, characterized by symmetry and challenges in determining the correct grip angle, show a reduction in error.

Finally, we evaluated the success of this method with the Niryo Ned robot. We performed 20 pick and place movements for each object. Table 3 shows the results of this experiment, which are mostly satisfactory. However, there are some objects that seem more difficult to pick up by the robot.

Despite the positive achievements, some objects posed notable difficulties due to the limitations of the gripper and the precision of the robot’s hand–eye calibration performed. One prominent cause of error identified was the narrowness of the gripper. Some objects, particularly the USB stick and the rubber, posed difficulties, as they fit tightly within the gripper. This limitation led to the need for a higher level of accuracy during the pick-up operations. Another crucial aspect contributing to the challenges was the hand–eye calibration of the robot. The complexity in the hand–eye calibration of the robot was accentuated by the position and angle of the external camera, introducing a degree of inaccuracy in commanding the robot to an accurate image position. Consequently, this issue led to occasional misplacements and hindered the overall success rate of the picking-up motions. The optimal location of the external camera would be one perpendicular to the working plane of the robot, which would correct possible errors due to calibration and even reduce the possibility of occlusions when taking the photo with which the network is trained, but this was not contemplated in our setup due to the unavailability of a tripod of these characteristics. This makes the results obtained even more satisfactory.

The setup used to this experiment is illustrated in Figure 6 and Figure 7, which show the results of instructing the robot to take a pen and a spoon, respectively. More details about the setup and results can be found in the Supplementary Materials Video S1.

5. Conclusions

Imitation learning via One-Shot Learning has emerged as one of the most compelling challenges in collaborative robotics in recent years. Our architecture tackles this challenge effectively by utilizing synthetic data generation and training a series of interconnected models. This system includes a model for classifying objects for pick up using speech-to-text and text prompts through the CLIP encoder, an object detector based on YOLO v8, and a novel combined regression model called CVR-ZS. The CVR-ZS model enhances the robot’s accuracy and holds promise for application in various domains, significantly reducing errors compared to individual regression models.

We conducted experiments with a Niryo Ned robot, which is compact and well suited for human environments. Notably, the methods we used are adaptable to different types of robots. Despite some technical limitations, such as the gripper size and minor precision issues, the gripping results were highly promising.

For future research, we propose investigating the system’s ability to learn new objects in real-time through continuous learning techniques. We also aim to extend the use of CVR-ZS to other regression problems. Real-time learning could improve our model’s adaptability when introducing new objects, potentially eliminating the need for retraining. Expanding CVR-ZS to other problem domains will help gauge the extent of improvement it can offer and broaden its applications beyond robotics.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics13173365/s1, Video S1: Pick & place experiment with CVR-ZS.

Author Contributions

Conceptualization, J.D.-D., E.Z. and J.G.-G.-B.; methodology, J.D.-D., E.Z. and J.G.-G.-B.; software, J.D.-D. and M.G.-G.; validation, J.D.-D. and M.G.-G.; formal analysis, J.D.-D., E.Z. and J.G.-G.-B.; investigation, J.D.-D., E.Z. and J.G.-G.-B.; resources, E.Z. and J.G.-G.-B.; data curation, J.D.-D.; writing—original draft preparation, J.D.-D.; writing—review and editing, J.D.-D., M.G.-G., E.Z. and J.G.-G.-B.; visualization, J.D.-D. and M.G.-G.; supervision, E.Z. and J.G.-G.-B.; project administration, E.Z. and J.G.-G.-B.; funding acquisition, E.Z. and J.G.-G.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This research has received funding from projects ROSOGAR PID2021-123020OB-I00 funded by MCIN/AEI/10.13039/501100011033/FEDER, UE, and EIAROB funded by Consejería de Familia of the Junta de Castilla y León—Next Generation EU.

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CLIP	Contrastive Language–Image Pre-training
CVR-ZS	Cross-Validation Regression with Z-score
FSL	Few-Shot Learning
OSL	One-Shot Learning
YOLO	You Only Look Once

References

Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a few examples: A survey on few-shot learning. ACM Comput. Surv. (CSUR) 2020, 53, 1–34. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO v8. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 15 July 2024).
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Zhao, M.; Zuo, G.; Yu, S.; Gong, D.; Wang, Z.; Sie, O. Position-aware pushing and grasping synergy with deep reinforcement learning in clutter. CAAI Trans. Intell. Technol. 2024, 9, 738–755. [Google Scholar] [CrossRef]
Kleeberger, K.; Bormann, R.; Kraus, W.; Huber, M.F. A survey on learning-based robotic grasping. Curr. Robot. Rep. 2020, 1, 239–249. [Google Scholar] [CrossRef]
Newbury, R.; Gu, M.; Chumbley, L.; Mousavian, A.; Eppner, C.; Leitner, J.; Bohg, J.; Morales, A.; Asfour, T.; Kragic, D.; et al. Deep learning approaches to grasp synthesis: A review. IEEE Trans. Robot. 2023, 39, 3994–4015. [Google Scholar] [CrossRef]
Ravichandar, H.; Polydoros, A.S.; Chernova, S.; Billard, A. Recent advances in robot learning from demonstration. Annu. Rev. Control Robot. Auton. Syst. 2020, 3, 297–330. [Google Scholar] [CrossRef]
Fang, B.; Jia, S.; Guo, D.; Xu, M.; Wen, S.; Sun, F. Survey of imitation learning for robotic manipulation. Int. J. Intell. Robot. Appl. 2019, 3, 362–369. [Google Scholar] [CrossRef]
Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 411–444. [Google Scholar] [CrossRef]
Singh, B.; Kumar, R.; Singh, V.P. Reinforcement learning in robotic applications: A comprehensive survey. Artif. Intell. Rev. 2022, 55, 945–990. [Google Scholar] [CrossRef]
Zou, Q.; Xiong, K.; Fang, Q.; Jiang, B. Deep imitation reinforcement learning for self-driving by vision. CAAI Trans. Intell. Technol. 2021, 6, 493–503. [Google Scholar] [CrossRef]
Lobbezoo, A.; Kwon, H.J. Simulated and Real Robotic Reach, Grasp, and Pick-and-Place Using Combined Reinforcement Learning and Traditional Controls. Robotics 2023, 12, 12. [Google Scholar] [CrossRef]
Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv 2018, arXiv:1806.10293. [Google Scholar]
Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-Net 2.0: Deep Learning to Plan Robust Grasps with Synthetic Point Clouds and Analytic Grasp Metrics. arXiv 2017, arXiv:1703.09312. [Google Scholar]
Mahler, J.; Pokorny, F.T.; Hou, B.; Roderick, M.; Laskey, M.; Aubry, M.; Kohlhoff, K.; Kröger, T.; Kuffner, J.; Goldberg, K. Dex-Net 1.0: A cloud-based network of 3D objects for robust grasp planning using a Multi-Armed Bandit model with correlated rewards. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1957–1964. [Google Scholar] [CrossRef]
Mahler, J.; Matl, M.; Liu, X.; Li, A.; Gealy, D.; Goldberg, K. Dex-net 3.0: Computing robust vacuum suction grasp targets in point clouds using a new analytic model and deep learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5620–5627. [Google Scholar]
Mahler, J.; Matl, M.; Satish, V.; Danielczuk, M.; DeRose, B.; McKinley, S.; Goldberg, K. Learning ambidextrous robot grasping policies. Sci. Robot. 2019, 4, eaau4984. [Google Scholar] [CrossRef]
Guo, A.; Wen, B.; Yuan, J.; Tremblay, J.; Tyree, S.; Smith, J.; Birchfield, S. HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions. arXiv 2023, arXiv:2308.01477. [Google Scholar]
Dong, X.; Jiang, Y.; Zhao, F.; Xia, J. A Practical Multi-Stage Grasp Detection Method for Kinova Robot in Stacked Environments. Micromachines 2023, 14, 117. [Google Scholar] [CrossRef]
Zhang, H.; Lan, X.; Zhou, X.; Tian, Z.; Zhang, Y.; Zheng, N. Visual Manipulation Relationship Network for Autonomous Robotics. In Proceedings of the 2018 IEEE-RAS 18th International Conference on Humanoid Robots (Humanoids), Beijing, China, 6–9 November 2018; pp. 118–125. [Google Scholar] [CrossRef]
Domingo, J.D.; Gómez-García-Bermejo, J.; Zalama, E. Visual recognition of gymnastic exercise sequences. Application to supervision and robot learning by demonstration. Robot. Auton. Syst. 2021, 143, 103830. [Google Scholar] [CrossRef]
Qian, Z.; You, M.; Zhou, H.; Xu, X.; He, B. Robot learning from human demonstrations with inconsistent contexts. Robot. Auton. Syst. 2023, 166, 104466. [Google Scholar] [CrossRef]
Jin, J.; Petrich, L.; Dehghan, M.; Zhang, Z.; Jagersand, M. Robot eye-hand coordination learning by watching human demonstrations: A task function approximation approach. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 6624–6630. [Google Scholar] [CrossRef]
Hwang, P.J.; Hsu, C.C.; Chou, P.Y.; Wang, W.Y.; Lin, C.H. Vision-Based Learning from Demonstration System for Robot Arms. Sensors 2022, 22, 2678. [Google Scholar] [CrossRef]
Sun, X.; Li, J.; Kovalenko, A.V.; Feng, W.; Ou, Y. Integrating Reinforcement Learning and Learning From Demonstrations to Learn Nonprehensile Manipulation. IEEE Trans. Autom. Sci. Eng. 2023, 20, 1735–1744. [Google Scholar] [CrossRef]
Kamali, K.; Bonev, I.A.; Desrosiers, C. Real-time Motion Planning for Robotic Teleoperation Using Dynamic-goal Deep Reinforcement Learning. In Proceedings of the 2020 17th Conference on Computer and Robot Vision (CRV), Ottawa, ON, Canada, 13–15 May 2020; pp. 182–189. [Google Scholar] [CrossRef]
Cabi, S.; Colmenarejo, S.G.; Novikov, A.; Konyushkova, K.; Reed, S.; Jeong, R.; Zolna, K.; Aytar, Y.; Budden, D.; Vecerik, M.; et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning. arXiv 2020, arXiv:1909.12200. [Google Scholar]
Ren, Y.; Zhang, L.; Suganthan, P.N. Ensemble classification and regression-recent developments, applications and future directions. IEEE Comput. Intell. Mag. 2016, 11, 41–53. [Google Scholar] [CrossRef]
Sagi, O.; Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
Ganaie, M.A.; Hu, M.; Malik, A.; Tanveer, M.; Suganthan, P. Ensemble deep learning: A review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Zhang, F.; Bazarevsky, V.; Vakunov, A.; Tkachenka, A.; Sung, G.; Chang, C.L.; Grundmann, M. Mediapipe hands: On-device real-time hand tracking. arXiv 2020, arXiv:2006.10214. [Google Scholar]
Wang, Z.; Bovik, A.C. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Process. Mag. 2009, 26, 98–117. [Google Scholar] [CrossRef]
Xu, X.; Xu, S.; Jin, L.; Song, E. Characteristic analysis of Otsu threshold and its applications. Pattern Recognit. Lett. 2011, 32, 956–961. [Google Scholar] [CrossRef]
Domingo, J.D.; Aparicio, R.M.; Rodrigo, L.M.G. Cross Validation Voting for Improving CNN Classification in Grocery Products. IEEE Access 2022, 10, 20913–20925. [Google Scholar] [CrossRef]
Duque-Domingo, J.; Aparicio, R.M.; Rodrigo, L.M.G. One Shot Learning with class partitioning and cross validation voting (CP-CVV). Pattern Recognit. 2023, 143, 109797. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar]
Niryo. Niryo Ned. 2021. Available online: https://docs.niryo.com/robots/ned/ (accessed on 15 July 2024).
Zhang, Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1330–1334. [Google Scholar] [CrossRef]

Figure 1. Scheme of our system.

Figure 2. Synthetic data generation.

Figure 3. CVR-ZS model with ResNet-34 CNN backbones.

Figure 4. Results YOLO detection.

Figure 5. Improved loss of object regression.

Figure 6. Picking up a spoon.

Figure 7. Picking up a pen.

Table 1. Regressor comparison (L1 loss).

Model	L1 Loss
Normal Regressor without CVR-ZS (average 5 runs)	0.0578
CVR with output as average of k = 5 models	0.0505
CVR with K-Means clustering	0.0490
CVR-ZS	0.0482

Table 2. Regression per object (L1 loss).

Object	L1 Loss (without CVR-ZS)	L1 Loss (with CVR-ZS)
nut	0.0788	0.0694
pen	0.0476	0.0370
rubber	0.0700	0.0611
screw	0.0592	0.0522
screwdriver	0.0655	0.0372
spoon	0.0397	0.0328
glue stick	0.0523	0.0473
usb stick	0.0517	0.0475
lip corrector	0.0552	0.0499

Table 3. Pick-and-place success rate.

Objects	Success
Nut	0.90
Pen	0.90
Rubber	0.70
Screw	0.85
Screwdriver	0.90
Spoon	0.75
Glue stick	0.80
USB stick	0.65
Lip corrector	0.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duque-Domingo, J.; García-Gómez, M.; Zalama, E.; Gómez-García-Bermejo, J. Learning by Demonstration of a Robot Using One-Shot Learning and Cross-Validation Regression with Z-Score. Electronics 2024, 13, 3365. https://doi.org/10.3390/electronics13173365

AMA Style

Duque-Domingo J, García-Gómez M, Zalama E, Gómez-García-Bermejo J. Learning by Demonstration of a Robot Using One-Shot Learning and Cross-Validation Regression with Z-Score. Electronics. 2024; 13(17):3365. https://doi.org/10.3390/electronics13173365

Chicago/Turabian Style

Duque-Domingo, Jaime, Miguel García-Gómez, Eduardo Zalama, and Jaime Gómez-García-Bermejo. 2024. "Learning by Demonstration of a Robot Using One-Shot Learning and Cross-Validation Regression with Z-Score" Electronics 13, no. 17: 3365. https://doi.org/10.3390/electronics13173365

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Learning by Demonstration of a Robot Using One-Shot Learning and Cross-Validation Regression with Z-Score

Abstract

1. Introduction

2. Overview of Related Work

3. Analysis of the System

3.1. Synthetic Data Generation

3.2. Object Detection with Text Prompts

3.3. Cross-Validation Regression Integrated with Z-Score

4. Experiments and Results

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI