Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets

Heo, Seokhyeon; Cho, Youngdae; Park, Jeongwoo; Cho, Seokhyun; Tsoy, Ziya; Lim, Hwasup; Cha, Youngwoon

doi:10.3390/app14199042

Open AccessArticle

Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets

by

Seokhyeon Heo

^1,†,

Youngdae Cho

^2,†

,

Jeongwoo Park

²,

Seokhyun Cho

¹,

Ziya Tsoy

¹,

Hwasup Lim

³

and

Youngwoon Cha

^2,*

¹

Department of Computer Science and Engineering, Konkuk University, Seoul 05029, Republic of Korea

²

Department of Metaverse Convergence, Graduate School, Konkuk University, Seoul 05029, Republic of Korea

³

Center for Artificial Intelligence, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(19), 9042; https://doi.org/10.3390/app14199042

Submission received: 12 September 2024 / Revised: 3 October 2024 / Accepted: 4 October 2024 / Published: 7 October 2024

(This article belongs to the Special Issue Computer Vision, Robotics and Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

:

We present a novel dataset for humanoid robot pose estimation from images, addressing the critical need for accurate pose estimation to enhance human–robot interaction in extended reality (XR) applications. Despite the importance of this task, large-scale pose datasets for diverse humanoid robots remain scarce. To overcome this limitation, we collected sparse pose datasets for commercially available humanoid robots and augmented them through various synthetic data generation techniques, including AI-assisted image synthesis, foreground removal, and 3D character simulations. Our dataset is the first to provide full-body pose annotations for a wide range of humanoid robots exhibiting diverse motions, including side and back movements, in real-world scenarios. Furthermore, we introduce a new benchmark method for real-time full-body 2D keypoint estimation from a single image. Extensive experiments demonstrate that our extended dataset-based pose estimation approach achieves over 33.9% improvement in accuracy compared to using only sparse datasets. Additionally, our method demonstrates the real-time capability of 42 frames per second (FPS) and maintains full-body pose estimation consistency in side and back motions across 11 differently shaped humanoid robots, utilizing approximately 350 training images per robot.

Keywords:

computer vision; robotics; deep learning

1. Introduction

Human–robot interaction (HRI) has become a pivotal area of research in robotics and extended reality (XR), driven by the increasing capabilities and applications of humanoid robots. Recent advancements in humanoid robots showcase remarkable physical abilities that closely mimic human movements, such as fast running, precise object manipulation, and sophisticated AI-assisted actions [1,2]. As these robots are integrated into diverse environments, accurate pose estimation has become critical to enhance their interaction with humans and their operational effectiveness in XR applications [3,4].

Pose estimation involves determining the position and orientation of a robot’s body parts from visual data, which is crucial for understanding the robot’s current status during interactions. This capability not only supports real-time operations, but also enhances the user’s immersive experience in XR environments. Despite its significance, accurate pose estimation for a wide variety of robots remains a challenge due to the diversity in robot designs and applications.

Traditional pose estimation methods for robots often rely on visual markers or internal feedback mechanisms. Visual markers can be intrusive and limit the robot’s operational flexibility, while internal feedback systems are frequently customized for specific robots or environments, reducing their adaptability to new or varied scenarios [3,4]. Moreover, these methods typically require extensive setup and calibration, which can restrict their use in controlled lab environments.

Recently, learning-based methods for pose estimation have gained significant attention. Convolutional neural network (CNN)-based approaches have been applied to robot arm pose estimation [5,6,7], as summarized in Table 1(a). However, these methods are restricted to arm joints and have been tested on only a limited number of robot types.

For humanoid robots, several pose estimation techniques have been proposed, as outlined in Table 1(b). Tejwani et al. [8] applied their method to an upper-body humanoid robot, estimating dual-arm poses using inverse kinematics (IK), but without addressing full-body estimation. Similarly, Lu et al. [9] employed CNN-based keypoint detection for humanoid robot pose estimation, though their approach was limited to end-effector joints and lacked full-body coverage. A full-body 3D pose estimation method from head-worn cameras was introduced by Cho et al. [10], but the method was validated on only a single type of robot.

None of these approaches have been evaluated on a diverse set of robots (e.g., more than ten types) or focused on estimating full-body joints, leaving the general applicability of pose estimation methods an open problem. Collecting pose datasets for a wide variety of robots poses another challenge, as setting up capture systems and recording diverse poses is costly and labor-intensive. Moreover, acquiring a variety of humanoid robot videos from providers is extremely limited, hindering the formation of large-scale datasets.

To address these challenges, we present a new pose dataset for diverse humanoid robots, combining small real-world datasets with synthetic data generated using our approach. We collected public videos from the official websites of 11 state-of-the-art humanoid robots [1,2] and manually annotated them to form sparse real datasets. The combined dataset leverages these sparse pose data for each humanoid robot and augments them with synthetic data generation techniques, including AI-assisted image synthesis, foreground removal, and 3D character simulations. By merging real and synthetic data, we create a robust dataset that provides full-body pose annotations for a variety of humanoid robots, capturing diverse motions such as front, side, back, and partial views in real-world scenarios. To the best of our knowledge, no existing dataset includes full-body pose annotations for diverse humanoid robots. An overview of our approach for diverse humanoid robot pose estimation is presented in Figure 1.

Additionally, we propose a new benchmark method for real-time, learning-based full-body 2D keypoint estimation from single images, as presented in Table 1(c). Our extensive experiments demonstrate that leveraging our extended pose dataset significantly improves full-body pose estimation accuracy by over 33.9% compared to using only sparse datasets, across 11 types of humanoid robots, requiring approximately 350 training images per robot.

Our method also achieves real-time performance, operating at 42 frames per second (FPS), while consistently estimating full-body poses for side and back views, even with varying humanoid robot shapes.

These advancements overcome the limitations of traditional pose estimation methods and offer a scalable solution for human-robot interaction applications involving diverse humanoid robots. To further support the research community, we have made our dataset publicly available. It can be accessed upon request at https://xrlabku.webflow.io/papers/diverse-humanoid-robot-pose-estimation-using-only-sparse-datasets (accessed on 2 October 2024).

In summary, the primary contributions of this work are as follows:

The first pose dataset providing full-body annotations for over ten different types of humanoid robots.
A benchmark method for real-time, learning-based full-body pose estimation from a single image.
An evaluation dataset and standard metrics for assessing pose estimation performance.

The remainder of this paper is organized as follows: Section 2 discusses related work. Section 3 presents the diverse humanoid robot pose dataset. Section 4 describes the benchmark method for real-time full-body pose estimation. The results are presented and discussed in Section 5, followed by a discussion of limitations and future work in Section 6.

2. Related Work

This study is closely related to various existing studies in the fields of human pose estimation, marker-based human–robot interaction, and robot pose estimation. In this section, we briefly review the major research in these areas.

2.1. Learning-Based Human Pose Estimation

Human pose estimation is a well-established and extensively researched field with notable advancements. Numerous techniques for detecting 2D human joints and estimating poses from RGB video or image data have been widely explored [23,24,25]. Cao et al. [23] introduced a real-time multi-person 2D pose estimation method using Part Affinity Fields (PAF), showcasing its ability to estimate the poses of multiple individuals in complex environments. OmniPose [24], a multi-scale framework-based method, employs multi-scale feature modules to effectively capture information at different scales, addressing occlusion challenges. ViTPose [25], based on a vision transformer, offers a simpler network structure while maintaining efficient inference speed and high accuracy.

Various studies have extended beyond 2D to explore 3D human pose estimation [26,27,28,29,30,31,32,33,34,35,36]. Video-based methods have been used to incorporate temporal information for more accurate human pose and shape estimation over time [29]. Research using diffusion models for human pose estimation has significantly enhanced the accuracy and reliability of 3D human pose estimation without requiring extensive training data [32,34,36]. Additionally, recent transformer-based models have improved the accuracy of 3D pose estimation by integrating spatial and temporal features [31,33,35]. However, these methods are trained on data of human poses and shapes. While humanoid robots are designed to mimic human appearance and movements, there are clear differences in their structural characteristics and external form, which present significant limitations when directly applying these models to humanoid robots.

Moreover, approaches using additional sensors such as IMUs [26,27,30] have shown promise in human pose estimation. but pose practical challenges when applied to humanoid robots due to differences in link proportions and the requirement for sensor attachment.

2.2. Marker-Based Human–Robot Interaction

Extensive research has been conducted in the field of HRI through marker-based approaches [4]. Bambusek et al. [37] propose a system that integrates interactive spatial augmented reality (ISAR) and head-mounted display (HMD) technologies to enable users to program workspaces even in the absence of a physical robot. This system allows users to perform programming tasks without the robot’s presence, and the visualization of the program facilitates a more intuitive understanding of the tasks. Qian et al. [38] focus on improving the accuracy of surgical assistants’ instrument insertion and manipulation using augmented reality. The augmented reality-assisted system provides 3D visualization to enhance task accuracy, utilizing marker-based tracking to ensure precise alignment between the user’s HMD and the instruments. Tran et al. [39] examine how mixed reality (MR)-based communication interacts with cognitive load in users, proposing a method for overlaying virtual objects onto an MR environment using fiducial markers. Frank et al. [40] describe a method for sharing spatial information between the user and the robot using visual markers, and propose an MR interface that allows the user to visually perceive the robot’s workspace and intuitively manipulate objects. However, the reliance on markers can introduce various limitations when applied to general situations.

2.3. Robot Arm Pose Estimation

Research on robot arm pose estimation has recently been conducted by using neural networks without relying on markers or additional sensors [5,6,7,41,42,43]. In one approach, the 2D pose of a robot arm is estimated from a single RGB image using a CNN, followed by camera-to-robot pose estimation using the perspective-n-point (PnP) algorithm [6]. Another study, bypassing the PnP algorithm, directly estimates the robot’s 3D joint pose based on its mask, predicted using CNN. This method estimates both the robot’s location and its 3D joint coordinates using a freely moving camera, overcoming the limitations of a fixed camera-to-robot setup [5]. Furthermore, Lu et al. [7] proposed a methodology for more accurately estimating the pose of robot arms by applying keypoint optimization and sim-to-real transfer approaches. Similarly, Ban et al. [41] introduced a method for predicting a robot pose in real time without relying on prior structural knowledge of the robot.

There have also been studies employing sequence data rather than single images [42,43]. A study proposes a real-time camera-to-robot pose estimation model, SGTAPose, by utilizing robot structure priors from single-view successive frames in an image sequence [42]. Furthermore, another study contributes to risk mitigation in human–robot collaboration scenarios by using a Recurrent Neural Network (RNN) model to predict the robot’s movements [43]. Tejwani et al. [8] proposed an Avatar Robot System that enables visual and haptic interactions by aligning a human model onto the robot arm. These methods may be effective for estimating the motion of simple robotic arms, but there are limitations when applied to complex movements involving multiple joints and body parts, as seen in humanoid robots. To accurately estimate such complex motions, approaches that can model the intricate interactions between different parts with greater precision are required.

2.4. Full-Body Robot Pose Estimation

Research on full-body robot pose estimation using fiducial markers [44,45,46,47,48] typically optimizes the pose based on marker positions. These markers are highly robust and have been employed in challenging environments, such as underwater scenarios with poor visibility [49] or in cases of significant occlusion [50,51]. Despite their robustness, marker-based methods remain unsuitable for general humanoid robot pose estimation due to their dependency on markers. To overcome the limitations of marker-based approaches, many studies have investigated learning-based methods that train models on image or video datasets to estimate poses without markers [9,52]. However, these approaches have limitations in accurately estimating the poses of humanoid robots with diverse forms. In particular, research on the full-body pose estimation of humanoid robots using learning-based methods has not yet been widely explored. This is largely due to challenges in generalizing to robots with varying appearances and joint structures, as well as limitations in the training datasets. As a result, further research is required to achieve comprehensive pose estimation for humanoid robots.

3. Diverse Humanoid Robot Pose Dataset (DHRP)

In this section, we present the Diverse Humanoid Robot Pose Dataset (DHRP), detailing the collection of the sparse target humanoid robot dataset, the annotation process, and the augmentation with synthetic data. In total, we collected 14.5k annotated images for training and 1.5k annotated images for evaluation, covering a range of motions and scenarios for diverse humanoid robots. The complete dataset configuration is further detailed in Section 3.3.

3.1. Target Humanoid Robots

We selected 11 state-of-the-art commercially available humanoid robots [1,2] as our target models. These humanoid robots have a similar joint structure to humans; however, the variability in their joint configurations is much greater than in human anatomy. For instance, EVE (1X Technologies) [16] has only a single leg and lacks feet, while Atlas (Boston Dynamics) [14] and H1 (Unitree) [18] are designed without hands. TALOS (PAL Robotics) [21] and Toro (DLR) [22] have uniquely shaped hip joints, and most of the robots do not possess facial features such as eyes, noses, or mouths.

Despite these challenges in shape and design variability, we annotated 14 key joints (nose, neck, shoulders, elbows, wrists, hips, knees, and ankles) across all the robot types. This follows a structure similar to OpenPose’s joint configuration [23], which is widely used in human pose estimation. Due to the absence of common facial features (most robots have plain, textureless faces), we chose to annotate only the nose, as it is crucial for identifying head orientation. The selected 11 humanoid robots are shown in Figure 2.

3.2. Sparse Dataset and Annotation Process

We collected 42 publicly available video clips for each target robot from the official websites of their respective providers and their YouTube channels. These were divided into 29 clips for training and 13 clips for evaluation, ensuring that both the training and evaluation sets included front, side, back, and partial views for each robot. Example images from the target humanoid robot dataset are shown in Figure 3.

The video clips were manually annotated for uniformly selected frames using V7 Darwin annotation software [53]. The numbers of annotated frames for each target humanoid robot are listed in Table 2. In total, we collected 3.8k training samples and 1.5k evaluation samples, forming a small-scale real-image dataset. On average, only 347 images were used per robot for training.

It is important to note that the annotation process was labor-intensive, and even with the collected video clips, capturing the entire shape of each robot was not fully possible. Nevertheless, our experiments show that, despite the limitations of this sparse dataset, when augmented with our data generation techniques described in Section 3.3, our pose estimation method achieved satisfactory keypoint detection accuracy on the evaluation dataset, maintaining robustness over time in the evaluation video clips (see Section 5).

3.3. Dataset Extension

The target humanoid robot dataset described in Section 3.2 is too sparse to achieve satisfactory accuracy across diverse motions and robot types. To address this limitation, we extend the dataset with additional sources, including data from arbitrary humanoid robots (Section 3.4), synthetic data (Section 3.5), and random backgrounds (Section 3.6). The full configuration of the combined dataset is outlined in Table 3. The results show that these augmentations significantly enhance accuracy, as detailed in Section 5.

3.4. Arbitrary Humanoid Robot Dataset

Unlike humans, humanoid robots exhibit significant diversity in body shapes and appearances. To address this variability, we collected images and videos of various humanoid robots to enhance the diversity of body shapes, appearances, motions, and viewpoints in our dataset. We sourced these resources from publicly available platforms such as Google and YouTube. After filtering out redundant body shapes and motions, we compiled an additional 2k annotations of arbitrary humanoid robots. Selected examples from this dataset are displayed in Figure 4.

3.5. Synthetic Dataset

The real datasets, including those for target and arbitrary humanoid robots, are limited in terms of motion and scenario diversity since they were captured for specific applications. To address these limitations, we further augment the dataset using synthetic image generation techniques as follows.

3.5.1. AI-Assisted Image Synthesis

To enhance the motion diversity of the humanoid robot dataset, we generated AI-synthesized images of robots performing dancing motions, as shown in Figure 5. Using AI video generation software such as Viggle [54], we synthesized arbitrary dance videos from single frontal-facing images. The input images used for this video synthesis were of the Kepler (Kepler Exploration Robot) [19] and Optimus Gen 2 (Tesla) [12]. Frames were randomly subsampled from these videos and annotated. Although the visual quality of the synthesized humanoid robots may not be entirely realistic, incorporating them with real-world backgrounds enhances the diversity of outdoor scenes, which are underrepresented in the existing real-world datasets.

3.5.2. 3D Character Simulations

High-quality synthetic images were also generated through 3D character simulations [56,57]. To further enhance the diversity of humanoid robot motions, we used Unreal Engine [55] to simulate characters based on Optimus Gen 2 and another robot model. These characters performed five-minute physical training routines across five distinct 3D environments, captured with randomized camera movements and lighting variations. By tracking all joint locations throughout the simulations, the projected joint positions on the screen were automatically recorded, enabling efficient joint annotation without manual effort.

In total, 6.7k frames were generated using AI-assisted image synthesis and 3D character simulations, enriching the dataset with diverse motions, color variations, and multiple camera perspectives. Selected examples from this synthetic dataset are shown in Figure 5.

3.6. Background Dataset

Unlike humans, humanoid robots often have surfaces covered with metallic parts and cables, which can create visual ambiguity between the robots and their environments, especially when metallic or shiny objects are present. To mitigate this ambiguity, we augmented the dataset with random background images using the following approaches:

3.6.1. AI-Assisted Foreground Removal

We generated 133 foreground-removed images using Adobe Photoshop Generative Fill [58], as illustrated in Figure 6. By manually selecting the foreground, the gaps were filled with AI-generated backgrounds. These background images help to better distinguish the robots’ metallic surfaces from similar objects in the environment.

3.6.2. Arbitrary Random Backgrounds

Additionally, we collected 1886 random indoor and outdoor background images to further increase background diversity. These images were sourced from cluttered environments such as forests, beaches, factories, gyms, grocery stores, and kitchens. Many of these scenes contain metallic or shiny objects, which improve the model’s ability to differentiate between background elements and the robot’s surface.

In total, 2k background images were collected using AI-assisted foreground removal and arbitrary random backgrounds, enhancing the dataset by improving the distinction between humanoid robots and their surrounding environments.

4. Diverse Humanoid Robot Pose Estimation

To accomplish full-body pose estimation for a variety of humanoid robots, we employed a CNN using the dataset described in Section 3. The baseline approach for keypoint detection is outlined in Section 4.1, while the evaluation method and metrics for the DHRP Dataset are detailed in Section 4.2.

4.1. The Benchmark Method

Recently, CNN-based human pose estimation methods [23,24,25,59] have shown success in real-time capability, multi-person detection, and full-body joint detection, even in occluded situations. Since humanoid robots have a similar joint structure to humans, we first investigate diverse humanoid robot pose estimation by adopting CNN-based pose estimation models.

The stacked hourglass architecture [59] is extensively used for human pose estimation, offering performance comparable to methods such as OpenPose [23] and ViTPose [25]. It notably achieves state-of-the-art results on the MPII dataset [60], as reported in [61]. The stacked hourglass model is a multi-stage CNN network, providing flexibility in selecting the network’s depth by adjusting the number of stages.

For our joint detector baseline, we utilize the regression-based CNN method Differentiable Spatial-to-Numerical Transform (DSNT) described in [62], leveraging the stacked hourglass architecture as the backbone. This regression module improves computational efficiency by allowing the direct estimation of 2D keypoint coordinates and eliminating the need to parse joint heatmaps for coordinate extraction during runtime.

Additionally, we adopt the network training procedure detailed in [10,56]. Below, we briefly summarize the joint detection network and its training details.

4.1.1. Joint Detection Network

The joint detection network,

f : I \to K, H

, processes an input image

I \in R^{m \times m \times 3}

(where

m = 320

) to produce joint coordinates

K \in R^{2 \times | K |}

and confidence maps

H \in R^{(m / 4) \times (m / 4) \times | K |}

, with

| K | = 14

representing the number of keypoints. The architecture of this multi-stage network is shown in Figure 7.

In each stage, the hourglass module [59] generates unnormalized heatmaps

\tilde{H} \in R^{(m / 4) \times (m / 4) \times | K |}

. These heatmaps are then passed to both the subsequent stage and the DSNT regression module [62]. The DSNT module normalizes

\tilde{H}

using a Softmax layer to produce confidence maps H. The map H is subsequently converted into 2D coordinates K through a dot product of the X and Y coordinate matrices [62].

4.1.2. Keypoint Parsing

In the keypoint parsing stage, keypoint confidences V are extracted from the network outputs K and H. The confidence for the j-th joint, denoted as

c_{j} = H_{j} (x_{j}, y_{j})

, is evaluated at the estimated 2D coordinate

k_{j} = (x_{j}, y_{j}) \in K

. If the confidence

c_{j}

exceeds the probability threshold (

c_{j} > t_{v}

), the keypoint

k_{j}

is considered detected, and its corresponding confidence

v_{j} \in V

is set to 1. Otherwise,

v_{j}

is assigned a value of 0. The probability threshold

t_{v}

is empirically set at 0.1 [10].

4.1.3. Network Optimization

We utilize the network optimization approach described in [10,56]. The joint detector network is trained to minimize the combined loss function

L_{d e t e c t o r} = L_{D S N T} + L_{V}

. To estimate the 2D coordinates of visible joints (where

v_{j} = 1

), we apply the regression loss

L_{D S N T}

, defined as:

L_{D S N T} = \sum_{i = 1}^{| K |} v_{i}^{g t} \cdot [| | k_{i}^{g t} - k_{i} {| |}_{2}^{2} + D (H_{i} | | N (k_{i}^{g t}, σ I_{2}))]

(1)

where the superscript ^gt denotes ground truth data,

v^{g t}

indicates the binary visibility for each joint in the ground truth, and

N (μ, σ)

represents a 2D Gaussian distribution centered at

μ

with a standard deviation

σ

; for training,

σ

is set to 1. The term

D (\cdot | | \cdot)

signifies the Jensen–Shannon divergence, which measures the similarity between the confidence map H and the ground truth Gaussian map [62].

To mitigate false detections, we incorporate the invisibility loss

L_{V}

, which operates on the hourglass layer’s output

\tilde{H}

as follows:

L_{V} = \sum_{i = 1}^{| K |} (1 - v_{i}^{g t}) \cdot | | {\tilde{H}}_{i} {| |}_{2}^{2}

(2)

This loss function penalizes

\tilde{H}

to produce a zero heatmap for invisible joints (where

v_{j} = 0

). In the DSNT module, a zero heatmap results in a uniform distribution in H using a Softmax layer, effectively reducing the confidence for joints that are not visible.

4.1.4. Training Details

The network model is trained in multiple stages for transfer learning. Initially, it is pretrained on the COCO 2017 Human Pose dataset [63] to learn low-level texture features. Subsequently, the network is further trained on our humanoid robot pose dataset using the pretrained parameters. Throughout the training process, intermediate supervision, as described in [23], is applied at each stage of the network.

To enhance detection robustness, we applied a range of standard data augmentation techniques [64,65,66]. These techniques include various image transformations, such as scaling, rotation, translation, and horizontal flipping, as well as adjustments to image color, including contrast and brightness. Observing the unique complexities of robot images—characterized by cables, electronic components, and surfaces with numerous bolts and nuts—we also introduced random blurring and noise. Detailed configurations of the data augmentation for training are presented in Table 4.

4.2. The Evaluation Method

To evaluate pose estimation methods, we adopted a standard evaluation procedure similar to that used in the COCO dataset [63], utilizing Object Keypoint Similarity (OKS). OKS measures the proximity of predicted keypoints to ground truth keypoints for a given object, with scores ranging from 0 to 1, where a higher OKS score indicates more accurate predictions. OKS is defined as:

O K S = \sum_{i = 1}^{| K |} e x p (- d_{i}^{2} / (2 s^{2} k_{i}^{2})) \cdot δ (v_{i}^{g t} > 0) / \sum_{i = 1}^{| K |} δ (v_{i}^{g t} > 0)

(3)

where

| K |

denotes the number of keypoints for an object,

d_{i}

represents the Euclidean distance between the i-th predicted keypoint and the corresponding ground truth keypoint, s is the square root of the object’s segment area (defined by its bounding box), and k is the normalization constant for each keypoint. The term

v_{i}^{g t}

indicates the binary visibility in the ground truth, and

δ (\cdot)

is a binary indicator function that evaluates to 1 if the condition inside the parentheses is true, and 0 otherwise.

While the COCO dataset employs a normalization constant of

k_{i} = 2 σ_{i}

for each keypoint, we use a stricter normalization constant of

k_{i} = σ_{i}

, indicating that our dataset demands higher accuracy in keypoint estimation. For the DHRP dataset, we base the normalization constant values k on those from the COCO dataset (available at https://cocodataset.org/#keypoints-eval, accessed on 2 October 2024). The specific values of k used are as follows: 0.079 for the nose, 0.079 for the neck, 0.079 for the shoulders, 0.072 for the elbows, 0.062 for the wrists, 0.107 for the hips, 0.087 for the knees, and 0.089 for the ankles. Since most humanoid robots lack distinct facial features, we assigned the same value as the shoulders to the nose and neck.

We further evaluate accuracy on a per-joint basis using Joint Keypoint Similarity (JKS), calculated in a manner similar to the OKS:

J K S = \sum_{i = 1}^{| F |} e x p (- d_{i}^{2} / (2 s^{2} k_{i}^{2})) \cdot δ (v_{i}^{g t} > 0) / \sum_{i = 1}^{| F |} δ (v_{i}^{g t} > 0)

(4)

where

| F |

denotes the number of frames for a particular joint, and

d_{i}

represents the Euclidean distance between the predicted keypoint and the ground truth in the i-th frame.

We evaluate the overall pose estimation performance using Average Precision (AP) and Average Recall (AR), following the methodology from the COCO dataset:

A P^{t} = \sum_{i = 1}^{m} δ (O K S > t); A R^{t} = \sum_{i = 1}^{n} δ (O K S > t)

(5)

where m denotes the number of instances where the network’s output is positive, n represents the total number of instances in the test set, and

t \in T

denotes a threshold value from the set T. We adopt the threshold range

T = (0.50, 0.55, \dots, 0.90, 0.95)

, following the COCO dataset, which includes a total of 20 thresholds.

The primary evaluation metric for the DHRP dataset is the mean Average Recall (mAR). In contexts such as human–robot interactions, the focus is on reducing false negatives (missed detections) during video tracking. The mean Average Precision (mAP) and mAR are computed as follows:

m A P = 1 / | T | \sum_{i = 1}^{| T |} A P_{i}; m A R = 1 / | T | \sum_{i = 1}^{| T |} A R_{i}

(6)

where

| T | = 20

represents the total number of thresholds in the threshold set T.

5. Results and Evaluation

Our method for the full-body pose estimation of diverse humanoid robots presents a unique challenge as it cannot be directly compared to existing approaches. Previous studies on robot pose estimation have largely focused on detecting specific partial joints, such as those in robot arms [5,6,7], upper-body joints [8], or end-effector joints [9]. Moreover, none of these methods have been applied to a broad range of robots [10].

This section presents the evaluation of our baseline full-body humanoid robot pose estimation method on the DHRP dataset. First, we assess the effectiveness of the extended dataset across different training configurations in Section 5.1. Next, we examine the impact of sparse datasets on the accuracy for individual robots in Section 5.2. In Section 5.3, we explore how variations in network architecture affect performance, followed by a comparison of our approach with other methods in Section 5.4.

A detailed quantitative and qualitative evaluation of each target humanoid robot is provided in Section 5.5 and Section 5.6, respectively, along with an analysis of the general applicability of the pose estimation method. Finally, we discuss failure cases in Section 5.7.

5.1. Evaluation on Dataset Configurations

In this section, we evaluate the effectiveness of the extended dataset across different training configurations. Although we collected real datasets for 11 target robots, the dataset size is limited, as discussed in Section 3.2. To address this, we supplemented the data with additional datasets, as described in Section 3.3. The complete configuration of the DHRP dataset is outlined in Table 3.

To demonstrate the impact of the combined dataset, we evaluate the benchmark methods from Section 4.1, trained with different configurations of the training dataset. The evaluation results are summarized in Table 5 for object-level analysis and Table 6 for joint-level analysis.

In both scenarios, incorporating all the augmented datasets—namely, arbitrary humanoid robots, synthetic data, and random backgrounds—resulted in significant accuracy improvements. Specifically, when comparing the use of only the sparse target dataset to the full dataset, the

m A R^{o b j}

improved by over 23.4%, while the

m A R^{j n t}

increased by over 14% on average. These findings highlight the substantial impact of each augmented dataset on full-body pose estimation, demonstrating notable accuracy gains across diverse humanoid robots.

Random data augmentation techniques, as detailed in Table 4, were employed during the training process. The effectiveness of these methods was assessed on the DHRP test set, with the results summarized in Table 7. Train B utilizes conventional data augmentation techniques commonly applied in pose estimation tasks [64,65,66], while Train C incorporates additional random blur and noise layers. The results reveal that including random blur and noise significantly enhances performance, with Train C improving the

m A R^{o b j}

by over 9.6% compared to Train B.

Unlike human bodies, humanoid robots often feature surfaces with metallic parts and cables, creating visual ambiguities between the robots and their surroundings, which can lead to false positive detections. Our results demonstrate that incorporating random blur and noise into the data augmentation training process effectively mitigates these false positives in humanoid robot pose estimation.

5.2. Evaluation on Sparse Datasets

In this experiment, we investigate the effect of incorporating sparse datasets for individual target robots on overall performance. As shown in Table 2, on average, only 347 images per robot were utilized for training, which is insufficient to effectively optimize millions of parameters in a conventional network model.

To assess the effect of sparse datasets for each robot, we conducted evaluations using individual datasets and compared them against our combined dataset. The results are shown in Table 8. Individual refers to results obtained from training separately on each target humanoid robot dataset, while Dataset A represents the combined dataset including all target robots. The findings reveal that using the combined dataset improves performance by over 22.9%, not only for individual robots but also for other target robots. This improvement can be attributed to the underfitting issue that arises when training a network with millions of parameters using only a sparse dataset for a single robot.

Dataset S, an extended dataset, demonstrates better performance than Individual, despite not containing specific information about the target robots. The complete dataset, Dataset D = Dataset S + Dataset A, shows substantial accuracy improvements for all target robots, with an overall performance increase of over 33.9% compared to Individual.

To further evaluate the impact of adding sparse datasets to the combined dataset, we conducted leave-one-out experiments. For each target robot, the model was trained on the full dataset, excluding the data for that specific robot. The results are presented in Table 9. In the leave-one-out evaluation, the performance for the target robot increased by over 33.4% on average, while the performance of all other robots showed an average improvement of 10.6%. These results demonstrate that incorporating even a small amount of data for a specific robot not only enhances that robot’s performance but also improves the accuracy of other robots.

5.3. Evaluation on Network Architecture

In this section, we investigate the impact of varying the number of stages in the network architecture on performance.

To determine the optimal configuration for the benchmark model, we performed evaluations on the DHRP test set using the DSNT regression-based model, with a multi-stage hourglass fully convolutional backbone, as detailed in Section 4.1. Each model was tested with varying numbers of hourglass stages and evaluated using the standard metrics

A P^{o b j}

,

A R^{o b j}

, and

A R^{j n t}

, as outlined in Section 4.2. All models were initially pretrained on the COCO dataset and then further trained using the DHRP training dataset.

The evaluation results are shown in Table 10 and Table 11. Performance consistently improved as the number of stages increased, with a decline observed in the eight-stage hourglass network due to overfitting on the small-scale dataset. These findings indicate that the seven-stage regression-based hourglass network yields the best performance across all metrics. Therefore, for the remaining evaluations on the DHRP dataset, we utilize the seven-stage regression-based hourglass network as the optimal model.

The network was implemented in PyTorch [67] and trained for 240 epochs using the RMSprop optimizer. Training was conducted on a system with an Intel Xeon Silver 4310 processor, 128GB RAM, and dual Nvidia RTX 4090 GPUs. The training times for models with different numbers of stages are presented in Table 12.

The real-time performance of the network models with varying stage numbers was also evaluated, as shown in Table 12. The trained PyTorch models were converted to a universal format using Open Neural Network Exchange (ONNX) [68] and tested in a C++ application. The inference times were measured on a system with an AMD Ryzen 9 5900X 12-Core processor, 64GB RAM, and an NVIDIA GeForce RTX 3080 Ti GPU.

The results show that all network models achieved inference speeds exceeding 38 FPS, confirming real-time pose estimation capabilities.

5.4. Comparison with Other Methods

In this section, we compare our method with existing approaches for humanoid robot pose estimation. Notably, prior methods [5,6,7,8] are not suited for full-body pose estimation across diverse humanoid robots or other robot types. For baseline comparison, we adopt the state-of-the-art RoboCup method from [9].

RoboCup is designed for detecting six end-effector joints in multiple humanoid robots within RoboCup league environments. Its CNN architecture uses a single-stage encoder–decoder model with ResNet18 as the backbone. We trained this model on our DHRP dataset following the approach outlined in [9]. Details of the network architectures are provided in Table 13.

Comparative evaluations on the DHRP test set are presented in Table 14. As the single-stage RoboCup model was originally designed for a smaller dataset (1.5k) focused on RoboCup robots, its performance is inadequate when applied to our larger and more diverse DHRP dataset (15k images, 14 joints, multiple robot types, diverse motions). In particular, RoboCup struggles with accurate limb joint detection. In contrast, our deeper, multi-stage network shows significant accuracy improvements across the board.

We also compare our pose estimation network using an external dataset provided by the digital human-augmented robotic telepresence method [10], which represents the state-of-the-art technique in full-body 3D pose estimation for a single humanoid robot type. This approach estimates 3D poses for a specific humanoid robot using 2D keypoint detectors from head-worn stereo cameras. We evaluated our method on their HRP dataset [10], utilizing parts of the dataset specifically for 2D pose estimation from a single image, with the results presented in Table 15. The findings demonstrate that our method significantly outperforms the specialized technique designed for a single robot type, underscoring the versatility of our approach for specific target robots.

5.5. Quantitative Evaluation on Target Humanoid Robots

In this experiment, we evaluate the overall performance of each target robot using the seven-stage regression-based hourglass network, identified as the optimal model. The pose estimation accuracies for each target robot, based on training with an average of only 347 images, are detailed in Table 16 and Table 17.

The results indicate that H1 robot achieves the highest accuracy, while FIGURE01 robot shows the lowest. The poor performance of FIGURE01 is attributed to its metallic and shiny surfaces, which cause detection issues due to reflections and environmental interference. For joint detection, accuracy is generally higher for head- and torso-related joints (e.g., nose, neck, shoulders, and hips) compared to arm-related joints (e.g., elbows and wrists). The lower accuracy for arm joints is likely due to greater variability in their appearance, such as lack of consistent colors or shapes.

These findings highlight that the effectiveness of full-body humanoid robot pose estimation is significantly influenced by the material composition of the robot’s body covering.

5.6. Qualitative Evaluation on Target Humanoid Robots

In this section, we present a comprehensive qualitative evaluation of each target humanoid robot, accompanied by an analysis of the overall applicability of our pose estimation method.

The full-body pose estimation results for various humanoid robots are shown in Figure 1 and Figure 8. For Optimus Gen 2, as displayed in Figure 1a, our method consistently estimates poses over approximately 30 s of walking motions, covering front, side, and back views. In the case of Toro, shown in Figure 8j, our approach accurately estimates poses from a back view, despite this perspective not being included in the training data. The presented results for 11 target humanoid robots show that even with sparse datasets, the method demonstrates reliable full-body pose estimation for more than 10 types of humanoid robots, capturing front, side, back, and partial views. For additional results, refer to the accompanying video showcasing pose estimation from videos [11].

Additionally, Figure 9 presents the full-body pose estimation results for other types of miniature humanoid robot models not included in the dataset. The results demonstrate that our method can be applied to other humanoid robots, provided they have a similar joint structure to those in the training dataset.

5.7. Failure Case Analysis

In this section, we analyze the primary failure cases of our approach using the DHRP dataset. Common failure scenarios are depicted in Figure 10. Our findings reveal that many target humanoid robots feature metallic body parts, leading to interference from surrounding environments. False detections are often caused by nearby objects with metallic, shiny, or similarly reflective surfaces.

False negatives can occur in egocentric views where most body parts are not visible. Additionally, false positives are observed with non-humanoid robots, such as the one shown in Figure 10f, which bears a resemblance to the H1 robot. These issues primarily arise from the greater variability in robot body surfaces compared to human bodies. The presence of metallic or shiny objects in the environment is more common than skin-colored objects resembling human appearances.

To address these challenges in future iterations, we plan to incorporate additional datasets featuring cluttered backgrounds to mitigate such interference.

6. Limitations and Future Work

Our current system has several limitations that present opportunities for future research. One major limitation is the increased background interference compared to human-centered datasets. To mitigate this issue, a promising future direction is to extend our dataset to include more cluttered backgrounds.

Our evaluation covered 11 different humanoid robots, demonstrating that even a sparse dataset—comprising just a few hundred images—can still provide consistent full-body pose estimations. Future work could involve expanding the dataset to include a broader variety of humanoid robots.

Currently, our dataset is designed for single-robot pose estimation from images. Future efforts will focus on adapting our approach for real-time multi-robot pose estimation. Additionally, we aim to explore alternative network architectures for pose estimation, such as OmniPose [24] and ViTPose [25].

7. Conclusions

In this paper, we addressed the challenges of accurate pose estimation for diverse humanoid robots, a key requirement for improving human–robot interaction in robotics and XR applications. By introducing a novel dataset that combines sparse real-world data with synthetic data generated through AI-assisted techniques, we provided full-body pose annotations for 11 types of humanoid robots, capturing a range of real-world motions. We also developed a benchmark method for real-time, learning-based full-body 2D keypoint estimation from images, demonstrating significant accuracy and consistency across multiple robots using only sparse datasets. Our approach addresses the shortcomings of traditional methods, providing a scalable solution for diverse HRI scenarios and contributing a publicly available dataset to further support the research community.

Author Contributions

Conceptualization, H.L. and Y.C. (Youngwoon Cha); Methodology, Y.C. (Youngwoon Cha); Software, S.H., Y.C. (Youngdae Cho), J.P., S.C., Z.T. and Y.C. (Youngwoon Cha); Validation, Y.C.; (Youngwoon Cha) Formal Analysis, Y.C.; (Youngwoon Cha) Investigation, S.H., Y.C. (Youngdae Cho), J.P., S.C. and Y.C.; (Youngwoon Cha) Resources, H.L. and Y.C.; (Youngwoon Cha) Data Curation, Y.C.; (Youngwoon Cha) Writing—Original Draft, S.H., Y.C. (Youngdae Cho), J.P., S.C., Z.T. and Y.C.; (Youngwoon Cha) Writing—Review and Editing, Y.C. (Youngwoon Cha); Visualization, S.H., Y.C. (Youngdae Cho), J.P., S.C. and Z.T.; Supervision, H.L.; Project Administration, Y.C. (Youngwoon Cha); Funding Acquisition, Y.C. (Youngwoon Cha). All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by a National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. RS-2023-00214511), by the Korea Institute of Science and Technology (KIST) Institutional Program (2E33003 and 2E33000), and by Konkuk University in 2023.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request. The data presented in this study will be available on request to the corresponding author (starting mid-October, 2024). Please refer to the project website at https://xrlabku.webflow.io/papers/diverse-humanoid-robot-pose-estimation-using-only-sparse-datasets (accessed on 2 Ocotber 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Saeedvand, S.; Jafari, M.; Aghdasi, H.S.; Baltes, J. A comprehensive survey on humanoid robot development. Knowl. Eng. Rev. 2019, 34, e20. [Google Scholar] [CrossRef]
Tong, Y.; Liu, H.; Zhang, Z. Advancements in humanoid robots: A comprehensive review and future prospects. IEEE/CAA J. Autom. Sin. 2024, 11, 301–328. [Google Scholar] [CrossRef]
Darvish, K.; Penco, L.; Ramos, J.; Cisneros, R.; Pratt, J.; Yoshida, E.; Ivaldi, S.; Pucci, D. Teleoperation of humanoid robots: A survey. IEEE Trans. Robot. 2023, 39, 1706–1727. [Google Scholar] [CrossRef]
Suzuki, R.; Karim, A.; Xia, T.; Hedayati, H.; Marquardt, N. Augmented reality and robotics: A survey and taxonomy for ar-enhanced human-robot interaction and robotic interfaces. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–33. [Google Scholar]
Miseikis, J.; Knobelreiter, P.; Brijacak, I.; Yahyanejad, S.; Glette, K.; Elle, O.J.; Torresen, J. Robot localisation and 3D position estimation using a free-moving camera and cascaded convolutional neural networks. In Proceedings of the 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Auckland, New Zealand, 9–12 July 2018; pp. 181–187. [Google Scholar]
Lee, T.E.; Tremblay, J.; To, T.; Cheng, J.; Mosier, T.; Kroemer, O.; Fox, D.; Birchfield, S. Camera-to-robot pose estimation from a single image. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 13 May–31 August 2020; pp. 9426–9432. [Google Scholar]
Lu, J.; Richter, F.; Yip, M.C. Pose estimation for robot manipulators via keypoint optimization and sim-to-real transfer. IEEE Robot. Autom. Lett. 2022, 7, 4622–4629. [Google Scholar] [CrossRef]
Tejwani, R.; Ma, C.; Bonato, P.; Asada, H.H. An Avatar Robot Overlaid with the 3D Human Model of a Remote Operator. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 7061–7068. [Google Scholar]
Amini, A.; Farazi, H.; Behnke, S. Real-time pose estimation from images for multiple humanoid robots. In Robot World Cup; Springer: Cham, Switzerland, 2021; pp. 91–102. [Google Scholar]
Cho, Y.; Son, W.; Bak, J.; Lee, Y.; Lim, H.; Cha, Y. Full-Body Pose Estimation of Humanoid Robots Using Head-Worn Cameras for Digital Human-Augmented Robotic Telepresence. Mathematics 2024, 12, 3039. [Google Scholar] [CrossRef]
Supplementary Video. Available online: https://xrlabku.webflow.io/papers/diverse-humanoid-robot-pose-estimation-using-only-sparse-datasets (accessed on 2 October 2024).
Tesla. Optimus Gen2. 2023. Available online: https://www.youtube.com/@tesla (accessed on 20 August 2024).
Apptronik. Apollo. 2023. Available online: https://apptronik.com/apollo/ (accessed on 20 August 2024).
Boston Dynamics. Atlas. 2016. Available online: https://bostondynamics.com/atlas/ (accessed on 20 August 2024).
Robotis. DARwln-OP. 2011. Available online: https://emanual.robotis.com/docs/en/platform/op/getting_started/ (accessed on 20 August 2024).
1X Technologies. EVE. 2023. Available online: https://www.1x.tech/androids/eve (accessed on 20 August 2024).
Figure. FIGURE01. 2023. Available online: https://www.figure.ai/ (accessed on 20 August 2024).
Unitree. H1. 2023. Available online: https://www.unitree.com/h1/ (accessed on 20 August 2024).
Kepler Exploration Robot. Kepler. 2023. Available online: https://www.gotokepler.com/home (accessed on 20 August 2024).
Sanctuary AI. Phoenix. 2023. Available online: https://sanctuary.ai/product/ (accessed on 20 August 2024).
PAL Robotics. TALOS. 2017. Available online: https://pal-robotics.com/robot/talos/ (accessed on 20 August 2024).
DLR. Toro. 2013. Available online: https://www.dlr.de/en/rm/research/robotic-systems/humanoids/toro (accessed on 20 August 2024).
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Artacho, B.; Savakis, A. Omnipose: A multi-scale framework for multi-person pose estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
Huang, Y.; Kaufmann, M.; Aksan, E.; Black, M.J.; Hilliges, O.; Pons-Moll, G. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. 2018, 37, 1–15. [Google Scholar] [CrossRef]
Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
Pavlakos, G.; Choutas, V.; Ghorbani, N.; Bolkart, T.; Osman, A.A.A.; Tzionas, D.; Black, M.J. Expressive Body Capture: 3D Hands, Face, and Body from a Single Image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–19 June 2019. [Google Scholar]
Kocabas, M.; Athanasiou, N.; Black, M.J. VIBE: Video Inference for Human Body Pose and Shape Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Guzov, V.; Mir, A.; Sattler, T.; Pons-Moll, G. Human poseitioning system (hps): 3d human pose estimation and self-localization in large scenes from body-mounted sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4318–4329. [Google Scholar]
Zheng, C.; Zhu, S.; Mendieta, M.; Yang, T.; Chen, C.; Ding, Z. 3D Human Pose Estimation With Spatial and Temporal Transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 11656–11665. [Google Scholar]
Gong, J.; Foo, L.G.; Fan, Z.; Ke, Q.; Rahmani, H.; Liu, J. DiffPose: Toward More Reliable 3D Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13041–13051. [Google Scholar]
Tang, Z.; Qiu, Z.; Hao, Y.; Hong, R.; Yao, T. 3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4790–4799. [Google Scholar]
Shan, W.; Liu, Z.; Zhang, X.; Wang, Z.; Han, K.; Wang, S.; Ma, S.; Gao, W. Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14761–14771. [Google Scholar]
Einfalt, M.; Ludwig, K.; Lienhart, R. Uplift and Upsample: Efficient 3D Human Pose Estimation With Uplifting Transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2903–2913. [Google Scholar]
Jiang, Z.; Zhou, Z.; Li, L.; Chai, W.; Yang, C.Y.; Hwang, J.N. Back to Optimization: Diffusion-Based Zero-Shot 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6142–6152. [Google Scholar]
Bambuŝek, D.; Materna, Z.; Kapinus, M.; Beran, V.; Smrž, P. Combining interactive spatial augmented reality with head-mounted display for end-user collaborative robot programming. In Proceedings of the 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), New Delhi, India, 14–18 October 2019; pp. 1–8. [Google Scholar]
Qian, L.; Deguet, A.; Wang, Z.; Liu, Y.H.; Kazanzides, P. Augmented reality assisted instrument insertion and tool manipulation for the first assistant in robotic surgery. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5173–5179. [Google Scholar]
Tran, N. Exploring Mixed Reality Robot Communication under Different Types of Mental Workload; Colorado School of Mines: Golden, CO, USA, 2020. [Google Scholar]
Frank, J.A.; Moorhead, M.; Kapila, V. Mobile mixed-reality interfaces that enhance human–robot interaction in shared spaces. Front. Robot. AI 2017, 4, 20. [Google Scholar] [CrossRef]
Ban, S.; Fan, J.; Zhu, W.; Ma, X.; Qiao, Y.; Wang, Y. Real-time Holistic Robot Pose Estimation with Unknown States. arXiv 2024, arXiv:2402.05655. [Google Scholar]
Tian, Y.; Zhang, J.; Yin, Z.; Dong, H. Robot structure prior guided temporal attention for camera-to-robot pose estimation from image sequence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8917–8926. [Google Scholar]
Rodrigues, I.R.; Dantas, M.; de Oliveira Filho, A.T.; Barbosa, G.; Bezerra, D.; Souza, R.; Marquezini, M.V.; Endo, P.T.; Kelner, J.; Sadok, D. A framework for robotic arm pose estimation and movement prediction based on deep and extreme learning models. J. Supercomput. 2023, 79, 7176–7205. [Google Scholar] [CrossRef]
Olson, E. AprilTag: A robust and flexible visual fiducial system. In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 3400–3407. [Google Scholar] [CrossRef]
Kalaitzakis, M.; Cain, B.; Carroll, S.; Ambrosi, A.; Whitehead, C.; Vitzilaios, N. Fiducial markers for pose estimation: Overview, applications and experimental comparison of the artag, apriltag, aruco and stag markers. J. Intell. Robot. Syst. 2021, 101, 1–26. [Google Scholar] [CrossRef]
Ilonen, J.; Kyrki, V. Robust robot-camera calibration. In Proceedings of the 2011 15th International Conference on Advanced Robotics (ICAR), Tallinn, Estonia, 20–23 June 2011; pp. 67–74. [Google Scholar] [CrossRef]
Davis, L.; Clarkson, E.; Rolland, J.P. Predicting accuracy in pose estimation for marker-based tracking. In Proceedings of the Second IEEE and ACM International Symposium on Mixed and Augmented Reality, Tokyo, Japan, 10 October 2003; pp. 28–35. [Google Scholar]
Ebmer, G.; Loch, A.; Vu, M.N.; Mecca, R.; Haessig, G.; Hartl-Nesic, C.; Vincze, M.; Kugi, A. Real-Time 6-DoF Pose Estimation by an Event-Based Camera Using Active LED Markers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 8137–8146. [Google Scholar]
Ishida, M.; Shimonomura, K. Marker based camera pose estimation for underwater robots. In Proceedings of the 2012 IEEE/SICE International Symposium on System Integration (SII), Fukuoka, Japan, 16–18 December 2012; pp. 629–634. [Google Scholar]
Garrido-Jurado, S.; Muñoz-Salinas, R.; Madrid-Cuevas, F.; Marín-Jiménez, M. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognit. 2014, 47, 2280–2292. [Google Scholar] [CrossRef]
Romero-Ramire, F.J.; Munoz-Salinas, R.; Medina-Carnicer, R. Fractal markers: A new approach for long-range marker pose estimation under occlusion. IEEE Access 2019, 7, 169908–169919. [Google Scholar] [CrossRef]
Di Giambattista, V.; Fawakherji, M.; Suriani, V.; Bloisi, D.D.; Nardi, D. On Field Gesture-Based Robot-to-Robot Communication with NAO Soccer Players. In Proceedings of the RoboCup 2019: Robot World Cup XXIII, Sydney, Australia, 2–8 July 2019; Chalup, S., Niemueller, T., Suthakorn, J., Williams, M.A., Eds.; Springer: Cham, Switzerland, 2019; pp. 367–375. [Google Scholar]
V7 Labs. V7 Darwin. 2019. Available online: https://www.v7labs.com/darwin/ (accessed on 9 September 2024).
Viggle. Available online: https://viggle.ai/ (accessed on 23 August 2024).
Epic Games. Unreal Engine. 1995. Available online: https://www.unrealengine.com/ (accessed on 9 September 2024).
Cha, Y.W.; Shaik, H.; Zhang, Q.; Feng, F.; State, A.; Ilie, A.; Fuchs, H. Mobile. Egocentric human body motion reconstruction using only eyeglasses-mounted cameras and a few body-worn inertial sensors. In Proceedings of the 2021 IEEE Virtual Reality and 3D User Interfaces (VR), Lisboa, Portugal, 27 March–1 April 2021; pp. 616–625. [Google Scholar]
Akada, H.; Wang, J.; Shimada, S.; Takahashi, M.; Theobalt, C.; Golyanik, V. Unrealego: A new dataset for robust egocentric 3d human motion capture. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 1–17. [Google Scholar]
Adobe Photoshop Generative Fill. Available online: https://www.adobe.com/products/photoshop/generative-fill.html (accessed on 23 August 2024).
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer: Cham, Switzerland, 2016; pp. 483–499. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
Lovanshi, M.; Tiwari, V. Human pose estimation: Benchmarking deep learning-based methods. In Proceedings of the 2022 IEEE Conference on Interdisciplinary Approaches in Technology and Management for Social Innovation (IATMSI), Gwalior, India, 21–23 December 2022; pp. 1–6. [Google Scholar]
Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical coordinate regression with convolutional neural networks. arXiv 2018, arXiv:1801.07372. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Wang, J.; Perez, L. The effectiveness of data augmentation in image classification using deep learning. Convolutional Neural Netw. Vis. Recognit. 2017, 11, 1–8. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
Yang, S.; Xiao, W.; Zhang, M.; Guo, S.; Zhao, J.; Shen, F. Image data augmentation for deep learning: A survey. arXiv 2022, arXiv:2204.08610. [Google Scholar]
Pytorch. Available online: https://pytorch.org (accessed on 12 September 2024).
Open Neural Network Exchange. Available online: https://onnx.ai (accessed on 12 September 2024).

Figure 1. We present a learning-based full-body pose estimation method for various humanoid robots. Our keypoint detector, trained on an extended pose dataset, consistently estimates humanoid robot poses over time from videos, capturing front, side, back, and partial poses, while excluding human bodies (see supplemental video [11]). The following robots are shown: (a) Optimus Gen 2 (Tesla) [12]; (b) Apollo (Apptronik) [13]; (c) Atlas (Boston Dynamics) [14]; (d) Darwin-OP (Robotis) [15]; (e) EVE (1X Technologies) [16]; (f) FIGURE 01 (Figure) [17]; (g) H1 (Unitree) [18]; (h) Kepler (Kepler Exploration Robot) [19]; (i) Phoenix (Sanctuary AI) [20]; (j) TALOS (PAL Robotics) [21]; (k) Toro (DLR) [22].

Figure 2. Joint configurations of humanoid robots used in the Diverse Humanoid Robot Pose Dataset (DHRP). (a) Apollo. (b) Atlas. (c) Darwin-OP. (d) EVE. (e) FIGURE 01. (f) H1. (g) Kepler. (h) Optimus Gen 2. (i) Phoenix. (j) TALOS. (k) Toro. Note: Phoenix lacks lower-body data in the dataset.

Figure 3. Example training images from the DHRP Dataset. (a) Apollo. (b) Atlas. (c) Darwin-OP. (d) EVE. (e) FIGURE 01. (f) H1. (g) Kepler. (h) Optimus Gen 2. (i) Phoenix. (j) TALOS. (k) Toro. Note: Phoenix lacks lower-body data in the dataset.

Figure 4. Example images from the arbitrary random humanoid robot dataset. These 2k additional images enhance the diversity of body shapes, appearances, and motions for various humanoid robots.

Figure 5. Example images from the synthetic dataset. The first row displays examples generated through AI-assisted image synthesis using Viggle [54]. The second row showcases examples created via 3D character simulations using Unreal Engine [55]. These 6.7k additional annotations enhance the diversity of motions and scenarios for various humanoid robots.

Figure 6. Example images from the random background dataset. The first row displays examples from the target humanoid robot dataset, while the second row shows the corresponding foreground-removed images in the background dataset, generated using Adobe Photoshop Generative Fill [58]. This dataset includes 133 AI-assisted foreground-removed images and 1886 random indoor and outdoor background images. The 2k background images, which do not feature humanoid robots, improve the distinction between robots and their backgrounds, particularly in environments with metallic objects that may resemble the robots’ body surfaces.

Figure 7. Network architecture for the 2D joint detector: Starting with a single input image, the n-stage network generates keypoint coordinates K and their corresponding confidence heat maps H. At each stage, the output from the hourglass module [59] is passed forward to both the next stage and the Differentiable Spatial-to-Numerical Transform (DSNT) regression module [62]. The DSNT module then produces both H and K. In the parsing stage, each keypoint

k_{j} = (x_{j}, y_{j}) \in K

is identified with its associated confidence value

c_{j} = H_{j} (x_{j}, y_{j})

.

Figure 7. Network architecture for the 2D joint detector: Starting with a single input image, the n-stage network generates keypoint coordinates K and their corresponding confidence heat maps H. At each stage, the output from the hourglass module [59] is passed forward to both the next stage and the Differentiable Spatial-to-Numerical Transform (DSNT) regression module [62]. The DSNT module then produces both H and K. In the parsing stage, each keypoint

k_{j} = (x_{j}, y_{j}) \in K

is identified with its associated confidence value

c_{j} = H_{j} (x_{j}, y_{j})

.

Figure 8. Qualitative evaluation on selected frames. The proposed learning-based full-body pose estimation method for various humanoid robots, trained on our DHRP dataset, consistently estimates the poses of humanoid robots over time from video frames, capturing front, side, back, and partial poses. (a) Apollo. (b) Atlas. (c) DARwln-OP. (d) EVE. (e) FIGURE 01. (f) H1. (g) Kepler. (h) Phoenix. (i) TALOS. (j) Toro.

Figure 9. Qualitative evaluation on selected frames. Full-body pose estimation results for miniature humanoid robot models not included in the DHRP dataset. These results demonstrate that our method can be extended to other types of humanoid robots.

Figure 10. Common failure cases: (a–c) False part detections caused by interference from nearby metallic objects. (d) False negatives due to interference from objects with similar appearances. (e) False negatives in egocentric views caused by the rarity of torso body part observations. (f) False positives on non-humanoid robots.

Table 1. Comparison of robot pose estimation methods.

Method	Target Joints	Target Robots
(a) Robot Arm Pose Estimation [5,6,7]	Arm Joints	Less than 3
(b) Humanoid Robot Pose Estimation [8,9]	Partial-Body Joints	Less than 3
(c) Diverse Humanoid Robot Pose Estimation (Ours)	Full-body Joints	11 commercial robots

Table 2. Number of frames in real dataset used for training and evaluating target humanoid robots. On average, 347 images are used per robot for training.

Humanoid Robot	Train Size	Test Size
Apollo (Apptronik) [13]	530	118
Atlas (Boston Dynamics) [14]	337	141
DARwln-OP (Robotis) [15]	348	134
EVE (1X Technologies) [16]	512	100
FIGURE 01 (Figure) [17]	261	146
H1 (Unitree) [18]	200	110
Kepler (Kepler Exploration Robot) [19]	332	136
Optimus Gen 2 (Tesla) [12]	474	135
Phoenix (Sanctuary AI) [20]	233	168
TALOS (PAL Robotics) [21]	358	108
Toro (DLR) [22]	233	158
Total	3818	1454

Table 3. Total DHRP Dataset, including target robots, augmented with other random humanoid robots, synthetic data, and random backgrounds used for training and evaluation, presented in number of frames.

Train Set	Size	Test Set	Size
Real (11 target humanoid robots)	3818	Real (11 target humanoid robots)	1454
Real (arbitrary humanoid robots)	2027	-	-
Synthetic dataset	6733	-	-
Random Backgrounds	2019	-	-
Total	14,597	Total	1454

Table 4. Random data augmentation during training.

Augment. Type	Augment. Method	Prob.	Range
Motion jitter	Image scale	0.8	0.5–1.6
Motion jitter	Image rotation	0.8	≤90°
Motion jitter	Image translation	0.8	≤0.3 × image height
Motion jitter	Horizontal flip	0.5	-
Color jitter	Pixel contrast	0.8	≤×0.2
Color jitter	Pixel brightness	0.8	≤±30
Noise jitter	Gaussian blur	0.4	$σ \leq 2.0$
Noise jitter	Salt-and-pepper noise	0.3	≤±25

Table 5. An ablation study of the dataset configurations. The evaluations are performed on the DHRP test set per object, utilizing 4-stage hourglass networks. The best results are shown in bold, and the worst are underlined. Dataset Configurations: Dataset A = target humanoid robots; Dataset B = real (target + arbitrary humanoid robots); Dataset C = real + synthetic data; Dataset D = real + synthetic + random background data.

Configuration	Model	${mAP}^{obj}$	${AP}^{75}$	${AP}^{85}$	${mAR}^{obj}$	${AR}^{75}$	${AR}^{85}$
Dataset A	4-stage HG	73.9	81.9	61.6	51.7	50.8	31.8
Dataset B	4-stage HG	80.9	89.9	74.0	68.4	73.2	51.4
Dataset C	4-stage HG	83.5	91.7	79.3	73.7	79.6	60.7
Dataset D	4-stage HG	84.9	93.2	82.1	75.1	80.6	64.5

Table 6. Ablation study of the dataset configurations. Evaluations are conducted on the DHRP test set for each joint, employing 4-stage Hourglass networks across all assessments, with

A R^{j n t}

as the evaluation metric. The best results are highlighted in bold, while the worst are underlined. The dataset configurations are consistent with those presented in Table 5.

Table 6. Ablation study of the dataset configurations. Evaluations are conducted on the DHRP test set for each joint, employing 4-stage Hourglass networks across all assessments, with

A R^{j n t}

as the evaluation metric. The best results are highlighted in bold, while the worst are underlined. The dataset configurations are consistent with those presented in Table 5.

Configuration	Model	Nose	Neck	Sho	Elb	Wri	Hip	Knee	Ank	${mAR}^{jnt}$
Dataset A	4-stage HG	78.6	79.1	73.9	61.4	45.0	73.3	77.3	69.3	69.7
Dataset B	4-stage HG	87.0	88.8	85.4	70.4	63.7	84.1	80.6	77.9	79.7
Dataset C	4-stage HG	89.6	88.9	87.7	76.3	67.4	88.4	85.9	80.8	83.1
Dataset D	4-stage HG	89.9	89.9	88.2	76.8	70.3	89.0	85.0	80.5	83.7

Table 7. An ablation study of random data augmentations during training. The evaluations are performed on the DHRP test set for each object using 4-stage hourglass networks. The best results are shown in bold, and the worst are underlined. Training configurations: Train A = Dataset D with no random data augmentation during training; Train B = Dataset D with random image transformations and random color jitter; Train C = Dataset D with random image transformations, random color jitter, random blur, and random noise.

Configuration	Model	${mAP}^{obj}$	${AP}^{75}$	${AP}^{85}$	${mAR}^{obj}$	${AR}^{75}$	${AR}^{85}$
Train A	4-stage HG	64.7	72.3	46.0	27.3	23.7	13.3
Train B	4-stage HG	82.6	90.9	77.7	65.4	69.9	47.6
Train C	4-stage HG	84.9	93.2	82.1	75.1	80.6	64.5

Table 8. An ablation study on individual target humanoid robot datasets using the DHRP test set and 4-stage hourglass networks, evaluated with the

A P^{o b j}

metric. The best results are indicated in bold, while the worst are underlined. The dataset configurations are as follows: Individual = trained separately using only the corresponding target humanoid robot data; Dataset S = arbitrary humanoid robots + synthetic + random background data with no target robot data; Dataset A (Target robots only) and Dataset D (Full) align with the configurations presented in Table 5.

Table 8. An ablation study on individual target humanoid robot datasets using the DHRP test set and 4-stage hourglass networks, evaluated with the

A P^{o b j}

metric. The best results are indicated in bold, while the worst are underlined. The dataset configurations are as follows: Individual = trained separately using only the corresponding target humanoid robot data; Dataset S = arbitrary humanoid robots + synthetic + random background data with no target robot data; Dataset A (Target robots only) and Dataset D (Full) align with the configurations presented in Table 5.

Configuration	${mAP}^{obj}$	Apo.	Atl.	Dar.	Eve	Fig.	H1	Kep.	Opt.	Pho.	Tal.	Tor.
Individual	51.0	48.8	57.8	61.3	38.4	27.8	49.9	90.8	50.9	34.3	42.6	58.4
Dataset S	58.5	79.2	56.9	54.5	65.9	32.2	72.1	79.0	73.6	51.3	29.9	55.0
Dataset A	73.9	71.4	79.9	66.9	75.3	49.4	85.1	90.6	64.4	75.0	81.6	77.5
Dataset D	84.9	85.9	88.1	79.7	82.4	69.7	95.4	94.3	80.0	83.5	91.8	87.2

Table 9. An ablation study of each humanoid robot dataset on the DHRP test set using the leave-one-out method. A 4-stage hourglass network is employed for evaluation, utilizing the

A R^{o b j}

metric for each class. Each row represents the method trained using Dataset D (real + synthetic + random background data) while excluding the corresponding target robot data. The best results are shown in bold, while the worst are underlined for each column.

Table 9. An ablation study of each humanoid robot dataset on the DHRP test set using the leave-one-out method. A 4-stage hourglass network is employed for evaluation, utilizing the

A R^{o b j}

metric for each class. Each row represents the method trained using Dataset D (real + synthetic + random background data) while excluding the corresponding target robot data. The best results are shown in bold, while the worst are underlined for each column.

Leave-One-Out	${mAR}^{obj}$	Apo.	Atl.	Dar.	Eve	Fig.	H1	Kep.	Opt.	Pho.	Tal.	Tor.
Using All	75.1	74.5	80.6	73.9	72.3	63.0	89.9	90.8	67.7	62.0	81.2	77.5
w/o Apo.	65.6	52.5	74.8	65.9	62.4	55.3	83.7	83.2	61.3	48.3	66.3	71.9
w/o Atl.	62.1	62.0	38.0	67.1	69.1	50.7	84.0	84.3	56.3	43.5	67.3	72.5
w/o Dar.	65.0	63.6	75.6	47.5	61.4	54.8	82.5	82.6	59.5	46.0	67.9	78.8
w/o Eve	64.8	59.2	76.7	65.8	29.5	56.4	85.3	84.9	59.0	46.8	74.3	73.4
w/o Fig.	65.9	60.3	73.8	68.4	70.1	48.5	86.2	82.4	63.6	47.2	64.0	69.8
w/o H1	66.5	61.0	75.8	69.9	63.1	60.9	51.3	84.8	63.2	50.8	75.7	74.7
w/o Kep.	62.5	62.0	74.1	61.3	65.2	54.7	80.8	63.7	55.9	44.5	56.7	74.1
w/o Opt.	66.7	62.3	72.9	67.0	64.7	57.3	83.7	83.8	56.2	49.5	66.4	74.7
w/o Pho.	64.3	60.6	73.0	63.7	66.7	62.1	83.3	85.6	60.0	22.0	73.1	71.0
w/o Tal.	62.1	66.5	74.3	66.4	54.8	55.0	82.2	85.8	61.9	43.6	18.1	71.1
w/o Tor.	63.9	59.4	76.6	65.9	64.0	59.4	84.6	87.4	58.1	48.3	73.1	39.1

Table 10. Comparison of models by stage number on the DHRP test set per object.

m A P^{o b j}

and

m A R^{o b j}

are used as the evaluation metrics. The best results are shown in bold, and the worst are underlined.

Table 10. Comparison of models by stage number on the DHRP test set per object.

m A P^{o b j}

and

m A R^{o b j}

are used as the evaluation metrics. The best results are shown in bold, and the worst are underlined.

Model	${mAP}^{obj}$	${AP}^{75}$	${AP}^{85}$	${mAR}^{obj}$	${AR}^{75}$	${AR}^{85}$
4-stage hourglass	84.9	93.2	82.1	75.1	80.6	64.5
5-stage hourglass	86.7	94.3	84.5	79.2	86.3	70.4
6-stage hourglass	86.4	93.9	83.9	79.8	86.9	71.7
7-stage hourglass	87.4	95.3	86.2	81.5	88.6	74.4
8-stage hourglass	83.6	92.5	78.1	75.2	81.9	62.5

Table 11. Comparison of models by stage number on the DHRP test set per joint.

m A R^{j n t}

is used as the evaluation metric. The best results are shown in bold, and the worst are underlined.

Table 11. Comparison of models by stage number on the DHRP test set per joint.

m A R^{j n t}

is used as the evaluation metric. The best results are shown in bold, and the worst are underlined.

Method	Nose	Neck	Sho	Elb	Wri	Hip	Knee	Ank	${mAR}^{jnt}$
4-stage hourglass	89.9	89.9	88.2	76.8	70.3	89.0	85.0	80.5	83.7
5-stage hourglass	90.6	90.9	90.3	79.0	73.4	91.3	87.5	83.9	85.9
6-stage hourglass	89.4	90.2	89.5	80.7	75.5	90.2	88.0	84.4	86.0
7-stage hourglass	92.1	91.1	91.2	80.6	75.5	92.4	88.0	86.2	87.1
8-stage hourglass	88.7	88.4	88.7	78.0	69.1	87.7	86.4	83.0	83.8

Table 12. Evaluation of network models by stage number. All models process normalized input images of 320 × 320 pixels, obtained by transforming arbitrary image sizes.

Model	Parameters	GFLOPs	Training (h)	Inference (FPS)
4-stage hourglass	13.0 M	48.66	10.5	63.64
5-stage hourglass	16.2 M	58.76	12.8	54.56
6-stage hourglass	19.3 M	68.88	14.6	48.34
7-stage hourglass	22.4 M	78.98	17.7	42.81
8-stage hourglass	25.6 M	89.10	19.8	38.34

Table 13. Details of the network architectures for the evaluated methods.

Method	Input Size	Backbone	Parameters	GFLOPs	Inference (FPS)
RoboCup (NimbRo-Net2) [9]	384	ResNet18	12.8 M	28.0	48
Ours (4-stage)	320	Hourglass	13.0 M	48.66	63.64
Ours (7-stage)	320	Hourglass	22.4 M	78.98	42.81

Table 14. Comparative evaluations on the DHRP test set per joint.

m A R^{j n t}

is used as the evaluation metric. The best results are shown in bold, and the worst are underlined.

Table 14. Comparative evaluations on the DHRP test set per joint.

m A R^{j n t}

is used as the evaluation metric. The best results are shown in bold, and the worst are underlined.

Method	Nose	Neck	Sho	Elb	Wri	Hip	Knee	Ank	${mAR}^{jnt}$
RoboCup (NimbRo-Net2) [9]	49.9	64.1	49.1	25.5	10.5	39.1	23.8	30.2	36.5
Ours (4-stage)	89.9	89.9	88.2	76.8	70.3	89.0	85.0	80.5	83.7
Ours (7-stage)	92.1	91.1	91.2	80.6	75.5	92.4	88.0	86.2	87.1

Table 15. Performance Evaluation on on the HRP validation set [10]. Our method is evaluated using the Percentage of Correct Keypoints (PCK) as the evaluation metric. The best results are highlighted in bold.

Method	mPCK	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]
Human+Robot Telepresence [10]	85.10	99.99	99.98	98.47	80.89	46.19
Ours (7-stage)	85.76	100.0	100.0	98.73	82.56	47.52

Table 16. Evaluation of each target humanoid robot on the DHRP test set by object, using 7-stage hourglass networks and the evaluation metrics

m A P^{o b j}

and

m A R^{o b j}

. The best results are highlighted in bold, while the worst are underlined.

Table 16. Evaluation of each target humanoid robot on the DHRP test set by object, using 7-stage hourglass networks and the evaluation metrics

m A P^{o b j}

and

m A R^{o b j}

. The best results are highlighted in bold, while the worst are underlined.

Class	Model	${mAP}^{obj}$	${AP}^{75}$	${AP}^{85}$	${mAR}^{obj}$	${AR}^{75}$	${AR}^{85}$
Total	7-stage HG	87.4	95.3	86.2	81.5	88.6	74.4
Apollo	7-stage HG	90.0	97.5	92.4	83.6	89.8	79.7
Atlas	7-stage HG	87.7	95.0	88.7	83.1	91.5	76.6
Darwin Op	7-stage HG	84.1	87.3	81.3	82.6	86.6	80.6
Eve	7-stage HG	83.6	100	90.0	82.1	99.0	85.0
Figure01	7-stage HG	72.3	85.6	54.1	69.5	79.5	50.7
H1	7-stage HG	95.9	99.1	98.2	91.6	97.3	94.5
Kepler	7-stage HG	95.4	97.1	97.1	90.1	95.6	89.0
Optimus Gen2	7-stage HG	86.1	92.6	82.2	76.6	82.2	67.4
Phoenix	7-stage HG	86.9	99.4	84.5	72.1	76.8	51.2
Talos	7-stage HG	91.1	100	97.2	87.8	98.1	89.8
Toro	7-stage HG	91.5	98.1	91.1	84.3	88.6	72.2

Table 17. Evaluation of each target humanoid robot on the DHRP test set by joint, using 7-stage hourglass networks with

m A R^{j n t}

as the evaluation metric. The best results are highlighted in bold, while the worst are underlined. Note: The Phoenix robot lacks lower-body data in the dataset.

Table 17. Evaluation of each target humanoid robot on the DHRP test set by joint, using 7-stage hourglass networks with

m A R^{j n t}

as the evaluation metric. The best results are highlighted in bold, while the worst are underlined. Note: The Phoenix robot lacks lower-body data in the dataset.

Class	Model	Nose	Neck	Sho	Elb	Wri	Hip	Knee	Ank	${mAR}^{jnt}$
Total	7-stage HG	92.1	91.1	91.2	80.6	75.5	92.4	88.0	86.2	87.1
Apollo	7-stage HG	92.6	93.2	91.9	83.1	71.4	95.8	84.2	98.5	88.8
Atlas	7-stage HG	93.8	91.8	93.0	75.6	70.9	92.6	91.4	91.4	87.6
Darwin Op	7-stage HG	92.7	83.9	91.0	78.0	79.8	98.4	94.8	81.6	87.5
Eve	7-stage HG	95.4	93.4	93.4	83.8	68.9	79.5	78.3	74.1	83.3
Figure01	7-stage HG	92.7	89.8	88.6	61.5	53.2	82.1	71.5	69.6	76.1
H1	7-stage HG	88.8	90.9	96.1	95.0	87.4	95.4	94.9	95.5	93.0
Kepler	7-stage HG	94.2	94.1	91.3	93.0	89.8	97.2	84.1	99.3	92.9
Optimus Gen2	7-stage HG	86.9	93.1	85.4	80.9	77.7	92.6	80.0	95.3	86.5
Phoenix	7-stage HG	98.5	98.5	91.9	66.3	61.5	90.5	N/A	N/A	84.5
Talos	7-stage HG	79.4	85.2	93.2	89.4	89.2	94.9	90.6	89.8	89.0
Toro	7-stage HG	91.7	87.0	89.4	90.7	87.4	89.8	94.7	82.4	89.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Heo, S.; Cho, Y.; Park, J.; Cho, S.; Tsoy, Z.; Lim, H.; Cha, Y. Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets. Appl. Sci. 2024, 14, 9042. https://doi.org/10.3390/app14199042

AMA Style

Heo S, Cho Y, Park J, Cho S, Tsoy Z, Lim H, Cha Y. Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets. Applied Sciences. 2024; 14(19):9042. https://doi.org/10.3390/app14199042

Chicago/Turabian Style

Heo, Seokhyeon, Youngdae Cho, Jeongwoo Park, Seokhyun Cho, Ziya Tsoy, Hwasup Lim, and Youngwoon Cha. 2024. "Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets" Applied Sciences 14, no. 19: 9042. https://doi.org/10.3390/app14199042

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Diverse Humanoid Robot Pose Estimation from Images Using Only Sparse Datasets

Abstract

1. Introduction

2. Related Work

2.1. Learning-Based Human Pose Estimation

2.2. Marker-Based Human–Robot Interaction

2.3. Robot Arm Pose Estimation

2.4. Full-Body Robot Pose Estimation

3. Diverse Humanoid Robot Pose Dataset (DHRP)

3.1. Target Humanoid Robots

3.2. Sparse Dataset and Annotation Process

3.3. Dataset Extension

3.4. Arbitrary Humanoid Robot Dataset

3.5. Synthetic Dataset

3.5.1. AI-Assisted Image Synthesis

3.5.2. 3D Character Simulations

3.6. Background Dataset

3.6.1. AI-Assisted Foreground Removal

3.6.2. Arbitrary Random Backgrounds

4. Diverse Humanoid Robot Pose Estimation

4.1. The Benchmark Method

4.1.1. Joint Detection Network

4.1.2. Keypoint Parsing

4.1.3. Network Optimization

4.1.4. Training Details

4.2. The Evaluation Method

5. Results and Evaluation

5.1. Evaluation on Dataset Configurations

5.2. Evaluation on Sparse Datasets

5.3. Evaluation on Network Architecture

5.4. Comparison with Other Methods

5.5. Quantitative Evaluation on Target Humanoid Robots

5.6. Qualitative Evaluation on Target Humanoid Robots

5.7. Failure Case Analysis

6. Limitations and Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI