3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views

Chen, Mingliang; Tan, Guangxing

doi:10.3390/electronics13152926

Open AccessArticle

3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views

by

Mingliang Chen

^*

and

Guangxing Tan

College of Automation, Guangxi University of Science and Technology, Liuzhou 545006, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 2926; https://doi.org/10.3390/electronics13152926

Submission received: 6 July 2024 / Revised: 15 July 2024 / Accepted: 16 July 2024 / Published: 24 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

The problem of 3D human pose estimation (HPE) has been the focus of research in recent years, yet precise estimation remains an under-explored challenge. In this paper, the merits of both multiview images and wearable IMUs are combined to enhance the process of 3D HPE. We build upon a state-of-the-art baseline while introducing three novelties. Initially, we enhance the precision of keypoint localization by substituting Gaussian kernels with Laplacian kernels in the generation of target heatmaps. Secondly, we incorporate orientation regularized network (ORN), which enhances cross-modal heatmap fusion by taking a weighted average of the top-scored values instead of solely relying on the maximum value. This not only improves robustness to outliers but also leads to higher accuracy in pose estimation. Lastly, we modify the limb length constraint in the conventional orientation regularized pictorial structure model (ORPSM) to improve the estimation of joint positions. Specifically, we devise a soft-coded binary term for limb length constraint, hence imposing a flexible and smoothed penalization and reducing sensitivity to hyperparameters. The experimental results using the TotalCapture dataset reveal a significant improvement, with a 10.3% increase in PCKh accuracy at the one-twelfth threshold and a 3.9 mm reduction in MPJPE error compared to the baseline.

Keywords:

Laplacian Kernels; orientation regularized network; cross-modal heatmap fusion; limb length constraint

1. Introduction

Three-dimensional human pose estimation has always been one of the most important research directions in the field of computer vision, widely applied in computer vision and artificial intelligence. It is closely related to human motion analysis, action recognition, and human-computer interaction [1,2,3,4]. The task of 3D human pose estimation involves inferring the pose information of a person in 3D space through the analysis of image or video data. Pose information includes parameters such as the position, rotation, and angles of human joints, which accurately describe the posture and actions of the human body in space. The purpose of conducting research on 3D human pose estimation is to achieve the high-precision prediction and analysis of human motion.

Three-dimensional pose estimation tasks encompass both single-sensor models and multi-sensor models. Single-sensor models [5,6,7] rely on the 2D keypoint detection of the human body in monocular RGB images, followed by methods such as model computation and depth inference to estimate the 3D pose. This method relies on the accurate detection and reconstruction of human pose information in monocular images, which suffer from imprecise occlusion and depth estimation. Consequently, mainstream models often adopt multi-sensor methods, which can be further categorized into multi-camera methods, multi-IMU methods, and a mixture of both. Methods based on multiple cameras [3,8,9,10,11] address the limitations of single-view methods related to occlusion and inaccurate depth estimation by resolving the geometric relationship between multiple input views. However, these methods still face significant constraints under severe occlusion and low-light conditions and are unable to identify regions outside the camera’s field of view. IMU-based methods [12,13,14] inherently circumvent the limitations induced by occlusion and an inadequate field of view, yet they encounter new challenges related to noise and cumulative errors. In light of such challenges, researchers have endeavored to exploit the merits of the joint utilization of cameras and IMUs [15,16,17,18]. In comparison to camera-based or IMU-based approaches, employing multiple cameras and IMUs for 3D human pose estimation demonstrates advantages in both accuracy and robustness.

Among the methods based on multiple cameras and IMUs, GeoFuse [15] achieved state-of-the-art (SOTA) performance by first employing SimpleNet [19] to estimate heatma-ps and then enhancing heatmaps using an orientation regularized network (ORN), ultimately solving for the 3D human pose through an orientation regularized pictorial structure model (ORPSM). In this paper, we build upon GeoFuse and propose GeoFuse++, featuring three novelties. Firstly, we employ a Laplacian distribution function to replace the Gaussian one used by GeoFuse, leading to more precise keypoint localization. Secondly, we devise an ensemble strategy for heatmap enhancement. Specifically, the top two heatmap values corresponding to the possible projected positions of linked joints are selected and weighted to enhance the target joint’s heatmap value. We improve keypoint confidence by taking the weighted average of the maximum value and the second maximum value. Relying solely on the maximum value may lead to instability in estimation results if it is influenced by outliers or noise. The weighted average mitigates the impact of outliers or noise on estimation results by incorporating the second maximum value, thereby enhancing the model’s robustness and accuracy. Lastly, we refine the ORPSM by introducing soft-coded binary terms to constrain limb lengths, replacing the hard-coded terms used in the baseline. This modification makes the model less sensitive to hyperparameters and imposes a smoother penalty on limb estimation errors.

In order to ascertain the efficacy of our approach, experiments were carried out on the TotalCapture dataset. Relative to the baseline, SN + lp demonstrated a 6.8% improvement in accuracy for 2D keypoints (PCKh@one-twelfth threshold), while ORN exhibited a 0.7% enhancement in accuracy for 2D keypoints (PCKh@one-twelfth threshold), along with a reduction of 3.9 mm in errors for 3D keypoints (MPJPE). In summary, our contribution can be categorized as three-fold:

Utilizing the Laplacian function to predict the probability distribution of keypoints;
By enhancing the fusion of cross-modal heatmaps, we adopt a weighted averaging approach for the top-scored values, rather than relying solely on a single value;
Confining limb lengths using soft-coded binary terms for improved constraint.

2. Related Work

As explicated earlier in this manuscript, the domain of 3D human pose estimation (3D HPE) can be categorized into single-sensor and multi-sensor paradigms. This study predominantly focuses on multi-sensor 3D HPE, further subcategorized into multiview, IMUs, and multiview IMUs.

2.1. Multiview Methods

Multi-view methods involve capturing images from different perspectives and analyzing 3D geometric relationships to determine the keypoints of human posture. Compared to single-view methods, their advantages lie in more accurate localization and reduced occlusion effects. Previous studies [20,21,22,23] have addressed keypoint-related issues using image structural frameworks, optimizing model parameters, and adjusting 2D postures to camera projection postures. Building upon this, [24] proposed a method that integrates 3D posture and multiview information into a unified image structural framework. The 3D image structural framework [3,25,26,27,28,29,30,31] primarily serves as an optimization method for capturing spatial relationships among different postural keypoints. Initially, [28,29,30,31] introduced the multiview consistency constraint method to minimize the impact of occlusion on structural reconstruction and posture estimation. This method utilizes information observed in multiple views of the same posture or scene to ensure consistent estimation results across different views. Subsequently, [3,25] aggregated 2D keypoint heatmaps from multiple views into a single 3D image structural framework using calibrated camera parameters and multiview consistency methods. However, when the positions of multiview cameras change, 3D posture reconstruction becomes necessary. Zhang et al. [15,32] utilized spatial geometric stereo-matching methods for the paired reconstruction of 3D postures and extended this method to more complex multiview camera environments. Similar to the aforementioned methods, our approach also utilizes an image structural model for 3D human pose estimation. However, multiview methods often suffer from inaccurate depth estimation. We introduce various types of sensors and combine them with IMUs to determine limb orientations, thereby improving the accuracy of depth estimation.

2.2. Multi-IMU Methods

Compared to multiview methods, multi-IMU methods do not suffer from decreased estimation accuracy due to occlusion and are not affected by the recording environment and space. IMU-based human pose estimation (HPE) primarily involves attaching IMUs to keypoints on the human body to capture acceleration and orientation. Feature extraction is performed using recurrent neural networks (RNNs), and pose estimation is achieved through deep learning models or physical models. Commercial products such as Xsens MVN [33] utilize 17 wearable IMUs and employ Kalman filtering to fuse all IMU information for the real-time global human body pose estimation of all keypoints. However, the use of a large number of IMUs affects the subjects’ movements, while the placement of IMUs is prone to errors, and the setup time is lengthy. The sparse IMU pose (SIP) [34] model is the first to use six IMUs through the skinned multi-person linear (SMPL) [12,13] model for offline full-body pose estimation, also known as the sparse IMU human pose estimation model [12,13,14,35]. SIP utilizes iterative optimization methods and requires offline processing, making real-time applications impractical. The method proposed by Deep Inertial Poser [12] is a deep learning-based framework used for the real-time reconstruction of human body poses from sparse IMU data. Since Deep Inertial Poser (DIP) employs RNNs to extract the orientation and acceleration information of keypoints, it maps them to the skinned multi-person linear (SMPL) model to achieve pose estimation. Since DIP does not estimate the root node, it cannot locate the position of the human body in 3D space. TransPose [13] introduces a method based on six inertial sensors for the real-time estimation of body posture and global translation. Posture is estimated using a multi-stage network to estimate full-body posture through subnodes, while translation utilizes a method combining supporting feet and RNNs. However, these methods [13,14] ignore the delay caused by the need for future sequences. Therefore, physical optimization and sparse IMU pose (PIP) [35] is a model that combines physical optimization and sparse IMU HPE, eliminating the need for future information. On the other hand, when considering the ease of daily detection and portability, IMUpose [14] proposes the use of IMUs on mobile phones, watches, and headphone devices. Long short-term memory (LSTM) is employed to extract features and map them to SMPL for full-body pose estimation. One drawback of solely using IMUs is the potential occurrence of drift errors over time. Our approach also incorporates sparse IMUs worn on the body. Our model integrates spatial information from images to complement IMU drift errors, resulting in the more accurate identification of human poses.

2.3. Multiview IMU Methods

Compared to the multi-image and multi-IMU methods, these two hybrid approaches integrate the complementary aspects of cameras and IMU sensors. IMUs can address issues related to occlusion and rotation, while multiview angles can reduce global position drift. Three-dimensional human pose estimation using multiview angles and multiple IMUs primarily involves acquiring viewpoint and IMU information to determine the positions of 3D human pose global keypoints. It relies mainly on image recognition, with heatmaps serving as intermediate variables, while the direction and acceleration provided by IMUs optimize the 3D human skeletal model. The methods combining multiview angles and multiple IMUs can be further categorized into kinematic solving [16,36], deep learning [16,37,38], and prior posture [15,17,39,40]. Early works on combining images and IMUs were proposed by PonsMoll [39], who introduced a hybrid tracker that integrates visual and IMU fusion by obtaining the spatial positions of keypoints from images and limb orientations from IMUs. Rumble et al. [38] achieved the real-time capture of 3D poses using LSTM in a deep learning framework, albeit requiring estimation in relatively enclosed environments. They introduced the TotalCapture dataset, providing annotated data for the fusion of multiview images and IMUs, serving as a standard benchmark for human pose estimation.Malleson et al. [17] achieved real-time indoor and outdoor pose capture using a prior posture model, which was capable of recovering six degrees of freedom motion. Building upon this, the authors of [17,40] achieved multi-person pose detection. Gilbert et al. [16] employed deep learning networks combined with LSTM and forward kinematics for pose recognition while also supplementing outdoor data in the TotalCapture dataset.Von Marcard et al. [18] investigated methods for accurately recovering 3D human poses using IMUs and mobile cameras in outdoor environments.Huang et al. [37] introduced an end-to-end 3D deep neural network that does not require a prior skeletal model, merging the two signals using multiple channels in the dataset. Additionally, [36] utilized forward and inverse kinematics for fusing IMU and image information and proposed a parameterized human pose model. Compared to [36,37], GeoFuse [15] employed a geometric model characterized by its simple structure and high accuracy.

The current state-of-the-art baseline, proposed by GeoFuse [15], involves the geometric fusion of wearable IMUs and multiview images. GeoFuse innovatively addresses the inherent challenges in 3D HPE by merging IMU measurements and image features using geometric methods. This advanced baseline adopts a sequential approach, utilizing the SimpleNet (SN) model for initial heatmap estimation, enhancing these heatmaps with the orientation regularized network (ORN) model, and finally solving for 3D human poses using the orientation regularized pictorial structure model (ORPSM). This paper aims to improve the shortcomings of the GeoFuse method by enhancing keypoint localization precision through the use of Laplacian kernels instead of Gaussian kernels. Additionally, in the ORN model, keypoint confidence is strengthened by taking a weighted average of the top-scoring values, improving robustness to outliers. ORPSM is refined by introducing soft-coded constraints to limb lengths, resulting in smoother and more accurate model predictions.

3. Methods

The purpose of 3D HPE is to acquire the precise translational and rotational parameters of human joints. Overall, the proposed 3D HPE pipeline consists of three stages: CNN-based 2D keypoint estimation (Section 3.1), images + IMU-based keypoint localization enhancement (Section 3.2), and 3D pose resolving (Section 3.3).

As shown in Figure 1, the overall pipeline of the proposed method consists of three parts: 2D keypoint estimators, 2D keypoint localization enhancement, and 3D pose resolving. The 2D keypoint estimators process images using ResNet to identify keypoint positions and determine both the confidence and specific locations of the keypoints using the Laplacian distribution function. The 2D keypoint localization enhancement averages the keypoints from four perspectives and then combines the body orientation provided by IMUs to output the human pose skeleton through ORN. Finally, the human keypoints are computed into 3D coordinates through ORPSM.

3.1. CNN-Based 2D Keypoint Estimation

We adopted SOTA SimpleNet [41] as our baseline. Following this methodology, we utilize ResNet [19] as the backbone of the CPM network, integrating a complex multi-stage convolutional pose machine (CPM) architecture. By leveraging ResNet for image feature extraction and conducting iterative deconvolution and upsampling processes within the CPM framework, the accuracy of human pose estimation is systematically enhanced. Ultimately, this process will generate probability maps that describe keypoints, supporting the achievement of accurate and efficient 2D human pose estimation. In comparison to GeoFuse [15], which uses ResNet50 as the skeletal structure for 2D HPE, we have opted for the deeper ResNet152. Deeper layers can extract richer feature information, aiding in capturing complex pose and motion characteristics, thereby enhancing the model’s expressiveness and accuracy.

In contrast to SimpleNet (SN) [41], we utilize the Laplace distribution function for keypoint confidence prediction instead of the Gaussian distribution function used in the SimpleNet 2D HPE under the GeoFuse model. Laplacian heatmaps are commonly employed to represent the spatial distribution of keypoints. Specifically, for each keypoint, a Gaussian kernel function is generated at its location, and these Gaussian kernels are superimposed to form a 2D heatmap. In this heatmap, the value of each pixel indicates the presence of a keypoint and its spatial distribution. The use of Laplacian heatmaps instead of Gaussian distributions helps emphasize the edges and details of keypoints, enabling better discrimination between keypoints and background. The adoption of Laplacian heatmaps improves the accuracy and robustness of keypoint estimation. The probability density function of the Laplace distribution is as follows:

f (x | μ, b) = \frac{1}{2 b} \cdot exp (- \frac{| x - μ |}{b}) .

(1)

where x is the random variable,

μ

is the mean (location parameter) of the distribution, and b is the scale parameter that controls the width of the distribution. The term

| x - μ |

represents the absolute difference between x and

μ

. This PDF describes the probability density of the random variable x taking a specific value, given the mean,

μ

, and scale parameter, b.

The shape of the probability density function (PDF) resembles a combination of two exponential distributions; hence, it is also known as the double exponential distribution. The Laplace distribution has several characteristics: symmetry, peakedness, and long tails. As shown in Figure 2, these characteristics are evident in the shape of the distribution. Symmetry means that the PDF is symmetric around the mean,

μ

. Peakedness refers to the existence of a peak in the PDF at the mean,

μ

. Long tails indicate that the PDF decreases exponentially on both sides of the mean. Compared to the Gaussian distribution, the Laplace distribution has the following advantages: (1) Robustness: The Laplace distribution is less affected by outliers compared to the Gaussian distribution. Due to its longer tails, it is more tolerant of extreme values and maintains a good fit to the overall data. (2) Sparsity: The Laplace distribution tends to generate sparse results compared to the Gaussian distribution. This is because the PDF of the Laplace distribution has a peak at the mean, making it more likely to allocate samples to a few specific points, which is advantageous for modeling highly sparse data. (3) Simplified marginal distribution: In some cases, it may be easier to compute and handle the marginal distribution of the sum or difference of multiple independent random variables. Due to the tail shape of the Laplace distribution, when considering the difference between two independent Gaussian random variables, the marginal distribution of the Laplace distribution can be more simply represented. (4) Simplified parameter estimation: Parameter estimation for the Laplace distribution is simpler compared to the Gaussian distribution. Estimating the parameters of the Gaussian distribution typically requires calculating the mean and covariance matrix, while the Laplace distribution only requires estimating the mean and scale parameters.

3.2. Images + IMU-Based Keypoint Localization Enhancement

First, let us describe how to use 3D limb direction to mutually enhance the joint nodes between joint pairs connected by IMU ihaiyoun for the same camera view. Then, we will extend this approach to the same joint node in multiple views to enhance the other joint nodes of the joint pairs.

As shown in Figure 3, the cross-joint fusion idea of the ORN+ model involves estimating 2D points and enhancing confidence based on multiple views and IMU data.

3.2.1. Single-View Object Relation Network for 2D Human Pose Estimation

In the same-view fusion, we use a joint pair to illustrate the ORN intra-view fusion method. The joint pair consists of

J_{1}

and

J_{2}

, with a length of L and a direction of O, which is determined by the joint pair’s IMU. The heatmaps of

J_{1}

and

J_{2}

are denoted as

H_{1}

and

H_{2}

, respectively, and

Y_{p}

represents the confidence of

J_{1}

at

Y_{p}

. Since the depth information of

J_{1}

at

Y_{p}

is not determined, we sample K points,

P_{1}

-

P_{k}

, along the line in the depth direction from the camera center

C_{1}

to

Y_{p}

using logarithmic mean sampling. We set the limb length L to 1, which is the average limb length predicted from the training dataset, and determine the points

Q_{1}

-

Q_{k}

based on the direction O of the joint pair. The 3D position of

J_{2}

can be calculated using the following equation:

Q_{k} = P_{k} + O \cdot L k \in [1, K]

(2)

We project the positions,

Q_{k}

, onto the image, and the higher the confidence in

H_{2}

, the higher the corresponding confidence in

H_{1}

. We find the maximum response among all positions using the following formula:

H_{1} (Y_{P}) = H_{1} (Y_{P}) + max_{k \in [1, K]} H_{2} (Y_{Q_{k}})

(3)

In the heatmap fusion, ideally, there should be maximum response at the correct

J_{2}

position and zero response at other positions. This means that noncorresponding positions do not contribute or contribute minimally to the fusion. In the experiments, we sampled 200 points with depths ranging from 0 to the maximum depth value. Same-view fusion has limitations since the depth information is unknown, and some of the maximum confidence responses may correspond to incorrect

J_{2}

positions, which can also be wrongly enhanced.

3.2.2. Multi-View Object Relation Network for 2D Human Pose Estimation

The formula for multiview fusion is as follows:

H_{1} (Y_{P}) = H_{1} (Y_{P}) + \frac{1}{V} \sum_{v = 1}^{v} max_{k \in [1, K]} H_{2}^{v} (Y_{Q_{k}}^{v})

(4)

The projection of

Q_{k}

in view v is represented by

J_{2}

, and the heat map in view v is derived from

J_{2}

.

H_{1}

(

Y_{p}

) is enhanced by taking the average of

m a x H_{2} (Y_{q k})

from multiple views. The effect is as follows (see Figure 4):

As shown in Figure 4, the model consists of three sample models and uses a multiview approach to accurately locate a node in the current image. A ray in the heatmap is formed by the center of the camera, and multiple rays intersect to determine the correct position of the node. The ORN model not only greatly enhances the 2D position but also solves the occlusion problem. Nodes may be occluded in their own views but visible in other views, allowing the model to recognize the occluded nodes in the current view.

3.2.3. Multi-View Averaging Enhanced Object Relation Network for 2D Human Pose Estimation

In order to make 2D human pose estimation more accurate, we employed a weighted averaging algorithm, which calculates the weighted average of the top two values in Equation (4). This helps achieve smoother and more precise node estimation. The ORN+ formula is as follows:

H_{1} (Y_{P}) = H_{1} (Y_{P}) + \frac{1}{V} \sum_{v = 1}^{V} max_{k \in [1, K]} H_{2}^{v} (w (Y_{Q_{k_{1}}}^{v}) + (1 - w) (Y_{Q_{k_{2}}}^{v}))

(5)

where

Y_{Q_{k_{1}}}^{v}

and

Y_{Q_{k_{2}}}^{v}

are the top two maximum values among k observed values.

3.3. Three-Dimensional Pose Resolving

We primarily utilize the triangulation method to predict the root node. The 2D human pose nodes outputted by the ORN+ model are used as input for predicting the 3D nodes. We apply the ORPSM+ posterior model to constrain the nodes, resulting in more accurate 3D node predictions.

3.3.1. Triangulation

In this paper, we primarily employ triangulation to estimate the 3D co-ordinates of the root keypoint and utilize the orientation regularized pictorial structure model (ORPSM+) to refine the prediction of all keypoints. A cubic grid, G, with an edge length of 2000 mm, denoted as N cubes, is placed around the 3D co-ordinates of the root keypoint (pelvic node). This grid, G, typically covers the entire body. Within the space composed of the regular grid G, each keypoint can be associated with a small cube as its 3D position. All human keypoints share this discrete state space, G, which consists of small squares.

The triangulation method is a commonly used technique in 3D human pose estimation, predicting 3D points by obtaining 2D human keypoint co-ordinates from multiple viewpoints and using geometric relationships for triangulation. Assuming known intrinsic parameters, K, and extrinsic parameters,

[R_{i} ∣ t_{i}]

, for each camera, as well as the pixel co-ordinates,

p_{i}

, of the keypoints in each view, the 3D co-ordinates,

P_{i}

, of keypoint

p_{i}

in the camera co-ordinate system can be obtained through triangulation. By utilizing known camera parameters and pixel co-ordinates, each keypoint’s 3D co-ordinates in the camera co-ordinate system can be computed using the triangulation method. The specific formula is as follows:

P_{i} = {(X_{i}, Y_{i}, Z_{i})}^{T} = λ_{i} K^{- 1} p_{i}

(6)

where

λ_{i}

is the normalized depth of the keypoint,

p_{i}

, to the camera, and

K^{- 1}

is the inverse matrix of the camera internal parameter matrix.

3.3.2. Orientation Regularized Pictorial Structure Model with Soft Limb Constraints

We propose a posterior model called ORPSM with image-IMU inputs to further improve the precision of the 3D keypoints. A person’s body consists of M keypoints,

J = {J_{1}, . . ., J_{M}}

, where

J_{i}

represents the 3D co-ordinates of the ith keypoint in the world coordinate system.

Each keypoint,

J_{i}

, takes values from a discrete space, but there are constraints between keypoints, such as limb lengths and limb orientations. The average limb length determined by 2D human pose keypoints and the limb orientations determined by IMU are used to constrain the estimation of

J

through the directional regularized image structure model (ORPSM+). The formula for the ORPSM+ posterior model is as follows:

P (J | F) = \frac{1}{Z (F)} \prod_{i = 1}^{M} ϕ_{i}^{c o n f} (J_{i}, F) \prod_{(m, n) \in β} ψ^{l i m b} (J_{m}, J_{n}) \prod_{(m, n) \in δ} ψ^{I M U} (J_{m}, J_{n})

(7)

where F is the multiview 2D pose heatmap,

Z (F)

is the normalization function, and

β

and

δ

are the sets of limb length and limb direction constraint functions, respectively. Below, we describe these functions in more detail.

We use the camera parameters to project the cube grid, G, containing the body for all J into the pixel coordinates of all the cameras, obtaining the corresponding joint consistency or response from F.

We defined the joint consistency or response for using the camera view as

\prod_{i = 1}^{M} ϕ_{i}^{conf} (J_{i}, F)

.

ε

limb is a set of

(J_{m}, J_{n})

joint pairs, and the constraint function

ψ^{l i m b} (J_{m}, J_{n})

is defined as follows:

ψ^{l i m b} (J_{m}, J_{n}) = (\binom{w cos (k x) + (1 - w), |I_{m, n} - {\tilde{I}}_{m, n}| \leq ε}{0, |I_{m, n} - {\tilde{I}}_{m, n}| \geq ε})

(8)

x = |I_{m, n} - {\tilde{I}}_{m, n}|

(9)

k ε = \frac{π}{2}

(10)

In the above equation,

I_{m, n}

is the limb length of the estimated pose and is the mean limb length. The value of w used to adjust the base probability represents a base expected value of the limb length error probability with small bias, and w can set the optimal w value by observing the 3D joint point average MPJPE. The parameter k controls the oscillation frequency; here, we set k = 0.00785 and

ε

= 200 mm;

ε

is set because the error of the general limb length is less than 200 mm.

The limb direction constraint function is the dot product between the limb direction and the IMU direction of the estimated pose, and this is formulated as follows:

ψ^{I M U} (J_{m}, J_{n}) = \frac{J_{m} - J_{n}}{∥ J_{m} - J_{n} ∥_{2}} \times O_{m, n} = cos θ

(11)

J_{m} - J_{n}

is the estimated limb direction vector,

∥ J_{m} - J_{n} ∥_{2}

is the Euclidean norm, and

O_{m, n}

is the IMU orientation between the joint pairs. We refer to this as the 3D pose estimator of limb length and limb orientation constraints, ORPSM+. To optimize the posterior probability formula, specifically, Equation (7), we maximize the posterior probability by running the dynamic programming algorithm on the discrete pose space. We adopt a variant of recursive RPSM [11], which iteratively optimizes the 3D pose.

4. The Experiment

4.1. Implementation Details

4.1.1. Dataset

The TotalCapture dataset [42] is one of the most popular datasets, containing ground truth 3D human pose data from IMUs and images. This dataset utilizes eight cameras and 13 IMUs to capture human pose information. It includes individuals of different genders performing four different actions: range of motion (ROM), walking, performing, and freestyle. Each action is repeated three times. In our experiments, for efficiency, we utilized four cameras (cameras 1, 3, 5, and 7) and eight IMUs (placed at the center positions of the limbs), as shown in Figure 1.

The TotalCapture dataset was divided into training and testing sets [38]. The training set includes subjects 1, 2, and 3, encompassing the Room 1, 2, and 3, Walk 1 and 3, and Freestyle 1 and 2 sequences. The testing set comprises Walk 2, Freestyle 3, and Acting 3 from all subjects. SimpleNet was utilized for training on the training set, while model performance was assessed on the testing set. Object relation network (ORN) and orientation regularized pictorial structure model (ORPSM) were employed to evaluate model performance.

This paper validates the generalizability of the model on the Human3.6M dataset [43], which includes 3.6 million 3D human poses and corresponding images from 11 professional actors (six males and five females) in seven scenes: discussion, smoking, taking photos, talking on the phone, daily activities, leisure activities, and social interaction. The dataset was captured using four calibrated high-resolution 50 Hz cameras. In 2D pose estimation, this paper uses subjects 1, 5, 6, 7, and 8 for training and subjects 9 and 11 for testing. Since this dataset does not provide IMU data, we created virtual IMUs (limb orientations) using the ground truth 3D poses for both training and testing, and we only show proof-of-concept results.

4.1.2. Training Method

In the training of 2D pose estimation, the adjustable hyperparameters include learning rate, batch size, regularization parameter, neural network architecture parameters, optimizer, number of iterations, rotation factor, and scaling factor. The learning rate was set to 0.001, with a decay of 10 times at epoch 25 and another 10 times at epoch 30. The study used an RTX 4090 GPU with 24 GB VRAM and a 2TB SSD. During the training of ResNet-152, the batch size was 16, with a speed of 210 samples/s for SN + Lp, 59 samples/s for ORN+, and 11 samples/s for ORPSM+. The regularization parameter controls the strength of the regularization penalty, and we used L2 regularization. Appropriate regularization parameters can improve the model’s generalization ability and prevent overfitting. We used a ResNet152 with 152 layers and Adam optimizer to adjust the model’s weights to minimize the loss function. For the rotation factor and scaling factor, we set them to 45° and 0.5, respectively.

In order to optimize the learning rate of the model, this method is used to adjust in the range from 0.1 to 0.0001. First, we set the initial minimum learning rate to 0.0001, and the maximum learning rate was set to 0.1. Model training was performed by calculating the intermediate value

α = \frac{α_{m i n} + α_{m a x}}{2}

, and the accuracy of the test set was recorded. The search range was adjusted according to the performance of the test set. If accuracy improves, the minimum learning rate is

α_{m i n}

. If the accuracy does not improve, then update the maximum learning rate of

α_{m a x}

. Repeat this process for 10 iterations to gradually close the gap between

α_{m i n}

and

α_{m a x}

. Finally, the learning rate with the highest accuracy in the test set was selected as the final learning rate value, namely 0.0001. This paper adopts an adaptive learning rate adjustment strategy where, during training at the 20th and 25th epochs, the learning rate is reduced by a factor of 10 when the test set accuracy stabilizes or no longer improves, aiming to prevent model overfitting.

In order to enhance the diversity of the training data, random image rotations were employed, allowing the model to better adapt to various object poses at different angles. A rotation range of 45 degrees covers a wide-angle variation, thereby improving the model’s robustness in handling object rotations. This setting also prevents issues such as the misidentification of limbs (e.g., left identified as right) when exceeding a 45-degree rotation. Additionally, a scaling factor of 0.5 was applied during training, randomly resizing images. This helps the model learn and adapt to objects of different scales, ensuring better performance when dealing with objects of varying sizes.

4.1.3. Evaluation Criteria

In 2D human pose estimation, we used the percentage of correct keypoints (PCK) metric as the evaluation criterion. PCKh@t measures the percentage of estimated keypoints with a distance less than times the head length compared to the ground truth keypoints. In the experiments, we provided results for different t values, where t was set to 300 mm. Specifically, we used t = one-half, one-sixth, and one-twelfth, examining the corresponding percentages by using these different thresholds. For 3D human pose estimation, mean per joint position error (MPJPE) is a classic metric that calculates the average distance error between predicted keypoints and ground truth keypoints. MPJPE is measured in millimeters. By reporting the average error (in millimeters) between all predicted keypoints and ground truth keypoints, we can evaluate the accuracy of our ORPSM+ in 3D pose estimation.

4.2. Two-Dimensional Human Pose Estimation Experiment

Table 1 mainly presents the variations of SimpleNet (SN), including SN50 with the baseline resnet50, SN152 with the baseline resnet152, and SN152 + Lp using the Laplace distribution function. Compared to the SN50 model, the SN152 model has slightly lower Mean (All) values at a threshold of one-half, with a decrease of 0.1% (99.0% for SN152 vs. 99.1% for SN50). At a threshold of one-sixth, both SN152 and SN50 have the same Mean (All) value of 90.7. At a threshold of one-twelfth, SN152 outperforms SN50, with a 0.4% higher Mean (All) value. By comparing the results at different thresholds, we can observe that the deeper network architecture of SN152 does not significantly affect the average joint recognition accuracy at thresholds of one-half and one-sixth. However, at a threshold of one-twelfth, SN152 achieves a higher average joint recognition accuracy compared to SN50. Therefore, we chose the SN152 model for 2D human pose estimation on all images. When comparing SN152 + Lp with SN152, the former achieves higher average joint accuracy, with improvements of 0.2%, 1.3%, and 2.0% at thresholds of one-half, one-sixth, and one-twelfth, respectively. This indicates that incorporating the Laplace distribution helps improve overall accuracy, especially at smaller thresholds. Therefore, we conclude that using the SN152 model for 2D human pose estimation provides favorable performance, and further enhancements can be achieved by leveraging the Laplace distribution, particularly at lower thresholds.

Figure 5 is a three-fold plot of the average joint point accuracy values at thresholds of one-half, one-sixth, and one-twelfth for the 2D human pose estimation ORN model using different w. According to the figure, when the threshold is one-half and one-sixth, the broken line tends to be smooth; when the threshold is one-twelfth, w takes 0.4–0.7 values as the common maximum. At three different thresholds, both 0.6 and 0.7 are the largest. Here, we use 0.7 as the optimal solution; however, the accuracy is 99.5%, 93.8%, and 77.4% when the threshold is one-half, one-sixth, and one-twelfth, respectively.

According to Table 2, the average accuracy per precision ((PCKh@) and (PCKh@)) for all joints are one-half and one-sixth using ORN+ and one-twelfth, respectively. ORN and ORN+ are averaged across all joints; they have the same accuracy at one-half and one-sixth, and ORN+ is 77.4% better than ORN at one-twelfth. It follows that using ORN+ relative to ORN can make the network smoother while improving the accuracy at small thresholds.

In Table 3, SN152 + Lp refers to the 2D human pose model that uses ResNet152 and the Laplacian distribution function,

O R N^{s a m e}

is the 2D human pose model that uses IMU to enhance the keypoints for the same view, and ORN+ is the 2D human pose model that uses IMU to enhance the keypoints for multiple views. ORN+ has a higher accuracy (PCKh@) than SN152 + Lp and

O R N^{s a m e}

under different thresholds. At a threshold of one-half, the Mean (All) accuracy of ORN+ is 0.3% and 0% higher than SN152 + Lp and

O R N^{s a m e}

, respectively. At a threshold of one-sixth, the Mean (All) accuracy of ORN+ is 1.8% and 0.5% higher than SN152 + Lp and

O R N^{s a m e}

, respectively. At a threshold of one-twelfth, the Mean (All) accuracy of ORN+ is 4.8% and 1.8% higher than SN152 + Lp and

O R N^{s a m e}

, respectively. By comparing the accuracy at different thresholds, ORN+ performs the best, and the accuracy improves more at smaller thresholds. At a threshold of one-half, the Elbow keypoint has the highest accuracy under

O R N^{s a m e}

, which is due to the errors in the IMU; the weighted averaging function mainly maintains the accuracy at a threshold of one-half and improves the sensitivity of the model at smaller thresholds of one-sixth and one-twelfth.

As shown in Table 4, the models trained on the Human3.6M dataset were evaluated using the SimpleNet architecture with ResNet152 (SN152) and ResNet152 with Laplace connection (SN + Lp) at different thresholds (one-twelfth, one-sixth, and one-half). At thresholds of one-twelfth and one-sixth, SN + Lp achieved average accuracies of 72.7% and 92.2%, respectively, outperforming SN152. At the threshold of one-half, the accuracy of SN + Lp was slightly lower by 0.1% compared to SN152. This suggests that under looser thresholds, SN + Lp performs slightly less well than SN152, but the difference is minimal. Overall, SN+ shows the best performance, indicating that the introduction of the Laplace connection positively impacts the model’s performance on the Human3.6M dataset, particularly under stricter thresholds (one-twelfth and one-sixth).

Table 4 presents the evaluation results of the models trained on the Human3.6M dataset. The SN152 + Lp model uses ResNet152 as the backbone network and incorporates a Laplace distribution function. The

O R N^{s a m e}

model employs a regularized network evaluated under a single-view setup, while the ORN+ model is evaluated under a multiview setup. Compared to SN+ and

O R N^{s a m e}

, all ORN+ models achieved the highest average keypoint accuracy across different thresholds, with values of 97.2%, 94.9%, and 83.1%, respectively. Overall, the ORN+ model demonstrates superior performance in human pose estimation tasks, particularly in multiview scenarios. This has significant implications for improving the performance and robustness of pose estimation systems, especially when handling multiview data.

When combining the results from Table 3, Table 4 and Table 5, it is evident that the SN+ and ORN+ models achieve the highest average keypoint accuracy at thresholds of one-half, one-sixth, and one-twelfth, thereby validating the generalization performance of the proposed model.

Figure 6 shows the output of 2D human pose visualization using our SN152 + Lp model and the ORN+ model across four cameras.

4.3. Three-Dimensional Human Pose Estimation Experiment

We first evaluated our 3D pose estimator via extensive ablation experiments, and we compared our method to state-of-the-art methods. As shown in Figure 7, the MPJPE values for the average 3D joints in the TotalCapture dataset were sampled from 0 to 1 in intervals of 0.1, indicating that the output MPJPE is minimized when w is 0.7, 0.8, and 0.9.

Figure 7 mainly uses the ORPSM+ model to change the w parameter of the limb length constraint function by taking w values from 0 to 1 and taking 11 numbers per interval of 0.1. If we observe the output MPJPE values with different w values, we can determine the optimal solution of the basic expectation, w. When taking 0.7, 0.8, and 0.9 for w, the minimum value for MPJPE is 20.7 mm. We finally use 0.7 as the base expectation. For w = 0, under the ORPSM, the error of MPJPE is reduced to 0 = 7 mm relative to w = 0 and w = 0.7.

According to Table 6, SN152 refers to the ResNet152 network used under the SimpleNet (SN) baseline; SN152 + ORN is the fusion 2D pose model that uses the SN152 baseline and combines multiple views and IMUs using ORN; SN152 + Lp + ORN refers to the model that uses the SN152 network and Laplacian probability distribution function and ORN model; SN152 + Lp + ORN+ refers to the ORN model that uses the SN152 network, a Laplacian probability distribution function, and a weighted averaging function. ORPSM refers to the ORPSM+ model using the basic expectation w = 0 for 3D human pose estimation with limb length and limb direction constraints. The SN152 + Lp + ORN+ model performs the best in 2D, with an error of 20.7, reducing the Mean (All) error by 0.7 mm compared to using ORPSM, and reducing the error of the Others keypoints by 0.3 mm. When using the ORPSM model for 3D, the Others keypoints have the smallest 3D error under SN152, mainly because Others include root, belly, neck, and nose keypoints, which already have high accuracy in 2D with the SN152 model. Using the multiview and IMU-based ORN model introduces errors and reduces accuracy, especially for 3D models. However, we used ORN mainly to improve the accuracy of limb joints, such as Hip, Knee, Ankle, Shoulder, Elbow, and Wrist, which have lower accuracy, as predicted by the SN152 model. Moreover, using ORN and ORN+ does not significantly reduce the accuracy of Others keypoints. Overall, Mean (All) error and 3D keypoint MPJPE are reduced when using ORN and ORN+ when keeping the 3D model unchanged.

According to Table 7, the overall average error (Mean) without performing aligned processing using IMU is 20.7 mm, ranking second for the TotalCapture dataset. Compared to the baseline [15], the model used in this paper reduces MPJPE by 3.9 mm. At the same time, our S4,5 walking A3 and free motion FS3 rank first for the TotalCapture dataset. Our models, including the nonfused SN152, the fused ORN+ in 2D, and the ORPSM+ model in 3D, are effective. In our model, the aligned process is used as a post-processing step, where the predicted keypoints are aligned with the true keypoints through rotation and translation to obtain the aligned keypoints.

5. Conclusions

Our proposed 3D human pose recognition model with multiple perspectives and IMU as input mainly upgrades the 2D ON-fused SN152 model to the fused ORN+ model, and then to the 3D ORPSM+ model. Compared with previous work, we mainly achieved an improvement in 2D accuracy and a reduction in 3D error, making the estimation of joint points more accurate. What we need to do in future work is to improve the accuracy of 2D joints, mainly by using a more lightweight and accurate neural network and optimizing the ORN+ model.

Author Contributions

Conceptualization, M.C. and G.T.; Methodology, M.C.; Software, M.C.; Validation, M.C. and G.T.; Formal analysis, M.C.; Investigation, M.C.; Resources, M.C.; Data curation, M.C.; Writing—original draft preparation, M.C.; Writing—review and editing, M.C.; Visualization, M.C.; Supervision, M.C.; Project administration, M.C.; Funding acquisition, G.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of Guangxi for modeling and control (No. 61563005) and by the Graduate Education Innovation Program of Guangxi University, study No. YCSW2023481, for climbing movement recognition and evaluation based on human posture characteristics.

Data Availability Statement

The data that support the findings of this study are openly available in the TotalCapture dataset at https://cvssp.org/data/totalcapture/ [42].

Conflicts of Interest

The authors declare no conflict of interest.

References

Arnab, A.; Doersch, C.; Zisserman, A. Exploiting temporal context for 3D human pose estimation in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3395–3404. [Google Scholar]
Lin, B.; Zhang, S.; Yu, X.; Chu, Z.; Zhang, H. Learning effective representations from global and local features for cross-view gait recognition. arXiv 2020, arXiv:2011.01461. [Google Scholar]
Zhang, Z.; Wang, C.; Qiu, W.; Qin, W.; Zeng, W. Adafuse: Adaptive multiview fusion for accurate human pose estimation in the wild. Int. J. Comput. Vis. 2021, 129, 703–718. [Google Scholar] [CrossRef]
Gall, J.; Rosenhahn, B.; Brox, T.; Seidel, H.P. Optimization and filtering for human motion capture: A multi-layer framework. Int. J. Comput. Vis. 2010, 87, 75–92. [Google Scholar] [CrossRef]
Martinez, J.; Hossain, R.; Romero, J.; Little, J.J. A simple yet effective baseline for 3D human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2640–2649. [Google Scholar]
Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Coarse-to-fine volumetric prediction for single-image 3D human pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7025–7034. [Google Scholar]
Liang, S.; Sun, X.; Wei, Y. Compositional human pose regression. Comput. Vis. Image Underst. 2018, 176, 1–8. [Google Scholar] [CrossRef]
Nie, X.; Huang, Y.; Chen, C. Multi-Camera Based Human Pose Estimation Using Motion Consistency and 3D Geometric Constraints. IEEE Trans. Multimed. 2021, 23, 1501–1513. [Google Scholar]
Tao, R.; Zhu, H.; Li, L. Multi-camera Multi-person 3D Human Pose Estimation via Joint Detection and Association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13288–13297. [Google Scholar]
Wang, K.; Liu, S.; Qian, C. A 3D Human Pose Estimation Method Based on Multi-view Images through Multi-feature Fusion and Temporal Information Modeling. IEEE Trans. Image Process. 2021, 30, 1190–1201. [Google Scholar]
Qiu, H.; Wang, C.; Wang, J.; Wang, N.; Zeng, W. Cross view fusion for 3D human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4342–4351. [Google Scholar]
Huang, Y.; Kaufmann, M.; Aksan, E.; Black, M.J.; Hilliges, O.; Pons-Moll, G. Deep inertial poser: Learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans. Graph. 2018, 37, 185. [Google Scholar] [CrossRef]
Zhang, X.; Wang, Y.; Li, J.; Liu, Z. TransPose: Real-time 3D Human Translation and Pose Estimation with Six Inertial Sensors. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 1176–1185. [Google Scholar]
Mollyn, V.; Arakawa, R.; Goel, M.; Harrison, C.; Ahuja, K. IMUPoser: Full-Body Pose Estimation using IMUs in Phones, Watches, and Earbuds. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA, 23–28 April 2023. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, C.; Qin, W.; Zeng, W. Fusing wearable IMUs with multiview images for human pose estimation: A geometric approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2200–2209. [Google Scholar]
Gilbert, A.; Trumble, M.; Malleson, C.; Hilton, A.; Collomosse, J. Fusing visual and inertial sensors with semantics for 3d human pose estimation. Int. J. Comput. Vis. 2019, 127, 381–397. [Google Scholar] [CrossRef]
Malleson, C.; Gilbert, A.; Trumble, M.; Collomosse, J.; Hilton, A.; Volino, M. Real-time full-body motion capture from video and IMUs. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Lyon, France, 18–21 September 2017; pp. 449–457. [Google Scholar]
Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dong, J.; Jiang, W.; Huang, Q.; Bao, H.; Zhou, X. Fast and Robust Multi-Person 3D Pose Estimation From Multiple Views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1090–1099. [Google Scholar]
Dong, Z.; Song, J.; Chen, X.; Guo, C.; Hilliges, O. Shape-aware Multi-Person Pose Estimation from Multi-view Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 4706–4715. [Google Scholar]
Liang, J.; Lin, M.C. Shape-Aware Human Pose and Shape Reconstruction Using Multi-View Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 27 October – 2 November 2019; pp. 3355–3364. [Google Scholar]
Tu, H.; Wang, C.; Zeng, W. VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1–18. [Google Scholar]
Burenius, M.; Sullivan, J.; Carlsson, S. 3D Pictorial Structures for Multiple View Articulated Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 3248–3255. [Google Scholar]
Pavlakos, G.; Zhou, X.; Derpanis, K.G.; Daniilidis, K. Harvesting Multiple Views for Marker-Less 3D Human Pose Annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4265–4274. [Google Scholar]
Rhodin, H.; Spörri, J.; Katircioglu, I.; Constantin, V.; Meyer, F.; Müller, E.; Salzmann, M.; Fua, P. Learning Monocular 3D Human Pose Estimation From Multi-View Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 5604–5613. [Google Scholar]
Rhodin, H.; Salzmann, M.; Fua, P. Unsupervised Geometry-Aware Representation for 3D Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 804–820. [Google Scholar]
Chen, H.; Guo, P.; Li, P.; Lee, G.H.; Chirikjian, G. Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 1034–1051. [Google Scholar]
Zhang, Y.; An, L.; Yu, T.; Li, X.; Li, K.; Liu, Y. 4D Association Graph for Realtime Multi-person Motion Capture Using Multiple Video Cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7522–7531. [Google Scholar]
Huang, C.; Jiang, S.; Li, Y.; Zhang, Z.; Traish, J.; Deng, C.; Ferguson, S.; Xu, R.Y.D. End-to-end Dynamic Matching Network for Multi-view Multi-person 3D Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
Mitra, R.; Gundavarapu, N.B.; Sharma, A.; Jain, A. Multiview-Consistent Semi-Supervised Learning for 3D Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Kocabas, M.; Karagoz, S.; Akbas, E. Self-Supervised Learning of 3D Human Pose Using Multi-View Geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Schepers, M.; Giuberti, M.; Bellusci, G. Xsens MVN: Consistent Tracking of Human Motion Using Inertial Sensing. Xsens Technol. 2018, 1, 1–8. [Google Scholar]
Von Marcard, T.; Rosenhahn, B.; Black, M.J.; Pons-Moll, G. Sparse inertial poser: Automatic 3D human pose estimation from sparse IMUs. Comput. Graph. Forum 2017, 35, 349–360. [Google Scholar] [CrossRef]
Yi, X.; Zhou, Y.; Habermann, M.; Shimada, S.; Golyanik, V.; Theobalt, C.; Xu, F. Physical inertial poser (PIP): Physics-aware real-time human motion tracking from sparse inertial sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, June 2022; pp. 13167–13178. [Google Scholar]
Bao, Y.; Zhao, X.; Qian, D. FusePose: IMU-vision sensor fusion in kinematic space for parametric human pose estimation. IEEE Trans. Multimed. 2022, 25, 7736–7746. [Google Scholar] [CrossRef]
Huang, F.; Zeng, A.; Liu, M.; Lai, Q.; Xu, Q. Deepfuse: An IMU-aware network for real-time 3D human pose estimation from multi-view images. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 429–438. [Google Scholar]
Trumble, M.; Gilbert, A.; Malleson, C.; Hilton, A.; Collomosse, J. Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors. In Proceedings of the British Machine Vision Conference, Glasgow, UK, 25–28 November 2017. [Google Scholar]
Pons-Moll, G.; Baak, A.; Helten, T.; Muller, M.; Seidel, H.P.; Rosenhahn, B. Multisensor-fusion for 3D full-body human motion capture. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 663–670. [Google Scholar]
Malleson, C.; Collomosse, J.; Hilton, A. Real-time multi-person motion capture from multi-view video and IMUs. Int. J. Comput. Vis. 2020, 128, 1594–1611. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Trumble, M.; Gilbert, A.; Malleson, C.; Hilton, A.; Collomosse, J. The TotalCapture Dataset. In Proceedings of the British Machine Vision Conference (BMVC), Glasgow, UK, 25–28 November 2017. [Google Scholar]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef] [PubMed]
Belagiannis, V.; Amin, S.; Andriluka, M.; Schiele, B.; Navab, N.; Ilic, S. 3D Pictorial Structures for Multiple Human Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, Day 23–28 June 2014. [Google Scholar]
Remelli, E.; Han, S.; Honari, S.; Fua, P.; Wang, R. Lightweight Multi-View 3D Pose Estimation Through Camera-Disentangled Representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]

Figure 1. The overall pipeline of the proposed method.

Figure 2. Gaussian function and Laplace function.

Figure 3. Illustration of the cross-joint fusion idea of the ORN+ model. (a) For a

Y_{P}

position in heatmap

H_{1}

, we estimate its 3D point,

P_{k}

, to lie on the line defined by the camera center

C_{1}

and

Y_{P}

. Using the 3D limb direction from IMUs and limb length, we estimate a candidate 3D position,

Q_{k}

, for

J_{2}

. This is projected onto heatmap

H_{2}

as

Y_{Q_{k}}

, and if the confidence is high, it affects the confidence in

J_{2}

heatmaps of other cameras. (b) The initial confidence of

J_{1}

at

Y_{P}

is enhanced by the confidence of

J_{2}

at

Y_{Q_{k}}

across all views. Heatmaps of

J_{1}

and

J_{2}

are fused, and MaxPool averages the top two maximum values of

Y_{Q_{k}}

.

Figure 3. Illustration of the cross-joint fusion idea of the ORN+ model. (a) For a

Y_{P}

position in heatmap

H_{1}

, we estimate its 3D point,

P_{k}

, to lie on the line defined by the camera center

C_{1}

and

Y_{P}

. Using the 3D limb direction from IMUs and limb length, we estimate a candidate 3D position,

Q_{k}

, for

J_{2}

. This is projected onto heatmap

H_{2}

as

Y_{Q_{k}}

, and if the confidence is high, it affects the confidence in

J_{2}

heatmaps of other cameras. (b) The initial confidence of

J_{1}

at

Y_{P}

is enhanced by the confidence of

J_{2}

at

Y_{Q_{k}}

across all views. Heatmaps of

J_{1}

and

J_{2}

are fused, and MaxPool averages the top two maximum values of

Y_{Q_{k}}

.

Figure 4. ORN+ model for enhancing node heatmaps.

Figure 5. Two-dimensional process using the

0, \dots, 1.0

ORN model; the 10 different weights correspond to the output of a line diagram of the average accuracy of all joints (PCKh@) from Equation (5). The distribution of (a–c) in the figure shows the value of the mean accuracy (PCKh@) at one-half, one-sixth, and one-twelfth or 150 mm, 50 mm, and 25 mm.

Figure 5. Two-dimensional process using the

0, \dots, 1.0

ORN model; the 10 different weights correspond to the output of a line diagram of the average accuracy of all joints (PCKh@) from Equation (5). The distribution of (a–c) in the figure shows the value of the mean accuracy (PCKh@) at one-half, one-sixth, and one-twelfth or 150 mm, 50 mm, and 25 mm.

Figure 6. 2D pose visualization.

Figure 7. For the TotalCapture dataset, the MPJPE values of the average 3D joints were taken from 0 to 1 per interval of 0.1, according to the w values in Equation (7), using the ORPSM+ 3D human pose estimation model. The output MPJPE is minimum when w = 0.7, 0.8, and 0.9.

Table 1. Two-dimensional human pose estimation accuracy (PCKh@) using the TotalCapture dataset. SN 50 is the resnet50 network, SN152 is the resnet152 network, SN152 + Lp is the resnet152 network + Laplace distribution function, and the Mean (All) is the joint average result.

Methods	PCK@	Hip	Knee	Ankle	Shoulder	Elbow	Wrist	Root	Belly	Neck	Nose	Mean (All)
SN50	$\frac{1}{2}$	99.9	99.1	99.1	99.1	99.7	98.1	97.4	99.9	99.9	99.9	99.8
SN152	$\frac{1}{2}$	99.9	99.1	99.1	99.3	98.0	97.0	100	100	99.9	99.9	99.0
SN152 + Lp	$\frac{1}{2}$	99.9	99.2	99.3	99.6	98.4	97.5	99.9	99.9	99.9	99.8	99.2
SN50	$\frac{1}{6}$	99.3	94.5	93.9	75.4	83.7	83.8	99.9	99.4	95.3	94.8	90.7
SN152	$\frac{1}{6}$	99.1	94.6	94.0	77.1	82.6	83.8	100	99.2	95.0	93.9	90.7
SN152 + Lp	$\frac{1}{6}$	99.0	95.5	94.9	80.2	86.0	85.2	99.9	99.1	95.4	95.3	92.0
SN50	$\frac{1}{12}$	90.8	73.6	72.8	50.8	54.4	55.0	99.9	88.4	70.6	68.3	70.2
SN152	$\frac{1}{12}$	89.0	75.0	74.7	52.7	54.6	56.1	100	88.4	70.1	66.3	70.6
SN152 + Lp	$\frac{1}{12}$	91.2	76.5	76.7	54.8	57.5	58.5	99.9	89.2	70.0	70.5	72.6

Table 2. TotalCapture dataset using the ORN and ORN+ models.PCKh@ indicates an accuracy of one-half, one-sixth, one-twelfth or 150 mm, 50 mm, 25 mm. Mean (All) is the average accuracy value of all joints, and the weight used by the ORN w = 0.7 for the N-MEAN model.

Methods	PCK@	Hip	Knee	Ankle	Shoulder	Elbow	Wrist	Root	Belly	Neck	Nose	Mean (All)
ORN	$\frac{1}{2}$	99.9	99.7	99.6	99.8	99.0	98.6	99.9	99.9	99.9	99.8	99.5
ORN+	$\frac{1}{2}$	99.9	99.6	99.6	99.7	98.9	98.5	99.9	99.9	99.9	99.8	99.5
ORN	$\frac{1}{6}$	99.3	97.2	96.5	83.8	88.8	88.8	99.9	99.4	96.4	96.2	93.8
ORN+	$\frac{1}{6}$	99.3	97.2	96.5	83.8	88.8	88.6	99.9	99.4	96.4	96.2	93.8
ORN	$\frac{1}{12}$	90.9	85.2	83.6	57.6	62.2	64.2	99.9	88.9	75.1	74.4	76.7
ORN+	$\frac{1}{12}$	92.4	85.7	85.1	58.4	62.7	65.0	99.9	88.9	75.1	74.4	77.4

Table 3. TotalCapture dataset. SN152 + Lp is the resnet152 network and Laplacian distribution function; the

O R N^{s a m e}

process is the ORN model in the same view, namely, Equation (3); ORN+ is the multiview ORN and weighted average model for Mean (Six); Others is the average accuracy of the other four joints root, belly, neck, nose; Mean (All) is the average accuracy of all joints.

Table 3. TotalCapture dataset. SN152 + Lp is the resnet152 network and Laplacian distribution function; the

O R N^{s a m e}

process is the ORN model in the same view, namely, Equation (3); ORN+ is the multiview ORN and weighted average model for Mean (Six); Others is the average accuracy of the other four joints root, belly, neck, nose; Mean (All) is the average accuracy of all joints.

Methods	PCK@	Hip	Knee	Ankle	Shoulder	Elbow	Wrist	Mean (Six)	Others	Mean (All)
SN152 + Lp	$\frac{1}{2}$	99.9	99.2	99.3	99.6	98.4	97.5	99.0	99.9	99.2
$O R N^{s a m e}$	$\frac{1}{2}$	99.9	99.6	99.4	99.6	99.1	98.3	99.3	99.9	99.5
ORN+	$\frac{1}{2}$	99.9	99.6	99.6	99.8	98.9	98.5	99.4	99.9	99.5
SN152 + Lp	$\frac{1}{6}$	99.0	95.5	94.9	80.2	86.0	85.2	90.1	97.5	92.0
$O R N^{s a m e}$	$\frac{1}{6}$	98.9	96.9	96.1	82.2	88.7	87.4	91.9	98.0	93.3
ORN+	$\frac{1}{6}$	99.3	97.2	96.5	83.8	88.8	88.6	92.4	98.0	93.8
SN152 + Lp	$\frac{1}{12}$	91.2	76.5	76.7	54.3	57.5	58.5	69.2	82.4	72.6
$O R N^{s a m e}$	$\frac{1}{12}$	88.0	84.0	81.9	55.0	62.3	63.9	72.5	84.6	75.6
ORN+	$\frac{1}{12}$	92.4	85.7	85.1	58.4	62.7	65.0	74.9	84.6	77.4

Table 4. Two-dimensional comparison of the PCK values of different methods for human posture using the Human3.6M dataset.

Methods	PCK@	Hip	Knee	Ankle	Shoulder	Elbow	Wrist	Root	Belly	Neck	Nose	Head	Mean (All)
SN50	$\frac{1}{2}$	93.7	94.8	93.8	86.9	90.3	86.6	95.4	97.2	97.8	97.6	97.7	92.8
SN152	$\frac{1}{2}$	98.1	96.1	96.3	97.5	96.6	94.0	98.3	97.3	97.7	97.8	97.7	96.8
SN152 + Lp	$\frac{1}{2}$	97.9	96.2	96.0	97.5	96.1	93.7	98.0	97.5	97.7	97.8	97.7	96.7
SN50	$\frac{1}{6}$	87.0	89.2	85.4	72.4	81.4	76.7	94.8	93.2	90.0	95.3	95.8	85.5
SN152	$\frac{1}{6}$	95.6	92.0	90.1	92.9	89.0	84.1	96.6	95.0	95.2	96.0	96.2	92.1
SN152 + Lp	$\frac{1}{6}$	95.3	92.0	89.9	94.1	89.1	83.6	96.6	95.0	95.5	96.0	96.1	92.2
SN50	$\frac{1}{12}$	58.1	69.0	54.0	42.8	60.5	58.5	86.8	65.3	40.4	83.5	95.8	62.1
SN152	$\frac{1}{12}$	70.8	68.6	56.8	67.1	73.2	66.4	95.0	70.4	62.0	85.0	95.0	71.4
SN152 + Lp	$\frac{1}{12}$	71.0	64.1	58.5	71.5	72.6	66.5	94.7	75.4	75.7	86.3	95.1	72.7

Table 5. Two-dimensional comparison of the PCK values of different methods for human poses using the Human3.6M dataset.

Methods	PCK@	Hip	Knee	Ankle	Shoulder	Elbow	Wrist	Mean (Six)	Others	Mean (All)
SN152 + Lp	$\frac{1}{2}$	97.9	96.2	96.0	97.5	96.1	93.7	96.2	97.7	96.7
$O R N^{s a m e}$	$\frac{1}{2}$	97.3	96.8	96.3	97.4	96.9	96.1	96.8	97.6	97.0
ORN+	$\frac{1}{2}$	97.6	96.9	96.8	97.5	96.8	96.8	97.0	97.6	97.2
SN152 + Lp	$\frac{1}{6}$	95.3	92.0	89.9	94.1	89.1	83.6	90.6	95.8	92.1
$O R N^{s a m e}$	$\frac{1}{6}$	96.0	95.1	93.4	94.7	92.0	87.2	93.1	95.9	93.9
ORN+	$\frac{1}{6}$	96.2	95.6	94.4	95.7	92.7	92.1	94.4	95.9	94.9
SN152 + Lp	$\frac{1}{12}$	71.0	64.1	58.5	71.5	72.6	66.5	67.3	85.4	72.7
$O R N^{s a m e}$	$\frac{1}{12}$	82.5	81.5	83.2	79.3	81.2	72.4	80.0	88.2	82.4
ORN+	$\frac{1}{12}$	86.1	79.8	77.2	83.9	80.0	78.6	80.9	88.2	83.1

Table 6. On the TotalCapture dataset, the unit for 3D human pose error (MPJPE) for different variants of our method is millimeters (mm). Mean (Six) represents the average error of the first six joints, Others denotes the error of the remaining joints, and Mean (All) indicates the average error across all joint points.

2D	3D	Hip	Knee	Ankle	Shoulder	Elbow	Wrist	Mean (Six)	Others	Mean (All)
SN152	ORPSM	16.2	20	21.8	34.6	29.8	29.8	25.3	14.9	22.7
SN152 + ORN	ORPSM	17.3	19.2	21.1	34.2	29.8	30.1	25.3	15.0	22.7
SN152 + Lp + ORN	ORPSM	15.9	18.4	20.6	31.0	27.9	28.7	23.8	15.5	21.7
SN152 + Lp + ORN+	ORPSM	15.3	18.2	20.0	30.7	27.7	28.3	23.4	15.5	21.4
SN152 + Lp + ORN+	ORPSM+	13.9	17.4	19.5	29.7	26.9	28.0	22.6	15.2	20.7

Table 7. MPJPE of 3D human posture error for different methods using the TotalCapture dataset, measured in millimeters (mm). The predicted human posture is aligned with the real human posture.

Approach	IMUs	Temporal	Aligned	Subjects (S1,2,3)			Subjects (S4,5)			Mean
				W2	A3	FS3	W2	A3	FS3
[38]				48.3	94.3	122.3	84.3	154.5	168.5	107.3
[17]	√	√		-	-	65.3	-	64.0	67.0	-
[16]	√	√		19.2	42.3	48.8	24.7	58.8	61.8	42.6
[44]		√		13.0	23.0	47.0	21.8	40.9	68.5	34.1
[11]				19.0	21.0	28.0	32.0	33.0	54.0	29.0
[45]				10.6	16.3	30.4	27.0	35.1	65.0	27.5
[18]	√	√	√	-	-	-	-	-	-	26.0
[15]	√			14.3	17.5	25.9	23.9	27.8	49.3	24.6
[3]				7.2	10.8	18.5	22.8	26.6	42.9	19.2
ORN+ + ORPSM+	√			13.2	14.9	20.0	22.4	26.5	36.4	20.7
ORN+ + ORPSM+	√		√	11.4	12.4	16.6	18.9	21.9	30.1	17.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, M.; Tan, G. 3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views. Electronics 2024, 13, 2926. https://doi.org/10.3390/electronics13152926

AMA Style

Chen M, Tan G. 3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views. Electronics. 2024; 13(15):2926. https://doi.org/10.3390/electronics13152926

Chicago/Turabian Style

Chen, Mingliang, and Guangxing Tan. 2024. "3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views" Electronics 13, no. 15: 2926. https://doi.org/10.3390/electronics13152926

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

3D Human Pose Estimation Based on Wearable IMUs and Multiple Camera Views

Abstract

1. Introduction

2. Related Work

2.1. Multiview Methods

2.2. Multi-IMU Methods

2.3. Multiview IMU Methods

3. Methods

3.1. CNN-Based 2D Keypoint Estimation

3.2. Images + IMU-Based Keypoint Localization Enhancement

3.2.1. Single-View Object Relation Network for 2D Human Pose Estimation

3.2.2. Multi-View Object Relation Network for 2D Human Pose Estimation

3.2.3. Multi-View Averaging Enhanced Object Relation Network for 2D Human Pose Estimation

3.3. Three-Dimensional Pose Resolving

3.3.1. Triangulation

3.3.2. Orientation Regularized Pictorial Structure Model with Soft Limb Constraints

4. The Experiment

4.1. Implementation Details

4.1.1. Dataset

4.1.2. Training Method

4.1.3. Evaluation Criteria

4.2. Two-Dimensional Human Pose Estimation Experiment

4.3. Three-Dimensional Human Pose Estimation Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI