1. Introduction
Due to the numerous real-world applications of 3D multi-view multi-person human pose estimation, such as human-computer interaction [
1], virtual and augmented reality [
2,
3], etc., the field of computer vision has seen significant research in this area [
4,
5,
6,
7], which has been driven by deep neural networks and large-scale human-annotated datasets [
8,
9]. These multi-view pose estimation methods have achieved excellent performance on the benchmark datasets [
8,
9], but still face challenges because of the wide variation in viewpoints, personal appearance, backgrounds, illumination, image quality, and so on. Due to unavoidable domain shifts, pose estimators developed for one particular domain (i.e., the source domain) may not generalize well to novel testing domains (i.e., the target domains). For example, a 3D pose estimator trained on the Panoptic [
9] dataset suffers a severe performance drop when evaluated on the Campus [
10] and Shelf [
10] datasets. In
Figure 1, which shows several datasets used for 3D human pose estimation, we can see a considerable domain shift.
Even if more training data from various domains could solve this problem, it may be impractical due to the complexity of real-world scenarios and the high cost of 3D annotation. As a result, methods for successfully transferring a 3D pose estimator trained on a labeled source domain to a new unlabeled target domain are in high demand.
Even if gathering more training data from diverse domains could solve this issue, it may be impractical due to the complexity of real-world scenarios and the high expense of 3D annotation. For this reason, there is a great need for methods that can successfully transfer a 3D pose estimator trained on a labeled source domain to a new unlabeled target domain.
Our research focuses on domain adaptation for multi-view, multi-person 3D pose estimation with covariate shift. We develop an end-to-end deep learning model called “Domain Adaptive VoxelPose” that is based on the cutting-edge VoxelPose model [
11]. An unsupervised domain adaptation case occurs when full supervision is available in the source domain but none in the target domain. As a result, there should be no additional annotation costs in the target domain to obtain better 3D pose estimation.
We add three elements to the VoxelPose model to minimize the domain divergence between two domains in order to correct the domain shift. First, we train a domain classifier [
12] using adversarial training to learn domain-invariant, reliable features. Second, we apply dropout to multiple discriminators by missing or dropping out each discriminator’s feedback with a specific probability at the end of each batch. This makes the feature extractor more domain-invariant by requiring its output to satisfy a dynamic ensemble of discriminators as opposed to a singular discriminator. Thirdly, we present Transferable Parameter Learning (TransPar) [
13] to eliminate the side effects of domain-specific information and enhance domain-invariant learning. TransPar divides all parameters into two categories: transferable parameters and non-transferable parameters. Consequently, TransPar provides distinct update rules for these two categories of parameters.
In conclusion, the following are the principal contributions of our work: (1) We provide domain adaptation components to reduce domain disparity between the selected target domain and the respective source domain. (2) To train a more domain-invariant feature extractor, we propose a novel method that applies an ensemble of dynamic dropout domain discriminators. (3) We employ Transferable Parameter Learning (TransPar) to reduce the negative effects of domain-specific knowledge throughout the learning process as well as to enhance the retention of domain-independent knowledge. (4) The suggested components are incorporated into the VoxelPose model, and the resulting system is capable of end-to-end training. Our method was evaluated on three datasets, and the results suggested that it could enhance the accuracy of cross-domain multi-view multi-person 3D pose estimation.
The paper is organized as follows.
Section 2 discusses related works of the methods, including 3D human pose estimation and unsupervised domain adaptation.
Section 3 introduces the main baseline, theory, hypothesis, and datasets used in the paper.
Section 4 details the proposed methodology for unsupervised multi-view, multi-person 3D pose estimation.
Section 5 verifies the feasibility of the proposed model.
Section 6 concludes the paper, points out the limitations of the research, and proposes some future work.
3. Preliminaries
3.1. VoxelPose
We present a brief overview of the VoxelPose model, which serves as the baseline for our research. VoxelPose projects 2D joint heatmaps from various viewpoints into a voxelized 3D space, enabling the direct detection and prediction of 3D human poses. The process begins with the estimation of 2D heatmaps for each view, encoding the per-pixel likelihood of all joints. Features from all camera views are aggregated in the 3D voxel space and processed by the Cuboid Proposal Network to localize individuals. This projection into a common 3D space results in a more comprehensive feature volume, allowing for an accurate estimation of 3D joint positions. Furthermore, the Pose Regression Network is used to estimate a full 3D pose for each proposal. All camera views’ noisy and incomplete information is warped to a common 3D space to create a feature volume that may be used for 3D estimation.
3.2. Fundamental Conceptual Framework
The -Divergence is an influential construct within unsupervised domain adaptation (UDA). In UDA scenarios, we are presented with a labeled source domain, , and an unlabeled target domain, . The overarching aim is to develop a hypothesis capable of proficiently predicting within the target domain, notwithstanding the lack of its labels.
The groundwork for understanding this theory lies in the theorem titled “Bound with Disparity” [
34]. According to this theorem, given a symmetric loss function
ℓ that adheres to the triangle inequality, the disparity between any two hypotheses
h and
on a distribution
can be defined as:
Subsequently, the target risk
can be constrained by:
Here, denotes the optimal joint hypothesis.
Building upon this premise, the essence of the -Divergence is to provide an upper boundary for the disparity difference. The core merit of this divergence, as highlighted in our work, is its capability to be estimated using finite, unlabeled samples from both the source and target domains. However, the direct computation of this divergence is notably intricate and challenging to optimize. Thus, it is approximated by training a domain discriminator D that separates the source and target samples. To accomplish this, we employ a dropout discriminator, which not only prevents mode collapse but also enhances the robustness of our algorithm.
The lottery ticket hypothesis refers to a hypothesis in deep learning that suggests finding sparse subnetworks, known as “winning tickets”, within over-parameterized neural networks. These winning tickets can achieve comparable or even better performance than the original large network when trained in isolation under suitable conditions.
In the context of domain adaptation, the lottery ticket hypothesis can be applied to transfer learning scenarios. By identifying winning tickets or subnetworks that generalize well to both the source and target domains, we can effectively adapt the model from the source domain to the target domain. This can help in mitigating the issue of distribution shift between the two domains and improving the performance of the model on the target domain with limited labeled data. This is also the theoretical basis for transferable parameter learning.
3.3. Dataset
As shown in
Table 1, our evaluation employs three datasets:
Campus [
10]: It is a dataset of three people conversing with one another in the outdoor environment, recorded by three calibrated cameras. To assess the precision of the 3D location of the body parts, we utilize the percentage of correctly estimated parts (3D PCP) [
10]. We adjust our pose estimator using the dataset’s unlabeled images as the target domain.
Shelf [
10]: The Shelf Dataset includes a scenario of ordinary social interactions. In contrast to Campus, this is a more complex dataset, which consists of four individuals deconstructing a shelf at close range. Around them, there are five calibrated cameras, but each view is severely obstructed. 3D PCP [
10] is also used as the assessment metric. This dataset’s unlabeled images serve as our target domain.
CMU Panoptic [
9]: This dataset was recorded in a lab setting, which contains multiple people engaging in social activities. With hundreds of cameras, it is able to obtain compelling motion capture findings. It is a sizable dataset of social interactions with numerous and various natural interactions. We use it as our source domain.
4. Method: Domain Adaptation for Multi-View Multi-Person 3D Pose Estimation
In this section, we detail our proposed methodology for supervised multi-view multi-person 3D pose estimation. In supervised multi-view multi-person 3D pose estimation, we have n labeled samples from , where represents the input space, the output space and K the number of keypoints for each input. The samples randomly selected from distribution D are denoted as . The objective is to identify a regressor that yields the lowest error rate on D, where L is a loss function we shall explain later. In unsupervised domain adaptation, there exists a labeled source domain and an unlabeled target domain . The objective is to minimize .
4.1. Domain Adaptation Component
In the VoxelPose model, the feature representation refers to the feature map outputs of the base convolutional layers (as depicted by the green parallelogram in
Figure 2). Specifically, we train a domain classifier to mitigate the domain distribution discrepancy on feature maps. The domain classifier predicts the domain label for each feature map, which corresponds to input images
from the source or target domain.
This decision has two advantages: (1) Aligning representations at the image level often reduces the shift caused by variations in the images, such as image style, human body scale, illumination, etc. (2) The batch size is typically quite small while training a pose estimation network due to the utilization of high-resolution input. This approach allows more data to be employed in training the domain classifier.
Let’s represent the domain label of the
i-th training image by
, where
for the source domain and
for the target domain. The feature map of the
i-th image after the base 2D convolutional layers is denoted by
. Using the cross entropy loss and denoting the output of the domain classifier
, the domain adaptation loss can be written as:
To align the domain distributions, we need to simultaneously optimize the parameters of the base 2D network to maximize the above domain classification loss and minimize the keypoint regression loss. In order to optimize the base network used to maximize the domain classifier, we adopt the gradient reverse layer [
12], as opposed to regular gradient descent. To maximize the domain classifier loss, the gradient must first pass through a gradient reverse layer, where its sign is inverted.
4.2. Dropout Domain Adaptation Component
In order to force the pose estimator to learn from a dynamic ensemble of discriminators, we propose the integration of adversarial feedback dropout in adversarial networks. The feedback of each discriminator is randomly excluded from the ensemble with a specific probability d at the end of each batch. This indicates that the pose estimator considers only the loss of the remaining discriminators when updating its parameters.
Figure 2 illustrates our proposed framework, including our initial modification to the adversarial training loss function
L, as shown in Equation (
2). In this equation,
is a Bernoulli variable
, and
is the set of
K total discriminators. When
, with
, the gradients derived from the loss of the supplied discriminator
, are exclusively employed to produce the final gradient updates for the pose estimator. Otherwise, this discrimination-related information is ignored:
Each discriminator trains independently, i.e., is unaware of the others, since no changes are made to their individual gradient updates.
Figure 3 depicts the proposed solution’s algorithm in detail.
4.3. Transferable Parameter Learning Component
The Transferable Parameter Learning Component is designed to distinguish between transferable and non-transferable parameters, enabling robust unsupervised domain adaptation. To minimize the model’s ability to retain domain-specific information, distinct updating rules are used for different types of parameters [
13]. Consider a parameter, denoted by
, its gradient is
at the
t-th iteration. The identifying criterion is defined by
where
is the parameter number of a module in a deep unsupervised domain adaptation network. If the value of
is large,
is viewed as a transferable parameter. On the contrary, if the value of
is small, e.g., zero or very close to zero,
is regarded as an untransferable parameter. It is not important for fitting domain-invariant information. If we update it, it will tend to fit domain-specific information. By utilizing objective function gradients and weight decay on the transferable parameters
, a robust positive update is executed.
Furthermore, the non-transferable parameters
, which tend to over-adapt to domain-specific details, are negatively updated.
where
refers to the learning rate and
stands for the set of parameters at the
t-th iteration. The term
refers to the common weight decay method, which can avoid overfitting by keeping the parameters from being too big.
is the weight decay coefficient. This method for domain adaptation is straightforward and independent of other approaches.
4.4. Network Overview
Figure 2 offers a detailed portrayal of our Domain Adaptive VoxelPose model, an enhancement of the baseline VoxelPose framework with three added components. Primarily, we have incorporated a domain classifier subsequent to the final 2D base convolution layer. Additionally, we have integrated a dropout mechanism that probabilistically neglects the feedback from each discriminator at the end of every batch. Finally, we apply transferable parameter learning to enforce distinct update rules for transferable and non-transferable parameters. The cumulative loss incurred during the training of our proposed network encompasses both the human pose estimation segment as well as the domain adaptation segment. Algorithm 1 provides a comprehensive overview of our approach, which optimizes two kinds of parameters using distinct update rules. It is important to note that our method preserves the architecture of the existing multi-view, multi-person 3D estimation networks. Consequently, both the time complexity and space complexity remain consistent with those of the original networks.
Algorithm 1: Voxelpose with dropout discriminator and transfer parameter learning. |
Input: Camera Views of source domain, Camera views of target domain Output: 3D human poses for all cuboids
|
5. Experiments
In this section, we verify the feasibility of the proposed Domain Adaptive VoxelPose model. Utilizing the Panoptic [
9] dataset as the source domain, the performance of the technique is examined in two distinct scenarios of domain shift: (1) An outdoor environment, where the Campus [
10] test dataset captures three individuals interacting outdoors through three calibrated cameras. (2) An indoor social interaction setting, where the Shelf [
10] test dataset, which is more complex, features four individuals closely deconstructing a shelf. There are five calibrated cameras surrounding them, but each view has severe occlusion. Due to differences in annotation formats between the source and target domains, a conversion measure is employed to align the model’s output with the target domain annotations (e.g., in a model trained on Panoptic outputs, the position of the nose can be viewed as the head-center position of the Campus dataset).
5.1. Experiment Setup
In our experiments, we use the unsupervised domain adaptation approach. The training set is divided into two parts: the source training set, which includes photos and their pose annotations, and the target training set, which only includes unlabeled images. In order to demonstrate the efficacy of the proposed approach, we present not only the conclusive results of our model but also the findings obtained by using each component. This is done for two common domain shift scenarios. We use the original VoxelPose model as a baseline. It was trained using training data from the source domain without taking domain adaptation into account. To assess the accuracy of the estimated 3D poses, we present the Percentage of Correct Parts (PCP3D) across all tests. We use Mean Per Joint Position Error (MPJPE) on the training set.
We conducted our experiments on a computer equipped with an NVIDIA Tesla V100 GPU and implemented our algorithm using PyTorch. For the task of multi-view image feature extraction, we employed a pose estimation model built upon ResNet-50 [
11]. The backbone of this model was specifically initialized using weights pre-trained on the COCO dataset. During the model training phase, we utilized the AdamW optimizer [
60] with an initial learning rate of 0.001, a batch size of 4, and a total of 20 training epochs. The 2D backbone and the remaining components of the model were trained jointly. Each training batch consisted of two images: one from the source domain and another from the target domain. For training on the Panoptic dataset, which serves as our source domain, we employed three camera views (03, 12, and 23).
5.2. Outdoor Environment Experimental Results
With the rapid advancement of 3D human pose estimation, the motion capture (mo-cap) system is emerging as an effective means to augment datasets. However, the Mocap system, originally designed for laboratory use, poses challenges for implementation in natural settings. A discernible visual disparity exists between laboratory data and outdoor scenarios, often leading to a performance gap between models trained in these different environments. Our initial experiment seeks to ascertain the applicability of the proposed method in this context.
Results: The results of the different methods are summarized in
Table 2. Specifically, using the dropout domain adaptation component alone, we achieve a +5.9% performance gain over VoxelPose. Using the dropout domain adaptation component embedded with transfer parameter learning yields an improvement of 6.9%, validating our hypothesis regarding the necessity of reducing domain shifts. This demonstrates that the domain shift between a lab and an outdoor environment can be effectively reduced by the components we proposed.
As illustrated in
Figure 4, the qualitative metrics reveal that our method effectively minimizes false detections.
Table 3 demonstrates a significant improvement in performance for joint locations with substantial variations, such as the lower arms and lower legs. We evaluated the generalization performance of our method in comparison to other domain adaptation techniques applied to this scenario, and the results are detailed in
Table 4. Based on the results presented in
Figure 5, a qualitative comparison is provided between our proposed method and other state-of-the-art algorithms. The results clearly demonstrate that our algorithm exhibits enhanced robustness in cross-domain scenarios.
5.3. Indoor Social Interaction Environment Experimental Results
Despite considerable advancements in multi-person 3D human pose estimation, numerous challenging scenarios persist, including obscured keypoints, invisible keypoints, and crowded backgrounds that hinder accurate keypoint localization. For a generalized 3D human position estimation system, precise operation in diverse social interaction environments is vital. This subsection examines the efficacy of 3D pose estimation in the context of group interactions.
Results: Our findings, as well as those of other baselines, are included in
Table 5. Similar observations apply in the outdoor environment. When all the parts are put together, our complete adaptive Voxelpose model is 2.7% better than the baseline Voxelpose model. In addition, we can see that the improvement generalizes well across different actors, indicating that the proposed method can also reduce domain discrepancies between different individuals.
Figure 6 illustrates the qualitative results of our algorithm, highlighting its capability to not only avoid false detection but also to yield more realistic and natural pose estimation outcomes. In
Table 6, notable improvements are evident in joint localizations, particularly for the lower arms and lower legs. In
Table 4, we present the results, comparing the generalization performance of our method with other unsupervised domain adaptation techniques applied to this scenario. In
Figure 7, we compared our proposed method with other advanced algorithms and found that our approach performs better in cross-domain scenarios. These results indicate that our algorithm is more robust than others.
In
Table 7, we evaluate the generalization performance of several state-of-the-art algorithms, including MvP [
6], Faster VoxelPose [
31], and TesseTrack [
57], as detailed in
Section 2.2 of the article. Our results demonstrate that in cross-domain scenarios, our method outperforms these other approaches while maintaining the model’s performance on the original dataset.
5.4. Ablation Studies and Discussions
5.4.1. Domain Adaptation Component
The training process of the adversarial domain adaptation component is characterized as a zero-sum, non-cooperative contest between the base feature extractor and the domain discriminator. As the domain discriminator learns to distinguish between source and target domain features, the feature extractor simultaneously learns domain-invariant feature representation to confound the domain discriminator, thereby enhancing cross-domain adaptation capability.
Due to the single-adversarial method’s limited distribution alignment ability, the improvement is small and erratic. Mode collapse, as a consequence of overfitting to the feedback of a single discriminator, shows up as difficulties with convergence.
5.4.2. Dropout Domain Adaptation Component
The multi-adversarial domain adaptation method has been empirically validated as an efficacious technique for improving domain adaptation capabilities [
41,
64]. This method involves dynamically altering the adversarial ensemble at each batch, stimulating the generator to cultivate domain-invariant representations that can deceive the remaining discriminators. The dynamic alteration not only encourages the generator to master domain-invariant representation but also amplifies the probability of successfully misleading any residual discriminators. By aligning the feature representation across diverse feature dimensions, complementary features are learned. This alignment facilitates a more efficient reduction of domain discrepancies with unlabeled target data, thereby bolstering the model’s generalization prowess. The efficacy of the dropout domain adaptive component is further illustrated in
Figure 8.
Figure 8 illustrates the relationship between the dropout rate and the generalization ability of the feature representation. This figure emphasizes that selecting an excessively large or small ratio for Parameter
d can complicate training. By striving to enhance the generalizability of the feature representations created by the base feature extractor, this type of dropout can be seen as a form of regularization. We found that employing any dropout rate within the range (0 <
d ≤ 1) consistently outperformed a static ensemble of adversaries (
d = 0). Specifically, utilizing a moderate dropout rate often led to superior results, as previously noted in [
42,
65].
5.4.3. Transferable Parameter Learning Component
The central concept of the aforementioned components is to acquire transferable feature representations by confusing a domain discriminator in a two-player game, leading to state-of-the-art results in various visual tasks [
12,
41,
64]. Deep Unsupervised Domain Adaptation (UDA) research expects precise feature representations, and insights derived from the source domain can be effectively applied to the target domain. However, during the learning process of domain-invariant features and source hypotheses, unnecessary domain-specific information is inevitably incorporated, hindering generalization to the target domain. The lottery ticket hypothesis [
66] reveals that only certain parameters are crucial for generalization. Thus, by eliminating the adverse effects of domain-specific information prior to testing, deep UDA networks can become more resilient and adaptable.
Voxelpose utilizes a 3D-CNN to estimate the 3D locations of the body joints based on the feature volume. This method suffers from a large number of domain-specific parameters. We believe that only partial “transferable parameters” are essential for learning domain-invariant information and generalizing well in UDA; on the other hand, “untransferable parameters” tend to suit domain-specific information and rarely generalize.
In order to lessen the negative impacts of domain-specific knowledge during the learning process, we introduce Transferable Parameter Learning (TransPar) into our network, providing unique update rules for these two categories of parameters.
While
Table 2 and
Table 5 have shown the benefits of the introduced Learning Transferable Parameters,
Figure 9 demonstrates that the performance achieved by medium-to-high ratios is relatively important.
6. Conclusions and Outlook
In this research, we present the Domain Adaptive VoxelPose model, which is an efficient method for cross-domain multi-view multi-person 3D pose estimation. Without needing any extra labeled data, one can build a robust pose estimator for a new domain using our method. Our strategy is based on the state-of-the-art VoxelPose model. We introduced a domain adaptation component and the dropout mechanism for our network based on the theoretical analysis that we did for cross-domain pose estimation. A collection of dropout discriminators is used to learn a robust model for the domain. We also introduced transfer parameter learning into our network, which used distinct updating rules for two types of parameters. These components are meant to alleviate the performance drop that is caused by domain shifting. Our methodology is validated on several different domain shift scenarios, and the adaptive method relatively outperforms the baseline VoxelPose method. This demonstrates the approach’s efficiency for cross-domain, multi-view, multi-person 3D pose estimation. In summary, our approach offers industrial advantages by strengthening the robustness of multi-view, multi-person estimation models in real-world conditions, minimizing errors and false positives, increasing operational efficiency, enhancing safety, and alleviating the burden of manual data labeling. However, it’s important to address some inherent limitations in the methodology that could potentially affect its precision and scalability.
6.1. Limitations
Primarily, our model confronts the challenge of quantization errors during the transition from 2D to 3D representations. This issue is particularly crucial as it may lead to discernible inaccuracies, undermining the algorithm’s efficacy. Secondly, the computational expenses for training our algorithm escalate with an increasing number of views. This scalability issue limits its applicability in contexts requiring large-scale, high-throughput processing. Moreover, our model currently lacks the incorporation of spatial-temporal data, despite the inherent correlation of human body postures over time. This limitation is especially poignant when the model needs to discern closely intertwined joints or intricate inter-human interactions.
6.2. Future Directions
The limitations of the current model naturally guide our future research. One promising direction is the development of an end-to-end framework aiming to mitigate the cumulative effects of quantization errors. Furthermore, we recognize the urgent need for the seamless integration of spatial-temporal data to improve the model’s precision in capturing human interactions and movements over time.