A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation

Xiao, Yabo; Wang, Xiaojuan; He, Mingshu; Jin, Lei; Song, Mei; Zhao, Jian

doi:10.3390/electronics12040857

Open AccessArticle

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation^†

by

Yabo Xiao

¹,

Xiaojuan Wang

^1,*,

Mingshu He

¹

,

Lei Jin

¹

,

Mei Song

¹ and

Jian Zhao

^2,3

¹

School of Electronic Engineering, Beijing University of Posts and Telecommunications, No.10, Xitucheng Road, Haidian District, Beijing 100876, China

²

Institute of North Electronic Equipment, Beijing 100191, China

³

Department of Mathematics and Theories, Peng Cheng Laboratory, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

^†

The extended version of AAAI conference paper AdaptivePose.

Electronics 2023, 12(4), 857; https://doi.org/10.3390/electronics12040857

Submission received: 4 January 2023 / Revised: 23 January 2023 / Accepted: 31 January 2023 / Published: 8 February 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Multi-person pose estimation generally follows top-down and bottom-up paradigms. The top-down paradigm detects all human boxes and then performs single-person pose estimation on each ROI. The bottom-up paradigm locates identity-free keypoints and then groups them into individuals. Both of them use an extra stage to build the relationship between human instance and corresponding keypoints (e.g., human detection in a top-down manner or a grouping process in a bottom-up manner). The extra stage leads to a high computation cost and a redundant two-stage pipeline. To address the above issue, we introduce a fine-grained body representation method. Concretely, the human body is divided into several local parts and each part is represented by an adaptive point. The novel body representation is able to sufficiently encode the diverse pose information and effectively model the relationship between human instance and corresponding keypoints in a single-forward pass. With the proposed body representation, we further introduce a compact single-stage multi-person pose regression network, called AdaptivePose++, which is the extended version of AAAI-22 paper AdaptivePose. During inference, our proposed network only needs a single-step decode operation to estimate the multi-person pose without complex post-processes and refinements. Without any bells and whistles, we achieve the most competitive performance on representative 2D pose estimation benchmarks MS COCO and CrowdPose in terms of accuracy and speed. In particular, AdaptivePose++ outperforms the state-of-the-art SWAHR-W48 and CenterGroup-W48 by 3.2 AP and 1.4 AP on COCO mini-val with faster inference speed. Furthermore, the outstanding performance on 3D pose estimation datasets MuCo-3DHP and MuPoTS-3D further demonstrates its effectiveness and generalizability on 3D scenes.

Keywords:

fine-grained; adaptive point; single-stage regression; 2D/3D multi-person pose estimation

1. Introduction

Human pose estimation (HPE) [1,2,3,4,5,6,7] is a classical yet challenging task in vision communities [8,9,10]. It aims to locate the person’s keypoints from the natural image. HPE always serves as the necessary step for high-level vision tasks such as action recognition [11,12,13,14,15,16] and pose tracking [17], etc. Existing 2D/3D multi-person pose estimation methods can be categorized into top-down [18,19,20,21,22,23,24] and bottom-up [25,26,27,28,29,30] paradigms. The top-down strategy divides this problem into human detection and single-person pose estimation, each detected human region is cropped and normalized to locate the single-person keypoints. It achieves a superior performance while suffering from a large computation cost and low efficiency due to the additional human detector. The bottom-up strategy formulates this task as keypoint localization and a grouping process. It firstly detects all person keypoints simultaneously on the full image instead of the cropped single-person regions and then assigns them to individuals. Although bottom-up methods are more efficient than top-down methods, the heuristic grouping process is still computationally complex, and always involves many hand-designed rules. The speed-accuracy curves are shown in Figure 1.

Both top-down and bottom-up methods generally use the conventional keypoint heatmap representation that models the human pose via absolute keypoint position, as shown in Figure 2a, which separates the relationship between the position of human instance and corresponding keypoints. Consequently, an extra stage is required to build up the connections. Recent research has tried to model the connections between the human body and corresponding keypoints in a single-forward process while suffering some obstacles, thus leading to a compromised performance. As shown in Figure 2b, CenterNet [32] represents the instance as center point and encodes the relationship between instance and its keypoints via center-to-joint offsets. Nevertheless, it achieves an inferior performance since the limited center feature cannot encode the various poses effectively. As shown in Figure 2c, SPM [33] also represents the human instance via the limited feature of root joint and further employs a fixed hierarchical structure along the skeleton path to build the relationship between the human instance and keypoints. Due to the intermediate nodes being pre-defined and the supervision acting on the offsets between the adjacent joints, the fixed hierarchical path will lead to accumulated errors along the hierarchical path.

To address the aforementioned problems, in this work, we propose a novel body representation which is able to sufficiently encode various human poses and effectively build the relations between the instance and keypoints in a single-forward pass. Specifically, the human body is divided into several parts and each human part is represented as an adaptive part related point. In this manner, we leverage the human center feature together with the features at several human-part related points to represent diverse human poses. Connections can be built by the center to adaptive points then to the keypoints path as shown in Figure 2d. Compared with previous representations, our representation brings two-fold benefits as follows: (1) The proposed point set representation introduces additional features at adaptive part related points, which are able to encode more informative features for flexible poses compared with limited center representation; (2) The adaptive part related points serving as relay nodes can more effectively model the associations between human instances and corresponding keypoints in a single-forward pass.

With the adaptive point set representation, we propose an effective and efficient single-stage differentiable regression network, termed AdaptivePose++, which mainly consists of three novel components. First, we introduce the Part Perception Module to regress seven adaptive human-part related points for perceiving the corresponding seven human parts. Second, in contrast to using the limited feature with a fixed receptive field to predict the human center, we propose the Enhanced Center-aware Branch to conduct the receptive field adaptation by aggregating the features of adaptive human-part related points to perceive the center of various poses more precisely. Third, we propose the Two-hop Regression Branch together with the Skeleton-Aware Regression Loss for regressing keypoints. The adaptive human-part related points act as one-hop nodes to factorize the center-to-joint offsets dynamically. AdaptivePose++ eliminates the time-consuming post-processes and achieves the best speed–accuracy trade-offs.

A preliminary version of this work [1] was accepted in AAAI Conference on Artificial Intelligence (AAAI), 2022. We extend it in terms of five aspects: (1) We augment the content of the Abstract, Introduction, Related Work, Methodology and Experiments to cover sufficient details for clearer and more comprehensive presentation; (2) We improve the regression loss and add an additional loss term to learn the skeleton connections, which is helpful for crowd scenes; (3) We tune several hyper-parameters and improve the performance in a single forward pass, and add more ablation experiments with analyses to verify the superior positioning capacity of our framework. We further report the more comprehensive comparisons with competitive bottom-up counterparts and list more qualitative results; (4) We report the state-of-the-art results on CrowdPose [34], which contains an enormous number of crowd scenes; (5) We keep the 2D framework and add the depth estimation components, and further extend our method to a 3D multi-person pose estimation task—the promising results on MuPoTS-3D [35] verify the effectiveness and generalizability of our method for 3D scenes.

We summarize our main contributions as follows:

We propose representing human parts as points, thus, the human body can be represented via an adaptive point set including the center and several human-part related points. To our best knowledge, we are the first to present a fine-gained and adaptive body representation to sufficiently encode the pose information and effectively build up the relation between the human instance and keypoints in a single-forward pass.
Based on the novel representation, we exploit a compact single-stage differentiable network, called AdaptivePose++. Specifically, we introduce a novel Part Perception Module to perceive the human parts by regressing seven human-part related points. By manipulating human-part related points, we further propose the Enhanced Center-aware Branch to more precisely perceive the human center and the Two-hop Regression Branch together with the Skeleton-Aware Regression Loss to precisely regress the keypoints.
Our method significantly simplifies the pipeline of existing multi-person pose estimation methods. The effectiveness is demonstrated on both 2D and 3D pose estimation benchmarks. We achieve the best speed–accuracy trade-offs without complex refinements and post-processes. Furthermore, extended experiments on CrowdPose and MuPoTS-3D clearly verify the generalizability on crowd and 3D scenes.

2. Related Work

In this section, we review three parts related to our method including top-down methods, bottom-up methods and point-based methods.

Top-down Methods. Given an arbitrary RGB image, the top-down methods [4,19,20,21,22,23] first crop and resize the region of a detected person and then locate the single-person keypoints in each cropped area. The detected human areas are cropped and resized to a unified size so that it has superior performance. For convolution-based methods, HRNet [21] maintains high-resolution features and repeatedly fuses multi-resolution features throughout the whole process to generate reliable high-resolution representations. Su et al. [23] proposed a Channel Shuffle Module and Spatial, Channel-wise Attention Residual Bottleneck (SCARB) to drive the cross-channel information flow. For transformer-based networks, TokenPose [36] embeds each keypoint as a token to simultaneously learn constraint relationships across keypoints and visual representation from images. Other researchers [34,37] have tried to handle quantization errors and occlusion issues. However, the detection-first paradigm always brings additional computational cost, and forward time, top-down methods are often not feasible for the real-time systems with strict latency constraints.

Bottom-up Methods. In contrast to top-down methods, bottom-up methods [6,25,26,27,28,29,30] first localize keypoints of all human instances in the input image and then group them to the corresponding person. Bottom-up methods mainly concentrate on the effective grouping process or tackling the scale variation. For example, CMU-pose [6] proposes a non parametric representation, named Part Affinity Fields (PAFs), which encodes the location and orientation of limbs, to group the keypoints to individuals. AE [27] simultaneously outputs a keypoint heatmap and a tag map for each body joint, then assigns the keypoints with similar tags to individuals. HigherHRNet [26] generates a high-resolution feature pyramid with multi-resolution supervision and multi-resolution heatmap aggregation for learning scale-aware representations. However, one case worth noting is that the grouping process, serving as a post-process, is still computationally complex and redundant.

Point-based methods. In the deep learning era, the point-based methods [32,38,39,40,41,42] represent the instances by the grid points and have been applied to many tasks. They have drawn much attention as they are always simpler and more efficient than anchor-based representation [20,43,44,45]. CenterNet [32] leverages the bounding box center to encode the object information and regresses the other object properties, such as size, to predict a bounding box in parallel. SPM [33] represents the person via root joint and further presents a fixed hierarchical body representation to estimate human poses. Point-Set Anchors [41] propose to leverage a set of pre-defined points as a pose anchor to provide more informative features for regression. In contrast to previous methods that use center or pre-defined pose anchors to model human instances, we propose to represent human instances via an adaptive point set including the center and seven human-part related points as shown in Figure 3a. The novel representation is able to capture the diverse pose information and effectively model the connections between human instances and keypoints.

3. Methodology

First, we elaborate on the proposed body representation in Section 3.1. Then, Section 3.2 provides a detailed description of network architecture including the Part Perception Module and the Enhanced Center-aware Branch, as well as the Two-hop Regression Branch. Finally, we report the training and inference details in Section 3.3.

3.1. Body Representation

We present an adaptive point set representation that uses the center point together with several human-part related points to represent the human instance. The proposed representation introduces the adaptive human-part related points, whose features are used to encode the per-part information, thus can sufficiently capture the structural pose information. Meanwhile, they serve as the intermediate nodes to effectively model the relationship between the human instance and keypoints. In contrast to the fixed hierarchical representation in SPM [33], The adaptive part related points are predicted by center feature dynamically and not pre-defined locations, thus avoid the accumulated error propagated along the fixed hierarchical path. Furthermore, instead of using the root feature to encode all keypoints, the features of adaptive points are also leveraged to encode keypoints of different parts respectively in our method.

Our body representation is built upon the pixel-wise keypoint regression framework, which estimates the candidate pose at each pixel. For a human instance, we manually divide the human body into seven parts (i.e., head, shoulder, left arm, right arm, hip, left leg and right leg) according to the inherent structure of human body, as shown in Figure 3b. Each divided human part is a rigid structure; we represent it via an adaptive human-part related point, which is dynamically regressed from the human center. The process can be formulated as:

C_{i n s t} \to \{P_{h e a d}, P_{s h o}, P_{l a}, P_{r a}, P_{h i p}, P_{l l}, P_{r l}\},

(1)

where

C_{i n s t}

refers to the instance center, others indicate seven adaptive human-part related points corresponding to head, shoulder, left arm, right arm, hip, left leg and right leg. Human pose is finely-grained represented by a point set

\{C_{i n s t}, P_{h e a d}, P_{s h o}, P_{l a}, P_{r a}, P_{h i p}, P_{l l}, P_{r l}\}

. By introducing the adaptive human-part related points, the semantic and position information of different keypoints can be encoded by the specific human-part related point’s feature, instead of only using the limited center feature to encode all keypoints’ information. For convenience,

P_{p a r t}

is used to indicate the seven human-part related points

P_{h e a d}, P_{s h o}, P_{l a}, P_{r a}, P_{h i p}, P_{l l}, P_{r l}

. Then, the feature on each human-part related point is responsible for regressing the keypoints belonging to corresponding parts as follows:

P_{p a r t} \to Joint .

(2)

The novel representation starts from the human center to the adaptive human-part related points, then to body keypoints, to build up the connection between the instance position and corresponding keypoint position in a single-forward pass without any non-differentiable process.

Based on the proposed representation, we delivered a single-stage differentiable solution to estimate multi-person pose. Concretely, the Part Perception Module was proposed to predict seven human-part related points. By using the adaptive human-part related points, the Enhanced Center-aware Branch was introduced to perceive the center of human with various deformation and scales. In parallel, the Two-hop Regression Branch is presented to regress keypoints via the adaptive part-related points.

3.2. Single-Stage Network

Overall Architecture. As shown in Figure 4, given an input image, we first extracted the semantic feature via the backbone, following three well-designed components to predict specific information. We leveraged the Part Perception Module to regress seven adaptive human-part related points from the assumed center for each human instance. Then, we conducted the receptive field adaptation in the Enhanced Center-aware Branch by aggregating the features of the adaptive points to predict the center heatmap. In addition, the Two-hop Regression Branch adopts the adaptive human-part related points as one-hop nodes to indirectly regress the offsets from the center to each keypoint. Our network followed the pixel-wise keypoint regression paradigm, which estimates the candidate pose at each pixel (called center pixel) by predicting an 2K-dimensional offset vector from the center pixel to the K keypoints. We only take a pixel position as an example to describe the single-stage network.

Part Perception Module. With the proposed body representation, we artificially divided each human instance into seven local parts (i.e., head, shoulder, left arm, right arm, hip, left leg, right leg) according to the inherent structure of the human body. The Part Perception Module is proposed to perceive the human parts by predicting seven adaptive human-part related points. For each part, we automatically regressed an adaptive point from center pixel c without explicit supervision. Each adaptive part related point was considered as encoding the informative features for the keypoints belonging to this part. As shown in Figure 5, we fed the regression branch specific feature

F_{r}

into the 3×3 convolutional layer to regress 14-channel x-y offsets

\bar{{off}_{1}}

from the center c to seven adaptive human-part related points on each pixel. These adaptive points acted as intermediate nodes, which were used for subsequent center positioning and keypoint regression.

Enhanced Center-aware Branch. In previous works [32,33,46], the center of human instances with various scales and deformation were predicted via the features with a fixed receptive field for each position. However, the pixel position which predicts the center of larger human body ought to have a larger receptive field compared with the position for predicting the center of a smaller human body. Thus, we propose a novel Enhanced Center-aware Branch which consists of a receptive field adaptation process to extract and aggregate the features of seven adaptive human-part related points for precise center localization.

As shown in Figure 5, we used the structure of 3 × 3 conv-relu to generate the branch-specific features. In Enhanced Center-aware Branch,

F_{c}

is a branch-specific feature with the fixed receptive field for each pixel position. We firstly used the 1 × 1 convolution to compress the 256-channel feature

F_{c}

and obtain the 64-channel feature

F_{c 0}

. Then, we extracted the feature vectors of the adaptive points via bilinear interpolation (named ’Warp’ in Figure 5) on

F_{c 0}

. Taking the head part as an example, the bilinear interpolation can be formulated as

F_{c}^{h e a d} = F_{c 0} (c + {\bar{{off}_{1}}}^{h e a d})

, where c indicates the center pixel and

{\bar{{off}_{1}}}^{h e a d}

is offset from center to adaptive points of head part. The extracted features {

F_{c}^{h e a d}

,

F_{c}^{s h o}

,

F_{c}^{l a}

,

F_{c}^{r a}

,

F_{c}^{h i p}

,

F_{c}^{l l}

,

F_{c}^{r l}

} correspond to seven divided human parts (i.e., head, shoulder, left arm, right arm, hip, left leg, right leg). We concatenated them with

F_{c}

along the channel dimension to generate the feature

F_{c}^{a d a p t}

. Since the predicted adaptive points located on the seven divided parts are relatively evenly distributed on the human body region, the process above can be regarded as the receptive field adaptation according to the human scale, as well as capture the various pose information sufficiently. Finally, we used

F_{c}^{a d a p t}

with an adaptive receptive field to predict the 1-channel probability map for the center localization.

We used the normalized Gaussian kernel

G_{(x, y)} = exp (- \frac{{(x - C_{x})}^{2} + {(y - C_{y})}^{2}}{2 * δ^{2}})

with mean

(C_{x}, C_{y})

and adaptive variance

δ

calculated by human scale to generate the ground-truth center heatmap. Concretely, we calculated the Gaussian kernel radius by the size of an object by ensuring that a pair of points within the radius would generate a bounding box with at least IoU 0.7 with the ground-truth annotation. The adaptive variance is 1/3 of the radius. For the loss function of the Enhanced Center-aware Branch, we employed the pixel-wise focal loss in a penalty-reduced manner as follows:

{l o s s}_{c t} = \frac{1}{N} \sum_{n = 1}^{N} \{\begin{matrix} {(1 - {\bar{P}}_{c})}^{α} ln ({\bar{P}}_{c}) & i f P_{c} = 1 \\ {(1 - P_{c})}^{β} {\bar{P}}_{c}^{α} ln (1 - {\bar{P}}_{c}) & e l i f P_{c} \neq 1, \end{matrix}

(3)

where N refers to the number of positive sample,

{\bar{P}}_{c}

and

P_{c}

indicate the predicted per-pixel confidence and corresponding ground truth.

α

and

β

are hyper-parameters and set to 2 and 4, following CenterNet [32] and CornerNet [42]. In the above loss, only center pixels with peak 1.0 are positive samples and all others are negative samples.

Two-hop Regression Branch. We leveraged a two-hop regression method to predict the displacements instead of directly regressing the center-to-joint offsets. In this manner, the adaptive human-part related points predicted by Part Perception Module act as one-hop nodes to build up the connection between human instance and keypoints more effectively.

We firstly leveraged the structure of 3 × 3 conv-relu to generate a branch-specific feature, named

F_{r}

, in the Two-hop Regression Branch. Then, we fed 256-channel

F_{r}

into the deformable convolutional layer [47,48] to generate 64-channel feature

F_{p}

. Then we extracted the features at the adaptive part related points via the bilinear interpolation operation (called ’Warp’ in Figure 5) on

F_{p}

for corresponding keypoint regression. We denoted the extracted features as {

F_{p}^{h e a d}

,

F_{p}^{s h o}

,

F_{p}^{l a}

,

F_{p}^{r a}

,

F_{p}^{h i p}

,

F_{p}^{l l}

,

F_{p}^{r l}

}, corresponding to seven divided parts (i.e., head, shoulder, left arm, right arm, hip, left leg and right leg). For the extracted feature

F_{p}^{h e a d}

, the above process can be formulated as

F_{p}^{h e a d} = F_{p} (c + {\bar{{off}_{1}}}^{h e a d}) .

These features encode the keypoint information of different human parts respectively, instead of using center features to regress all keypoints. Afterward, the extracted features were responsible for locating the keypoints belonging to the corresponding part by regressing the offsets from adaptive part related point to specific keypoints

\bar{{off}_{2}}

via different 1 × 1 convolutional layers. For example,

F_{p}^{h e a d}

was used to localize the eyes, ears and nose,

F_{p}^{l a}

was used to localize the left wrist and left elbow,

F_{p}^{l l}

was used to localize the left knee and left ankle.

The Two-hop Regression Branch outputs a 34-channel tensor corresponding to x-y offsets

\bar{off}

from the center to 17 keypoints, which is predicted by the two-hop manner as follows:

\bar{off} = \bar{{off}_{1}} + \bar{{off}_{2}},

(4)

where

\bar{{off}_{1}}

and

\bar{{off}_{2}}

respectively indicate the offset from the center to adaptive human-part related point (One-hop offset mentioned in Figure 5) and the offset from human-part related point to specific keypoints (second-hop offset mentioned in Figure 5). The predicted offsets

\bar{off}

are supervised by vanilla L1 loss and the supervision only acts at positive keypoint locations; the other background locations are ignored. Furthermore, we added an additional loss term to learn the rigid bone connection between adjacent keypoints, termed Skeleton-Aware Regression Loss. In particular, as shown in Figure 3c, we denoted a bone connection set as

B = {B^{i}}_{i = 1}^{I}

, where I is the number of bone connections in pre-defined set

B

. Each bone is formulated as

B = P_{a d j a c e n t (j o i n t)} - P_{j o i n t}

, in which

P

is the joint position and the function

a d j a c e n t (*)

return the adjacent joints for input joint. The total regression loss is formulated as follows:

{l o s s}_{o f f} = \frac{1}{2 K * N} \sum_{n = 1}^{K} |\bar{{off}^{n}} - {off}_{g t}^{n}| + \frac{1}{2 I * N} \sum_{i = 1}^{I} |\bar{B^{i}} - B_{g t}^{i}|,

(5)

where

{off}_{g t}^{n}

and

B_{g t}^{i}

are the ground truth center-to-keypoint offset and bone connection. N indicates the number of human instances. K is the number of valid keypoint locations. We find that employing the supervision on the bone connections can bring 0.3 AP improvements on CrowdPose [34].

3.3. Training and Inference Details

During training, we employed an auxiliary training objective to learn keypoint heatmap representation, which enabled the feature to maintain more human structural geometric information. In particular, we added a parallel branch to output a 17-channel heatmap corresponding to 17 keypoints and applied a Gaussian kernel with adaptive variance to generate a ground truth keypoint heatmap. We denote this training objective as

{l o s s}_{h m}

, which is similar to Equation (3). The only difference is that N refers to the number of positive keypoints. The auxiliary branch was only used for the training process and was removed in the inference process.

Our total training loss for multi-task training procedure is formulated as:

{l o s s}_{t o t a l} = {l o s s}_{c t} + {l o s s}_{o f f} + {l o s s}_{h m} .

(6)

During inference, the Enhanced Center-aware Branch outputs the center heatmap that indicates whether the pixel position is at the center or not. The Two-hop Regression Branch outputs the offsets from the center to each keypoint. We first picked the human center by using a 5 × 5 max-pooling kernel on the center heatmap to maintain 20 candidates, and then retrieved the corresponding offsets

(δ_{x}^{i}, δ_{y}^{i})

to form a human pose without any extra tricks. Specifically, we denoted the predicted center as

(C_{x}, C_{y})

. The above decode process is formulated as follows:

(K_{x}^{i}, K_{y}^{i}) = (C_{x}, C_{y}) + (δ_{x}^{i}, δ_{y}^{i}),

(7)

where

(K_{x}^{i}, K_{y}^{i})

is the coordinate of the i-th keypoint. In contrast to DEKR [49], which further uses the average of the extracted heat values at each regressed keypoints to modulate the center heat-values, we only leveraged the center heat-values as the final pose score for fast inference.

4. Experiments and Analysis

In this section, we first briefly introduce the 2D pose estimation datasets, evaluation metric, data augmentation and implementation details. Next, we conduct comprehensive ablation studies to reveal the effectiveness of each component in Section 4.2. Then, we compare our proposed method with the previous methods on MS COCO [31] in Section 4.3 and CrowdPose [34] in Section 4.4. Finally, we extend our network to 3D multi-person pose estimation and verify the generalizability on 3D MuCo-3DHP [35] and MuPoTS-3D [35] datasets.

4.1. Experimental Setup

Dataset. We evaluated our method on two 2D multi-person pose estimation benchmarks including MS COCO [31] and CrowdPose [34]. The MS COCO dataset [31] is a large-scale pose estimation benchmark consisting of over 200,000 images for more than 250,000 human instances annotated with 17 keypoints. It is divided into train, validation, and test sets, respectively. We trained our model on the COCO train2017 dataset. The comprehensive experimental results are reported on the COCO mini-val set with 5000 images and on the test-dev2017 set with 20,000 images. The CrowdPose [34] dataset consists of 20,000 images for 80,000 labelled persons. The training, validation and test sets are partitioned in the proportions of 5:1:4. They contain more challenging images, which are used to verify the robustness for crowded scenes. We follow previous works [2,3,26,49] and trained our models on the train and validation sets and report the results on the test set.

Evaluation Metric. We leveraged average precision and average recall based on different Object Keypoint Similarity (OKS) [31] thresholds to evaluate our keypoint detection performance on both MS COCO and CrowdPose datasets. OKS is formulated as follows:

O K S = \frac{\sum_{i} e x p (\frac{- d_{i}^{2}}{2 s^{2} k_{i}^{2}}) δ (υ_{i} > 0)}{\sum_{i} δ (υ_{i} > 0)},

(8)

where

d_{i}

is the Euclidean distance between the predicted keypoint and the corresponding ground-truth,

υ_{i}

represents the visibility tag of keypoint,

δ

in a function when

υ_{i} > 0

is 1, otherwise is 0, s refers to the instance scale, and

k_{i}

is a constant to control falloff for each specific keypoint. In addition, for the COCO dataset, we report AP

_{M}

and AP

_{L}

, which corresponds to AP over medium and large-sized instances respectively. For CrowdPose, we report AP

_{E}

, AP

_{M}

, AP

_{H}

, which indicate AP scores over easy, medium and hard instances, according to dataset annotations.

Data Augmentation. During training, we used random flip, random rotation, random scaling and color jitter to augment training samples. The flip probability was set to 0.5, the rotation range was (−30, 30) and the scale range was (0.6, 1.3). During the training process, each input image was cropped according to the random center and random scale and then resized to 512 × 512 / 640 × 640 / 800 × 800 pixels for different backbones.

Implementation Details. We trained our proposed model via Adam [50] optimizer with an initial learning rate of 2.5

\times 10^{- 4}

on the workstation with eight Tesla V100 GPUs. The learning rate was dropped to 2.5

\times 10^{- 5}

and 2.5

\times 10^{- 6}

at the 230th and 260th epochs, respectively. The total training procedure was terminated at the 280th epoch (2× training scheme). All codes were implemented with Pytorch. DLA-34 (19.7M) [51] and HRNet [21] were adopted to achieve the trade-offs between the accuracy and efficiency. The batch size was set to 128 for DLA-34 and HRNet-W32 and 64 for HRNet-W48 due to the limited GPU memory. During inference, we kept the aspect ratio of the raw image and resized the short side of the images to 512/640/800 pixels accordingly. The output size was 1/4 of the input resolution. We further used flip and multi-scale image pyramids to boost the performance. It is worth highlighting that the flip was only applied to the center heatmap predicted by the Enhanced Center-aware Branch. All training and inference setups were shared between MS COCO [31] and CrowdPose [34] datasets.

4.2. Ablation Experiments

In this subsection, we conducted comprehensive ablation experiments to analyze each component, respectively, as well as our whole regression model. All ablation studies adopted DLA-34 as a backbone and used the 1× training schedule (140 epochs) via single-scale testing without horizontal flip on the COCO mini-val set.

Analysis of Part Perception Module. Based on the adaptive point set representation, the Part Perception Module was proposed to regress seven adaptive human-part related points, which were used for the subsequent prediction of the Enhanced Center-aware Branch and the Two-hop Regression Branch. Figure 6 shows the predicted human center and seven adaptive human-part related points on the human instances with various scales and poses.

As reported in Table 1, we leveraged various designs to study the structure of the Part Perception Module including (1) 1 × 1 convolutional layer; (2) 3 × 3 convolutional layer with group 7, in which each group is responsible for a human part; (3) vanilla 3 × 3 convolutional layer. The vanilla 3 × 3 convolution achieves the slightly better result. We selected the vanilla 3 × 3 convolution for the follow-up experiments.

Analysis of Enhanced Center-aware Branch. In the Enhanced Center-aware Branch, we conducted the receptive field adaptation operation by aggregating the feature vectors of seven adaptive human-part related points to more precisely position the human center.

We conducted the controlled experiments to explore the effect of receptive field adaptation (RFA) process in the Enhanced Center-aware Branch. Compared with using the feature with fixed receptive field to position the human center, the receptive field adaptation process obtained 1.4% AP improvements in (Expt. 3 versus Expt. 4) of Table 2. We consider that receptive field adaptation is capable of enhancing center feature representation and dynamically adjusting its receptive field accordingly.

Analysis of Two-hop Regression Branch. In the Two-hop Regression Branch, we adopted the adaptive human-part related points as intermediate nodes to localize the keypoints along the center-to-adaptive points-to-keypoints path.

As reported in (Expt. 2 versus Expt. 4) of Table 2, it achieves 4.5% AP improvements compared with directly regressing the displacements from the center to each joint. The results prove that the feature embedding of the adaptive point is more capable of sufficiently encoding the content and position information of corresponding keypoints than limited center feature embedding. Thus, these adaptive points serving as the intermediate nodes can factorize center-to-joint offsets effectively to improve the regression performance.

Furthermore, we analyzed the localization error of the direct center-to-joint regression, hierarchical regression in SPM [33] and our adaptive two-hop regression (with auxiliary keypoint heatmap loss applied for three above regression schemes) via the coco-analyze tool [52]. The localization error consists of four error types: (1) Jitter is a small localization error; (2) Miss refers to a large localization error; (3) Inversion denotes confusion between keypoints within an human instance; (4) Swap indicates the confusion between keypoints across different human bodies. The results are shown in Table 3; compared with direct center-to-joint regression and hierarchical regression, our adaptive two-hop regression reduces

J i t t e r

error by 4.5 and 1.4, respectively, and also reduces

M i s s

error by 1.9 and 0.5, which proves that our regression method can improve the localization quality of the other regression methods evidently.

Analysis of auxiliary loss. We added a parallel branch to learn keypoint heatmap representation, which was only used for auxiliary loss computation in the training stage.

In order to study the effect of auxiliary loss, we achieved 1.6% AP improvements by employing auxiliary heatmap loss to help coordinate regression. It experimentally proves that learning the keypoint heatmap means it is able to retain more structural geometric information to improve regression performance.

Analysis of Overall Architecture. We studied the inherent relationship between the Enhanced Center-aware Branch and the Two-hop Regression Branch, which are correlated by the adaptive human-part related points. As shown in Expt. 1 and 2 of Table 2, without two-hop regression, receptive field adaptation achieves 1.0% AP improvements. As reported in Expt. 3 and 4 of Table 2, with two-hop regression, we further observe that receptive field adaptation achieves 1.4% AP improvements. We consider that

{l o s s}_{o f f}

enables the adaptive points to scatter over the divided human parts, thus the receptive field adaptation is capable of perceiving the human center more precisely. Meanwhile, as reported in Expt. 1 and 3 of Table 2, without receptive field adaptation, two-hop regression brings 4.1% AP improvements. With receptive field adaptation, two-hop regression brings 4.5% AP improvements as shown in Expt. 2 and 4 of Table 2. It experimentally proves that

{l o s s}_{c t}

drives the adaptive points to locate on the semantically significant region, thus two-hop regression is better able to locate the keypoints.

Analysis of Heatmap Refinement. CenterNet [32] performs a post-processing step, which searches the closest peaks (confidence > 0.1) on the keypoint heatmap to replace the initial regressed results. Since the position of confidence peaks on the keypoint heatmap are integer, sub-pixel offsets are predicted to recover the discretization errors in parallel. In this manner, the regressed predictions are grouping clues for assigning the keypoints detected from heatmap to individuals. We named the above process heatmap refinement.

As reported in CenterNet [32], heatmap refinement brings large improvements of 6.2% AP to the initial regression result (from 51.7% AP to 57.9% AP). For validating the regression performance of our method, we further conducted the heatmap refinement for our regression result. For convenience, the two-hop regression result and the heatmap refinement result are denoted as Ours-reg and Ours-heat, respectively. As shown in Table 4, Ours-reg obtained a slightly better performance than Ours-heat (64.6% AP versus 64.4% AP), which proves that our regression method has the better positioning capacity.

4.3. Results on MS COCO Dataset

We report the comparisons with the previous state-of-the-art methods on the COCO mini-val and test-dev sets. All experimental results were obtained via a 2x training schedule.

Mini-val Results.Table 5 reports the comparisons with the recent most competitive bottom-up methods to reveal the keypoint positioning capability in cases without any test-time augmentation (single-scale testing without flip). Adopting smaller DLA-34 as a backbone and the same input resolution 512 pixels, our method achieves 65.8 AP, which outperforms competitive bottom-up HrHRNet-W32 [26] and SWAHR-W32 [2], as well as DEKR-W32 [49], by 2.2 AP, 1.1 AP, 2.4 AP, respectively, with a much faster inference speed. By using HRNet-W32, we outperform state-of-the-art CenterGroup-W32 [3] by 1.1 % AP. Adopting DLA-34 and 640 pixels input resolution, our network achieves an equal performance to those of HrHRNet-W48, SWAHR-W48 and DEKR-W48, with only 1/3 parameters. Furthermore, we obtained 70.5 AP by using HRNet-W48 with 640 pixels input resolution, which achieved a 3.4 AP gain over state-of-the-art regression-based method DEKR-W48. It is noteworthy that DEKR leverages the keypoint heat-value to modulate the center heat-value and further employs an extra rescoring network in post-process, while we directly adopt the center heat-value as the final pose score for fast inference. For the state-of-the-art bottom-up method CenterGroup-W48 (adopting HigherHRNet-W48 as a backbone), which introduces a transformer encoder to conduct the grouping process, we surpass it by 1.4 AP via HRNet-W48 without extra deconvolution layers. The above results prove that our proposed body representation is more effective at modeling the relationship between human instances and keypoints than previous heuristic or learnable grouping methods in terms of accuracy and speed. Figure 7 shows the predicted skeletons on the COCO mini-val set.

Test-dev Results. We further list comprehensive comparisons with the existing bottom-up and single-stage regression-based methods on the COCO test-dev set. In detail, as reported in Table 6, our method achieves state-of-the-art 71.4 AP, which outperforms the widely-used bottom-up methods CMU-pose [6] and AE [27] by a large margin with faster inference speed. Finally, compared with previous single-stage regression-based methods, our method surpasses SPM [33] (refined by the well-trained single-person pose estimation model) by 4.5 AP without any refinement and also outperforms DirectPose [46] with a large margin, by 5.1 AP.

4.4. Results on CrowdPose Dataset

We compared our method with the previous state-of-the-art methods on the CrowdPose dataset, which consists of more challenging crowd scenes. Following existing methods [2,26,49], we trained our models on the train and val sets and evaluated the performance on the test set.

The comparisons with previous state-of-the-art methods are shown in Table 7. Generally, the top-down paradigm always achieved the better performance than bottom-up and single-stage paradigms due to the persons being cropped to perform single-person pose estimation. Nevertheless, our single-stage methods achieved the better performance than most widely-used top-down methods on CrowdPose. We consider that the detected single person region always contains the bodies of other persons in crowd scenes, where persons are usually heavily overlapped.

Furthermore, for the bottom-up methods, we outperform CMU-pose [6] by a large margin. Compared with HigherHRNet [26] using the HRNet-W48 and higher output resolution, our methods achieve an equal performance by only using small HRNet-W32 without multi-scale heatmap aggregation. Our method with HRNet-W48 improves HigherHRNet-W48 by 2.2 AP (1.6 AP) for single-scale (multi-scale) testing. Compared with the competitive DEKR [49], we achieve 1.2 AP gains without any bells and whistles (e.g., only using the center score as the final pose score) in the inference stage. Figure 8 shows the predicted skeletons on CrowdPose. The results prove that the positioning capability of our network is much better.

4.5. AdaptivePose for 3D Pose Estimation

We further extended AdaptivePose to 3D multi-person pose estimation [54] to verify its generality.

Methodology. To simply demonstrate the effectiveness of our proposed body representation and the single-stage network in 3D scenes, we used the pixel-wise depth map to predict the depth information of all human bodies. Based on the 2D network, we further added two parallel branches. One was used to estimate a 1-channel root absolute-depth map in a camera-centered coordinate system. For its target, the map values at the region centering on the root joint with radius 4 equaled their absolute depths. The other branch was to output the 14-channel relative depth map of other keypoints compared to their root joint ( MuCo-3DHP and MuPoTS-3D only provide 15 keypoints annotations.). For its target, the map values at the region centering on the root joint with radius 4 equaled their relative depths compared to root joints. Due to the visual perception of object scale and depth depending on the size of field of view (FoV), following SMAP [55], we normalized the depth by the size of FoV for all training samples as: Z = z * w /f, where Z is the normalized depth, z is the original depth, and f and w are the focal length and the image width.

During inference, first, we formed the 2D human pose as described in Section 3.3. Second, we extracted the absolute root depth and corresponding relative depth of the other keypoints via bilinear interpolation at the root position of each pose candidate. The predicted depth values can be converted back to metric values during inference. Finally, according to the 2D keypoint coordinates and corresponding depth, the 3D pose can be reconstructed through the perspective camera model:

{[X, Y, Z]}^{T} = Z K^{- 1} {[x, y, 1]}^{T},

(9)

where

[X, Y, Z]

refers to 3D coordinated in a camera-centered coordinate system and

[x, y]

is the 2D coordinate of a keypoint in a pixel coordinate system, and K is the camera intrinsic matrix.

Dataset. MuCo-3DHP is the training dataset which is generated by compositing the 3D single-person pose estimation dataset MPI-INF-3DHP [56]. The MuPoTS-3D dataset is the test set, containing 8700 challenging images, which was generated out of doors and consists of 20 real-world scenes annotated with 3D keypoint positions. The annotations were obtained from a multi-view marker-less motion capture system.

Evaluation Metrics. We leveraged the 3D percentage of correct keypoints (3D PCK

_{r e l}

) with root alignment and the area under the 3D PCK curve across different thresholds (AUC

_{r e l}

) to evaluate relative root-centered prediction. The prediction was considered as correct if it lay within 15cm of the annotated keypoint position, following SMAP [55]. We further used 3D PCK

_{a b s}

, which indicates the 3D PCK without root alignment to evaluate the absolute camera-centered prediction.

Implementation Details. We adopted the Adam optimizer to train our 3D network with a batch size of 64 on a workstation with eight 32GB Tesla V100 GPUs. We employed a warmup training strategy and the initial learning rate was set to 1.0

\times 10^{- 3}

. The total training procedure was terminated at the 20th epoch. Following previous work, we used MuCo-3DHP mixed with MS COCO to train the 3D network. All images were shuffled and each mini-batch was randomly sampled from the shuffled dataset. All images were resized to a fixed resolution of 832 × 512 as model input for both training and testing processes.

Results.Table 8 reports the results of our method and previous top-down and bottom-up methods on the MuPoTS-3D [35] dataset. Our AdaptivePose-3D achieves 83.9 3D PCK

_{r e l}

and AUC

_{r e l}

score 44.6 with HRNet-W32, which outperforms top-down 3DMPPE [57] and Hdnet [58] by 1.4 and 0.2 3D PCK

_{r e l}

, respectively. Compared with the bottom-up methods, our method outperforms Xnect [59] by a large margin, and surpasses SMAP [55] and Shen et al. [60] by 3.5 and 0.7 3D PCK

_{r e l}

, respectively. The results prove that our body representation can more effectively build the relationship between a human instance and corresponding keypoints than the heuristic grouping process in a 3D scene. Although we only used a particularly simple method to regress absolute root depth, we also achieved the promising performance on PCK

_{a b s}

. Figure 9 shows the predicted skeletons on MuPoTS-3D [35]. We believe that our AdaptivePose-3D has great potential with a more effective depth estimation approach.

5. Conclusions

In this paper, we introduced a fine-grained body representation that represents human parts as adaptive points. Based on the proposed body representation, we built a compact single-stage network, named AdaptivePose++. The proposed network eliminates the time-consuming grouping and refinement processes, thus obtaining the best speed–accuracy trade-offs. Concretely, our method exceeds the state-of-the-art bottom-up DEKR [49] and CenterGroup [3] methods by 3.4 AP and 1.4 AP, and outperforms other existing bottom-up as well as single-stage approaches on MS COCO with a faster inference speed. Comprehensive experiments prove the generality on crowd and 3D scenes.

AdaptivePose++ eliminates complex post-processes during inference but still requires NMS to remove the duplicates. We will explore designing a more efficient framework without any post-process in future works. We also believe that the proposed body representation can inspire other human-centered vision tasks such as action recognition and human reconstruction.

Author Contributions

Methodology, Y.X.; Validation, Y.X.; Investigation, Y.X. and M.H.; Writing and original draft, Y.X.; Writing, review and editing, Y.X., M.H., L.J. and J.Z.; Visualization, Y.X.; Project administration, X.W.; Funding acquisition, X.W. and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Nature Fund No.62071056, No.62227805, No.61871046, No.62102039 and No.62006244, and the Young Elite Scientist Sponsorship Program of China Association for Science and Technology YESS20200140.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

https://cocodataset.org/#download (accessed on 1 December 2022).

Conflicts of Interest

Funding acquisition, Xiaojuan Wang and Mei Song.

References

Xiao, Y.; Wang, X.J.; Yu, D.; Wang, G.; Zhang, Q.; He, M. AdaptivePose: Human Parts as Adaptive Points. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 36, pp. 2813–2821. [Google Scholar]
Luo, Z.; Wang, Z.; Huang, Y.; Wang, L.; Tan, T.; Zhou, E. Rethinking the heatmap regression for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Brasó, G.; Kister, N.; Leal-Taixé, L. The Center of Attention: Center-Keypoint Grouping via Attention for Multi-Person Pose Estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
Papandreou, G.; Zhu, T.; Kanazawa, N. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherland, 8–16 October 2016; pp. 483–499. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Xiao, Y.; Yu, D.; Wang, X.; Lv, T.; Fan, Y.; Wu, L. SPCNet:Spatial Preserve and Content-aware Network for Human Pose Estimation. In Proceedings of the European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 29 August–8 September 2020. [Google Scholar]
Tan, J.; Liao, X.; Liu, J.; Cao, Y.; Jiang, H. Channel Attention Image Steganography with Generative Adversarial Networks. IEEE Trans. Netw. Sci. Eng. 2022, 9, 888–903. [Google Scholar] [CrossRef]
Liao, X.; Yu, Y.; Li, B.; Li, Z.; Qin, Z. A New Payload Partition Strategy in Color Image Steganography. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 685–696. [Google Scholar] [CrossRef]
Liao, X.; Yin, J.; Chen, M.; Qin, Z. Adaptive Payload Distribution in Multiple Images Steganography Based on Image Texture Features. IEEE Trans. Dependable Secur. Comput. 2022, 19, 897–911. [Google Scholar] [CrossRef]
Kasprzak, W.; Jankowski, B. Light-Weight Classification of Human Actions in Video with Skeleton-Based Features. Electronics 2022, 11, 2145. [Google Scholar] [CrossRef]
Lv, T.; Wang, X.; Jin, L.; Xiao, Y.; Song, M. Margin-based deep learning networks for human activity recognition. Sensors 2020, 20, 1871. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Wang, X.; Lv, T.; Jin, L.; He, M. HARNAS: Human Activity Recognition Based on Automatic Neural Architecture Search Using Evolutionary Algorithms. Sensors 2021, 21, 6927. [Google Scholar] [CrossRef] [PubMed]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LO, USA, 2–7 February 2018. [Google Scholar]
Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 472–487. [Google Scholar]
Dong, H.; Wang, G.; Chen, C.; Zhang, X. RefinePose: Towards More Refined Human Pose Estimation. Electronics 2022, 11, 4060. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Z.; Peng, Y. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Fang, H.-S.; Xie, S.; Tai, Y.-W. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Su, K.; Yu, D.; Xu, Z.; Geng, X.; Wang, C. Multi-person pose estimation with enhanced channel-wise and spatial information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Xiao, Y.; Su, K.; Wang, X.; Yu, D.; Jin, L.; He, M.; Yuan, Z. QueryPose: Sparse Multi-Person Pose Regression via Spatial-Aware Part-Level Query. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LO, USA, 28 November–9 December 2022. [Google Scholar]
Papandreou, G.; Zhu, T.; Chen, L.C. PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Cheng, B.; Xiao, B.; Wang, J. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the Conference and Workshop on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2274–2284. [Google Scholar]
Li, J.; Su, W.; Wang, Z. Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation. In Proceedings of the National Conference on Artificial Intelligence, Hilton New York Midtown, NY, USA, 7–12 February 2020. [Google Scholar]
Xiao, Y.; Yu, D.; Wang, X.J.; Jin, L.; Wang, G.; Zhang, Q. Learning Quality-aware Representation for Multi-person Pose Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2022. [Google Scholar]
Kreiss, S.; Bertoni, L.; Alahi, A. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.J. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Zhou, X.; Wang, D.; Krahenbuhl, P. Objects as points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Nie, X.; Feng, J.; Zhang, J. Single-stage multi-person pose machines. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Li, J.; Wang, C.; Zhu, H. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Sridhar, S.; Pons-Moll, G.; Theobalt, C. Single-shot multi-person 3D pose estimation from monocular rgb. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018. [Google Scholar]
Li, Y.; Zhang, S.; Wang, Z.; Yang, S.; Yang, W.; Xia, S.T.; Zhou, E. Tokenpose: Learning keypoint tokens for human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021. [Google Scholar]
Huang, J.; Zhu, Z.; Guo, F. The Devil Is in the Details: Delving Into Unbiased Data Processing for Human Pose Estimation. In Proceedings of the CVPR, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Duan, K.; Bai, S.; Xie, L. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Zhu, C.; He, Y.; Savvides, M. Feature Selective Anchor-Free Module for Single-Shot Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Wei, F.; Sun, X.; Li, H.; Wang, J.; Lin, S. Point-set anchors for object detection, instance segmentation and pose estimation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Cai, Z.; Nuno, V. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference and Workshop on Neural Information Processing Systems, Montreal, QC, Canada, 11–12 December 2015. [Google Scholar]
Tian, Z.; Chen, H.; Shen, C. Directpose: Direct end-to-end multi-person pose estimation. arXiv 2019, arXiv:1911.07451. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Yu, F.; Wang, D.; Shelhamer, E.; Darrell, T. Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Ruggero Ronchi, M.; Pietro, P. Benchmarking and error diagnosis in multi-instance pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Mao, W.; Tian, Z.; Wang, X.; Shen, C. FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Jin, L.; Xu, C.; Wang, X.; Xiao, Y.; Guo, Y.; Nie, X.; Zhao, J. Single-Stage Is Enough: Multi-Person Absolute 3D Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LO, USA, 19–24 June 2022. [Google Scholar]
Zhen, J.; Fang, Q.; Sun, J.; Liu, W.; Jiang, W.; Bao, H.; Zhou, X. Smap: Single-shot multi-person absolute 3d pose estimation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Mehta, D.; Rhodin, H.; Casas, D.; Fua, P.; Sotnychenko, O.; Xu, W.; Theobalt, C. Monocular 3d human pose estimation in the wild using improved cnn supervision. In Proceedings of the 2017 International Conference on 3D vision (3DV), Qingdao, China, 10–12 October 2017. [Google Scholar]
Moon, G.; Chang, J.Y.; Lee, K.M. Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Lin, J.; Lee, G.H. Hdnet: Human depth estimation for multi-person camera-space localization. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Mehta, D.; Sotnychenko, O.; Mueller, F.; Xu, W.; Elgharib, M.; Fua, P.; Seidel, H.P.; Rhodin, H.; Pons-Moll, G.; Theobalt, C. XNect: Real-time multi-person 3D motion capture with a single RGB camera. Acm Trans. Graph. (TOG) 2020, 39, 82. [Google Scholar] [CrossRef]
Shen, T.; Li, D.; Wang, F.Y.; Huang, H. Depth-Aware Multi-person 3D Pose Estimation with Multi-scale Waterfall Representations. IEEE Trans. Multimed. 2022, 2022. 8, 1–14. [Google Scholar] [CrossRef]
Benzine, A.; Chabot, F.; Luvison, B.; Pham, Q.C.; Achard, C. Pandanet: Anchor-based single-shot multi-person 3d pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Zhang, J.; Yu, D.; Liew, J.H.; Nie, X.; Feng, J. Body meshes as points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]

Figure 1. Inference time (s) vs. precision (COCO keypoint AP). Our method achieves the best speed–accuracy trade-offs compared with previous methods on MS COCO [31].

Figure 2. (a) Conventional keypoint heatmap representation generally used in top-down methods such as Rmpe [22], as well as bottom-up methods such as CMU-pose [6]. (b) Center-to-joint body representation proposed by CenterNet [32]. (c) Hierarchical body representation introduced by SPM [33]. (d) Our adaptive point set representation. In contrast to (b,c) only uses center or root features; the features of adaptive points are introduced to encode the keypoint information in each part.

Figure 3. (a) The visualization of adaptive point set. White points indicate the human center and others refer to part-related points visualized by different colors. We leverage an adaptive point set conditioned on each human instance to represent the human pose in a fine-grained way. (b) Divided human parts according to inherent body structure. (c) Black dotted arrows indicate the bone connections of cross parts and solid lines refer to bone connections of inner parts.

Figure 4. Overview of AdaptivePose++. (a) The structure of the Part Perception Module. (b) The structure of the Enhanced Center-aware Branch, RP Adaptation refers to receptive field adaptation. (c) The diagram of the Two-hop Regression Branch. (d) The red arrows are one-hop offsets that dynamically locate the adaptive human-part related points. (e) The blue arrows indicate second-hop offsets for localizing the keypoints.

Figure 5. The detailed structure across the Part Perception Module, Enhanced Center-aware Branch and Two-hop Regression Branch. Concat indicates the feature concatenation along the channel dimension. The red arrows is used to indicate where a loss is applied.

Figure 6. Visualization of the predicted adaptive point set including the center and seven human-part related points for each human instance on the COCO dataset. Best viewed after zooming in.

Figure 7. Examples of predicted skeleton on COCO mini-val dataset. The challenges include various human scales, pose deformation as well as occluded scenarios. Best viewed after zooming in.

Figure 8. Visualization of the estimated multi-person pose on CrowdPose dataset. The examples consist of many overlapped and occluded multi-person scenarios. Best viewed after zooming in.

Figure 9. Qualitative results on MuPoTS-3D datasets. The test images and the corresponding 3D multi-person poses predicted by our proposed AdaptivePose-3D.

Table 1. Ablation study for exploring the design of the Part Perception Module.

Structure	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{M}$	${AP}_{L}$
1 × 1 Conv	64.5	85.9	70.6	58.0	73.9
Group 3 × 3 Conv	64.5	85.6	70.6	57.9	74.0
vanilla 3 × 3 Conv	64.6	85.5	70.4	58.1	74.2

Table 2. Ablation studies. PPM denotes Part Perception Module, RFA is the receptive field adaptation conducted in the Enhanced Center-aware Branch, TR indicates the two-hop regression strategy in the Two-hop Regression Branch, AL refers to employing an auxiliary loss

{l o s s}_{h m}

to learn keypoint heatmap representation.

Table 2. Ablation studies. PPM denotes Part Perception Module, RFA is the receptive field adaptation conducted in the Enhanced Center-aware Branch, TR indicates the two-hop regression strategy in the Two-hop Regression Branch, AL refers to employing an auxiliary loss

{l o s s}_{h m}

to learn keypoint heatmap representation.

Expt.	PPM	RFA	TR	AL	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{M}$	${AP}_{L}$
1	-	-	-	-	57.5	81.7	62.6	48.7	70.2
2	√	√	-	-	58.5	82.4	63.8	49.0	71.7
3	√	-	√	-	61.6	84.5	65.9	53.0	72.1
4	√	√	√	-	63.0	85.5	68.6	56.0	72.8
5	√	√	√	√	64.6	85.5	70.4	58.1	74.2

Table 3. Comparisons of direct center-to-joint regression (denoted as DR) in CenterNet, hierarchical regression (denoted as HR) in SPM and our Two-hop Regression (denoted as TR) in terms of the precision, and four types of localization errors.

Methods	AP ↑	Jitter↓	Miss↓	Inversion ↓	Swap ↓
DR	58.8	17.3	9.5	3.9	1.2
HR	62.2	14.2	8.1	3.8	1.2
TR	63.8	12.8	7.6	3.5	1.3

Table 4. Ablation study for exploring the effect of heatmap refinement for our two-hop regressed result.

Methods	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{M}$	${AP}_{L}$
Ours-heat	64.4	85.3	70.3	58.3	73.8
Ours-reg	64.6	85.5	70.4	58.1	74.2

Table 5. Comparisons with the competitive bottom-up methods on the COCO mini-val set. Note that all results are reported for single-scale testing without horizontal flip.

^{‡}

indicates the conference version. + refers to larger input resolution. All inference time is calculated on a 2080Ti GPU with mini-batch 1.

Table 5. Comparisons with the competitive bottom-up methods on the COCO mini-val set. Note that all results are reported for single-scale testing without horizontal flip.

^{‡}

indicates the conference version. + refers to larger input resolution. All inference time is calculated on a 2080Ti GPU with mini-batch 1.

Methods	Params.(M)	Input Size	Output Size	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{M}$	${AP}_{L}$	Time(s)
Most Competitive Bottom-Up Methods
DEKR-W32 [49]	29.6	512	128	63.4	86.2	69.8	55.8	76.2	0.145
HrHRNet-W32 [26]	28.6	512	256	63.6	84.9	69.2	57.1	73.5	0.164
SWAHR-W32 [2]	28.6	512	256	64.7	86.1	69.8	57.8	74.8	0.235
CenterGroup-W32 [3]	30.3	512	256	66.9	-	-	-	-	-
Single-Stage Methods
AdaptivePose $^{‡}$ (DLA-34) [1]	21.0	512	128	65.5	86.1	71.9	59.1	75.0	0.034
Ours (DLA-34)	21.0	512	128	65.8	86.4	71.9	59.3	75.4	0.034
Ours (HRNet-W32)	29.6	512	128	68.0	88.1	74.2	62.2	75.6	0.056
Most Competitive Bottom-Up Methods +
DEKR-W48 [49]	65.7	640	160	67.1	87.7	73.9	61.5	77.1	0.195
HrHRNet-W48 [26]	63.8	640	320	66.6	85.3	72.8	61.7	74.4	0.242
SWAHR-W48 [2]	63.8	640	320	67.3	87.1	72.9	62.1	75.0	0.428
CenterGroup-W48 [3]	65.5	640	320	69.1	-	-	-	-	-
Single-Stage Methods +
AdaptivePose $^{‡}$ (DLA-34) [1]	21.0	640	160	66.6	86.0	72.2	60.1	75.8	0.045
Ours (DLA-34)	21.0	640	160	67.0	86.3	72.6	60.5	76.1	0.045
Ours (HRNet-W48)	64.8	640	160	70.5	88.5	76.7	64.5	79.4	0.082
Ours (HRNet-W48)	64.8	800	200	70.8	88.3	77.0	65.7	78.7	0.115

Table 6. Comparisons with previous state-of-the-art methods on COCO test-dev set. * indicates using extra test-time refinements. † refers to multi-scale testing. DLA-34+ indicates DLA-34 with 640 pixels input resolution.

Methods	Params.(M)	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{M}$	${AP}_{L}$
Bottom-Up Methods
CMU-Pose $^{* †}$ [6]	-	61.8	84.9	67.5	57.1	68.2
AE $^{* †}$ [27]	227.8	65.5	86.8	72.3	60.6	72.6
CenterNet-DLA [32]	-	57.9	84.7	63.1	52.5	67.4
CenterNet-HG [32]	-	63.0	86.8	69.6	58.9	70.4
PersonLab [25]	68.7	66.5	88.0	72.6	62.4	72.3
PifPaf [30]	-	66.7	-	-	62.4	72.9
HrHRNet-W48 $^{* †}$ [26]	63.8	70.5	89.3	77.2	66.6	75.8
FCPose-R101 [53]	-	65.6	87.9	72.6	62.1	72.3
SWAHR-W48 $^{*}$ [2]	63.8	70.2	89.9	76.9	65.2	77.0
DEKR-W48 $^{* †}$ [49]	65.7	71.0	89.2	78.0	67.1	76.9
Single-stage Regression-based Methods
SPM $^{* †}$ [33]	-	66.9	88.5	72.9	62.6	73.1
DirectPose $^{†}$ [46]	-	64.8	87.8	71.1	60.4	71.5
PointSetNet $^{* †}$ [41]	-	68.7	89.9	76.3	64.8	75.3
Ours (DLA-34) $^{†}$	21.0	67.5	88.3	73.7	62.7	74.5
Ours (DLA-34+) $^{†}$	21.0	68.4	88.9	75.5	63.6	75.4
Ours (HRNet-W48) $^{†}$	64.7	71.4	90.2	78.5	66.8	78.2

Table 7. Comparisons with the state-of-the-art methods on the CrowdPose test set.

^{†}

indicates multi-scale testing.

Table 7. Comparisons with the state-of-the-art methods on the CrowdPose test set.

^{†}

indicates multi-scale testing.

Methods	$AP$	${AP}_{50}$	${AP}_{75}$	${AP}_{E}$	${AP}_{M}$	${AP}_{H}$
Top-Down Methods
Mask-RCNN [20]	57.2	83.5	60.3	69.4	57.9	45.8
Rmpe [22]	61.0	81.3	66.0	71.2	61.4	51.1
SimpleBaseline [17]	60.8	84.2	71.5	71.4	61.2	51.2
CrowdPose [34]	66.0	84.2	71.5	75.5	66.3	57.4
Bottom-Up and Single-Stage Methods
CMU-Pose [6]	-	-	-	62.7	48.7	32.3
HigherHRNet-W48 [26]	65.9	86.4	70.6	73.3	66.5	57.9
+ CenterGroup [3]	67.6	87.7	72.7	73.9	68.2	60.3
DEKR-W32 [49]	65.7	85.7	70.4	73.0	66.4	57.5
DEKR-W48 [49]	67.3	86.4	72.2	74.6	68.1	58.7
Ours(DLA-34)	64.2	85.4	69.3	71.7	64.8	55.9
Ours(HRNet-W32)	66.0	86.6	71.2	73.3	66.7	57.8
Ours(HRNet-W48)	68.1	86.9	73.9	74.4	68.8	60.2
Bottom-Up and Single-Stage Methods $^{†}$
HigherHRNet-W48 $^{†}$ [26]	67.6	87.4	72.6	75.8	68.1	58.9
DEKR-W32 $^{†}$ [49]	67.0	85.4	72.4	75.5	68.0	56.9
DEKR-W48 $^{†}$ [49]	68.0	85.5	73.4	76.6	68.8	58.4
Ours(DLA-34) $^{†}$	65.9	85.4	71.3	74.1	66.6	56.9
Ours(HRNet-W32) $^{†}$	67.5	85.4	71.3	74.1	66.6	56.9
Ours(HRNet-W48) $^{†}$	69.2	87.3	75.0	76.7	70.0	60.9

Table 8. Comparisons on the MuPoTS-3D [35] dataset. All reported results are averaged over 20 test sequences. ★ indicates that the evaluations are conducted on all annotated persons.

Methods	PCK $_{rel} ↑$	PCK $_{abs} ↑$	AUC $_{rel} ↑$	PCK $_{rel}^{★} ↑$
Top-Down Methods
3DMPPE [57]	82.5	31.8	40.9	81.8
Hdnet [58]	83.7	35.2	-	-
Pandanet [61]	-	-	-	72.0
Bottom-Up Methods
SMAP [55]	80.5	38.7	42.7	73.5
Xnect [59]	75.8	-	-	70.4
BMP [62]	75.3	-	-	-
Shen et al. [60]	83.2	39.7	44.1	75.2
Single-Stage Methods
Ours(HRNet-W32)	83.9	33.0	44.6	78.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiao, Y.; Wang, X.; He, M.; Jin, L.; Song, M.; Zhao, J. A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation. Electronics 2023, 12, 857. https://doi.org/10.3390/electronics12040857

AMA Style

Xiao Y, Wang X, He M, Jin L, Song M, Zhao J. A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation. Electronics. 2023; 12(4):857. https://doi.org/10.3390/electronics12040857

Chicago/Turabian Style

Xiao, Yabo, Xiaojuan Wang, Mingshu He, Lei Jin, Mei Song, and Jian Zhao. 2023. "A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation" Electronics 12, no. 4: 857. https://doi.org/10.3390/electronics12040857

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation^†

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Body Representation

3.2. Single-Stage Network

3.3. Training and Inference Details

4. Experiments and Analysis

4.1. Experimental Setup

4.2. Ablation Experiments

4.3. Results on MS COCO Dataset

4.4. Results on CrowdPose Dataset

4.5. AdaptivePose for 3D Pose Estimation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation †

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Body Representation

3.2. Single-Stage Network

3.3. Training and Inference Details

4. Experiments and Analysis

4.1. Experimental Setup

4.2. Ablation Experiments

4.3. Results on MS COCO Dataset

4.4. Results on CrowdPose Dataset

4.5. AdaptivePose for 3D Pose Estimation

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation^†