Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal

Lin, Song; Hou, Wenjun

doi:10.3390/app14062238

Open AccessArticle

Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal

by

Song Lin

¹ and

Wenjun Hou

^2,3,*

¹

School of Automation, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Beijing Key Laboratory of Network Systems and Network Culture, Beijing University of Posts and Telecommunications, Beijing 100876, China

³

School of Digital Media and Design Arts, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(6), 2238; https://doi.org/10.3390/app14062238

Submission received: 5 February 2024 / Revised: 24 February 2024 / Accepted: 27 February 2024 / Published: 7 March 2024

Download

Browse Figures

Versions Notes

Abstract

:

Tracking the articulated poses of multiple individuals in complex videos is a highly challenging task due to a variety of factors that compromise the accuracy of estimation and tracking. Existing frameworks often rely on intricate propagation strategies and extensive exchange of flow data between video frames. In this context, we propose a spatiotemporal sampling framework that addresses the degradation of frames at the feature level, offering a simple yet effective network block. Our spatiotemporal sampling mechanism empowers the framework to extract meaningful features from neighboring video frames, thereby optimizing the accuracy of pose detection in the current frame. This approach results in significant improvements in running latency. When evaluated on the COCO dataset and the mixed dataset, our approach outperforms other methods in terms of average precision (AP), recall rate (AR), and acceleration ratio. Specifically, we achieve a 3.7% increase in AP, a 1.77% increase in AR, and a speedup of 1.51 times compared to mainstream state-of-the-art (SOTA) methods. Furthermore, when evaluated on the PoseTrack2018 dataset, our approach demonstrates superior accuracy in multi-object tracking, as measured by the multi-object tracking accuracy (MOTA) metric. Our method achieves an impressive 11.7% increase in MOTA compared to the prevailing SOTA methods.

Keywords:

pose estimation; pose tracking; spatiotemporal sampling

1. Introduction

Pose estimation and tracking, being a fundamental domain in computer vision, have significantly contributed to various applications dedicated to understanding human behavior. These applications encompass human–computer interaction, sensory tracking, 3D visual imaging, and human motion forecasting, among other aspects. Simultaneously, they provide invaluable datasets encapsulating a diverse array of human postures within authentic video sequences, contributing to the refinement and application of correlated algorithms and models. While current research has achieved notable breakthroughs in the accuracy of pose estimation and tracking, there has been a comparatively limited focus on efficient multi-person pose estimation and tracking amidst frame degradation [1,2,3]. This study introduces a feature sampling block design for the person detection stage, autonomously acquiring the ability to spatially sample features from supporting frames without additional supervision. This innovative approach presents a solution for striking a balance between accuracy and running speed in two-stage pose estimation and tracking schemes. The feature sampling block enhances adaptability, facilitating seamless integration with diverse object detection architectures.

Most multi-person pose estimation and tracking methods can be categorized into two pipelines: bottom-up [4,5] and top-down [6,7,8,9]. Due to the absence of a global view of individual instances in the bottom-up pipeline, state-of-the-art top-down approaches excel in accuracy on large-scale benchmarks. However, their two-stage methodology tends to compromise inference speed compared to more streamlined bottom-up and single-stage approaches. Thus, the key challenge lies in designing a module that can effectively sample and exploit temporal information. Additionally, the final system in the tracking stage should be extremely lightweight and easy to implement.

To address these concerns, we propose a straightforward yet highly efficient spatiotemporal module. By incorporating deformable convolutions in both spatial and temporal dimensions, our module effectively utilizes temporal information for person detection in videos. When integrated into the multi-person pose estimation and tracking framework, the module intelligently samples relevant feature information from neighboring video frames, significantly improving the accuracy of the current frame. Our spatiotemporal sampling framework is trained on an extensive dataset comprising annotated bounding boxes from a diverse collection of still images and video frames. In summary, our main contributions are as follows:

This paper introduces a sophisticated two-stage joint framework for human pose estimation and tracking. In the initial stage, a precise target detector is employed to detect human bodies in video frames using a top-down approach. Subsequently, human keypoint coordinates are obtained by combining heatmaps with offset regression. The second stage utilizes a bipartite graph matching algorithm to achieve keypoint tracking between consecutive video frames.
This study pioneers a unified framework for human pose estimation and tracking, achieving commendable performance on widely adopted benchmarks.
The framework exhibits remarkable scalability, allowing seamless integration into diverse human visual applications, such as 3D human pose estimation, human parsing, action reconstruction, and other related domains.

2. Related Work

2.1. Multi-Person Pose Estimation in Images

In recent years, there has been significant progress in the field of multi-person pose estimation in images. The current approaches can be categorized into two main categories: two-stage approaches and one-stage approaches. Figure 1 elucidates the taxonomy delineating the landscape of the related work [5,7,10,11,12,13,14,15,16,17,18].

2.1.1. Two-Stage Approaches

The two-stage approaches can be further divided into bottom-up and top-down methods.

Bottom-Up Approaches: These approaches start by detecting all possible keypoints in an image and then group them to form individual skeletons. Several improvements have been made in bottom-up approaches, such as optimizing keypoints regression [8] and improving heat map prediction methods [9]. However, these approaches often ignore global information from other body parts and the overall target, relying solely on keypoints detection. They also face challenges in handling scale variation in images with small individuals.
Top-Down Approaches: In contrast to the bottom-up approach, top-down approaches first detect the bounding boxes for all persons in the image and then perform single-person pose estimation within each detected bounding box. However, when the detection target is severely occluded, the deviation caused by bounding box detection can lead to failure. To address this issue, subsequent work has explored techniques based on Convolutional Neural Networks (CNNs) or transformer frameworks. For example, an Image Guided Progressive Graph Convolutional Networks (GCN) module [10] was proposed to infer the position of invisible joints based on action types and image context. However, top-down approaches typically involve two stages: initial pose estimation and pose correction, which affect their efficiency. Transformer-based methods, such as ViT-Pose [11], offer simplicity, scalability, and flexibility, but their larger model size and input resolution hinder their effective application to video-based pose estimation. A common limitation of top-down approaches is that their inference time increases with the number of people in the image.
A hybrid solution [6] that combines the two approaches has been explored, where bottom-up pose estimation models are used as detectors instead of classic object detectors. These models provide detection boxes and pose proposals as conditions for subsequent attention-based top-down models. However, the performance of different sampling strategies for these conditions, such as empirical sampling and generative sampling, requires further exploration.

2.1.2. One-Stage Approaches

Whether through regression keypoint coordinate methods [11,12] or coordinate classification methods [13], one-stage approaches aim to learn an end-to-end network that can simultaneously locate human bodies and detect their joints. These methods avoid the inefficiencies of two-stage approaches, such as grouping, region of interest (RoI) cropping, and bounding-box detection. Representative works in this category include PolarPose [8], FCPose [19], SMPR [20], ED-pose [21], InsPose [22], and Group Pose [23]. However, the constraints of end-to-end networks in learning intricate semantic non-linear representations, such as body center maps and dense joint display maps, often result in lower precision compared to top-down counterparts. Although some single-stage approaches [8,19,20,21,22,23] have achieved breakthroughs in estimation accuracy on certain datasets, their stability across a wider range of datasets requires further exploration.

2.2. Multi-Person Pose Estimation and Tracking in Videos

As mentioned in Section 2.1, it is natural to extend the top-down or bottom-up methods for multi-person pose estimation from still images to videos. Taking the top-down approach as an example, directly applying off-the-shelf person detectors from images to videos in the first stage presents challenges. Despite the rich temporal clues provided by video contexts for pose estimation and tracking, numerous challenges arise, including motion blur, occlusions between multiple persons, defocused video frames, and unconventional poses. The presence of temporal redundancy in video frames further complicates the situation, leading to additional computational overhead. Therefore, achieving accurate and efficient video-based pose estimation and tracking requires striking a balance. This entails rational utilization of temporal cues in the video while creatively reducing the computational overhead caused by temporal redundancy.

2.2.1. Multi-Person Pose Estimation

Video-based multi-person pose estimation can be implemented through image-based approaches on a frame-by-frame basis, naturally falling into two categories:

Bottom-Up Approaches: These approaches estimate poses by performing keypoint detection and grouping them on a per-frame basis [14,15,24,25]. However, most approaches struggle with body part association in crowded scenes with severe occlusion or complex movements. For instance, while DeciWatch [15] employs DenoiseNet to diminish the noise attributed to motion blur, its efficacy may be compromised in cases of non-periodic and intricate human motion.
Top-Down Approaches: These approaches begin by detecting individuals in all frames and then conduct single-person pose estimation on each frame based on image-based techniques. Numerous CNN-based approaches [16,26,27] have been introduced to address occlusion challenges in videos. Recently, transformer-based approaches have demonstrated superior performance on well-established keypoint estimation benchmarks [17,28].

2.2.2. Multi-Person Pose Tracking

In the context of multi-person pose tracking in videos, there are two main approaches: bottom-up and top-down.

Bottom-Up Approaches: These approaches start by detecting human body parts and then group them to form individuals. The grouped parts are connected and associated across frames to generate complete poses. However, bottom-up approaches may struggle with body part association in occluded scenes. In [4,5], the multi-person pose tracking challenge is introduced, and a spatial graph is extended to a spatiotemporal graph based on bottom-up methods [29]. While [4] achieves plausible results in complex videos by solving a minimum-cost multicut problem, the handcrafted features in probabilistic graphical models are not necessarily optimal for long video clips. Optimizing this sophisticated integer programming problem is a Nondeterministic Polynomial-time hard problem considered computationally infeasible, even with state-of-the-art optimizers.
Top-Down Approaches: Similar to image-based top-down approaches, these methods face the challenge of reliably detecting individual persons, especially under significant occlusion [7,26]. To address missed occluded joints, ref. [30] propose a Graph Neural Network (GNN)-based network that predicts poses aggregated with the detected poses. The process of posture prediction learning captures two types of human motion features through different types of edges: the relative motion and spatial structure of human joints within the frame, and the temporal human dynamic features between consecutive frames. Additionally, ref. [18] propose a gated attention transformer for multi-person pose tracking, considering posture-based similarity and appearance-based similarity. This approach automatically adjusts the impact of the two similarities on pose tracking accuracy and shows significant improvements in motion blur, crowded scenes, and occlusions, especially in assigning highly occluded persons in unusual poses. These algorithms and models primarily focus on improving the accuracy of pose tracking without addressing the issue of lightweight deployment. Specifically, this approach facilitates seamless model deployment and achieves commendable performance in human keypoint detection and tracking, all while maintaining low algorithmic complexity.

Our articulated pose-tracking framework follows a multi-stage pipeline and offers advantages over object detection techniques used in videos. We propose an end-to-end trainable network that jointly estimates optical flow or samples features from adjacent frames. The predicted optical flow aligns the features, and a spatiotemporal module learns to spatially sample features from adjacent frames. The aggregated features are then fed into different detection networks to estimate body joint landmarks. By effectively fusing information from multiple frames, this approach naturally makes our framework robust to motion blur and the presence of multiple overlapping instances in individual frames. Moreover, our approach is much less computationally expensive compared to previous methods, allowing it to handle any number of person instances per frame in a video efficiently. Figure 2 depicts our spatiotemporal sampling framework, which we elaborate on in the subsequent sections.

3. Background

3.1. Upper Bounds of Analytical Framework

In this section, we will discuss the upper-bound performance of the analytical system, with a specific focus on the multi-object tracking accuracy (MOTA) metric. Our re-implementation of the validation set achieves a slightly higher performance (MOTA: 57.8) compared to the original paper’s reported performance (MOTA: 57.6). We will explore the upper bound performance of the analytical framework by considering three key factors: Track Identifier (track ID), detection bounding boxes, and body joint landmarks. We analyze these factors from four different aspects:

Perfect detection bounding boxes: We utilize the ground truth bounding boxes as the predicted results of the detector. Subsequently, we estimate the body joint landmarks within these ground truth boxes. The predicted landmarks are then matched using a bounding box overlap cost criterion. As illustrated in Table 1, we observe a notable 5.1% improvement in MOTA (57.8 → 62.9). The lack of fusion of potential information from adjacent frames indicates that the performance of the image-level detector is insufficient under the circumstances of frame degeneration, and its influence cannot be overlooked.
Perfect body joint landmarks: In this scenario, we first employ the detector to predict a series of bounding boxes. We then calculate the Intersection-over-Union (IoU) between the predicted boxes and the ground truth boxes. We assign ground truth keypoints to the predicted boxes with an IoU greater than 0.7. As depicted in Table 1, we achieve a significant 10.8% improvement in MOTA (57.8 → 68.6). This highlights the crucial role played by the quality of pose estimation in the evaluation of tracking performance.
Perfect track ID: Similar to the approach described in (2), we obtain boxes (IoU > 0.7) and assign the ground truth track IDs to the final boxes. As outlined in Table 1, we observe a modest 1.8% improvement in MOTA (57.8 → 59.6). This suggests that a simple and commonly used greedy matching algorithm, based on the bounding box overlap cost criterion, is already approaching the upper bound performance.
Perfect bounding boxes and body joint landmarks: Finally, we combine the perfect bounding boxes and body joint landmarks. As indicated in Table 1, this results in a substantial boost in MOTA (57.8 → 74.9). This emphasizes that the most critical challenge in the articulated pose tracking task for the top-down pipeline lies in establishing a robust and reliable person detector and pose estimator.

3.2. Motivation

As discussed in Section 3.1, achieving satisfactory results in pose estimation and tracking using off-the-shelf image-level approaches is challenging due to various factors such as motion blur, person occlusions, and unusual poses. However, videos contain richer information about the same person instance, allowing for higher confidence when utilizing nearby frames. In the person detection stage, it is natural to extract useful feature information from neighboring frames to maximize the accuracy of person instance detection in a given frame. By doing so, the performance of human pose estimation and tracking can experience a substantial boost. Additionally, the analytical framework, which achieved 2nd place in the keypoint detection task of COCO 2018, has demonstrated good performance in still images. The challenge at hand is how to effectively incorporate instance-level information from nearby frames to address the problem of frame degeneration. Although the analytical framework performs well, there is still room for improving the accuracy of person detection in the person detection stage. This motivates the design of a module that can effectively exploit temporal information for specific instances between adjacent video frames. To maintain consistency with the current top-down pipeline, we propose a framework, which includes an additional module that learns to sample and aggregate features across temporal correspondences to enhance person detections over adjacent video frames.

Therefore, addressing two important blocks is crucial for proper feature sampling and aggregation: (1) determining how to sample useful instance-level features from nearby frames to a given frame, and (2) finding an appropriate approach to aggregate the features across multiple frames. These two blocks will be further elaborated below.

4. Method

4.1. Evaluation and Datasets

The evaluation of the framework performance is conducted using the PoseTrack 2018 dataset [5], which is a widely used benchmark for human pose estimation and articulated pose tracking in videos. The dataset consists of 1337 video sequences, with 593 videos for training, 74 videos for validation, and 375 videos for testing. The dataset includes three different challenges: (1) multi-person pose estimation in single frames, (2) multi-person pose estimation in videos, and (3) multi-person pose tracking in real-world scenarios. The evaluation metrics used for challenges (1) and (2) are mean Average Precision (mAP), while challenge (3) is evaluated using the MOTA metric [31]. More detailed information can be found in the original paper [5].

The modern PoseTrack framework generally follows a similar pipeline [7,16,26]. First, a human detector is applied to a still image (I) to generate bounding boxes (B) encompassing the individuals in the image. Then, a pose estimator is employed on the image region defined by the bounding boxes (B) to predict the keypoints (K) representing the body joints. Finally, a tracker module is built upon the bounding boxes (B) and keypoints (K), incorporating correlation features between pairs of consecutive frames to assign a unique track ID to each individual.

For the purpose of simplifying the performance analysis, we simplify the framework by replacing the complex flow-based pose tracking algorithm with a more straightforward greedy matching algorithm that uses bounding box overlap as the cost criterion. This simplified framework, which consists of a Residual Network-50 (ResNet-50) backbone [32] for image feature extraction, a Region-based Fully Convolutional Networks (R-FCN) [33] detector without joint propagation for detecting human key point coordinates in images, and a bounding box overlap similarity metric, will be referred to as the analytical framework for the sake of explanation.

4.2. Network Architecture

As illustrated in Figure 2, our PoseTrack framework comprises three crucial components: (1) person detection in videos, (2) human pose estimation in videos, and (3) multi-person articulated tracking in videos. As mentioned in Section 3.1, person detection in videos plays a pivotal role in the overall pipeline, particularly in mitigating the effects of frame degeneration. The frame degeneration, attributed to factors such as motion blur artifacts and occlusions, significantly amplifies the inaccuracy of single-frame person detection. As a consequence, top-down approaches encounter difficulties in accurately delineating the true boundaries of overlapping bodies, while bottom-up methods face challenges in effectively associating key points amidst occlusion scenarios. Therefore, to enhance the performance of the entire framework, it is imperative to obtain more robust detection results.

The objective of this paper is to design a module that can effectively exploit temporal information for pose estimation and tracking in unconstrained videos. Let us use

I_{t}

to represent the t-th frame in the video and consider the scenario described in Figure 2. Suppose the person in

I_{t}

is affected by motion blur or occlusions from other objects. Under such circumstances, it becomes challenging for an image-level detector to accurately detect the person. However, if a nearby frame

I_{t + s}

or

I_{t - s}

contains the same person instance, which is clearly visible and not occluded by other instances, it is natural to leverage the information from

I_{t + s}

or

I_{t - s}

to improve the accuracy of detection in frame

I_{t}

. Hence, the main challenge lies in sampling useful person-level information from the nearby frame

I_{t + s}

or

I_{t - s}

to enhance the current frame

I_{t}

. For ease of explanation, we refer to the supporting frames as the nearby frames used for sampling and the reference frame as the current frame that requires improvement. In practice, we employ 2S supporting frames to enhance person detection accuracy in the reference frame. The supporting frames are randomly selected from a large range, encompassing S frames before the reference frame and S frames after it, respectively. This technique involves four crucial steps, which are elucidated in Figure 3.

We incorporate various sub-networks within the person detection stage, and the structure of the complete spatiotemporal sampling module is depicted in Figure 3.

4.2.1. Feature Extraction Network

We employ the ResNet (-50 and -101) [32] with deformable convolutions [34] for the feature extraction network, which is one of the strong object detection backbones. The ResNet are all pretrained on ImageNet Classification Task [35].

4.2.2. Feature Sampling

Our primary contribution in this study is the development of a feature sampling block for the person detection stage, which enables spatial sampling of features from the supporting frames without requiring additional supervision. Moreover, it facilitates seamless integration with other object detection architectures.

Firstly, we sample the reference frame

I_{t}

and the supporting frames

I_{t - s_{1}}

and

I_{t - s_{2}}

from the given video, and then pass them through the same feature extraction network mentioned in Section 4.2. Let c denote the number of channels, h denote the height, and w denote the width of the feature map. This yields three groups of feature maps:

F_{t}

,

F_{t - s_{1}}

, and

F_{t - s_{2}}

all possessing the same dimensions. Subsequently, we concatenate these three groups into a new feature map denoted as

F_{t}

,

F_{t - s_{1}}

, and

F_{t - s_{2}}

∈

R^{3 c \times h \times w}

. As a result, the feature map

F_{t}

,

F_{t - s_{1}}

, and

F_{t - s_{2}}

now contains not only the information from the reference frame but also from the supporting frames.

Secondly, we utilize the deformable convolutional layer [34] to facilitate the sampling of relevant information from the supporting feature maps

F_{t}

,

F_{t - s_{1}}

, and

F_{t - s_{2}}

. For the sake of clarity, we describe our feature sampling mechanism using a single group of supporting feature maps

F_{t - s_{1}}

. The input to our deformable convolutional layer consists of two components: the supporting feature map and the predicted location offsets

F_{t - s_{1}} (x, y)

. The number of predicted location offsets

(x, y)

depends on the number of supporting frames. The output of our deformable convolutional layer is a newly sampled feature map

U_{t, t - s_{1}}

, which is then passed to the feature aggregation block. A detailed illustration of our feature sampling mechanism can be found in Figure 3.

4.2.3. Feature Aggregation

Once all the selected supporting frames have been fed into the feature sampling module, we obtain a series of newly sampled feature maps

U_{t, t \pm I}

, where i = 1, …, N represents the index, and N (=2 S) denotes the number of supporting frames. These groups of supporting feature maps are then accumulated onto the reference feature map. As a result, we obtain richer information regarding the same person instance, such as poses, different viewpoints, and illuminations. For the feature aggregation block, we employ adaptive weights at various spatial locations, with all feature channels of the newly sampled feature maps sharing the same weight. The output of the feature aggregation block is given by

\begin{matrix} {\bar{U}}_{t} = \sum_{k = - S}^{S} ω_{t, t + k} \cdot U_{t, t + k} \end{matrix}

(1)

where S is the range of supporting frames for feature aggregation. Equation (1) bears similarity to the attention mechanism [36]. The adaptive weights signify the importance of each supporting frame

I_{t + k}

to the reference frame

I_{t}

, where k = 1, …, S denotes the index of the supporting frames. To compute the adaptive weights w accurately, we employ an embedding sub-network

E (x)

to learn the feature representations of video frames in a high-dimensional space. Specifically, at each location

(x, y)

, if the newly sampled feature

U_{t, t + k}

is similar to the feature

F_{t}

, it is assigned a relatively large weight. Conversely, it is assigned a relatively smaller weight. In this case, we employ cosine similarity between the features extracted from the reference frame

F_{t}

and each corresponding point in the newly sampled feature

U_{t, t + k}

. Our embedding sub-network consists of four layers: a 1 × 1 × 512 convolution, a 3 × 3 × 512 convolution, and a 3 × 3 × 1024 convolution. It is randomly initialized.

We calculate the adaptive weights

ω_{t, t + k}

with

ω_{t, t + k} = e x p (\frac{E (F_{t}) \cdot E (F_{t, t + k})}{|E (F_{t})| |E (F_{t, t + k})|})

(2)

where

E (\cdot)

denotes the representation of the individual frame in the high-dimensional feature space. Subsequently, we employ the exponential function to measure the similarity between any two frames

{(I}_{t} {, I}_{t + k})

and normalize the adaptive weights

ω_{t, t + k}

across all supporting frames with a softmax layer. The softmax layer enables the conversion of adaptive weights

ω_{t, t + k}

into a probability distribution, yielding

\sum_{k = - S}^{k = S} ω_{t, t + k} = 1

, which guarantees that the weights accurately capture the relative importance of each supporting frame in relation to the reference frame. After applying a softmax layer, the process of calculating each weight

ω

can be interpreted as measuring the cosine similarity between a reference frame and a supporting frame in the embedding feature space.

4.2.4. Person Detection Network

Subsequently, the aggregated features for the person instance are forwarded to the detection network, enabling the prediction of the final location of individuals in video frames. As demonstrated in Section 3, we validate the efficacy of this approach in generating robust predictions. Furthermore, it can be seamlessly integrated into any state-of-the-art detection network architecture, ensuring compatibility and enhancing performance.

4.3. Training

We employ ResNet-101 Deformable Convolutional Networks (DCN) as our feature extraction network. The feature sampling module consists of six 3 × 3 × 1024 deformable convolutional layers, and the localization offsets module comprises six 3 × 3 convolutional layers for predicting

(x, y)

offsets (Figure 2 only shows two deformable convolutional layers). Our person detection network is implemented based on the Cascade Region-Based Convolutional Neural Network (R-CNN) [37] architecture.

For training, the entire feature sampling and feature aggregation modules are fully differentiable, enabling end-to-end training without the need for supervision. During training, all input images are resized to have a shorter side of 600 pixels. We use a small value for

N

(e.g., N = 2) as the number of succeeding frames in the training process, which means that two frames randomly sampled from a larger range are considered as the supporting frames—one frame preceding the reference frame and the other frame succeeding it. During inference, we utilize a larger value for N. The feature sampling and feature aggregation modules process each supporting frame sequentially. The model iterates over all the frames in the video, updating the feature cache in memory. The sampled features from the supporting frames are aggregated and used for person detection. The process of associating detected bounding boxes and keypoints with previous frames is repeated from the first to the last frame of the video. By using a smaller N value during training and a larger N value during inference, we can optimize the model’s performance for pose estimation and tracking.

Our person detection network is trained in two stages. In the first stage, we perform pretraining on the COCO [38] dataset, using annotations of the person category labeled with bounding boxes. Since the COCO dataset only contains still images and lacks the temporal information present in video frames, our feature sampling module cannot effectively sample useful information in this case. To address this limitation, we employ a specific type of data augmentation called Gaussian blur, which helps train the feature sampling module to some extent. Our experimental results demonstrate that Gaussian blur is effective and leads to improvement. We use the Stochastic Gradient Descent (SGD) [39] optimizer with only one image per mini-batch and train the model for 5 epochs on 4 GPUs. The learning rates are initially set to

10^{- 3}

and

10^{- 4}

for the first 3 epochs and the last 2 epochs, respectively. In the second stage, the model is trained on the PoseTrack 2018 dataset for 3 epochs, with one image per mini-batch. The weights of the feature extraction, feature sampling, and feature aggregation components are initialized with the parameters learned in the first stage. The learning rates for the second stage are set to

10^{- 3}

and

10^{- 4}

for the first 2 epochs and the last epoch, respectively.

5. Experiments and Results

5.1. Multi-Person Pose Estimation and Tracking

In this section, we provide detailed information regarding our training and inference procedures for pose estimation and tracking.

5.1.1. Training

During the training stage, we begin by pre-training our model on the COCO training dataset using annotations of the person category, which include bounding boxes and keypoints. We then fine-tune the model using the PoseTrack 2018 training dataset. To ensure consistent proportions, we use a fixed aspect ratio of 4:3 for the ground truth human bounding boxes. These boxes are subsequently cropped from the image and resized to a standardized resolution of 256 × 192 pixels. Data augmentation techniques such as scale variation (±30%), rotation (±40%), and flipping are employed. Our feature extraction networks are pre-trained on the ImageNet Classification Task [35]. For optimization, we utilize the ADAM [40] optimizer with a mini-batch size of 32 images. The pose estimator model is trained for a total of 140 epochs on 4 GPUs. The learning rates for the first 90 epochs, the next 120 epochs, and the final 20 epochs are set to

10^{- 3}

,

10^{- 4}

, and

10^{- 5}

, respectively.

5.1.2. Inference

During the inference stage, we first employ detection models integrated with our feature sampling and aggregation modules, as described in Section 3.1, to predict bounding boxes around each person instance. By default, we utilize Cascade R-CNN with ResNet-101 DCN as our baseline detector. Subsequently, the deconvolution head network [26] is used to predict the body joint landmarks within each bounding box. Similar to previous work [7], we discard low-confidence and potentially incorrect detections to prevent track drifting and reduce false positives. This approach may slightly decrease the mAP score for pose estimation but improve the MOTA score for pose tracking. We set thresholds for bounding box detections at 0.5 and keypoints at 0.4. We then employ a greedy matching algorithm to compute the similarity between bounding boxes and landmarks across nearby frames for each instance. A cost matrix is generated using an ensemble cost metric based on the bounding box overlap cost criterion (

ω_{1} = 0.9

), and pose similarity (

ω_{2} = 0.1

), which calculates the bodyjoint landmarks distance between two instances using the head-normalized probability of correct keypoint metric (PCKh) [41]. The matching process starts with the highest confidence match and iteratively selects the most suitable edge in the cost matrix, removing the two connected nodes from consideration. This process of associating detected boxes and keypoints with previous frames is repeated from the first to the last frame of the given video.

Additional details regarding our inference process are presented in Algorithm 1. Given a video comprising consecutive frames and a specified sampling and aggregation range S, our feature sampling and aggregation modules sequentially process each supporting frame. To address GPU memory constraints, we utilize N to represent the number of supporting frames preceding the reference frame and S supporting frames following it. Initially, the feature extraction network is employed to process the initial S supporting frames, and their corresponding features are cached in memory (L1–L3 in Algorithm 1). Subsequently, our model iterates over all frames in the video to conduct person detection while concurrently updating the feature cache with new information. For each reference frame, denoted as t, the relevant feature information from the supporting frames is sampled, and their adaptive weights are computed (L4–L8). The newly sampled features are then aggregated and forwarded to the detection network for person detection, with the feature map of the (t + k)-th frame being added to the memory cache (L9–L12). Finally, a non-maximum suppression (NMS) post-processing step is applied, utilizing a threshold of 0.4 to eliminate redundant bounding boxes based on a cost criterion.

Algorithm 1 Inference algorithm of spatiotemporal sampling module

Require:

input: video frames

I_{t}

, a specific sampling and aggregation range S

1: for t = 0 to S do

2:

{F_{t} = N}_{f e a t} (I_{t})

3: end for

4: for t = 0 to ∞ do

5: for t + k = max (0, t − S) to t + S do

6:

F_{t}^{e}, F_{t, t + k}^{e} = E (F_{t}, F_{t, t + k})

7:

ω_{t, t + k} = e x p (\frac{F_{t}^{e}, \cdot F_{t, t + k}^{e}}{|F_{t}^{e}| |F_{t, t + k}^{e}|})

8: end for

9:

{\bar{U}}_{t} = \sum_{k = - S}^{S} ω_{t, t + k} \cdot U_{t, t + k}

10:

{p_{t} = N}_{d e t} ({\bar{U}}_{t})

11:

{F_{t + k} = N}_{f e a t} (I_{t + k})

12: end for

Ensure:

output: person detection results:

\{p_{t}\}

5.1.3. Results on the PoseTrack Benchmark

As outlined in Section 4.2, our approach involves training the model on the COCO training dataset and subsequently fine-tuning it on the PoseTrack 2018 training dataset.

We first performed a thorough analysis of the COCO keypoint detection dataset for human pose estimation and made a fair comparison with the most recent and state-of-the-art methods. The COCO dataset is annotated from the original COCO dataset, while the mixed dataset is obtained by integrating multiple datasets including COCO, PoseTrack, etc. The above dataset settings are designed to be aligned with the baseline methods. The results summarized in Table 2 show that our method achieves the best performance currently. Specifically, on the COCO dataset, our average precision (AP) reaches 76.7%, and AP at 0.5 to 0.75 is 92.5%, 83.2%, and 79.9%, respectively. At the same time, we found that our method not only performs best in accuracy but also in recall rate (AR). On the mixed dataset, the same conclusion can be drawn: AP is 78.5%, and AP at 0.5 to 0.75 is 93.1%, 84.9%, and 80.2%, respectively. Considering that we use ResNet-50 as the backbone network, its final latency is 59.2 ms.

We present our method’s performance on the PoseTrack 2017 and PoseTrack 2018 benchmarks. Table 3 displays the results for the multi-person pose estimation task, while Table 4 showcases the results for the multi-person pose tracking task. Our larger ResNet-152 network achieves superior performance, with a 77 mAP score and a 65.8 MOTA score on the PoseTrack 2018 validation dataset. When compared to the state-of-the-art PoseTrack framework, specifically the FlowTrack in the Simple Baseline [26], our framework exhibits a slight improvement of 0.3% (73.9 → 74.2) on the PoseTrack 2017 test dataset. Although the improvement in mAP is not substantial, it is noteworthy that our framework operates under significantly more challenging conditions than FlowTrack, yet achieves comparable results (MOTA: 57.3 vs. 57.6) in the pose tracking task. Unlike FlowTrack, our spatiotemporal sampling module does not rely on large amounts of flow data or complex propagation processes based on optical flow between multiple frames during the tracking stage. While being lightweight and straightforward to implement, our spatiotemporal sampling module can seamlessly integrate into various high-performing person detection architectures. The fact that our framework successfully learns temporal correspondences and outperforms strong single-frame baselines in both mAP and MOTA is impressive, as depicted in Table 5. To the best of our knowledge, our approach is the first to explore the temporal relationship between video frames at the feature level for video person detection in the modern PoseTrack framework.

Our method excels not only in analysis accuracy but also in terms of running latency, as demonstrated in Table 6. The table presents a comparison of our method’s performance on a single NVIDIA GeForce RTX 3090 by NVIDIA Corporation in Santa Clara, California, USA and sourced from Taiwan for different batch sizes, alongside existing state-of-the-art (SOTA) methods. To ensure fairness in the experiment, we replicated all the methods listed in the table and meticulously aligned the parameter details involved in each method. The “ST baseline” refers to the simplest implementation of our method, which involves spatiotemporal sampling of video frames, data fusion for tracking results, and recording the entire inference time of the process. Our method, referred to as the “ST method” in the table, efficiently samples and fuses video frames to generate pose estimation and tracking results while fully recording the inference time for this process. We conducted an experiment where we applied the tracking technology of our ST method to Lite Pose, a lightweight 2D attitude estimation network without a complete attitude tracking stage. The results were surprisingly impressive, as the end-to-end process became faster. This improvement was primarily due to the lightweight design in pose estimation, which naturally leads to a strong speedup ratio, ultimately accelerating the inference of the entire pose tracking on videos. Furthermore, we compared our naive implementation work with the latest works such as those presented in CVPR2022 and CVPR2023, specifically in terms of inference speed. In the case of multiple batches, the maximum acceleration ratio reached 1.01–1.24×, showcasing the efficiency of our method. Overall, our method not only delivers remarkable analysis accuracy but also demonstrates notable improvements in running latency, outperforming existing approaches and achieving impressive acceleration ratios in various scenarios.

5.2. Ablation Studies

In the ablation studies conducted on the PoseTrack2018 validation dataset, we thoroughly evaluate the components of our design and provide detailed insights. We describe the findings and present the results in Table 7.

5.2.1. Feature Sampling and Aggregation Module

We study the effect of feature sampling and feature aggregation modules in the entire PoseTrack framework. In the person detection stage, we use Cascade R-CNN detector with ResNet-101 backbone. In the pose estimation stage, the backbone is ResNet-50, and the input size of the image is 256 × 192 pixels. In the tracking stage, we use greedy matching approach to compute the similarity using bounding boxes overlap cost criterion in the adjacent frames across the time.

Method (a) represents the single-frame baseline, which achieves a 73.1 mAP for multi-person pose estimation and a 58.3 MOTA for multi-person pose tracking. This baseline already demonstrates competitiveness compared to the state-of-the-art results on the PoseTrack2017 validation dataset.
Method (b) is a degraded version that excludes the feature sampling module and sets all adaptive weights to 1/2S (as described in Equation (2)). This results in a decrease in mAP to 71.4 and MOTA to 57.3, indicating the importance of motion information on the feature level for video object detection.
Method (c) enhances (b) by incorporating adaptive weights $ω_{t, t + k}$ calculated using Equation (2). This leads to an increase in mAP to 73.0 and MOTA to 58.1, surpassing the performance of (b). This highlights the criticality of designing an adaptive-weighted sub-network in the feature aggregation module.
Method (d) is our proposed feature sampling and feature aggregation method, which adds the feature sampling module to (c). This further improves the mAP to 73.5 and MOTA to 62.8. These results demonstrate the effectiveness of our feature sampling and aggregation modules in leveraging motion information from adjacent frames to address challenges like motion blur and occlusion.

5.2.2. The Design Choices of Embedding Network

We explore different structures for the embedding sub-network and assess their impact on performance. We use four different structures, and the results are presented in Table 8.
The findings indicate that using a fully convolutional network to project features into a new high-dimensional embedding space for similarity measurement does not significantly impact performance.
Based on these results, we select Design (C) as our embedding sub-network, as it offers the best performance with minimal computational requirements.

5.2.3. The Number of Supporting Frames

Due to GPU memory limitations, we extract the features of each image individually and then feed them into memory before inputting them into the sampling and aggregation modules.
We experiment with different numbers of supporting frames (5, 7, 9, 11, 13, and 15) during inference and 2 or 4 supporting frames per mini-batch during training.
The results in Table 9 indicate that using more supporting frames during training does not lead to higher accuracy, and the improvement saturates at 13 frames during the inference stage.
Consequently, we default to sampling 2 supporting frames during training and aggregating features from 13 supporting frames during inference.

Overall, the ablation studies provide a comprehensive evaluation of our design on the PoseTrack2018 validation dataset. They highlight the importance of motion information on the feature level, the effectiveness of the proposed feature sampling and aggregation modules, the minimal impact of different embedding network structures, and the optimal number of supporting frames for inference and training.

5.3. Qualitative Results

Figure 4 presents a selection of illustrative results showcasing the effectiveness of our spatiotemporal sampling framework in predicting multi-person pose tracking on the PoseTrack2018 validation dataset.

6. Conclusions and Future Work

This article focuses on the crucial challenge of frame degradation in multi-person pose tracking, with the goal of improving accuracy while minimizing computational resource consumption. To address this challenge, we propose a spatiotemporal sampling framework that effectively leverages information from neighboring video frames to extract valuable features. What sets our approach apart from conventional pose estimation and tracking frameworks is the inclusion of a spatiotemporal sampling module specifically within the person detection stage, an aspect often overlooked in previous research. This module, which is fully differentiable, offers several advantages over existing articulated pose tracking frameworks by reducing reliance on extensive flow data and intricate propagation strategies. Moreover, our techniques seamlessly integrate into high-performance detection architectures, consistently leading to significant improvements in the tracking stage. It is worth noting that we have not explored additional acceleration of the model through the incorporation of contemporary pruning and quantification methods. In future endeavors, we intend to delve into more intricate designs for the feature sampling and aggregation modules. Additionally, our objective is to explore the potential of integrating pose estimation with articulated tracking in an end-to-end manner, rather than treating them as separate stages.

Author Contributions

Conceptualization, S.L. and W.H.; methodology, S.L. and W.H.; software, S.L.; validation, S.L. and W.H.; formal analysis, S.L.; investigation, S.L.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L. and W.H.; visualization, S.L.; supervision, W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and related information for this study are provided in accordance with our data availability policy. We adhere to relevant laws and regulations to ensure data accessibility and transparency. The dataset utilized in this study was derived from publicly accessible resources, specifically PoseTrack obtained from https://posetrack.net/, PoseTrack21 sourced from https://github.com/andoer/PoseTrack21, and COCO available at https://cocodataset.org/, with reference numbers [2,5,38].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, L.; Meng, X.; Liu, Z.; Wu, M.; Gao, Z.; Wang, P. Human Pose-based Estimation, Tracking and Action Recognition with Deep Learning: A Survey. arXiv 2023, arXiv:2310.13039. [Google Scholar]
Doering, A.; Chen, D.; Zhang, S.; Schiele, B.; Gall, J. PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 20931–20940. [Google Scholar]
Chen, H.; Feng, R.; Wu, S.; Xu, H.; Zhou, F.; Liu, Z. 2D Human pose estimation: A survey. Multimed. Syst. 2023, 29, 3115–3138. [Google Scholar] [CrossRef]
Insafutdinov, E.; Andriluka, M.; Pishchulin, L.; Tang, S.; Levinkov, E.; Andres, B.; Schiele, B. ArtTrack: Articulated multi-person tracking in the wild. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1293–1301. [Google Scholar] [CrossRef]
Andriluka, M.; Iqbal, U.; Insafutdinov, E.; Pishchulin, L.; Milan, A.; Gall, J.; Schiele, B. PoseTrack: A Benchmark for Human Pose Estimation and Tracking. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5167–5176. [Google Scholar] [CrossRef]
Zhou, M.; Stoffl, L.; Mathis, M.W.; Mathis, A. Rethinking pose estimation in crowds: Overcoming the detection information bottleneck and ambiguity. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 14643–14653. [Google Scholar]
Girdhar, R.; Gkioxari, G.; Torresani, L.; Paluri, M.; Tran, D. Detect-and-Track: Efficient Pose Estimation in Videos. In Proceedings of the 31st Meeting of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 350–359. [Google Scholar] [CrossRef]
Li, J.; Wang, Y.; Zhang, S. PolarPose: Single-Stage Multi-Person Pose Estimation in Polar Coordinates. IEEE Trans. Image Process. 2023, 32, 1108–1119. [Google Scholar] [CrossRef] [PubMed]
Cheng, Y.; Ai, Y.; Wang, B.; Wang, X.; Tan, R.T. Bottom-up 2D pose estimation via dual anatomical centers for small-scale persons. Pattern Recognit. 2023, 139, 109403. [Google Scholar] [CrossRef]
Qiu, L.; Zhang, X.; Li, Y.; Li, G.; Wu, X.; Xiong, Z.; Han, X.; Cui, S. Peeking into Occluded Joints: A Novel Framework for Crowd Pose Estimation. In Proceedings of the 16th European Conference on Computer Vision, ECCV 2020, Glasgow, UK, 23–28 August 2020; Springer Science and Business Media Deutschland GmbH: Cham, Switzerland, 2020; Volume 12364 LNCS, pp. 488–504. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose++: Vision Transformer for Generic Body Pose Estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 1212–1230. [Google Scholar] [CrossRef] [PubMed]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022, New Orleans, LA, USA, 19–20 June 2022; pp. 2636–2645. [Google Scholar]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
Zeng, A.; Ju, X.; Yang, L.; Gao, R.; Zhu, X.; Dai, B.; Xu, Q. DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 607–624. [Google Scholar]
Xiu, Y.; Li, J.; Wang, H.; Fang, Y.; Lu, C. Pose Flow: Efficient Online Pose Tracking. In Proceedings of the British Machine Vision Conference, Newcastle, UK, 3–6 September 2018. [Google Scholar]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-Resolution Transformer for Dense Prediction. arXiv 2021, arXiv:2110.09408. [Google Scholar]
Doering, A.; Gall, J. A Gated Attention Transformer for Multi-Person Pose Tracking. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) 2023, Paris, France, 2–6 October 2023; pp. 3181–3190. [Google Scholar]
Mao, W.; Tian, Z.; Wang, X.; Shen, C. FCPose: Fully Convolutional Multi-Person Pose Estimation with Dynamic Instance-Aware Convolutions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 19–25 June 2021; pp. 9030–9039. [Google Scholar] [CrossRef]
Miao, H.; Lin, J.; Cao, J.; He, X.; Su, Z.; Liu, R. SMPR: Single-stage multi-person pose regression. Pattern Recognit. 2023, 143, 109743. [Google Scholar] [CrossRef]
Yang, J.; Zeng, A.; Liu, S.; Li, F.; Zhang, R.; Zhang, L. Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation. arXiv 2023, arXiv:2302.01593. [Google Scholar]
Shi, D.; Wei, X.; Yu, X.; Tan, W.; Ren, Y.; Pu, S. InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose Estimation. In Proceedings of the 29th ACM International Conference on Multimedia, MM 2021, Virtual, 20–24 October 2021; Association for Computing Machinery, Inc.: New York, NY, USA; pp. 3079–3087. [Google Scholar] [CrossRef]
Liu, H.; Chen, Q.; Tan, Z.; Liu, J.-J.; Wang, J.; Su, X.; Li, X.; Yao, K.; Han, J.; Ding, E.; et al. Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV) 2023, Paris, France, 1–6 October 2023; pp. 14983–14992. [Google Scholar]
Jin, S.; Liu, W.; Ouyang, W.; Qian, C. Multi-Person Articulated Tracking with Spatial and Temporal Embeddings. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 5657–5666. [Google Scholar]
Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Nashville, TN, USA, 19–25 June 2021; pp. 14671–14681. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. arXiv 2018, arXiv:1804.06208. [Google Scholar]
Wang, M.; Tighe, J.; Modolo, D. Combining Detection and Tracking for Human Pose Estimation in Videos. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Seattle, WA, USA, 13–19 June 2020; pp. 11085–11093. [Google Scholar]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. arXiv 2022, arXiv:2204.12484. [Google Scholar]
Insafutdinov, E.; Pishchulin, L.; Andres, B.; Andriluka, M.; Schiele, B. DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
Yang, Y.; Ren, Z.; Li, H.; Zhou, C.; Wang, X.; Hua, G. Learning Dynamics via Graph Neural Networks for Human Pose Estimation and Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021, Nashville, TN, USA, 20–25 June 2021; pp. 8070–8080. [Google Scholar]
Milan, A.; Leal-Taixé, L.; Reid, I.D.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015, Boston, MA, USA, 7–12 June 2015; pp. 770–778. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Proceedings of the Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 379–387. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV) 2017, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2014, 115, 211–252. [Google Scholar] [CrossRef]
Rush, A.M.; Chopra, S.; Weston, J. A Neural Attention Model for Abstractive Sentence Summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Pishchulin, L.; Insafutdinov, E.; Tang, S.; Andres, B.; Andriluka, M.; Gehler, P.; Schiele, B. DeepCut: Joint Subset Partition and Labeling for Multi Person Pose Estimation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4929–4937. [Google Scholar]
Lu, P.; Jiang, T.; Li, Y.; Li, X.; Chen, K.; Yang, W. RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation. arXiv 2023, arXiv:2312.07526. [Google Scholar]
Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S.T. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. In Proceedings of the 17th European Conference on Computer Vision, ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Springer Science and Business Media Deutschland GmbH: Berlin, Germany, 2022; Volume 13666 LNCS, pp. 89–106. [Google Scholar] [CrossRef]
Gu, K.; Yang, L.; Yao, A. Removing the Bias of Integral Pose Regression. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 11047–11056. [Google Scholar] [CrossRef]

Figure 1. The taxonomy delineating the landscape of Related Work [5,7,10,11,12,13,14,15,16,17,18].

Figure 2. Our spatiotemporal sampling framework consists of three primary stages. (1) Person Detection Stage: This stage focuses on generating bounding boxes, which are subsequently utilized to predict body joint landmarks in the human pose estimation stage. (2) Human Pose Estimation Stage: Building upon the detection boxes generated in the first stage, this stage predicts keypoints associated with human poses. (3) Multi-Person Articulated Tracking Stage: In this stage, a lightweight tracking algorithm based on greedy matching is employed to address the more intricate task of tracking poses. It utilizes the detection boxes from the first stage and the keypoints predicted in the second stage to track poses over time.

Figure 3. Our spatiotemporal sampling mechanism, as outlined in Section 4.2, can be summarized in four steps. First, we utilize a feature extraction network to compute features for each video frame. Next, spatiotemporal sampling blocks are implemented to selectively sample pertinent features from neighboring frames. Subsequently, the sampled features are temporally aggregated, integrating the temporal information across frames. Finally, the aggregated features are fed as input to the detection network for further processing.

Figure 4. Sample results. The detections are color-coded based on track ID. It is worth noting that our framework demonstrates successful tracking of individuals throughout the sequence.

Table 1. Upper bounds of analytical framework.

Ours	ID	Boxes	Landmarks	Boxes + Landmarks
(MOTA) 57.8	59.6	62.9	68.6	74.9

Table 2. Multi-person pose estimation performance on COCO challenge dataset.

Method	Dataset	AP	AP (50)	AP (75)	AR	AR (50)	Latency (ms)
RTMO-s [42]	COCO	67.7	87.8	73.7	71.5	90.8	18.5
RTMO-m [42]	COCO	70.9	89.1	77.8	74.7	92.1	22.6
RTMO-l [42]	COCO	72.4	89.5	78.8	76.2	92.6	37.5
RTMO-t [42]	mix	56.7	80.1	61.2	61.1	83.6	14.2
RTMO-s [42]	mix	68.7	87.8	74.6	72.5	90.6	18.5
RTMO-m [42]	mix	72.4	89.5	78.9	76.4	92.9	22.6
RTMO-l [42]	mix	74.8	91.7	81.9	78.5	92.8	37.5
YOLO-POSE-t [12]	COCO	51.9	78.3	55.6	57.1	83.5	32.2
YOLO-POSE-s t [12]	COCO	64.4	87.2	71.2	68.1	90.5	45.6
YOLO-POSE-m t [12]	COCO	69.5	89.7	76.4	73.7	92.6	51.3
YOLO-POSE-l t [12]	COCO	71.3	90.8	78.4	74.9	92.8	60.9
debias-ipr-resnet-50 [43]	COCO	67.5	87.2	71.4	76.5	91.4	89.5
Simcc-s-vipnas-mobilenetv3 [44]	COCO	69.8	88.2	72.7	75.6	92.7	39.5
ours: ResNet-50	COCO	76.7	92.5	83.2	79.9	94.1	59.2
ours: ResNet-50	mix	78.5	93.1	84.9	80.2	95.1	59.2

Table 3. Multi-person pose estimation performance on pose track challenge dataset.

Method	Dataset	Head mAP	Shou mAP	Elb mAP	Wri mAP	Hip mAP	Knee mAP	Ankl mAP	Total mAP
Girdhar et al. [7]	validation	67.5	70.2	62.0	51.7	60.7	58.7	49.8	60.6
Xiu et al. [16]	validation	66.7	73.3	68.3	61.1	67.5	67.0	61.3	66.5
Xiao et al. [26]: ResNet-50	validation	79.1	80.5	75.5	66.0	70.8	70.0	61.7	72.4
Xiao et al. [26]: ResNet-152	validation	81.7	83.4	80.0	72.4	75.3	74.8	67.1	76.7
Ours: ResNet-50	validation	78.7	80.8	76.8	68.5	70.6	70.6	62.8	73.1
Ours: ResNet-50	validation	77.4	79.5	76.9	70.4	72.9	73.4	65.7	73.9
Ours: ResNet-152	validation	81.1	83.7	79.9	72.5	75.8	75.6	67.6	77.0
Xiu et al. [16]	testing	64.9	67.5	65.0	59.0	62.5	62.8	57.9	63.0
Xiao et al. [26]: ResNet-50	testing	76.4	77.2	72.2	65.1	68.5	66.9	60.3	70.0
Xiao et al. [26]: ResNet-152	testing	79.5	79.7	76.4	70.7	71.6	71.3	64.9	73.9
Ours: ResNet-152	testing	79.8	80.0	82.0	76.6	71.7	78.0	65.6	74.2

Table 4. Multi-person pose estimation performance on pose track challenge dataset.

Method	Dataset	MOTA Head	MOTA Shou	MOTA Elb	MOTA Wri	MOTA Hip	MOTA Knee	MOTA Ankl	MOTA Total
Girdhar et al. [7]	validation	61.7	65.5	57.3	45.7	54.3	53.1	45.7	55.2
Xiu et al. [16]	validation	59.8	67.0	59.8	51.6	60.0	58.4	50.5	58.3
Xiao et al. [26]: ResNet-50	validation	72.1	74.0	61.2	53.4	62.4	61.6	50.7	62.9
Xiao et al. [26]: ResNet-152	validation	73.9	75.9	63.7	56.1	65.5	65.1	53.5	65.4
Ours: ResNet-50	validation	71.1	73.6	62.4	54.0	62.5	60.1	48.9	62.8
Ours: ResNet-50	validation	72.8	74.4	65.6	55.2	63.9	63.6	52.8	63.9
Ours: ResNet-152	validation	73.3	74.2	65.2	57.9	64.8	68.8	54.7	65.8
Xiu et al. [16]	testing	52.0	57.4	52.8	46.6	51.0	51.2	45.3	51.0
Xiao et al. [26]: ResNet-50	testing	65.9	67.0	51.5	48.0	56.2	54.6	46.9	56.4
Xiao et al. [26]: ResNet-152	testing	67.1	68.4	52.2	48.9	56.1	56.6	48.8	57.6
Ours: ResNet-152	testing	66.6	68.1	56.1	50.2	57.3	55.6	43.7	57.3

Table 5. Comparing different person detection models integrated with spatiotemporal sampling module.

Detector	Backbone	Sampling?	mAP	MOTA
R-FCN	ResNet-50	× ¹	67.2	57.8
R-FCN	ResNet-50	√	71.1	61.8
Cascade R-CNN	ResNet-101	×	73,1	58.3
Cascade R-CNN	ResNet-101	√	73.5	62.8
Faster R-CNN	ResNet-50	×	69.8	58.9
Faster R-CNN	ResNet-50	√	72.7	60.4
Faster R-CNN	ResNet-50 DCN + FPN	×	71.8	60.3
Faster R-CNN	ResNet-50 DCN + FPN	√	73.7	63.1

¹ The symbol “×” is used to indicate that the detector is not integrated with the spatiotemporal sampling module, whereas “√” is employed to signify integration with the spatiotemporal sampling module.

Table 6. Results of inference latency on single GPU.

latency (batch)	2	4	8	16
ST-baseline	1×	1×	1×	1×
ST (ours)	1.46×	1.39×	1.28×	1.17×
Lite-Pose	1.01×	0.98×	0.96×	0.95×
VoxelTrack	1.24×	1.19×	1.15×	1.09×
AOP	1.19×	1.13×	1.11×	1.05×
Location-free	1.12×	1.08×	1.05×	1.01×
Location-Global	1.15×	1.13×	1.11×	1.07×

Table 7. Different design of feature sampling and aggregation module.

Nfeat	ResNet-101
Ndet	Cascade R-CNN
methods
multi-frame feature aggregation?		√ ¹	√	√
adaptive weighs?			√	√
spatiotemporal sampling?				√
mAP	73.1	71.4	73.0	73.5
MOTA	58.3	57.3	58.1	62.8

¹ The symbol “√” indicates that the framework adopts the corresponding design of the feature sampling and aggregation module.

Table 8. Results of different design of the embedding.

Setting	Design (A)	Design (B)	Design (C)	Design (D)
Layer # 1	conv1-512	conv1-512	conv1-512	conv1-512
Layer # 2	conv3-512	conv3-512	conv3-512	conv3-512
Layer # 3	conv1-2048	conv1-4096	conv3-1024	conv3-1024
Layer # 4	-	-	conv1-2048	conv3-1024
Layer # 5	-	-	-	conv1-2048
mAP	72.9	73.0	73.1	73.1
MOTA	58.2	58.2	58.3	58.3

Table 9. Results of using different number of frames.

# training	2					4
# inference	7	9	11	13	15	7	9	11	13	15
mAP	72.0	72.5	72.8	73.1	73.1	72.1	72.5	72.9	73.1	73.1
MOTA	57.8	58.1	58.2	58.3	58.3	57.9	58.1	58.2	58.3	58.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, S.; Hou, W. Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal. Appl. Sci. 2024, 14, 2238. https://doi.org/10.3390/app14062238

AMA Style

Lin S, Hou W. Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal. Applied Sciences. 2024; 14(6):2238. https://doi.org/10.3390/app14062238

Chicago/Turabian Style

Lin, Song, and Wenjun Hou. 2024. "Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal" Applied Sciences 14, no. 6: 2238. https://doi.org/10.3390/app14062238

APA Style

Lin, S., & Hou, W. (2024). Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal. Applied Sciences, 14(6), 2238. https://doi.org/10.3390/app14062238

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Sampling of Two-Stage Multi-Person Pose Estimation and Tracking from Spatiotemporal

Abstract

1. Introduction

2. Related Work

2.1. Multi-Person Pose Estimation in Images

2.1.1. Two-Stage Approaches

2.1.2. One-Stage Approaches

2.2. Multi-Person Pose Estimation and Tracking in Videos

2.2.1. Multi-Person Pose Estimation

2.2.2. Multi-Person Pose Tracking

3. Background

3.1. Upper Bounds of Analytical Framework

3.2. Motivation

4. Method

4.1. Evaluation and Datasets

4.2. Network Architecture

4.2.1. Feature Extraction Network

4.2.2. Feature Sampling

4.2.3. Feature Aggregation

4.2.4. Person Detection Network

4.3. Training

5. Experiments and Results

5.1. Multi-Person Pose Estimation and Tracking

5.1.1. Training

5.1.2. Inference

5.1.3. Results on the PoseTrack Benchmark

5.2. Ablation Studies

5.2.1. Feature Sampling and Aggregation Module

5.2.2. The Design Choices of Embedding Network

5.2.3. The Number of Supporting Frames

5.3. Qualitative Results

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI