1. Introduction
Human pose estimation (HPE) is a computer vision technique for identifying and classifying key joints of the human body in images or videos. In traditional pose estimation methods, various types of equipment (such as sensors) are attached to the subject to directly acquire physical movement information, which is then used to estimate the pose. These methods can accurately capture human movements in real time but require significant time and resources; thus, they are mainly used in restricted environments, like laboratories.
Recently, with rapid advancements in deep learning technology and the emergence of large-scale pose estimation datasets, there has been continuous research based on deep learning, focusing on accurate HPE without special equipment. Thus, there have been breakthroughs in overcoming the HPE limitations. As the accuracy and robustness of pose estimation improve, its applications are expanding to various fields, such as posture correction, behavior recognition, abnormal behavior detection, and augmented reality [
1].
Various pose estimation models are considerably accurate, from convolutional neural network-based models, such as stacked hourglass network [
2] and HRNet [
3], to the recently introduced transformer-based models, such as ViTPose [
4] and TransPose [
5]. However, owing to the high complexity of these models, high-speed inference in low-power and low-performance environments, such as mobile devices, is limited. Generally, high-performance HPE models are characterized by high memory usage and computational costs during training and inference, necessitating hardware equipped with high-performance graph processing units (GPUs). However, with the proliferation of smartphones and various Internet of Things devices, coupled with the rising popularity of edge computing, the demand for deep learning models capable of high-speed pose estimation even on devices with limited computational capability, memory, and power has increased.
Therefore, this study proposes SMS-Net based on the encoder–decoder structure of the hourglass network, which applies various lightweight techniques to enable high-speed pose estimation while requiring minimal storage space and computation. Specifically, to improve the internal structure of the encoder and decoder constituting the individual hourglass networks, a building block is proposed for extracting key features necessary for pose estimation and generating heatmaps that indicate the likely regions where key points exist. Subsequently, we conducted experiments to demonstrate its effectiveness. Lightweight convolution operations, including depthwise convolution, were applied to improve the execution complexity of conventional convolution operations in the encoder. Additionally, instead of performing convolution operations on all channels of the input feature map, the input channels were halved, and the convolution operations were performed only on the first half; the results were reconnected with the second half without channel loss, followed by channel shuffle. This reduces the overall computation while preventing performance degradation in keypoint prediction.
Different types of spatial information included in the input image to restore meaningful feature maps necessary for keypoint extraction were captured for the decoder. Particularly, a multi-dilation block composed of depthwise convolutions with various dilation values was applied to maintain computational efficiency while integrating spatial information at different scales. This approach prevented overfitting while using fewer weights compared to simply increasing the kernel size. Consequently, by applying the aforementioned methods, the proposed SMS-Net model significantly reduced floating-point operations and memory access costs. Additionally, the inference time was improved while maintaining the pose estimation accuracy by minimizing the number of weights and expanding the receptive field of the decoder. Moreover, through various experiments using the widely used MPII Human Pose [
6] and COCO [
7] datasets, the effectiveness of the lightweight techniques applied in this study was demonstrated, while the balance between the computational efficiency and performance of the proposed system compared to existing models was improved.
The first contribution of this study is the SMS-Net model, which applies various lightweight techniques to minimize floating-point operations and memory access costs, using fewer weights for pose feature extraction. Additionally, the performance of the proposed model was validated using representative public datasets used for pose estimation, and its effectiveness and inference efficiency were demonstrated through comparisons with existing research results.
The remainder of the paper is structured as follows:
Section 2 introduces the latest deep learning-based pose estimation algorithms, and
Section 3 provides a detailed description of the structure of the proposed SMS-Net and the applied lightweight techniques.
Section 4 presents the implementation and training of the model, along with the experimental results using the MPII and COCO datasets. Finally,
Section 5 summarizes the conclusions of the study.
2. Related Work
With the rapid advancement of deep learning technology, relevant research in the field of pose estimation has been actively conducted, which has continuously improved its accuracy and efficiency. This section explains the main features of models that have achieved state-of-the-art (SOTA) pose estimation and reviews representative deep learning-based lightweight pose estimation models.
The stacked hourglass network [
2] is a representative deep-learning network model used for HPE. This model can progressively extract high-resolution and low-resolution features from input images through repeated down- and up-sampling processes. Each hourglass module analyzes the image by compressing features to progressively smaller resolutions and restoring them to the original size to recover detailed positional information. During this process, skip connections are used between the encoder and decoder to transfer precise information extracted by the encoder to the decoder, minimizing information loss and alleviating the vanishing gradient problem that is common in deep networks. The model’s structure indicates that the stacked hourglass network achieves accurate and consistent estimation results, especially for complex human poses.
The soft-gated skip connection [
8] is a key component of the stacked hourglass network that improves the residual block [
9]. The authors previously identified that modifying the residual block used within the encoder and decoder of the stacked hourglass network causes training limitations, and they introduced the soft gating mechanism to address this challenge. They used gated skip connections with learnable parameters to determine which channels in the feature map from the previous stage should be weighted more through backpropagation, ensuring the optimal information was conveyed. This stacked hourglass network consists of multiple stacked modules, resulting in high computational costs and increased model size, which may be impractical in real-time or resource-constrained environments.
The cascade pyramid network (CPN) [
10] proposes a model consisting of GlobalNet and RefineNet to effectively predict occluded or overlapping keypoints. GlobalNet captures the global context and performs approximate pose estimation by extracting heatmaps from feature maps of various resolutions in a pyramid structure. RefineNet generates more precise keypoints using higher-resolution feature maps based on the initial heatmaps extracted by GlobalNet. Additionally, as training progresses, the model tends to focus on simpler keypoints while ignoring more challenging ones. Hard keypoints are selected for training through online hard keypoint mining to effectively learn both types. However, this multi-stage refinement process increases computational overhead and complicates training, making the model less suitable for resource-limited environments.
Bin et al. [
11] proposed a method that simplified the existing complex structure. After processing the input image using a ResNet [
9]-based backbone network, deconvolution [
12] layers were added to restore the resolution and generate high-resolution keypoint heatmaps. This approach effectively simplifies pose estimation tasks and maintains accuracy; however, it still requires significant computational resources owing to the reliance on ResNet-based backbones and deconvolution layers. OpenPose [
13] uses a VGG-19-based [
14] backbone to extract feature maps and generates heatmaps representing keypoint probabilities and part affinity fields representing the connectivity between keypoints through initial and refinement stages, performing multi-person pose estimation. LightWeight OpenPose [
15] is a lightweight model based on OpenPose that uses MobileNetV1 [
16] as the backbone to reduce computational costs and memory usage. Additionally, most computational processes for deriving heatmaps and part affinity fields in the refinement stage are shared. A single prediction branch is used, thus successfully lightweighting the model while maintaining the pose estimation accuracy above a level.
Lite-HRNet [
17] is based on HRNet [
3] and uses the shuffle block from ShuffleNetV2 [
18]. This block divides the input channels into two, sequentially performs bottleneck operations on one partition, connects it with the other partition, and performs channel shuffle. The authors noted the high time complexity of 1 × 1 convolutions and replaced them with element-wise weighting operations to achieve their reduction. Additionally, when creating new sub-branches in each branch of the original HRNet, the feature maps of all previous branches were combined, and convolutions were applied. Subsequently, they were redistributed to existing and generated branches, facilitating information exchange across different scales. This approach improves model efficiency; however, relying on multiple branches and redistributing feature maps can increase the overhead.
3. Methodology
3.1. Model Architecture
Figure 1 illustrates the general structure of the proposed SMS-Net model, consisting mainly of two modules—feature extractor and stacked hourglass module. The feature extractor extracts key features from the input image using a feature extraction model. Afterward, 1 × 1 convolutions and up-sampling operations are performed, and the generated feature map is inputted into the stacked hourglass module. This approach reduces resource consumption and improves the inference speed using a relatively small input containing key feature information instead of directly processing the initial input image. Various feature extraction models can be used; however, EfficientNet [
19], which has a relatively low computational cost, was used for efficient feature extraction in this study.
The stacked hourglass module is composed of multiple connected hourglass modules. Each hourglass module consists of an hourglass network for effectively extracting low- and high-level features from the input image and a Keypoint Prediction Head that uses the extracted feature maps to accurately predict keypoint locations. Each hourglass network is composed of an encoder and a decoder. To reduce the computational load while preventing reduced pose estimation performance, the encoder uses the shuffle-gated block, and the decoder applies the multi-dilation block; these are the main lightweight techniques proposed in this paper. The Keypoint Prediction Head consists of a shuffle-gated block and 1 × 1 convolutions and extracts heatmaps for each keypoint to be predicted from the feature maps extracted by the hourglass network to accurately predict keypoint locations.
The proposed SMS-Net model reduces the computation required for training and inference by applying lightweight techniques to the hourglass network while improving memory costs. The following section describes the main features of the SMS-Net model, focusing on the lightweight techniques.
3.2. Feature Extractor
The feature extractor extracts abstract features at various levels from the input image and inputs them into the stacked hourglass module. This process must extract key features that are essential for pose estimation with high performance while minimizing resource consumption, including the computational load. Additionally, the feature extractor should not be model-specific. This study adopts EfficientNet [
19], which is a deep learning-based feature extraction model that meets these requirements. EfficientNet is a neural network architecture designed using Neural Architecture Search to balance the size and performance of deep learning models, applying an efficient model scaling method to generate models of various sizes. The model used in this study is EfficientNet-B0, pre-trained on ImageNet-1K [
20], which is known for its efficient feature extraction despite its lightweight structure.
Specifically, the feature map generated at the fourth stage of EfficientNet-B0, with dimensions of , is processed through 1 1 convolutions to convert the channel dimension from 40 to 144, followed by up-sampling to generate a final feature map of . The feature map obtained is then passed to the next stage, i.e., the stacked hourglass module. Using the final feature map generated at the last eighth stage of EfficientNet-B0, with a resolution of , can cause excessive information loss. Therefore, this study used the feature map generated at the fourth stage, which was revealed to be the best through experiments.
3.3. Stacked Hourglass Module
The stacked hourglass module extracts key information from the input image based on the feature map extracted by the feature extractor and represents the probability of keypoints existing at those locations as heatmaps. This study adopts a structure similar to that of the original hourglass network, which consists of the encoder and decoder (
Figure 2). We applied various lightweight techniques to improve the internal structure of the encoder and decoder, thereby maintaining the pose estimation accuracy during inference while reducing resource use.
In the encoder section, a shuffle-gated block and max pooling are used to extract spatial feature maps of the input image at each resolution. The feature map size of the first encoder is , and the height and width are reduced by half through max pooling for each encoder. The feature map size of Encoder block 5 is reduced to . In the decoder section, we combined the features extracted by the encoder and applied the multi-dilation block and up-sampling to progressively restore the original resolution and generate keypoint heatmaps. These heatmaps are the same size as those produced by the encoder and eventually match the size of the original input feature map, .
Finally, by adjusting the number of hourglass networks, we can balance model accuracy and computational efficiency. Increasing the number of modules improves accuracy; however, fewer modules reduce inference time and complexity.
3.3.1. Shuffle-Gated Block
The existing pose estimation models with encoder–decoder structures, such as the stacked hourglass network [
2] and soft-gated skip connection [
8], propose unique encoder structures, typically composed of various convolution layers. These conventional convolution-based encoder blocks can effectively extract feature maps but require substantial computational resources. Our study proposes the shuffle-gated block to reduce the computational resources required for feature map extraction in the encoder (
Figure 3).
The shuffle-gated block divides the input feature map (144 channels) in half and processes each part differently. The first half learns the importance of each channel through soft gating comprising learnable parameters and converts it into a weight. This converted weight is multiplied by each channel to generate a feature map.
In the second half, the three feature maps are generated interactively. The first feature map is created by applying the Ghost module from GhostNetV2 [
21] (
Figure 4) to the input feature map, followed by depthwise convolution. This process efficiently extracts key features through the Ghost module and depthwise convolution while minimizing the computational load. The first feature map generated in this manner serves as input to create the second feature map. In the second process, a 1 × 1 convolution is followed by a depthwise convolution to generate a feature map. This feature map is then used to create a third feature map through an additional depthwise convolution. Finally, spatial weighting is applied to the three feature maps generated in each process.
Section 3.3.3 explains spatial weighting. These three feature maps are concatenated channel-wise with the feature map created in the first half, and their channels are swapped among the feature maps via channel shuffling to promote interaction and improve expressiveness.
Consequently, the shuffle-gated block combines soft gating, various convolution techniques, channel shuffling, and spatial weighting for efficient and elaborate feature extraction. This allows for the creation of optimal feature maps that minimize the computational load while reflecting high expressiveness and spatial importance.
3.3.2. Multi-Dilation Block
A wide area to preserve local details and structures is essential for restoring high-resolution information, allowing the network to understand the overall structure and context of the input image. Additionally, capturing semantic patterns and features at the pixel level enables more accurate predictions. Generally, increasing the size of the convolution kernel increases the receptive field, allowing for the collection of spatial information over a broader range. However, increasing the kernel size exponentially increases the computational load. Therefore, our study proposes a multi-dilation block based on dilated convolution to efficiently extract spatial information of various sizes while minimizing the increase in computational load (
Figure 5). In the decoder of the proposed hourglass network, before the multi-dilation block is applied, the input feature map is combined with the feature map of the same resolution from the encoder via skip connections. Afterward, the multi-dilation block is used to restore detailed information concerning the entire feature map.
The dilated convolution operates by applying the convolution filter to each spatial position in the input feature map. However, unlike the regular convolution, the elements of the filter are spaced. The dilation rate determines the spacing, and the receptive field size covered by the filter can be expanded by adjusting this rate. However, expanding the receptive field strengthens the global context but may weaken the local context. We used several depthwise convolutions with multiple dilation rates to develop a module that is robust globally and locally.
The operation of the multi-dilation block is described as follows: First, the input feature map is divided in half. One half passes through the Ghost module and undergoes depthwise convolution with multiple dilation values. During this process, three feature maps are generated using different dilation rates and combined through element-wise addition, and spatial weighting is applied. The feature map generated in this way is connected with the remaining half of the input feature map, and a channel shuffle is performed to reflect the spatial information at various dilation rates. Consequently, the dilated convolution efficiently captures a wider context of the input image while maintaining computational efficiency by retaining the number of parameters in the convolution layers.
3.3.3. Spatial Weighting
Generally, depthwise convolution applies filters independently to each channel, resulting in a lack of interaction between channels. Therefore, many existing studies have performed a 1 × 1 convolution on feature maps generated through depthwise convolution to combine all the channels and create new information, enhancing channel interaction. However, this operation requires significant computational resources. Spatial weighting [
17] was used to minimize the computational cost in the proposed model (
Figure 5). The operation of spatial weighting is described as follows: All spatial feature maps are collected using global average pooling, and 1 × 1 convolution and ReLU activation functions are then used to reduce the channels. The feature map of the reduced channels is restored using 1 × 1 convolution and sigmoid activation functions. Afterward, element-wise multiplication is applied to the initial input feature map to effectively integrate the feature map while minimizing the computational channels.
4. Experimental Results
4.1. Datasets
In this study, we evaluated the performance of the proposed SMS-Net model using the MPII and COCO datasets—commonly used benchmarks for HPE. The MPII dataset consists of approximately 25,000 images and 40,000 annotated people, covering 410 human activities. These images are obtained from online video platforms and offer diverse poses, making it a challenging dataset because of its diversity in human activities and varied contexts. The MPII dataset provides annotations for 16 keypoints, where the joints are connected in a straight line from the crown to the chest. The COCO dataset is a large-scale dataset widely used in computer vision tasks, such as object detection, segmentation, and pose estimation. It comprises approximately 120,000 images and 17 keypoints, including facial features, such as eyes, nose, and ears, representing human poses. COCO offers a higher complexity owing to its diverse object categories, challenging backgrounds, and the variety of human poses in everyday scenes.
4.2. Training Details
The proposed SMS-Net was trained using a single NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 24 GB memory. The model’s hyperparameters were set with a batch size of 24 and 250 epochs using the Adam optimizer. The learning rate was 2 × 10−3, decreasing by 0.1 at 170 and 200 epochs. Additionally, the input image sizes were adjusted for each dataset. The input images for the COCO and MPII datasets were resized to 256 × 192 and 256 × 256, respectively. Furthermore, data augmentation techniques, such as random rotation within a range of [−30°, 30°], random scaling within [0.75, 1.25], and horizontal and vertical flips, were applied to both datasets.
4.3. Performance Metrics
The number, type, and location of keypoints in both datasets differed; thus, the evaluation metrics were also different. The evaluation metric for the MPII dataset is the percentage of correct keypoints (PCK). PCK considers a keypoint prediction correct if the distance between the predicted keypoint and ground truth coordinates is less than a specified threshold. We used PCKh@0.5, with a threshold of 0.5 based on the person’s head size (Equations (1) and (2)). Furthermore,
and
denote the coordinates of the
i ground truth keypoint, and
and
denote the
i predicted keypoint. Additionally,
represents the distance between the predicted and ground truth coordinates of joint
i, and
h represents the length between the keypoints composing the head and neck.
K denotes the number of keypoints, and
is an indicator function that is 1 if the condition is true and 0 if the condition is false.
The COCO dataset uses object keypoint similarity (OKS) (Equation (3)) to evaluate the predicted pose and assess model performance at various thresholds, using average precision and recall.
Let i be the joint index, where di represents the distance between the predicted and ground truth coordinates of joint i. The variable s denotes the person’s size, ki is a constant for each joint type, and vi indicates whether the joint is annotated. The function δ serves as an indicator for PCKh. The calculated OKS ranges from 0 to 1 and evaluates the model performance by assessing the Euclidean distance and vector similarity between keypoints.
Subsequently, the average precision (AP) and average recall (AR) were measured. AP is the precision measured at OKS thresholds starting from 0.5 up to 0.95 (AP50 and AP75 represent the precision at OKS thresholds of 0.5 and 0.75, respectively). AR is the recall measured at OKS thresholds of 0.5–0.95.
APM and APL represent the average precisions for medium (32 × 32–96 × 96) and large (≥96 × 96) objects, respectively, at OKS thresholds of 0.5–0.95. These metrics were used to evaluate pose estimation model performance and compare the precision and recall at various OKS values.
We took measurements 10 times for each model and used the average values to compare the performances and minimize the randomness in deep learning models.
4.4. Results
The model performance was measured using the performance metrics introduced in the previous section on the MPII and COCO validation datasets for detecting and representing pose keypoints in the target images during inference. We compared performances using four models: SMS-Net-Tiny (one stack), SMS-Net-Small (two stacks), SMS-Net-Large (four stacks), and SMS-Net-Huge (eight stacks).
4.4.1. MPII
Table 1 presents the performance metrics of pose estimation models trained on the MPII dataset, where all models were trained using 256 × 256 images. SMS-Net-Large achieved a PCKh@0.5 and PCKh@0.1 of 87.7 and 33.7, respectively, outperforming other lightweight pose estimation models with fewer parameters and comparable FLOPs. Furthermore, SMS-Net-Large had similar scores compared to models such as LMFormer-L, EL-HRNet-W32, and Greit-HRNet-30 while maintaining fewer parameters and FLOPs, highlighting its efficiency.
Similarly, SMS-Net-Small achieved a PCKh@0.5 score of 85.6, surpassing LMFormer-T, which has the same FLOPs but with slightly lower accuracy compared to Lite-HRNet-148, Dite-HRNet-18, and Greit-HRNet-18. However, SMS-Net-Small had a higher PCKh@0.1 score of 31.6, compared to 29.5 and 31.1 of Lite-HRNet-18 and Dite-HRNet-18, respectively. These results indicate that the lightweight hourglass architecture outperforms the lightweight HRNet architectures in large models despite its lower performance in small networks. However, regarding PCKh@0.1, the small and large models outperformed other comparable models. Notably, the small model achieved a PCKh@0.1 score nearly identical to Dite-HRNet-30 despite having significantly fewer FLOPs and parameters.
4.4.2. COCO
Table 2 compares the AP, AR, and FLOPs of the proposed SMS-Net models with SOTA lightweight pose estimation models in the COCO validation dataset. All models were evaluated using input images sized 256 × 192. The SMS-Net-Small and SMS-Net-Large models achieved 86.4 and 89.4 precision at AP50, respectively.
Thus, SMS-Net-Large achieved the best AP50 performance compared to all other models. However, aside from AP50, its performances in all other evaluation metrics, including AP and AR, were lower than those of other models with similar parameters and FLOPs. Therefore, the lightweight hourglass model lags in other evaluation metrics compared to other lightweight architectures, despite performing well in AP50 on the COCO dataset.
Table 3 presents the performance of the proposed SMS-Net models using the COCO test-dev dataset. SMS-Net-Huge demonstrated superior performance to Lite-HRNet-30 in AP and AP50. Similarly, the SMS-Net-Large model demonstrated results comparable to those of Lite-HRNet-18. Furthermore, the SMS-Net-Huge delivered slightly lower performance than EL-HRNet-W32; however, the differences in parameters and FLOPs were substantial.
4.5. Ablation Study
We conducted an ablation study using the MPII and COCO validation datasets to assess the impact of each module. Specifically, we compared the performance of various backbones, evaluated the use of soft-gated versus shuffle-gated blocks, and analyzed the performance and speed between the multi-dilation and 3 × 3 convolution blocks in the decoder process. These experiments systematically examined the effect of each module on the overall network performance and efficiency, as presented in
Table 4,
Table 5,
Table 6 and
Table 7.
4.5.1. Backbone
We conducted an ablation study to compare the performance of various backbone networks. Along with EfficientNet-B0, the backbone used in this study, we experimented with MobileNetV2 and MobileNetV3.
Table 4 and
Table 5 reveal that for the MPII and COCO datasets, the FLOPs of EfficientNet-B0 and MobileNetV2 were nearly identical. MobileNetV2 had slightly fewer parameters; however, the model using EfficientNet-B0 as the backbone had slightly better accuracy.
4.5.2. Soft-Gated vs. Shuffle-Gated Block
We compared the performance of the existing encoder block (soft-gated block) with the proposed shuffle-gated block. The shuffle-gated block performed slightly lower than the soft-gated block; nonetheless, it maintained high accuracy with considerably fewer FLOPs.
Table 6 and
Table 7 reveal that the small model in the MPII dataset experienced a 2.1% decrease in accuracy at PCKh@0.5, while the large model had a 0.7% reduction. Similarly, on the COCO dataset, the precision dropped by 3.1% and 1.3% at AP50, respectively. However, the reduced parameters and FLOPs demonstrate that the encoder was effectively made more lightweight while minimizing the performance loss.
4.5.3. Decoder 3 × 3 Convolution vs. Multi-Dilation Block
We compared the results of applying the conventional 3 × 3 convolution after up-sampling in the decoder block (the stacked hourglass network) with those of replacing it with the multi-dilation block.
Table 6 and
Table 7 reveal that the PCKh@0.5 for the MPII dataset decreased by 1.6% and 0.2% for the small and large models, respectively. The precision at AP50 decreased by 2.0% and 1.3%, respectively, for the COCO dataset. However, the difference in the parameters and FLOPs demonstrates a reduction in computation while minimizing the performance decline.
5. Conclusions
This study introduced a lightweight network suitable for HPE. The soft-gated skip connection network, a model with SOTA in HPE, demonstrated high performance by weighting important channels, but it still had high FLOPs compared to other lightweight models. We proposed a lightweight pose estimation model based on the sequentially stacked structure of the hourglass network to address this issue. Particularly, a shuffle-gated block was introduced to reduce the computational load and number of parameters during the feature extraction of the encoder of each hourglass network. Additionally, a multi-dilation block was used in the decoder to secure the receptive fields of various scales without increasing the computational load. The results of using two datasets (i.e., MPII and COCO) demonstrated that the proposed model achieved an improved balance between computational efficiency and performance compared to existing models in HPE.
Our future research will focus on simplifying complex structures, such as the shuffle-gated and multi-dilation blocks, to enhance model efficiency and improve performance. Furthermore, collecting and using additional pose estimation datasets will accurately measure the model’s generalizability, enabling its effectiveness in diverse environments and conditions.
Additionally, we deployed SMS-Net on mobile devices, which demonstrated satisfactory performance. However, comprehensive optimization has not yet been performed. Future studies will involve refining the model for better efficiency on these devices, including conducting hardware benchmarks to evaluate its performance under real-world conditions. These research directions are expected to improve the model’s usefulness across various fields by equipping it with better performance and generalization capabilities.