GDE-Pose: A Real-Time Adaptive Compression and Multi-Scale Dynamic Feature Fusion Approach for Pose Estimation

Kuok, Kaiian; Liu, Xuan; Ye, Jinwei; Wang, Yaokang; Liu, Wenjian

doi:10.3390/electronics13234837

Open AccessArticle

GDE-Pose: A Real-Time Adaptive Compression and Multi-Scale Dynamic Feature Fusion Approach for Pose Estimation

by

Kaiian Kuok

,

Xuan Liu

,

Jinwei Ye

,

Yaokang Wang

and

Wenjian Liu

^*

Faculty of Data Science, City University of Macau, Macau 999078, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4837; https://doi.org/10.3390/electronics13234837

Submission received: 18 November 2024 / Revised: 2 December 2024 / Accepted: 5 December 2024 / Published: 7 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

This paper introduces a novel lightweight pose estimation model, GDE-pose, which addresses the current trade-off between accuracy and computational efficiency in existing models. GDE-pose builds upon the baseline YOLO-pose model by incorporating Ghost Bottleneck, a Dynamic Feature Fusion Module (DFFM), and ECA Attention to achieve more effective feature representation and selection. The Ghost Bottleneck reduces computational complexity, DFFM enhances multi-scale feature fusion, and ECA Attention optimizes the selection of key features. GDE-pose improves pose estimation accuracy while preserving real-time performance. Experimental results demonstrate that GDE-pose achieves higher accuracy on the COCO dataset, with a substantial reduction in parameters, over 80% fewer FLOPs, and an increased inference speed of 31 FPS, underscoring its exceptional lightweight and real-time capabilities. Ablation studies confirm the independent contribution of each module to the model’s overall performance. GDE-pose’s design highlights its broad applicability in real-time pose estimation tasks.

Keywords:

GDE-pose; lightweight pose estimation; multi-scale feature fusion; attention mechanism; deep learning

1. Introduction

Pose estimation, a key technology in computer vision, plays an essential role across various application domains, particularly in real-time scenarios such as robotic navigation, health monitoring, and sports analysis. By detecting and analyzing the spatial positions of human joints, pose estimation enables accurate tracking and interpretation of dynamic activities [1]. For instance, in health monitoring, precise pose recognition can detect anomalies such as falls among the elderly or patients, providing caregivers with real-time alerts and thus reducing potential risks [2]. Additionally, in robotic navigation, real-time pose estimation endows machines with adaptability to complex environments, allowing them to respond and make more intelligent decisions during human interactions. Consequently, accurate and efficient pose estimation is critical for many real-world applications, especially those requiring real-time performance.

Although mainstream models like YOLO-pose have achieved significant advances in pose estimation, their high computational and storage costs severely limit their application in resource-constrained devices. For example, while the YOLO-pose model demonstrates outstanding accuracy, its complex computational architecture and high parameter count make it unsuitable for deployment on mobile or embedded systems. Specifically, the computational burden of such models leads to increased inference latency, and the elevated memory consumption further restricts their practicality in low-power settings. These limitations are particularly pronounced in applications that demand real-time and efficient pose estimation, compelling researchers to balance between model compactness and performance [3]. Therefore, developing a pose estimation model that maintains high accuracy under lightweight constraints has become a pressing issue in computer vision research.

This paper is motivated by the challenge of designing a pose estimation method capable of achieving model compactness while retaining robust feature representation capabilities. Compared with traditional YOLO-pose models, our goal is to significantly reduce computational cost and optimize the model structure to meet the demands of resource-limited devices. Efficient feature extraction and fusion are critical; however, current lightweight methods often sacrifice some feature representation ability, leading to performance degradation. Our approach seeks to overcome this limitation by constructing a model that combines both efficiency and high accuracy, addressing the dual demands for real-time performance and precision in pose estimation, and thus promoting its broad applicability across various scenarios.

This paper introduces an improved lightweight pose estimation model, GDE-pose, which enhances the efficiency and feature representation of YOLO-pose by integrating Ghost Bottleneck [4], a Dynamic Feature Fusion Module (DFFM), and an ECA Attention module [5]. Specifically, the Ghost Bottleneck replaces the original Bottleneck structure in YOLO-pose, reducing computational load; the DFFM module added to the Neck facilitates dynamic multi-scale feature fusion, improving recognition of complex poses; and the ECA Attention module, incorporated in the Head, further strengthens selective expression of key features, thereby boosting pose estimation accuracy. These improvements enable the GDE-pose model to achieve an optimal balance between lightweight design and performance enhancement, demonstrating significant application potential.

The remainder of this paper is organized as follows: Section 2 provides an overview of related research on pose estimation, including developments in lightweight models and feature fusion techniques; Section 3 describes the architecture and improvement methods of the GDE-pose model in detail; Section 4 presents the experimental setup, result analysis, and performance comparison with existing models; and Section 5 summarizes the research findings and suggests directions for future work.

2. Related Work

As a single-stage detection framework, the YOLO (You Only Look Once) series has been widely adopted in object detection due to its speed and efficiency. Since its initial release, the YOLO model has undergone multiple iterations, each significantly improving detection accuracy and speed, which has made it highly competitive for real-time applications [6]. With the growing demand for human pose estimation, the YOLO architecture has been further refined to support key detection and pose estimation tasks, resulting in variants like YOLO-Pose. However, despite its solid performance in pose estimation accuracy, the large computational load and parameter scale of YOLO-Pose limit its usability in lightweight application scenarios. This limitation underpins the motivation for this study, which aims to optimize the YOLO-Pose structure to improve its efficiency and applicability in pose estimation tasks [7].

Lightweight techniques are especially critical in pose estimation tasks, particularly on resource-constrained devices. Lightweight neural networks such as MobileNet [8] and ShuffleNet [9] have been widely used in computer vision tasks. Through techniques like depthwise separable convolution [10], these models achieve significant reductions in computational complexity. However, although these lightweight models provide advantages in parameter count and computational speed, they face limitations in the context of pose estimation tasks. Specifically, due to the high demand for detailed feature representation in pose estimation, the lightweight methods of MobileNet and ShuffleNet tend to compromise feature representation capacity, affecting the model’s final detection accuracy. This trade-off reveals the necessity to develop new lightweight methods that can maintain robust feature representation while further reducing the model’s computational load.

To enhance feature representation capacity, dynamic feature fusion and attention mechanisms have recently been introduced into pose estimation models. Dynamic feature fusion aims to adaptively integrate features from different layers and scales, enriching the model’s adaptability to diverse inputs and improving its capability to recognize complex poses. For example, multi-scale feature fusion not only strengthens the model’s balance between detail and global information but also enhances its robustness to occlusions and variations in pose [11]. Meanwhile, attention mechanisms are widely applied in pose estimation to achieve selective feature enhancement. Channel attention can effectively emphasize critical feature channels essential for object detection, improving the model’s adaptability to complex backgrounds and varied poses. However, while existing dynamic feature fusion and attention mechanisms improve model expression capacity, they still come with high computational costs, which limit their suitability for lightweight applications.

In summary, while current pose estimation models have achieved improvements in accuracy and feature expression, their lightweight performance still requires enhancement. Lightweight networks like MobileNet and ShuffleNet lack sufficient feature representation capacity, while existing dynamic feature fusion and attention mechanisms are still computationally demanding. To address these limitations, this paper proposes the GDE-pose model, which combines Ghost Bottleneck, a Dynamic Feature Fusion Module (DFFM), and an ECA Attention mechanism. Ghost Bottleneck reduces computational costs by eliminating redundant features; DFFM, integrated into the Neck, enhances model expression through multi-scale feature fusion; and the ECA Attention module, with a simplified channel selection mechanism, improves feature extraction accuracy. This unique combination of modules provides an optimal balance between lightweight design and effective feature representation for GDE-pose.

3. Method

The improvements in this paper are based on the YOLO11-pose [12] architecture, which incorporates pose estimation capabilities into a single-stage object detection framework, resulting in a multi-task model capable of efficiently detecting keypoints. YOLO11-pose achieves feature extraction, fusion, and keypoint detection through three main components: Backbone, Neck, and Head. The Backbone is responsible for capturing fundamental image features, the Neck layer further aggregates information through a multi-scale feature pyramid structure, and the Head performs keypoint predictions based on these fused features. Although YOLO-pose offers advantages in real-time performance and detection accuracy, there remains room for optimization in balancing lightweight design and precision. To address this, this paper introduces targeted module replacements and enhancements in the Backbone, Neck, and Head structures, as illustrated in Figure 1.

3.1. C3k2_Ghost

In the design of the Backbone, the C3k2 module serves as a crucial unit for feature extraction, traditionally using a standard Bottleneck structure to capture both local and global image features. However, due to multiple convolutional layers, the standard Bottleneck imposes a significant computational load, making it unsuitable for resource-constrained devices. To achieve a lightweight model, this paper replaces the Bottleneck in the C3k2 module with a Ghost Bottleneck. Ghost Bottleneck generates “redundant” feature maps, which reduces computational complexity while preserving feature integrity, enabling robust feature representation with fewer parameters. Additionally, by employing simplified convolution operations within the Ghost Bottleneck, this module effectively reduces memory usage, thereby providing greater space and efficiency for subsequent feature fusion stages. As illustrated in Figure 2.

3.2. C3k2_DFFM

In pose estimation tasks, the Neck section is responsible for extracting and fusing multi-scale features, laying a foundation for the model to accurately identify target poses in complex scenarios. To enhance YOLO11-pose’s performance in recognizing multi-scale and complex poses, this paper introduces a Dynamic Feature Fusion Module (DFFM) into the Neck section. The core of DFFM lies in multi-layered feature fusion and adaptive processing, enabling the model to dynamically respond to and accurately capture various poses.

DFFM consists of three main modules: the Dynamic Receptive Field Module, the Multi-Scale Feature Fusion Module, and the Lightweight Channel Compression Module. First, the Dynamic Receptive Field Module combines 1 × 1 and 3 × 3 adjustable convolutional kernels to adaptively adjust the receptive field size, allowing the model to balance between capturing fine details for small-scale targets and the overall structure for larger scales. By automatically adjusting weights, this module effectively captures both local and global information, significantly enhancing the model’s robustness to complex backgrounds and variable targets.

Second, the Multi-Scale Feature Fusion Module performs comprehensive horizontal and vertical fusion across features at different scales. Horizontal fusion reinforces details within the same level, while vertical fusion integrates global and detail information across different levels, ensuring the model captures essential details and leverages global context when handling scale variations. This fusion strategy not only strengthens the model’s expressive capacity in diverse scenarios but also allows adaptive precision improvements in complex poses and multi-scale settings.

The Lightweight Channel Compression Module plays a crucial role in computational efficiency. By using 1 × 1 convolutions to effectively compress channels, this module significantly reduces computational costs, ensuring that the addition of DFFM does not substantially increase resource requirements. This channel compression strategy is particularly suited for applications demanding high real-time performance, enabling the model to maintain efficient feature representation with minimal computational overhead.

By integrating these three components, DFFM greatly enhances YOLO11-pose’s feature adaptability and expressiveness while maintaining a lightweight design. Compared to traditional feature fusion methods, DFFM not only improves accuracy and robustness in complex pose estimation but also significantly enhances real-time performance and deployment efficiency, achieving a balance of efficiency, accuracy, and flexibility that meets modern pose estimation systems’ requirements for high precision and low resource consumption. These details are presented in Figure 3.

3.3. ECA_Head

In the Head section, to further enhance the representation capacity of key features, this paper introduces the ECA (Efficient Channel Attention) mechanism. By assigning distinct weights to each feature channel, ECA effectively improves the model’s selective focus on essential features, thereby reducing interference from irrelevant information during feature extraction. A unique aspect of ECA’s design is its optimization of channel selection through local inter-channel dependencies without adding extra computational overhead, making it particularly suitable for lightweight applications. The Efficient Channel Attention (ECA) module achieves an ideal balance between computational efficiency and feature enhancement, making it well-suited for real-time applications and lightweight models. ECA captures inter-channel dependencies using a single 1D convolution, avoiding the additional parameters and complex computations introduced by other attention mechanisms such as SE [13] or CBAM [14]. This design maintains inference speed while providing an adaptive channel attention mechanism that dynamically weights features based on channel importance, enhancing critical information and suppressing irrelevant noise.

ECA can also flexibly adjust the receptive field by varying the kernel size (k_size), adapting to multi-scale features, which is especially effective in tasks involving complex backgrounds and small-object detection. Additionally, the modular design of ECA enables seamless integration into different network parts (e.g., Backbone, Neck, or Head), allowing for selective feature enhancement at various levels without redesigning the network architecture. Thanks to its lightweight design and effective channel prioritization, ECA is not only suitable for lightweight networks but also provides substantial benefits for large-scale networks, particularly for tasks that demand efficient and adaptive feature representation, such as classification, object detection, and pose estimation.

In summary, this paper introduces the GDE-pose model by integrating Ghost Bottleneck, DFFM, and ECA Attention modules into key parts of the YOLO-pose architecture. These three modules address lightweight design, dynamic feature fusion, and feature selection, collectively enhancing the model’s overall efficiency and accuracy. Ghost Bottleneck reduces the computational burden, making the model more lightweight; DFFM enhances performance in complex scenarios through dynamic multi-scale feature fusion; and ECA Attention optimizes information filtering at the final stage of feature extraction. With these improvements, GDE-pose achieves the goal of enhancing pose estimation accuracy while maintaining real-time performance, demonstrating significant application potential. The results are depicted in Figure 4.

The model successfully identifies individuals and highlights key body joints, indicating effective pose estimation. In the first example, it distinguishes the giraffes and their body structure, though some keypoints appear less accurate due to limited visibility. In the second example, the model accurately detects multiple people in a crowded environment and overlays bounding boxes and keypoints, demonstrating strong object detection and pose recognition. Despite minor inaccuracies in cluttered areas, the overall performance indicates robust functionality in diverse scenarios.

4. Experiment

4.1. Datasets and Evaluation Metrics

This experiment uses the COCO (Common Objects in Context) dataset [15] to comprehensively evaluate the performance of the GDE-pose model across various pose estimation scenarios. The COCO dataset includes extensive keypoint annotations, covering diverse complex scenes and occlusions, effectively testing the model’s adaptability in different conditions.

For evaluation metrics, Mean Average Precision (mAP) and Frames Per Second (FPS) are the primary measures. mAP assesses the model’s accuracy in keypoint detection, while FPS reflects its real-time performance. These metrics not only demonstrate the model’s accuracy and efficiency but also validate its suitability for lightweight application scenarios.

4.2. Experimental Setup

To ensure the reliability and reproducibility of experimental results, all tests were conducted in a consistent hardware and software environment. For hardware, the experiments were run on a high-performance computing server equipped with an NVIDIA Tesla V100 GPU (NVIDIA, Santa Clara, CA, USA) and 32 GB of memory, allowing for the efficient handling of large-scale datasets and accelerated computation. On the software side, both model training and inference were implemented using the PyTorch 2.5 deep learning framework, with all model hyperparameters kept consistent to ensure fair comparisons. To eliminate the influence of environmental variables, a fixed random seed was used to initialize weights, ensuring a consistent starting state for each experiment and thereby yielding stable and reproducible results.

4.3. Ablation and Comparative Experiments

In Experiment 1, we evaluated the effectiveness of integrating the C3k2_Ghost module into the Backbone, comparing different replacement positions (initial convolution layer, main feature extraction stage, downsampling module, and final aggregation layer). Horizontal analysis shows that replacing the Backbone’s initial convolution layer or downsampling module significantly reduces FLOPs (to 13.9 G and 13.8 G, respectively) while maintaining high detection performance, with frame rates increasing to 32 FPS and 31 FPS. When all modules were replaced, AP slightly decreased to 74.1%, and the frame rate increased to 33 FPS, confirming the effectiveness of a complete replacement strategy.

Vertical comparison demonstrates that a combined replacement strategy achieves a balance between performance and efficiency. For instance, replacing the initial convolution layer, main feature extraction, and downsampling modules resulted in AP and mAP scores of 73.4% and 69.0%, respectively, with a peak frame rate of 35 FPS and a reduction in FLOPs to 13.5 G. This outcome suggests that C3k2_Ghost maintains strong detection performance while lowering computational costs. Overall, integrating C3k2_Ghost into the Backbone significantly enhances model efficiency, showcasing its lightweight advantages.

The experimental results indicate that different methods of integrating the DFFM module into the C3k2 module significantly impact model accuracy, inference speed, model size, and computational complexity. Specifically, the “insertion of DFFM before feature concatenation” demonstrates the best performance balance, achieving the highest accuracy (75.0%) and fastest inference speed (31 FPS) compared to other integration methods, while significantly reducing model size (6.7 MB) and FLOPs (14.1 G), exemplifying an ideal balance between lightweight design and efficiency. These details are presented in Table 1.

In contrast, the “introduction of DFFM within each branch” improves accuracy (74.8%) but has higher FLOPs (15.8 G), resulting in a slightly slower inference speed (28 FPS), making it suitable for scenarios where computational resources are less restricted. Meanwhile, using “DFFM as post-processing in the C2f module” enhances global features in complex scenarios, but its larger model size (8.1 MB) and higher FLOPs (16.2 G) make it less suited for resource-constrained real-time tasks.

Overall, compared to the baseline model without DFFM, all integration methods significantly improve accuracy, with DFFM integration before feature concatenation achieving the optimal balance of accuracy and efficiency. This approach demonstrates superior performance in resource-sensitive environments, offering an effective optimization strategy for efficient multi-scale feature fusion integration.

In Experiment 3, we explored the impact of various module combinations on the performance of the YOLO11-npose model. The results indicate that integrating the C3k2_Ghost, C3k2_DFFM, and ECA_Head modules significantly enhances the model’s accuracy (AP) and mean accuracy (mAP), while also optimizing the recall rate (AR) and computational efficiency. Specific experimental results are shown in Table 2.

In single-module experiments, C3k2_Ghost demonstrated outstanding lightweight advantages, markedly reducing model parameters and FLOPs, with inference speed increasing to 33 FPS. C3k2_DFFM prioritized feature extraction improvements, raising AP from the baseline 74.5% to 75.0%. Dual-module combinations further strengthened performance, particularly the pairing of C3k2_DFFM with ECA_Head, which significantly improved accuracy and recall, making it suitable for applications with high accuracy requirements.

In the final configuration, which includes all three modules—C3k2_Ghost, C3k2_DFFM, and ECA_Head (denoted as “Ours”)—the model achieved an AP of 77.3%, mAP of 72.6%, and recall rate of 74.5%, maintaining high inference speed and low parameter count while showcasing an excellent balance of performance. This combination provides a novel approach to multi-scale feature fusion, achieving an optimal balance between accuracy and efficiency, with strong potential for deployment in resource-constrained real-time tasks.

In the comparative experiments, we evaluated the GDE-pose model against several mainstream pose estimation models to comprehensively assess its performance advantages. First, YOLO11-pose, as the baseline model for GDE-pose, was used to directly verify the effects of the improved modules (Ghost Bottleneck, DFFM, and ECA Attention). Additionally, we selected MobileNet and ShuffleNet as representatives of lightweight models for comparison, as both are widely used in real-world applications for their low computational cost and relatively high accuracy. MobileNet achieves lightweight design through depthwise separable convolution, while ShuffleNet relies on grouped convolution for computational optimization. In comparison, GDE-pose not only focuses on lightweight optimization but also further enhances feature representation capacity. Therefore, comparing these models helps clarify GDE-pose’s advantage in balancing accuracy and efficiency.

As shown in Table 3, GDE-pose (Ours) outperforms several existing mainstream pose estimation models, including YOLOv8-Pose, YOLO-NAS-Pose, OpenPose, HRNet-Pose, and AlphaPose across multiple performance metrics. Compared to high-performing models such as YOLO-NAS-Pose and YOLOv8-Pose, GDE-pose achieved an increase in accuracy (AP and mAP) by 1.5% and 2.1%, respectively, while significantly reducing parameter count (6.8 M) and computational complexity (14.5 G), resulting in more efficient resource utilization. Relative to traditional high-precision models like OpenPose and HRNet-Pose, GDE-pose achieved higher accuracy with a substantial reduction in parameters, reducing FLOPs by over 80% and improving inference speed to 31 FPS, showcasing outstanding lightweight advantages and real-time performance.

While PoseResNet and AlphaPose demonstrate certain performance strengths in specific applications, their parameter counts and computational requirements are considerably higher than those of GDE-pose, failing to achieve an optimal balance between real-time performance and accuracy. Overall, GDE-pose reached 77.3% AP and 72.6% mAP while maintaining a compact model size and low computational complexity, validating its excellent capability in balancing accuracy and efficiency. This demonstrates its extensive application potential, providing a new technical option for real-time pose estimation.

5. Results

In quantitative results, the GDE-pose model demonstrates significant advantages across multiple key metrics. Table 3 shows the performance comparison between the GDE-pose, the YOLO-pose baseline model, and other lightweight models (MobileNet and ShuffleNet). GDE-pose achieves outstanding performance in mean Average Precision (mAP) and inference speed (FPS), particularly with mAP scores of 72.6%. Additionally, the GDE-pose reaches an inference speed of 31 FPS, approximately 20% faster than the baseline YOLO-pose, nearing the speed level of lightweight models. Through the synergistic effects of the Ghost Bottleneck, DFFM, and ECA modules, GDE-pose achieves an optimal balance between accuracy and efficiency, exhibiting superior real-time performance and accuracy. To investigate the contribution of each improvement module in the GDE-pose model, we conducted an ablation study by individually removing the Ghost Bottleneck, DFFM, and ECA Attention modules to assess their impact on overall performance. Results show that removing the Ghost Bottleneck reduced inference speed by about 12%, indicating its critical role in maintaining model compactness. When DFFM was removed, the model’s mAP dropped by 3.8% in complex pose scenarios, highlighting DFFM’s unique contribution to feature fusion and multi-scale information enhancement. Similarly, removing ECA Attention led to a significant reduction in detection accuracy in low-light scenes, confirming this module’s effectiveness in feature selection and enhancing model robustness. Table 4 summarizes the ablation study results, further illustrating the indispensable role of each module in enhancing GDE-pose’s overall performance.

The provided plots illustrate the training and validation loss trends of a deep learning model across various metrics, including box loss, pose loss, objectness loss, classification loss, and distribution focal loss. Training losses consistently decrease, indicating improved performance in bounding box regression, pose estimation, object detection, and classification. Validation losses follow similar downward trends but show slight fluctuations, particularly in classification and objectness loss, suggesting potential challenges in generalization or data complexity. The smooth lines highlight overall progress, though some metrics reveal potential overfitting or hyperparameter optimization needs. To enhance performance, consider fine-tuning learning rates, applying regularization techniques, or addressing potential class imbalances. Despite minor fluctuations, the results suggest the model is learning effectively, with good alignment between training and validation outcomes. These details are presented in Figure 5.

6. Discussion and Conclusions

The experimental results demonstrate that GDE-pose achieves an ideal balance between lightweight design and accuracy in pose estimation tasks, a success attributed to the intricate synergy among its modules. Compared to the baseline YOLO-pose model, GDE-pose significantly reduces computational costs while maintaining high detection accuracy, making it more suitable for resource-constrained environments. Particularly on the COCO, GDE-pose achieved mAP scores of 72.6%, respectively, with an inference speed reaching 31 FPS, showcasing exceptional performance in both real-time and accuracy metrics. These results indicate that GDE-pose not only possesses potential for practical application but also provides valuable design insights for the development of future lightweight pose estimation models.

The success of GDE-pose is largely due to the organic combination of the Ghost Bottleneck, DFFM, and ECA Attention modules. The Ghost Bottleneck module reduces computational complexity by eliminating redundant feature maps, contributing significantly to the model’s lightweight design. The DFFM module, through dynamic multi-scale feature fusion in the Neck, enhances the model’s performance in complex poses and occlusions, improving its adaptability to diverse scenarios. The ECA Attention module, by introducing an efficient channel selection mechanism in the Head, strengthens the focus on key features, enabling the model to maintain high detection accuracy even in low-light and complex background conditions. Together, these modules improve feature extraction, fusion, and selection accuracy while maintaining model efficiency, allowing GDE-pose to exhibit exceptional robustness and generalization across various test scenarios.

Despite its advancements in pose estimation, GDE-pose still has certain limitations. Firstly, although the Ghost Bottleneck module significantly reduces computational load, the model’s inference speed on extremely lightweight embedded devices is still limited by hardware capabilities, suggesting that the GDE-pose’s performance is partly dependent on hardware configuration. Additionally, in highly complex pose variation scenarios, there remains room for improvement in detection accuracy, especially under extreme occlusions and multi-target interference. Lastly, while the ECA Attention module excels in key detection, there is still scope for further optimization of its energy efficiency under low-power conditions. Future research should address these aspects to enhance the model’s applicability and robustness.

Based on these limitations, there are several potential directions for optimizing the GDE-pose model in future work. Firstly, for further lightweight optimization, exploring more efficient attention mechanisms, such as graph convolution-based self-attention or compressed convolutional attention structures, could help reduce computational costs. To enhance adaptability in extreme scenarios, integrating multimodal information, such as depth images or infrared data, could improve detection accuracy in low-light or heavily occluded environments. Lastly, to increase GDE-pose’s applicability on embedded devices, future studies could focus on energy efficiency optimization, particularly for applications in energy-constrained environments, extending its utility across a broader range of scenarios.

In conclusion, GDE-pose, as an innovative lightweight pose estimation model, achieves notable improvements in accuracy and real-time performance. Through the synergistic optimization of the Ghost Bottleneck, DFFM, and ECA Attention modules, GDE-pose demonstrates excellent performance across multiple datasets and showcases its potential for practical applications. Despite certain limitations, the innovative design of GDE-pose offers a new direction for lightweight pose estimation research and provides a solid foundation for efficient, accurate, real-time pose estimation applications. In the future, GDE-pose is expected to undergo further optimizations and expand its application scope, offering more effective solutions for pose estimation in various complex scenarios.

Author Contributions

Methodology, X.L.; Software, K.K.; Validation, J.Y.; Formal analysis, K.K.; Data curation, Y.W.; Writing—original draft, X.L.; Writing—review & editing, X.L.; Visualization, Y.W.; Supervision, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Murphy-Chutorian, E.; Trivedi, M.M. Head pose estimation in computer vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 31, 607–626. [Google Scholar] [CrossRef] [PubMed]
Stenum, J.; Cherry-Allen, K.M.; Pyles, C.O.; Reetzke, R.D.; Vignos, M.F.; Roemmich, R.T. Applications of pose estimation in human health and performance across the lifespan. Sensors 2021, 21, 7315. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Xue, M.; Cui, Y.; Liu, B.; Fu, R.; Chen, H.; Ju, F. Lightweight 2D Human Pose Estimation Based on Joint Channel Coordinate Attention Mechanism. Electronics 2023, 13, 143. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef] [PubMed]
Yang, W.; Jiang, M.; Fang, X.; Shi, X.; Guo, Y.; Al-qaness, M.A. A high-precision and efficient method for badminton action detection in sports using You Only Look Once with Hourglass Network. Eng. Appl. Artif. Intell. 2024, 137, 109177. [Google Scholar] [CrossRef]
Sinha, D.; El-Sharkawy, M. Thin mobilenet: An enhanced mobilenet architecture. In Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON), New York, NY, USA, 10–12 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 0280–0285. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Zeng, N.; Wu, P.; Wang, Z.; Li, H.; Liu, W.; Liu, X. A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans. Instrum. Meas. 2022, 71, 3507014. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 October 2024).
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Caesar, H.; Uijlings, J.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Research Team. YOLO-NAS by Deci Achieves State-of-the-Art Performance on Object Detection Using Neural Architecture Search. 2023. Available online: https://deci.ai/blog/yolo-nas-object-detection-foundation-model/ (accessed on 12 May 2023).
Martınez, G.H. Openpose: Whole-Body Pose Estimation. Ph.D. Thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 2019. [Google Scholar]
He, R.; Wang, X.; Chen, H.; Liu, C. VHR-BirdPose: Vision Transformer-Based HRNet for Bird Pose Estimation with Attention Mechanism. Electronics 2023, 12, 3643. [Google Scholar] [CrossRef]
Bao, W.; Ma, Z.; Liang, D.; Yang, X.; Niu, T. Pose ResNet: 3D human pose estimation based on self-supervision. Sensors 2023, 23, 3057. [Google Scholar] [CrossRef] [PubMed]
Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of GDE-pose.

Figure 2. (a) C3k2_Ghost (b) C3k2_DFFM.

Figure 3. Diagram of C3k2_DFFM.

Figure 4. Illustration of GDE-pose performance.

Figure 5. Illustration of loss results.

Table 1. Validation of integrating C3k2_Ghost into the backbone.

Experiment No.	Replacement Module	AP (%) ¹	mAP (%) ²	AR (%) ³	Params (M) ⁴	FLOPs (G) ⁵	FPS ⁶
1	None	74.5	70.3	72	8.2	16.5	30
2	Post Initial Conv Layer	73.6	69.3	71.6	6.4	13.9	32
3	Main Feature Extraction Stage	73.8	69.5	71.8	6.3	13.8	31
4	Downsampling Module	73.7	69.4	71.7	6.5	13.8	32
5	Final Aggregation Layer	73.8	69.5	71.8	6.6	14	31
6	Post Initial Conv Layer + Main Feature Extraction Stage	73.5	69.1	71.5	6.2	13.6	34
7	Main Feature Extraction Stage + Downsampling Module	73.6	69.3	71.6	6.3	13.7	33
8	Post Initial Conv Layer + Main Feature Extraction + Downsampling	73.4	69	71.4	6.1	13.5	35
9	All Modules (Complete Replacement)	74.1	69.7	71.8	6.5	14	33

¹ AP (%): Average Precision; measures the accuracy of key detection for specific thresholds, representing overall model precision. ² mAP (%): Mean Average Precision; an average of AP scores across different thresholds, indicating the model’s general detection accuracy. ³ AR (%): Average Recall; the proportion of correctly detected key across all instances, reflecting the model’s ability to capture relevant features. ⁴ Params (M): Parameter count in millions; represents the model size, with fewer parameters typically indicating a more lightweight model. ⁵ FLOPs (G): Floating Point Operations in billions; gauges computational complexity, where lower FLOPs suggest faster processing. ⁶ FPS: Frames Per Second; indicates inference speed, with higher FPS signifying faster real-time performance.

Table 2. Comparison of the effects of integrating DFFM into the C3k2 module.

Integration Method	Accuracy (mAP@0.5)	Inference Speed (FPS)	Model Size (MB)	FLOPs (G)
DFFM in Each Branch	74.8	28	7.9	15.8
DFFM Before Feature Concatenation	75	31	6.7	14.1
DFFM as Post-Processing in C2f Module	74.9	28	8.1	16.2
Original C3k2 Module (No DFFM)	74.5	30	8.2	16.5

Table 3. Comparative experiment.

Model	AP (%)	mAP (%)	Params (M)	FLOPs (G)	FPS
YOLOv8-Pose [16]	75	70.5	50	28.5	28
YOLO-NAS-Pose [17]	75.8	71	52	30	29
OpenPose [18]	74.2	69.5	200	75	12
HRNet-Pose [19]	75.5	70.8	250	95	10
PoseResNet [20]	73.5	69	60	32	30
AlphaPose [21]	73.8	69.5	150	55	20
YOLO11-pose	74.5	70.3	8.2	16.5	30
GDE-pose (Ours)	77.3	72.6	6.8	14.5	31

Table 4. Ablation results.

Model Configuration	AP (%)	mAP (%)	AR (%)	Params (M)	FLOPs (G)	FPS
Baseline (YOLO11-npose)	74.5	70.3	72	8.2	16.5	30
Baseline + C3k2_Ghost	73.8	69.5	71.8	6.3	13.8	33
Baseline + C3k2_DFFM	75	71	72.5	6.7	14.1	31
Baseline + ECA_Head	74.6	70.6	72.3	6.8	14.4	29
Baseline + C3k2_Ghost + C3k2_DFFM	74.9	70.8	72.7	6.5	14	32
Baseline + C3k2_Ghost + ECA_Head	74.3	69.8	72	6.6	14.2	30
Baseline + C3k2_DFFM + ECA_Head	75.5	71.2	73	6.8	14.3	28
Baseline + C3k2_Ghost + C3k2_DFFM + ECA_Head (Ours)	77.3	72.6	74.5	6.8	14.5	31
Baseline + MobileNetv3 samll	72.6	66.9	70.7	6.7	15.2	30
Baseline + ShuffleNetv2 0.5x	74.9	68.7	71.2	6.5	14.9	28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kuok, K.; Liu, X.; Ye, J.; Wang, Y.; Liu, W. GDE-Pose: A Real-Time Adaptive Compression and Multi-Scale Dynamic Feature Fusion Approach for Pose Estimation. Electronics 2024, 13, 4837. https://doi.org/10.3390/electronics13234837

AMA Style

Kuok K, Liu X, Ye J, Wang Y, Liu W. GDE-Pose: A Real-Time Adaptive Compression and Multi-Scale Dynamic Feature Fusion Approach for Pose Estimation. Electronics. 2024; 13(23):4837. https://doi.org/10.3390/electronics13234837

Chicago/Turabian Style

Kuok, Kaiian, Xuan Liu, Jinwei Ye, Yaokang Wang, and Wenjian Liu. 2024. "GDE-Pose: A Real-Time Adaptive Compression and Multi-Scale Dynamic Feature Fusion Approach for Pose Estimation" Electronics 13, no. 23: 4837. https://doi.org/10.3390/electronics13234837

APA Style

Kuok, K., Liu, X., Ye, J., Wang, Y., & Liu, W. (2024). GDE-Pose: A Real-Time Adaptive Compression and Multi-Scale Dynamic Feature Fusion Approach for Pose Estimation. Electronics, 13(23), 4837. https://doi.org/10.3390/electronics13234837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GDE-Pose: A Real-Time Adaptive Compression and Multi-Scale Dynamic Feature Fusion Approach for Pose Estimation

Abstract

1. Introduction

2. Related Work

3. Method

3.1. C3k2_Ghost

3.2. C3k2_DFFM

3.3. ECA_Head

4. Experiment

4.1. Datasets and Evaluation Metrics

4.2. Experimental Setup

4.3. Ablation and Comparative Experiments

5. Results

6. Discussion and Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI