Next Article in Journal
Adaptive Super-Twisting Tracking for Uncertain Robot Manipulators Based on the Event-Triggered Algorithm
Next Article in Special Issue
Automated Assessment of Upper Extremity Function with the Modified Mallet Score Using Single-Plane Smartphone Videos
Previous Article in Journal
Deep-Learning-Based Analysis of Electronic Skin Sensing Data
Previous Article in Special Issue
Accuracy of the Instantaneous Breathing and Heart Rates Estimated by Smartphone Inertial Units
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Meta-Transfer-Learning-Based Multimodal Human Pose Estimation for Lower Limbs

1
School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
2
School of Medicine and Health, Harbin Institute of Technology, Harbin 150001, China
3
College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
4
School of Computer Science and Engineering, Sichuan University of Science & Engineering, Zigong 643002, China
5
Unmanned Systems Technology Research Institute, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(5), 1613; https://doi.org/10.3390/s25051613
Submission received: 15 January 2025 / Revised: 1 March 2025 / Accepted: 4 March 2025 / Published: 6 March 2025

Abstract

:
Accurate and reliable human pose estimation (HPE) is essential in interactive systems, particularly for applications requiring personalized adaptation, such as controlling cooperative robots and wearable exoskeletons, especially for healthcare monitoring equipment. However, continuously maintaining diverse datasets and frequently updating models for individual adaptation are both resource intensive and time-consuming. To address these challenges, we propose a meta-transfer learning framework that integrates multimodal inputs, including high-frequency surface electromyography (sEMG), visual-inertial odometry (VIO), and high-precision image data. This framework improves both accuracy and stability through a knowledge fusion strategy, resolving the data alignment issue, ensuring seamless integration of different modalities. To further enhance adaptability, we introduce a training and adaptation framework with few-shot learning, facilitating efficient updating of encoders and decoders for dynamic feature adjustment in real-time applications. Experimental results demonstrate that our framework provides accurate, high-frequency pose estimations, particularly for intra-subject adaptation. Our approach enables efficient adaptation to new individuals with only a few new samples, providing an effective solution for personalized motion analysis with minimal data.

1. Introduction

Accurate HPE remains a crucial task in the field of human–machine interaction (HMI), with accuracy and stability being pivotal for user-centric applications. Sensing technologies such as imaging, sEMG, and inertial measurement unit (IMU) have facilitated advancements in human motion representation, and when integrated with estimation methods and neural networks, have driven continuous advancements in human motion analysis and estimation methodologies [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. As HMI applications have become increasingly prevalent in daily life, the diversity in motion patterns introduces significant complexity and uncertainty to processing models. While intra-subject adaptation remains essential for personalized HMI, its implementation introduces compounded challenges: designing context-aware architectures, curating specialized motion datasets, and preserving estimation consistency. These interdependent barriers fundamentally undermine pose estimation reliability. Knowledge-based approaches, such as transfer learning (TL) and meta-learning (ML), have proven effective in enhancing performance, particularly for intra-subject scenarios [20,21,22,23,24,25]. Additionally, leveraging multimodal information integration and effective knowledge fusion can further enhance the performance of HPE tasks [26,27,28,29,30,31,32,33,34].
Traditional HPE approaches leverage high-frequency and sensitive sEMG signals for motion state estimation and partial body posture analysis, such as hand gesture recognition [6], classification [7,8], and upper and lower body movement estimation [9,10]. Furthermore, sEMG-based methods demonstrate significant potential in motion analysis by utilizing encoding networks such as BP [11], DBN [12], CNN [13], LSTM [8], and hybrid architectures [14]. Given the complexity of motion states in irregular daily movements, vision-based methods exhibit higher reliability and accuracy due to the lack of initial value corrections and instability in slow or stationary states. Previous studies employing traditional machine learning and deep learning techniques [15,16] have offered vision-based solutions for HPE tasks. Vision-based solutions like AplhaPose [17], Openpose [18] and MotionBert [19] demonstrate strong potential in practical applications. However, they often lack the robustness and reliability required for demanding tasks, such as precise operational interactions. The incorporation of multi-view capturing [35] and multi-level feature extraction [36] further enhances estimation accuracy. However, vision-based methods typically demand extensive network parameters and substantial computational resources, potentially causing latency in real-time interactive applications. Consequently, human pose prediction and estimation have shifted toward multimodal processing approaches. Significant latency reduction has been achieved by integrating IMU and video streams [37,38,39], while the fusion of RGB and depth information [40,41] yields more reliable spatial results. To achieve both accuracy and efficiency in HPE tasks, RGB and sEMG data are utilized as our baseline modalities. However, in the context of body movement analysis, the translation and reshaping of the human body occur simultaneously. Inspired by previous work of rigid body pose estimation using VIO information [42,43], we use the pelvis as the central reference point for estimating the relative pose during movement. By incorporating VIO information as a pose movement descriptor, the representation of human motion is further enriched.
Despite significant progress in HPE, many previous studies fail to ensure robustness and reliability when pre-trained models are applied to subjects with varying motion patterns, habits, or physical characteristics. To address this challenge, online optimization [22] and adaption [23,24] techniques have shown great promise in real-time motion processing. By incorporating neural networks into the adaptation process [25] these techniques further enhance the accuracy of the results. However, sensor noise and shifts in data distribution in the input stream can compromise the stability and robustness of online computations. To improve generalization and resistance to interference, larger and more diverse training datasets, along with deeper networks, have been employed in pose estimation [44,45,46] and activity recognition [47,48] tasks leveraging TL techniques based on IMUs and video streams. Few-shot learning technologies such as ML have been widely applied in computer vision [49], natural language processing [50] and recommender systems [51], providing an effective solution for limited training data and uncertain scenarios. Meta-transfer learning techniques [52,53] offer advantages in adaptability across various tasks and scenarios, leveraging a knowledge-based migration framework. Knowledge-based learning algorithms have demonstrated strong potential in human motion prediction [49,54] and classification [55], particularly in handling uncertainty in domains and tasks with limited reference movements. Given the diversity of motion types and patterns, knowledge-based mechanisms provide higher-dimensional descriptive parameters in prediction and estimation networks, ensuring the reliability and robustness of human motion modeling.
Contemporary human pose estimation (HPE) systems demonstrate marked performance advantages through multimodal fusion architectures [26,27], contrasting with the inherent limitations of unimodal approaches. This paradigm enhances system robustness via three synergistic mechanisms: (1) cross-modal feature complementarity that mitigates sensory noise through differential error profiles, (2) hierarchical representation learning enabling discriminative kinematic pattern recognition, and (3) dynamic attention weighting optimized for intra-subject adaptation. Particularly in personalized motion analysis contexts, the fused feature manifold provides physiologically consistent representations that substantially improve movement decoding fidelity under real-world variability conditions. To further enhance stability and efficiency in knowledge-based meta-learning for human motion, multimodal information fusion combined with knowledge-level techniques has shown promise in training and inference. Early-stage fusion methods like bagging [28] and embedding [29] offer a solid foundation for improving model performance, while intermediate fusion strategies, such as feature-level fusion, enable synergy between complementary modalities. Vision-based multimodal deep learning networks [30,31,32] and attention-based networks [33,34] are effective at leveraging feature extraction to enhance results, particularly in intra-subject scenarios. However, relying solely on data-level or feature-level fusion does not ensure effective knowledge transfer between tasks. Overcoming this limitation requires deeper exploration of advanced knowledge representation techniques. Techniques like shared feature spaces [56], shared representations [57], and shared model parameters [58] facilitate knowledge transfer across domains, ensuring relevant information is preserved and applied. These methods facilitate knowledge transfer, improving the generalization capabilities of models trained on multimodal data. Despite these advancements, misalignment between information and knowledge across domains remains a key challenge. Aligning data representations, particularly when source and target domains differ significantly, remains a major challenge in intra-subject adaptation. Addressing this misalignment demands sophisticated fusion techniques and refined adaptation strategies to reconcile discrepancies in knowledge transfer across new contexts.
Based on extensive advancements in HPE using multimodal sensing technologies, the remaining task of dealing intra-subject adaptation still being challenging. In this paper, we propose a novel framework for multimodal information with knowledge fusion aiming to improve both accuracy and stability in HPE tasks. We summarized our main contributions as follows: (1) a framework for multimodal input knowledge fusion in meta-transfer learning, enhancing accuracy and stability; (2) a multi-channel feature extraction and fusion network designed to improve knowledge representation capabilities, and also resolve the knowledge alignment problem; and (3) a training and adaptation framework incorporating few-shot learning for the efficient updating of encoders and decoders, enabling dynamic feature updating in real-time applications.

2. Materials and Methods

2.1. Equipment and Participants

As no public dataset meets our requirements (shown in Table 1) for multimodal information, we recruited volunteers to perform ground walking movements while being equipped with four types of sensors: (1) an RGB camera (OmniVision, Shanghai, China) was used to capture visual motion data for the modulation of bone movements; (2) six sEMG sensors were placed on the rectus femoris (RF), biceps femoris (BF), and gastrocnemius (GA) of both legs to capture muscle activity; (3) an Intel RealSense T265 (Intel, Santa Clara, CA, USA) sensor, mounted on the waist, was used to capture body motion; and (4) a Vicon motion capture system (Vicon, Denver, CO, USA) was employed to obtain ground-truth movement data for the lower body. The placement of all sensors and capture environment are illustrated in Figure 1, and the specifications of all sensors are provided in Table 2. The selected RGB module offers robust dynamic motion capture stability, making it well-suited for motion analysis scenarios. The Vicon capture system, equipped with T40 cameras, provides high-resolution skeleton spatial references, while Delsys sEMG signal capture sensors (Delsys, Natick, MA, USA) offer high capture frequency with minimal errors. Both systems are widely utilized in human motion analysis research. The Intel RealSense T265 camera directly provides integrated VIO information, which helps conserve computational resources.
Ten healthy volunteers (ten males; age: 23–26 years, average: 24.2 ± 1.8 years, height: 160–185 cm) were recruited in the experiment. All participants had no history of joint or neurological disorders or injuries that could limit their exercise ability. This study was conducted in accordance with the Declaration of Helsinki. Written informed consent was obtained from all participants.

2.2. Data Acquisition

Each participant was required to perform 12 walking trials on flat ground, consisting of 4 trials at slow speed, 4 trials at moderate speed, and 4 trials at fast speed. Each trial lasted 1 min, followed by a 1 min pause to prevent muscle fatigue. Prior to the experiment, each participant was instructed to walk on a treadmill with a zero incline at speeds of 3.5 km/h, 4.5 km/h, and 6 km/h to familiarize themselves with slow, moderate, and fast walking speeds. The walking trials were conducted in a circular path with two possible directions: clockwise and counterclockwise, with each trial following only one direction.
The Vicon motion capture system employs reflective markers to obtain the 3D spatial coordinates of the markers, with joint locations calculated based on methods outlined in prior work [66,67]. The system, which consists of 10 infrared cameras, is calibrated using a “T”-shaped tool with infrared LEDs on its surface. The calibration process is rigorously executed using built-in software (Nexus v2.16), within a defined action zone measuring 3 m by 4 m. To address capture frequency discrepancies across multimodal equipment, we use hard-wired synchronization between the sEMG signals and the Vicon capture system. The VIO and RGB streams were timestamped and stored using the same clock as the sEMG capture system.

2.3. Method

In addressing the optimization challenges of pose estimation adaptation for intra-subject scenarios, we implemented a multi-stage meta-transfer learning strategy. This approach effectively reduces reliance on large-scale pre-training datasets while significantly conserving computational resources required for model training and validation. To enhance the reliability and stability of estimation outputs, we developed customized knowledge fusion mechanisms that strategically integrate multimodal input streams through modality-specific feature integration protocols. As shown in Figure 2, the proposed meta-transfer learning framework consists of three stages, taking known and new data as inputs to the learning process. The learning process includes a pre-training stage, which generates an initial pose estimation model with pre-trained weights and biases. The meta-transfer learning stage fine-tunes the pre-trained model on multiple tasks by modulating its parameters. Finally, in the meta-adaptation stage, the meta-learner adapts to new tasks using a few examples beyond the known sample set and is evaluated on test data.

2.3.1. Pre-Training

To effectively address the knowledge alignment challenge, the pre-training stage encodes multimodal inputs separately, enabling a reliable and robust fusion of features through knowledge sharing. Figure 3 illustrates the complete pre-training process, with the final output comprising the six joint angles of lower limbs (hip, knee, ankle of both legs), describing the movements.
Given input stream as I m R T m × D m , where m i , e , v that represents the input of image stream, sEMG stream and VIO stream, respectively. T m and D m denote the input length and channel size. As shown in Figure 4, the convolutional block attention module (CBAM) [68] leverages excellent representation of channel attention and spatial attention or effective feature representation. An attention block is applied to the image stream input, while a transformer captures temporal dependencies in the sEMG and VIO streams. A CBAM-Resnet12 block consists of four residual blocks, each connected to three convolutional layers, a CBAM layer, and a max-pooling layer. The feature extraction process is defined as r m = X * I m , where X * represents the CBAM-Resnet12 or transformer, depending on the input stream.
The inherent differences in representational capacity among multimodal information streams for human pose estimation often result in input misalignment. The primary objective of knowledge sharing is to mitigate these discrepancies by ensuring feature alignment across individual clips. To address this challenge, we perform the following transformation on the encoded features:
r ~ m = f m r m ,
where r ~ m denotes the transformed feature representation. The training process of the transform encoder involves evaluating the relationship between the target feature dimension and the encoded feature dimension. To measure the discrepancy between these feature spaces, we utilize the Euclidean distance as the loss function, which ensures the alignment of features during the optimization process.
L K S = ( k , t ) M , k t M m m E U   k ,   t .
where m m represents for trainable normalization function for each modality. Therefore, the remaining low-level features enhance the learning performance during the following process.
To ensure feature coherence across varying spatial and temporal scales, the fusion network is designed with three branches. Each branch consists of a different number of blocks, where each block comprises a convolutional layer, batch normalization, and a leaky ReLU activation function. The outputs from these branches are concatenated through global average pooling, producing the fused feature k . The decoder processes the encoded representations of joint angles through a series of operations, including convolutional layers, batch normalization, ReLU activations, and dropout. Finally, a fully connected regression layer generates the final predictions. The loss function employed for training is the mean squared error (MSE), defined as:
L P E = 1 N i = 1 N ( y i y ^ i ) 2 ,
where y i represents the ground-truth values, y ^ i denotes the predictions, and N denotes the number of joint angles.

2.3.2. Meta-Transfer Learning

With a pre-trained pose estimation model initialized with parameters ( W , b ) , the meta-learning process utilizes a parameter modulation strategy to fine-tune the model for task-specific adaptation. This strategy ensures that the model can effectively generalize to new tasks by optimizing the parameter space based on task-specific objectives. The process can be formally expressed as:
M L W , b = W γ w + b + γ b ,
where γ w and γ b denote modulation parameter to pre-trained model.
In the meta-transfer stage, the pre-trained model is fine-tuned using new tasks and corresponding samples to enhance its task-specific performance. To achieve this, gradient-based update techniques are employed for parameter modulation, enabling the model to adapt efficiently to the new task distribution. For an N -way K -shot task T , the training loss function is defined as:
L T = 1 N × K y i T T L M S E ( y i ) .
where L M S E represents the MSE loss calculation and y i is the given sample. The optimized pose estimation model is evaluated using query sets of size N × Q , where N represents the number of classes and Q denotes the number of queries per class, sampled from the same task distribution. By utilizing the full support set and query set, the meta-learner is iteratively trained and fine-tuned to achieve an adaptation-ready state. This approach ensures robust generalization across task variations, enabling the model to effectively handle new, unseen scenarios with minimal additional training.

2.3.3. Meta Adaptation

In this stage, the meta-learner is tasked with processing samples that differ from those used during training, allowing it to demonstrate its ability to quickly adapt to novel tasks. Randomly selected samples from the support set are employed to fine-tune the task-specific model, where the meta-learner iteratively adjusts its parameters to align with the requirements of the new task. The adaptation performance is then quantitatively evaluated by calculating the average accuracy on the query set. This evaluation provides a comprehensive assessment of the meta-learner’s capacity to generalize effectively across tasks, ensuring robust performance in diverse application scenarios. The meta-learning framework utilizes task-specific fine-tuning with minimal data from the target individual, allowing the model to capture individual-specific features while preserving its generalized knowledge. Theoretically, this framework is designed to handle multiple new tasks during the adaptation stage. However, since the primary objective of the proposed method is to optimize personalized adaptation, the experiments in this study specifically focus on single-person tasks.

2.4. Evaluation

As the intra-subject adaptation is the main target of proposed method, we define the estimation of each subject as a specific task in meta-learning process and a ten-second clip of input is defined as a one-shot. Additionally, walking phase (stance phase and swing phase) and walking speed are also defined as a task for some experiments to further evaluating online interactive capabilities of proposed method in diverse applicational scenarios.
To define the accuracy of human poses, we introduce root mean square error (RMSE) to evaluate the testing results. RMSE calculated the distance between estimation values and ground-truth values, and the ground-truth values are calculated by labeled data captured by the Vicon motion capture system.

3. Results

The primary objective of the proposed method is to achieve effective pose estimation adaptation for intra-subject scenarios. To evaluate its performance, we tested the method on different subjects using pre-trained models of varying scales. Pre-trained models were generated using 50%, 75%, and 100% of the remaining data, representing small-scale, middle-scale, and large-scale models, respectively. The performance of the pre-trained models was compared to that of the proposed method across different subjects. Visual results in Figure 5 demonstrate significant improvement achieved through the 1-way 5-shot meta-learning process. The reduction in estimation error achieved through this process is 37.4%, 24.7%, and 23.8% for the small-scale, middle-scale, and large-scale pre-trained models, respectively. Moreover, the improvement observed in meta-learning with the small-scale pre-trained model surpasses that of the large-scale pre-trained model, which presents a trade-off strategy between training resources and meta-learning samples. The estimation errors for each subject are not identical; however, by leveraging knowledge adaptation through the meta-transfer learning process, the differences among subjects in few-shot scenarios are reduced compared to the pre-trained model.
To evaluate the contributions of multimodal inputs to the proposed method, we conducted an ablation study focusing on input modalities. Since VIO information lacks spatial context for pose estimation, it was evaluated in combination with other modalities. Results presented in Table 3 demonstrate that incorporating additional modalities enhances performance in both pre-trained models and meta-learning scenarios. The estimation errors of RGB information are more reliable in terms of spatial accuracy compared to sEMG data; however, when sEMG signals encode continuous motion information, performance improves. Notably, in walking scenarios, when VIO information is combined, the estimation errors decrease significantly by leveraging the richer descriptive information of the human body.
The goal of this research is to optimize real-time applications by implementing intra-subject adaptation using multimodal information, which may introduce computational complexities. We conducted an ablation study on inference efficiency across different modalities, with the results presented in Table 4. RGB information incurs a higher cost in both inference time and memory due to the complexity of the CBAM-ResNet12 encoding network. However, as shown in Table 3, spatial information significantly impacts accuracy. While sEMG and VIO information are higher frequency compared to RGB data, the computational resource requirements are not significantly large relative to their contribution, benefiting from knowledge sharing and the meta-transfer learning strategy. Overall, the proposed framework achieves real-time performance for less intense movements, making it suitable for target scenarios such as healthcare monitoring and cooperative robotics. From the perspective of memory cost, the proposed method also demonstrates strong potential for end-device implementation in low-cost applications. For further optimization of inference efficiency, the encoding network for RGB information can be replaced with a less complex sequence-based network, without significant loss in accuracy, due to the knowledge transfer capabilities of the proposed method.
We conducted an ablation study to thoroughly investigate the pose estimation network utilizing multimodal knowledge fusion. In this study, the data fusion module in the proposed method was replaced with a concatenation operation to integrate features from different modalities. Table 5 presents the differences between knowledge-sharing (KS) and non-knowledge-sharing (nKS) processes. The results indicate that the KS module significantly enhances performance in both pre-trained models and meta-learning scenarios. Specifically, in nKS cases, even 1-shot learning demonstrates the ability to extract sufficient knowledge from the provided sample queries. This finding highlights that without the KS module, simple feature concatenation does not efficiently contribute to achieving high-accuracy pose estimation.
Variations in walking speeds result in distinct movement patterns, while different walking phases produce unique joint spatial distributions. We evaluated the knowledge transfer capabilities for tasks involving different walking speeds and walking phases. As shown in Table 6 and Table 7, the number of learning shots significantly impacts performance, exhibiting varying effects depending on the task type.
For different walking speed tasks, we observe that the estimation error decreases as the number of learning shots increases. Notably, the rate of error reduction is higher for fast walking compared to moderate and slow speeds, indicating that the proposed method excels at extracting knowledge from the distinct movement patterns associated with intense activities. For different walking phases, a similar trend is observed. In the stance phase, performance improves as the number of learning shots increases. However, in the swing phase, the error decreases more rapidly than in the stance phase, attributed to the smoother joint movements during this phase. In these two cases, we obtained convincing evidence that the proposed method enhances pose estimation results in both relatively slow and rapid motion scenarios. Furthermore, in slow-motion cases, estimation errors are smaller due to the spatial information encoded in the RGB data.
In HMI systems, specific tasks are often closely associated with the movements of particular joint. In our study, the seven analyzed joints exhibit distinct movement patterns, offering valuable insights for understanding other types of motion. As illustrated in Figure 6, the estimation errors for both legs are nearly symmetrical, with only slight variations. The hip estimation error is the lowest, attributed to its relatively smaller range of motion and the effective representation provided by VIO features. The knee and ankle joints exhibit higher estimation errors, although they undergo significant optimization through meta-transfer learning. The ankle joint, which is challenging to estimate due to its agile movement patterns, achieves comparable estimation error levels to other joints through multimodal pose estimation and meta-transfer learning. Notably, our quantitative analysis revealed distinct optimization patterns following meta-transfer learning: knee and ankle joint estimation errors decreased significantly compared to hip joint errors. This phenomenon stems from biomechanical characteristics where knee and ankle joints exhibit greater intra-subject variability due to their higher degrees of freedom, which existing pre-trained models inadequately capture given limited training data diversity. The meta-learning framework effectively addressed this limitation through subject-specific motion pattern adaptation, particularly benefiting from the sequential encoding of continuous periodic movements. Crucially, the differential optimization efficacy across joints correlates with their kinematic profiles—while knee and ankle primarily demonstrate sagittal plane motion during gait cycles, the hip’s multiplanar movement complexity presents greater adaptation challenges.
Due to the lack of a public dataset that satisfies our data requirements, baseline methods were adapted and evaluated using our custom dataset. As shown in Table 8, the proposed method demonstrates significant advantages over baseline approaches on our dataset, which include sEMG-based transfer methods [69,70,71] and an RGB information-based pre-trained model [19]. MyoNet, EMGNet and other EMG-based transfer learning use sEMG signals for estimating joint angles through transfer learning methods, employing convolutional networks and deep learning techniques. In contrast, the proposed method integrates RGB and VIO modalities effectively within the meta-transfer learning process. Large-scale pre-trained models, such as MotinBert, have achieved excellent results in estimating human pose from RGB data. However, inference on 2D images lacks the dimensional constraints required for 3D pose estimation, leading to relatively higher estimation errors.

4. Discussion

In this study, the primary objective is to optimize pose estimation adaptation for intra-subject cases. By incorporating few-shot learning in the meta-adaptation stage, general knowledge is effectively transferred to the meta-learner, enabling it to develop a comprehensive global understanding of new tasks, represented in this study by the new target. In practical applications, interactive systems and equipment are required to respond to instructions with precise adjustments. The meta-adaptation mechanism facilitates this requirement by enabling accurate and rapid adaptation to dynamic inputs. Comparisons with baseline methods, including pre-trained and transfer-learning-based models, demonstrate that the proposed method successfully fulfills the task and outperforms alternative approaches.
Since the performance of an inference model is closely tied to the scale of the training dataset, we conducted a deeper investigation into subject-level cases. We observed that the number of learning shots has a similar impact on reducing prediction error as the data scale used during pre-training. From a statistical perspective (Figure 5), few-shot learning demonstrates an “amplifier” effect on the pre-trained estimation model. This indicates that for any subject, the estimation performance of a smaller pre-trained model can match that of a larger pre-trained model with the assistance of the meta-learner. Considering general knowledge transfer, the samples provided to the meta-learner serve as personalized contextual information, enabling the pre-trained model to adapt. This process logically aligns with the transition from a known knowledge domain to a filtered new knowledge domain. The adaptation process for each subject exhibits a consistent trend, further validating the stability and robustness of the proposed method.
Multimodal input plays a crucial role in enhancing the accuracy and robustness of HPE. In the proposed method, RGB data are utilized as a relatively precise spatial representation of the skeleton model, while sEMG signals capture continuity and muscle activity. Although VIO information is not commonly employed in HPE tasks, it offers inherent advantages in reducing noise and representing holistic movement patterns. Integrating VIO information into feature encoding significantly reduces estimation error, particularly in intense movement patterns. The proposed method seamlessly integrates these input sources without compromising the flexibility to decouple modalities when necessary. Furthermore, the method offers adaptability to various application requirements, allowing trade-offs between accuracy, stability, and frequency by selecting appropriate input modalities.
Different input modalities are often configured in distinct ways to capture diverse information distributions. During continuous motion, each modality responds uniquely to the same movement process or patterns. Direct feature fusion and knowledge extraction can introduce collateral noise, negatively affecting the performance of the pre-trained model. To ensure high-quality knowledge for training the meta-learner, a separately trained knowledge transfer module is employed, significantly enhancing the effectiveness of adaptation.
Understanding precise motion in the context of common task requirements is essential for achieving accurate HPE. In this study, movement patterns are categorized by walking speed and walking phases, with the efficiency of knowledge transfer closely linked to the intensity of these patterns. This indicates that the quantity and quality of learning samples significantly influence movements, particularly in cases where joints exhibit a wider range of motion. From the perspective of partial movement, each joint exhibits distinct movement patterns throughout the walking procedure. In this context, whether the goal is to balance estimation errors or to improve the performance of specific joint estimations, augmenting the pre-trained dataset or enriching learning samples can be beneficial.
The proposed method has shown promising results in intra-subject adaptation for HPE tasks. However, for broader applications, more complex and non-repetitive scenarios have not been addressed, indicating that further challenges remain. In future work, we aim to explore several key areas to further enhance the proposed framework. One direction involves investigating the impact of random and unpredictable movements on the performance of the model, particularly in scenarios where consistent patterns are absent. Additionally, we will examine the intricate relationship between knowledge transfer and feature representation, especially in the context of refined and nuanced movement patterns. Another critical area of focus will be the study of partial body movement patterns, assessing how localized movements influence overall pose estimation accuracy and the potential for improving adaptability in tasks requiring incomplete or partial input data. These efforts aim to expand the framework’s applicability and robustness across diverse and dynamic real-world scenarios.

5. Conclusions

This study proposed a comprehensive framework for multimodal input knowledge fusion within meta-transfer learning, significantly enhancing the accuracy and stability of human pose estimation. By integrating diverse modalities such as sEMG, VIO, and image data, the framework addresses uncertainties in complex tasks and compensates for the limitations of single-modal information. A multi-channel feature extraction and fusion network was developed to improve knowledge representation capabilities and resolve knowledge alignment challenges, ensuring robust and generalized performance across various input modalities. Additionally, a training and adaptation framework incorporating few-shot learning enables efficient encoder and decoder updates, achieving real-time dynamic feature adaptation. These innovations demonstrate strong potential for real-world applications requiring rapid adaptation and personalization, such as healthcare monitoring and human–machine interaction, offering enhanced precision, improved robustness, and scalable adaptability across diverse scenarios.

Author Contributions

Conceptualization, G.D. and H.Z.; methodology, G.D.; software, G.D.; validation, G.D., Z.D. and H.Z.; formal analysis, G.D.; investigation, G.D. and H.Z.; resources, G.D., Z.D. and H.H.; data curation, G.D.; writing—original draft preparation, G.D.; writing—review and editing, G.D., H.Z., Z.D. and F.J.; visualization, G.D.; supervision, F.J.; project administration, F.J.; funding acquisition, F.J., H.H., H.Z. and X.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partly funded by Central Guidance for Local Science and Technology Development Fund Projects under Grant No. 2024ZYD0266 and the National Key Laboratory of Unmanned Aerial Vehicle Technology in NPU, Grant No. 202405.

Institutional Review Board Statement

This study is conducted according to the Declaration of Helsinki, and the protocol of Registering Clinical Trials (ChiECRCT20200319) is approved by the Chinese Ethics Committee.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
HPEHuman Pose Estimation
RGBRed, Green and Blue
sEMGSurface Electromyography
IMUInertial Measurement Unit
TLTransfer Learning
MLMeta Learning
VIOVisual-Inertial Odometry

References

  1. Persson, T. A marker-free method for tracking human lower limb segments based on model matching. Int. J. Bio-Med. Comput. 1996, 41, 87–97. [Google Scholar] [CrossRef]
  2. Laribi, M.A.; Zeghloul, S. Chapter 4—Human lower limb operation tracking via motion capture systems. In Design and Operation of Human Locomotion Systems; Ceccarelli, M., Carbone, G., Eds.; Academic Press: Cambridge, MA, USA, 2020; pp. 83–107. ISBN 9780128156599. [Google Scholar] [CrossRef]
  3. Li, J.; Wang, Z.; Wang, C.; Su, W. GaitFormer: Leveraging dual-stream spatial–temporal Vision Transformer via a single low-cost RGB camera for clinical gait analysis. Knowl.-Based Syst. 2024, 295, 111810. [Google Scholar] [CrossRef]
  4. Ou, Y.; Li, Z.; Li, G.; Su, C.Y. Adaptive fuzzy tracking control of a human lower limb with an exoskeleton. In Proceedings of the 2012 IEEE International Conference on Robotics and Biomimetics (ROBIO), Guangzhou, China, 11–14 December 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1937–1942. [Google Scholar]
  5. Wang, Z.; Deligianni, F.; Voiculescu, I.; Yang, G.Z. A single RGB camera based gait analysis with a mobile tele-robot for healthcare. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Mexico City, Mexico, 1–5 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 6933–6936. [Google Scholar]
  6. Sîmpetru, R.C.; Arkudas, A.; Braun, D.I.; Osswald, M.; de Oliveira, D.S.; Eskofier, B.; Kinfe, T.M.; Del Vecchio, A. Learning a hand model from dynamic movements using high-density EMG and convolutional neural networks. IEEE Trans. Biomed. Eng. 2024, 71, 3556–3568. [Google Scholar] [CrossRef]
  7. Liu, Y.; Zhang, S.; Gowda, M. Neuropose: 3d hand pose tracking using emg wearables. Proc. Web Conf. 2021, 2021, 1471–1482. [Google Scholar]
  8. Raurale, S.A.; McAllister, J.; del Rincon, J.M. Real-time embedded EMG signal analysis for wrist-hand pose identification. IEEE Trans. Signal Process. 2020, 68, 2713–2723. [Google Scholar] [CrossRef]
  9. Ma, C.; Lin, C.; Samuel, O.W.; Guo, W.; Zhang, H.; Greenwald, S.; Xu, L.; Li, G. A bidirectional lstm network for estimating continuous upper limb movement from surface electromyography. IEEE Robot. Autom. Lett. 2021, 6, 7217–7224. [Google Scholar] [CrossRef]
  10. Mundt, M.; Thomsen, W.; Witter, T.; Koeppe, A.; David, S.; Bamer, F.; Potthast, W.; Markert, B. Prediction of lower limb joint angles and moments during gait using artificial neural networks. Med. Biol. Eng. Comput. 2020, 58, 211–225. [Google Scholar] [CrossRef] [PubMed]
  11. Zhang, F.; Li, P.; Hou, Z.G.; Lu, Z.; Chen, Y.; Li, Q.; Tan, M. sEMG-based continuous estimation of joint angles of human legs by using bp neural network. Neurocomputing 2012, 78, 139–148. [Google Scholar] [CrossRef]
  12. Chen, J.; Zhang, X.; Cheng, Y.; Xi, N. Surface emg based continuous estimation of human lower limb joint angles by using deep belief networks. Biomed. Signal Process. Control 2018, 40, 335–342. [Google Scholar] [CrossRef]
  13. Zhu, M.; Guan, X.; Li, Z.; He, L.; Wang, Z.; Cai, K. sEMG-based lower limb motion prediction using cnnlstm with improved pca optimization algorithm. J. Bionic Eng. 2023, 20, 612–627. [Google Scholar] [CrossRef]
  14. Hannun, A.; Lee, A.; Xu, Q.; Collobert, R. Sequence to-sequence speech recognition with time-depth separable convolutions. arXiv 2019, arXiv:1904.02619. [Google Scholar]
  15. Kolotouros, N.; Pavlakos, G.; Black, M.J.; Daniilidis, K. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2252–2261. [Google Scholar]
  16. Lin, K.; Wang, L.; Liu, Z. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1954–1963. [Google Scholar]
  17. Fang, H.S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]
  18. Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef] [PubMed]
  19. Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. Motionbert: A unified perspective on learning human motion representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 15085–15099. [Google Scholar]
  20. Ning, G.; Zhang, Z.; He, Z. Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimed. 2017, 20, 1246–1259. [Google Scholar] [CrossRef]
  21. Tripathi, S.; Ranade, S.; Tyagi, A.; Agrawal, A. Posenet3d: Learning temporally consistent 3d human pose via knowledge distillation. In Proceedings of the 2020 International Conference on 3D Vision (3DV), Fukuoka, Japan, 25–28 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 311–321. [Google Scholar]
  22. Martínez, I.J.R.; Mannini, A.; Clemente, F.; Cipriani, C. Online grasp force estimation from the transient EMG. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 28, 2333–2341. [Google Scholar] [CrossRef] [PubMed]
  23. Ding, Z.; Yang, C.; Wang, Z.; Yin, X.; Jiang, F. Online adaptive prediction of human motion intention based on sEMG. Sensors 2021, 21, 2882. [Google Scholar] [CrossRef]
  24. Zheng, N.; Li, Y.; Zhang, W.; Du, M. User-independent emg gesture recognition method based on adaptive learning. Front. Neurosci. 2022, 16, 847180. [Google Scholar] [CrossRef]
  25. Li, H.; Guo, S.; Wang, H.; Bu, D. Subject-independent continuous estimation of sEMG-based joint angles using both multisource domain adaptation and BP neural network. IEEE Trans. Instrum. Meas. 2022, 72, 4000910. [Google Scholar] [CrossRef]
  26. Al-Quraishi, M.S.; Elamvazuthi, I.; Tang, T.B.; Al-Qurishi, M.; Parasuraman, S.; Borboni, A. Multimodal Fusion Approach Based on EEG and EMG Signals for Lower Limb Movement Recognition. IEEE Sens. J. 2021, 21, 27640–27650. [Google Scholar] [CrossRef]
  27. Zheng, J.; Shi, X.; Gorban, A.; Mao, J.; Song, Y.; Qi, C.R.; Liu, T.; Chari, V.; Cornman, A.; Zhou, Y.; et al. Multi-modal 3d human pose estimation with 2d weak supervision in autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4478–4487. [Google Scholar]
  28. Li, Y.; Liu, Y.; Zhang, M.; Zhang, G.; Wang, Z.; Luo, J. Radiomics with attribute bagging for breast tumor classification using multimodal ultrasound images. J. Ultrasound Med. 2020, 39, 361–371. [Google Scholar] [CrossRef]
  29. Niu, Z.; Zhou, M.; Wang, L.; Gao, X.; Hua, G. Hierarchical multimodal lstm for dense visual-semantic embedding. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1881–1889. [Google Scholar]
  30. Rahaman, M.M.; Li, C.; Yao, Y.; Kulwa, F.; Wu, X.; Li, X.; Wang, Q. DeepCervix: A deep learning-based framework for the classification of cervical cells using hybrid deep feature fusion techniques. Comput. Biol. Med. 2021, 136, 104649. [Google Scholar] [CrossRef]
  31. Ravi, V.; Chaganti, R.; Alazab, M. Recurrent deep learning-based feature fusion ensemble meta-classifier approach for intelligent network intrusion detection system. Comput. Electr. Eng. 2022, 102, 108156. [Google Scholar] [CrossRef]
  32. Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
  33. Dong, Y.; Liu, Q.; Du, B.; Zhang, L. Weighted feature fusion of convolutional neural network and graph attention network for hyperspectral image classification. IEEE Trans. Image Process. 2022, 31, 1559–1572. [Google Scholar] [CrossRef] [PubMed]
  34. Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
  35. Tu, H.; Wang, C.; Zeng, W. Voxel pose: Towards multi-camera 3d human pose estimation in wild environment. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part I 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 197–212. [Google Scholar]
  36. Islam, M.M.; Nooruddin, S.; Karray, F.; Muhammad, G. Multi-level feature fusion for multimodal human activity recognition in internet of healthcare things. Inf. Fusion 2023, 94, 17–31. [Google Scholar] [CrossRef]
  37. Von Marcard, T.; Henschel, R.; Black, M.J.; Rosenhahn, B.; Pons-Moll, G. Recovering accurate 3d human pose in the wild using imus and a moving camera. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 601–617. [Google Scholar]
  38. Malleson, C.; Gilbert, A.; Trumble, M.; Collomosse, J.; Hilton, A.; Volino, M. Real-time full-body motion capture from video and imus. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 449–457. [Google Scholar]
  39. Trumble, M.; Gilbert, A.; Malleson, C.; Hilton, A.; Collomosse, J. Total capture: 3d human pose estimation fusing video and inertial sensors. In Proceedings of the 28th British Machine Vision Conference (BMVC), London, UK, 4–7 September 2017; Volume 2, pp. 1–13. [Google Scholar]
  40. Zimmermann, C.; Welschehold, T.; Dornhege, C.; Burgard, W.; Brox, T. 3d human pose estimation in rgbd images for robotic task learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1986–1992. [Google Scholar]
  41. Ying, J.; Zhao, X. Rgb-d fusion for point-cloud-based 3dhuman pose estimation. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3108–3112. [Google Scholar]
  42. Ramezani, M.; Khoshelham, K.; Fraser, C. Pose estimation by omnidirectional visual-inertial odometry. Robot. Auton. Systems 2018, 105, 26–37. [Google Scholar] [CrossRef]
  43. Li, T.; Yu, H. Visual–inertial fusion-based human pose estimation: A review. IEEE Trans. Instrum. Meas. 2023, 72, 4007816. [Google Scholar] [CrossRef]
  44. Yi, X.; Zhou, Y.; Xu, F. Transpose: Real-time 3d human translation and pose estimation with six inertial sensors. ACM Trans. Graph. (TOG) 2021, 40, 1–13. [Google Scholar] [CrossRef]
  45. Mohammadi, S.M.; Enshaeifar, S.; Hilton, A.; Dijk, D.J.; Wells, K. Transfer learning for clinical sleep pose detection using a single 2D IR camera. IEEE Trans. Neural Syst. Rehabil. Eng. 2020, 29, 290–299. [Google Scholar] [CrossRef]
  46. Soleimani, E.; Nazerfard, E. Cross-subject transfer learning in human activity recognition systems using generative adversarial networks. Neurocomputing 2021, 426, 26–34. [Google Scholar] [CrossRef]
  47. Lin, C.; He, Z. A rotary transformer cross-subject model for continuous estimation of finger joints kinematics and a transfer learning approach for new subjects. Front. Neurosci. 2024, 18, 1306050. [Google Scholar] [CrossRef] [PubMed]
  48. Wei, C.S.; Lin, Y.P.; Wang, Y.T.; Lin, C.T.; Jung, T.P. A subject-transfer framework for obviating inter-and intra-subject variability in EEG-based drowsiness detection. NeuroImage 2018, 174, 407–419. [Google Scholar] [CrossRef] [PubMed]
  49. Gui, L.Y.; Wang, Y.X.; Ramanan, D.; Moura, J.M. Few-shot human motion prediction via meta-learning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 432–450. [Google Scholar]
  50. Lee, H.Y.; Vu, N.T.; Li, S.W. Meta learning and its applications to natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Tutorial Abstracts, Virtual, 1–6 August 2021; pp. 15–20. [Google Scholar]
  51. Wang, C.; Zhu, Y.; Liu, H.; Zang, T.; Yu, J.; Tang, F. Deep meta-learning in recommendation systems: A survey. arXiv 2022, arXiv:2206.04415. [Google Scholar]
  52. Sun, Q.; Liu, Y.; Chua, T.S.; Schiele, B. Meta-transfer learning for few-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 403–412. [Google Scholar]
  53. Coskun, H.; Zia, M.Z.; Tekin, B.; Bogo, F.; Navab, N.; Tombari, F.; Sawhney, H.S. Domain-specific priors and meta learning for few-shot first-person action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 45, 6659–6673. [Google Scholar] [CrossRef] [PubMed]
  54. Sun, X.; Sun, H.; Li, B.; Wei, D.; Li, W.; Lu, J. MoML: Online Meta Adaptation for 3D Human Motion Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–18 June 2024; pp. 1042–1051. [Google Scholar]
  55. Zhou, F.; Liu, X.; Zhong, T.; Trajcevski, G. MetaMove: On improving human mobility classification and prediction via metalearning. IEEE Trans. Cybern. 2021, 52, 8128–8141. [Google Scholar] [CrossRef]
  56. Hu, Z.; Lu, G.; Xu, D. FVC: A new framework towards deep video compression in feature space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1502–1511. [Google Scholar]
  57. Zhuang, F.; Cheng, X.; Luo, P.; Pan, S.J.; He, Q. Supervised representation learning: Transfer learning with deep autoencoders. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
  58. Ren, J.; Chai, M.; Tulyakov, S.; Fang, C.; Shen, X.; Yang, J. Human motion transfer from poses in the wild. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 262–279. [Google Scholar]
  59. Graff, C.; Dua, D. Uc Irvine Machine Learning Repository. 2019. Available online: https://archive.ics.uci.edu/ (accessed on 20 February 2025).
  60. Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-body human pose estimation in the wild. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 196–214. [Google Scholar]
  61. Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 1325–1339. [Google Scholar] [CrossRef]
  62. Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3d points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 9–14. [Google Scholar]
  63. Hu, B.; Rouse, E.; Hargrove, L. Benchmark datasets for bilateral lower-limb neuromechanical signals from wearable sensors during unassisted locomotion in able-bodied individuals. Front. Robot. AI 2018, 5, 14. [Google Scholar] [CrossRef]
  64. Luan, Y.; Shi, Y.; Wu, W.; Liu, Z.; Chang, H.; Cheng, J. Har-semg: A dataset for human activity recognition on lower-limb semg. Knowl. Inf. Syst. 2021, 63, 2791–2814. [Google Scholar] [CrossRef]
  65. Krasoulis, A.; Kyranou, I.; Erden, M.S.; Nazarpour, K.; Vijayakumar, S. Improved prosthetic hand control with concurrent use of myoelectric and inertial measurements. J. Neuroeng. Rehabil. 2017, 14, 71. [Google Scholar] [CrossRef]
  66. Allard, P.; Cappozzo, A.; Lundberg, A.; Vaughan, C. Three-Dimensional Analysis of Human Locomotion; John Wiley & Sons: Chichester, UK, 1998. [Google Scholar]
  67. De Leva, P. Adjustments to Zatsiorsky-Seluyanov’s, segment inertia parameters. J. Biomech. 1996, 29, 1223–1230. [Google Scholar] [CrossRef]
  68. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  69. Gautam, A.; Panwar, M.; Biswas, D.; Acharyya, A. MyoNet: A transfer-learning-based LRCN for lower limb movement recognition and knee joint angle prediction for remote monitoring of rehabilitation progress from sEMG. IEEE J. Transl. Eng. Health Med. 2020, 8, 2100310. [Google Scholar] [CrossRef] [PubMed]
  70. Zhang, C.; Wang, X.; Yu, Z.; Wang, B.; Deng, C. Interpretable Dual-branch EMGNet: A transfer learning-based network for inter-subject lower limb motion intention recognition. Eng. Appl. Artif. Intell. 2024, 130, 107761. [Google Scholar] [CrossRef]
  71. Long, Y.; Geng, Y.; Dai, C.; Li, G. A transfer learning based cross-subject generic model for continuous estimation of finger joint angles from a new user. IEEE J. Biomed. Health Inform. 2023, 27, 1914–1925. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Sensor placement and data collection environment: (a) For the lower body, six sEMG sensors were placed on both sides of the legs, while 16 Vicon markers were used to collect ground-truth data. An Intel RealSense T265 sensor was mounted on the waist. (b) Ten Vicon cameras were positioned on the ceiling to capture reflective markers on the lower body, and an RGB camera was placed on the side wall. The subject performed walking trials on flat ground, both clockwise and counterclockwise.
Figure 1. Sensor placement and data collection environment: (a) For the lower body, six sEMG sensors were placed on both sides of the legs, while 16 Vicon markers were used to collect ground-truth data. An Intel RealSense T265 sensor was mounted on the waist. (b) Ten Vicon cameras were positioned on the ceiling to capture reflective markers on the lower body, and an RGB camera was placed on the side wall. The subject performed walking trials on flat ground, both clockwise and counterclockwise.
Sensors 25 01613 g001
Figure 2. Overall schematic of proposed framework, totally including three phases.
Figure 2. Overall schematic of proposed framework, totally including three phases.
Sensors 25 01613 g002
Figure 3. The pose estimation network is pipelined with feature extraction, knowledge sharing, fusion of knowledge and pose regression.
Figure 3. The pose estimation network is pipelined with feature extraction, knowledge sharing, fusion of knowledge and pose regression.
Sensors 25 01613 g003
Figure 4. The structure of CBAM-Resnet12 is composed of a combination of CBAM module, residual block, convolution layer and max pooling layer.
Figure 4. The structure of CBAM-Resnet12 is composed of a combination of CBAM module, residual block, convolution layer and max pooling layer.
Sensors 25 01613 g004
Figure 5. Results on different subjects with different scales of pre-training.
Figure 5. Results on different subjects with different scales of pre-training.
Sensors 25 01613 g005
Figure 6. Evaluation of different joints from lower body, results are calculated with RMSE in degrees.
Figure 6. Evaluation of different joints from lower body, results are calculated with RMSE in degrees.
Sensors 25 01613 g006
Table 1. Comparison between public dataset with ours.
Table 1. Comparison between public dataset with ours.
DatasetModalitiesLocation
RequirementsRGB + VIO + sEMGlower body
UCI dataset [59]sEMGupper arms; upper legs
COCO-WholeBody [60]RGBbody
Human3.6M [61]RGBbody
Action 3D [62]Depthbody
ENABL3S [63]sEMG + IMUlower limbs
HAR-sEMG [64]sEMGlower limbs
Ninapro 7 [65]sEMG + IMUforearm
OursRGB + VIO + sEMGlower body
Table 2. Specifications of sensors used in experiments.
Table 2. Specifications of sensors used in experiments.
SensorsModalityResolutionFrame Rate (Hz)Sensor NumberSensitivity
RGB camera *RGB640 × 4803018-bit
Vicon T40Skeleton2336 × 17281001010-bit
Delsys TringosEMG-1111.111616-bit
Intel RealSense T265VIO-200116-bit
* The RGB camera is an industrial product with model HF877; in cases where purchase is not possible, we provide the sensing chip model OV 9750 for reproducibility of the research.
Table 3. Ablation study on input modalities, results are the estimation errors with RMSE in degrees in few-shot cases.
Table 3. Ablation study on input modalities, results are the estimation errors with RMSE in degrees in few-shot cases.
VisionsEMGVIOPre-Trained1-Shot5-Shot
1.671.591.43
1.741.611.50
1.381.331.23
1.451.421.31
1.541.431.38
1.371.301.07
The symbol √ represents the usage of input modalities.
Table 4. Ablation study on input modalities, the experiments are conducted on a single NVIDIA RTX 4080 SUPER GPU (Nvidia, Santa Clara, CA, USA).
Table 4. Ablation study on input modalities, the experiments are conducted on a single NVIDIA RTX 4080 SUPER GPU (Nvidia, Santa Clara, CA, USA).
ModalitiesInference Time (ms)Inference Memory (MB)
RGB13847
sEMG6416
RGB + VIO181278
sEMG + VIO9827
RGB + sEMG211303
RGB + sEMG + VIO241852
Table 5. Ablation study on KS and nKS cases, results the estimation errors with RMSE in degrees in few-shot cases.
Table 5. Ablation study on KS and nKS cases, results the estimation errors with RMSE in degrees in few-shot cases.
CasesPre-Trained1-Shot5-Shot
KS1.721.341.03
nKS2.411.551.20
Table 6. Evaluation on different walking speeds, estimation errors are with RMSE calculated in degrees.
Table 6. Evaluation on different walking speeds, estimation errors are with RMSE calculated in degrees.
Walking SpeedPre-Trained1-Shot3-Shot5-Shot
Slow (3.5 km/h)1.301.261.171.04
Moderate (4.5 km/h)1.391.331.221.12
Fast (6 km/h)1.441.361.281.24
Table 7. Evaluation on different walking phases, estimation errors are calculated with RMSE in degrees.
Table 7. Evaluation on different walking phases, estimation errors are calculated with RMSE in degrees.
Walking PhasePre-Trained1-Shot3-Shot5-Shot
Stance Phase1.301.251.151.03
Swing Phase1.411.331.221.15
Table 8. Comparisons of RMSE in degrees between different methods over different tasks.
Table 8. Comparisons of RMSE in degrees between different methods over different tasks.
MyoNet [69]EMGNet [70]TL-EMG [71]AplhaPose [19]OursOurs
RMSE1.88 *1.97 *2.62 *1.52 *1.30 **1.07 ***
* Model is pre-trained transfer model which is not few-shot testable. ** Results of 1-shot learning. *** Results of 5-shot learning.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Du, G.; Zhu, H.; Ding, Z.; Huang, H.; Bie, X.; Jiang, F. Meta-Transfer-Learning-Based Multimodal Human Pose Estimation for Lower Limbs. Sensors 2025, 25, 1613. https://doi.org/10.3390/s25051613

AMA Style

Du G, Zhu H, Ding Z, Huang H, Bie X, Jiang F. Meta-Transfer-Learning-Based Multimodal Human Pose Estimation for Lower Limbs. Sensors. 2025; 25(5):1613. https://doi.org/10.3390/s25051613

Chicago/Turabian Style

Du, Guoming, Haiqi Zhu, Zhen Ding, Hong Huang, Xiaofeng Bie, and Feng Jiang. 2025. "Meta-Transfer-Learning-Based Multimodal Human Pose Estimation for Lower Limbs" Sensors 25, no. 5: 1613. https://doi.org/10.3390/s25051613

APA Style

Du, G., Zhu, H., Ding, Z., Huang, H., Bie, X., & Jiang, F. (2025). Meta-Transfer-Learning-Based Multimodal Human Pose Estimation for Lower Limbs. Sensors, 25(5), 1613. https://doi.org/10.3390/s25051613

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop