1. Introduction
Accurate HPE remains a crucial task in the field of human–machine interaction (HMI), with accuracy and stability being pivotal for user-centric applications. Sensing technologies such as imaging, sEMG, and inertial measurement unit (IMU) have facilitated advancements in human motion representation, and when integrated with estimation methods and neural networks, have driven continuous advancements in human motion analysis and estimation methodologies [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19]. As HMI applications have become increasingly prevalent in daily life, the diversity in motion patterns introduces significant complexity and uncertainty to processing models. While intra-subject adaptation remains essential for personalized HMI, its implementation introduces compounded challenges: designing context-aware architectures, curating specialized motion datasets, and preserving estimation consistency. These interdependent barriers fundamentally undermine pose estimation reliability. Knowledge-based approaches, such as transfer learning (TL) and meta-learning (ML), have proven effective in enhancing performance, particularly for intra-subject scenarios [
20,
21,
22,
23,
24,
25]. Additionally, leveraging multimodal information integration and effective knowledge fusion can further enhance the performance of HPE tasks [
26,
27,
28,
29,
30,
31,
32,
33,
34].
Traditional HPE approaches leverage high-frequency and sensitive sEMG signals for motion state estimation and partial body posture analysis, such as hand gesture recognition [
6], classification [
7,
8], and upper and lower body movement estimation [
9,
10]. Furthermore, sEMG-based methods demonstrate significant potential in motion analysis by utilizing encoding networks such as BP [
11], DBN [
12], CNN [
13], LSTM [
8], and hybrid architectures [
14]. Given the complexity of motion states in irregular daily movements, vision-based methods exhibit higher reliability and accuracy due to the lack of initial value corrections and instability in slow or stationary states. Previous studies employing traditional machine learning and deep learning techniques [
15,
16] have offered vision-based solutions for HPE tasks. Vision-based solutions like AplhaPose [
17], Openpose [
18] and MotionBert [
19] demonstrate strong potential in practical applications. However, they often lack the robustness and reliability required for demanding tasks, such as precise operational interactions. The incorporation of multi-view capturing [
35] and multi-level feature extraction [
36] further enhances estimation accuracy. However, vision-based methods typically demand extensive network parameters and substantial computational resources, potentially causing latency in real-time interactive applications. Consequently, human pose prediction and estimation have shifted toward multimodal processing approaches. Significant latency reduction has been achieved by integrating IMU and video streams [
37,
38,
39], while the fusion of RGB and depth information [
40,
41] yields more reliable spatial results. To achieve both accuracy and efficiency in HPE tasks, RGB and sEMG data are utilized as our baseline modalities. However, in the context of body movement analysis, the translation and reshaping of the human body occur simultaneously. Inspired by previous work of rigid body pose estimation using VIO information [
42,
43], we use the pelvis as the central reference point for estimating the relative pose during movement. By incorporating VIO information as a pose movement descriptor, the representation of human motion is further enriched.
Despite significant progress in HPE, many previous studies fail to ensure robustness and reliability when pre-trained models are applied to subjects with varying motion patterns, habits, or physical characteristics. To address this challenge, online optimization [
22] and adaption [
23,
24] techniques have shown great promise in real-time motion processing. By incorporating neural networks into the adaptation process [
25] these techniques further enhance the accuracy of the results. However, sensor noise and shifts in data distribution in the input stream can compromise the stability and robustness of online computations. To improve generalization and resistance to interference, larger and more diverse training datasets, along with deeper networks, have been employed in pose estimation [
44,
45,
46] and activity recognition [
47,
48] tasks leveraging TL techniques based on IMUs and video streams. Few-shot learning technologies such as ML have been widely applied in computer vision [
49], natural language processing [
50] and recommender systems [
51], providing an effective solution for limited training data and uncertain scenarios. Meta-transfer learning techniques [
52,
53] offer advantages in adaptability across various tasks and scenarios, leveraging a knowledge-based migration framework. Knowledge-based learning algorithms have demonstrated strong potential in human motion prediction [
49,
54] and classification [
55], particularly in handling uncertainty in domains and tasks with limited reference movements. Given the diversity of motion types and patterns, knowledge-based mechanisms provide higher-dimensional descriptive parameters in prediction and estimation networks, ensuring the reliability and robustness of human motion modeling.
Contemporary human pose estimation (HPE) systems demonstrate marked performance advantages through multimodal fusion architectures [
26,
27], contrasting with the inherent limitations of unimodal approaches. This paradigm enhances system robustness via three synergistic mechanisms: (1) cross-modal feature complementarity that mitigates sensory noise through differential error profiles, (2) hierarchical representation learning enabling discriminative kinematic pattern recognition, and (3) dynamic attention weighting optimized for intra-subject adaptation. Particularly in personalized motion analysis contexts, the fused feature manifold provides physiologically consistent representations that substantially improve movement decoding fidelity under real-world variability conditions. To further enhance stability and efficiency in knowledge-based meta-learning for human motion, multimodal information fusion combined with knowledge-level techniques has shown promise in training and inference. Early-stage fusion methods like bagging [
28] and embedding [
29] offer a solid foundation for improving model performance, while intermediate fusion strategies, such as feature-level fusion, enable synergy between complementary modalities. Vision-based multimodal deep learning networks [
30,
31,
32] and attention-based networks [
33,
34] are effective at leveraging feature extraction to enhance results, particularly in intra-subject scenarios. However, relying solely on data-level or feature-level fusion does not ensure effective knowledge transfer between tasks. Overcoming this limitation requires deeper exploration of advanced knowledge representation techniques. Techniques like shared feature spaces [
56], shared representations [
57], and shared model parameters [
58] facilitate knowledge transfer across domains, ensuring relevant information is preserved and applied. These methods facilitate knowledge transfer, improving the generalization capabilities of models trained on multimodal data. Despite these advancements, misalignment between information and knowledge across domains remains a key challenge. Aligning data representations, particularly when source and target domains differ significantly, remains a major challenge in intra-subject adaptation. Addressing this misalignment demands sophisticated fusion techniques and refined adaptation strategies to reconcile discrepancies in knowledge transfer across new contexts.
Based on extensive advancements in HPE using multimodal sensing technologies, the remaining task of dealing intra-subject adaptation still being challenging. In this paper, we propose a novel framework for multimodal information with knowledge fusion aiming to improve both accuracy and stability in HPE tasks. We summarized our main contributions as follows: (1) a framework for multimodal input knowledge fusion in meta-transfer learning, enhancing accuracy and stability; (2) a multi-channel feature extraction and fusion network designed to improve knowledge representation capabilities, and also resolve the knowledge alignment problem; and (3) a training and adaptation framework incorporating few-shot learning for the efficient updating of encoders and decoders, enabling dynamic feature updating in real-time applications.
3. Results
The primary objective of the proposed method is to achieve effective pose estimation adaptation for intra-subject scenarios. To evaluate its performance, we tested the method on different subjects using pre-trained models of varying scales. Pre-trained models were generated using 50%, 75%, and 100% of the remaining data, representing small-scale, middle-scale, and large-scale models, respectively. The performance of the pre-trained models was compared to that of the proposed method across different subjects. Visual results in
Figure 5 demonstrate significant improvement achieved through the 1-way 5-shot meta-learning process. The reduction in estimation error achieved through this process is 37.4%, 24.7%, and 23.8% for the small-scale, middle-scale, and large-scale pre-trained models, respectively. Moreover, the improvement observed in meta-learning with the small-scale pre-trained model surpasses that of the large-scale pre-trained model, which presents a trade-off strategy between training resources and meta-learning samples. The estimation errors for each subject are not identical; however, by leveraging knowledge adaptation through the meta-transfer learning process, the differences among subjects in few-shot scenarios are reduced compared to the pre-trained model.
To evaluate the contributions of multimodal inputs to the proposed method, we conducted an ablation study focusing on input modalities. Since VIO information lacks spatial context for pose estimation, it was evaluated in combination with other modalities. Results presented in
Table 3 demonstrate that incorporating additional modalities enhances performance in both pre-trained models and meta-learning scenarios. The estimation errors of RGB information are more reliable in terms of spatial accuracy compared to sEMG data; however, when sEMG signals encode continuous motion information, performance improves. Notably, in walking scenarios, when VIO information is combined, the estimation errors decrease significantly by leveraging the richer descriptive information of the human body.
The goal of this research is to optimize real-time applications by implementing intra-subject adaptation using multimodal information, which may introduce computational complexities. We conducted an ablation study on inference efficiency across different modalities, with the results presented in
Table 4. RGB information incurs a higher cost in both inference time and memory due to the complexity of the CBAM-ResNet12 encoding network. However, as shown in
Table 3, spatial information significantly impacts accuracy. While sEMG and VIO information are higher frequency compared to RGB data, the computational resource requirements are not significantly large relative to their contribution, benefiting from knowledge sharing and the meta-transfer learning strategy. Overall, the proposed framework achieves real-time performance for less intense movements, making it suitable for target scenarios such as healthcare monitoring and cooperative robotics. From the perspective of memory cost, the proposed method also demonstrates strong potential for end-device implementation in low-cost applications. For further optimization of inference efficiency, the encoding network for RGB information can be replaced with a less complex sequence-based network, without significant loss in accuracy, due to the knowledge transfer capabilities of the proposed method.
We conducted an ablation study to thoroughly investigate the pose estimation network utilizing multimodal knowledge fusion. In this study, the data fusion module in the proposed method was replaced with a concatenation operation to integrate features from different modalities.
Table 5 presents the differences between knowledge-sharing (KS) and non-knowledge-sharing (nKS) processes. The results indicate that the KS module significantly enhances performance in both pre-trained models and meta-learning scenarios. Specifically, in nKS cases, even 1-shot learning demonstrates the ability to extract sufficient knowledge from the provided sample queries. This finding highlights that without the KS module, simple feature concatenation does not efficiently contribute to achieving high-accuracy pose estimation.
Variations in walking speeds result in distinct movement patterns, while different walking phases produce unique joint spatial distributions. We evaluated the knowledge transfer capabilities for tasks involving different walking speeds and walking phases. As shown in
Table 6 and
Table 7, the number of learning shots significantly impacts performance, exhibiting varying effects depending on the task type.
For different walking speed tasks, we observe that the estimation error decreases as the number of learning shots increases. Notably, the rate of error reduction is higher for fast walking compared to moderate and slow speeds, indicating that the proposed method excels at extracting knowledge from the distinct movement patterns associated with intense activities. For different walking phases, a similar trend is observed. In the stance phase, performance improves as the number of learning shots increases. However, in the swing phase, the error decreases more rapidly than in the stance phase, attributed to the smoother joint movements during this phase. In these two cases, we obtained convincing evidence that the proposed method enhances pose estimation results in both relatively slow and rapid motion scenarios. Furthermore, in slow-motion cases, estimation errors are smaller due to the spatial information encoded in the RGB data.
In HMI systems, specific tasks are often closely associated with the movements of particular joint. In our study, the seven analyzed joints exhibit distinct movement patterns, offering valuable insights for understanding other types of motion. As illustrated in
Figure 6, the estimation errors for both legs are nearly symmetrical, with only slight variations. The hip estimation error is the lowest, attributed to its relatively smaller range of motion and the effective representation provided by VIO features. The knee and ankle joints exhibit higher estimation errors, although they undergo significant optimization through meta-transfer learning. The ankle joint, which is challenging to estimate due to its agile movement patterns, achieves comparable estimation error levels to other joints through multimodal pose estimation and meta-transfer learning. Notably, our quantitative analysis revealed distinct optimization patterns following meta-transfer learning: knee and ankle joint estimation errors decreased significantly compared to hip joint errors. This phenomenon stems from biomechanical characteristics where knee and ankle joints exhibit greater intra-subject variability due to their higher degrees of freedom, which existing pre-trained models inadequately capture given limited training data diversity. The meta-learning framework effectively addressed this limitation through subject-specific motion pattern adaptation, particularly benefiting from the sequential encoding of continuous periodic movements. Crucially, the differential optimization efficacy across joints correlates with their kinematic profiles—while knee and ankle primarily demonstrate sagittal plane motion during gait cycles, the hip’s multiplanar movement complexity presents greater adaptation challenges.
Due to the lack of a public dataset that satisfies our data requirements, baseline methods were adapted and evaluated using our custom dataset. As shown in
Table 8, the proposed method demonstrates significant advantages over baseline approaches on our dataset, which include sEMG-based transfer methods [
69,
70,
71] and an RGB information-based pre-trained model [
19]. MyoNet, EMGNet and other EMG-based transfer learning use sEMG signals for estimating joint angles through transfer learning methods, employing convolutional networks and deep learning techniques. In contrast, the proposed method integrates RGB and VIO modalities effectively within the meta-transfer learning process. Large-scale pre-trained models, such as MotinBert, have achieved excellent results in estimating human pose from RGB data. However, inference on 2D images lacks the dimensional constraints required for 3D pose estimation, leading to relatively higher estimation errors.
4. Discussion
In this study, the primary objective is to optimize pose estimation adaptation for intra-subject cases. By incorporating few-shot learning in the meta-adaptation stage, general knowledge is effectively transferred to the meta-learner, enabling it to develop a comprehensive global understanding of new tasks, represented in this study by the new target. In practical applications, interactive systems and equipment are required to respond to instructions with precise adjustments. The meta-adaptation mechanism facilitates this requirement by enabling accurate and rapid adaptation to dynamic inputs. Comparisons with baseline methods, including pre-trained and transfer-learning-based models, demonstrate that the proposed method successfully fulfills the task and outperforms alternative approaches.
Since the performance of an inference model is closely tied to the scale of the training dataset, we conducted a deeper investigation into subject-level cases. We observed that the number of learning shots has a similar impact on reducing prediction error as the data scale used during pre-training. From a statistical perspective (
Figure 5), few-shot learning demonstrates an “amplifier” effect on the pre-trained estimation model. This indicates that for any subject, the estimation performance of a smaller pre-trained model can match that of a larger pre-trained model with the assistance of the meta-learner. Considering general knowledge transfer, the samples provided to the meta-learner serve as personalized contextual information, enabling the pre-trained model to adapt. This process logically aligns with the transition from a known knowledge domain to a filtered new knowledge domain. The adaptation process for each subject exhibits a consistent trend, further validating the stability and robustness of the proposed method.
Multimodal input plays a crucial role in enhancing the accuracy and robustness of HPE. In the proposed method, RGB data are utilized as a relatively precise spatial representation of the skeleton model, while sEMG signals capture continuity and muscle activity. Although VIO information is not commonly employed in HPE tasks, it offers inherent advantages in reducing noise and representing holistic movement patterns. Integrating VIO information into feature encoding significantly reduces estimation error, particularly in intense movement patterns. The proposed method seamlessly integrates these input sources without compromising the flexibility to decouple modalities when necessary. Furthermore, the method offers adaptability to various application requirements, allowing trade-offs between accuracy, stability, and frequency by selecting appropriate input modalities.
Different input modalities are often configured in distinct ways to capture diverse information distributions. During continuous motion, each modality responds uniquely to the same movement process or patterns. Direct feature fusion and knowledge extraction can introduce collateral noise, negatively affecting the performance of the pre-trained model. To ensure high-quality knowledge for training the meta-learner, a separately trained knowledge transfer module is employed, significantly enhancing the effectiveness of adaptation.
Understanding precise motion in the context of common task requirements is essential for achieving accurate HPE. In this study, movement patterns are categorized by walking speed and walking phases, with the efficiency of knowledge transfer closely linked to the intensity of these patterns. This indicates that the quantity and quality of learning samples significantly influence movements, particularly in cases where joints exhibit a wider range of motion. From the perspective of partial movement, each joint exhibits distinct movement patterns throughout the walking procedure. In this context, whether the goal is to balance estimation errors or to improve the performance of specific joint estimations, augmenting the pre-trained dataset or enriching learning samples can be beneficial.
The proposed method has shown promising results in intra-subject adaptation for HPE tasks. However, for broader applications, more complex and non-repetitive scenarios have not been addressed, indicating that further challenges remain. In future work, we aim to explore several key areas to further enhance the proposed framework. One direction involves investigating the impact of random and unpredictable movements on the performance of the model, particularly in scenarios where consistent patterns are absent. Additionally, we will examine the intricate relationship between knowledge transfer and feature representation, especially in the context of refined and nuanced movement patterns. Another critical area of focus will be the study of partial body movement patterns, assessing how localized movements influence overall pose estimation accuracy and the potential for improving adaptability in tasks requiring incomplete or partial input data. These efforts aim to expand the framework’s applicability and robustness across diverse and dynamic real-world scenarios.