1. Introduction
In recent years, human motion capture has garnered considerable attention, owing to its diverse applications in entertainment, healthcare, and sports industries [
1,
2]. Accurate motion capture is essential for realistic animation, immersive virtual reality experiences, and a precise biomechanical analysis of human movements [
3,
4]. Traditional optical motion capture systems are widely employed, but they often exhibit certain limitations, such as high cost, restricted mobility, and dependency on controlled environments [
5,
6].
Many approaches have been proposed for human motion capture [
7]. One of the most famous examples is the vision-based method. For instance, in [
8], a vision-based system for tracking and interpreting leg motion in image sequences using a single camera is developed, which is implemented on a commercial computer without any special hardware. A new method for fast human motion capture based on a single red–green–blue-distance (RGB-D) sensor is proposed [
9]. To the visual-based human posture capture device, Microsoft’s Kinect camera has gained popularity as a depth-sensing device that can capture human movements with high accuracy [
10]. This camera utilizes infrared sensors to measure the distance between objects and the camera, generating a detailed three-dimensional point cloud representation of a scene [
11]. This depth information, combined with RGB data, enables the precise tracking of human skeletal joints and facilitates real-time motion capture [
12]. Depth cameras, integral to Kinect’s operation, rely on the emission and detection of infrared light to create depth maps of the surrounding scene [
13,
14]. However, the presence of environmental obstructions can introduce uncertainties into the captured data. Occurrences where the human body is temporarily occluded by objects within the view of the camera can lead to data gaps or inaccuracies in the motion capture process. These transient interruptions can impede the seamless reconstruction of motion trajectories, potentially affecting the fidelity and reliability of the captured human movement. In order to deal with this problem, inertial sensors have been proposed to measure human body movements. For instance, in [
15], data on human activities are derived from the mobile device’s inertial sensor. Meanwhile, Beshara and Chen proposed the use of inertial sensors and Kinect cameras to capture human body movements in their study [
16]. It should be noted that although the inertial sensor is able to achieve the seamless measurement, the sensors will experience cumulative errors.
Based on the measurement technology, the data fusion filter will also improve the accuracy of the measurement. To the data fusion filter, it should be pointed out that the Kalman filter (KF) has been widely used. For instance, in [
17], the distributed Kalman filter has been proposed to provide a human’s position. The dual predictive quaternion Kalman filter is designed for the tracking of the human lower limb posture [
18]. Moreover, one new adaptive extended Kalman filter (EKF) for cooperative localization is proposed [
19], which is based on the nonlinear system. Moreover, the sigma-point update of cubature Kalman filter is proposed in [
20]. One can easily find that the Kalman filter’s performance depends on the model’s accuracy and the noise description; however, it may be difficult to obtain in practice. In order to overcome this shortcoming, the finite impulse response (FIR) filter is proposed [
21]. For example, the extended FIR (EFIR) is used to fuse the inertial navigation system (INS) data and the ultra-wide band (UWB) data. It should be pointed out that the approaches mentioned above do not consider the data outage, which may make the filter’s measurement unavailable. In order to overcome this problem, one least-squares support vector machine (LS-SVM)-assisted FIR filter is proposed. In [
22], one self-learning square-root cubature Kalman filter is proposed for the integrated global positioning system (GPS)/INS in GPS-denied environments.
To address the limitations of standalone INS and overcome the data gaps in Kinect measurements [
23,
24], a previous study proposed the use of the extreme learning machine (ELM) algorithm to establish new signals through mapping when UWB signals are interrupted [
25]. This allows the entire system to properly function. Building upon this concept, this paper proposes an integrated human motion capture system using ELM, FIR filtering, and INS data assisted by Microsoft’s Kinect camera [
26], which can reconcile the strengths of INS and Kinect while reducing their weaknesses. The proposed methodology is outlined below.
The INS comprises miniature inertial sensors strategically placed on a subject’s body to measure accelerations and angular velocities [
27,
28]. Raw INS data provide real-time information about the orientation and motion of the subject [
29]. The INS comprises miniature inertial sensors accurately placed on the subject’s body to measure attitude angles [
30]. The raw INS data provide real-time information about the subject’s orientation and motion [
31] and serve as a foundation for subsequent processes [
32]. Meanwhile, the pivotal role of ELM lies in learning the intricate relation between INS-derived body pose data and the corresponding pose data acquired from Kinect [
33,
34]. Using a shallow neural network architecture, ELM efficiently maps the INS measurements to the corresponding Kinect-based body poses.
Before utilizing ELM, FIR is applied to both the INS data and the pose data obtained from Kinect [
17,
35]. This filtering process effectively eliminates sensor noise and mitigates the effects of drift, ensuring the accuracy and reliability of the motion capture system [
36]. Finally, other researchers previously mentioned that the use of an interactive multiple model (IMM) filtering algorithm can further enhance positioning accuracy. Building upon this idea, the IMM filtering algorithm is employed. The IMM filter is adopted to fuse the INS data and the vision data from Kinect, alongside the ELM-processed data [
37]. This fusion process compensates for the missing or erroneous Kinect measurements and further enhances the accuracy of the motion capture system [
38,
39].
By integrating INS, Kinect vision data, ELM algorithms, IMM algorithms, and FIR filtering, the proposed approach offers an advanced solution for human motion capture. This integration effectively reduces issues related to data gaps caused by environmental obstructions and high noise levels during Kinect measurements, contributing to improved precision and positioning accuracy in motion capture [
40,
41]. The resulting system ensures accurate and reliable real-time motion capture, thereby opening up a wide range of possibilities for applications in animation, virtual reality, sports analysis, and healthcare.
To obtain accurate position information, a one-assistant method fusing ELM/FIR filters and vision data is proposed for INS-based human motion capture. In the proposed method, when vision is available, the vision-based human position is inputted into an FIR filter that outputs accurate human position. Meanwhile, another FIR filter outputs the human position using INS data. Moreover, ELM is used to build mapping between the output of FIR and the corresponding error. When vision data are unavailable, FIR is used to provide human posture and ELM is used to provide the estimation error built in the aforementioned stage. Test results confirm the effectiveness of the proposed method.
The main contributions of this study are as follows:
A seamless INS/vision human motion capture scheme is designed.
A dual ELM/FIR-integrated filtering is derived.
An INS/vision human motion capture system is built.
Experimental evidence shows the better performances of the proposed algorithms than those of traditional algorithms.
The rest of this paper is structured as follows.
Section 2 discusses the principle of an INS-based human motion capture system.
Section 3 presents the design of an ELM/FIR filter for the human motion capture system. The investigation of experimental tests is summarized in
Section 4, and the conclusions are given in
Section 5.