Deep Learning-Based Markerless Hand Tracking for Freely Moving Non-Human Primates in Brain–Machine Interface Applications

Liu, Yuhang; Wang, Miao; Hou, Shuaibiao; Wang, Xiao; Shi, Bing

doi:10.3390/electronics14050920

Open AccessArticle

Deep Learning-Based Markerless Hand Tracking for Freely Moving Non-Human Primates in Brain–Machine Interface Applications

by

Yuhang Liu

^1,2,3,†,

Miao Wang

^1,2,3,†,

Shuaibiao Hou

^1,2,3,

Xiao Wang

^1,2,3,*

and

Bing Shi

^1,2,3,*

¹

State Key Laboratory of Digital Medical Engineering, School of Biomedical Engineering, Sanya Research Institute of Hainan University, Haikou 570228, China

²

Key Laboratory of Biomedical Engineering of Hainan Province, One Health Institute, Hainan University, Haikou 570228, China

³

School of Biomedical Engineering, Hainan University, Haikou 570228, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(5), 920; https://doi.org/10.3390/electronics14050920

Submission received: 18 December 2024 / Revised: 21 February 2025 / Accepted: 24 February 2025 / Published: 26 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

The motor cortex of non-human primates plays a key role in brain–machine interface (BMI) research. In addition to recording cortical neural signals, accurately and efficiently capturing the hand movements of experimental animals under unconstrained conditions remains a key challenge. Addressing this challenge can deepen our understanding and application of BMI behavior from both theoretical and practical perspectives. To address this issue, we developed a deep learning framework that combines Yolov5 and RexNet-ECA to reliably detect the hand joint positions of freely moving primates at different distances using a single camera. The model simplifies the setup procedure while maintaining high accuracy, with an average keypoint detection error of less than three pixels. Our method eliminates the need for physical markers, ensuring non-invasive data collection while preserving the natural behavior of the experimental subjects. The proposed system exhibits high accuracy and ease of use compared to existing methods. By quickly and accurately acquiring spatiotemporal behavioral metrics, the method provides valuable insights into the dynamic interplay between neural and motor functions, further advancing BMI research.

Keywords:

brain–machine interface; visual recognition; non-human primates; hand movements; object detection; gesture recognition

1. Introduction

In the exploration of cerebral operational mechanisms, brain–machine interfaces (BMIs) emerge as pivotal conduits for transmitting electrophysiological data from the brain to the external environment, playing a crucial role in advancing neuroscientific research. Non-human primates are considered indispensable subjects in BMI experiments, particularly in classical experimental paradigms targeting the primary motor cortex (M1) [1,2].

The evolution of BMI devices has transitioned from wired to wireless implantable devices, which can record the neural signals of experimental subjects in a more natural and free state of movement. Participants equipped with wireless implantable devices enjoy the freedom to engage in a wide range of motions during experiments. A critical challenge to address is the synchronous capture of precise recognition of small hand areas within a broader visual field.

In M1 research with non-human primates, neural signals under natural movement conditions may significantly differ from those acquired under constraints. Therefore, investigating neural signals in unrestricted movement holds greater value [3]. In early BMI experiments, researchers commonly used marker-based methods to pre-process primates for obtaining hand movement data [4,5]. However, these systems, while effective for basic tasks, face significant limitations, including the need for hair trimming, marker attachment, and frequent recalibration. Such procedures often cause discomfort to primates, leading to unnatural behaviors and compromising the quality of experimental data. The inherent complexity and invasive nature of marker-based approaches hinder their application in studies requiring natural and unrestricted movement. Additionally, the innate sensitivity and curiosity of the monkeys can result in attempts to remove the foreign markers, causing self-inflicted harm and further disrupting data collection, so new techniques need to be developed to avoid disturbing the animals.

The rise of deep learning [6] alongside the constant progress in computer vision technology has yielded novel opportunities for tracking movement in non-human primates [7,8], which are essential for BMI research. Markerless motion capture systems based on computer vision techniques have become a significant aspect of the BMI research process. In 2016, Tomoya Nakamura and colleagues conducted a three-dimensional markerless motion capture of monkeys using four depth cameras and successfully detected specific actions, thereby resolving the drawbacks of marked recognition [9]. However, the recognition error rate remained high. Later, in 2019, Rollyn Labuguen et al. [10] used the open-source deep learning tool DeepLabCut to train a monkey model and obtain trajectories that can be used to analyze behaviors. This tool is characterized by its ability to achieve relatively good recognition without requiring a large amount of training data. Nevertheless, the drawbacks are also relatively obvious in that it has to be re-calibrated or re-trained each time it is used. In 2020, Bala et al. [8] developed a markerless motion capture system based on deep learning for detecting monkeys’ whole-body poses during natural motion, with small recognition errors and widely applicable models. They also publicly released the OpenMonkeyPose dataset. In the subsequent year, Rollyn Labuguen et al. [11] unveiled the ‘MacaquePose’ dataset; the availability of the dataset was confirmed through unmarked recognition of the body joint of macaques, and the keypoint errors were found to be close to human standards. In 2021, North R. et al. [12] conducted a clinical trial on the use of non-human primates for human disease, using DeepLabCut to perform label-free recognition on the hands of non-human primates. In 2023, Matsumoto J. et al. [13] conducted multi-view 3D label-free recognition of the entire body of non-human primates, which was able to reconstruct 3D data of non-human primates and more successfully study their social behavior. Li C. et al. [14] used deep learning and two-dimensional skeleton visualization to identify the fine motor activities of crab-eating macaques in cages, solving the problem that action recognition in non-human primate research is heavily dependent on manual work. Butler D.J. et al. [15] used a combination of fluorescence labeling and label-free recognition to study the behavior of non-human primates, which further demonstrates that deep-learning-based marker tracking has revolutionized studies of animal behavior.

The above studies highlight the significant progress made in recent years in analyzing the overall body posture of monkeys. However, the paradigm study of M1 mainly focuses on the overall posture of monkeys. In order to gain a deeper understanding of behavioral mechanisms, behavioral research has gradually shifted from focusing solely on large limb movements to finer and more subtle local limb movements. In recent years, from the free-moving non-human primate treadmill model proposed by Foster J.D. et al. [7] in 2014 (capturing movement and analyzing information through multi-angle cameras and hardware technology) to Li C. et al. [14] identifying the fine movements of caged crab-eating macaques through deep learning and two-dimensional skeleton visualization in 2023, it shows that researchers are paying more and more attention to capturing the details of local limb movements. Therefore, it is necessary to develop fast and simple methods to obtain detailed movements of local limbs to meet the needs of current behavioral research.

Capturing hand movements in BMI experiments is a difficult challenge, and its study requires combining neural signals with spatial and temporal tracking of hand posture and trajectory. Existing tools like DeepLabCut, while effective for general motion tracking, often require extensive retraining and recalibration, making them less efficient for targeted applications like hand tracking. Similarly, in conventional movement paradigms, such as the center-out paradigm, animals need to touch the screen with their hands, which requires accurate recording of information such as gestures and touch time. Therefore, in this study, we developed an efficient and accurate markerless hand motion tracking method designed for non-human primates that can achieve reliable recognition of fine-grained hand movements without frequent retraining, and the model proposed in this paper can automatically recognize hand gestures and add timestamps to each recognition result, which facilitates synchronization with the brain–computer interface system and provides a more natural and practical solution for the study of BMI systems.

We propose a novel marker-free hand movement recognition system tailored for non-human primates. In our research system, we completely avoided the complexity of labeling, streamlined the experimental setup, and used marker-free detection methods to ensure that there was no psychological or physical harm to the primates. This approach made the actions of the experimental subjects more natural.

In this study, we choose to combine Yolov5 and RexNet models, which are deep learning frameworks focused on object detection and feature extraction. Yolov5 is a neural network widely used in object detection tasks, known for its fast inference speed and efficient performance [16]. Since Yolo was first proposed in 2016, it has become an important algorithm family in the field of deep learning object detection. On the other hand, RexNet is a lightweight convolutional neural network designed for optimized feature representation and application on resource-limited devices [17]. By combining the object detection capabilities of Yolov5 with the efficient feature extraction of RexNet, we can achieve robust and accurate gesture recognition.

2. Materials and Methods

This paper presents a non-human primate hand joint recognition algorithm with a single monocular camera. Through the Yolov5 and RexNet networks, this system efficiently captures and tracks the movement of free-moving non-human primate hands in wide-field-of-view images with a high degree of accuracy and speed.

All animal experiments described in this article have been approved by the Animal Ethics Committee of Hainan University, with the audit number HNUAUCC-2023-00008. When designing the experiment, we prioritized the behavioral and physiological needs of the animals. We avoided using any means that might cause physiological or psychological stress to the animals. For non-human primates, no physical interference was performed during the experiment to reduce their stress response, and all operations during the experiment were performed under the supervision of researchers. The experimental animals were cared for by professionals, and the health and physiological indicators of the experimental animals were monitored regularly. After the experiment, all animals did not show any long-term adverse reactions or injuries, ensuring that each experimental animal had sufficient recovery time and medical care after the experiment.

Data collection work. As research advances, achieving high accuracy, convenience, and continuous, stable identification of hand movements in monkeys across their extensive range of motion becomes crucial. While our system is more convenient and practical compared to deep learning tools like DeepLabCut, its effectiveness still heavily depends on high-quality and comprehensive training data. Unfortunately, there is a scarcity of datasets specifically tailored for non-human primates, particularly those applicable to monkey hand movements. Given the availability of ample human hand image data and the physiological and structural similarities between primates and humans, we used transfer learning for the model’s initial training. Subsequently, we utilized screened hand images from existing datasets, coupled with additional collected data, to create a joint training dataset for further model refinement. To compile this dataset, we captured body and hand data from five monkeys at varying distances and under diverse settings using a camera. The subjects encompassed one crab-eating macaque and four rhesus macaques involved in BMI experiments, as these two monkey types are most used in such studies [18]. In the process of collecting the dataset, with a variety of experimental environments and types of equipment, we recorded 41,400 s of video and 169 screened pictures in a free-movement environment; the food guide scenario obtained 50,040 s of video data, a total of 704 available pictures, and 57 available analysis video data; a further 122 images of experimental animal behavior were captured in a controlled environment. Additionally, publicly available datasets [11] were annotated and incorporated into our study.

In terms of data collection (refer to Figure 1), we conducted collection work in various environments, obtaining over 200 h of video files and thousands of images. However, due to the much more difficult control of non-human primate behavior compared to human behavior, the number of images we actually screened and used in the end is limited. In this model training, we used 1986 public dataset images, 1295 public dataset processed images, and 995 actual collected images.

We used the above method to obtain a large number of data samples with different sampling distances. In sub-figure (a), the acquisition distances we used are shown: the black lines from I to VII represent seven different acquisition distances, each with an interval of 20 cm. At VII, the camera is 1.8 m away from the acquisition cage. Multiple acquisition distances are used to evaluate the robustness of the system at different distances.

Subsequent error analysis with these data effectively corroborated the system’s robustness. This dataset adopts the PASCAL VOC dataset annotation format and uses the Python-based LabelImg and Labelme annotation tools to manually annotate the sorted image data to mark its type or keypoint position, thereby improving the accuracy and efficiency of machine learning algorithms and artificial intelligence models.

System network model. To ensure precise identification of hand joints within a broad field of view, we have divided the recognition process into two stages: initial hand localization and posture identification. These stages encompass target detection and pose recognition modules within our markerless recognition system. The target detection module is based on the Yolov5 model, which produces a segmented hand image as its output. This segmented image, identified by the model, is then fed into the second network layer, leveraging the RexNet architecture. The RexNet network is instrumental in identifying and detecting keypoints, enabling joint identification and subsequent output. To further improve the accuracy of the system, we integrated the Efficient Channel Attention (ECA) mechanism in the RexNet network. This attention mechanism selectively emphasizes informative feature channels without increasing computational complexity, involving only a few parameters and bringing significant performance improvements. It improves model performance by focusing on key features, thereby significantly improving recognition accuracy [19].

In the training process of Yolov5 and RexNet-ECA (refer to Figure 2), we adopted an independent training approach for each network. This strategy allows for gathering information from each component, enabling individual adjustments for target detection and keypoint identification, enhancing flexibility and adaptability. Firstly, the Yolov5 network’s detection layer is designed to collaboratively identify multiple hand positions through multi-dimensional analysis, primarily processing wide-field image data. A notable feature of this model is its ability to extract information across various dimensions, thereby concurrently enhancing detection precision. Subsequently, the positional data obtained from the Yolov5 model serve as inputs for the RexNet-ECA model. Leveraging this information, the RexNet-ECA model conducts fine-grained recognition of hand joints, extracting data to detect these joints. RexNet-ECA’s primary advantage lies in its lightweight architecture, ensuring rapid recognition speeds and commendable accuracy compared to alternative algorithms. With this integrated network, the system efficiently captures and recognizes multiple hand positions in non-human primate image data. It supports video recognition, logging gesture data, and capturing joint angular information.

In Figure 2, FC represents a fully connected layer, which integrates the features extracted from the previous layer into a fixed-size output for decision-making. CSP represents a cross-stage local network. The model inputs the original image data into the CSP network and performs feature extraction operations on the image data. SPP corresponds to the spatial pyramid pooling module, which performs feature extraction operations through different pooling sizes, increasing the network’s receptive field in terms of representation effect. MBConv represents mobile inverted bottleneck convolution, which is a lightweight convolution block optimized for mobile networks. Upsample represents an upsampling layer, which is responsible for increasing spatial resolution, and convolution represents a standard convolution layer for feature extraction.

Gesture video density calculation formula. In order to accurately identify the monkey’s hand movements, we used a gesture video density calculation method based on hand joint markers. The core of this method is to extract 21 joint markers of the monkey’s hand and calculate the density of these markers relative to their center of mass, thereby quantifying the distribution of gestures during grasping and unfolding. The formula is as follows:

\begin{array}{l} X = [x 0, x 1, x 2 \dots, x 20] \\ Y = [y 0, y 1, y 2 \dots, y 20] \end{array}

(1)

\begin{array}{l} X_{mid} = [\max (X) - \min (X)] / 2 \\ Y_{mid} = [\max (Y) - \min (Y)] / 2 \end{array}

(2)

Density = \sum_{i = 0}^{20} \sum \sqrt{{(X [i] - X_{mid})}^{2} + {(Y [i] - Y_{mid})}^{2}}

(3)

(1) Equation (1): X and Y represent the coordinate information for the 21 hand markers.

(2) Equation (2): This is used to compute the marker points’ centroidal coordinates.

(3) Equation (3): The ‘Density’ is derived by taking the L2 norm between each marker point and the centroid, then summing the norm values for each point.

3. Results

Dataset construction and feature labeling rules. We have devised a markerless recognition system underpinned by a deep learning framework. The system can capture multiple hand targets and identify 21 marker points on the hand of a non-human primate in a free-motion state by using a model derived from training. These 21 marker points adhere to the standards the International Society of Biomechanics and Anatomy (ISB) set forth. Given the anatomical similarities between human and non-human primate hands, we have applied these universally recognized standards for non-human primate hand recognition.

System accuracy. We assessed the system’s accuracy, efficacy, and precision. To gauge its robustness, we conducted marker point error detection. It considers common conditions in non-human primates used for BMI experiments—distal phalanges often suffer from mutilation due to self-harm or aggression from conspecifics, and the proximal phalanges do not sufficiently represent hand behavior. Therefore, the error detection point was chosen as the midway between each finger’s middle and proximal phalanges, also referred to as the proximal interphalangeal joint. These are denoted as points 3, 6, 10, 14, and 18 (refer to Figure 3). These five landmarks aptly represent the overarching hand posture, and the precision of their detection directly impacts the discernment of the hand’s posture.

To control the variables, we captured 20 images at each of the seven distances with varying pixel dimensions and then evaluated detection errors and recognition ratio for the five landmarks mentioned above.

Figure 4 shows that the pinky has the most prominent error among the images at the same distance with the same pixel size because its target is the smallest and is often partially occluded. While thumb targets are similarly smaller and more flexible, with a greater variety of postures, the recognition difficulty is also very high. The other three fingers, especially the middle finger and ring finger, have the highest correlation and the least flexibility and are less likely to be obscured, and the average recognition accuracy is relatively consistent and high.

Regarding pixel dimensions, as pixel count decreases, the information available diminishes. The algorithm maintains a relatively stable precision and recognizable ratio within the (640 × 480) to (320 × 240) pixel range. However, when the image dimensions drop below (240 × 180), the recognition precision and recognizable ratio decline sharply. This underscores that, for images above (320 × 240) pixels, the proposed algorithm in this study retains commendable recognition precision and recognizable ratio. Notably, the algorithm’s robustness remains largely unaffected by the decline in pixels. Reducing pixels can augment recognition speed, enhancing the system’s real-time capabilities—a distinct advantage of the method presented here. The recognition effect at different pixel resolutions was also discussed, as summarized in Table 1.

Referring to our discussion related to Figure 4 and analyzing the data in Table 1, it is discernible that as the pixel resolution of the images decreases, the system’s processing speed for these images correspondingly increases. This acceleration in speed is particularly evident when transitioning from resolutions of (3840 × 2160) to (1920 × 1088). However, in practical scenarios, we opted for a resolution of (320 × 240) for data acquisition in this experimental setup, which maintains the recognition speed above 30 FPS, thereby satisfying the real-time transmission demands and the requirements for neural data mapping.

Calculation and analysis of gesture video density. One of the pivotal application scenarios of hand joint extraction in BMIs is the recognition of grasp-and-spread postures. We filmed a video of a monkey grasping a physical object and used the algorithm in this paper to identify the 21 points of articulation of its hand and plot its process curve. The monkey reaching out to grasp a physical object can be seen more clearly in Figure 5a, from 1 to 9. We can follow the position of each keypoint of the hand in real time and plot its continuous change process. As shown in Figure 5b, we employed the L2 norm (Equation (3)) to compute the density of the 21 markers, using this metric to ascertain the monkey’s grasp-and-spread actions.

Moreover, in BMIs, the precise categorization of actions is substantially influenced by the actual duration of action formation. As illustrated in Figure 5a, beginning from the fifth frame, the density distribution of the front of the hand conspicuously increases, serving as a crucial indicator for recognizing the hand’s grasp-and-spread posture. The L2 norm results in Figure 5b aptly validate this observation. Consequently, this study can adeptly synchronize with the neural data gathered by the BMI system, ensuring high temporal precision in matching motion states with neural data.

Gesture classification based on joint data analysis. The gesture classification depicted in Figure 6 exhibits commendable performance, distinguishing between spreading and grasping. We applied this classification to the nine frames from Figure 5 and compared the outcomes with the computational results from Figure 5b, demonstrating a good effect. Concurrently, further definite validation of the grasping motions was conducted while obtaining specific keypoints and angles, bolstering the system’s recognition accuracy.

In order to further verify the recognition effect of the algorithm on grasp–stretch posture, we conducted a more diverse analysis of the 21-point joint data based on the algorithm of this paper, performed a binary classification of the grasp–stretch posture, and compared it with the image classification method based on the ResNet50 network, a widely used deep convolutional neural network typically employed for image classification tasks. We found that the accuracy of the ResNet50 image classification algorithm was only 0.882, while the accuracy of inputting joint data into the decision tree reached 0.943.

4. Discussion

To verify the algorithm’s effectiveness in this paper, the running speed and error of Yolov5 + ResNet50 and Yolov5 + RexNet are compared, and Table 2 is obtained.

As fewer reported studies focus on fine-grained hand movement recognition algorithms in non-human primates, in this paper, we only compare the algorithms we implemented and discuss the advantages and disadvantages of the different algorithms. The above table compares the Yolov5 + ResNet50 and Yolov5 + RexNet algorithm models for a single image, including model loading time (*). It is evident from the table that the Yolov5 + ResNet50 model holds a distinct edge in terms of recognition speed and accuracy. Given our study’s focus on the rapid identification of acceptable targets, RexNet was chosen for its lightweight design and computational efficiency. This fits the experimental requirements of our research, where real-time performance in limited computational environments is a key consideration, and thus RexNet fits better with the central theme of our investigation.

To verify the effectiveness of this algorithm, the identification indexes of many models were compared, and the graphics card is NVIDIA Tesla P40, which results in Table 3 below. Among them, the PCK (Percentage of Correct Keypoints) indicator is often used for the evaluation of tasks such as pose estimation and measuring the correct matching degree between the keypoints predicted by the model and the true keypoint coordinates. The value is the percentage of the correctly matched keypoints in the total number of keypoints; OKS (Object Keypoint Similarity) is also an important indicator of posture assessment, representing the similarity between the predicted keypoint position error and the keypoints. The higher the OKS value is, the closer the prediction of the model is to the real attitude; Average Error represents the average difference between the predicted value and the true value of all samples; and Best Loss denotes the lowest loss value reached by the model during training. The improvement of the model in this study adds the ECA mechanism, which is mainly used to enhance the ability to perceive convolutional neural networks for images. By introducing the channel attention mechanism into the keypoint recognition algorithm, the perception of the relationship between channels is enhanced, which helps to improve the ability of the model to extract important features.

The above table compares the algorithm models of SqueezeNet, ResNet50, RexNet, and the improved algorithm RexNet-ECA. As clearly demonstrated in the table, the RexNet-ECA model has more advantages in recognition accuracy. After adding the ECA mechanism, the convergence speed of the RexNet model accelerated, reaching the Best Loss value of 0.1100 in 825 rounds of training. Without the ECA mechanism, the model reached the Best Loss value of 0.1181 in 965 rounds of training. The improved RexNet-ECA algorithm is significantly better than the RexNet algorithm in terms of convergence speed and performance. This performance advantage is very critical to our research, especially in accurately identifying monkey hand movements.

Our research focuses on the recognition of hand movements in monkeys. Compared to the recognition of overall movements in monkeys discussed in reference [13], our analysis provides a more detailed examination of hand movements. Despite the differences in the application, we believe our work can offer one option more.

This study employs a monocular camera system, which provides simplicity in deployment and operation, resulting in effective and highly robust data. Our research focuses on addressing data processing challenges inherent in BMI studies targeting the primary motor cortex. It enables a collaborative analysis with neural signals while mitigating potential physical harm to non-human primates during hand movement recognition, thereby preserving the naturalistic behavior crucial for experiments. In the BMI experimental paradigm, the model can automatically recognize hand movements. Combined with the corresponding neural signals of macaques during different gestures, we can better capture the complex relationship between brain activity and behavior. Additionally, the diverse dataset utilized ensures consistent and accurate recognition across various environments. We independently established a diversified, high-quality non-human primate hand dataset to train a multi-target capture and key detection model, a dataset including eight kinds of experimental environment image data, and three kinds of image acquisition means and two kinds of video acquisition of five non-human primates, collected a total of 4276 image data, and accumulated 91,440 s of video files.

In this paper, a multi-objective capture and keypoint detection model is proposed for BMI behavioral image acquisition and motion data analysis. The improved Yolov5 + RexNet-ECA model architecture is adopted, and transfer learning is used for optimization training. Using a monocular camera in BMI behavior research, we can real-time record 21 keypoints of joint data of large non-human primates within a 2-meter field of view, with an average identification error of only three pixels. Compared to the original YOLOv5 + RexNet architecture, our model shows a 6% improvement in performance, demonstrating superior results in a comparison of four algorithms, and the model enables rapid detection and annotation of hand action states based on movement data. In future work, we can refer to more advanced fine-grained activity recognition and prediction methods to further improve the performance of our model in hand action recognition and prediction. For example, we can use multiple visual modalities and temporal features [20] to achieve more accurate action classification and prediction. These advanced methods are expected to provide us with more efficient and accurate tools when processing non-human primate behavioral data.

5. Conclusions

In this study, we introduced the ECA mechanism into the Yolov5 + RexNet framework to optimize algorithm performance. The model demonstrated excellent results in the non-human primate hand action recognition task, and ablation experiments further validated the performance improvements achieved through individual and combined optimizations. We applied the optimized algorithm to annotate macaque hand data in BMI experiments, significantly enhancing the accuracy of the mapping between neural and behavioral data. The main challenge of this study was to address markerless recognition of hand movements in non-human primates by using a model that fits the experimental needs. Accurately capturing the hand movements of primates moving freely in an unconstrained environment is crucial for understanding the dynamic interaction between neural signals and behavior, which directly affects the accuracy and applicability of real-time neural decoding. This improvement allows complex brain–behavior associations to be captured more effectively. Our method can provide BMI with higher temporal resolution and precise behavioral data by capturing fine-grained hand movements in real time, thus supporting more accurate neural signal decoding. Combined with the closed-loop control paradigm of BMI, it can verify the causal relationship between neural signals and behavior in real time, providing new tools for the design and optimization of BMI. Furthermore, this study constructed an unlabeled macaque hand dataset, providing a valuable public resource for future research in related fields.

However, our method also has limitations. For example, while it performs well in controlled environments, its effectiveness in highly dynamic real-world environments may need further verification. In addition, the model’s reliance on a single camera for joint position detection may limit its robustness in scenarios that require multiple angles or depth information for more complex movements.

In future research, more advanced hardware equipment and training methods, such as multiple cameras and efficient image processing algorithms, can be used to further improve the accuracy and robustness of hand action recognition.

Author Contributions

Conceptualization, X.W. and B.S.; methodology, Y.L. and M.W.; software, Y.L.; validation, M.W. and S.H.; formal analysis, M.W.; investigation, Y.L., M.W. and S.H.; resources, X.W.; data curation, S.H.; writing—original draft preparation, Y.L. and M.W.; writing—review and editing, Y.L.; visualization, M.W.; supervision, X.W.; project administration, B.S. and X.W.; funding acquisition, B.S. and X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (Grant Nos. 62466015 and 32160204), the Project of Sanya Yazhou Bay Science and Technology City (Grant No. SCKJ-JYRC-2023-28), the ‘Rising Star of South China Sea’ Project of Hainan Province (Grant No. 202309001), the High-level Talent Project of the Natural Science Foundation of Hainan Province (Grant No. 322RC560).

Institutional Review Board Statement

The animal experiments involved in this article have all passed the ethical review of the Animal Ethics Committee of Hainan University. The audit number is HNUAUCC-2023-00008.

Data Availability Statement

The dataset supporting the results of this study has been stored on the public data website OSF at https://osf.io/ez428/, accessed on 24 March 2024. All json files in the dataset are annotated by researchers.

Conflicts of Interest

The authors declare no conflict of interest.

References

Almani, M.N.; Saxena, S. Recurrent neural networks controlling musculoskeletal models predict motor cortex activity during novel limb movements. In Proceedings of the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Glasgow, UK, 11–15 July 2022; pp. 3350–3356. [Google Scholar]
Gallego, J.A.; Makin, T.R.; McDougle, S.D. Going beyond primary motor cortex to improve brain–computer interfaces. Trends Neurosci. 2022, 45, 176–183. [Google Scholar] [CrossRef] [PubMed]
Libey, T.; Fetz, E.E. Open-source, low cost, free-behavior monitoring, and reward system for neuroscience research in non-human primates. Front. Neurosci. 2017, 11, 265. [Google Scholar] [CrossRef] [PubMed]
Blickhan, R.; Andrada, E.; Hirasaki, E.; Ogihara, N. Trunk and leg kinematics of grounded and aerial running in bipedal macaques. J. Exp. Biol. 2021, 224, jeb225532. [Google Scholar] [CrossRef]
Vargas-Irwin, C.E.; Shakhnarovich, G.; Yadollahpour, P.; Mislow, J.M.; Black, M.J.; Donoghue, J.P. Decoding complete reach and grasp actions from local primary motor cortex populations. J. Neurosci. 2010, 30, 9659–9669. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Osindero, S.; Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
Foster, J.D.; Nuyujukian, P.; Freifeld, O.; Gao, H.; Walker, R.; Ryu, S.I.; Meng, T.H.; Murmann, B.; Black, M.J.; Shenoy, K.V. A freely-moving monkey treadmill model. J. Neural Eng. 2014, 11, 046020. [Google Scholar] [CrossRef] [PubMed]
Bala, P.C.; Eisenreich, B.R.; Yoo, S.B.M.; Hayden, B.Y.; Park, H.S.; Zimmermann, J. Automated markerless pose estimation in freely moving macaques with OpenMonkeyStudio. Nat. Commun. 2020, 11, 4560. [Google Scholar] [CrossRef] [PubMed]
Nakamura, T.; Matsumoto, J.; Nishimaru, H.; Bretas, R.V.; Takamura, Y.; Hori, E.; Ono, T.; Nishijo, H. A markerless 3D computerized motion capture system incorporating a skeleton model for monkeys. PLoS ONE 2016, 11, e0166154. [Google Scholar] [CrossRef] [PubMed]
Labuguen, R.; Bardeloza, D.K.; Negrete, S.B.; Matsumoto, J.; Inoue, K.; Shibata, T. Primate markerless pose estimation and movement analysis using DeepLabCut. In Proceedings of the 2019 Joint 8th International Conference on Informatics, Electronics & Vision (ICIEV) and 2019 3rd International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Spokane, WA, USA, 30 May–2 June 2019; pp. 297–300. [Google Scholar]
Labuguen, R.; Matsumoto, J.; Negrete, S.B.; Nishimaru, H.; Nishijo, H.; Takada, M.; Go, Y.; Inoue, K.-i.; Shibata, T. MacaquePose: A novel “in the wild” macaque monkey pose dataset for markerless motion capture. Front. Behav. Neurosci. 2021, 14, 581154. [Google Scholar] [CrossRef]
North, R.; Wurr, R.; Macon, R.; Mannion, C.; Hyde, J.; Torres-Espin, A.; Rosenzweig, E.S.; Ferguson, A.R.; Tuszynski, M.H.; Beattie, M.S. Quantifying the kinematic features of dexterous finger movements in nonhuman primates with markerless tracking. In Proceedings of the 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), Mexico, Mexico, 1–5 November 2021; pp. 6110–6115. [Google Scholar]
Matsumoto, J.; Kaneko, T.; Kimura, K.; Negrete, S.B.; Guo, J.; Suda-Hashimoto, N.; Kaneko, A.; Morimoto, M.; Nishimaru, H.; Setogawa, T. Three-dimensional markerless motion capture of multiple freely behaving monkeys for automated characterization of social behavior. bioRxiv 2023. [Google Scholar] [CrossRef]
Li, C.; Xiao, Z.; Li, Y.; Chen, Z.; Ji, X.; Liu, Y.; Feng, S.; Zhang, Z.; Zhang, K.; Feng, J. Deep learning-based activity recognition and fine motor identification using 2D skeletons of cynomolgus monkeys. Zool. Res. 2023, 44, 967. [Google Scholar] [CrossRef] [PubMed]
Butler, D.J.; Keim, A.P.; Ray, S.; Azim, E. Large-scale capture of hidden fluorescent labels for training generalizable markerless motion capture models. Nat. Commun. 2023, 14, 5866. [Google Scholar] [CrossRef] [PubMed]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Han, D.; Yun, S.; Heo, B.; Yoo, Y. Rexnet: Diminishing representational bottleneck on convolutional neural network. arXiv 2020, arXiv:2007.00992v3. [Google Scholar]
Liang, F.; Yu, S.; Pang, S.; Wang, X.; Jie, J.; Gao, F.; Song, Z.; Li, B.; Liao, W.-H.; Yin, M. Non-human primate models and systems for gait and neurophysiological analysis. Front. Neurosci. 2023, 17, 1141567. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Chen, H.; Zendehdel, N.; Leu, M.C.; Yin, Z. Fine-grained activity classification in assembly based on multi-visual modalities. J. Intell. Manuf. 2024, 35, 2215–2233. [Google Scholar] [CrossRef]

Figure 1. Block diagram of the overall acquisition. (a) The experimental apparatus consists of a behavior collection cage for non-human primates and a Canon 700D camera for data acquisition. The black lines from I to VII represent seven different acquisition distances, each with a spacing of 20 cm. On VII the camera is 1.8 m away from the acquisition cage. Multiple acquisition distances are used to evaluate the robustness of the system at different distances. (b) It displays a single image in the recognition model, where target detection is performed. (c) It is a joint extraction work for the target detection results. The upper and lower figures show the correspondence between the original image and the joint. (d) It represents 21 random image data corresponding to different collection distances, with decreasing pixel sizes of (640, 480), (560, 420), (480, 360), (400, 300), (320, 240), (240, 180), and (160, 120); a gradual reduction in pixel size is evident in this figure. (e) It showcases detailed hand information extracted from seven randomly chosen images from ‘(d)’, further illustrating the stepwise pixel decrement.

Figure 2. System network model. The model is divided into two parts, namely the target detection module and the posture recognition module. The target detection module uses Yolov5 as the basic model. The model outputs the hand contour after identification and cutting. The data are then input into the second-layer network, RexNet-ECA network, which is applied to the recognition and detection of keypoints. This figure presents a schematic representation of the system model, detailing the data flow from input to output. Through multi-dimensional information detection, multiple hand positions within the image are identified. Subsequent processes extract the joint structure from each detected hand, yielding multiple hand joints within the image. FC, fully connected layer; CSP, stage partial network; SPP, spatial pyramid pooled mud; MBConv, moving inverted bottleneck convolution; Upsample, upsampling; Convolution, convolution layer.

Figure 3. Specific locations of 21 key marking points on the hand.

Figure 4. System testing. The recognition error and recognizable ratio on reduced pixels (640, 480), (560, 420), (480, 360), (400, 300), (320, 240), (240, 180), and (160, 120) were tested in the view. Five distinct color bars represent the five critical landmarks. The black dots on the folds indicate the average pixel error for each marker point, and the blue dots on the folds indicate the recognition efficiency of the system at different pixel scales, from which it can be seen that the system maintains good recognition efficiency at the first five scales. At resolutions above 320 × 240, it has lower recognition error (average pixel error is less than 3.0) and higher recognition rate (average above 95%).

Figure 5. Video frame analysis. (a) It depicts the system’s recognition of nine frames from a captured video sequence. Distinct colored lines represent different keypoints. The joint extraction map of the hand position at the beginning of the video is marked in (a). The numbers in this figure indicate the position of the center of the hand at different frames. In the video, in frame 5, the monkey grabs the food and retracts the hand into the cage. (b) We utilize the L2 norm to compute the density of the joint keypoints across different frames. The line graph distinctively demarcates two phases: the ‘spread’ and ‘grip’ segments, with clear transitional boundaries, effectively capturing the monkey’s grasping motion.

Figure 6. Gesture classification. The (a,c) views are gesture scatter plots normalized to the reach and grasp gesture data, respectively, with the colored markers representing each of the 21 hand keypoints, from which the reach and hold hand information can be seen more clearly. The (b,d) views are line plots of the reach and grasp gesture joint data, respectively, and the highlighted lines are the data averages for each hand keypoint, showing that the folds all have a strong regularity, indicating that classification can be carried out effectively with these data.

Table 1. System accuracy across different environments.

Picture Pixels	Processing Speed of Images (20 Pieces)	Time Improvement Ratio	Error	Detection Probability
3840 × 2160	4.5773	-	-	-
1920 × 1088	1.8539	-	-	-
640 × 480	1.1661	0.00%	2.7283	100.00%
560 × 420	1.1587	0.64%	3.1267	95.00%
480 × 360	1.1392	2.31%	3.0282	100.00%
400 × 300	1.1264	3.41%	2.9714	95.00%
320 × 240	1.1090	4.91%	2.9320	100.00%
240 × 180	1.0689	8.35%	3.9010	80.00%
160 × 120	1.0384	10.95%	3.7765	55.00%

Table 2. Performance and error comparisons of multiple models, (*) indicates that the parameter contains the model loading time.

	Temporal Parameters/			Precision Parameters/Pixel
	Target Detection *	Joint Recognition *	Detection + Joint	Error
Yolo + ResNet50	1.2112	0.6428	1.8540	0.1406
Yolo + ReXNet	1.2112	0.3503	1.5615	0.1219

Table 3. Performance comparison of keypoint detection algorithms.

Model	PCK	OKS	Average Error	Best Loss
SqueezeNet	0.444	0.757	16.580	0.2067
ResNet50	0.746	0.888	9.829	0.1406
RexNet	0.820	0.916	7.998	0.1181
RexNet-ECA	0.825	0.919	7.850	0.1100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Y.; Wang, M.; Hou, S.; Wang, X.; Shi, B. Deep Learning-Based Markerless Hand Tracking for Freely Moving Non-Human Primates in Brain–Machine Interface Applications. Electronics 2025, 14, 920. https://doi.org/10.3390/electronics14050920

AMA Style

Liu Y, Wang M, Hou S, Wang X, Shi B. Deep Learning-Based Markerless Hand Tracking for Freely Moving Non-Human Primates in Brain–Machine Interface Applications. Electronics. 2025; 14(5):920. https://doi.org/10.3390/electronics14050920

Chicago/Turabian Style

Liu, Yuhang, Miao Wang, Shuaibiao Hou, Xiao Wang, and Bing Shi. 2025. "Deep Learning-Based Markerless Hand Tracking for Freely Moving Non-Human Primates in Brain–Machine Interface Applications" Electronics 14, no. 5: 920. https://doi.org/10.3390/electronics14050920

APA Style

Liu, Y., Wang, M., Hou, S., Wang, X., & Shi, B. (2025). Deep Learning-Based Markerless Hand Tracking for Freely Moving Non-Human Primates in Brain–Machine Interface Applications. Electronics, 14(5), 920. https://doi.org/10.3390/electronics14050920

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Learning-Based Markerless Hand Tracking for Freely Moving Non-Human Primates in Brain–Machine Interface Applications

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI