1. Introduction
In the exploration of cerebral operational mechanisms, brain–machine interfaces (BMIs) emerge as pivotal conduits for transmitting electrophysiological data from the brain to the external environment, playing a crucial role in advancing neuroscientific research. Non-human primates are considered indispensable subjects in BMI experiments, particularly in classical experimental paradigms targeting the primary motor cortex (M1) [
1,
2].
The evolution of BMI devices has transitioned from wired to wireless implantable devices, which can record the neural signals of experimental subjects in a more natural and free state of movement. Participants equipped with wireless implantable devices enjoy the freedom to engage in a wide range of motions during experiments. A critical challenge to address is the synchronous capture of precise recognition of small hand areas within a broader visual field.
In M1 research with non-human primates, neural signals under natural movement conditions may significantly differ from those acquired under constraints. Therefore, investigating neural signals in unrestricted movement holds greater value [
3]. In early BMI experiments, researchers commonly used marker-based methods to pre-process primates for obtaining hand movement data [
4,
5]. However, these systems, while effective for basic tasks, face significant limitations, including the need for hair trimming, marker attachment, and frequent recalibration. Such procedures often cause discomfort to primates, leading to unnatural behaviors and compromising the quality of experimental data. The inherent complexity and invasive nature of marker-based approaches hinder their application in studies requiring natural and unrestricted movement. Additionally, the innate sensitivity and curiosity of the monkeys can result in attempts to remove the foreign markers, causing self-inflicted harm and further disrupting data collection, so new techniques need to be developed to avoid disturbing the animals.
The rise of deep learning [
6] alongside the constant progress in computer vision technology has yielded novel opportunities for tracking movement in non-human primates [
7,
8], which are essential for BMI research. Markerless motion capture systems based on computer vision techniques have become a significant aspect of the BMI research process. In 2016, Tomoya Nakamura and colleagues conducted a three-dimensional markerless motion capture of monkeys using four depth cameras and successfully detected specific actions, thereby resolving the drawbacks of marked recognition [
9]. However, the recognition error rate remained high. Later, in 2019, Rollyn Labuguen et al. [
10] used the open-source deep learning tool DeepLabCut to train a monkey model and obtain trajectories that can be used to analyze behaviors. This tool is characterized by its ability to achieve relatively good recognition without requiring a large amount of training data. Nevertheless, the drawbacks are also relatively obvious in that it has to be re-calibrated or re-trained each time it is used. In 2020, Bala et al. [
8] developed a markerless motion capture system based on deep learning for detecting monkeys’ whole-body poses during natural motion, with small recognition errors and widely applicable models. They also publicly released the OpenMonkeyPose dataset. In the subsequent year, Rollyn Labuguen et al. [
11] unveiled the ‘MacaquePose’ dataset; the availability of the dataset was confirmed through unmarked recognition of the body joint of macaques, and the keypoint errors were found to be close to human standards. In 2021, North R. et al. [
12] conducted a clinical trial on the use of non-human primates for human disease, using DeepLabCut to perform label-free recognition on the hands of non-human primates. In 2023, Matsumoto J. et al. [
13] conducted multi-view 3D label-free recognition of the entire body of non-human primates, which was able to reconstruct 3D data of non-human primates and more successfully study their social behavior. Li C. et al. [
14] used deep learning and two-dimensional skeleton visualization to identify the fine motor activities of crab-eating macaques in cages, solving the problem that action recognition in non-human primate research is heavily dependent on manual work. Butler D.J. et al. [
15] used a combination of fluorescence labeling and label-free recognition to study the behavior of non-human primates, which further demonstrates that deep-learning-based marker tracking has revolutionized studies of animal behavior.
The above studies highlight the significant progress made in recent years in analyzing the overall body posture of monkeys. However, the paradigm study of M1 mainly focuses on the overall posture of monkeys. In order to gain a deeper understanding of behavioral mechanisms, behavioral research has gradually shifted from focusing solely on large limb movements to finer and more subtle local limb movements. In recent years, from the free-moving non-human primate treadmill model proposed by Foster J.D. et al. [
7] in 2014 (capturing movement and analyzing information through multi-angle cameras and hardware technology) to Li C. et al. [
14] identifying the fine movements of caged crab-eating macaques through deep learning and two-dimensional skeleton visualization in 2023, it shows that researchers are paying more and more attention to capturing the details of local limb movements. Therefore, it is necessary to develop fast and simple methods to obtain detailed movements of local limbs to meet the needs of current behavioral research.
Capturing hand movements in BMI experiments is a difficult challenge, and its study requires combining neural signals with spatial and temporal tracking of hand posture and trajectory. Existing tools like DeepLabCut, while effective for general motion tracking, often require extensive retraining and recalibration, making them less efficient for targeted applications like hand tracking. Similarly, in conventional movement paradigms, such as the center-out paradigm, animals need to touch the screen with their hands, which requires accurate recording of information such as gestures and touch time. Therefore, in this study, we developed an efficient and accurate markerless hand motion tracking method designed for non-human primates that can achieve reliable recognition of fine-grained hand movements without frequent retraining, and the model proposed in this paper can automatically recognize hand gestures and add timestamps to each recognition result, which facilitates synchronization with the brain–computer interface system and provides a more natural and practical solution for the study of BMI systems.
We propose a novel marker-free hand movement recognition system tailored for non-human primates. In our research system, we completely avoided the complexity of labeling, streamlined the experimental setup, and used marker-free detection methods to ensure that there was no psychological or physical harm to the primates. This approach made the actions of the experimental subjects more natural.
In this study, we choose to combine Yolov5 and RexNet models, which are deep learning frameworks focused on object detection and feature extraction. Yolov5 is a neural network widely used in object detection tasks, known for its fast inference speed and efficient performance [
16]. Since Yolo was first proposed in 2016, it has become an important algorithm family in the field of deep learning object detection. On the other hand, RexNet is a lightweight convolutional neural network designed for optimized feature representation and application on resource-limited devices [
17]. By combining the object detection capabilities of Yolov5 with the efficient feature extraction of RexNet, we can achieve robust and accurate gesture recognition.
2. Materials and Methods
This paper presents a non-human primate hand joint recognition algorithm with a single monocular camera. Through the Yolov5 and RexNet networks, this system efficiently captures and tracks the movement of free-moving non-human primate hands in wide-field-of-view images with a high degree of accuracy and speed.
All animal experiments described in this article have been approved by the Animal Ethics Committee of Hainan University, with the audit number HNUAUCC-2023-00008. When designing the experiment, we prioritized the behavioral and physiological needs of the animals. We avoided using any means that might cause physiological or psychological stress to the animals. For non-human primates, no physical interference was performed during the experiment to reduce their stress response, and all operations during the experiment were performed under the supervision of researchers. The experimental animals were cared for by professionals, and the health and physiological indicators of the experimental animals were monitored regularly. After the experiment, all animals did not show any long-term adverse reactions or injuries, ensuring that each experimental animal had sufficient recovery time and medical care after the experiment.
Data collection work. As research advances, achieving high accuracy, convenience, and continuous, stable identification of hand movements in monkeys across their extensive range of motion becomes crucial. While our system is more convenient and practical compared to deep learning tools like DeepLabCut, its effectiveness still heavily depends on high-quality and comprehensive training data. Unfortunately, there is a scarcity of datasets specifically tailored for non-human primates, particularly those applicable to monkey hand movements. Given the availability of ample human hand image data and the physiological and structural similarities between primates and humans, we used transfer learning for the model’s initial training. Subsequently, we utilized screened hand images from existing datasets, coupled with additional collected data, to create a joint training dataset for further model refinement. To compile this dataset, we captured body and hand data from five monkeys at varying distances and under diverse settings using a camera. The subjects encompassed one crab-eating macaque and four rhesus macaques involved in BMI experiments, as these two monkey types are most used in such studies [
18]. In the process of collecting the dataset, with a variety of experimental environments and types of equipment, we recorded 41,400 s of video and 169 screened pictures in a free-movement environment; the food guide scenario obtained 50,040 s of video data, a total of 704 available pictures, and 57 available analysis video data; a further 122 images of experimental animal behavior were captured in a controlled environment. Additionally, publicly available datasets [
11] were annotated and incorporated into our study.
In terms of data collection (refer to
Figure 1), we conducted collection work in various environments, obtaining over 200 h of video files and thousands of images. However, due to the much more difficult control of non-human primate behavior compared to human behavior, the number of images we actually screened and used in the end is limited. In this model training, we used 1986 public dataset images, 1295 public dataset processed images, and 995 actual collected images.
We used the above method to obtain a large number of data samples with different sampling distances. In sub-figure (a), the acquisition distances we used are shown: the black lines from I to VII represent seven different acquisition distances, each with an interval of 20 cm. At VII, the camera is 1.8 m away from the acquisition cage. Multiple acquisition distances are used to evaluate the robustness of the system at different distances.
Subsequent error analysis with these data effectively corroborated the system’s robustness. This dataset adopts the PASCAL VOC dataset annotation format and uses the Python-based LabelImg and Labelme annotation tools to manually annotate the sorted image data to mark its type or keypoint position, thereby improving the accuracy and efficiency of machine learning algorithms and artificial intelligence models.
System network model. To ensure precise identification of hand joints within a broad field of view, we have divided the recognition process into two stages: initial hand localization and posture identification. These stages encompass target detection and pose recognition modules within our markerless recognition system. The target detection module is based on the Yolov5 model, which produces a segmented hand image as its output. This segmented image, identified by the model, is then fed into the second network layer, leveraging the RexNet architecture. The RexNet network is instrumental in identifying and detecting keypoints, enabling joint identification and subsequent output. To further improve the accuracy of the system, we integrated the Efficient Channel Attention (ECA) mechanism in the RexNet network. This attention mechanism selectively emphasizes informative feature channels without increasing computational complexity, involving only a few parameters and bringing significant performance improvements. It improves model performance by focusing on key features, thereby significantly improving recognition accuracy [
19].
In the training process of Yolov5 and RexNet-ECA (refer to
Figure 2), we adopted an independent training approach for each network. This strategy allows for gathering information from each component, enabling individual adjustments for target detection and keypoint identification, enhancing flexibility and adaptability. Firstly, the Yolov5 network’s detection layer is designed to collaboratively identify multiple hand positions through multi-dimensional analysis, primarily processing wide-field image data. A notable feature of this model is its ability to extract information across various dimensions, thereby concurrently enhancing detection precision. Subsequently, the positional data obtained from the Yolov5 model serve as inputs for the RexNet-ECA model. Leveraging this information, the RexNet-ECA model conducts fine-grained recognition of hand joints, extracting data to detect these joints. RexNet-ECA’s primary advantage lies in its lightweight architecture, ensuring rapid recognition speeds and commendable accuracy compared to alternative algorithms. With this integrated network, the system efficiently captures and recognizes multiple hand positions in non-human primate image data. It supports video recognition, logging gesture data, and capturing joint angular information.
In
Figure 2, FC represents a fully connected layer, which integrates the features extracted from the previous layer into a fixed-size output for decision-making. CSP represents a cross-stage local network. The model inputs the original image data into the CSP network and performs feature extraction operations on the image data. SPP corresponds to the spatial pyramid pooling module, which performs feature extraction operations through different pooling sizes, increasing the network’s receptive field in terms of representation effect. MBConv represents mobile inverted bottleneck convolution, which is a lightweight convolution block optimized for mobile networks. Upsample represents an upsampling layer, which is responsible for increasing spatial resolution, and convolution represents a standard convolution layer for feature extraction.
Gesture video density calculation formula. In order to accurately identify the monkey’s hand movements, we used a gesture video density calculation method based on hand joint markers. The core of this method is to extract 21 joint markers of the monkey’s hand and calculate the density of these markers relative to their center of mass, thereby quantifying the distribution of gestures during grasping and unfolding. The formula is as follows:
(1) Equation (1): X and Y represent the coordinate information for the 21 hand markers.
(2) Equation (2): This is used to compute the marker points’ centroidal coordinates.
(3) Equation (3): The ‘Density’ is derived by taking the L2 norm between each marker point and the centroid, then summing the norm values for each point.
3. Results
Dataset construction and feature labeling rules. We have devised a markerless recognition system underpinned by a deep learning framework. The system can capture multiple hand targets and identify 21 marker points on the hand of a non-human primate in a free-motion state by using a model derived from training. These 21 marker points adhere to the standards the International Society of Biomechanics and Anatomy (ISB) set forth. Given the anatomical similarities between human and non-human primate hands, we have applied these universally recognized standards for non-human primate hand recognition.
System accuracy. We assessed the system’s accuracy, efficacy, and precision. To gauge its robustness, we conducted marker point error detection. It considers common conditions in non-human primates used for BMI experiments—distal phalanges often suffer from mutilation due to self-harm or aggression from conspecifics, and the proximal phalanges do not sufficiently represent hand behavior. Therefore, the error detection point was chosen as the midway between each finger’s middle and proximal phalanges, also referred to as the proximal interphalangeal joint. These are denoted as points 3, 6, 10, 14, and 18 (refer to
Figure 3). These five landmarks aptly represent the overarching hand posture, and the precision of their detection directly impacts the discernment of the hand’s posture.
To control the variables, we captured 20 images at each of the seven distances with varying pixel dimensions and then evaluated detection errors and recognition ratio for the five landmarks mentioned above.
Figure 4 shows that the pinky has the most prominent error among the images at the same distance with the same pixel size because its target is the smallest and is often partially occluded. While thumb targets are similarly smaller and more flexible, with a greater variety of postures, the recognition difficulty is also very high. The other three fingers, especially the middle finger and ring finger, have the highest correlation and the least flexibility and are less likely to be obscured, and the average recognition accuracy is relatively consistent and high.
Regarding pixel dimensions, as pixel count decreases, the information available diminishes. The algorithm maintains a relatively stable precision and recognizable ratio within the (640 × 480) to (320 × 240) pixel range. However, when the image dimensions drop below (240 × 180), the recognition precision and recognizable ratio decline sharply. This underscores that, for images above (320 × 240) pixels, the proposed algorithm in this study retains commendable recognition precision and recognizable ratio. Notably, the algorithm’s robustness remains largely unaffected by the decline in pixels. Reducing pixels can augment recognition speed, enhancing the system’s real-time capabilities—a distinct advantage of the method presented here. The recognition effect at different pixel resolutions was also discussed, as summarized in
Table 1.
Referring to our discussion related to
Figure 4 and analyzing the data in
Table 1, it is discernible that as the pixel resolution of the images decreases, the system’s processing speed for these images correspondingly increases. This acceleration in speed is particularly evident when transitioning from resolutions of (3840 × 2160) to (1920 × 1088). However, in practical scenarios, we opted for a resolution of (320 × 240) for data acquisition in this experimental setup, which maintains the recognition speed above 30 FPS, thereby satisfying the real-time transmission demands and the requirements for neural data mapping.
Calculation and analysis of gesture video density. One of the pivotal application scenarios of hand joint extraction in BMIs is the recognition of grasp-and-spread postures. We filmed a video of a monkey grasping a physical object and used the algorithm in this paper to identify the 21 points of articulation of its hand and plot its process curve. The monkey reaching out to grasp a physical object can be seen more clearly in
Figure 5a, from 1 to 9. We can follow the position of each keypoint of the hand in real time and plot its continuous change process. As shown in
Figure 5b, we employed the L2 norm (Equation (3)) to compute the density of the 21 markers, using this metric to ascertain the monkey’s grasp-and-spread actions.
Moreover, in BMIs, the precise categorization of actions is substantially influenced by the actual duration of action formation. As illustrated in
Figure 5a, beginning from the fifth frame, the density distribution of the front of the hand conspicuously increases, serving as a crucial indicator for recognizing the hand’s grasp-and-spread posture. The L2 norm results in
Figure 5b aptly validate this observation. Consequently, this study can adeptly synchronize with the neural data gathered by the BMI system, ensuring high temporal precision in matching motion states with neural data.
Gesture classification based on joint data analysis. The gesture classification depicted in
Figure 6 exhibits commendable performance, distinguishing between spreading and grasping. We applied this classification to the nine frames from
Figure 5 and compared the outcomes with the computational results from
Figure 5b, demonstrating a good effect. Concurrently, further definite validation of the grasping motions was conducted while obtaining specific keypoints and angles, bolstering the system’s recognition accuracy.
In order to further verify the recognition effect of the algorithm on grasp–stretch posture, we conducted a more diverse analysis of the 21-point joint data based on the algorithm of this paper, performed a binary classification of the grasp–stretch posture, and compared it with the image classification method based on the ResNet50 network, a widely used deep convolutional neural network typically employed for image classification tasks. We found that the accuracy of the ResNet50 image classification algorithm was only 0.882, while the accuracy of inputting joint data into the decision tree reached 0.943.
4. Discussion
To verify the algorithm’s effectiveness in this paper, the running speed and error of Yolov5 + ResNet50 and Yolov5 + RexNet are compared, and
Table 2 is obtained.
As fewer reported studies focus on fine-grained hand movement recognition algorithms in non-human primates, in this paper, we only compare the algorithms we implemented and discuss the advantages and disadvantages of the different algorithms. The above table compares the Yolov5 + ResNet50 and Yolov5 + RexNet algorithm models for a single image, including model loading time (*). It is evident from the table that the Yolov5 + ResNet50 model holds a distinct edge in terms of recognition speed and accuracy. Given our study’s focus on the rapid identification of acceptable targets, RexNet was chosen for its lightweight design and computational efficiency. This fits the experimental requirements of our research, where real-time performance in limited computational environments is a key consideration, and thus RexNet fits better with the central theme of our investigation.
To verify the effectiveness of this algorithm, the identification indexes of many models were compared, and the graphics card is NVIDIA Tesla P40, which results in
Table 3 below. Among them, the PCK (Percentage of Correct Keypoints) indicator is often used for the evaluation of tasks such as pose estimation and measuring the correct matching degree between the keypoints predicted by the model and the true keypoint coordinates. The value is the percentage of the correctly matched keypoints in the total number of keypoints; OKS (Object Keypoint Similarity) is also an important indicator of posture assessment, representing the similarity between the predicted keypoint position error and the keypoints. The higher the OKS value is, the closer the prediction of the model is to the real attitude; Average Error represents the average difference between the predicted value and the true value of all samples; and Best Loss denotes the lowest loss value reached by the model during training. The improvement of the model in this study adds the ECA mechanism, which is mainly used to enhance the ability to perceive convolutional neural networks for images. By introducing the channel attention mechanism into the keypoint recognition algorithm, the perception of the relationship between channels is enhanced, which helps to improve the ability of the model to extract important features.
The above table compares the algorithm models of SqueezeNet, ResNet50, RexNet, and the improved algorithm RexNet-ECA. As clearly demonstrated in the table, the RexNet-ECA model has more advantages in recognition accuracy. After adding the ECA mechanism, the convergence speed of the RexNet model accelerated, reaching the Best Loss value of 0.1100 in 825 rounds of training. Without the ECA mechanism, the model reached the Best Loss value of 0.1181 in 965 rounds of training. The improved RexNet-ECA algorithm is significantly better than the RexNet algorithm in terms of convergence speed and performance. This performance advantage is very critical to our research, especially in accurately identifying monkey hand movements.
Our research focuses on the recognition of hand movements in monkeys. Compared to the recognition of overall movements in monkeys discussed in reference [
13], our analysis provides a more detailed examination of hand movements. Despite the differences in the application, we believe our work can offer one option more.
This study employs a monocular camera system, which provides simplicity in deployment and operation, resulting in effective and highly robust data. Our research focuses on addressing data processing challenges inherent in BMI studies targeting the primary motor cortex. It enables a collaborative analysis with neural signals while mitigating potential physical harm to non-human primates during hand movement recognition, thereby preserving the naturalistic behavior crucial for experiments. In the BMI experimental paradigm, the model can automatically recognize hand movements. Combined with the corresponding neural signals of macaques during different gestures, we can better capture the complex relationship between brain activity and behavior. Additionally, the diverse dataset utilized ensures consistent and accurate recognition across various environments. We independently established a diversified, high-quality non-human primate hand dataset to train a multi-target capture and key detection model, a dataset including eight kinds of experimental environment image data, and three kinds of image acquisition means and two kinds of video acquisition of five non-human primates, collected a total of 4276 image data, and accumulated 91,440 s of video files.
In this paper, a multi-objective capture and keypoint detection model is proposed for BMI behavioral image acquisition and motion data analysis. The improved Yolov5 + RexNet-ECA model architecture is adopted, and transfer learning is used for optimization training. Using a monocular camera in BMI behavior research, we can real-time record 21 keypoints of joint data of large non-human primates within a 2-meter field of view, with an average identification error of only three pixels. Compared to the original YOLOv5 + RexNet architecture, our model shows a 6% improvement in performance, demonstrating superior results in a comparison of four algorithms, and the model enables rapid detection and annotation of hand action states based on movement data. In future work, we can refer to more advanced fine-grained activity recognition and prediction methods to further improve the performance of our model in hand action recognition and prediction. For example, we can use multiple visual modalities and temporal features [
20] to achieve more accurate action classification and prediction. These advanced methods are expected to provide us with more efficient and accurate tools when processing non-human primate behavioral data.
5. Conclusions
In this study, we introduced the ECA mechanism into the Yolov5 + RexNet framework to optimize algorithm performance. The model demonstrated excellent results in the non-human primate hand action recognition task, and ablation experiments further validated the performance improvements achieved through individual and combined optimizations. We applied the optimized algorithm to annotate macaque hand data in BMI experiments, significantly enhancing the accuracy of the mapping between neural and behavioral data. The main challenge of this study was to address markerless recognition of hand movements in non-human primates by using a model that fits the experimental needs. Accurately capturing the hand movements of primates moving freely in an unconstrained environment is crucial for understanding the dynamic interaction between neural signals and behavior, which directly affects the accuracy and applicability of real-time neural decoding. This improvement allows complex brain–behavior associations to be captured more effectively. Our method can provide BMI with higher temporal resolution and precise behavioral data by capturing fine-grained hand movements in real time, thus supporting more accurate neural signal decoding. Combined with the closed-loop control paradigm of BMI, it can verify the causal relationship between neural signals and behavior in real time, providing new tools for the design and optimization of BMI. Furthermore, this study constructed an unlabeled macaque hand dataset, providing a valuable public resource for future research in related fields.
However, our method also has limitations. For example, while it performs well in controlled environments, its effectiveness in highly dynamic real-world environments may need further verification. In addition, the model’s reliance on a single camera for joint position detection may limit its robustness in scenarios that require multiple angles or depth information for more complex movements.
In future research, more advanced hardware equipment and training methods, such as multiple cameras and efficient image processing algorithms, can be used to further improve the accuracy and robustness of hand action recognition.