1. Introduction
In the field of sports biomechanics, various video analysis methods are widely used to analyze the movements of athletes and sports equipment [
1]. In the digitization process, i.e., obtaining the pixel coordinates of specific key points on the body or equipment displayed in the video, reflective markers are often affixed to facilitate key point identification. Motion capture systems can then be used to label the reflective markers detected in three-dimensional (3D) space using infrared cameras [
1]. However, in performance measurements conducted during competitions or experimentally simulated conditions, it is often impractical to affix reflective markers to athletes. Consequently, several studies have relied on manual annotation of high-speed video recordings [
2,
3,
4]. However, manual digitization is time-consuming and labor-intensive, which poses a great challenge when analyzing many video frames [
1].
Recently, deep learning (DL) techniques have been increasingly applied to motion analysis using images. DL refers to a method in which a model—a function characterized by a very large number of parameters (possibly millions or more)—is constructed and then optimized using a dataset comprising paired input and ground truth data. This process yields a trained model that estimates unseen key point locations based on patterns learned from the ground truth data. This process of obtaining outputs from the trained model for new inputs is referred to as “inference”. Compared with earlier image processing techniques, DL has demonstrated higher accuracy in tasks such as object recognition and object detection, and it has been applied to a wide range of problems. An early example in the sports domain is DeepPose [
5], published in 2014, where DL was applied to estimate the coordinates of key points on various body parts (i.e., pose estimation) from images. Numerous studies on pose estimation using DL have since been conducted, achieving real-time performance and simultaneous estimation of multiple individuals [
6]. However, DL models typically require large datasets to achieve high generalization and to reduce the risk of overfitting. This is especially true when models need to capture complex and diverse movement patterns, where small datasets may cause the model to memorize the training data rather than learning general patterns, resulting in poor performance in new, unseen data. Additionally, limited datasets may display high variance, making the model sensitive to slight changes in input data. In sports biomechanics, acquiring large amounts of labeled data is often challenging due to the complexity of data collection and annotation. Therefore, there is a need to develop automated methods that minimize manual effort for key point tracking in sports biomechanics to address the challenges posed by conventional approaches.
Although pose estimation via DL can be expected to facilitate the digitization process in sports biomechanics, several challenges remain. The first challenge is the accuracy of pose estimation. In a study by Nakano et al. [
7], measurements of major key points during walking, jumping, and throwing motions were compared between OpenPose [
6]—a system for automatic key point detection—and a 3D motion capture system. Discrepancies exceeding 40 mm were observed in approximately 10% of the measurement data and, in some trials, different body parts were identified as key points. Similarly, Fukushima et al. [
8] reported that the DL-based human pose estimation yielded mean absolute errors of 9–10 degrees for sports movements, compared with a marker-based motion capture system. Therefore, DL-based pose estimation needs improvement when precise measurements are required.
The second challenge concerns the dataset used for model training. Conventional DL pose estimation models (e.g., OpenPose [
6] and MoveNet [
9]) are typically trained on large, publicly available datasets that include major body key points such as the shoulder, elbow, and knee joints. However, the definition of these key points is generally not explicitly stated, leaving it unclear whether the points are identified on the basis of anatomical criteria. As a result, although many recent studies have addressed motion pattern recognition [
10] and event detection [
11] using such models, few have investigated the temporal changes in specific motions or conducted comparisons based on competitive levels as is common in sports biomechanics. Moreover, while recent marker-less motion capture systems have achieved higher accuracy, systematic errors have been reported because the built-in models differ from validated marker-based models commonly used in sports biomechanics [
12]. Also, the set of key points may vary among models [
13]. Finally, to estimate the coordinates of body parts or specific locations on equipment specific to biomechanics without existing datasets, a new dataset must be prepared.
Therefore, sports biomechanics research requires a system that allows key points to be arbitrarily yet accurately defined. In this technical note, we propose an automatic digitization technique using DL designed to operate effectively with minimal training data (few-shot learning). Our approach leverages transfer learning from a pre-trained model but focuses training on a small, video-specific dataset of only a few manually annotated frames. This strategy reduces the problem space, enabling rapid and targeted training around specific key points and the visual environment of the analysis video.
The remainder of this paper is structured as follows:
Section 2 describes the proposed method, including the model architecture, training procedure, and inference process.
Section 3 presents some tracking examples and computational performance metrics.
Section 4 discusses the method’s context, advantages, and limitations. Finally,
Section 5 concludes the technical note.
2. Materials and Methods
2.1. Overview of the Proposed Method
The proposed method comprises three main phases: preprocessing, model training, and model inference (
Figure 1). In the preparation phase, manual digitization of the points of interest is first performed on a limited number of frames (typically three to four) extracted from the video under analysis. Next, the manually digitized frames are utilized to construct training data, including ground truth probability maps. The DL model is trained using these data. Finally, the trained model is employed to automatically digitize all frames in the video. The choice of manual digitization provides precise ground truth for the exact points needed by the researcher and allows complete flexibility in defining key points (anatomical landmarks, equipment points, etc.).
In the proposed method, the training data are constructed by extracting regions of interest (ROIs) centered around manually digitized points. This approach limits the problem space, allowing the model to focus on the most relevant areas of the image, thereby enhancing tracking efficiency even with a small dataset. During inference, the method also leverages these targeted ROIs to reduce the computational resources required while maintaining high accuracy in key point tracking.
2.2. Structure of the Deep Learning Model
The deep learning model has a two-stage structure consisting of a backbone network and a head network. A VGG16 model [
14] serves as the backbone of the convolutional neural network (CNN), and additional convolutional layers are appended to construct the head network for image recognition tasks (see
Figure 2).
The backbone extracts image features from three intermediate layers: the 2nd layer (Conv1_2, just before the first max-pooling operation, maintaining the same spatial dimensions as the input), the 4th layer (Conv2_2, just before the second max-pooling, with half the input image size), and the 7th layer (Conv3_3, just before the third max-pooling, with one-quarter the input image size). The resulting feature maps are then concatenated into a single tensor, referred to as the concatenated feature tensors, which serves as the input for the head network.
The head network processes these concatenated feature tensors through convolutional layers (Conv2D with a 5 × 5 kernel and 16 channels) followed by ReLU activation and up-sampling. Each intermediate feature map is processed separately by a convolutional layer (5 × 5 kernel, 16 channels) and ReLU activation. The processed feature tensors are then up-sampled back to the original input dimensions and concatenated into a single tensor. A subsequent convolutional operation produces a single-channel probability map. During inference, Gaussian smoothing is applied to the probability map to identify the coordinate with the highest probability.
The VGG16 model used as the backbone is one of the fundamental object recognition models employing DL. Pre-trained weights on the ImageNet dataset [
15]—containing over 14 million images for image recognition tasks—are publicly available in various DL frameworks such as TensorFlow [
16] and PyTorch [
17]. Due to its straightforward network structure, in which the spatial resolution of feature maps is progressively reduced by convolution and pooling layers, extracting features from intermediate layers is relatively simple.
2.3. Model Training
The training dataset consisted of pairs comprising manually digitized frames and probability maps generated by drawing a two-dimensional Gaussian distribution centered at the manually digitized coordinates. To reduce training time, the input region of the images used as training data was limited to the vicinity of the digitized location, thereby reducing the input image size. In addition, the variance of the Gaussian distribution used to generate the probability map was set according to the analysis target and objectives. That is, when fine-scale analysis was required (e.g., tracking the tip of equipment), a Gaussian distribution with a small standard deviation was used, whereas for broader targets such as the entire body, a distribution with a larger standard deviation was applied.
Subsequently, the model was trained specifically for the video using the constructed training dataset. During training, the weights of VGG16 were fixed, and only the additional head network was trained. The model outputs predicted probability maps, which were then compared with the ground truth probability maps to calculate the loss. This loss calculation occurred after each training iteration. Subsequently, backpropagation was performed to minimize the loss, and the model weights were updated using the Adam optimizer [
18].
Unlike conventional DL approaches that minimize a loss function—as is performed in OpenPose [
6]—this training method does not converge; rather, the function value increases indefinitely with the number of epochs. Consequently, the standard deviation and the number of epochs must be set arbitrarily. If the tracking performance is unsatisfactory in a given trial, improvements may be achieved by increasing the number of manually digitized frames, among other adjustments.
2.4. Model Inference
After model training is complete, all frames of the video are input into the model to perform automatic digitization. The DL model outputs a probability map indicating the location of the points of interest. During inference, Gaussian smoothing is applied to this probability map to extract the coordinate with the highest probability, which is then adopted as the digitized point. Inference is conducted sequentially for each frame in chronological order. For each frame, only a small image region centered on the digitized point from the previous frame (or the manually digitized point for the first inference frame) is used as the input. This approach reduces the input size for the inference process and shortens the required inference time.
3. Results
Some examples of key point tracking are presented for various sports movements in different environments.
Figure 3 illustrates the probability map generated during inference, overlaid with a heatmap. The region of interest (ROI) is a square of 64 pixels on each side, marked by the blue rectangle. The manual digitization of the tennis racket tip was performed four times.
Figure 4 illustrates trajectories (five revolutions) and probability maps (last frame) for the ear, shoulder, hip, knee, ankle, and pedal key points during cycling, generated by the proposed framework using PyTorch 2.4.1 and OpenCV 4.2.0. The moving ROI (64 × 64 pixels) is marked by the blue rectangle (lowest probability). Manual digitization was performed on four frames. The ground truth Gaussian standard deviation for the probability map was set to 3, with 15 training epochs. These figures show consistent tracking for both body parts and equipment throughout the motion.
Figure 5 shows the inferred key points (over 30 frames) obtained through the proposed method for a countermovement jump. In this experiment, manual digitization was performed on three frames for six key points on the subject’s right side: the inferior rib margin, greater trochanter, lateral condyle of the femur, lateral malleolus, heel, and toe. The ground truth Gaussian standard deviation for the probability map used during training was set to 3, and the number of epochs was set to 15. Tracking was performed at nearly identical locations; however, minor lateral fluctuations were observed for the inferior rib margin due to the lack of distinctive patterns, colors, or luminance variations in the surrounding area.
We performed five tracking trials on the dataset (6 markers and 741 video frames, 1920 × 1080 pixels) to calculate computational efficiency. We implemented our tracking method on a system with an NVIDIA GeForce RTX 2070 Super GPU (8 GB memory), using TensorFlow 1.11.0 and OpenCV 4.2.0. With an ROI of 400 × 800 pixels and using only three manually digitized frames, the model required an average training time of 28.63 ± 0.89 s. During inference using a moving ROI strategy with a 64 × 64 pixels window for each key point, the system processed the video in 13.96 ± 0.24 s, achieving an average tracking speed of 53.08 frames per second for all six markers.
4. Discussion
In this study, we proposed an automatic digitization method that leverages DL and employs a minimal number of manually digitized frames as training data. Because the proposed method involves re-training for each target video, it enables flexible tracking that is adapted to a specific environment. This feature is expected to be highly useful in applications such as sports biomechanics, where extremely high precision is required.
In DL-based object recognition and pose estimation using DL, numerous high-accuracy and high-speed models have been developed in recent years. While VGG16 remains widely used for its simple architecture and robust feature extraction, more advanced CNN models have emerged—such as ResNet [
19] with skip connections, MobileNet [
20] for lightweight efficiency, and EfficientNet [
21] with compound scaling—offering potential improvements in accuracy and computational efficiency. More recently, transformer architectures like Vision Transformer (ViT) have been successfully adapted for computer vision, with models such as ViTPose [
22] setting new benchmarks in human pose estimation. While this study primarily focuses on the framework itself rather than evaluating the efficiency and accuracy of recently developed models, integrating such advanced architectures into our framework may further enhance its performance.
When applying these advanced models to practical problems, transfer learning is commonly used. In transfer learning, the outputs from intermediate layers of an object recognition model pre-trained on a large dataset (e.g., ImageNet) are leveraged, and additional training is performed for the new task. This enables high-accuracy predictions even with a small amount of training data. A related approach is DeepLabCut [
23], published in 2018. This method is a tracking technique specialized in animal and human pose estimation, which typically uses approximately 200 manually digitized images as training data to achieve high-precision automatic tracking. DeepLabCut capitalizes on large pre-trained models while re-training to adapt to specific motions or individuals, thereby maintaining a consistent level of accuracy across different trials and environments. In contrast, the present method learns the specific key points of the body or equipment that vary depending on the analysis challenges and video environment by restricting the problem space and employing a minimal training dataset. This approach not only reduces training time and effort but also achieves high-precision tracking tailored to each environment.
Additionally, the ability to fine-tune the model specifically for a given motion, rather than being limited to general pre-trained models, is a distinct advantage of the proposed method. In contrast, marker-less tracking methods such as the Kanade–Lucas–Tomasi (KLT) algorithm, which do not rely on pre-trained models but instead use optical flow to track key points, have been employed to estimate object positions from video data (e.g., the barbell trajectory during an athlete’s snatch motion) [
24]. While the KLT algorithm has demonstrated high accuracy, it has limitations related to object appearance, as tracking accuracy can vary depending on changes in the color or design of the tracked object. Similarly, our method faces challenges with visually indistinct areas (e.g., the rib in
Figure 5), which showed tracking fluctuations. For such regions, enhancing visual features through modified clothing or markers may improve tracking performance.
Regarding the potential impact of varying video quality or resolution, although these factors certainly influence image recognition performance, they typically pose minimal problems in biomechanical research, where video quality and resolution remain consistent within a single study. The framework’s adaptability to different subjects and body types highlights the strength of our sparse learning approach. The dependence on manual annotations in the initial stage serves a dual purpose: providing training data while also helping define the ROI constraints, which represents an innovative approach to the tracking problem.
The current method has three primary limitations. First, because inference is performed sequentially in chronological order, if a key point is temporarily occluded in the video, the system may erroneously digitize an incorrect location. To address this, either modifying the sequential ROI approach or implementing correction mechanisms such as temporal smoothing or Kalman filtering would improve tracking reliability while still maintaining faster processing than extensive manual digitization. Second, the accuracy of the automatic digitization depends on the nature of the motion, the environment in which it is captured, and the characteristics of the key point and its surrounding features. Thus, it is necessary to test under similar conditions in advance to verify the inference results. Although the current study focuses on presenting a framework for efficient key point tracking with minimal training data, it does not include validation on additional datasets or different movement tasks. This limitation may affect the generalizability of the proposed method. Future work should aim to evaluate its performance across a wider range of sports and motion types to further assess robustness. Third, manual digitization of a few frames is required. When the training data are insufficient, larger errors may occur, necessitating the analyst to visually inspect the automatic digitization results and, if necessary, increase the number of manually digitized frames. Although complete automation is not achieved, compared with conventional methods that require the manual digitization of every frame, the proposed method offers a significant reduction in effort, which should be beneficial for sports biomechanics research. In cases where obvious errors are detected, adjustments—such as modifying the arbitrarily set standard deviation or the number of epochs or increasing the number of manually digitized frames followed by re-training and inference—may be implemented. Further verification of the method’s accuracy under various conditions is warranted.