Next Article in Journal
Biomechanical Evaluation of a Spinal Surgical Instrument: A Numerical–Experimental Approach
Previous Article in Journal
Integrated Assessment of Gait and Spinal Kinematics Using Optoelectronic Motion Analysis Systems: Validation and Usability Assessment of a Novel Protocol
Previous Article in Special Issue
The Influence of Running Technique Modifications on Vertical Tibial Load Estimates: A Combined Experimental and Machine Learning Approach in the Context of Medial Tibial Stress Syndrome
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Technical Note

A Proposed Method for Deep Learning-Based Automatic Tracking with Minimal Training Data for Sports Biomechanics Research

Japan Institute of Sports Sciences, 3-15-1, Nishigaoka, Kita-ku, Tokyo 115-0056, Japan
*
Author to whom correspondence should be addressed.
Biomechanics 2025, 5(2), 25; https://doi.org/10.3390/biomechanics5020025
Submission received: 26 February 2025 / Revised: 10 April 2025 / Accepted: 11 April 2025 / Published: 13 April 2025
(This article belongs to the Special Issue Biomechanics in Sport and Ageing: Artificial Intelligence)

Abstract

:
Background: This technical note proposes a deep learning-based, few-shot automatic key point tracking technique tailored to sports biomechanics research. Methods: The present method facilitates the arbitrary definition of key points on athletes’ bodies or sports equipment. Initially, a limited number of video frames are manually digitized to mark the points of interest. These annotated frames are subsequently used to train a deep learning model that leverages a pre-trained VGG16 network as its backbone and incorporates an additional convolutional head. Feature maps extracted from three intermediate layers of VGG16 are processed by the head network to generate a probability map, highlighting the most likely locations of the key points. Transfer learning is implemented by freezing the backbone weights and training only the head network. By restricting the training data generation to regions surrounding the manually annotated points and training specifically for each video, this approach minimizes training time while maintaining high precision. Conclusions: This technique substantially reduces the time and effort required compared to frame-by-frame manual digitization in various sports settings, and enables customized training tailored to specific analytical needs and video environments.

1. Introduction

In the field of sports biomechanics, various video analysis methods are widely used to analyze the movements of athletes and sports equipment [1]. In the digitization process, i.e., obtaining the pixel coordinates of specific key points on the body or equipment displayed in the video, reflective markers are often affixed to facilitate key point identification. Motion capture systems can then be used to label the reflective markers detected in three-dimensional (3D) space using infrared cameras [1]. However, in performance measurements conducted during competitions or experimentally simulated conditions, it is often impractical to affix reflective markers to athletes. Consequently, several studies have relied on manual annotation of high-speed video recordings [2,3,4]. However, manual digitization is time-consuming and labor-intensive, which poses a great challenge when analyzing many video frames [1].
Recently, deep learning (DL) techniques have been increasingly applied to motion analysis using images. DL refers to a method in which a model—a function characterized by a very large number of parameters (possibly millions or more)—is constructed and then optimized using a dataset comprising paired input and ground truth data. This process yields a trained model that estimates unseen key point locations based on patterns learned from the ground truth data. This process of obtaining outputs from the trained model for new inputs is referred to as “inference”. Compared with earlier image processing techniques, DL has demonstrated higher accuracy in tasks such as object recognition and object detection, and it has been applied to a wide range of problems. An early example in the sports domain is DeepPose [5], published in 2014, where DL was applied to estimate the coordinates of key points on various body parts (i.e., pose estimation) from images. Numerous studies on pose estimation using DL have since been conducted, achieving real-time performance and simultaneous estimation of multiple individuals [6]. However, DL models typically require large datasets to achieve high generalization and to reduce the risk of overfitting. This is especially true when models need to capture complex and diverse movement patterns, where small datasets may cause the model to memorize the training data rather than learning general patterns, resulting in poor performance in new, unseen data. Additionally, limited datasets may display high variance, making the model sensitive to slight changes in input data. In sports biomechanics, acquiring large amounts of labeled data is often challenging due to the complexity of data collection and annotation. Therefore, there is a need to develop automated methods that minimize manual effort for key point tracking in sports biomechanics to address the challenges posed by conventional approaches.
Although pose estimation via DL can be expected to facilitate the digitization process in sports biomechanics, several challenges remain. The first challenge is the accuracy of pose estimation. In a study by Nakano et al. [7], measurements of major key points during walking, jumping, and throwing motions were compared between OpenPose [6]—a system for automatic key point detection—and a 3D motion capture system. Discrepancies exceeding 40 mm were observed in approximately 10% of the measurement data and, in some trials, different body parts were identified as key points. Similarly, Fukushima et al. [8] reported that the DL-based human pose estimation yielded mean absolute errors of 9–10 degrees for sports movements, compared with a marker-based motion capture system. Therefore, DL-based pose estimation needs improvement when precise measurements are required.
The second challenge concerns the dataset used for model training. Conventional DL pose estimation models (e.g., OpenPose [6] and MoveNet [9]) are typically trained on large, publicly available datasets that include major body key points such as the shoulder, elbow, and knee joints. However, the definition of these key points is generally not explicitly stated, leaving it unclear whether the points are identified on the basis of anatomical criteria. As a result, although many recent studies have addressed motion pattern recognition [10] and event detection [11] using such models, few have investigated the temporal changes in specific motions or conducted comparisons based on competitive levels as is common in sports biomechanics. Moreover, while recent marker-less motion capture systems have achieved higher accuracy, systematic errors have been reported because the built-in models differ from validated marker-based models commonly used in sports biomechanics [12]. Also, the set of key points may vary among models [13]. Finally, to estimate the coordinates of body parts or specific locations on equipment specific to biomechanics without existing datasets, a new dataset must be prepared.
Therefore, sports biomechanics research requires a system that allows key points to be arbitrarily yet accurately defined. In this technical note, we propose an automatic digitization technique using DL designed to operate effectively with minimal training data (few-shot learning). Our approach leverages transfer learning from a pre-trained model but focuses training on a small, video-specific dataset of only a few manually annotated frames. This strategy reduces the problem space, enabling rapid and targeted training around specific key points and the visual environment of the analysis video.
The remainder of this paper is structured as follows: Section 2 describes the proposed method, including the model architecture, training procedure, and inference process. Section 3 presents some tracking examples and computational performance metrics. Section 4 discusses the method’s context, advantages, and limitations. Finally, Section 5 concludes the technical note.

2. Materials and Methods

2.1. Overview of the Proposed Method

The proposed method comprises three main phases: preprocessing, model training, and model inference (Figure 1). In the preparation phase, manual digitization of the points of interest is first performed on a limited number of frames (typically three to four) extracted from the video under analysis. Next, the manually digitized frames are utilized to construct training data, including ground truth probability maps. The DL model is trained using these data. Finally, the trained model is employed to automatically digitize all frames in the video. The choice of manual digitization provides precise ground truth for the exact points needed by the researcher and allows complete flexibility in defining key points (anatomical landmarks, equipment points, etc.).
In the proposed method, the training data are constructed by extracting regions of interest (ROIs) centered around manually digitized points. This approach limits the problem space, allowing the model to focus on the most relevant areas of the image, thereby enhancing tracking efficiency even with a small dataset. During inference, the method also leverages these targeted ROIs to reduce the computational resources required while maintaining high accuracy in key point tracking.

2.2. Structure of the Deep Learning Model

The deep learning model has a two-stage structure consisting of a backbone network and a head network. A VGG16 model [14] serves as the backbone of the convolutional neural network (CNN), and additional convolutional layers are appended to construct the head network for image recognition tasks (see Figure 2).
The backbone extracts image features from three intermediate layers: the 2nd layer (Conv1_2, just before the first max-pooling operation, maintaining the same spatial dimensions as the input), the 4th layer (Conv2_2, just before the second max-pooling, with half the input image size), and the 7th layer (Conv3_3, just before the third max-pooling, with one-quarter the input image size). The resulting feature maps are then concatenated into a single tensor, referred to as the concatenated feature tensors, which serves as the input for the head network.
The head network processes these concatenated feature tensors through convolutional layers (Conv2D with a 5 × 5 kernel and 16 channels) followed by ReLU activation and up-sampling. Each intermediate feature map is processed separately by a convolutional layer (5 × 5 kernel, 16 channels) and ReLU activation. The processed feature tensors are then up-sampled back to the original input dimensions and concatenated into a single tensor. A subsequent convolutional operation produces a single-channel probability map. During inference, Gaussian smoothing is applied to the probability map to identify the coordinate with the highest probability.
The VGG16 model used as the backbone is one of the fundamental object recognition models employing DL. Pre-trained weights on the ImageNet dataset [15]—containing over 14 million images for image recognition tasks—are publicly available in various DL frameworks such as TensorFlow [16] and PyTorch [17]. Due to its straightforward network structure, in which the spatial resolution of feature maps is progressively reduced by convolution and pooling layers, extracting features from intermediate layers is relatively simple.

2.3. Model Training

The training dataset consisted of pairs comprising manually digitized frames and probability maps generated by drawing a two-dimensional Gaussian distribution centered at the manually digitized coordinates. To reduce training time, the input region of the images used as training data was limited to the vicinity of the digitized location, thereby reducing the input image size. In addition, the variance of the Gaussian distribution used to generate the probability map was set according to the analysis target and objectives. That is, when fine-scale analysis was required (e.g., tracking the tip of equipment), a Gaussian distribution with a small standard deviation was used, whereas for broader targets such as the entire body, a distribution with a larger standard deviation was applied.
Subsequently, the model was trained specifically for the video using the constructed training dataset. During training, the weights of VGG16 were fixed, and only the additional head network was trained. The model outputs predicted probability maps, which were then compared with the ground truth probability maps to calculate the loss. This loss calculation occurred after each training iteration. Subsequently, backpropagation was performed to minimize the loss, and the model weights were updated using the Adam optimizer [18].
Unlike conventional DL approaches that minimize a loss function—as is performed in OpenPose [6]—this training method does not converge; rather, the function value increases indefinitely with the number of epochs. Consequently, the standard deviation and the number of epochs must be set arbitrarily. If the tracking performance is unsatisfactory in a given trial, improvements may be achieved by increasing the number of manually digitized frames, among other adjustments.

2.4. Model Inference

After model training is complete, all frames of the video are input into the model to perform automatic digitization. The DL model outputs a probability map indicating the location of the points of interest. During inference, Gaussian smoothing is applied to this probability map to extract the coordinate with the highest probability, which is then adopted as the digitized point. Inference is conducted sequentially for each frame in chronological order. For each frame, only a small image region centered on the digitized point from the previous frame (or the manually digitized point for the first inference frame) is used as the input. This approach reduces the input size for the inference process and shortens the required inference time.

3. Results

Some examples of key point tracking are presented for various sports movements in different environments. Figure 3 illustrates the probability map generated during inference, overlaid with a heatmap. The region of interest (ROI) is a square of 64 pixels on each side, marked by the blue rectangle. The manual digitization of the tennis racket tip was performed four times. Figure 4 illustrates trajectories (five revolutions) and probability maps (last frame) for the ear, shoulder, hip, knee, ankle, and pedal key points during cycling, generated by the proposed framework using PyTorch 2.4.1 and OpenCV 4.2.0. The moving ROI (64 × 64 pixels) is marked by the blue rectangle (lowest probability). Manual digitization was performed on four frames. The ground truth Gaussian standard deviation for the probability map was set to 3, with 15 training epochs. These figures show consistent tracking for both body parts and equipment throughout the motion.
Figure 5 shows the inferred key points (over 30 frames) obtained through the proposed method for a countermovement jump. In this experiment, manual digitization was performed on three frames for six key points on the subject’s right side: the inferior rib margin, greater trochanter, lateral condyle of the femur, lateral malleolus, heel, and toe. The ground truth Gaussian standard deviation for the probability map used during training was set to 3, and the number of epochs was set to 15. Tracking was performed at nearly identical locations; however, minor lateral fluctuations were observed for the inferior rib margin due to the lack of distinctive patterns, colors, or luminance variations in the surrounding area.
We performed five tracking trials on the dataset (6 markers and 741 video frames, 1920 × 1080 pixels) to calculate computational efficiency. We implemented our tracking method on a system with an NVIDIA GeForce RTX 2070 Super GPU (8 GB memory), using TensorFlow 1.11.0 and OpenCV 4.2.0. With an ROI of 400 × 800 pixels and using only three manually digitized frames, the model required an average training time of 28.63 ± 0.89 s. During inference using a moving ROI strategy with a 64 × 64 pixels window for each key point, the system processed the video in 13.96 ± 0.24 s, achieving an average tracking speed of 53.08 frames per second for all six markers.

4. Discussion

In this study, we proposed an automatic digitization method that leverages DL and employs a minimal number of manually digitized frames as training data. Because the proposed method involves re-training for each target video, it enables flexible tracking that is adapted to a specific environment. This feature is expected to be highly useful in applications such as sports biomechanics, where extremely high precision is required.
In DL-based object recognition and pose estimation using DL, numerous high-accuracy and high-speed models have been developed in recent years. While VGG16 remains widely used for its simple architecture and robust feature extraction, more advanced CNN models have emerged—such as ResNet [19] with skip connections, MobileNet [20] for lightweight efficiency, and EfficientNet [21] with compound scaling—offering potential improvements in accuracy and computational efficiency. More recently, transformer architectures like Vision Transformer (ViT) have been successfully adapted for computer vision, with models such as ViTPose [22] setting new benchmarks in human pose estimation. While this study primarily focuses on the framework itself rather than evaluating the efficiency and accuracy of recently developed models, integrating such advanced architectures into our framework may further enhance its performance.
When applying these advanced models to practical problems, transfer learning is commonly used. In transfer learning, the outputs from intermediate layers of an object recognition model pre-trained on a large dataset (e.g., ImageNet) are leveraged, and additional training is performed for the new task. This enables high-accuracy predictions even with a small amount of training data. A related approach is DeepLabCut [23], published in 2018. This method is a tracking technique specialized in animal and human pose estimation, which typically uses approximately 200 manually digitized images as training data to achieve high-precision automatic tracking. DeepLabCut capitalizes on large pre-trained models while re-training to adapt to specific motions or individuals, thereby maintaining a consistent level of accuracy across different trials and environments. In contrast, the present method learns the specific key points of the body or equipment that vary depending on the analysis challenges and video environment by restricting the problem space and employing a minimal training dataset. This approach not only reduces training time and effort but also achieves high-precision tracking tailored to each environment.
Additionally, the ability to fine-tune the model specifically for a given motion, rather than being limited to general pre-trained models, is a distinct advantage of the proposed method. In contrast, marker-less tracking methods such as the Kanade–Lucas–Tomasi (KLT) algorithm, which do not rely on pre-trained models but instead use optical flow to track key points, have been employed to estimate object positions from video data (e.g., the barbell trajectory during an athlete’s snatch motion) [24]. While the KLT algorithm has demonstrated high accuracy, it has limitations related to object appearance, as tracking accuracy can vary depending on changes in the color or design of the tracked object. Similarly, our method faces challenges with visually indistinct areas (e.g., the rib in Figure 5), which showed tracking fluctuations. For such regions, enhancing visual features through modified clothing or markers may improve tracking performance.
Regarding the potential impact of varying video quality or resolution, although these factors certainly influence image recognition performance, they typically pose minimal problems in biomechanical research, where video quality and resolution remain consistent within a single study. The framework’s adaptability to different subjects and body types highlights the strength of our sparse learning approach. The dependence on manual annotations in the initial stage serves a dual purpose: providing training data while also helping define the ROI constraints, which represents an innovative approach to the tracking problem.
The current method has three primary limitations. First, because inference is performed sequentially in chronological order, if a key point is temporarily occluded in the video, the system may erroneously digitize an incorrect location. To address this, either modifying the sequential ROI approach or implementing correction mechanisms such as temporal smoothing or Kalman filtering would improve tracking reliability while still maintaining faster processing than extensive manual digitization. Second, the accuracy of the automatic digitization depends on the nature of the motion, the environment in which it is captured, and the characteristics of the key point and its surrounding features. Thus, it is necessary to test under similar conditions in advance to verify the inference results. Although the current study focuses on presenting a framework for efficient key point tracking with minimal training data, it does not include validation on additional datasets or different movement tasks. This limitation may affect the generalizability of the proposed method. Future work should aim to evaluate its performance across a wider range of sports and motion types to further assess robustness. Third, manual digitization of a few frames is required. When the training data are insufficient, larger errors may occur, necessitating the analyst to visually inspect the automatic digitization results and, if necessary, increase the number of manually digitized frames. Although complete automation is not achieved, compared with conventional methods that require the manual digitization of every frame, the proposed method offers a significant reduction in effort, which should be beneficial for sports biomechanics research. In cases where obvious errors are detected, adjustments—such as modifying the arbitrarily set standard deviation or the number of epochs or increasing the number of manually digitized frames followed by re-training and inference—may be implemented. Further verification of the method’s accuracy under various conditions is warranted.

5. Conclusions

This technical note proposed an automatic digitization method based on DL that utilizes a small number (three to four) of manually annotated frames for training. By leveraging transfer learning and tailoring the model to each analysis video, the method is anticipated to be applicable to image analysis in sports biomechanics, a field where extremely high precision is required.

Author Contributions

Conceptualization, M.M. and T.M.; methodology, M.M. and T.M.; software, M.M.; validation, D.Y., M.M. and T.M.; formal analysis, D.Y. and M.M.; investigation, D.Y. and M.M.; resources, D.Y. and M.M.; data curation, M.M.; writing—original draft preparation, D.Y., M.M. and T.M.; writing—review and editing, D.Y., M.M. and T.M.; visualization, D.Y. and M.M.; supervision, T.M.; project administration, D.Y., M.M. and T.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study was conducted as part of the Sports Medicine and Science Research Programs of the Japan Institute of Sports Sciences.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author due to institutional policies restricting the public distribution of code and software used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yeadon, M.R.; Challis, J.H. The future of performance-related sports biomechanics research. J. Sports Sci. 1994, 12, 3–32. [Google Scholar] [CrossRef] [PubMed]
  2. Nagao, H.; Huang, Z.; Kubo, Y. Biomechanical comparison of successful snatch and unsuccessful frontward barbell drop in world-class male weightlifters. Sports Biomech. 2023, 22, 1120–1135. [Google Scholar] [CrossRef] [PubMed]
  3. Nicholson, G.; Epro, G.; Merlino, S.; Walker, J.; Bissas, A. Differences in run-up, take-off, and flight characteristics: Successful vs. unsuccessful high jump attempts at the IAAF world championships. Front. Sports Act. Living 2024, 6, 1352725. [Google Scholar] [CrossRef] [PubMed]
  4. Hanley, B.; Bissas, A.; Merlino, S.; Burns, G.T. Changes in running biomechanics during the 2017 IAAF world championships men’s 1500 m final. Scand. J. Med. Sci. Sports 2023, 33, 931–942. [Google Scholar] [CrossRef] [PubMed]
  5. Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
  6. Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
  7. Nakano, N.; Sakura, T.; Ueda, K.; Omura, L.; Kimura, A.; Iino, Y.; Fukashiro, S.; Yoshioka, S. Evaluation of 3D Markerless Motion Capture Accuracy Using OpenPose With Multiple Video Cameras. Front. Sports Act. Living 2020, 2, 50. [Google Scholar] [CrossRef] [PubMed]
  8. Fukushima, T.; Blauberger, P.; Guedes Russomanno, T.; Lames, M. The potential of human pose estimation for motion capture in sports: A validation study. Sports Eng. 2024, 27, 19. [Google Scholar] [CrossRef]
  9. Bajpai, R.; Joshi, D. Movenet: A deep neural network for joint profile prediction across variable walking speeds and slopes. IEEE Trans. Instrum. Meas. 2021, 70, 1–11. [Google Scholar]
  10. Verma, A.; Suman, A.; Biradar, V.G.; Brunda, S. Human Activity Classification Using Deep Convolutional Neural Network. In Recent Advances in Artificial Intelligence and Data Engineering: Select Proceedings of AIDE 2020; Springer Nature: Singapore, 2022; pp. 41–50. [Google Scholar]
  11. Theagarajan, R.; Bhanu, B. An Automated System for Generating Tactical Performance Statistics for Individual Soccer Players From Videos. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 632–646. [Google Scholar] [CrossRef]
  12. Kanko, R.M.; Laende, E.K.; Davis, E.M.; Selbie, W.S.; Deluzio, K.J. Concurrent assessment of gait kinematics using marker-based and markerless motion capture. J. Biomech. 2021, 127, 110665. [Google Scholar] [CrossRef] [PubMed]
  13. Washabaugh, E.P.; Shanmugam, T.A.; Ranganathan, R.; Krishnan, C. Comparing the accuracy of open-source pose estimation methods for measuring gait kinematics. Gait Posture 2022, 97, 188–195. [Google Scholar] [CrossRef] [PubMed]
  14. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  15. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
  16. Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M. {TensorFlow}: A system for {Large-Scale} machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
  17. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
  18. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  19. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  20. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  21. Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  22. Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. Vitpose: Simple vision transformer baselines for human pose estimation. Adv. Neural Inf. Process. Syst. 2022, 35, 38571–38584. [Google Scholar]
  23. Mathis, A.; Mamidanna, P.; Cury, K.M.; Abe, T.; Murthy, V.N.; Mathis, M.W.; Bethge, M. DeepLabCut: Markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 2018, 21, 1281–1289. [Google Scholar] [CrossRef] [PubMed]
  24. Nagao, H.; Yamashita, D. Validation of video analysis of marker-less barbell auto-tracking in weightlifting. PLoS ONE 2022, 17, e0263224. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Workflow of the proposed method. GT: ground truth. Prob. Maps: probability maps.
Figure 1. Workflow of the proposed method. GT: ground truth. Prob. Maps: probability maps.
Biomechanics 05 00025 g001
Figure 2. Model overview (tracking the pelvic center of a tennis player performing a forehand stroke). The model outputs predicted probability maps during both the training and inference phases. During inference, the predicted probability maps are further processed outside the model through Gaussian filtering and Argmax to obtain the final key point coordinates. ReLU: rectified linear unit activation function. Conv2D: two-dimensional convolutional layer. Argmax: argument of the maximum. Prob. Maps: probability maps.
Figure 2. Model overview (tracking the pelvic center of a tennis player performing a forehand stroke). The model outputs predicted probability maps during both the training and inference phases. During inference, the predicted probability maps are further processed outside the model through Gaussian filtering and Argmax to obtain the final key point coordinates. ReLU: rectified linear unit activation function. Conv2D: two-dimensional convolutional layer. Argmax: argument of the maximum. Prob. Maps: probability maps.
Biomechanics 05 00025 g002
Figure 3. Probability map generated by the proposed framework during inference using TensorFlow 1.11.0 and OpenCV 4.2.0. The heatmap overlay shows the probability distribution for the tennis racket tip. The moving region of interest (ROI) (64 × 64 pixels) is marked by the blue rectangle (lowest probability). The area with the highest probability is displayed as bright areas. Manual digitization was performed on four frames. The standard deviation for the probability map was set to 3, with 15 training epochs.
Figure 3. Probability map generated by the proposed framework during inference using TensorFlow 1.11.0 and OpenCV 4.2.0. The heatmap overlay shows the probability distribution for the tennis racket tip. The moving region of interest (ROI) (64 × 64 pixels) is marked by the blue rectangle (lowest probability). The area with the highest probability is displayed as bright areas. Manual digitization was performed on four frames. The standard deviation for the probability map was set to 3, with 15 training epochs.
Biomechanics 05 00025 g003
Figure 4. Trajectories (5 revolutions) and probability maps (last frame) generated by the proposed framework using PyTorch 2.4.1 and OpenCV 4.2.0. The heatmap overlay shows the probability distribution for the ear, shoulder, hip, knee, ankle, and pedal key points during cycling. The trajectories are indicated by dotted lines: magenta lines represent all key points in the overview (left panel), while yellow lines depict individual key point paths in the magnified views (center and right panels). The moving region of interest (ROI) (64 × 64 pixels) is marked by the blue rectangle (lowest probability). The area with the highest probability is displayed as bright areas. Manual digitization was performed on four frames. The standard deviation for the probability map was set to 3, with 15 training epochs.
Figure 4. Trajectories (5 revolutions) and probability maps (last frame) generated by the proposed framework using PyTorch 2.4.1 and OpenCV 4.2.0. The heatmap overlay shows the probability distribution for the ear, shoulder, hip, knee, ankle, and pedal key points during cycling. The trajectories are indicated by dotted lines: magenta lines represent all key points in the overview (left panel), while yellow lines depict individual key point paths in the magnified views (center and right panels). The moving region of interest (ROI) (64 × 64 pixels) is marked by the blue rectangle (lowest probability). The area with the highest probability is displayed as bright areas. Manual digitization was performed on four frames. The standard deviation for the probability map was set to 3, with 15 training epochs.
Biomechanics 05 00025 g004
Figure 5. Example of automatic digitization (30 frames) for a countermovement jump using TensorFlow (version 1.11.0) and OpenCV (version 4.1). Six key points—rib lower end (blue), greater trochanter (light green), lateral epicondyle of the femur (red), lateral malleolus (cyan), heel (purple), and toe (yellow)—were manually digitized in three frames. Automatic digitization was then applied to the entire video (120 fps), starting from the first manually digitized frame. The standard deviation for the probability map was set to 3, with 15 training epochs.
Figure 5. Example of automatic digitization (30 frames) for a countermovement jump using TensorFlow (version 1.11.0) and OpenCV (version 4.1). Six key points—rib lower end (blue), greater trochanter (light green), lateral epicondyle of the femur (red), lateral malleolus (cyan), heel (purple), and toe (yellow)—were manually digitized in three frames. Automatic digitization was then applied to the entire video (120 fps), starting from the first manually digitized frame. The standard deviation for the probability map was set to 3, with 15 training epochs.
Biomechanics 05 00025 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yamashita, D.; Matsumoto, M.; Matsubayashi, T. A Proposed Method for Deep Learning-Based Automatic Tracking with Minimal Training Data for Sports Biomechanics Research. Biomechanics 2025, 5, 25. https://doi.org/10.3390/biomechanics5020025

AMA Style

Yamashita D, Matsumoto M, Matsubayashi T. A Proposed Method for Deep Learning-Based Automatic Tracking with Minimal Training Data for Sports Biomechanics Research. Biomechanics. 2025; 5(2):25. https://doi.org/10.3390/biomechanics5020025

Chicago/Turabian Style

Yamashita, Daichi, Minoru Matsumoto, and Takeo Matsubayashi. 2025. "A Proposed Method for Deep Learning-Based Automatic Tracking with Minimal Training Data for Sports Biomechanics Research" Biomechanics 5, no. 2: 25. https://doi.org/10.3390/biomechanics5020025

APA Style

Yamashita, D., Matsumoto, M., & Matsubayashi, T. (2025). A Proposed Method for Deep Learning-Based Automatic Tracking with Minimal Training Data for Sports Biomechanics Research. Biomechanics, 5(2), 25. https://doi.org/10.3390/biomechanics5020025

Article Metrics

Back to TopTop