A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data

Lin, Wan-Chih; Tu, Yu-Chen; Lin, Hong-Yi; Tseng, Ming-Hseng

doi:10.3390/electronics14061075

Open AccessArticle

A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data

¹

Master Program in Medical Informatics, Chung Shan Medical University, Taichung 40201, Taiwan

²

Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei 11221, Taiwan

³

Department of Physical Medicine and Rehabilitation, Chung Shan Medical University Hospital, Taichung 40201, Taiwan

⁴

Department of Physical Medicine and Rehabilitation, School of Medicine, Chung Shan Medical University, Taichung 40201, Taiwan

⁵

Information Technology Office, Chung Shan Medical University Hospital, Taichung 40201, Taiwan

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2025, 14(6), 1075; https://doi.org/10.3390/electronics14061075

Submission received: 30 January 2025 / Revised: 28 February 2025 / Accepted: 5 March 2025 / Published: 7 March 2025

(This article belongs to the Special Issue Advances in Information, Intelligence, Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This study evaluates the performance of seven deep learning methods for recognizing motion patterns in Up-and-Go pole walking exercises, aiming to improve rehabilitation technologies for the elderly population. For the ageing population, improving the accuracy of movement posture for elderly people is crucial in obtaining better rehabilitation outcomes. Up-and-Go pole walking exercises offer significant health benefits, but attaining the correct pose in motion is essential for achieving these benefits. The dataset includes skeleton images generated by OpenPose 1.7.0 and 2D and 3D skeleton images extracted through MediaPipe 0.10.21. Two sets of feature data were developed for model evaluation: one that comprises 12 features representing the key coordinates of the hands and feet and another consisting of 30 features derived from subdivided full-body skeletons. The study compares the accuracy and performance of each method, examining the impact of different combinations and representations on motion patterns. The experimental results indicate that the Swin model based on MediaPipe 2D skeleton images achieved the highest accuracy (99.7%), demonstrating superior performance in recognizing motion patterns of Up-and-Go pole walking exercises. The study summarizes the advantages and limitations of each approach, highlighting the contributions of different features and data representations to recognition outcomes. This research provides scientific evidence to advance elderly rehabilitation technologies by accurately recognizing poses.

Keywords:

deep learning; CNN (Convolutional Neural Network); ViT (Vision Transformer); pose recognition; Up-and-Go pole walking exercises; healthcare

1. Introduction

Since 2018, Taiwan has officially entered the status of ageing society, and health problems among the elderly population have become a focal point of social concern. Some elderly people with limited mobility may develop sarcopenia due to prolonged use of a wheelchair, leading to an inability to stand and, in severe cases, long-term bedridden conditions that significantly impact their quality of life [1]. Many medical experts have pointed out that exercise is the most effective treatment for sarcopenia [2,3]. However, requiring hospital visits for rehabilitation can be a significant burden for patients and their families. In this context, trekking poles provide a relatively accessible and convenient tool for rehabilitation.

With the global aging population on the rise, the importance of elderly rehabilitation training has become increasingly prominent, particularly in interventions aiming to improve mobility, prevent falls, and improve quality of life. Pose recognition technology, as a potentially powerful tool, has been widely applied in various rehabilitation exercises, especially in the field of elderly rehabilitation [4,5,6]. However, despite some initial achievements in pose recognition, several research challenges and gaps remain. First, many existing pose recognition systems focus primarily on posture analysis for healthy adults or athletes. For elderly people, who face unique physiological changes and limitations in physical abilities, the accuracy and stability of pose recognition still pose significant challenges [7]. Furthermore, most existing studies overlook personalized needs in elderly rehabilitation exercises, especially when approaching the issue of tailoring training plans based on individual physical conditions, which remains unresolved. Furthermore, pose recognition technologies have not yet adequately addressed subtle changes during exercise, such as stability and balance monitoring. This study focuses on the “Up-and-Go Pole Walking” exercise for elderly rehabilitation, aiming to use deep learning techniques for precise pose recognition and help adjust training plans through posture analysis, ultimately improving rehabilitation outcomes. Compared with previous studies, this study will comprehensively evaluate the performances and limitations of various deep learning models in pose recognition based on skeleton image sets and pose feature sets of rehabilitation exercises for elderly people; this has important academic value and practical potential.

Dr. Kuo established the “Up-and-Go Pole Walking School” to help family members regain the ability to stand independently from a wheelchair. The school offers specific functional exercise training programs specifically designed for elderly people. The kind of exercise program conceived from Nordic walking has beneficial effects on many biomechanical parameters, such as increasing walking distance, walking speed, and stride length [8]. It is also an effective low-threshold cardiac rehabilitation exercise [9]. Preventing falls is important for elderly people; literature has shown that Nordic walking can improve dynamic balance in older adults, making it a safe and accessible form of aerobic exercise [10]. Using two trekking poles as supports, participants perform various movements in 52 exercises, including seated preparatory movements, sit-to-stand techniques, weight shifting, lower body strength squats, flexibility, core stability, balance, obstacle crossing, and physical fitness assessments. Dr. Kuo established the official website for the Up-and-Go pole walking exercise [11], which offers comprehensive descriptions of the rehabilitation movements and postures involved. This source serves as a primary reference for understanding the structured progression of exercises designed to assist frail older adults transition from sitting to standing and ultimately achieve stable walking, thus reducing the need for wheelchairs.

Pose recognition is a method based on computer vision technology that treats the human pose as a complex structure composed of multiple key joints (e.g., elbows, knees). The connections and positional relationships between these joints provide rich information about body position, orientation, and movement. By analyzing images or videos, pose recognition technology accurately identifies body poses, joint positions, and actions, offering precise descriptions of various human poses.

Machine learning (ML) [12], as the core technology of artificial intelligence (AI), focusses on the development of algorithms capable of automatically learning from data to make predictions or decisions. The development of ML goes back to foundational models such as supervised learning, unsupervised learning, and reinforcement learning. Supervised learning trains models with labelled datasets for classification or regression tasks, while unsupervised learning discovers hidden structures or patterns in unlabeled data. Reinforcement learning optimizes strategies based on environmental feedback. With advancements in data and computational power, ML methods have evolved into deep learning (DL), emphasizing multilayer neural network structures and providing robust solutions to high-dimensional data challenges.

Convolutional neural networks (CNNs), representative structures in deep learning, have achieved remarkable success in computer vision. CNN architectures typically consist of convolutional layers, pooling layers, and fully connected layers. Convolutional layers extract local features, pooling layers reduce dimensionality, and fully connected layers handle classification or regression tasks. The shared weight mechanism in CNNs ensures high computational efficiency when processing features at different locations, making CNNs the preferred algorithm for tasks such as image classification, object detection, and pose recognition [13,14].

Deep learning has become central to advancements in pose recognition, particularly in applications related to rehabilitation exercises. CNNs are widely used for extracting hierarchical features, making them effective in motion recognition tasks. Although early models such as AlexNet [14] improved image classification, their applicability to pose recognition remained limited due to the lack of specialized feature extraction for skeletal structures. VGGNet [15] and ResNet [16] were advanced deep networks, making them more suitable for recognizing complex motion sequences in rehabilitation exercises. But challenges remain in adapting them to real-time elderly rehabilitation applications.

Fully connected neural networks (FCNs) [17] were among the earliest neural network models, featuring an input layer, several hidden layers, and an output layer. Despite its simplicity and the ease with which it can be understood, the capability of the FCN model to address pose recognition issues requires further investigation.

Recently, the vision transformer (ViT) [18] has come to represent a significant advancement in deep learning models. The ViT models break the traditional convolutional paradigm by utilizing a transformer-based architecture for image processing. The Swin Transformer [19] further improves upon the ViT, especially in terms of computational efficiency and performance with high-resolution images. The Swin Transformer has demonstrated exceptional results in tasks such as image classification, outperforming traditional CNN and ViT models [20], and has been successfully applied to image classification and other computer vision tasks.

In pose recognition, OpenPose [21] is a CNN-based technology that introduces confidence maps for joint locations and partial affinity fields (PAFs) for the construction of skeletal structures. OpenPose supports the detection of 25 key joints, including the head, spine, limbs, and the major joints of the hands and feet. While it generates only 2D skeletal maps suitable for in-plane pose analysis, its confidence maps and PAFs enable precise joint positioning and skeletal structure construction for multi-person pose recognition, including detailed facial and hand detection.

MediaPipe [22], developed by Google in 2019, specializes in efficient real-time data processing and pose recognition. Supporting 33 key joints, it extends beyond OpenPose by including joints in the lower back, fingers, and toes, offering more comprehensive pose descriptions. Unlike OpenPose, MediaPipe supports both 2D and 3D skeletal maps, capturing depth information for spatially detailed applications. MediaPipe modularizes sensory processing workflows, enabling real-time joint position updates as individuals move. In addition to full-body pose recognition, it provides detailed detection of facial features, hands, and irises.

OpenPose and MediaPipe are preferred over other available methods, such as PoseNet [23] and AlphaPose [24], due to their high accuracy, efficiency, and scalability for real-time applications. While PoseNet is known for its speed, it lacks the depth of feature recognition offered by OpenPose and MediaPipe. AlphaPose, although accurate, requires more computational resources and is less efficient for real-time processing. OpenPose and MediaPipe were selected as they strike the right balance between accuracy and processing speed, essential for analyzing complex motion sequences like those found in Up-and-Go pole walking exercises.

Up-and-Go pole walking exercises are low-intensity, full-body exercises with significant health benefits. However, challenges such as difficulty in mastering the proper pose and adapting to exercises have hindered its promotion. This study investigates pose recognition methods for Up-and-Go pole walking exercises, evaluating the performance of seven deep-learning-based techniques. When comparing pose recognition accuracy, the research provides empirical evidence to address pose-related challenges and explores the potential of these technologies to improve exercise effectiveness and safety. The findings have important implications for improving health and quality of life.

2. Methods

2.1. Research Workflow Framework

Figure 1 illustrates the complete workflow framework for pose recognition for the rehabilitation of the Up-and-Go pole walking exercise. This framework integrates four core stages: image collection and processing, skeleton diagram and joint point feature extraction, deep learning model training, and model performance evaluation. The workflow not only combines deep learning techniques but also utilizes powerful skeleton diagram generation tools such as Mediapipe and OpenPose, enabling high-precision movement recognition. The integration of these technologies enhances the efficiency and feasibility of automatic pose recognition and accuracy evaluation for elderly people’s exercise routines. The design of this workflow aims to improve the accuracy and reliability of the pose recognition of Up-and-Go pole walking exercises, offering more precise exercise guidance for elderly users.

In the data processing stage, 30 key motion frames are extracted from video recordings using segmentation and image selection techniques, serving as the dataset for model training. Then, Mediapipe and OpenPose are employed to generate skeletal images and extract 33 keypoint coordinates, from which joint distances and angle features are computed, yielding 12 to 30 pose feature descriptors. These features are used both for the generation of skeleton images and as input for deep learning models. For model development, this study employs seven pre-trained models of convolutional neural networks and vision transformers. The CNN-based models include five pre-trained architectures: VGG16, EfficientNetB2, DenseNet169, ResNet50, and Xception. Additionally, ViT-based models, including the ViT model and the Swin Transformer, are evaluated. Furthermore, two FCN-based models are applied for classification using pose feature data. For performance evaluation in this study, a five-time experimental method is utilized to compare model performance, ensuring stability and accuracy across different data subsets. The entire workflow—from image processing, skeleton image generation, and pose feature extraction to model training and evaluation—demonstrates a comprehensive workflow to automate pose recognition in Up-and-Go pole walking exercises.

2.2. Data Source for the Up-and-Go Pole Walking Exercise

The data for this study were sourced from videos of the Up-and-Go pole walking exercise recorded by Dr. Kuo [25]. Due to the high similarity of certain movements and the inability to effectively detect whether the eyes were closed, the study refined and selected 30 representative exercise steps. These movements include stepping, sitting with legs apart, swinging the arms while in a lunge position, raising a single arm, and standing on one leg. The 30 exercise poses comprehensively encompass the fundamental movements of the Up-and-Go pole walking exercise, effectively capturing the essential rehabilitative postures associated with this activity.

For data processing, the videos were first segmented to exclude non-representative or incomplete full-body frames, ensuring that the visual data used clearly presented the various poses during the exercise routine. Subsequently, the processed videos were converted to continuous image sequences. To ensure that the distinctive features of each movement step were adequately learned and recognized, 30 images were extracted from each video, which were then used as input data for subsequent deep learning model training. This approach to data processing not only enhanced data quality but also minimized errors caused by image interference or unclear movements. By performing thorough data pre-processing and curation, the study provided high-quality training data for the model, thus improving the accuracy and robustness of exercise movement recognition.

2.3. Skeletal Image Sets Generation

For human pose recognition, this study used the OpenPose and MediaPipe suites to generate three skeletal image sets from the Up-and-Go pole walking exercise video: the OpenPose 2D skeletal image set, the Mediapipe 2D skeletal image set, and the Mediapipe 3D skeletal image set. Each dataset consisted of 900 images, with each class containing 30 samples. The 30 classes represented 30 distinct steps, resulting in a total of 900 images. The dataset was divided into training and test sets with an 80:20 ratio, where 720 images were used for training and 180 images were used for testing.

First, OpenPose was utilized to extract human joint points and generate two-dimensional skeletal images for 30 distinct movement classes in the Up-and-Go pole walking exercise, as illustrated in Figure 2. These images feature a black background and display only the skeletal structure, providing x- and y-axis planar information. This format is suitable for basic pose analysis and motion recognition. Additionally, the joints and bones in the OpenPose skeletal diagram are represented in different colors, with each color corresponding to specific joints or bone connections. For example, red is typically used to represent the central axis of the torso (such as the spine), while green and blue are used to distinguish the upper and lower limbs, respectively. Purple and yellow may be used for finer details, such as the head or hands. This color distinction helps us to intuitively understand the structural features of human posture. Second, Mediapipe was used to generate 2D skeletal images like those from OpenPose. However, Mediapipe offers enhanced flexibility in detecting human joint points, capturing pose features from multiple angles, and further improving the application potential of 2D skeletal images. A key advantage of Mediapipe lies in its ability to generate 3D skeletal images for 30 distinct movement steps in the Up-and-Go pole walking exercise, as depicted in Figure 3. Unlike 2D skeletal images, 3D skeletal images include not only x- and y-axis data, but also z-axis depth information, capturing the spatial structure of human poses. This comprehensive spatial representation makes 3D skeletal images particularly suitable for detailed motion analysis, such as assessing pose stability and capturing spatial variations in complex movements. These 3D skeletal images are based on a standardized three-dimensional coordinate system that provides the position information of each joint on the X, Y, and Z axes. The coordinate values are relative and standardized, rather than representing actual physical length, and are scaled according to human body proportions. This coordinate system allows skeletal images to be used for more precise posture analysis and action recognition.

2.4. Image-Based Model Development

This study used five CNNs and two ViTs to construct deep learning models to recognize the postures of the Up-and-Go pole walking exercises. CNNs are widely used in computer vision tasks, particularly in image classification, object detection, and pose recognition. Their primary advantage lies in their robust feature extraction capabilities, which enable automatic learning of spatial hierarchies in images through convolutional layers, thus improving the accuracy and precision of image data understanding and prediction. Based on transfer learning, five commonly used pre-trained models—Xception [26], EfficientNetB2 [27], DenseNet169 [28], ResNet50 [16], and VGG16 [15]—were initially tested in this study to identify the most suitable model for recognizing Up-and-Go pole walking exercises.

Xception (Extreme Inception) [26] is a deep learning model developed by Google’s François Chollet, based on the concept of depth-wise separable convolutions. It contains 36 convolutional layers and is structurally divided into three main parts: entry flow, intermediate flow, and exit flow. Xception excels at processing large datasets, especially on the ImageNet dataset, where it outperforms traditional models such as VGG16 and ResNet50. This model is also known for its efficiency and scalability, making it suitable for a wide range of applications beyond image classification.

EfficientNet [27] is a series of efficient convolutional neural network (CNN) models that aim to balance the depth, width, and resolution of the model through a compound scaling approach. EfficientNetB2 is a variant in this family that offers improved accuracy and efficiency compared to its predecessor. These models have performed well in a variety of computer vision tasks, including image classification, object detection, and segmentation. They have also shown good performance in medical applications.

DenseNet169 [28] is part of the DenseNet family, which features a densely connected network architecture. In this structure, each layer is connected to all previous layers, promoting feature reuse and reducing the number of parameters. DenseNet169 performs well in image classification tasks, especially in medical image analysis, where it has demonstrated high accuracy and robustness. Its dense connections facilitate gradient flow and alleviate the problem of vanishing gradients, making it easier to train deeper networks.

ResNet50 [16] is a deep residual network proposed by Kaiming He et al. in 2015. It introduced the concept of residual learning, making it possible to train very deep networks. ResNet50 solves the gradient vanishing problem by using short-circuit connections, also known as skip connections, which allow gradients to flow directly through the network. This architecture has achieved excellent results in multiple image recognition challenges and has been widely adopted in various applications.

VGG16 (Visual Geometry Group 16-layer Network) [15] is a classic deep convolutional neural network architecture comprising 16 layers, including 13 convolutional layers and 3 fully connected layers. Its distinctive feature lies in its simple yet uniform design, which uses small convolutional kernels (3 × 3) and a step of 1 to hierarchically extract the features of the image. The convolutional and pooling layers are alternately arranged, progressively reducing the spatial dimensions of the feature maps while increasing their depth. This enables VGG16 to capture multi-level abstract features within images, resulting in exceptional performance across various visual recognition tasks. Despite its relatively large number of parameters, VGG16 remains popular for its straightforward architecture and effectiveness in various computer vision tasks.

Figure 4a illustrates the CNN-based model developed in this study, using VGG16 as the primary feature extractor and fine-tuned to meet specific application requirements. To enhance training efficiency and prevent overfitting, the weights of the initial layers were frozen during fine-tuning, and only the latter layers were retrained. This approach took advantage of the general features learnt by VGG16 on large-scale datasets (such as ImageNet), while customizing the latter layers for the specific features of the Up-and-Go pole walking exercises. Furthermore, global average pooling layers, ReLU activation functions [29], batch normalization layers, and dropout layers were incorporated into the model to further improve generalization capabilities and reduce overfitting. The final output layer employed a Softmax activation function [30] to convert the outputs into classification probabilities, facilitating recognition of multiple classes. This fine-tuning strategy enabled the VGG16 model to effectively learn and adapt to specialized application scenarios, achieving high recognition accuracy and demonstrating strong feature learning capabilities in motion recognition tasks.

ViT models, known for their ability to process image data by dividing images into patches and applying self-attention mechanisms, were also used to train on different types of skeletal image datasets. In addition to the CNN-based approach, this study also incorporated the ViT-based method for performance evaluation.

ViT [18], introduced by Dosovitskiy et al., transforms an image into a sequence of non-overlapping patches, each treated as a token in a transformer model. Unlike traditional CNNs, which rely on convolutional layers to extract hierarchical features, ViT utilizes self-attention across all patches, capturing long-range dependencies effectively. This method has demonstrated strong performance on large-scale datasets, particularly in image classification tasks, and has been extended to various computer vision applications, such as object detection and segmentation.

The Swin Transformer is specifically designed to handle high-resolution images efficiently by using a “shifted window” approach for self-attention, which allows it to capture both local and global features effectively. Unlike ViT, which applies self-attention globally, Swin Transformer partitions an image into smaller, non-overlapping windows, computing self-attention within each region. To ensure information exchange between windows, it incorporates a shifting mechanism that enables better feature aggregation across different image regions. This hierarchical structure not only reduces computational costs but also improves adaptability to various scales and image resolutions, making it suitable for tasks such as object detection, segmentation, and medical imaging analysis [19].

Figure 4b illustrates the ViT-based model developed in this study, which utilizes Swin Transformer as the primary feature extractor. The model processes input images through a series of layers starting with a patch embedding that divides the image into smaller patches. Following this, the model applies dropout for regularization and passes the output through a sequential block to process the features. After normalization, a global average pooling layer reduces the dimensionality of the output, capturing the most important features of the image. The processed features are then passed through dense layers for further processing and classification. Modifications to the original architecture include the use of a Swin Transformer for feature extraction and fine-tuning of the model to meet the specific requirements of the application.

In this study, the proposed approach, based on a deep learning pre-trained model and a fine-tuning strategy, effectively processes Up-and-Go pole walking exercise data. The implementation procedure involves first selecting a CNN or ViT model as the base model, then loading the pre-trained model weights, and using the specific Up-and-Go pole walking exercise dataset, after data preprocessing, to fine-tune the model. As the training progresses, the model continuously optimizes its parameters to improve its performance in practical applications. Finally, the performances of each model are evaluated to identify the best model that meets the specific requirements for posture recognition. The core of the fine-tuning strategy lies in leveraging the features learned by the pre-trained models and tailoring them to better capture and recognize the pose characteristics of Up-and-Go pole walking exercises.

2.5. Posture Feature Sets Generation

This study uses the Mediapipe framework to extract the coordinates of the complete joint from images (Figure 5). Each human body is represented by 33 joint coordinates, which provide essential foundational information for capturing human movement and pose. To further analyze and recognize human pose, we designed two feature extraction methods: 12 features based on hand and foot coordinates, and 30 expert-defined features. The dataset consists of data from 900 instances, where each instance corresponds to a set of 12 or 30 features, depending on the extraction method. The dataset is divided into training and test sets with an 80:20 ratio, where 720 instances are used for training and 180 instances are used for test. Each class contains 30 instances, and each instance has 12 or 30 features.

First, we used the coordinates of the hands and feet to distinguish the basic positions of the body, dividing the pose into eight different distance combinations and four different angle combinations, generating a total of 12 features (Figure 6). Specifically, the selected distance combinations include the distances between the shoulder and elbow, elbow and wrist, hip and knee, knee and ankle, and other joint distances, while the angle combinations involve the bending angles at various joints. These features focus on the relative distances and angles between the main limbs of the body, effectively summarizing the basic pose features during movement. Furthermore, to enhance the model’s ability to recognize human pose, expert-defined feature markers were employed to provide a more detailed description of the body’s position. By integrating the full-body joint coordinates, we generated 15 distance combinations and 15 angle combinations, totaling 30 features. These features provide higher resolution and detailed descriptions, accurately reflecting the relative position and angle changes of different body parts, thereby improving the model’s ability to recognize complex poses.

2.6. Feature-Based Model Development

In this study, we also used a fully connected neural network (FCN) to establish a deep learning model for training the Up-and-Go pole walking exercise recognition model. The FCN is a learning model based on an artificial neural network layer structure, where each neuron in a layer is connected to all neurons in the previous layer. This enables fully connected layers to capture and learn complex patterns within the input data, making it particularly effective for processing structured data. In this research, we use the input of the feature dataset as the model for prediction and recognition, with the aim of improving the accuracy and efficiency of the Up-and-Go pole walking exercises.

The feature-based model architecture proposed in this study, as shown in Figure 7, consists of three hidden layers, with each layer utilizing the ReLU (rectified linear unit) activation function, a commonly used nonlinear activation function that helps mitigate the vanishing gradient problem and facilitates network training. Specifically, the first hidden layer contains 200 neurons, the second hidden layer consists of 300 neurons, and the third hidden layer has 100 neurons. These hidden layers are responsible for learning high-level feature representations from the input data. The final output layer consists of 12 neurons and uses the Softmax activation function for multi-class classification. The Softmax function converts the output into a probability distribution, where the predicted probabilities of each class reach a sum of 1. This design allows the model to accurately classify and predict different stages of the Up-and-Go pole walking exercise process.

2.7. Evaluation

This study uses five-time experiments, confusion matrices, and accuracy to evaluate the performance of various deep learning posture recognition models and analyze their effectiveness in classifying the Up-and-Go pole walking exercise categories.

The method of five-time experiments is commonly used to improve model reliability and mitigate the impact of data variability on test results [32]. In this approach, the dataset is randomly split into training and testing sets, and the model is trained and tested five times with different random splits. The evaluation criterion is the average test accuracy across five-time experiments is calculated to provide a more stable performance assessment. Compared to k-fold cross-validation [33], the five-run method is particularly useful when working with limited datasets, as it reduces computational complexity while still offering a robust evaluation of model generalization. This evaluation approach ensures that model performance is not overly dependent on a specific data split, leading to more reliable and generalizable results.

The confusion matrix is a powerful tool that can be used to display the prediction results of a classification model, especially when dealing with multiclass classification problems. In this study, the confusion matrix illustrates the model’s predictions for different “Up-and-Go Pole Walking” exercise categories. Each row represents the actual motion category, while each column corresponds to the model’s predicted category. The numbers in the matrix reflect the frequency of each prediction result, providing a clear evaluation of the model’s ability to recognize different categories and its error distribution.

In this study, we used trained deep learning model weights to classify action labels in test videos. During this process, the model classifies actions based on motion features in the video and compares the identified action labels with the actual action categories of the video. This method effectively evaluates the model’s performance in real-world application scenarios, particularly in complex action classification tasks. By testing various videos, we ensure that the model can accurately differentiate between different types of action. Video testing allows us to more accurately reflect the model’s capability in dynamic image classification, assess its performance in recognizing different action categories, and explore the model’s stability across various scenes and conditions.

3. Results and Discussion

This study focuses on the development and evaluation of a pose recognition model for the Up-and-Go pole walking exercise. During the model selection phase, we tested five common pre-trained models and conducted a performance comparison, including Xception, EfficientNetB2, DenseNet169, ResNet50, and VGG16. The accuracy data are shown in Table 1. The results demonstrate that the VGG16 model achieved an average test accuracy of 98.2%, significantly outperforming the other models. Specifically, the average test accuracy for Xception, EfficientNetB2, DenseNet169, and ResNet50 were 97.9%, 95.9%, 78.1%, and 98.1%, respectively. In Table 1, the ViT and Swin models were also tested, achieving a remarkable average test accuracy of 99.4% and 99.7%, outperforming the five other CNN models. Based on these results, VGG16 and Swin were selected to evaluate their performance in pose recognition for the Up-and-Go pole walking exercise using skeleton images and feature data.

Up-and-Go pole walking exercise can be attributed to several architectural characteristics and design choices inherent to each model: Xception model utilizes depth-wise separable convolutions for efficient computation but slightly less accurate due to task complexity. The EfficientNetB2 model employs compound scaling for efficiency but may miss intricate details needed for pose recognition. DenseNet169 model applies features densely connected layers for gradient flow and feature reuse but may overfit or struggle to generalize. The ResNet50 model uses residual connections to train deeper networks, achieving high accuracy, but slightly outperformed by others. The VGG16 model employs a simple architecture with small convolutional filters, effective in feature extraction, leading to high accuracy. The ViT model represents images as sequences of patches, capturing global context and long-range dependencies, resulting in superior performance. The Swin Transformer introduces a hierarchical architecture with shifted windows, optimizing efficiency and capturing both local and global features, achieving the best performance. In summary, the differences in pose recognition performance among these models are primarily due to their unique architectural features and the specific ways in which they process and extract information from the input data. The superior performance of the VGG16 model and the Swin Transformer highlights its efficiency in capturing the necessary details for accurate pose recognition in the Up-and-Go pole walking exercise.

Based on CNN architecture in Table 2, the VGG16 model achieved testing accuracies of 98.2% for the MediaPipe 2D dataset, 98.1% for the OpenPose dataset, and 94.6% for the MediaPipe 3D dataset. The result of Table 2 demonstrated that the best performance in pose recognition for the Up-and-Go pole walking exercise is the VGG16 model with Mediapipe 2D skeleton images, achieving a test accuracy of 98.2%.

Based on ViT architecture, Table 3 shows the performance results of the Swin model achieved testing accuracies of 99.7% for the MediaPipe 2D dataset, 97.4% for the OpenPose dataset, and 92.7% for the MediaPipe 3D dataset. When comparing the results of Table 2 and Table 3, it can be observed that for the recognition of the Up-and-Go pole walking exercise postures, the Swin model performs the best when using the Mediapipe 2D skeleton images, even surpassing the VGG16 model.

Based on the FCN architecture in Table 4, the FCN model achieved testing accuracies of 84.0% for the 30-feature set and 80.3% for the 12-feature set. When comparing the performance of using skeleton image sets and posture feature sets, it can be observed that, while converting image data into feature data significantly reduces memory requirements during computation, the conversion process incurs additional computational costs, and the overall posture recognition performance decreases by approximately 15%.

Moreover, the processing speed of the models also affects their practical applicability. The total processing time for the MediaPipe 2D model was 5.08 s, with an average processing time of 0.0301 s per frame. The MediaPipe 3D model had a total processing time of 5.13 s and an average of 0.0303 s per frame. The OpenPose model required a total of 5.45 s, with an average processing time of 0.0323 s per frame. Although OpenPose provides 2D skeletal information, its relatively higher computational cost may limit its feasibility for real-time applications. On the contrary, the MediaPipe 2D model maintains high recognition accuracy while achieving faster processing speeds, making it a more advantageous choice for real-world deployment.

Although the CNN-based VGG16 model has good pose recognition capabilities, the ViT-based Swin model demonstrates superior performance, particularly with the MediaPipe 2D skeleton image set, achieving test accuracy as high as 99.7%. Leveraging the Swin model and MediaPipe 2D skeleton images can provide a powerful, accurate, and efficient solution for real-time applications related to posture recognition in Up-and-Go pole walking rehabilitation exercises.

Figure 8 illustrates the training and validation process of the VGG16 model for recognizing movements in the Up-and-Go pole walking exercise using Mediapipe 2D skeleton data, achieving a test accuracy of 98.2%. Figure 9 displays the training confusion matrix of the VGG16 model with Mediapipe 2D skeleton data, while Figure 10 shows the test confusion matrix of the VGG16 model with the same data.

Figure 11 illustrates the training and validation process of the Swin model for recognizing poses in the Up-and-Go pole walking exercise using Mediapipe 2D skeleton data, achieving a test accuracy of 99.7%. Figure 12 shows the training confusion matrix obtained from the Swin model with Mediapipe 2D skeleton data for Up-and-Go pole walking exercise postures, while Figure 13 displays the test confusion matrix of the Swin model using the same data.

Figure 14 illustrates the training and validation process of the FCN model for pose recognition in the Up-and-Go pole walking exercise using 30-feature set, achieving a test accuracy of 84.0%. Figure 15 shows the training confusion matrix obtained from the FCN model with 30-feature set of the Up-and-Go pole walking exercise, while Figure 16 displays the test confusion matrix of the FCN model using the same feature set.

Figure 17 and Figure 18 show correct and incorrect examples when testing the video dataset using the trained model weights. These examples provide a detailed view of the model’s performance in real-world applications, particularly in cases of misclassification. For instance, in the example of category 28 (squat to standing with single arm raise), Figure 17 shows a correctly predicted classification, while Figure 18 illustrates a misclassification where the model predicted category 10 (arm swinging while in lunge position). The green text in the top left corner of the image indicates the predicted class of the model, which helps in further adjusting and optimizing the model.

4. Conclusions and Future Research

This study evaluated seven different methods of pose recognition in the Up-and-Go pole walking exercise. The results indicate that the CNN-based deep learning model with a pre-trained VGG16 backbone achieved an accuracy of 98.2%. However, the ViT-based deep learning model with a pre-trained Swin backbone outperformed all other methods, achieving the highest accuracy of 99.7%, demonstrating its superior ability in recognizing complex movement patterns in the Up-and-Go pole walking exercise. The 30-feature dataset also demonstrated a higher test accuracy of 84.0% compared to the 12-feature dataset, highlighting the importance of capturing complete body structure. In summary, CNN-based models demonstrated superior performance compared to FCN-based models. Furthermore, ViT-based models outperformed CNN-based approaches, underscoring the efficacy of ViT in processing skeleton images for posture recognition in the Up-and-Go pole walking exercise.

However, this study has some limitations. The dataset used in the experiment was relatively small, and future studies should consider using larger and more diverse datasets to improve the generalization of the model. Additionally, while the models performed well in controlled settings, their real-world application could be affected by factors such as varying lighting conditions and individual differences in posture. Previous research has shown that lighting variations can significantly affect the performance of image-based gesture recognition systems, as changes in light intensity, angle, or shadow can alter the appearance of key features [34]. Further research is needed to address these challenges and improve the robustness of pose recognition systems, particularly in real-world scenarios where lighting conditions are less predictable.

Author Contributions

Conceptualization, M.-H.T. and H.-Y.L.; methodology, W.-C.L., Y.-C.T. and M.-H.T.; formal analysis, W.-C.L., Y.-C.T., H.-Y.L. and M.-H.T.; writing—original draft preparation, W.-C.L. and M.-H.T.; writing—review and editing, M.-H.T. and H.-Y.L.; supervision, M.-H.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, R.O.C., grant number: NSTC 113-2121-M-040-002.

Data Availability Statement

Data used in this study are sourced from the publicly available website at [https://www.youtube.com/watch?v=xkSJUutBa74&list=PLUBbhVNVjjTW00dNfQBsxrpY5gdQ48VB8 (accessed on 1 January 2024)], reference number [25]. Readers who require data and code assistance are asked to contact the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bastiaanse, L.P.; Hilgenkamp, T.I.; Echteld, M.A.; Evenhuis, H.M. Prevalence and associated factors of sarcopenia in older adults with intellectual disabilities. Res. Dev. Disabil. 2012, 33, 2004–2012. [Google Scholar] [CrossRef] [PubMed]
Meier, N.F.; Lee, D.-C. Physical activity and sarcopenia in older adults. Aging Clin. Exp. Res. 2020, 32, 1675–1687. [Google Scholar] [CrossRef] [PubMed]
Marzetti, E.; Calvani, R.; Tosato, M.; Cesari, M.; Di Bari, M.; Cherubini, A.; Broccatelli, M.; Savera, G.; D’Elia, M.; Pahor, M. Physical activity and exercise as countermeasures to physical frailty and sarcopenia. Aging Clin. Exp. Res. 2017, 29, 35–42. [Google Scholar] [CrossRef] [PubMed]
Bravo, V.P.; Muñoz, J.A. Wearables and their applications for the rehabilitation of elderly people. Med. Biol. Eng. Comput. 2022, 60, 1239–1252. [Google Scholar] [CrossRef]
Regterschot, G.R.H.; Ribbers, G.M.; Bussmann, J.B. Wearable movement sensors for rehabilitation: From technology to clinical practice. Sensors 2021, 21, 4744. [Google Scholar] [CrossRef]
Hong, Z.; Hong, M.; Wang, N.; Ma, Y.; Zhou, X.; Wang, W. A wearable-based posture recognition system with AI-assisted approach for healthcare IoT. Future Gener. Comput. Syst. 2022, 127, 286–296. [Google Scholar] [CrossRef]
Ramirez, H.; Velastin, S.A.; Meza, I.; Fabregas, E.; Makris, D.; Farias, G. Fall detection and activity recognition using human skeleton features. IEEE Access 2021, 9, 33532–33542. [Google Scholar] [CrossRef]
Roy, M.; Grattard, V.; Dinet, C.; Soares, A.V.; Decavel, P.; Sagawa, Y.J. Nordic walking influence on biomechanical parameters: A systematic review. Eur. J. Phys. Rehabil. Med. 2020, 56, 607–615. [Google Scholar] [CrossRef]
Nagyova, I.; Jendrichovsky, M.; Kucinsky, R.; Lachytova, M.; Rus, V. Effects of Nordic walking on cardiovascular performance and quality of life in coronary artery disease. Eur. J. Phys. Rehabil. Med. 2020, 56, 616–624. [Google Scholar] [CrossRef]
Bullo, V.; Gobbo, S.; Vendramin, B.; Duregon, F.; Cugusi, L.; Di Blasio, A.; Bocalini, D.S.; Zaccaria, M.; Bergamin, M.; Ermolao, A. Nordic walking can be incorporated in the exercise prescription to increase aerobic capacity, strength, and quality of life for elderly: A systematic review and meta-analysis. Rejuvenat. Res. 2018, 21, 141–161. [Google Scholar] [CrossRef]
Kuo, C.-C. UpandGO Pole Walking School. Available online: https://upandgo.com.tw/ (accessed on 1 January 2024).
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Tseng, M.-H. GA-based weighted ensemble learning for multi-label aerial image classification using convolutional neural networks and vision transformers. Mach. Learn. Sci. Technol. 2023, 4, 045045. [Google Scholar] [CrossRef]
Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.; Yong, M.; Lee, J. MediaPipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
Fang, H.-S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.-L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef]
Kuo, C.-C. Up-and-GO Pole Walking Exercise. Available online: https://www.youtube.com/watch?v=xkSJUutBa74&list=PLUBbhVNVjjTW00dNfQBsxrpY5gdQ48VB8 (accessed on 1 January 2024).
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Agarap, A. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Mehra, S.; Raut, G.; Purkayastha, R.D.; Vishvakarma, S.K.; Biasizzo, A. An empirical evaluation of enhanced performance softmax function in deep learning. IEEE Access 2023, 11, 34912–34924. [Google Scholar] [CrossRef]
Google. MediaPipe Pose. Available online: https://github.com/google-ai-edge/mediapipe/blob/master/docs/solutions/pose.md (accessed on 1 February 2025).
Rafało, M. Cross validation methods: Analysis based on diagnostics of thyroid cancer metastasis. ICT Express 2022, 8, 183–188. [Google Scholar] [CrossRef]
Zhang, X.; Liu, C.-A. Model averaging prediction by K-fold cross-validation. J. Econom. 2023, 235, 280–301. [Google Scholar] [CrossRef]
Yu, L.; Abuella, H.; Islam, M.Z.; O’Hara, J.F.; Crick, C.; Ekin, S. Gesture recognition using reflected visible and infrared lightwave signals. IEEE Trans. Hum. Mach. Syst. 2021, 51, 44–55. [Google Scholar] [CrossRef]

Figure 1. Workflow framework diagram.

Figure 2. OpenPose skeletal images.

Figure 3. Mediapipe 3D skeletal images.

Figure 4. The designed image-based models in this study.

Figure 5. Mediapipe full-body joint coordinates [31].

Figure 6. Twelve-feature dataset generated in this study.

Figure 7. The designed FCN model in this study.

Figure 8. Training and validation process of VGG16 model using Mediapipe 2D skeleton images.

Figure 9. Training confusion matrix of VGG16 model using Mediapipe 2D skeleton images.

Figure 10. Test confusion matrix of VGG16 model using Mediapipe 2D skeleton images.

Figure 11. Training and validation process of Swin model using Mediapipe 2D skeleton images.

Figure 12. Training confusion matrix of Swin model using Mediapipe 2D skeleton images.

Figure 13. Test confusion matrix of Swin model using Mediapipe 2D skeleton images.

Figure 14. Training and validation process of FCN model using 30-feature set.

Figure 15. Training confusion matrix of FCN model using 30-feature set.

Figure 16. Test confusion matrix of FCN model using 30-feature set.

Figure 17. A correct example of video predictions using the proposed model.

Figure 18. An incorrect example of video predictions using the proposed model.

Table 1. Performance comparison of seven pre-trained models using Mediapipe 2D skeleton images.

Accuracy	CNN-Based Model					ViT-Based Model
Accuracy	Xception	EfficientNetB2	DenseNet169	ResNet50	VGG16	ViT	Swin
Training	0.998 ± 0.002	0.993 ± 0.005	0.799 ± 0.368	0.992 ± 0.014	1.000 ± 0.000	1.000 ± 0.000	0.999 ± 0.002
Test	0.979 ± 0.009	0.959 ± 0.019	0.781 ± 0.361	0.981 ± 0.018	0.982 ± 0.007	0.994 ± 0.004	0.997 ± 0.003
Overall	0.994 ± 0.003	0.986 ± 0.007	0.795 ± 0.366	0.990 ± 0.014	0.996 ± 0.002	0.999 ± 0.001	0.998 ± 0.002

Table 2. Performance comparison of VGG16 model for three image sets.

Accuracy	Openpose Skeleton Image	Mediapipe 2D Skeleton Image	Mediapipe 3D Skeleton Image
Training	0.999 ± 0.002	1.000 ± 0.000	1.000 ± 0.000
Test	0.981 ± 0.008	0.982 ± 0.007	0.946 ± 0.011
Overall	0.996 ± 0.002	0.996 ± 0.002	0.989 ± 0.002

Table 3. Performance comparison of Swin model for three image sets.

Accuracy	Openpose Skeleton Image	Mediapipe 2D Skeleton Image	Mediapipe 3D Skeleton Image
Training	0.998 ± 0.002	0.999 ± 0.002	0.998 ± 0.003
Test	0.980 ± 0.010	0.997 ± 0.003	0.927 ± 0.014
Overall	0.995 ± 0.003	0.998 ± 0.002	0.984 ± 0.003

Table 4. Performance comparison of FCN model for two feature sets.

Accuracy	12 Features	30 Features
Training	0.934 ± 0.023	0.982 ± 0.010
Test	0.803 ± 0.024	0.840 ± 0.022
Overall	0.908 ± 0.020	0.954 ± 0.011

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, W.-C.; Tu, Y.-C.; Lin, H.-Y.; Tseng, M.-H. A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data. Electronics 2025, 14, 1075. https://doi.org/10.3390/electronics14061075

AMA Style

Lin W-C, Tu Y-C, Lin H-Y, Tseng M-H. A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data. Electronics. 2025; 14(6):1075. https://doi.org/10.3390/electronics14061075

Chicago/Turabian Style

Lin, Wan-Chih, Yu-Chen Tu, Hong-Yi Lin, and Ming-Hseng Tseng. 2025. "A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data" Electronics 14, no. 6: 1075. https://doi.org/10.3390/electronics14061075

APA Style

Lin, W.-C., Tu, Y.-C., Lin, H.-Y., & Tseng, M.-H. (2025). A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data. Electronics, 14(6), 1075. https://doi.org/10.3390/electronics14061075

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Comparison of Deep Learning Techniques for Pose Recognition in Up-and-Go Pole Walking Exercises Using Skeleton Images and Feature Data

Abstract

1. Introduction

2. Methods

2.1. Research Workflow Framework

2.2. Data Source for the Up-and-Go Pole Walking Exercise

2.3. Skeletal Image Sets Generation

2.4. Image-Based Model Development

2.5. Posture Feature Sets Generation

2.6. Feature-Based Model Development

2.7. Evaluation

3. Results and Discussion

4. Conclusions and Future Research

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI