A diagram explaining PRISMA can be seen in
Figure 2. We found 371 papers through our search on different databases: 97 papers on WOS, 138 papers on Scopus, 130 papers on PubMed, and 6 papers on Astrophysics Data System. After removing duplicates and papers without DOIs or written in languages other than English, 238 papers were left for the first screening. A total of 23 articles were excluded at this stage, and 169 papers were excluded as they did not meet the criteria of “Applying Deep Learning-based Computer Vision for Camera Depth Sensor-Based Physiotherapy Movements Assessment”. In the second screening, 46 articles were analyzed, of which 18 papers were finally selected for this systematic review as shown in
Table 3.
3.1. RQ 1: Sensor
Camera sensors are crucial for acquiring visual data for movement analysis and physical therapy. The type of sensor chosen directly affects the quality of the captured data, the performance of the deep learning model, and the design of the evaluation protocol. In this review, depth cameras accounted for 65.4% (see
Figure 4). As shown in
Figure 5, standard depth camera sensors are shown below:
Kinect series: RGB-D camera introduced by Microsoft, which obtains accurate depth and image data through infrared ranging and color image acquisition technologies, and the main models are Kinect V1, Kinect V2, and so on.
RealSense series: Intel’s RGB-D camera product line, using visual-inertial ranging technology, can obtain high-quality depth and motion data.
Other RGB-D cameras: Besides the mainstream products mentioned above, third-party vendors, such as Xtion Pro, provide some RGB-D camera devices.
Ordinary RGB cameras: These only capture color image data and must be combined with other depth estimation algorithms for data processing and analysis.
After reviewing 18 related literature (see
Table 4), the Kinect series is the most widely used sensor, and 12 studies [
8,
11,
12,
13,
15,
16,
28,
29,
30,
31,
33,
34] adopted Kinect V2 as the primary Kinect camera for data acquisition. The Kinect camera has been recognized as the mainstream choice in this field because of its high accuracy and reliability. Intel RealSense series has also gained some applications; two papers [
10,
36] used the RealSense L515, D435i, and D415 models. RealSense cameras are technologically advanced, have excellent performance, and are expected to be used more widely in the future.
In addition, three papers [
27,
32,
37] used ordinary RGB cameras and other RGB-D cameras (e.g., Xtion Pro, ASUS, Taipei, Taiwan) to collect data. This approach has low hardware requirements but requires the development of appropriate algorithms to process and analyze the data.
Choosing a sensor for physical therapy applications requires a combination of specific rehabilitation scenarios and technology needs, while Kinect V2 dominates current applications with its superior bone tracking capabilities and reliability for full-body motion analysis, it faces sustainability issues due to discontinuation. In contrast, Intel RealSense excels in proximity capture and fine-motion tracking and is a strong contender for home rehabilitation systems thanks to continued development support and cost advantages. Conventional RGB-D cameras, such as the Xtion Pro, offer depth perception at a lower cost, while their tracking accuracy may be slightly less than that of a dedicated system, they are still attractive in some applications. The Xbox 360 platform played an important role in early rehabilitation applications but has been replaced by more advanced technologies. Traditional RGB cameras, despite their limitations in depth perception, can still be helpful in specific scenarios, especially in resource-constrained environments, when combined with other sensors (e.g., IMUs) and appropriate pose estimation algorithms.
3.2. RQ 2: Dataset
In computer vision-based physiotherapy movement assessment research, datasets are a key driver for its development. High-quality and diverse raw data and labeled information are the basis for developing excellent deep-learning models. At the same time, these datasets support the ability of the models to generalize across different application scenarios, facilitating the establishment of standardized physiotherapy movement assessment methods. In the reviewed literature, researchers used a variety of data types (see
Figure 6), as shown in the following:
RGB-D image/video data: Many studies [
8,
10,
11,
16,
29,
32,
33,
35,
36] have used color image and depth information captured by RGB-D cameras, usually image sequences or videos. These raw data can directly reflect the motion process and provide the basic input for subsequent motion detection and analysis.
Joint and Skeletal Data: Several studies [
12,
13,
27,
28,
34] have utilized the joint positions and skeletal information extracted by depth cameras to construct skeletal datasets. These structured data directly represents the key features of human movement and is used to model and analyze joint motion trajectories.
Combined datasets: There are also some studies [
31,
32,
37] that have combined RGB-D data and auxiliary data captured by other sensors (e.g., IMU) to capture the motion process from different perspectives and obtain more comprehensive and accurate motion information.
Existing studies adopted various approaches to dataset construction (see
Table 5), including public datasets, self-constructed datasets, and dataset fusion. Most studies [
8,
10,
11,
12,
15,
16,
30,
32,
33,
34,
35,
37] relied on self-constructed datasets. The size of the datasets ranged from tens to hundreds of participants, with some small to medium datasets and some larger datasets, such as Girase et al. [
34], which contained 411 participants. In data collection and construction, researchers generally consider data from people of different ages, genders, and health conditions. This diversity enhances the robustness of the physical therapy movement dataset and improves the ability to generalize the model.
In addition, different studies have used existing publicly available datasets to accelerate model development by utilizing existing labeled data. For instance, Raza et al. [
27] used ”Multi-Class Exercise Poses for Human Skeleton“
https://www.kaggle.com/datasets/dp5995/gym-exercise-mediapipe-33-landmarks, (accessed on 19 January 2025), and Khan et al. [
28] used UI-PRMD [
5]. In contrast, Bijalwan et al. [
29] used a fusion of multiple datasets by combining publicly available datasets (UTD-MHAD [
38], mHealth [
39], OU-ISIR [
40], HAPT [
41]) with the self-collected datasets for the construction of combinations.
During our research, we identified an important challenge in depth camera-based physical therapy exercise assessment: despite the wide variety of datasets, there is a lack of standardization and interoperability. Due to ethical and privacy considerations, most current computer vision-assisted physical therapy research relies primarily on self-constructed datasets. Researchers typically construct or integrate multiple datasets based on specific needs, which cover different groups and sizes of participants and contain multimodal data from multiple sensors, while these datasets provide some foundation for deep learning model training and physical therapy exercise analysis, there is still a lack of standardized datasets to support direct comparisons and meta-analyses of research results. These issues can be addressed in the future by establishing standardized guidelines and developing advanced annotation tools to enhance the utility of the datasets.
3.3. RQ 3: Data Processing
Data processing plays a key role in depth camera-based physiotherapy movement analysis studies and directly affects the quality and precision of subsequent analyses. As
Table 6 shows, through a systematic review of 18 works in the literature, the author can summarize several major types of data processing methods and techniques. First, skeletal data extraction and processing are the basis of most studies, e.g., [
12,
34,
36] used Kinect SDK 2.0 and the OpenPose library to extract human skeletal data from raw depth images, respectively, which provides structured input for subsequent analysis. Second, to ensure the consistency and comparability of the data, some studies such as [
16,
28] performed coordinate system transformation and alignment operations, which helped to eliminate errors caused by different devices or shooting angles.
Data standardization and normalization is another common processing step, such as the min–max normalization method used in [
29], which helps to eliminate the effects of different scales and allows various features to be compared on the same magnitude. To improve data quality, some studies have used filtering and noise removal techniques, such as the Kalman filter and low-pass Butterworth filter used in [
30], which effectively remove noise and improve signal quality. Feature extraction and selection also play an important role in machine learning, as evidenced by several articles in the literature [
27,
33]. These techniques help to reduce data dimensionality and improve the efficiency and generalization of the model.
For studies involving multimodal data, data synchronization and fusion become key issues. In [
31,
32], the authors explored the fusion of accelerometer and gyroscope measurements and visual synchronization via timestamps, respectively. In addition, to increase the diversity of training samples and improve the robustness of the model, ref. [
35] employed data augmentation techniques, which are particularly useful in deep learning model training to alleviate the problem of insufficient data effectively. Some studies such as [
28] have also mentioned the processing of feature transformations, including operations such as dimensionality reduction and feature combination, which help to extract more meaningful feature representations.
These diverse data processing methods cover the whole process from raw data acquisition to feature extraction, which improves the data quality and provides more reliable and effective inputs for subsequent algorithmic models. However, from the reviewed literature, most studies focus more on implementing specific tasks without fully exploring the impact of data processing methods on the results and the potential optimization space. This suggests that the standardization of data processing techniques has not yet been fully achieved, and future in-depth exploration in this area is needed.
Researchers can further combine advanced techniques, such as automated feature engineering with multimodal data fusion methods, to reduce data noise and enable real-time data processing in a broader range of clinical scenarios. In addition, several studies in the literature (e.g., [
30,
36]) have shown that data fidelity and maximized extraction of helpful information are particularly critical in practical applications. Therefore, developing more generalized and adaptable processing flows will be key to improving the accuracy and applicability of depth camera-based motion analysis in the future.
3.4. RQ 4: Algorithm
As
Table 6 shown, each of these algorithms excels in different application scenarios, collectively contributing to the field’s rapid development. In the research field of deep camera-based physiotherapy movement assessment, the selection and design of algorithms are crucial and directly affect the accuracy and efficiency of movement recognition, assessment, and analysis. Based on a systematic review of existing literature, as shown in
Figure 7, algorithms can be broadly categorized into three groups: traditional machine learning algorithms, deep learning algorithms, and dedicated algorithms for specific tasks:
Traditional machine-learning algorithms have been widely used in several studies. For instance, random forest (RF) [
42], logistic regression (LR) [
43], support vector machine (SVM) [
44], and principal component analysis (PCA) [
45] are favored for their interpretability and computational efficiency. Ref. [
27] utilized RF, LR, and LSTM algorithms for human posture estimation. At the same time, PCA, nonlinear PCA (NLPCA), and LR are combined to identify movement strategies in patients with low back pain [
30]. Raza et al. [
27] used RF, LR, and LSTM algorithms for human pose estimation in structured lower limb motion datasets where the RF approach achieved state-of-the-art with a high-performance score of 99.8%. Ref. [
34] conducted a comparative study on the performance of different machine learning algorithms for behavioral classification tasks. Using the combined feature set of pose, pose derivative, and dynamic features, the MLP achieved the best classification accuracy of 52.3%, followed by RF at 51.8% and SVM at 47.2%. The study demonstrated that as the feature set gradually expanded from single pose features to complete pose-derivative dynamic feature combinations, all algorithms showed significant performance improvements, highlighting how feature combination diversity plays a crucial role in enhancing classification results.
With the rapid development of deep learning techniques, more studies are using deep neural network models. Architectures such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory networks (LSTMs) show significant advantages when dealing with complex time-series motion data. For example, Maskeliūnas et al. [
10] showed that CNNs achieved recognition accuracies ranging from 60.7 to 93.8% in human posture and motion analysis tasks, while traditional Random Forest (RF) methods achieved accuracies of 86–100%, while a hybrid deep learning (HDL) model combining CNN, RNN, and CNN-GRU effectively improved the detection and recognition accuracy of upper limb rehabilitation movements, the CNN model alone reached 98%, and the accuracy of the hybrid models CNN-LSTM and CNN-GRU reached 99% and 100%, respectively [
29]. In another study, researchers developed an innovative approach combining convolutional neural networks (CNN) and random forest (RF) classifiers to estimate the human body’s center of mass (CoM). This hybrid method demonstrated remarkable performance, achieving high sensitivity and specificity rates exceeding 80% for four-level classification and over 90% for binary classification [
35]. These deep learning methods automatically extract deep features from raw data, significantly reducing the workload of manual feature engineering and improving model generalization, which is particularly suitable for processing large-scale and high-dimensional visual data.
In addition, some researchers have developed specialized algorithms for specific physiological therapy tasks. For example, imitation learning is used to achieve adaptive learning for multifunctional upper limb rehabilitation [
11], while the effort-based parameterization method (EBPM) provides a theoretical basis for a home rehabilitation guidance system [
13]. Though narrow in application, these specific task-oriented algorithms often provide precise and efficient solutions, reflecting the researchers’ deep understanding and innovative thinking about actual clinical needs. It should be noted that some studies have adopted the algorithm fusion strategy to utilize the advantages of different algorithms fully. For example, combining SVM, RF, multilayer perceptrons (MLPs), and cascaded convolutional neural networks (CCNNs) with semi-supervised learning and a traceless Kalman filter identifies the key factors of pathological movements. Girase et al. [
34] evaluated the performance of different algorithms on a classification task through a five-fold cross-validation experiment based on data collected in clinical practice. For non-temporal data, random forest achieves an optimal accuracy of 51.8% when fusing features (P+PD+D). For temporal data, US CNN + MLP performs best with an accuracy of 73.4% using the complete feature set, outperforming DTW (63.0%) and ResNet (71.6%). With the enrichment of feature combinations (P to P+PD to P+PD+D), each algorithm’s performance generally improves, highlighting the positive effect of feature diversity on the classification effect. Wagner et al. [
16] integrated various algorithms, including KD, CH, and FV, with the Zebris FDM platform to estimate gait parameters accurately. Among these methods, the KD algorithm demonstrated superior performance with the highest accuracy rate of 81.8%. This fusion strategy improves the performance and stability of the model, providing new ideas for solving complex physiological treatment problems.
Figure 8 visualizes the distribution of the proportion of research in the current literature among the three types of algorithms: traditional machine learning algorithms account for 44.4%, deep learning algorithms account for 41.7%, and specialized task algorithms account for 13.9%. This study indicates that traditional and deep learning algorithms are the mainstay of current research, while specialized task algorithms focus more on specific clinical needs. As described in
Table 7:
Traditional Machine Learning Algorithms: These algorithms perform well on small, structured datasets such as Sit-to-Stand or TUG test tasks. However, they have limited performance on high-dimensional and nonlinear data. Their main advantage is their high interpretability, making them valuable in clinical scenarios requiring a clear basis for decision-making.
Deep Learning Algorithms: With their ability to process large-scale and complex data, deep learning algorithms are excellent in analyzing time-series motion data, such as high-precision upper limb rehabilitation task detection using CNN and RNN. However, they require high computational resources and lack interpretability. In the future, clinical trust can be improved by introducing interpretable AI methods such as SHAP or LIME.
Dedicated Algorithms: Algorithms designed for specific tasks show high accuracy and relevance, e.g., the application of imitation learning in adaptive rehabilitation. However, due to their high specificity, they may need redesigned when scaling to a wide range of scenarios.
In general, movement analysis based on depth research cameras for physiological treatments has shown a trend of diversification, specialization, and convergence in algorithm selection and design. Researchers flexibly utilize existing machine learning and deep learning algorithms and develop innovative solutions based on specific application scenarios. This diversified algorithmic application strategy has effectively promoted technological advancements and improvements in clinical practice. At the same time, focusing on the development of the algorithms themselves, data pre-processing and post-processing play equally important roles in the application of algorithms. Precisely, coordinate system transformation [
16], feature selection and hyperparameter tuning [
27], min–max normalization [
29], Kalman filtering and Butterworth low-pass filtering [
30] improve the quality of the data and performance of the algorithms. The OpenPose library, for instance, is used to process video frames from RGB sensors for joint estimation [
36]. These processing techniques complement the core algorithm, forming a complete analysis flow.
In the research of depth camera-based physiotherapy movement assessment, in addition to the optimization and innovation of the algorithms themselves, the following research directions are worth exploring in depth:
Model Interpretability and Trustworthiness: Although deep learning algorithms perform well in complex data scenarios due to their powerful feature extraction capabilities, their inherent “black box” characteristics lead to insufficient model interpretations, which triggers a crisis of trust in clinical application scenarios. To address this challenge, future research needs to find a balance between model performance and interpretability. Specifically, explainable AI (XAI) techniques will be introduced to elucidate the decision rationale through feature visualization tools or auxiliary explanatory models. For example, technologies such as SHAP (Shapley Additive Explanations) and LIME (Local Interpretable Model-agnostic Explanations) can effectively enhance the interpretability of deep learning models, making them more likely to gain the trust of clinical workers.
Generalization Capability and Task Customization: Although the current dedicated algorithms perform well in specific scenarios, their generalization ability remains to be verified. Future research should focus on developing generalized algorithms that can adapt to diverse tasks while maintaining high performance. Through techniques such as transfer learning and multi-task learning, the cross-scenario adaptability of algorithms can be enhanced while maintaining the advantages of task-customized design to solve real-world problems effectively in specific scenarios.
Data Scarcity and Migration Learning: The acquisition of physical therapy data faces the dual challenges of high labeling costs and strict privacy protection requirements. In this context, migration learning and few-shot learning provide feasible solutions to the data scarcity problem. For example, migrating deep learning models pre-trained on large-scale datasets to new rehabilitation tasks can significantly reduce the demand for the amount of data in the target domain while guaranteeing the performance of the models.
Real-Time Processing and Edge Computing: With the rapid development of the Internet of Things (IoT) and edge computing technologies, developing efficient and lightweight real-time processing algorithms has become the key to meeting clinical needs. Realizing low-latency and high-precision real-time feedback, especially in home rehabilitation scenarios, puts higher requirements on algorithms. Future research should focus on optimizing the efficiency of the algorithms so that they can run stably on edge devices and provide immediate motion assessment and feedback to patients.
Ethics and Privacy Protection: Patient privacy protection and data security are important topics that cannot be ignored when applying physical therapy algorithms. Future research must introduce privacy-preserving techniques, such as federated learning during data collection and processing, to improve model performance while ensuring data security. In addition, improving the transparency and fairness of algorithms is also an important direction in enhancing their ethical acceptability.
3.5. RQ 5: Feature
As shown in
Table 8, the summary of current research and innovations in physiotherapy and rehabilitation highlights the integration of advanced technologies and methods. Incorporating computer vision technology, depth cameras, and other sensors has been pivotal in enhancing the effectiveness and accuracy of rehabilitation treatments. These technologies offer cost-effective training feedback, improve gait assessment, increase the accuracy of human posture estimation, enhance patient engagement and the precision of posture and movement analysis, enable personalized adaptive learning, accelerate post-stroke movement assessment, and support clinical decision-making. Additionally, these technologies are used to develop home rehabilitation protocols and remote physiotherapy guidance systems, promoting continuous patient care and recovery. This systematic review categorizes and discusses the features of using depth camera sensors in physiotherapy movement assessment, focusing on the following aspects:
Integration of Depth Sensors with Other Technologies: Research on integrating depth cameras with other sensors, such as pressure mats, is crucial in physiotherapy movement assessment. For example, ref. [
15] demonstrated efficient balance training feedback by combining depth cameras and pressure mats.
Gait and Posture Assessment: Gait and posture assessment are critical in physiotherapy. Several studies explore the application of depth sensors in these areas. For instance, ref. [
16] utilized depth sensors to improve gait assessment accuracy, enhancing diagnosis and treatment. Ref. [
33] applied Kinect SDK skeletonization to assess the straight leg raise accurately, aiding in lumbar condition diagnosis. Ref. [
34] used machine learning to automatically detect and classify pathological movements from sit-to-stand transitions.
Upper and Lower Limb Rehabilitation: Upper and lower limb rehabilitation is a key research direction in physiotherapy. Ref. [
11] explored a personalized adaptive learning system for upper limb rehabilitation, improving patient outcomes. Ref. [
36] developed a method using RGB-D sensors for precise joint angle estimation in-home rehabilitation. Ref. [
13] implemented a Kinect-based remote physiotherapy guidance system, promoting continuous care.
Applications of Machine Learning and Deep Learning: Machine learning and deep learning are widely applied in physiotherapy assessments. Ref. [
27] improved human pose estimation using the LogRF and random forest algorithms. Ref. [
28] introduced a hybrid quantum neural network to enhance the speed and accuracy of post-stroke movement assessments. Ref. [
30] utilized unsupervised learning to analyze motion capture data, identifying movement strategies in low back pain patients. Ref. [
29] combined deep learning models to enhance spatiotemporal feature modeling in stroke rehabilitation.
Home Rehabilitation and Remote Monitoring: Home rehabilitation and remote monitoring are current research hotspots. Ref. [
31] proposed a home rehabilitation protocol for post-knee replacement using convenient technology. Ref. [
12] developed a system for dynamic monitoring and correction of shoulder movements using Kinect. Ref. [
8] used RGB-D cameras to analyze compensatory trunk movements, improving upper limb rehabilitation strategies.
Clinical Applications and Validation: Several studies validate the effectiveness of depth sensors in clinical settings. Ref. [
32] accurately assessed movement limitations caused by spinal arthritis using RGB-D cameras, supporting better clinical decision-making. Ref. [
37] demonstrated the reliability and speed of kinematic assessments using RGB-D cameras in clinical environments.
In summary, computer vision-based physiotherapy movement assessment using depth camera sensors have shown diverse applications and significant advancements. These technologies not only demonstrate great potential in enhancing diagnostic accuracy, personalized rehabilitation, and patient engagement but also pave the way for more effective and accessible physiotherapy solutions. Studies indicate that depth sensors are widely applied in gait and posture assessment and upper and lower limb rehabilitation, and, when combined with machine learning and deep learning technologies, have achieved breakthroughs in home rehabilitation and remote monitoring. These studies cover balance training, virtual reality integration, and home rehabilitation, providing real-time, accurate, and personalized feedback mechanisms that improve treatment outcomes and patient participation. Future research should focus on integrating these features into comprehensive systems to enhance further diagnostic accuracy, movement assessment speed, and home rehabilitation efficacy, promoting more efficient and convenient rehabilitation practices. In general, applying depth cameras and advanced algorithms brings innovative solutions to physical therapy, significantly improving the efficiency and coverage of rehabilitation training.
3.6. RQ 6: Scenario
The movement analysis based on depth cameras for physical therapy shows significant value and potential in three major scenarios (see
Table 9): remote, clinical, and local (see
Figure 9). In remote scenarios, the technology breaks through geographical limitations and enables patients to receive real-time rehabilitation guidance and assessment at home, improving the accessibility and continuity of rehabilitation services; in clinical scenarios, it provides medical professionals with accurate exercise data and objective assessment tools, which help to formulate personalized and efficient treatment plans; and in local scenarios, such as at home or in community-based rehabilitation centers, the technology supports autonomous training and daily monitoring and enhances patients’ self-management ability. The integration of these three scenarios optimizes the allocation of rehabilitation resources and realizes an all-round multilevel rehabilitation care system, significantly improving the overall effect of physical therapy and patient experience. With the advancement of technology and in-depth clinical practice, this multi-scenario application mode is reshaping the traditional rehabilitation concept and promoting the development of physical therapy in the direction of intelligence, personalization, and popularization.
In a remote scenario, the main objective is to provide patients with a convenient home rehabilitation program. With the development of telemedicine technology, remote physical therapy movement assessment based on depth cameras has become a reality. For example, ref. [
10] described BiomacVR, a virtual reality (VR)-based rehabilitation system that combines a VR physical training monitoring environment and upper limb rehabilitation technology for precise interaction and improves patient engagement, which is applied to a real-time physical therapy sports wellness system for telerehabilitation. In [
11], the authors proposed an adaptive learning system based on imitation learning for multi-purpose upper extremity rehabilitation that allows patients to perform rehabilitation at home. In [
12], the authors examined the development of a Kinect 2 sensor-based telerehabilitation system that observes and evaluates exercise in patients with shoulder impairments through a web application used for communication between the patient and the therapist and a console application that helps the patient perform the exercise correctly. Ref. [
13] proposed an approach based on effort parameterization for monitoring a home rehabilitation system to ensure correctness and adherence to rehabilitation exercises.
In clinical scenarios, physiotherapy movement assessment research focuses on accurately analyzing and assessing patients’ motion status to provide key data support for clinical diagnosis and treatment. For example, ref. [
16] realized the accurate analysis of patients’ gait through the estimation of gait parameters, which effectively assists clinical diagnosis and the formulation of treatment plans. Due to the wide application of artificial intelligence technology in this field, for example, ref. [
10] utilized neural network algorithms to observe human skeletal motion through visible information, which can accurately analyze patient posture and movement patterns. In addition, ref. [
29] applied deep learning techniques for the detection and recognition of detecting and recognizing upper limb rehabilitation exercises, which helps clinicians assess the progress of the rehabilitation of patients. In disease-specific studies, such as [
30], machine learning methods have been applied to identify exercise strategies for patients with low back pain, providing a scientific basis for developing clinical treatment programs. Ref. [
32], which validated and analyzed patients’ trunk movement limitations by synchronizing and visualizing datasets, helps clinicians gain insights into patients’ movement abilities and limitations. In [
33], the authors employed advanced detection and tracking techniques, combining calibration, skeletonization process, and feature extraction, to achieve monitoring and analysis of key movements in the rehabilitation process, providing detailed movement data support for clinical decision-making. Ref. [
34] applied semi-supervised learning algorithms to estimate the joint center position through the standard Kinect 2 body tracking library, successfully identifying and classifying critical factors of pathological movements and providing more accurate data support for rehabilitation treatment.
In local scenarios, physiotherapy exercise and assessment research focus on using advanced algorithms and data processing techniques to automate the evaluation of patient rehabilitation training and posture recognition and improve rehabilitation effects and patient compliance. For example, ref. [
15] provided visual feedback to patients through the acquisition and processing of joint displacement data in real-time to help them perform effective balance training at home. Ref. [
27] applied AI algorithms combined with MediaPipe pose labeling, feature selection, and hyperparameter tuning to achieve a high-precision estimation of human posture, which provides important data support for rehabilitation training. Ref. [
28] realized automated evaluation of exercises through high-quality neural network alignment of length and center and feature transformation, which significantly improves the efficiency and effectiveness of rehabilitation training. In rehabilitation after specific surgeries, ref. [
31] incorporated accelerometer and gyroscope measurement techniques to perform home rehabilitation training evaluation after total knee replacement. This makes it possible to monitor patient rehabilitation progress in the home environment. Ref. [
35] accurately estimated the patient’s center of mass position through data augmentation techniques, providing a scientific basis for balance training and assessment. Ref. [
36] achieved an accurate estimation of joint position by processing video frame data from RGB sensors, providing strong support for motion analysis. In addition, ref. [
37] created virtual skeletal representations to assess patients’ ability to perform functional tasks, further extending the scope and depth of local rehabilitation assessment. These studies fully demonstrate that patient rehabilitation training can be effectively monitored and evaluated in local scenarios with advanced algorithms and data processing techniques, improving the rehabilitation effect and significantly enhancing patient compliance.
3.7. RQ 7: Target
In physiotherapy movement assessment, the selection of an appropriate target for the study is critical. This selection is directly related to the type of visual data to be captured and its depth, which affects the design and application of deep learning models. By carefully selecting targets for the human body, researchers can ensure that the acquired movement data is pertinent and complete, providing high-quality input for movement recognition and assessment.
As depicted in
Figure 10, the existing literature shows that researchers generally focus on the main body parts, such as the entire body, the upper limb, and the lower limb, as well as specific joint parts, such as the knee, ankle, and shoulder (see
Table 4). For instance, five studies [
13,
15,
27,
28,
35] have delved into full-body movement recognition and evaluation and focused on how to capture body movement data using a depth camera and analyze the data using deep learning models.
The upper limbs are another key focus area, with four studies [
8,
10,
11,
29] concentrating on arm and elbow joint movements to support rehabilitation efforts. These studies focused on capturing movement data at the arm and elbow joints to guide upper limb rehabilitation. Similarly, the lower limbs have received extensive attention, including the foot, ankle, knee, and hip. Six publications [
16,
31,
33,
34,
36,
37] have also been aimed at providing assessment and guidance for lower limb rehabilitation.
Additionally, three studies [
30,
32,
34] emphasize the lower back and trunk, which are vital for assessing overall physical mobility. These investigations underline the significance of trunk and lumbar regions in evaluating biomechanical function and rehabilitation outcomes.
Beyond isolated target regions, exploring the biomechanical synergies and kinematic couplings between different body parts offers a more integrated perspective on movement dynamics. For instance, upper and lower limb coordination during complex physiotherapy exercises can reveal compensatory patterns or inefficiencies. Similarly, trunk and lower limb interactions during gait or balance tasks are critical for holistic movement quality assessment. This interconnected analysis can inform the development of comprehensive movement assessment models, enabling the design of physiotherapy protocols that consider the cascading effects of rehabilitation on multiple body parts. Such models could enhance our understanding of how targeted rehabilitation impacts overall biomechanics, improving intervention effectiveness.
In summary, the reviewed literature addresses movement recognition and assessment for the whole body, key limbs, and specific joints, emphasizing the importance of targeted selection in efficient data acquisition and analysis. By integrating an understanding of biomechanical synergies and kinematic couplings, researchers can develop more holistic and practical deep-learning models, fully leveraging the potential of depth cameras and other sensing technologies for physiotherapy applications.
3.8. RQ 8: Problem Statement
As shown in
Table 10, most studies indicate that using computer vision and depth sensor technology for patient movement analysis and rehabilitation training has become a significant development direction in modern physiotherapy and assessment. However, the problem statements from a systematic review reveal numerous challenges in applying current technologies and methods. To better understand the bottlenecks in current research and future development directions, this paper categorizes and discusses these problems as follows:
Equipment and Feedback Mechanism Issues: Several studies [
13,
15,
35] highlight that current balance training and motion analysis equipment is often expensive and bulky, limiting its use in resource-constrained clinical environments. There needs to be more effective on-demand balance assessment tools in physiotherapy, further restricting treatment flexibility and real-time feedback capabilities.
Data Utilization and Diagnostic Accuracy Issues: Traditional gait analysis and kinematic assessment methods need to effectively utilize depth sensor data, leading to decreased diagnostic accuracy [
16,
37]. For example, conventional methods cannot accurately capture the straight leg raise motion [
33], complicating lumbar assessments. Furthermore, home rehabilitation methods inaccurately estimate joint angle ranges [
36], which affects patient treatment outcomes.
Accuracy Issues in Posture Estimation and Motion Analysis: Posture estimation in physiotherapy often lacks accuracy [
10,
27], impacting the correction of exercises and rehabilitation effectiveness. Traditional motion analysis tools lack precision and interactivity, failing to meet the demands of efficient rehabilitation.
Personalization and Dynamic Adaptability Issues: Standard upper limb rehabilitation devices and post-stroke assessment systems lack dynamic adaptation to patient progress [
11,
28], limiting their effectiveness. Existing shoulder rehabilitation systems lack precise and interactive exercise monitoring [
12], making personalized treatment difficult. Inadequate motion data analysis limits low back pain treatment strategies [
30].
Spatiotemporal Feature Modeling Issues: Insufficient spatiotemporal feature modeling in stroke rehabilitation [
29,
31] affects the effectiveness of rehabilitation exercises. Moreover, current methods also fall short in analyzing compensatory trunk movements [
8], further impacting upper limb rehabilitation outcomes.
Pathological Diagnosis and Assessment Tool Issues: Current automated diagnostic tools for spine, hip, and knee pathologies [
34], as well as tools to assess movement limitations in ankylosing spondylitis [
32], are inadequate. These issues indicate that the existing automated diagnostic and assessment tools still need to meet clinical needs and require further development and optimization.
Future research and technological innovations in computer vision-based physical therapy exercise assessment can focus on several key directions to overcome the challenges of ethics, privacy, model interpretability, clinical trust, and generalization capabilities. First, in terms of privacy protection, differential privacy, federated learning, and encrypted computing techniques can be employed to ensure data security. At the same time, developing transparent, ethical guidelines and compliance frameworks can enhance patient trust in the technology. Second, in terms of model interpretability and clinical trust, developing visual AI decision tools and integrating multimodal data such as vision, EMG signals, and pressure sensor data improves the reliability and interpretability of results. In addition, in terms of generalization capability and applicability, migratory learning, data augmentation, and personalized model fine-tuning techniques are utilized to adapt the system to different patient populations and clinical environments.
Meanwhile, emerging sensor technologies such as high-resolution 3D cameras, flexible sensors, and smart wearable devices are being explored to provide more accurate support for motion capture. Regarding algorithm design, spatiotemporal perception models, few-sample learning, and self-supervised learning are used to solve the problem of limited data volume, and the real-time performance of the model is optimized by lightweight and edge computing. Finally, user-friendly interactive interfaces and personalized feedback systems are developed, which are combined with virtual reality or augmented reality technologies to provide collaborative and interactive physical therapy experiences for patients and clinicians. These innovative directions will advance the field and significantly improve the technology’s effectiveness and trust in practical applications.