A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities
Abstract
1. Introduction
1.1. Rationale
1.2. Objective
1.2.1. Research Gaps and New Research Challenges
1.2.2. Our Contribution
- Comprehensive Review with Diverse Modality: We conduct a thorough survey of RGB-based, skeleton-based, sensor-based, and fusion HAR-based methods, focusing on the evolution of data acquisition, environments, and human activity portrayals from 2014 to 2025.
- Dataset Description: We provide a detailed overview of benchmark public datasets for RGB, skeleton, sensor, and fusion data, highlighting their latest performance accuracy with references.
- Unique Process: Our study covers feature representation methods, common datasets, challenges, and future directions, emphasizing the extraction of distinguishable action features from video data despite environmental and hardware limitations. We also included the mathematical derivation for the evaluation of the deep learning model for each modality, such as from 3D CNN to Multi-View Transformer and GCN to EMS-TAGCN for pixel video and sequence of the skeleton dataset, respectively.
- Identification of Gaps and Future Directions: We identify significant gaps in the current research and propose future research directions supported by the latest performance data for each modality.
- Evaluation of System Efficacy: We assess existing HAR systems by analyzing their recognition accuracy and providing benchmark datasets for future development.
- Guidance for Practitioners: Our review offers practical guidance for developing robust and accurate HAR systems, providing insights into current techniques, highlighting challenges, and suggesting future research directions to advance HAR system development.
1.2.3. Research Questions
- What are the main difficulties faced in human activity recognition?
- What are the major open challenges faced in human activity recognition?
- What are the major algorithms involved in human activity recognition?
1.3. Organization of the Work
2. Methods
2.1. Article Search Protocol
- “Human Activity Recognition” OR “Human Action Recognition”
- “Computer Vision”, “RGB”, “Skeleton”, “Sensor”, “Multimodal”, “Deep Learning”, “Machine Learning”
2.2. Eligibility Criteria
- Publications from 2014 to 2025;
- Peer-reviewed journals, conference papers, book chapters, and lecture notes;
- Focus on HAR using RGB, skeleton, sensor, fusion HAR methods, or multimodal;
- Emphasis on the evolution of data acquisition, environments, and human activity portrayals.
- Exclusion of studies lacking in-depth information about their experimental procedures;
- Exclusion of research articles where the complete text is not accessible, both in physical and digital formats;
- Exclusion of research articles that include opinions, keynote speeches, discussions, editorials, tutorials, remarks, introductions, viewpoints, and slide presentations.
2.3. Article Selection Process
- IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI);
- IEEE Transactions on Image Processing (TIP);
- International Conference on Computer Vision and Pattern Recognition (CVPR);
- IEEE International Conference on Computer Vision (ICCV);
- Springer, ELSEVIER, MDPI, Frontier, etc.
2.4. Critical Appraisal of Individual Sources
- Abstract review;
- Methodology analysis;
- Result evaluations;
- Discussion and conclusions.
- Bibliographic information (author(s), publication year, and venue);
- Dataset characteristics and associated HAR modality (e.g., RGB, skeleton, sensor, fusion);
- Feature extraction techniques and classification models employed;
- Evaluation metrics and reported benchmark performance.
3. RGB-Data Modality-Based Action Recognition Methods
3.1. RGB-Based Datasets of HAR
Dataset | Data Set Modalities | Year | Class | Subject | Sample | Latest Accuracy |
---|---|---|---|---|---|---|
UPCV [75] | Skeleton | 2014 | 10 | 20 | 400 | 99.20% [76] |
Activity Net [67] | RGB | 2015 | 203 | - | 27,208 | 94.7% [68] |
Kinetics-400 [69] | RGB | 2017 | 400 | - | 306,245 | 92.1% [71] |
AVA [72] | RGB | 2017 | 80 | - | 437 | 83.0% [73] |
EPIC Kitchen 55 [77] | RGB | 2018 | 149 | 32 | 39,596 | - |
AVE [78] | RGB | 2018 | 28 | - | 4143 | - |
Moments in Times [74] | RGB | 2019 | 339 | - | 1,000,000 | 51.2% [71] |
Kinetics-700 [70] | RGB | 2019 | 700 | - | 650,317 | 85.9% [71] |
RareAct [79] | RGB | 2020 | 122 | 905 | 2024 | 60.80% [80] |
HiEve [81] | RGB, Skeleton | 2020 | - | - | - | 95.5% [82] |
MSRDaily Activity3D [83] | RGB, Skeleton | 2012 | 16 | 10 | 320 | 97.50% [84] |
N-UCLA [85] | RGB, Skeleton | 2014 | 10 | 10 | 1475 | 99.10% [86] |
Multi-View TJU [87] | RGB, Skeleton | 2014 | 20 | 22 | 7040 | - |
UTD-MHAD [88] | RGB, Skeleton | 2015 | 27 | 8 | 861 | 95.0% [89] |
UWA3D Multiview II [90] | RGB, Skeleton | 2015 | 30 | 10 | 1075 | - |
NTU RGB+D 60 [91] | RGB, Skeleton | 2016 | 60 | 40 | 56,880 | 97.40% [86] |
PKU-MMD [92] | RGB, Skeleton | 2017 | 51 | 66 | 10,076 | 94.40% [93] |
NEU-UB [94] | RGB | 2017 | 6 | 20 | 600 | - |
Kinetics-600 [95] | RGB, Skeleton | 2018 | 600 | - | 595,445 | 91.90% [71] |
RGB-D Varing-View [96] | RGB, Skeleton | 2018 | 40 | 118 | 25,600 | - |
NTU RGB+D 120 [97] | RGB, Skeleton | 2019 | 120 | 106 | 114,480 | 95.60% [86] |
Drive&Act [98] | RGB, Skeleton | 2019 | 83 | 15 | - | 77.61% [99] |
MMAct [100] | RGB, Skeleton | 2019 | 37 | 20 | 36,764 | 98.60% [101] |
Toyota-SH [102] | RGB, Skeleton | 2019 | 31 | 18 | 16,115 | - |
IKEA ASM [103] | RGB, Skeleton | 2020 | 33 | 48 | 16,764 | - |
ETRI-Activity3D [104] | RGB, Skeleton | 2020 | 55 | 100 | 112,620 | 95.09% [105] |
UAV-Human [106] | RGB, Skeleton | 2021 | 155 | 119 | 27,428 | 55.00% [107] |
3.2. Handcrafted Features with ML-Based Approach
3.2.1. Holistic Feature Extraction
3.2.2. Local and Global Representation
3.2.3. Classification Approach
3.3. End-to-End Deep Learning Approach
3.3.1. Two Stream-Based Network
3.3.2. Multi-Stream Based Network
3.3.3. 3D CNN and Hybrid Networks
3.3.4. Other Models
3.4. Mathematical Derivation of the Benchmark RGB-Based 3DCNN Method
3.4.1. Three-Dimensional CNN
3.4.2. C3D [23]
3.4.3. I3D (Inflated 3D ConvNet)
3.4.4. S3D
3.4.5. R3D, R(2+1)D
3.4.6. P3D ResNet
3.4.7. SlowFast Networks
3.4.8. X3D
3.4.9. Vision Transformer (ViT)
3.4.10. Video Vision Transformer (ViViT)
3.4.11. Multiview Transformers for Video Recognition (MVT)
3.4.12. UniFormer-Based 3DCNN
3.4.13. VideoMAE Based 3DCNN
3.4.14. InternVideo 3DCNN Model
4. Skeleton Data Modality-Based Action Recognition Method
4.1. Skeleton-Based HAR Dataset
4.2. Pose Estimation
4.2.1. Two-Dimensional Human Pose Estimation-Based Methods
4.2.2. Three-Dimensional Human Pose Estimation-Based Methods
Author | Year | Dataset Name | Modality | Method | Classifier | Accuracy [%] |
---|---|---|---|---|---|---|
Veeriah et al. [195] | 2015 | MSRAction3D KTH-1 (CV) KTH-2 (CV) | Skeleton | Differential RNN | Softmax |
92.03 93.96, 92.12 |
Xu et al. [116] | 2016 | MSRAction3D UTKinect Florence3D action | Skeleton | SVM with PSO | SVM | 93.75 97.45, 91.20 |
Zhu et al. [196] | 2016 | SBU Kinect HDM05, CMU | Skeleton | Stacked LSTM | Softmax | 90.41 97.25, 81.04 |
Li et al. [197] | 2017 | UTD-MHAD NTU-RGBD (CV) NTU-RGBD (CS) | Skeleton | CNN | Maximum Score | 88.10 82.3 76.2 |
Soo et al. [198] | 2017 | NTU-RGBD (CV) NTU-RGBD (CS) | Skeleton | Temporal CNN | Softmax | 83.1 74.3 |
Liu et al. [160] | 2017 | NTU-RGBD (CS) NTU-RGBD (CV) MSRC-12 (CS) Northwestern-UCLA | Skeleton | Multi-stream CNN | Softmax | 80.03, 87.21 96.62, 92.61 |
Das et al. [199] | 2018 | MSRDailyActivity3D NTU-RGBD (CS) CAD-60 | Skeleton | Stacked LSTM | Softmax | 91.56 64.49, 67.64 |
Si et al. [200] | 2019 | NTU-RGBD (CS) NTU-RGBD (CV) UCLA | Skeleton | AGCN-LSTM | Sigmoid | 89.2, 95.0 93.3 |
Shi et al. [201] | 2019 | NTU-RGBD (CS) NTU-RGBD (CV) Kinetics | Skeleton | AGCN | Softmax | 88.5 95.1 58.7 |
Trelinski et al. [202] | 2019 | UTD-MHAD MSR-Action3D | Skeleton | CNN-based | Softmax | 95.8, 77.44 80.36 |
Li et al. [203] | 2019 | NTU-RGBD (CS) Kinetics (CV) | Skeleton | Actional graph based CNN | Softmax | 86.8 56.5 |
Huynh et al. [204] | 2019 | MSRAction3D UTKinect-3D SBU-Kinect Interaction | Skeleton | ConvNets | Softmax | 97.9 98.5, 96.2 |
Huynh et al. [205] | 2020 | NTU-RGB+D UTKinect-Action3D | Skeleton | PoT2I with CNN | Softmax | 83.85, 98.5 |
Naveenkumar et al. [206] | 2020 | UTKinect-Action3D NTU-RGB+D | Skeleton | Deep ensemble | Softmax | 98.9, 84.2 |
Plizzari et al. [207] | 2021 | NTU-RGBD 60 NTU-RGBD 120 Kinetics Skeleton-400 | Skeleton | ST-GCN | Softmax | 96.3, 87.1 60.5 |
Snoun et al. [208] | 2021 | RGBD-HuDact, KTH | Skeleton | VGG16 | Softmax | 95.7, 93.5 |
Duan et al. [209] | 2022 | NTU-RGBD UCF101 | Skeleton | PYSKL | - | 97.4, 86.9 |
Song et al. [210] | 2022 | NTU-RGBD | Skeleton | GCN | Softmax | 96.1 |
Zhu et al. [211] | 2023 | UESTC NTU-60 (CS) | Skeleton | RSA-Net | Softmax | 93.9, 91.8 |
Zhang et al. [212] | 2023 | NTU-RGBD Kinetics-Skeleton | Skeleton | Multilayer LSTM | Softmax | 83.3 27.8 (Top-1) 50.2 (Top-5) |
Liu et al. [213] | 2023 | NTU-RGBD 60 (CV)NTU-RGBD 120 (CS) | Skeleton | LKJ-GSN | Softmax | 96.1 86.3 |
Liang et al. [214] | 2024 | NTU-RGBD (CV) NTU-RGBD 120 (CS) FineGYM | Skeleton | MTCF | Softmax | 96.9, 86.6 94.1 |
Karthika et al. [215] | 2025 | NTU-RGBD 60 NTU-RGBD 120 Kinetics-700 Micro- Action-52 | Skeleton | Stacked Ensemble | Logistic Regression |
97.87 98.0 97.50 95.20 |
Sun et al. [216] | 2025 | Self collected KTH UTD-MHAD | Skeleton | Multi channel fussion | Logistic Regression | 98.16 92.85 84.98 |
Mehmood et al. [217] | 2025 | NTU-RGB+D (CS/CV) Kinetics UCF-101 HMDB-51 | Skeleton | EMS-TAGCN | Logistic Regression | 91.3/97.5 62.3 51.24 72.7 |
4.3. Handcrafted Feature and ML-Based Classification Approach
4.4. End-to-End Deep Learning-Based Approach
4.4.1. CNN-Based Methods
4.4.2. RNN-LSTM-Based Methods
4.4.3. GNN or GCN-Based Methods
4.4.4. Spectral GCN-Based Methods
4.4.5. Spatial GCN-Based Methods
4.5. Mathematical Derivation of the Skeleton-Based Learning Methods
4.5.1. GCN
4.5.2. ST-GCN
4.5.3. STA-GCN
4.5.4. Shift-GCN
4.5.5. InfoGCN
4.5.6. EMS-TAGCN
5. Sensor-Based HAR
Dataset Names | Year | Sensor Modalities | No. of Sensors | No. of People | No. of Activities | Activity Categories | Latest Performances |
---|---|---|---|---|---|---|---|
HHAR [260] | 2015 | Accelerometer, Gyroscope | 36 | 9 | 6 | Daily living activity, Sports fitness activity | 99.99% [261] |
MHEALTH [262] | 2014 | Accelerometer, Gyroscope, Magnetometer, Electrocardiogram | 3 | 10 | 12 | Atomic activity, Daily living activity, Sports fitness activity | 97.83% [263] |
OPPT [264] | 2013 | Acceleration, Rate of Turn Magnetic field, Reed switches | 40 | 4 | 17 | Daily living activity, Composite activity | 100% [265] |
WISDM [266] | 2011 | Accelerometer, Gyroscopes | 1 | 33 | 6 | Daily living activity, Sports fitness activity | 97.8% [267] |
UCIHAR [268] | 2013 | Accelerometer, Gyroscope | 1 | 30 | 6 | Daily living activity | |
PAMAP2 [269] | 2012 | Accelerometer, Gyroscope, Magnetometer, Temperature | 4 | 9 | 18 | Daily living activity, Sports fitness activity, Composite activity | 94.72% [270] 82.12% [265] 90.27% [267] |
DSADS [271] | 2010 | Accelerometer, Gyroscope Magnetometer | 45 | 8 | 19 | Daily living activity, Sports fitness activity | 99.48% [272] |
RealWorld [273] | 2016 | Acceleration | 7 | 15 | 8 | Daily living activity, Sports fitness activity | 95% [274] |
Exer. Activity [275] | 2013 | Accelerometer, Gyroscope | 3 | 20 | 10 | Sports fitness activity | - |
UTD-MHAD [88] | 2015 | Accelerometer, Gyroscope RGB camera, depth camera | 3 | 8 | 27 | Daily living activity, Sports fitness activity Composite activity Atomic activity | 76.35% [276] |
Shoaib [277] | 2014 | Accelerometer, Gyroscope | 5 | 10 | 7 | Daily living activity, Sports fitness activity | 99.86% [278] |
TUD [279] | 2008 | Accelerometer | 2 | 1 | 34 | Daily living activity, Sports fitness Composite activity | - |
SHAR [280] | 2017 | Accelerometer | 2 | 30 | 17 | Daily living activity, Sports fitness activity Atomic activity | 82.79% [281] |
USC-HAD [282] | 2012 | Accelerometer, Gyroscope | 1 | 14 | 12 | Daily living activity, Sports fitness activity activity | 97.25% [281] |
Mobi-Act [283] | 2016 | Accelerometer, Gyroscope orientation sensors | 1 | 50 | 13 | Daily living activity, Atomic activity activity | 75.87% [284] |
Motion Sense [285] | 2018 | Accelerometer, Gyroscope | 1 | 24 | 6 | Daily living activity | 95.35% [286] |
van Kasteren [287] | 2011 | switches, contacts passive infrared (PIR) | 14 | 1 | 10 | Daily living activity Composite activity activity | - |
CASAS [288] | 2012 | Temperature Infrared motion/light sensor | 52 | 1 | 7 | Daily living activity Composite activity activity | 88.4% [289] |
Skoda [290] | 2008 | Accelerometer | 19 | 1 | 10 | Daily living activity Composite activity activity | 97% [291] |
Widar3.0 [253] | 2019 | Wi-Fi | 7 | 1 | 6 | Atomic activity | 82.18% [292] |
UCI [268] | 2013 | Accelerometer, Gyroscope | 2 | 30 | 6 | Human activity | 95.90% [270] |
HAPT [293] | 2016 | Accelerometer, Gyroscope | 1 | 30 | 12 | Human activity | 92.14% [270] 98.73% [278] |
Author | Year | Dataset Name | Modality Sensor Name | Methods | Classifier | Accuracy % |
---|---|---|---|---|---|---|
Jain et al. [294] | 2017 | UCI HAR | IMU Sensor | Fusion based | SVM, KNN | 97.12 |
Ignatov et al. [295] | 2018 | WISDM UCI HAR | IMU Sensor | CNN | Softmax | 93.32 97.63 |
Chen et al. [296] | 2019 | MHEALTH PAMAP2 UCI HAR | IMU | CNN | Softmax | 94.05, 83.42 81.32 |
kavuncuouglu et al. [297] | 2021 | Fall and ADLs | Accelerometer Gyroscope Magnetometer | ML | SVM K-NN | 99.96 95.27 |
Lu et al. [298] | 2022 | WISDM, PAMAP2 UCI-HAR | IMUs Accelerometers | CNN-GRU | Softmax | 96.41 96.25 96.67 |
Kim et al. [299] | 2022 | WISDM USC-HAR | IMUs | CNN-BiGRU | Softmax | 99.49 88.31 |
Lin et al. [300] | 2020 | Smartwach | Accelerometer gyroscope | Dilated CNN | Softmax | 95.49 |
Nadeem et al. [301] | 2021 | WISDM PAMAP2 USC-HAD | IMU | HMM | Softmax | 91.28 91.73 90.19 |
Zhang et al. [302] | 2020 | WiFi CSI | WiFi signal | Dense-LSTM | Softmax | 90.0 |
Alawneh et al. [303] | 2020 | UniMib Shar WISDM | Accelerometer IMU Sensor | Bi-LSTM | Softmax | 99.25 98.11 |
Wei et al. [304] | 2024 | WISDM PAMAP2 USC-HAD | IMU | TCN-Attention | Softmax | 99.03 98.35 96.32 |
Yao et al. [281] | 2024 | PAMAP2 USC-HAD, UniMiB-SHAR OPPORTUNITY | IMUs Accelerometers | ELK ResNet | Softmax | 95.53 97.25 82.79 87.96 |
El-Adawi et al. [263] | 2024 | MHEALTH | IMU | GAF+ DenseNet169 | Softmax | 97.83 |
Sarkar et al. [305] | 2023 | UCI-HAR WISDM, MHEALTH PAMAP2 HHAR | IMUs Accelerometers | CNN with GA | SVM | 98.74 98.34 99.72 97.55 96.87 |
Semwal et al. [306] | 2023 | WISDM PAMAP2 USC-HAD | IMUs | CNN and LSTM | Softmax | 95.76 94.64 89.83 |
Zhang et al. [278] | 2024 | DSADS HAPT | IMU | Multi-STMT | Softmax | 99.86 98.73 |
Saha et al. [286] | 2024 | UCI HAR Motion-Sense | IMU | FusionActNet | Softmax | 97.35 95.35 |
Liu et al. [307] | 2025 | UCI-HAR WISDM | Accelerometer Gyroscope | UC Fusion | Softmax | 96.84 98.85 |
Khan et al. [308] | 2025 | HAPT Human activities | AAccelerometer Gyroscope | 1D-CNN + LSTM | Softmax | 97.84 99.04 |
sarakon et al. [309] | 2025 | WISDM DaLiAc MotionSense PAMAP2 | Accelerometer | MLP | Softmax | 95.83 97.00 94.65 98.54 |
Yao et al. [310] | 2025 | PAMAP2 WISDM USC-HAD | Accelerometer Gyroscope Magnetometer | MLKD | Softmax | 92.66 98.22 95.42 |
Thakur et al. [311] | 2025 | UCI-HAR WISDM OPPORTUNITY HAR | Accelerometer Gyroscope Magnetometer GPS | CNN + RNN | Softmax | 96 95 93 95 |
Hu et al. [312] | 2025 | UCI-HAR HAPT RHAR | Accelerometer Gyroscope Magnetometer GPS | AResGAT-LDA | Softmax | 96.62 94.56 85.08 |
Yu et al. [313] | 2025 | UCI-HAR USC-HAD WISDM DSADS | Accelerometer Gyroscope Magnetometer GPS | ASK-HAR | Softmax | 97.25 89.40 98.46 89.42 |
Muralidharan et al. [314] | 2025 | MobiAct | Accelerometer Gyroscope Orientation sensors | CNN-RNN | Softmax | 94.69 |
Yang et al. [315] | 2025 | UCI-HAR RealWorld MotionSense | Accelerometer Gyroscope | Semi-supervised | Softmax | 97.5 95.6 94.2 |
Ye et al. [265] | 2024 | OPPT, PAMAP2 | IMU | CVAE-USM | GMM | 100 82.12 |
Kaya et al. [267] | 2024 | UCI-HAPT WISDM, PAMAP2 | IMU | Deep CNN | Softmax | 98 97.8 90.27 |
Zhang et al. [272] | 2024 | Shoaib, SisFall HCIHAR, KU-HAR | IMU | 1DCNN-Att -BiLSTM | SVM | 99.48 91.85 96.67 97.99 |
Sharen et al. [316] | 2024 | WISDM UCI-HAR KU-HAR | Accelerometer Gyroscope | WISNet | Softmax | 96.41 95.66 94.01 |
Teng et al. [317] | 2025 | UCI-HAR PAMAP2 UNIMIB-SHAR USC-HAD | Accelerometer Gyroscope Magnetometer | CNN-TSFDU-LW | Softmax | 97.90 94.34 78.90 94.71 |
Dahal et al. [318] | 2025 | mHealth UCI-HAR WISDM | Accelerometer Gyroscope Magnetometer | Stack-HAR | Gradient Boosting | 99.49 96.87 90.00 |
Pitombeira-Neto et al. [319] | 2025 | PAMAP2 USC-HAD | Accelerometer Gyroscope Magnetometer | BDLM | Bayesian updating | 96.00 |
5.1. Preprocessing of the Sensor Dataset
5.2. Sensor Data Modality Based HAR System Using Feature Extraction with Machine Learning
5.3. Sensor Data Modality-Based HAR System Using a Deep Learning Approach
5.3.1. Background of the Deep Learning-Based Temporal Modeling TCN
5.3.2. CNN-Based Various Stream for HAR
5.3.3. RNN, LSTM, Bi-LSTM for HAR
5.3.4. Integration of CNN and LSTM-Based Technique
5.4. Radio Frequency (RF)-Based HAR Techniques
5.4.1. RF Dataset and Signal Acquisition
5.4.2. Filtering, Denoising, and Segmenting the Signal
5.4.3. Multipath Effects and Mitigation Techniques
- Angle-of-Arrival (AoA) and Angle-of-Departure (AoD) Estimation: Methods such as MUSIC and ESPRIT leverage antenna array processing to spatially resolve signal paths, helping isolate the Line-of-Sight (LoS) component from multipath reflections [340].
- Time-Frequency Analysis: Doppler spectrograms and Short-Time Fourier Transform (STFT) techniques decompose CSI into time-varying frequency components, making it easier to detect motion-induced frequency shifts associated with human activities [341].
- Graph-Based Path Modeling: By jointly estimating parameters like AoA, Time-of-Flight (ToF), and Doppler shifts, recent approaches construct graph-based signal representations to maintain spatial–temporal consistency in dynamic scenarios.
- Phase Unwrapping and CSI Denoising: Techniques including Hilbert transforms, wavelet filtering, and statistical smoothing are applied to reduce high-frequency noise and unwrap distorted phase responses, enhancing the usability of CSI features.
- Domain Adaptation: DL models trained in one environment often fail to generalize to others due to the environmental sensitivity of RF signals. Adversarial domain adaptation methods (e.g., DANN, GAN-based transfer learning) have shown promise in aligning latent representations across domains [340].
5.4.4. RF Feature Extraction
5.4.5. Classification
5.4.6. ML and DL Methods for RF-Based Datasets
5.5. Mathematical Derivation of the Sensor-Based Learning Method
6. Multimodal Fusion Modality-Based Action Recognition
6.1. Multimodal Fusion-Based HAR Dataset
Dataset | Data Set Modalities | Year | Class | Subject | Sample | Latest Accuracy |
---|---|---|---|---|---|---|
MSRDaily Activity3D [83] | RGB, Skeleton, Depth | 2012 | 16 | 10 | 320 | 97.50% [84] |
N-UCLA [85] | RGB, Skeleton, Depth | 2014 | 10 | 10 | 1475 | 99.10% [86] |
Multi-View TJU [87] | RGB, Skeleton, Depth | 2014 | 20 | 22 | 7040 | - |
UTD-MHAD [88] | RGB, Skeleton, Depth, Acceleration, Gyroscope | 2015 | 27 | 8 | 861 | 95.0% [89] |
UWA3D Multiview II [90] | RGB, Skeleton, Depth | 2015 | 30 | 10 | 1075 | - |
NTU RGB+D [91] | RGB, Skeleton, Depth, Infrared | 2016 | 60 | 40 | 56,880 | 97.40% [86] |
PKU-MMD [92] | RGB, Skeleton, Depth, Infrared | 2017 | 51 | 66 | 10,076 | 94.40% [93] |
NEU-UB [94] | RGB, Depth | 2017 | 6 | 20 | 600 | - |
Kinetics-600 [95] | RGB, Skeleton, Depth, Infrared | 2018 | 600 | - | 595,445 | 91.90% [71] |
RGB-D Varing-View [96] | RGB, Skeleton, Depth | 2018 | 40 | 118 | 25,600 | - |
Drive&Act [98] | RGB, Skeleton, Depth | 2019 | 83 | 15 | - | 77.61% [99] |
MMAct [100] | RGB, Skeleton, Acceleration, Gyroscope | 2019 | 37 | 20 | 36,764 | 98.60% [101] |
Toyota-SH [102] | RGB, Skeleton, Depth | 2019 | 31 | 18 | 16,115 | - |
IKEA ASM [103] | RGB, Skeleton, Depth | 2020 | 33 | 48 | 16,764 | - |
ETRI-Activity3D [104] | RGB, Skeleton, Depth | 2020 | 55 | 100 | 112,620 | 95.09% [105] |
UAV-Human [106] | RGB, Skeleton, Depth | 2021 | 155 | 119 | 27,428 | 55.00% [107] |
6.2. Fusion of RGB, Skeleton, and Depth Modalities
6.3. Fusion of Signal and Visual Modalities
6.4. Mathematical Derivation of the Multimodal Learning Methods
7. Current Challenges
7.1. RGB Data Modality Based Current Challenges
7.1.1. Efficient Action Recognition Analysis
7.1.2. Complexity Within the Environment
7.1.3. Large Memory of the Dataset and Limitations
7.2. Skeleton Data Modality-Based Challenges
7.2.1. Pose Preparation and Analysis
7.2.2. Viewpoint Variation
7.2.3. Single Scale Data Analysis
7.3. Sensor-Based HAR; Current Challenges and Possible Solution
Challenges in RF-Based HAR
- Modality Fusion and Latency: Fusing RF signals with data from other modalities (e.g., vision or inertial sensors) introduces challenges such as sampling rate mismatch, latency, and temporal misalignment. Achieving real-time, low-latency fusion while preserving cross-modal synchronization remains an open problem, particularly in mobile or dynamic environments [351,398].
- Representation Learning for RF Signals: Unlike images or video, RF signals lack regular grid-based spatial structure, making direct application of standard deep learning techniques less effective. Developing transferable, domain-invariant representations from complex inputs like CSI, RSSI, or Doppler spectrograms remains a growing area of research [348,399].
- Cross-Domain Generalization: RF-based models often perform poorly when transferred across different environments, hardware setups, or users due to the high sensitivity of wireless signals to ambient changes. While domain adaptation techniques such as adversarial learning and few-shot adaptation have been proposed, robust generalization remains limited relative to vision-based HAR [348,350].
- Multipath Propagation: As discussed in Section 5.4.3, multipath effects arise when signals reflect off surfaces and objects, causing interference patterns that distort amplitude and phase. These distortions degrade the quality of extracted features and the reliability of classification. While mitigation techniques such as Angle-of-Arrival (AoA) estimation, Doppler analysis, and deep domain adaptation have shown promise [340,341], real-time multipath-resilient modelling across diverse spaces remains an unsolved problem.
7.4. Multimodal-Based Challenges
7.4.1. Temporal Misalignment
7.4.2. Missing Modalities
7.4.3. High Computational Cost
7.4.4. Overfitting and Heterogeneous Data
8. Discussion and Future Direction
8.1. Development of the New Large Scale Datasets
8.2. Data Augmentation Techniques
8.3. Advancements in Models Performances
- Long-term Dependency Analysis: Long-term correlations refer to the sequence of actions that unfold over extended periods, akin to how memories are stored in our brains. In action recognition, it is essential to integrate both spatial and temporal modeling to capture these dependencies. To implement this, future research should consider transformer-based temporal attention models (e.g., Video Swin Transformers), TCNs, and hierarchical recurrent architectures that explicitly model variable-length sequences.
- Multimodal Modeling: This involves integrating data from multiple devices, such as RGB, skeleton, and audio sensors, to build more robust HAR systems. Implementation can leverage cross-modal attention mechanisms (e.g., co-attention modules) to dynamically weight modalities based on scene context. Techniques like cross-modal contrastive learning and domain adaptation can further enhance multimodal fusion performance, addressing occlusion and domain shift issues identified earlier.
- Enhancing Video Representations: Multimodal data (such as depth, skeleton, and RGB) is essential for improving video representations [410,411]. Future research should focus on implementing multi-stream networks, where each stream processes a different modality and shares temporal context through feature fusion layers. Additionally, self-supervised pretraining on unlabeled multimodal videos can improve generalization to unseen environments and handle missing modalities.
- Efficient Modeling Analysis: Creating an efficient network architecture is crucial due to the challenges posed by existing models, including model complexity, excessive parameters, and real-time performance limitations. To address these issues, techniques like distributed training [412], mobile networks [413], hybrid precision training, model compression, quantization, and pruning can be explored. These approaches can enhance both efficiency and effectiveness in image classification tasks.
- Semi-supervised and Unsupervised Learning Approaches: Supervised learning approaches, especially those based on deep learning, typically require large, expensive labeled datasets for model training. In contrast, unsupervised and semi-supervised learning techniques [414] can utilize unlabeled data to train models, thereby reducing the need for extensive labeled datasets. Given that unlabeled action samples are often easier to collect, unsupervised and semi-supervised approaches to HAR represent a crucial research direction deserving further exploration.
8.4. Video Lengths in Human Action Recognition
Limitations
9. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
HAR | Human Activity Recognitions |
HOF | Histogram of Optical Flow |
STIP | Spatio-Temporal Interest Point |
GCN | Graph Convolutional Network |
MEI | Motion Energy Image |
SVM | Support Vector Machine |
TSN | Temporal Segment Network |
RF | Radio Frequency |
ML | Machine learning |
KNN | K-Nearest-Neighbor |
CNN | Convolutional Neural Network |
DL | Deep Learning |
LSTM | Long Short Term Memory |
DNNs | Deep Neural Networks |
References
- Papadopoulos, G.T.; Axenopoulos, A.; Daras, P. Real-time skeleton-tracking-based human action recognition using kinect data. In Proceedings of the MultiMedia Modeling: 20th Anniversary International Conference (MMM 2014), Dublin, Ireland, 6–10 January 2014; Proceedings, Part I 20. Springer: Cham, Switzerland, 2014; pp. 473–483. [Google Scholar]
- Islam, M.N.; Jahangir, R.; Mohim, N.S.; Wasif-Ul-Islam, M.; Ashraf, A.; Khan, N.I.; Mahjabin, M.R.; Miah, A.S.M.; Shin, J. A multilingual handwriting learning system for visually impaired people. IEEE Access 2024, 12, 10521–10534. [Google Scholar] [CrossRef]
- Rahim, M.A.; Miah, A.S.M.; Sayeed, A.; Shin, J. Hand gesture recognition based on optimal segmentation in human-computer interaction. In Proceedings of the 3rd IEEE International Conference on Knowledge Innovation and Invention (ICKII), Kaohsiung, Taiwan, 21–23 August 2020; pp. 163–166. [Google Scholar]
- Van Gemert, J.C.; Jain, M.; Gati, E.; Snoek, C.G. APT: Action localization proposals from dense trajectories. In Proceedings of the BMVC, Swansea, UK, 7–10 September 2015; Volume 2, p. 4. [Google Scholar]
- Zhu, H.; Vial, R.; Lu, S. Tornado: A spatio-temporal convolutional regression network for video action proposal. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5813–5821. [Google Scholar]
- Ziaeefard, M.; Bergevin, R. Semantic human activity recognition: A literature review. Pattern Recognit. 2015, 48, 2329–2345. [Google Scholar] [CrossRef]
- Wu, S.; Oreifej, O.; Shah, M. Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1419–1426. [Google Scholar]
- Herath, S.; Harandi, M.; Porikli, F. Going deeper into action recognition: A survey. Image Vis. Comput. 2017, 60, 4–21. [Google Scholar] [CrossRef]
- Chao, Y.W.; Wang, Z.; He, Y.; Wang, J.; Deng, J. Hico: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1017–1025. [Google Scholar]
- Peng, X.; Schmid, C. Multi-region two-stream R-CNN for action detection. In Proceedings of the ECCV 2016: 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Cham, Switzerland, 2016; pp. 744–759. [Google Scholar]
- Liu, J.; Li, Y.; Song, S.; Xing, J.; Lan, C.; Zeng, W. Multi-modality multi-task recurrent neural network for online action detection. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2667–2682. [Google Scholar] [CrossRef]
- Patrona, F.; Chatzitofis, A.; Zarpalas, D.; Daras, P. Motion analysis: Action detection, recognition and evaluation based on motion capture data. Pattern Recognit. 2018, 76, 612–622. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
- Das Dawn, D.; Shaikh, S.H. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 2016, 32, 289–306. [Google Scholar] [CrossRef]
- Nguyen, T.V.; Song, Z.; Yan, S. STAP: Spatial-temporal attention-aware pooling for action recognition. IEEE Trans. Circuits Syst. Video Technol. 2014, 25, 77–86. [Google Scholar] [CrossRef]
- Shao, L.; Zhen, X.; Tao, D.; Li, X. Spatio-temporal Laplacian pyramid coding for action recognition. IEEE Trans. Cybern. 2013, 44, 817–827. [Google Scholar] [CrossRef]
- Burghouts, G.; Schutte, K.; ten Hove, R.J.M.; van den Broek, S.; Baan, J.; Rajadell, O.; van Huis, J.; van Rest, J.; Hanckmann, P.; Bouma, H.; et al. Instantaneous threat detection based on a semantic representation of activities, zones and trajectories. Signal Image Video Process. 2014, 8, 191–200. [Google Scholar] [CrossRef]
- Yang, X.; Tian, Y. Super normal vector for activity recognition using depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 804–811. [Google Scholar]
- Ye, M.; Zhang, Q.; Wang, L.; Zhu, J.; Yang, R.; Gall, J. A survey on human motion analysis from depth data. In Time-of-Flight and Depth Imaging, Sensors, Algorithms, and Applications: Dagstuhl 2012 Seminar on Time-of-Flight Imaging and GCPR 2013 Workshop on Imaging New Modalities, Schloss Dagstuhl; Springer: Berlin/Heidelberg, Germany, 2013; pp. 149–187. [Google Scholar]
- Li, M.; Leung, H.; Shum, H.P. Human action recognition via skeletal and depth based feature fusion. In Proceedings of the 9th International Conference on Motion in Games, Burlingame, CA, USA, 10–12 October 2016; pp. 123–132. [Google Scholar]
- Yang, X.; Tian, Y. Effective 3D action recognition using eigenjoints. J. Vis. Commun. Image Represent. 2014, 25, 2–11. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4489–4497. [Google Scholar]
- Wang, J.; Chen, Y.; Hao, S.; Peng, X.; Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognit. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef]
- Vrigkas, M.; Nikou, C.; Kakadiaris, I.A. A review of human activity recognition methods. Front. Robot. AI 2015, 2, 28. [Google Scholar] [CrossRef]
- Vishwakarma, S.; Agrawal, A. A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 2013, 29, 983–1009. [Google Scholar] [CrossRef]
- Ke, S.R.; Thuc, H.L.U.; Lee, Y.J.; Hwang, J.N.; Yoo, J.H.; Choi, K.H. A review on video-based human activity recognition. Computers 2013, 2, 88–131. [Google Scholar] [CrossRef]
- Zhu, Y.; Li, X.; Liu, C.; Zolfaghari, M.; Xiong, Y.; Wu, C.; Zhang, Z.; Tighe, J.; Manmatha, R.; Li, M. A comprehensive study of deep video action recognition. arXiv 2020, arXiv:2012.06567. [Google Scholar]
- Zhang, H.B.; Zhang, Y.X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.X.; Chen, D.S. A comprehensive survey of vision-based human action recognition methods. Sensors 2019, 19, 1005. [Google Scholar] [CrossRef]
- Kong, Y.; Fu, Y. Human action recognition and prediction: A survey. Int. J. Comput. Vis. 2022, 130, 1366–1401. [Google Scholar] [CrossRef]
- Ma, N.; Wu, Z.; Cheung, Y.m.; Guo, Y.; Gao, Y.; Li, J.; Jiang, B. A survey of human action recognition and posture prediction. Tsinghua Sci. Technol. 2022, 27, 973–1001. [Google Scholar] [CrossRef]
- Xing, Y.; Zhu, J. Deep learning-based action recognition with 3D skeleton: A survey. CAAI Trans. Intell. Technol. 2021, 6, 80–92. [Google Scholar] [CrossRef]
- Presti, L.L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar] [CrossRef]
- Subetha, T.; Chitrakala, S. A survey on human activity recognition from videos. In Proceedings of the 2016 International Conference on Information Communication and Embedded Systems (ICICES), Chennai, India, 25–26 February 2016; IEEE: New York, NY, USA, 2016; pp. 1–7. [Google Scholar]
- Feng, M.; Meunier, J. Skeleton graph-neural-network-based human action recognition: A survey. Sensors 2022, 22, 2091. [Google Scholar] [CrossRef]
- Feng, L.; Zhao, Y.; Zhao, W.; Tang, J. A comparative review of graph convolutional networks for human skeleton-based action recognition. Artif. Intell. Rev. 2022, 55, 4275–4305. [Google Scholar] [CrossRef]
- Gupta, P.; Thatipelli, A.; Aggarwal, A.; Maheshwari, S.; Trivedi, N.; Das, S.; Sarvadevabhatla, R.K. Quo vadis, skeleton action recognition? Int. J. Comput. Vis. 2021, 129, 2097–2112. [Google Scholar] [CrossRef]
- Song, L.; Yu, G.; Yuan, J.; Liu, Z. Human pose estimation and its application to action recognition: A survey. J. Vis. Commun. Image Represent. 2021, 76, 103055. [Google Scholar] [CrossRef]
- Shaikh, M.B.; Chai, D. RGB-D data-based action recognition: A review. Sensors 2021, 21, 4246. [Google Scholar] [CrossRef] [PubMed]
- Majumder, S.; Kehtarnavaz, N. Vision and inertial sensing fusion for human action recognition: A review. IEEE Sens. J. 2020, 21, 2454–2467. [Google Scholar] [CrossRef]
- Wang, L.; Huynh, D.Q.; Koniusz, P. A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 2019, 29, 15–28. [Google Scholar] [CrossRef]
- Wang, C.; Yan, J. A comprehensive survey of rgb-based and skeleton-based human action recognition. IEEE Access 2023, 11, 53880–53898. [Google Scholar] [CrossRef]
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef]
- Ullah, A.; Muhammad, K.; Haq, I.U.; Baik, S.W. Action recognition using optimized deep autoencoder and CNN for surveillance data streams of non-stationary environments. Future Gener. Comput. Syst. 2019, 96, 386–397. [Google Scholar] [CrossRef]
- Lan, Z.; Lin, M.; Li, X.; Hauptmann, A.G.; Raj, B. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 204–212. [Google Scholar]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef]
- Wang, L.; Qiao, Y.; Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4305–4314. [Google Scholar]
- Sharma, S.; Kiros, R.; Salakhutdinov, R. Action recognition using visual attention. arXiv 2015, arXiv:1511.04119. [Google Scholar]
- Ijjina, E.P.; Chalavadi, K.M. Human action recognition using genetic algorithms and convolutional neural networks. Pattern Recognit. 2016, 59, 199–212. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1933–1941. [Google Scholar]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 20–36. [Google Scholar]
- Akilan, T.; Wu, Q.J.; Safaei, A.; Jiang, W. A late fusion approach for harnessing multi-CNN model high-level features. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, Canada, 5–8 October 2017; pp. 566–571. [Google Scholar]
- Shi, Y.; Tian, Y.; Wang, Y.; Huang, T. Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Trans. Multimed. 2017, 19, 1510–1520. [Google Scholar] [CrossRef]
- Ahsan, U.; Sun, C.; Essa, I. Discrimnet: Semi-supervised action recognition from videos using generative adversarial networks. arXiv 2018, arXiv:1801.07230. [Google Scholar]
- Tu, Z.; Xie, W.; Qin, Q.; Poppe, R.; Veltkamp, R.C.; Li, B.; Yuan, J. Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recognit. 2018, 79, 32–43. [Google Scholar] [CrossRef]
- Zhou, Y.; Sun, X.; Zha, Z.J.; Zeng, W. Mict: Mixed 3d/2d convolutional tube for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018; pp. 449–458. [Google Scholar]
- Jian, M.; Zhang, S.; Wu, L.; Zhang, S.; Wang, X.; He, Y. Deep key frame extraction for sport training. Neurocomputing 2019, 328, 147–156. [Google Scholar] [CrossRef]
- Gowda, S.; Rohrbach, M.; Sevilla-Lara, L. Smart frame selection for action recognition. arXiv 2020, arXiv:2012.10671. [Google Scholar] [CrossRef]
- Khan, M.A.; Javed, K.; Khan, S.A.; Saba, T.; Habib, U.; Khan, J.A.; Abbasi, A.A. Human action recognition using fusion of multiview and deep features: An application to video surveillance. Multimed. Tools Appl. 2020, 79, 27973–27995. [Google Scholar] [CrossRef]
- Ullah, A.; Muhammad, K.; Ding, W.; Palade, V.; Haq, I.U.; Baik, S.W. Efficient activity recognition using lightweight CNN and DS-GRU network for surveillance applications. Appl. Soft Comput. 2021, 103, 107102. [Google Scholar] [CrossRef]
- Wang, L.; Tong, Z.; Ji, B.; Wu, G. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20 June 2021; pp. 1895–1904. [Google Scholar]
- Wang, X.; Zhang, S.; Qing, Z.; Tang, M.; Zuo, Z.; Gao, C.; Jin, R.; Sang, N. Hybrid relation guided set matching for few-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19948–19957. [Google Scholar]
- Wensel, J.; Ullah, H.; Munir, A. Vit-ret: Vision and recurrent transformer neural networks for human activity recognition in videos. IEEE Access 2023, 11, 72227–72249. [Google Scholar] [CrossRef]
- Hassan, N.; Miah, A.S.M.; Shin, J. A Deep Bidirectional LSTM Model Enhanced by Transfer-Learning-Based Feature Extraction for Dynamic Human Activity Recognition. Appl. Sci. 2024, 14, 603. [Google Scholar] [CrossRef]
- Khan, M.H.; Javed, M.A.; Farid, M.S. Deep-learning-based ConvLSTM and LRCN networks for human activity recognition. J. Vis. Commun. Image Represent. 2025, 110, 104469. [Google Scholar] [CrossRef]
- Shah, H.; Holia, M.S. Hybrid Feature Extraction and Knowledge Distillation Based Deep Learning Model for Human Activity Recognition System. Signal Process. Image Commun. 2025, 137, 117308. [Google Scholar] [CrossRef]
- Caba Heilbron, F.; Escorcia, V.; Ghanem, B.; Carlos Niebles, J. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7 June 2015; pp. 961–970. [Google Scholar]
- Li, K.; Wang, Y.; He, Y.; Li, Y.; Wang, Y.; Wang, L.; Qiao, Y. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv 2022, arXiv:2211.09552. [Google Scholar]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
- Carreira, J.; Noland, E.; Hillier, C.; Zisserman, A. A short note on the kinetics-700 human action dataset. arXiv 2019, arXiv:1907.06987. [Google Scholar]
- Wang, Y.; Li, K.; Li, X.; Yu, J.; He, Y.; Chen, G.; Pei, B.; Zheng, R.; Xu, J.; Wang, Z.; et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv 2024, arXiv:2403.15377. [Google Scholar]
- Gu, C.; Sun, C.; Ross, D.A.; Vondrick, C.; Pantofaru, C.; Li, Y.; Vijayanarasimhan, S.; Toderici, G.; Ricco, S.; Sukthankar, R.; et al. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018; pp. 6047–6056. [Google Scholar]
- Sheng, K.; Dong, W.; Ma, C.; Mei, X.; Huang, F.; Hu, B.G. Attention-based multi-patch aggregation for image aesthetic assessment. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 879–886. [Google Scholar]
- Monfort, M.; Andonian, A.; Zhou, B.; Ramakrishnan, K.; Bargal, S.A.; Yan, T.; Brown, L.; Fan, Q.; Gutfreund, D.; Vondrick, C.; et al. Moments in time dataset: One million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 502–508. [Google Scholar] [CrossRef]
- Theodorakopoulos, I.; Kastaniotis, D.; Economou, G.; Fotopoulos, S. Pose-based human action recognition via sparse representation in dissimilarity space. J. Vis. Commun. Image Represent. 2014, 25, 12–23. [Google Scholar] [CrossRef]
- Zhou, Q.; Rasol, J.; Xu, Y.; Zhang, Z.; Hu, L. A high-performance gait recognition method based on n-fold Bernoulli theory. IEEE Access 2022, 10, 115744–115757. [Google Scholar] [CrossRef]
- Damen, D.; Doughty, H.; Farinella, G.M.; Fidler, S.; Furnari, A.; Kazakos, E.; Moltisanti, D.; Munro, J.; Perrett, T.; Price, W.; et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 720–736. [Google Scholar]
- Tian, Y.; Shi, J.; Li, B.; Duan, Z.; Xu, C. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 247–263. [Google Scholar]
- Miech, A.; Alayrac, J.B.; Laptev, I.; Sivic, J.; Zisserman, A. Rareact: A video dataset of unusual interactions. arXiv 2020, arXiv:2008.01018. [Google Scholar]
- Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Lin, W.; Liu, H.; Liu, S.; Li, Y.; Qian, R.; Wang, T.; Xu, N.; Xiong, H.; Qi, G.J.; Sebe, N. Human in events: A large-scale benchmark for human-centric video analysis in complex events. arXiv 2020, arXiv:2005.04490. [Google Scholar]
- Duan, X. Abnormal Behavior Recognition for Human Motion Based on Improved Deep Reinforcement Learning. Int. J. Image Graph. 2024, 24, 2550029. [Google Scholar] [CrossRef]
- Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1290–1297. [Google Scholar]
- Shahroudy, A.; Ng, T.T.; Gong, Y.; Wang, G. Deep multimodal feature analysis for action recognition in rgb+ d videos. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1045–1058. [Google Scholar] [CrossRef]
- Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar]
- Cheng, Q.; Cheng, J.; Liu, Z.; Ren, Z.; Liu, J. A Dense-Sparse Complementary Network for Human Action Recognition based on RGB and Skeleton Modalities. Expert Syst. Appl. 2024, 244, 123061. [Google Scholar] [CrossRef]
- Liu, A.A.; Su, Y.T.; Jia, P.P.; Gao, Z.; Hao, T.; Yang, Z.X. Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans. Cybern. 2014, 45, 1194–1208. [Google Scholar] [CrossRef]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QUC, Canada, 27–30 September 2015; pp. 168–172. [Google Scholar]
- Liu, M.; Yuan, J. Recognizing human actions as the evolution of pose estimation maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1159–1168. [Google Scholar]
- Rahmani, H.; Mahmood, A.; Huynh, D.; Mian, A. Histogram of oriented principal components for cross-view action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2430–2443. [Google Scholar] [CrossRef]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1010–1019. [Google Scholar]
- Liu, C.; Hu, Y.; Li, Y.; Song, S.; Liu, J. Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv 2017, arXiv:1703.07475. [Google Scholar]
- Li, T.; Fan, L.; Zhao, M.; Liu, Y.; Katabi, D. Making the invisible visible: Action recognition through walls and occlusions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 872–881. [Google Scholar]
- Kong, Y.; Fu, Y. Max-margin heterogeneous information machine for RGB-D action recognition. Int. J. Comput. Vis. 2017, 123, 350–371. [Google Scholar] [CrossRef]
- Carreira, J.; Noland, E.; Banki-Horvath, A.; Hillier, C.; Zisserman, A. A short note about kinetics-600. arXiv 2018, arXiv:1808.01340. [Google Scholar]
- Ji, Y.; Xu, F.; Yang, Y.; Shen, F.; Shen, H.T.; Zheng, W.S. A large-scale RGB-D database for arbitrary-view human action recognition. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1510–1518. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
- Martin, M.; Roitberg, A.; Haurilet, M.; Horne, M.; Reiß, S.; Voit, M.; Stiefelhagen, R. Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2801–2810. [Google Scholar]
- Lin, D.; Lee, P.H.Y.; Li, Y.; Wang, R.; Yap, K.H.; Li, B.; Ngim, Y.S. Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring. arXiv 2024, arXiv:2401.14838. [Google Scholar]
- Kong, Q.; Wu, Z.; Deng, Z.; Klinkigt, M.; Tong, B.; Murakami, T. Mmact: A large-scale dataset for cross modal human action understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8658–8667. [Google Scholar]
- Liu, Y.; Wang, K.; Li, G.; Lin, L. Semantics-aware adaptive knowledge distillation for sensor-to-vision action recognition. IEEE Trans. Image Process. 2021, 30, 5573–5588. [Google Scholar] [CrossRef]
- Das, S.; Dai, R.; Koperski, M.; Minciullo, L.; Garattoni, L.; Bremond, F.; Francesca, G. Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 27–2 November 2019; pp. 833–842. [Google Scholar]
- Ben-Shabat, Y.; Yu, X.; Saleh, F.; Campbell, D.; Rodriguez-Opazo, C.; Li, H.; Gould, S. The ikea asm dataset: Understanding people assembling furniture through actions, objects and pose. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 847–859. [Google Scholar]
- Jang, J.; Kim, D.; Park, C.; Jang, M.; Lee, J.; Kim, J. ETRI-activity3D: A large-scale RGB-D dataset for robots to recognize daily activities of the elderly. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 10990–10997. [Google Scholar]
- Dokkar, R.R.; Chaieb, F.; Drira, H.; Aberkane, A. ConViViT–A Deep Neural Network Combining Convolutions and Factorized Self-Attention for Human Activity Recognition. arXiv 2023, arXiv:2310.14416. [Google Scholar]
- Li, T.; Liu, J.; Zhang, W.; Ni, Y.; Wang, W.; Li, Z. Uav-human: A large benchmark for human behavior understanding with unmanned aerial vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16266–16275. [Google Scholar]
- Xian, R.; Wang, X.; Kothandaraman, D.; Manocha, D. PMI Sampler: Patch similarity guided frame selection for Aerial Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 6982–6991. [Google Scholar]
- Patel, C.I.; Garg, S.; Zaveri, T.; Banerjee, A.; Patel, R. Human action recognition using fusion of features for unconstrained video sequences. Comput. Electr. Eng. 2018, 70, 284–301. [Google Scholar] [CrossRef]
- Liu, J.; Kuipers, B.; Savarese, S. Recognizing human actions by attributes. In Proceedings of the CVPR 2011, Providence, RI, USA, 20–25 June 2011; pp. 3337–3344. [Google Scholar]
- Shi, Q.; Cheng, L.; Wang, L.; Smola, A. Human action segmentation and recognition using discriminative semi-markov models. Int. J. Comput. Vis. 2011, 93, 22–32. [Google Scholar] [CrossRef]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. Action recognition from depth sequences using depth motion maps-based local binary patterns. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2015; pp. 1092–1099. [Google Scholar]
- Gan, L.; Chen, F. Human Action Recognition Using APJ3D and Random Forests. J. Softw. 2013, 8, 2238–2245. [Google Scholar] [CrossRef]
- Everts, I.; Van Gemert, J.C.; Gevers, T. Evaluation of color spatio-temporal interest points for human action recognition. IEEE Trans. Image Process. 2014, 23, 1569–1580. [Google Scholar] [CrossRef] [PubMed]
- Zhu, Y.; Chen, W.; Guo, G. Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 2014, 32, 453–464. [Google Scholar] [CrossRef]
- Liu, L.; Shao, L.; Li, X.; Lu, K. Learning spatio-temporal representations for action recognition: A genetic programming approach. IEEE Trans. Cybern. 2015, 46, 158–170. [Google Scholar] [CrossRef] [PubMed]
- Xu, D.; Xiao, X.; Wang, X.; Wang, J. Human action recognition based on Kinect and PSO-SVM by representing 3D skeletons as points in lie group. In Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 11–12 July 2016; pp. 568–573. [Google Scholar]
- Vishwakarma, D.K.; Kapoor, R.; Dhiman, A. A proposed unified framework for the recognition of human activity by exploiting the characteristics of action dynamics. Robot. Auton. Syst. 2016, 77, 25–38. [Google Scholar] [CrossRef]
- Singh, D.; Mohan, C.K. Graph formulation of video activities for abnormal activity recognition. Pattern Recognit. 2017, 65, 265–272. [Google Scholar] [CrossRef]
- Jalal, A.; Kim, Y.H.; Kim, Y.J.; Kamal, S.; Kim, D. Robust human activity recognition from depth video using spatiotemporal multi-fused features. Pattern Recognit. 2017, 61, 295–308. [Google Scholar] [CrossRef]
- Nazir, S.; Yousaf, M.H.; Velastin, S.A. Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 2018, 72, 660–669. [Google Scholar] [CrossRef]
- Ullah, S.; Bhatti, N.; Qasim, T.; Hassan, N.; Zia, M. Weakly-supervised action localization based on seed superpixels. Multimed. Tools Appl. 2021, 80, 6203–6220. [Google Scholar] [CrossRef]
- Al-Obaidi, S.; Al-Khafaji, H.; Abhayaratne, C. Making sense of neuromorphic event data for human action recognition. IEEE Access 2021, 9, 82686–82700. [Google Scholar] [CrossRef]
- Hejazi, S.M.; Abhayaratne, C. Handcrafted localized phase features for human action recognition. Image Vis. Comput. 2022, 123, 104465. [Google Scholar] [CrossRef]
- Zhang, C.; Xu, Y.; Xu, Z.; Huang, J.; Lu, J. Hybrid handcrafted and learned feature framework for human action recognition. Appl. Intell. 2022, 52, 12771–12787. [Google Scholar] [CrossRef]
- Fatima, T.; Rahman, H.; Jalal, A. A novel framework for human action recognition based on features fusion and decision tree. In Proceedings of the 2023 4th International Conference on Advancements in Computational Sciences (ICACS), Lahore, Pakistan, 20–22 February 2023; Volume 53. [Google Scholar]
- Bobick, A.F.; Davis, J.W. The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 257–267. [Google Scholar] [CrossRef]
- Zhang, Z.; Hu, Y.; Chan, S.; Chia, L.T. Motion context: A new representation for human action recognition. In Computer Vision–ECCV 2008, Proceedings of the10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; Proceedings, Part IV 10; Springer: Berlin/Heidelberg, Germany, 2008; pp. 817–829. [Google Scholar]
- Somasundaram, G.; Cherian, A.; Morellas, V.; Papanikolopoulos, N. Action recognition using global spatio-temporal features derived from sparse representations. Comput. Vis. Image Underst. 2014, 123, 1–13. [Google Scholar] [CrossRef]
- Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
- Oreifej, O.; Liu, Z. Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 716–723. [Google Scholar]
- Patel, C.I.; Labana, D.; Pandya, S.; Modi, K.; Ghayvat, H.; Awais, M. Histogram of oriented gradient-based fusion of features for human action recognition in action video sequences. Sensors 2020, 20, 7299. [Google Scholar] [CrossRef]
- Tan, P.S.; Lim, K.M.; Lee, C.P. Human action recognition with sparse autoencoder and histogram of oriented gradients. In Proceedings of the 2020 IEEE 2nd International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), Kota Kinabalu, Sabah, 26–27 September 2020; pp. 1–5. [Google Scholar]
- Wattanapanich, C.; Wei, H.; Xu, W. Analysis of Histogram of Oriented Gradients on Gait Recognition. In Proceedings of the 4th Mediterranean Conference on Pattern Recognition and Artificial Intelligence, MedPRAI 2020, Hammamet, Tunisia, 20–22 December 2020; Proceedings 4. Springer: Cham, Switzerland, 2021; pp. 86–97. [Google Scholar]
- Zuo, Z.; Yang, L.; Liu, Y.; Chao, F.; Song, R.; Qu, Y. Histogram of fuzzy local spatio-temporal descriptors for video action recognition. IEEE Trans. Ind. Inform. 2019, 16, 4059–4067. [Google Scholar] [CrossRef]
- Wang, H. Enhanced forest microexpression recognition based on optical flow direction histogram and deep multiview network. Math. Probl. Eng. 2020, 2020, 5675914. [Google Scholar] [CrossRef]
- Ullah, S.; Hassan, N.; Bhatti, N. Temporal Superpixels based Human Action Localization. In Proceedings of the 2018 14th International Conference on Emerging Technologies (ICET), Islamabad, Pakistan, 21–22 November 2018; pp. 1–6. [Google Scholar]
- Laptev, I.; Pérez, P. Retrieving actions in movies. In Proceedings of the 2007 IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, 14–20 October 2007; pp. 1–8. [Google Scholar]
- Tran, D.; Sorokin, A. Human activity recognition with metric learning. In Proceedings of the Computer Vision–ECCV 2008, 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; Proceedings, Part I 10. Springer: Berlin/Heidelberg, Germany, 2008; pp. 548–561. [Google Scholar]
- Morency, L.P.; Quattoni, A.; Darrell, T. Latent-dynamic discriminative models for continuous gesture recognition. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
- Wang, S.B.; Quattoni, A.; Morency, L.P.; Demirdjian, D.; Darrell, T. Hidden conditional random fields for gesture recognition. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; Volume 2, pp. 1521–1527. [Google Scholar]
- Wang, L.; Suter, D. Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 18–23 June 2007; pp. 1–8. [Google Scholar]
- Tang, K.; Fei-Fei, L.; Koller, D. Learning latent temporal structure for complex event detection. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1250–1257. [Google Scholar]
- Wang, Z.; Wang, J.; Xiao, J.; Lin, K.H.; Huang, T. Substructure and boundary modeling for continuous action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1330–1337. [Google Scholar]
- Luo, G.; Yang, S.; Tian, G.; Yuan, C.; Hu, W.; Maybank, S.J. Learning Depth from Monocular Videos using Deep Neural Networks. J. Comput. Vis. 2014, 10, 1–10. [Google Scholar]
- Yuan, C.; Hu, W.; Tian, G.; Yang, S.; Wang, H. Multi-task sparse learning with beta process prior for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 423–429. [Google Scholar]
- Kar, A.; Rai, N.; Sikka, K.; Sharma, G. Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3376–3385. [Google Scholar]
- Varol, G.; Laptev, I.; Schmid, C. Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1510–1517. [Google Scholar] [CrossRef]
- Feichtenhofer, C.; Pinz, A.; Wildes, R.P. Spatiotemporal multiplier networks for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4768–4777. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Wang, G. Spatio-temporal lstm with trust gates for 3d human action recognition. In ECCV 2016, Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14; Springer: Cham, Switzerland, 2016; pp. 816–833. [Google Scholar]
- Shin, J.; Miah, A.S.M.; Akiba, Y.; Hirooka, K.; Hassan, N.; Hwang, Y.S. Korean sign language alphabet recognition through the integration of handcrafted and deep learning-based two-stream feature extraction approach. IEEE Access 2024, 12, 68303–68318. [Google Scholar] [CrossRef]
- Zhang, B.; Wang, L.; Wang, Z.; Qiao, Y.; Wang, H. Real-time action recognition with deeply transferred motion vector cnns. IEEE Trans. Image Process. 2018, 27, 2326–2339. [Google Scholar] [CrossRef]
- Hassan, N.; Miah, A.S.M.; Shin, J. Enhancing human action recognition in videos through dense-level features extraction and optimized long short-term memory. In Proceedings of the 2024 7th International Conference on Electronics, Communications, and Control Engineering (ICECC), Kuala Lumpur, Malaysia, 22–24 March 2024; pp. 19–23. [Google Scholar]
- Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; Darrell, T. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. arXiv 2017, arXiv:1705.07750. [Google Scholar]
- Ng, J.Y.H.; Hausknecht, M.J.; Vijayanarasimhan, S.; Vinyals, O.; Monga, R.; Toderici, G. Beyond short snippets: Deep networks for video classification. arXiv 2015, arXiv:1503.08909. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Trans. Image Process. 2018, 27, 3459–3471. [Google Scholar] [CrossRef] [PubMed]
- Horn, B.K.; Schunck, B.G. Determining optical flow. Artif. Intell. 1981, 17, 185–203. [Google Scholar] [CrossRef]
- Lan, Z.; Zhu, Y.; Hauptmann, A.G.; Newsam, S. Deep local video feature for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1–7. [Google Scholar]
- Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 803–818. [Google Scholar]
- Liu, M.; Liu, H.; Chen, C. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognit. 2017, 68, 346–362. [Google Scholar] [CrossRef]
- Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. arXiv 2017, arXiv:1711.10305. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
- Zhao, Y.; Xiong, Y.; Lin, D. Trajectory convolution for action recognition. Adv. Neural Inf. Process. Syst. 2018, 31, 2208–2219. [Google Scholar]
- Wang, L.; Li, W.; Li, W.; Van Gool, L. Appearance-and-relation networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1430–1439. [Google Scholar]
- Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 305–321. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Feiszli, M. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 27–2 November 2019; pp. 5552–5561. [Google Scholar]
- Feichtenhofer, C. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 203–213. [Google Scholar]
- Yang, C.; Xu, Y.; Shi, J.; Dai, B.; Zhou, B. Temporal pyramid network for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 591–600. [Google Scholar]
- Zhang, S.; Guo, S.; Huang, W.; Scott, M.R.; Wang, L. V4d: 4d convolutional neural networks for video-level representation learning. arXiv 2020, arXiv:2002.07442. [Google Scholar]
- Qin, Y.; Mo, L.; Xie, B. Feature fusion for human action recognition based on classical descriptors and 3D convolutional networks. In Proceedings of the 2017 Eleventh International Conference on Sensing Technology (ICST), Auckland, New Zealand, 28–30 November 2017; pp. 1–5. [Google Scholar]
- Diba, A.; Fayyaz, M.; Sharma, V.; Karami, A.H.; Arzani, M.M.; Yousefzadeh, R.; Van Gool, L. Temporal 3d convnets: New architecture and transfer learning for video classification. arXiv 2017, arXiv:1711.08200. [Google Scholar]
- Zhu, J.; Zhu, Z.; Zou, W. End-to-end video-level representation learning for action recognition. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 645–650. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the NIPS’17: 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Girdhar, R.; Carreira, J.; Doersch, C.; Zisserman, A. Video action transformer network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 244–253. [Google Scholar]
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. From CNNs to transformers in multimodal human action recognition: A survey. ACM Trans. Multimed. Comput. Commun. Appl. 2024, 20, 260. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Xiao, F.; Lee, Y.J.; Grauman, K.; Malik, J.; Feichtenhofer, C. Audiovisual slowfast networks for video recognition. arXiv 2020, arXiv:2001.08740. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QUC, Canada, 10–17 October 2021; pp. 6836–6846. [Google Scholar]
- Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and PATTERN Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 3333–3343. [Google Scholar]
- Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef]
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093. [Google Scholar]
- Wang, Y.; Li, K.; Li, Y.; He, Y.; Huang, B.; Zhao, Z.; Zhang, H.; Xu, J.; Liu, Y.; Wang, Z.; et al. Internvideo: General video foundation models via generative and discriminative learning. arXiv 2022, arXiv:2212.03191. [Google Scholar]
- Han, J.; Shao, L.; Xu, D.; Shotton, J. Enhanced computer vision with microsoft kinect sensor: A review. IEEE Trans. Cybern. 2013, 43, 1318–1334. [Google Scholar]
- Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
- Xiu, Y.; Li, J.; Wang, H.; Fang, Y.; Lu, C. Pose Flow: Efficient online pose tracking. arXiv 2018, arXiv:1802.00977. [Google Scholar]
- Yang, Y.; Ramanan, D. Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 2878–2890. [Google Scholar] [CrossRef]
- Chen, X.; Yuille, A.L. Articulated pose estimation by a graphical model with image dependent pairwise relations. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
- Bulat, A.; Tzimiropoulos, G. Human pose estimation via convolutional part heatmap regression. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14; Springer: Cham, Switzerland, 2016; pp. 717–732. [Google Scholar]
- Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
- Carreira, J.; Agrawal, P.; Fragkiadaki, K.; Malik, J. Human pose estimation with iterative error feedback. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4733–4742. [Google Scholar]
- Zhou, X.; Zhu, M.; Pavlakos, G.; Leonardos, S.; Derpanis, K.G.; Daniilidis, K. Monocap: Monocular human motion capture using a cnn coupled with a geometric prior. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 901–914. [Google Scholar] [CrossRef]
- Nunes, U.M.; Faria, D.R.; Peixoto, P. A human activity recognition framework using max-min features and key poses with differential evolution random forests classifier. Pattern Recognit. Lett. 2017, 99, 21–31. [Google Scholar] [CrossRef]
- Chen, Y. Reduced basis decomposition: A certified and fast lossy data compression algorithm. Comput. Math. Appl. 2015, 70, 2566–2574. [Google Scholar] [CrossRef]
- Veeriah, V.; Zhuang, N.; Qi, G.J. Differential recurrent neural networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4041–4049. [Google Scholar]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
- Li, C.; Hou, Y.; Wang, P.; Li, W. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Process. Lett. 2017, 24, 624–628. [Google Scholar] [CrossRef]
- Soo Kim, T.; Reiter, A. Interpretable 3d human action analysis with temporal convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21 July 2017; pp. 20–28. [Google Scholar]
- Das, S.; Koperski, M.; Bremond, F.; Francesca, G. Deep-temporal lstm for daily living action recognition. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1227–1236. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12026–12035. [Google Scholar]
- Trelinski, J.; Kwolek, B. Ensemble of classifiers using CNN and hand-crafted features for depth-based action recognition. In Artificial Intelligence and Soft Computing: Proceedings of the 18th International Conference, ICAISC 2019, Zakopane, Poland, 16–20 June 2019; Proceedings, Part II 18; Springer: Cham, Switzerland, 2019; pp. 91–103. [Google Scholar]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3595–3603. [Google Scholar]
- Huynh-The, T.; Hua, C.H.; Kim, D.S. Encoding pose features to images with data augmentation for 3-D action recognition. IEEE Trans. Ind. Inform. 2019, 16, 3100–3111. [Google Scholar] [CrossRef]
- Huynh-The, T.; Hua, C.H.; Ngo, T.T.; Kim, D.S. Image representation of pose-transition feature for 3D skeleton-based action recognition. Inf. Sci. 2020, 513, 112–126. [Google Scholar] [CrossRef]
- Naveenkumar, M.; Domnic, S. Deep ensemble network using distance maps and body part features for skeleton based action recognition. Pattern Recognit. 2020, 100, 107125. [Google Scholar]
- Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
- Snoun, A.; Jlidi, N.; Bouchrika, T.; Jemai, O.; Zaied, M. Towards a deep human activity recognition approach based on video to image transformation with skeleton data. Multimed. Tools Appl. 2021, 80, 29675–29698. [Google Scholar] [CrossRef]
- Duan, H.; Wang, J.; Chen, K.; Lin, D. Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 7351–7354. [Google Scholar]
- Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef]
- Zhu, G.; Wan, C.; Cao, L.; Wang, X. Relation-mining self-attention network for skeleton-based human action recognition. Pattern Recognit. 2023, 135, 109098. [Google Scholar] [CrossRef]
- Zhang, G.; Wen, S.; Li, J.; Che, H. Fast 3D-graph convolutional networks for skeleton-based action recognition. Appl. Soft Comput. 2023, 145, 110575. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, H.; Li, Y.; He, K.; Xu, D. Skeleton-based human action recognition via large-kernel attention graph convolutional network. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2575–2585. [Google Scholar] [CrossRef]
- Liang, C.; Yang, J.; Du, R.; Hu, W.; Hou, N. Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition. IEEE Access 2024, 12, 64937–64949. [Google Scholar] [CrossRef]
- Karthika, S.; Nancy Jane, Y.; Khanna Nehemiah, H. Spatio-temporal 3D skeleton kinematic joint point classification model for human activity recognition. J. Vis. Commun. Image Represent. 2025, 110, 104471. [Google Scholar] [CrossRef]
- Sun, T.; Lian, C.; Dong, F.; Shao, J.; Zhang, X.; Xiao, Q.; Ju, Z.; Zhao, Y. Skeletal joint image-based multi-channel fusion network for human activity recognition. Knowl.-Based Syst. 2025, 315, 113232. [Google Scholar] [CrossRef]
- Mehmood, F.; Guo, X.; Chen, E.; Akbar, M.A.; Khan, A.A.; Ullah, S. Extended multi-stream temporal-attention module for skeleton-based human action recognition (HAR). Comput. Hum. Behav. 2025, 163, 108482. [Google Scholar] [CrossRef]
- Shao, L.; Ji, L.; Liu, Y.; Zhang, J. Human action segmentation and recognition via motion and shape analysis. Pattern Recognit. Lett. 2012, 33, 438–445. [Google Scholar] [CrossRef]
- Yang, X.; Zhang, C.; Tian, Y. Recognizing actions using depth motion maps-based histograms of oriented gradients. In Proceedings of the 20th ACM International Conference on Multimedia, Nara, Japan, 29 October–2 November 2012; pp. 1057–1060. [Google Scholar]
- Chen, W.; Guo, G. TriViews: A general framework to use 3D depth data effectively for action recognition. J. Vis. Commun. Image Represent. 2015, 26, 182–191. [Google Scholar] [CrossRef]
- Miao, J.; Jia, X.; Mathew, R.; Xu, X.; Taubman, D.; Qing, C. Efficient action recognition from compressed depth maps. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 16–20. [Google Scholar]
- Shotton, J.; Sharp, T.; Kipman, A.; Fitzgibbon, A.; Finocchio, M.; Blake, A.; Cook, M.; Moore, R. Real-time human pose recognition in parts from single depth images. Commun. ACM 2013, 56, 116–124. [Google Scholar] [CrossRef]
- Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3d joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
- Keceli, A.S.; Can, A.B. Recognition of basic human actions using depth information. Int. J. Pattern Recognit. Artif. Intell. 2014, 28, 1450004. [Google Scholar] [CrossRef]
- Pazhoumand-Dar, H.; Lam, C.P.; Masek, M. Joint movement similarities for robust 3D action recognition using skeletal data. J. Vis. Commun. Image Represent. 2015, 30, 10–21. [Google Scholar] [CrossRef]
- Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef]
- Ding, Z.; Wang, P.; Ogunbona, P.O.; Li, W. Investigation of different skeleton features for cnn-based 3D action recognition. In Proceedings of the 2017 IEEE International Conference on Multimedia & ExpoWorkshops (ICMEW), Hong Kong, China, 10–14 July, 2017; pp. 617–622. [Google Scholar]
- Caetano, C.; Sena, J.; Brémond, F.; Dos Santos, J.A.; Schwartz, W.R. Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–8. [Google Scholar]
- Liu, H.; Tu, J.; Liu, M. Two-stream 3d convolutional neural network for skeleton-based action recognition. arXiv 2017, arXiv:1705.08106. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Ogiela, M.R.; Jain, L.C. Computational Intelligence Paradigms in Advanced Pattern Classification; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012; Volume 386. [Google Scholar]
- Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
- Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C. Global context-aware attention lstm networks for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1647–1656. [Google Scholar]
- Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently recurrent neural network (indrnn): Building a longer and deeper rnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5457–5466. [Google Scholar]
- Miah, A.S.M.; Hasan, M.A.M.; Shin, J. Dynamic Hand Gesture Recognition using Multi-Branch Attention Based Graph and General Deep Learning Model. IEEE Access 2023, 11, 4703–4716. [Google Scholar] [CrossRef]
- Shin, J.; Miah, A.S.M.; Suzuki, K.; Hirooka, K.; Hasan, M.A.M. Dynamic Korean Sign Language Recognition Using Pose Estimation Based and Attention-Based Neural Network. IEEE Access 2023, 11, 143501–143513. [Google Scholar] [CrossRef]
- Shin, J.; Kaneko, Y.; Miah, A.S.M.; Hassan, N.; Nishimura, S. Anomaly Detection in Weakly Supervised Videos Using Multistage Graphs and General Deep Learning Based Spatial-Temporal Feature Enhancement. IEEE Access 2024, 12, 65213–65227. [Google Scholar] [CrossRef]
- Shin, J.; Miah, A.S.M.; Egawa, R.; Hassan, N.; Hirooka, K.; Tomioka, Y. Multimodal Fall Detection Using Spatial–Temporal Attention and Bi-LSTM-Based Feature Fusion. Future Internet 2025, 17, 173. [Google Scholar] [CrossRef]
- Miah, A.S.M.; Hasan, M.A.M.; Nishimura, S.; Shin, J. Sign Language Recognition using Graph and General Deep Neural Network Based on Large Scale Dataset. IEEE Access 2024, 12, 34553–34569. [Google Scholar] [CrossRef]
- Miah, A.S.M.; Hasan, M.A.M.; Jang, S.W.; Lee, H.S.; Shin, J. Multi-Stream General and Graph-Based Deep Neural Networks for Skeleton-Based Sign Language Recognition. Electronics 2023, 12, 2841. [Google Scholar] [CrossRef]
- Gori, M.; Monfardini, G.; Scarselli, F. A new model for learning in graph domains. In Proceedings of the 2005 IEEE International Joint Conference on Neural Networks, Montreal, QUC, Canada, 31 July 31–4 August 2005; Volume 2, pp. 729–734. [Google Scholar]
- Li, R.; Tapaswi, M.; Liao, R.; Jia, J.; Urtasun, R.; Fidler, S. Situation recognition with graph neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4173–4182. [Google Scholar]
- Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Shiraki, K.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Spatial temporal attention graph convolutional networks with mechanics-stream for skeleton-based action recognition. In Proceedings of the Asian Conference on Computer Vision, Virtual, 30 November–4 December 2020. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Trans. Image Process. 2020, 29, 9532–9545. [Google Scholar] [CrossRef] [PubMed]
- Huang, J.; Huang, Z.; Xiang, X.; Gong, X.; Zhang, B. Long-short graph memory network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 7–10 January 2020; pp. 645–652. [Google Scholar]
- Cheng, K.; Zhang, Y.; He, X.; Chen, W.; Cheng, J.; Lu, H. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 183–192. [Google Scholar]
- Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Stronger, faster and more explainable: A graph convolutional baseline for skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1625–1633. [Google Scholar]
- Thakkar, K.; Narayanan, P. Part-based graph convolutional network for action recognition. arXiv 2018, arXiv:1809.04983. [Google Scholar]
- Li, B.; Li, X.; Zhang, Z.; Wu, F. Spatio-temporal graph routing for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8561–8568. [Google Scholar]
- Chi, H.g.; Ha, M.H.; Chi, S.; Lee, S.W.; Huang, Q.; Ramani, K. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 20186–20196. [Google Scholar]
- Zheng, Y.; Zhang, Y.; Qian, K.; Zhang, G.; Liu, Y.; Wu, C.; Yang, Z. Zero-effort cross-domain gesture recognition with Wi-Fi. In Proceedings of the 17th Annual International Conference on Mobile Systems, Applications, and Services, Seoul, Republic of Korea, 17–21 June 2019; pp. 313–325. [Google Scholar]
- Sanhudo, L.; Calvetti, D.; Martins, J.P.; Ramos, N.M.; Meda, P.; Goncalves, M.C.; Sousa, H. Activity classification using accelerometers and machine learning for complex construction worker activities. J. Build. Eng. 2021, 35, 102001. [Google Scholar] [CrossRef]
- Chen, K.; Zhang, D.; Yao, L.; Guo, B.; Yu, Z.; Liu, Y. Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar] [CrossRef]
- Huan, R.; Jiang, C.; Ge, L.; Shu, J.; Zhan, Z.; Chen, P.; Chi, K.; Liang, R. Human complex activity recognition with sensor data using multiple features. IEEE Sens. J. 2021, 22, 757–775. [Google Scholar] [CrossRef]
- Nafea, O.; Abdul, W.; Muhammad, G.; Alsulaiman, M. Sensor-based human activity recognition with spatio-temporal deep learning. Sensors 2021, 21, 2141. [Google Scholar] [CrossRef]
- Kabir, M.H.; Mahmood, S.; Al Shiam, A.; Musa Miah, A.S.; Shin, J.; Molla, M.K.I. Investigating Feature Selection Techniques to Enhance the Performance of EEG-Based Motor Imagery Tasks Classification. Mathematics 2023, 11, 1921. [Google Scholar] [CrossRef]
- Al Farid, F.; Bari, A.; Mansor, S.; Uddin, J.; Kumaresan, S.P. A Structured and Methodological Review on Multi-View Human Activity Recognition for Ambient Assisted Living. J. Imaging 2025, 11, 182. [Google Scholar] [CrossRef]
- Stisen, A.; Blunck, H.; Bhattacharya, S.; Prentow, T.S.; Kjærgaard, M.B.; Dey, A.; Sonne, T.; Jensen, M.M. Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, Seoul, Republic of Korea, 1–4 November 2015; pp. 127–140. [Google Scholar]
- Abbas, S.; Alsubai, S.; Sampedro, G.A.; ul Haque, M.I.; Almadhor, A.; Al Hejaili, A.; Ivanochko, I. Active Machine Learning for Heterogeneity Activity Recognition Through Smartwatch Sensors. IEEE Access 2024, 12, 22595–22607. [Google Scholar] [CrossRef]
- Banos, O.; Garcia, R.; Holgado-Terriza, J.A.; Damas, M.; Pomares, H.; Rojas, I.; Saez, A.; Villalonga, C. mHealthDroid: A novel framework for agile development of mobile health applications. In Proceedings of the International Workshop on Ambient Assisted Living, Belfast, UK, 2–5 December 2014; pp. 2–5. [Google Scholar]
- El-Adawi, E.; Essa, E.; Handosa, M.; Elmougy, S. Wireless body area sensor networks based human activity recognition using deep learning. Sci. Rep. 2024, 14, 2702. [Google Scholar] [CrossRef]
- Chavarriaga, R.; Sagha, H.; Calatroni, A.; Digumarti, S.T.; Tröster, G.; Millán, J.d.R.; Roggen, D. The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognit. Lett. 2013, 34, 2033–2042. [Google Scholar] [CrossRef]
- Ye, X.; Wang, K.I.K. Deep Generative Domain Adaptation with Temporal Relation Knowledge for Cross-User Activity Recognition. arXiv 2024, arXiv:2403.14682. [Google Scholar]
- Kwapisz, J.R.; Weiss, G.M.; Moore, S.A. Activity recognition using cell phone accelerometers. ACM Sigkdd Explor. Newsl. 2011, 12, 74–82. [Google Scholar] [CrossRef]
- Kaya, Y.; Topuz, E.K. Human activity recognition from multiple sensors data using deep CNNs. Multimed. Tools Appl. 2024, 83, 10815–10838. [Google Scholar] [CrossRef]
- Anguita, D.; Ghio, A.; Oneto, L.; Parra, X.; Reyes-Ortiz, J.L. A public domain dataset for human activity recognition using smartphones. In Proceedings of the ESANN, Bruges, Belgium, 24–26 April 2013; Volume 3, p. 3. [Google Scholar]
- Reiss, A.; Stricker, D. Introducing a new benchmarked dataset for activity monitoring. In Proceedings of the 2012 16th International Symposium on Wearable Computers, Newcastle, UK, 18–22 June 2012; pp. 108–109. [Google Scholar]
- Zhu, Y.; Luo, H.; Chen, R.; Zhao, F. DiamondNet: A Neural-Network-Based Heterogeneous Sensor Attentive Fusion for Human Activity Recognition. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 15321–15331. [Google Scholar] [CrossRef]
- Altun, K.; Barshan, B.; Tunçel, O. Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognit. 2010, 43, 3605–3620. [Google Scholar] [CrossRef]
- Zhang, H.; Xu, L. Multi-STMT: Multi-level network for human activity recognition based on wearable sensors. IEEE Trans. Instrum. Meas. 2024, 73, 2508612. [Google Scholar] [CrossRef]
- Sztyler, T.; Stuckenschmidt, H. On-body localization of wearable devices: An investigation of position-aware activity recognition. In Proceedings of the 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom), Sydney, Australia, 14–19 March 2016; pp. 1–9. [Google Scholar]
- Khan, D.; Al Mudawi, N.; Abdelhaq, M.; Alazeb, A.; Alotaibi, S.S.; Algarni, A.; Jalal, A. A wearable inertial sensor approach for locomotion and localization recognition on physical activity. Sensors 2024, 24, 735. [Google Scholar] [CrossRef]
- Cheng, H.T.; Sun, F.T.; Griss, M.; Davis, P.; Li, J.; You, D. Nuactiv: Recognizing unseen new activities using semantic attribute-based learning. In Proceedings of the 11th Annual International Conference on Mobile Systems, Applications, and Services, Taipei, Taiwan, 25–28 June 2013; pp. 361–374. [Google Scholar]
- Zolfaghari, P.; Rey, V.F.; Ray, L.; Kim, H.; Suh, S.; Lukowicz, P. Sensor Data Augmentation from Skeleton Pose Sequences for Improving Human Activity Recognition. arXiv 2024, arXiv:2406.16886. [Google Scholar]
- Shoaib, M.; Bosch, S.; Incel, O.D.; Scholten, H.; Havinga, P.J. Fusion of smartphone motion sensors for physical activity recognition. Sensors 2014, 14, 10146–10176. [Google Scholar] [CrossRef]
- Zhang, L.; Yu, J.; Gao, Z.; Ni, Q. A multi-channel hybrid deep learning framework for multi-sensor fusion enabled human activity recognition. Alex. Eng. J. 2024, 91, 472–485. [Google Scholar] [CrossRef]
- Huynh, T.; Fritz, M.; Schiele, B. Discovery of activity patterns using topic models. In Proceedings of the 10th International Conference on Ubiquitous Computing, Seoul, Republic of Korea, 21–24 September 2008; pp. 10–19. [Google Scholar]
- Micucci, D.; Mobilio, M.; Napoletano, P. Unimib shar: A dataset for human activity recognition using acceleration data from smartphones. Appl. Sci. 2017, 7, 1101. [Google Scholar] [CrossRef]
- Yao, M.; Zhang, L.; Cheng, D.; Qin, L.; Liu, X.; Fu, Z.; Wu, H.; Song, A. Revisiting Large-Kernel CNN Design via Structural Re-Parameterization for Sensor-Based Human Activity Recognition. IEEE Sens. J. 2024, 24, 12863–12876. [Google Scholar] [CrossRef]
- Zhang, M.; Sawchuk, A.A. USC-HAD: A daily activity dataset for ubiquitous activity recognition using wearable sensors. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, Pittsburgh, PA, USA, 5–8 September 2012; pp. 1036–1043. [Google Scholar]
- Vavoulas, G.; Chatzaki, C.; Malliotakis, T.; Pediaditis, M.; Tsiknakis, M. The mobiact dataset: Recognition of activities of daily living using smartphones. In Proceedings of the International Conference on Information and Communication Technologies for Ageing Well and E-Health, Rome, Italy, 21–22 April 2016; SciTePress: Porto, Portugal, 2016; Volume 2, pp. 143–151. [Google Scholar]
- Khaertdinov, B.; Asteriadis, S. Explaining, Analyzing, and Probing Representations of Self-Supervised Learning Models for Sensor-based Human Activity Recognition. In Proceedings of the 2023 IEEE International Joint Conference on Biometrics (IJCB), Ljubljana, Slovenia, 25–28 September 2023; pp. 1–10. [Google Scholar]
- Malekzadeh, M.; Clegg, R.G.; Cavallaro, A.; Haddadi, H. Protecting sensory data against sensitive inferences. In Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems, Porto, Portugal, 23–26 April 2018; pp. 1–6. [Google Scholar]
- Saha, U.; Saha, S.; Kabir, M.T.; Fattah, S.A.; Saquib, M. Decoding human activities: Analyzing wearable accelerometer and gyroscope data for activity recognition. IEEE Sens. Lett. 2024, 8, 7003904. [Google Scholar] [CrossRef]
- van Kasteren, T.L.; Englebienne, G.; Kröse, B.J. Human activity recognition from wireless sensor network data: Benchmark and software. In Activity Recognition in Pervasive Intelligent Environments; Springer: Berlin/Heidelberg, Germany, 2011; pp. 165–186. [Google Scholar]
- Cook, D.J.; Crandall, A.S.; Thomas, B.L.; Krishnan, N.C. CASAS: A smart home in a box. Computer 2012, 46, 62–69. [Google Scholar] [CrossRef]
- Kim, H.; Lee, D. CLAN: A Contrastive Learning based Novelty Detection Framework for Human Activity Recognition. arXiv 2024, arXiv:2401.10288. [Google Scholar]
- Zappi, P.; Lombriser, C.; Stiefmeier, T.; Farella, E.; Roggen, D.; Benini, L.; Tröster, G. Activity recognition from on-body sensors: Accuracy-power trade-off by dynamic sensor selection. In Wireless Sensor Networks, Proceedings of the 5th European Conference, EWSN 2008, Bologna, Italy, 30 January–1 February 2008; Proceedings; Springer: Berlin/Heidelberg, Germany, 2008; pp. 17–33. [Google Scholar]
- Zhang, Z.; Wang, W.; An, A.; Qin, Y.; Yang, F. A human activity recognition method using wearable sensors based on convtransformer model. Evol. Syst. 2023, 14, 939–955. [Google Scholar] [CrossRef]
- Chen, J.; Xu, X.; Wang, T.; Jeon, G.; Camacho, D. An AIoT Framework With Multi-modal Frequency Fusion for WiFi-Based Coarse and Fine Activity Recognition. IEEE Internet Things J. 2024, 11, 39020–39029. [Google Scholar] [CrossRef]
- Reyes-Ortiz, J.L.; Oneto, L.; Samà, A.; Parra, X.; Anguita, D. Transition-aware human activity recognition using smartphones. Neurocomputing 2016, 171, 754–767. [Google Scholar] [CrossRef]
- Jain, A.; Kanhangad, V. Human activity classification in smartphones using accelerometer and gyroscope sensors. IEEE Sens. J. 2017, 18, 1169–1177. [Google Scholar] [CrossRef]
- Ignatov, A. Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl. Soft Comput. 2018, 62, 915–922. [Google Scholar] [CrossRef]
- Chen, K.; Yao, L.; Zhang, D.; Wang, X.; Chang, X.; Nie, F. A semisupervised recurrent convolutional attention model for human activity recognition. IEEE Trans. Neural Netw. Learn. Syst. 2019, 31, 1747–1756. [Google Scholar] [CrossRef] [PubMed]
- Kavuncuoğlu, E.; Uzunhisarcıklı, E.; Barshan, B.; Özdemir, A.T. Investigating the performance of wearable motion sensors on recognizing falls and daily activities via machine learning. Digit. Signal Process. 2022, 126, 103365. [Google Scholar] [CrossRef]
- Lu, L.; Zhang, C.; Cao, K.; Deng, T.; Yang, Q. A multichannel CNN-GRU model for human activity recognition. IEEE Access 2022, 10, 66797–66810. [Google Scholar] [CrossRef]
- Kim, Y.W.; Cho, W.H.; Kim, K.S.; Lee, S. Oversampling technique-based data augmentation and 1D-CNN and bidirectional GRU ensemble model for human activity recognition. J. Mech. Med. Biol. 2022, 22, 2240048. [Google Scholar] [CrossRef]
- Lin, Y.; Wu, J. A novel multichannel dilated convolution neural network for human activity recognition. Math. Probl. Eng. 2020, 2020, 5426532. [Google Scholar] [CrossRef]
- Nadeem, A.; Jalal, A.; Kim, K. Automatic human posture estimation for sport activity recognition with robust body parts detection and entropy markov model. Multimed. Tools Appl. 2021, 80, 21465–21498. [Google Scholar] [CrossRef]
- Zhang, J.; Wu, F.; Wei, B.; Zhang, Q.; Huang, H.; Shah, S.W.; Cheng, J. Data augmentation and dense-LSTM for human activity recognition using WiFi signal. IEEE Internet Things J. 2020, 8, 4628–4641. [Google Scholar] [CrossRef]
- Alawneh, L.; Mohsen, B.; Al-Zinati, M.; Shatnawi, A.; Al-Ayyoub, M. A comparison of unidirectional and bidirectional lstm networks for human activity recognition. In Proceedings of the 2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), Austin, TX, USA, 23–27 March 2020; pp. 1–6. [Google Scholar]
- Wei, X.; Wang, Z. TCN-attention-HAR: Human activity recognition based on attention mechanism time convolutional network. Sci. Rep. 2024, 14, 7414. [Google Scholar] [CrossRef]
- Sarkar, A.; Hossain, S.S.; Sarkar, R. Human activity recognition from sensor data using spatial attention-aided CNN with genetic algorithm. Neural Comput. Appl. 2023, 35, 5165–5191. [Google Scholar] [CrossRef]
- Semwal, V.B.; Jain, R.; Maheshwari, P.; Khatwani, S. Gait reference trajectory generation at different walking speeds using LSTM and CNN. Multimed. Tools Appl. 2023, 82, 33401–33419. [Google Scholar] [CrossRef]
- Liu, K.; Gao, C.; Li, B.; Liu, W. Human activity recognition through deep learning: Leveraging unique and common feature fusion in wearable multi-sensor systems. Appl. Soft Comput. 2024, 151, 111146. [Google Scholar] [CrossRef]
- Khan, S.I.; Dawood, H.; Khan, M.; Issa, G.F.; Hussain, A.; Alnfiai, M.M.; Adnan, K.M. Transition-aware human activity recognition using an ensemble deep learning framework. Comput. Hum. Behav. 2025, 162, 108435. [Google Scholar] [CrossRef]
- Sarakon, S.; Massagram, W.; Tamee, K. Multisource Data Fusion Using MLP for Human Activity Recognition. Comput. Mater. Contin. (CMC) 2025, 82, 2110–2136. [Google Scholar] [CrossRef]
- Yao, M.; Cheng, D.; Zhang, L.; Wang, L.; Wu, H.; Song, A. Long kernel distillation in human activity recognition. Knowl.-Based Syst. 2025, 316, 113397. [Google Scholar] [CrossRef]
- Thakur, D.; Dangi, S.; Lalwani, P. A novel hybrid deep learning approach with GWO–WOA optimization technique for human activity recognition. Biomed. Signal Process. Control 2025, 99, 106870. [Google Scholar] [CrossRef]
- Hu, L.; Zhao, K.; Ling, B.W.K.; Liang, S.; Wei, Y. Improving human activity recognition via graph attention network with linear discriminant analysis and residual learning. Biomed. Signal Process. Control 2025, 100, 107053. [Google Scholar] [CrossRef]
- Yu, X.; Al-qaness, M.A. ASK-HAR: Attention-Based Multi-Core Selective Kernel Convolution Network for Human Activity Recognition. Measurement 2025, 242, 115981. [Google Scholar] [CrossRef]
- Muralidharan, A.; Mahfuz, S. Human Activity Recognition Using Hybrid CNN-RNN Architecture. Procedia Comput. Sci. 2025, 257, 336–343. [Google Scholar] [CrossRef]
- Yang, Z.; Zhang, S.; Wei, Z.; Zhang, Y.; Zhang, L.; Li, H. Semi-supervised Human Activity Recognition with Individual Difference Alignment. Expert Syst. Appl. 2025, 275, 126976. [Google Scholar] [CrossRef]
- Sharen, H.; Anbarasi, L.J.; Rukmani, P.; Gandomi, A.H.; Neeraja, R.; Narendra, M. WISNet: A deep neural network based human activity recognition system. Expert Syst. Appl. 2024, 258, 124999. [Google Scholar] [CrossRef]
- Teng, Q.; Li, W.; Hu, G.; Shu, Y.; Liu, Y. Innovative Dual-Decoupling CNN With Layer-Wise Temporal-Spatial Attention for Sensor-Based Human Activity Recognition. IEEE J. Biomed. Health Inform. 2025, 29, 1035–1047. [Google Scholar] [CrossRef] [PubMed]
- Dahal, A.; Moulik, S.; Mukherjee, R. Stack-HAR: Complex Human Activity Recognition With Stacking-Based Ensemble Learning Framework. IEEE Sens. J. 2025, 25, 16373–16380. [Google Scholar] [CrossRef]
- Pitombeira-Neto, A.R.; de França, D.S.; Cruz, L.A.; da Silva, T.L.C.; de Macedo, J.A.F. An Ensemble Bayesian Dynamic Linear Model for Human Activity Recognition. IEEE Access 2025, 13, 30316–30333. [Google Scholar] [CrossRef]
- Latyshev, E. Sensor Data Preprocessing, Feature Engineering and Equipment Remaining Lifetime Forecasting for Predictive Maintenance. In Proceedings of the DAMDID/RCDL, Moscow, Russia, 9–12 October 2018; pp. 226–231. [Google Scholar]
- Joy, M.M.H.; Hasan, M.; Miah, A.S.M.; Ahmed, A.; Tohfa, S.A.; Bhuaiyan, M.F.I.; Zannat, A.; Rashid, M.M. Multiclass mi-task classification using logistic regression and filter bank common spatial patterns. In Proceedings of the International Conference on Computing Science, Communication and Security, Gandhingar, India, 26–27 March 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 160–170. [Google Scholar]
- Miah, A.S.M.; Rahim, M.A.; Shin, J. Motor-imagery classification using riemannian geometry with median absolute deviation. Electronics 2020, 9, 1584. [Google Scholar] [CrossRef]
- Zobaed, T.; Ahmed, S.R.A.; Miah, A.S.M.; Binta, S.M.; Ahmed, M.R.A.; Rashid, M. Real time sleep onset detection from single channel EEG signal using block sample entropy. In Proceedings of the IOP Conference Series: Materials Science and Engineering, IOP Publishing, Dhaka, Bangladesh, 27–28 August 2020; Volume 928, p. 032021. [Google Scholar]
- Hussain, I.; Jany, R.; Boyer, R.; Azad, A.; Alyami, S.A.; Park, S.J.; Hasan, M.M.; Hossain, M.A. An explainable EEG-based human activity recognition model using machine-learning approach and LIME. Sensors 2023, 23, 7452. [Google Scholar] [CrossRef]
- Thakur, D.; Biswas, S.; Ho, E.S.; Chattopadhyay, S. Convae-lstm: Convolutional autoencoder long short-term memory network for smartphone-based human activity recognition. IEEE Access 2022, 10, 4137–4156. [Google Scholar] [CrossRef]
- Madsen, H. Time Series Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar]
- Ye, X.; Wang, K.I.K. Cross-User Activity Recognition Using Deep Domain Adaptation With Temporal Dependency Information. IEEE Trans. Instrum. Meas. 2025, 74, 2520415. [Google Scholar] [CrossRef]
- Park, J.; Kim, D.W.; Lee, J. HT-AggNet: Hierarchical temporal aggregation network with near-zero-cost layer stacking for human activity recognition. Eng. Appl. Artif. Intell. 2025, 149, 110465. [Google Scholar] [CrossRef]
- Ordóñez, F.J.; Roggen, D. Deep convolutional and lstm recurrent neural networks for multimodal wearable activity recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef]
- Murad, A.; Pyun, J.Y. Deep recurrent neural networks for human activity recognition. Sensors 2017, 17, 2556. [Google Scholar] [CrossRef]
- Gupta, S. Deep learning based human activity recognition (HAR) using wearable sensor data. Int. J. Inf. Manag. Data Insights 2021, 1, 100046. [Google Scholar] [CrossRef]
- Chen, Z.; Wu, M.; Cui, W.; Liu, C.; Li, X. An attention based CNN-LSTM approach for sleep-wake detection with heterogeneous sensors. IEEE J. Biomed. Health Inform. 2020, 25, 3270–3277. [Google Scholar] [CrossRef]
- Essa, E.; Abdelmaksoud, I.R. Temporal-channel convolution with self-attention network for human activity recognition using wearable sensors. Knowl.-Based Syst. 2023, 278, 110867. [Google Scholar] [CrossRef]
- Zhang, X.Y.; Shi, H.; Li, C.; Li, P. Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12886–12893. [Google Scholar]
- Cemiloglu, A.; Akay, B. Handling heterogeneity in Human Activity Recognition data by a compact Long Short Term Memory based deep learning approach. Eng. Appl. Artif. Intell. 2025, 153, 110788. [Google Scholar] [CrossRef]
- Di Domenico, S.; De Sanctis, M.; Cianca, E.; Bianchi, G. A trained-once crowd counting method using differential wifi channel state information. In Proceedings of the 3rd International on Workshop on Physical Analytics, Singapore, 26 June 2016; pp. 37–42. [Google Scholar]
- Liu, J.; Teng, G.; Hong, F. Human activity sensing with wireless signals: A survey. Sensors 2020, 20, 1210. [Google Scholar] [CrossRef]
- Jiang, W.; Miao, C.; Ma, F.; Yao, S.; Wang, Y.; Yuan, Y.; Xue, H.; Song, C.; Ma, X.; Koutsonikolas, D.; et al. Towards environment independent device free human activity recognition. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, New Delhi, India, 29 October–2 November 2018; pp. 289–304. [Google Scholar]
- Arshad, S.; Feng, C.; Liu, Y.; Hu, Y.; Yu, R.; Zhou, S.; Li, H. Wi-chase: A WiFi based human activity recognition system for sensorless environments. In Proceedings of the 2017 IEEE 18th International Symposium on A World of Wireless, Mobile and Multimedia Networks (WoWMoM), Macau, China, 12–15 June 2017; pp. 1–6. [Google Scholar]
- Li, C.; Cao, Z.; Liu, Y. Deep AI enabled ubiquitous wireless sensing: A survey. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
- Ji, S.; Xie, Y.; Li, M. SiFall: Practical online fall detection with RF sensing. In Proceedings of the 20th ACM Conference on Embedded Networked Sensor Systems, Boston, MA, USA, 6–9 November 2022; pp. 563–577. [Google Scholar]
- Zhao, C.; Wang, L.; Xiong, F.; Chen, S.; Su, J.; Xu, H. RFID-Based Human Action Recognition Through Spatiotemporal Graph Convolutional Neural Network. IEEE Internet Things J. 2023, 10, 19898–19912. [Google Scholar] [CrossRef]
- Li, W.; Vishwakarma, S.; Tang, C.; Woodbridge, K.; Piechocki, R.J.; Chetty, C. Using RF Transmissions from IoT Devices for Occupancy Detection and Activity Recognition. IEEE Sens. J. 2022, 22, 2484–2495. [Google Scholar] [CrossRef]
- Muaaz, M.; Waqar, S.; Pätzold, M. Orientation-Independent Human Activity Recognition Using Complementary Radio Frequency Sensing. Sensors 2023, 23, 5810. [Google Scholar] [CrossRef] [PubMed]
- Ali, M.; Marsalek, R. The Human Activity Recognition Using Radio Frequency Signals. In Proceedings of the 2023 33rd International Conference Radioelektronika (RADIOELEKTRONIKA), Pardubice, Czech Republic, 19–20 April 2023. [Google Scholar] [CrossRef]
- Uysal, C.; Filik, T. A New RF Sensing Framework for Human Detection Through the Wall. IEEE Trans. Veh. Technol. 2023, 72, 3600–3610. [Google Scholar] [CrossRef]
- Saeed, U.; Shah, S.A.; Khan, M.Z.; Alotaibi, A.A.; Althobaiti, T.; Ramzan, N.; Imran, M.A.; Abbasi, Q.H. Software-Defined Radio-Based Contactless Localization for Diverse Human Activity Recognition. IEEE Sens. J. 2023, 23, 12041–12048. [Google Scholar] [CrossRef]
- Wang, Z.; Yang, C.; Mao, S. AIGC for RF-Based Human Activity Sensing. IEEE Internet Things J. 2025, 12, 3991–4005. [Google Scholar] [CrossRef]
- Chen, Z.; Cai, C.; Zheng, T.; Luo, J.; Xiong, J.; Wang, X. RF-Based Human Activity Recognition Using Signal Adapted Convolutional Neural Network. arXiv 2023, arXiv:2110.14307. [Google Scholar] [CrossRef]
- Yang, C.; Wang, X.; Mao, S. TARF: Technology-Agnostic RF Sensing for Human Activity Recognition. IEEE J. Biomed. Health Inform. 2023, 27, 636–647. [Google Scholar] [CrossRef]
- Guo, W.; Yamagishi, S.; Jing, L. Human Activity Recognition via Wi-Fi and Inertial Sensors With Machine Learning. IEEE Access 2024, 12, 18821–18836. [Google Scholar] [CrossRef]
- Mohtadifar, M.; Cheffena, M.; Pourafzal, A. Acoustic- and Radio-Frequency-Based Human Activity Recognition. Sensors 2022, 22, 3125. [Google Scholar] [CrossRef]
- Rani, S.S.; Naidu, G.A.; Shree, V.U. Kinematic joint descriptor and depth motion descriptor with convolutional neural networks for human action recognition. Mater. Today Proc. 2021, 37, 3164–3173. [Google Scholar] [CrossRef]
- Dhiman, C.; Vishwakarma, D.K. View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 2020, 29, 3835–3844. [Google Scholar] [CrossRef]
- Wang, L.; Ding, Z.; Tao, Z.; Liu, Y.; Fu, Y. Generative multi-view human action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6212–6221. [Google Scholar]
- Rahmani, H.; Mahmood, A.; Huynh, D.Q.; Mian, A. Real time action recognition using histograms of depth gradients and random decision forests. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, 24–26 March 2014; pp. 626–633. [Google Scholar]
- Shin, J.; Miah, A.S.M.; Kaneko, Y.; Hassan, N.; Lee, H.S.; Jang, S.W. Multimodal Attention-Enhanced Feature Fusion-Based Weakly Supervised Anomaly Violence Detection. IEEE Open J. Comput. Soc. 2024, 6, 129–140. [Google Scholar] [CrossRef]
- Güler, R.A.; Neverova, N.; Kokkinos, I. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7297–7306. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Zaher, M.; Ghoneim, A.S.; Abdelhamid, L.; Atia, A. Fusing CNNs and attention-mechanisms to improve real-time indoor Human Activity Recognition for classifying home-based physical rehabilitation exercises. Comput. Biol. Med. 2025, 184, 109399. [Google Scholar] [CrossRef]
- Ko, J.E.; Kim, S.; Sul, J.H.; Kim, S.M. Data Reconstruction Methods in Multi-Feature Fusion CNN Model for Enhanced Human Activity Recognition. Sensors 2025, 25, 1184. [Google Scholar] [CrossRef]
- Zhao, Y.; Shao, J.; Lin, X.; Sun, T.; Li, J.; Lian, C.; Lyu, X.; Si, B.; Zhan, Z. CIR-DFENet: Incorporating cross-modal image representation and dual-stream feature enhanced network for activity recognition. Expert Syst. Appl. 2025, 266, 125912. [Google Scholar] [CrossRef]
- Romaissa, B.D.; Mourad, O.; Brahim, N. Vision-based multi-modal framework for action recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 5859–5866. [Google Scholar]
- Ren, Z.; Zhang, Q.; Gao, X.; Hao, P.; Cheng, J. Multi-modality learning for human action recognition. Multimed. Tools Appl. 2021, 80, 16185–16203. [Google Scholar] [CrossRef]
- Chen, J.; Ho, C.M. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2022; pp. 1910–1921. [Google Scholar]
- Khatun, M.A.; Yousuf, M.A.; Ahmed, S.; Uddin, M.Z.; Alyami, S.A.; Al-Ashhab, S.; Akhdar, H.F.; Khan, A.; Azad, A.; Moni, M.A. Deep CNN-LSTM with self-attention model for human activity recognition using wearable sensor. IEEE J. Transl. Eng. Health Med. 2022, 10, 1–16. [Google Scholar] [CrossRef]
- Bruce, X.; Liu, Y.; Zhang, X.; Zhong, S.H.; Chan, K.C. Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3522–3538. [Google Scholar]
- Wang, L.; Koniusz, P. 3mformer: Multi-order multi-mode transformer for skeletal action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5620–5631. [Google Scholar]
- Xu, H.; Gao, Y.; Hui, Z.; Li, J.; Gao, X. Language knowledge-assisted representation learning for skeleton-based action recognition. arXiv 2023, arXiv:2305.12398. [Google Scholar] [CrossRef]
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Multimodal fusion for audio-image and video action recognition. Neural Comput. Appl. 2024, 36, 5499–5513. [Google Scholar] [CrossRef]
- Dai, C.; Lu, S.; Liu, C.; Guo, B. A light-weight skeleton human action recognition model with knowledge distillation for edge intelligent surveillance applications. Appl. Soft Comput. 2024, 151, 111166. [Google Scholar] [CrossRef]
- Zhao, X.; Tang, C.; Hu, H.; Wang, W.; Qiao, S.; Tong, A. Attention mechanism based multimodal feature fusion network for human action recognition. J. Vis. Commun. Image Represent. 2025, 110, 104459. [Google Scholar] [CrossRef]
- Liu, D.; Meng, F.; Mi, J.; Ye, M.; Li, Q.; Zhang, J. SAM-Net: Semantic-assisted multimodal network for action recognition in RGB-D videos. Pattern Recognit. 2025, 168, 111725. [Google Scholar] [CrossRef]
- Xefteris, V.R.; Syropoulou, A.C.; Pistola, T.; Kasnesis, P.; Poulios, I.; Tsanousa, A.; Symeonidis, S.; Diplaris, S.; Goulianas, K.; Chatzimisios, P.; et al. Multimodal fusion of inertial sensors and single RGB camera data for 3D human pose estimation based on a hybrid LSTM-Random forest fusion network. Internet Things 2025, 29, 101465. [Google Scholar] [CrossRef]
- Hu, J.F.; Zheng, W.S.; Lai, J.; Zhang, J. Jointly learning heterogeneous features for RGB-D activity recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5344–5352. [Google Scholar]
- Hu, J.F.; Zheng, W.S.; Pan, J.; Lai, J.; Zhang, J. Deep bilinear learning for rgb-d action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 335–351. [Google Scholar]
- Khaire, P.; Kumar, P.; Imran, J. Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 2018, 115, 107–116. [Google Scholar] [CrossRef]
- Cardenas, E.E.; Chavez, G.C. Multimodal human action recognition based on a fusion of dynamic images using cnn descriptors. In Proceedings of the 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Paraná, Brazil, 29 October–1 November 2018; pp. 95–102. [Google Scholar]
- Khaire, P.; Imran, J.; Kumar, P. Human activity recognition by fusion of rgb, depth, and skeletal data. In Proceedings of the 2nd International Conference on Computer Vision & Image Processing (CVIP 2017), Roorkee, India, 9–12 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 1, pp. 409–421. [Google Scholar]
- Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 807–811. [Google Scholar] [CrossRef]
- Liu, D.; Meng, F.; Xia, Q.; Ma, Z.; Mi, J.; Gan, Y.; Ye, M.; Zhang, J. Temporal cues enhanced multimodal learning for action recognition in RGB-D videos. Neurocomputing 2024, 594, 127882. [Google Scholar] [CrossRef]
- Franco, A.; Magnani, A.; Maio, D. A multimodal approach for human activity recognition based on skeleton and RGB data. Pattern Recognit. Lett. 2020, 131, 293–299. [Google Scholar] [CrossRef]
- Shah, K.; Shah, A.; Lau, C.P.; de Melo, C.M.; Chellappa, R. Multi-view action recognition using contrastive learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HA, USA, 3–7 January 2023; pp. 3381–3391. [Google Scholar]
- Wu, Z.; Ding, Y.; Wan, L.; Li, T.; Nian, F. Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition. Pattern Recognit. 2025, 159, 111106. [Google Scholar] [CrossRef]
- Wang, C.; Yang, H.; Meinel, C. Exploring multimodal video representation for action recognition. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 1924–1931. [Google Scholar]
- Kazakos, E.; Nagrani, A.; Zisserman, A.; Damen, D. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5492–5501. [Google Scholar]
- Gao, R.; Oh, T.H.; Grauman, K.; Torresani, L. Listen to look: Action recognition by previewing audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10457–10467. [Google Scholar]
- Venkatachalam, K.; Yang, Z.; Trojovskỳ, P.; Bacanin, N.; Deveci, M.; Ding, W. Bimodal HAR-An efficient approach to human activity analysis and recognition using bimodal hybrid classifiers. Inf. Sci. 2023, 628, 542–557. [Google Scholar] [CrossRef]
- Yu, X.; Yang, H.; Chen, C.H. Human operators’ cognitive workload recognition with a dual attention-enabled multimodal fusion framework. Expert Syst. Appl. 2025, 280, 127418. [Google Scholar] [CrossRef]
- Keselman, L.; Iselin Woodfill, J.; Grunnet-Jepsen, A.; Bhowmik, A. Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1–10. [Google Scholar]
- Drouin, M.A.; Seoud, L. Consumer-grade RGB-D cameras. In 3D Imaging, Analysis and Applications; Springer: London, UK, 2020; pp. 215–264. [Google Scholar]
- Grunnet-Jepsen, A.; Sweetser, J.N.; Woodfill, J. Best-Known-Methods for Tuning Intel® Realsense™ d400 Depth Cameras for Best Performance; Intel Corporation: Satan Clara, CA, USA, 2018; Volume 1. [Google Scholar]
- Zabatani, A.; Surazhsky, V.; Sperling, E.; Moshe, S.B.; Menashe, O.; Silver, D.H.; Karni, Z.; Bronstein, A.M.; Bronstein, M.M.; Kimmel, R. Intel® realsense™ sr300 coded light depth camera. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2333–2345. [Google Scholar] [CrossRef]
- Li, T.; Zhang, R.; Li, Q. Multi scale temporal graph networks for skeleton-based action recognition. arXiv 2020, arXiv:2012.02970. [Google Scholar]
- Parsa, B.; Narayanan, A.; Dariush, B. Spatio-temporal pyramid graph convolutions for human action recognition and postural assessment. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Aspen, CO, USA, 1–5 March 2020; pp. 1080–1090. [Google Scholar]
- Zhu, G.; Zhang, L.; Li, H.; Shen, P.; Shah, S.A.A.; Bennamoun, M. Topology-learnable graph convolution for skeleton-based action recognition. Pattern Recognit. Lett. 2020, 135, 286–292. [Google Scholar] [CrossRef]
- Li, M.; Chen, S.; Chen, X.; Zhang, Y.; Wang, Y.; Tian, Q. Symbiotic graph neural networks for 3d skeleton-based human action recognition and motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3316–3333. [Google Scholar] [CrossRef] [PubMed]
- Weng, Y.; Wu, G.; Zheng, T.; Yang, Y.; Luo, J. Large Model for Small Data: Foundation Model for Cross-Modal RF Human Activity Recognition. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys 2024), Hangzhou, China, 4–7 November 2024. [Google Scholar]
- Khan, M.Z.; Bilal, M.; Abbas, H.; Imran, M.; Abbasi, Q.H. A Novel Multimodal LLM-Driven RF Sensing Method for Human Activity Recognition. In Proceedings of the 2025 2nd International Conference on Microwave, Antennas & Circuits (ICMAC), Islamabad, Pakistan, 17–18 April 2025. [Google Scholar]
- Li, Y.; Li, Y.; Vasconcelos, N. Resound: Towards action recognition without representation bias. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, German, 10–13 September 2018; pp. 513–528. [Google Scholar]
- Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13001–13008. [Google Scholar]
- Bowles, C.; Chen, L.; Guerrero, R.; Bentley, P.; Gunn, R.; Hammers, A.; Dickie, D.A.; Hernández, M.V.; Wardlaw, J.; Rueckert, D. Gan augmentation: Augmenting training data using generative adversarial networks. arXiv 2018, arXiv:1810.10863. [Google Scholar]
- Kang, G.; Dong, X.; Zheng, L.; Yang, Y. Patchshuffle regularization. arXiv 2017, arXiv:1707.07103. [Google Scholar]
- DeVries, T.; Taylor, G.W. Dataset augmentation in feature space. arXiv 2017, arXiv:1702.05538. [Google Scholar]
- Li, S.; Chen, Y.; Peng, Y.; Bai, L. Learning more robust features with adversarial training. arXiv 2018, arXiv:1804.07757. [Google Scholar]
- Real, E.; Moore, S.; Selle, A.; Saxena, S.; Suematsu, Y.L.; Tan, J.; Le, Q.V.; Kurakin, A. Large-scale evolution of image classifiers. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 2902–2911. [Google Scholar]
- Zou, Y.; Choi, J.; Wang, Q.; Huang, J.B. Learning representational invariances for data-efficient action recognition. Comput. Vis. Image Underst. 2023, 227, 103597. [Google Scholar] [CrossRef]
- Zhang, Y.; Jia, G.; Chen, L.; Zhang, M.; Yong, J. Self-paced video data augmentation by generative adversarial networks with insufficient samples. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1652–1660. [Google Scholar]
- Gowda, S.N.; Rohrbach, M.; Keller, F.; Sevilla-Lara, L. Learn2augment: Learning to composite videos for data augmentation in action recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 242–259. [Google Scholar]
- Gabeur, V.; Sun, C.; Alahari, K.; Schmid, C. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16; Springer: Cham, Switzerland, 2020; pp. 214–229. [Google Scholar]
- Piergiovanni, A.; Ryoo, M. Learning multimodal representations for unseen activities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 517–526. [Google Scholar]
- Lin, J.; Gan, C.; Han, S. Training kinetics in 15 minutes: Large-scale distributed training on videos. arXiv 2019, arXiv:1910.00932. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
- Singh, A.; Chakraborty, O.; Varshney, A.; Panda, R.; Feris, R.; Saenko, K.; Das, A. Semi-supervised action recognition with temporal contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 10389–10399. [Google Scholar]
- Yu, K.; Yun, F. Human Action Recognition and Prediction: A Survey. arXiv 2018, arXiv:1806.11230. [Google Scholar]
Author | Year | Dataset Name | Modality | Method | Classifier | Accuracy [%] |
---|---|---|---|---|---|---|
Ji et al. [46] | 2012 | KTH | RGB | 3DCNN | 90.2 | |
Wang et al. [47] | 2015 | UCF101 HMDB51 | RGB | 2-stream Convolution Network | Softmax | 91.5 65.9 |
Sharma et al. [48] | 2015 | UCF11 HMDB51 Hollywood2 | RGB | Stacked LSTM | Softmax | 84.96 41.31 43.91 |
Ijjina et al. [49] | 2016 | UCF50 | RGB | CNN-Genetic Algorithm | CNN | 99.98 |
Feichtenhofer et al. [50] | 2016 | UCF101 HMDB51 | RGB | CNN Two-Stream | Softmax | 92.5 65.4 |
Wang et al. [51] | 2016 | HMDB51 UCF101 | RGB | TSN | Softmax | 69.4 94.2 |
Akilan et al. [52] | 2017 | CIFAR100 Caltech101 CIFAR10 | RGB | ConvNets | Softmax | 75.87 95.54 91.83 |
Shi et al. [53] | 2017 | KTH UCF101 HMDB51 | RGB | 3-stream CNN | Softmax | 96.8 94.33 92.2 |
Ahsan et al. [54] | 2018 | UCF101 HMDB51 | RGB | GAN | Softmax | 47.2 41.40 |
Tu et al. [55] | 2018 | JHMDB HMDB51 UCF Sports UCF101 | RGB | Multi-Stream CNN | Softmax | 71.17 69.8 58.12 94.5 |
Zhou et al. [56] | 2018 | HMDB51 UCF101 | RGB | TMiCT-Net | CNN | 70.5 94.7 |
Jian et al. [57] | 2019 | Sport video | RGB | FCN | Softmax | 97.40 |
Ullah et al. [44] | 2019 | UCF50 UCF101 YouTube action HMDB51 | RGB | Deep autoencoder | SVM | 96.4 94.33 96.21 70.33 |
Gowda et al. [58] | 2020 | UCF101 HMDB51 FCVID ActivityNet | RGB | SMART | Softmax | 98.6 84.3 82.1 84.4 |
Khan et al. [59] | 2020 | HMDB51 UCF Sports YouTube IXMAS KTH | RGB | VGG19 CNN | Naive Bayes | 93.7 98.0 94.4 99.4 95.2 97.0 |
Ullah et al. [60] | 2021 | HMDB51 UCF101 UCF50 Hollywood2 YouTube Actions | RGB | DS-GRU | Softmax | 72.3 95.5 95.2 71.3 97.17 |
Wang et al. [61] | 2021 | SomethingV1 SomethingV2 Kinetics-400 | RGB | Temporal Difference Networks | TDN | 84.1 91.6 94.4 |
Wang et al. [62] | 2022 | UCF101 | RGB | HyRSM | - | 93.0 |
Wensel et al. [63] | 2023 | YouTube Action HMDB51 UCF50 UCF101 | RGB | ViT-ReT | Softmax | 92.4 78.4 97.1 94.7 |
Hassan et al. [64] | 2024 | UCF11 UCF Sports JHMDB | RGB | Deep Bi-LSTM | Softmax | 99.2 93.3 76.3 |
Khan et al. [65] | 2025 | UCF50 HMDB51 UCF101 | RGB | ConvLSTM and LRCN | Softmax | 97.42 73.63 95.70 |
Shah et al. [66] | 2025 | UCF101 HMDB51 | RGB | KD-GAN | Softmax | 98.50 79.21 |
Author | Year | Dataset Name | Modality | Method | Classifier | Accuracy [%] |
---|---|---|---|---|---|---|
Gan et al. [112] | 2013 | UTKinect-Action | RGB | RF | APJ3D | 92.00 |
Everts et al. [113] | 2014 | UCF11 UCF50 | RGB | Multi-channel STIP | SVM | 78.6 72.9 |
Zhu et al. [114] | 2014 | MSRAction3D UTKinectAction CAD-60 MSRDailyActivity3D HMDB51 | RGB | STIP (HOG/HOF) | SVM | 94.3 91.9 87.5 80.0 |
Yang et al. [21] | 2014 | MSR Action3D | RGB | EigenJoints-based | NBNN | 97.8 |
Liu et al. [115] | 2015 | KTH HMDB51 UCF YouTube Hollywood2 | RGB | GP-learned descriptors | SVM | 95.0 48.4 82.3 46.8 |
Xu et al. [116] | 2016 | MSRAction3D UTKinectAction Florence 3D-Action | RGB | PSO-SVM | - | 93.75 97.45 91.20 |
Vishwakarma et al. [117] | 2016 | KTH Weizmann i3Dpost Ballet IXMAS | RGB | SDEG | SVM | 95.5 100 92.92 93.25 85.8 |
Singh et al. [118] | 2017 | UCSDped-1 UCSDped-2 UMN | RGB | Graph formulation | SVM | 97.14 90.13 95.24 |
Jalal et al. [119] | 2017 | IM-DailyDepthActivity MSRAction3D MSRDailyActivity3D | RGB | HOG-DDS | HMM | 72.86 93.3 97.9 |
Nazir et al. [120] | 2018 | KTH UCF Sports UCF11 Hollywood | RGB | D-STBoE | SVM | 91.82 94.00 94.00 68.10 |
Ullah et al. [121] | 2021 | UCF Sports UCF101 | RGB | Weekly supervised based | SVM | 98.27 84.72 |
Al et al. [122] | 2021 | E-KTH E-UCF11 E-HMDB51 E-UCF50 R-UCF11 R-UCF50 N-Actions | RGB | Local and global feature extraction | QSVM | 93.14 94.43 87.61 69.45 82.61 68.96 61.94 |
Hejazi et al. [123] | 2022 | UCF101 Kinetics-400 Kinetics-700 | RGB | Optical flow based | KNN | 99.21 98.24 96.35 |
Zhang et al. [124] | 2022 | UCF 11 UCF 50 UCF 101 JHMDB51 UT-Interaction | RGB | FV+BoTF | SVM | 99.21 92.5 95.1 70.8 91.50 |
Fatima et al. [125] | 2023 | UT-Interaction | RGB | SIFT and ORB | Decision Tree | 94.6 |
Model | Innovation Point | Strengths | Limitations |
---|---|---|---|
ST-GCN [244] | Fixed skeleton graph structure with spatial–temporal convolution | Baseline for spatial–temporal modeling | Limited flexibility for unseen poses and graph variation |
2s-AGCN [201] | Data-driven topology learning and attention-based weighting | Improved adaptability and robustness | High computational complexity; sensitivity to sensor noise |
STA-GCN [245] | Attentional focus on action-relevant joints and frames | Enhanced interpretability; adaptive attention | Requires careful tuning of attention mechanisms |
Shift-GCN [246] | Spatial shift operations for efficient receptive field expansion | Lightweight and efficient; good for long actions | Less expressive than full GCN in small-scale motions |
InfoGCN [252] | Injects global semantics into GCN to improve feature learning | Improved generalization; handles complex scenes | May require large training data for stable learning |
EMS-TAGCN [217] | Multi-stream adaptive attention across space, time, and channels | High accuracy across datasets; modular attention mechanism | Increased complexity; scalability concerns without further tuning |
Dataset | Classifier | Methods | Data Set Type | Year | Reference | Accuracy [%] |
---|---|---|---|---|---|---|
NTU RGB+D (CS) NTU RGB+D (CV) | SVM | P-LSTM | RGB, Depth | 2016 | [91] | 62.93 70.27 |
UCI-HAD USC-HAD Opportunity Daphnet FOG Skoda | SVM KNN | DRNN | Sensors | 2017 | [330] | 96.7 97.8 92.5 94.1 92.6 |
Smartwach | Softmax | Dilated CNN | Sensor | 2020 | [300] | 95.49 |
UTD-MHAD NTU RGB+D | Softmax | Vission based | RGB, Depth, Skeleton | 2021 | [363] | 98.88 75.50 |
NTU RGB+D (CS) NTU RGB+D (CV) SYSU 3D HOI UWA3D II | Hierarchical- score fusion | Multi Model | RGB Depth | 2021 | [364] | 89.70 92.97 87.08 |
UCF-101 Something-Something-v2 Kinetics-600 | Softmax | MM-ViT | RGB | 2022 | [365] | 98.9 90.8 96.8 |
MHEALTH UCI-HAR | Softmax | CNN-LSTM | Sensor | 2022 | [366] | 98.76 93.11 |
UCI-HAR WISDM MHEALTH PAMAP2 HHAR | SVM | CNN with GA | Sensors | 2023 | [305] | 98.74 98.34 99.72 97.55 96.87 |
NTU RGB+D 60 NTU RGB+D120 PKU-MMD Northwestern UCLAMultiview Toyota Smarthome | - | MMNet | RGB, Depth | 2023 | [367] | 98.0 90.5 98.0 93.3 |
NTU RGB+D 60 NTU RGB+D120 NW-UCLA | Softmax | InfoGCN | RGB, Depth | 2023 | [252] | 93.0 89.8 97.0 |
NTU RGB+D NTU RGB+D120 | Softmax | Two-stream Transformer | RGB, Depth | 2023 | [368] | 94.8 93.8 |
NTU RGB+D NTU RGB+D120 NW-UCLA | Softmax | Language knowledge-assisted | RGB, Depth | 2023 | [369] | 97.2 91.8 97.6 |
UCF51 Kinetics Sound | Softmax | MAIVAR-CH | RGB, audio | 2024 | [370] | 87.9 79.0 |
Drive Act | - | Dual Feature Shift | RGB, Depth, Infrared | 2024 | [99] | 77.61 |
Florence3DAction UTKinect-Action3D 3DActionPairs NTURGB+D | Softmax | two-stream spatial–temporal architecture | RGB, Depth, Infrared | 2024 | [371] | 93.8 98.7 97.3 90.2 |
UI-PRMD KIMORE | Softmax | Fusing CNNs | RGB, Skeleton | 2025 | [360] | 89.80 95.33 |
Custom HAR | Softmax | Multi-Features Fusion CNN | Sensor | 2025 | [361] | 97.92 |
Custom Gymnastics Activity UCI-HAR | Softmax | CIR-DFENet | Sensor | 2025 | [362] | 99.40 98.07 |
OPPT PAMAP2 DSADS | Softmax | DTSDA | Sensor | 2025 | [327] | 99.00 81.55 51.59 |
NTU-RGB+D UTD-MHAD | Softmax | AMFI-Net | RGB Skeleton | 2025 | [372] | 88.97 93.21 |
NTU-60 PKU-MMD Northwestern UCLA | Softmax | SAM-Net | RGB, Skeleton, text | 2025 | [373] | 94.8 97.0 93.7 |
Algorithm | Dataset Name | Skeleton | RGB | Depth | Sensor Signal | Others | Modality Summary | Acc. | Prec. | Rec. |
---|---|---|---|---|---|---|---|---|---|---|
Khaire et al. [377] | NTU RGB +D | - | RGB | - | - | Single | 70.01% | - | - | |
Khaire et al. [377] | NTU RGB +D | Skelton | - | - | - | Single | 69.90% | - | - | |
Khaire et al. [377] | NTU RGB +D | - | - | Depth | - | - | Single | 80.30% | - | - |
Khaire et al. [377] | NTU RGB +D | - | RGB | Depth | - | - | Multi | 91.16% | - | - |
Khaire et al. [377] | NTU RGB +D | Skeleton | RGB | - | - | - | Multi | 80.69% | - | - |
Khaire et al. [377] | NTU RGB +D | Skeleton | - | Depth | - | - | Multi | 93.50% | - | - |
Khaire et al. [377] | NTU RGB +D | Skeleton | RGB | Depth | - | - | Multi | 94.60% | - | - |
Zhao et al. [372] | NTU RGB +D | - | RGB | - | - | Single | 95.22% | |||
Zhao et al. [372] | NTU RGB +D | Skeleton | - | - | - | Single | 93.24% | - | ||
Zhao et al. [372] | NTU RGB +D | Skeleton | RGB | - | - | Multi | 95.82% | - | ||
Zhao et al. [372] | UTD-MHAD | Skeleton | RGB | - | - | - | Multi | 93.21% | - | |
Franco et al. [382] | CAD-60, CAD-120, OAD | - | RGB | - | - | - | Single | - | 92.5, 61.1, 85.8 | 89.4, 59.3, 85.9 |
Franco et al. [382] | CAD-60, CAD-120, OAD | Skeleton | - | - | - | - | Single | - | 95.00, 77.60, 80.6 | 95.0, 73.1, 80.5 |
Franco et al. [382] | CAD-60, CAD-120, OAD | Skeleton | RGB | - | - | - | Multi | - | 98.8, 85.4, 90.6 | 98.3, 83.3, 90.4 |
Shah et al. [383] | NTU-60, N-UCLA | - | RGB | - | - | - | Single | 98.0, 91.7 | - | - |
Wu et al. [384] | NTU-60, N-UCLA | Skeleton | - | - | - | - | Single | 96.7, 96.8, | - | - |
Liu et al. [381] | NTU-60, PKU-MMD, N-UCLA | Skeleton | RGB | - | - | - | Multi | 98.0, 98.0, 90.8 | - | - |
Liu et al. [373] | NTU-60, PKU-MMD, N-UCLA | Skeleton | RGB | - | - | Text | Multi | 98.5, 98.4, 92.3 | - | - |
Fusion Type | Fusion Stage | Key Advantages | Key Limitations |
---|---|---|---|
Early Fusion [377] | Input or low-level feature stage | Simple structure; captures low-level dependencies | Temporal misalignment; feature redundancy |
Late Fusion [378] | Output or decision stage | Robust to missing modalities; modular | Ignores feature interaction; limited synergy |
Hybrid Fusion [379] | Combines low- and high-level fusion | Improved flexibility; hierarchical modeling | Complex training; higher resource cost |
Attention-Based Fusion [367,368,386] | Feature-wise dynamic weighting | Learns modality importance adaptively; robust under occlusion | Overfitting risk; needs large labeled datasets |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shin, J.; Hassan, N.; Miah, A.S.M.; Nishimura, S. A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities. Sensors 2025, 25, 4028. https://doi.org/10.3390/s25134028
Shin J, Hassan N, Miah ASM, Nishimura S. A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities. Sensors. 2025; 25(13):4028. https://doi.org/10.3390/s25134028
Chicago/Turabian StyleShin, Jungpil, Najmul Hassan, Abu Saleh Musa Miah, and Satoshi Nishimura. 2025. "A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities" Sensors 25, no. 13: 4028. https://doi.org/10.3390/s25134028
APA StyleShin, J., Hassan, N., Miah, A. S. M., & Nishimura, S. (2025). A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities. Sensors, 25(13), 4028. https://doi.org/10.3390/s25134028