Human Behavior Recognition via Hierarchical Patches Descriptor and Approximate Locality-Constrained Linear Coding
Abstract
:1. Introduction
- (1)
- Traditional hand-crafted representation-based features are difficult to interpret intuitively and have the problem of low recognition accuracy;
- (2)
- Learning an effective dictionary-learning model is computationally expensive and time-consuming;
- (3)
- For small-scale datasets, it is still a challenge to improve the recognition accuracy and robustness.
- (1)
- Five energy image species are utilized to describe human behavior in a global manner. These are statistical features based on motion information. Moreover, an HPD is constructed to obtain detailed local feature descriptions for recognition. Combining local features with global features can better describe behavioral features, which can improve recognition accuracy.
- (2)
- The proposed method is based on the ALLC algorithm for fast coding, which is computationally efficient because it has a closed-form analytical solution and it does not need to solve the norm optimization repeatedly.
- (3)
- We demonstrate the superior performance of the proposed method in comparison with state-of-the-art alternatives by conducting experiments on both Weizmann and DHA datasets.
2. Related work
2.1. Traditional Artificial-Feature-Based Approach
2.2. Learning-Feature-Based Learning Approach
3. The Proposed Methods
3.1. Framework of the Proposed Human Behavior Recognition Approach
- (1)
- Human body segmentation. In the input video sequences, there often exists a large amount of background information, which significantly reduces the computation efficiency and affects the human motion feature extraction. Thus, segmentation is an essential step to ensure that critical behavior information can be retained while unnecessary background information can be removed. In this paper, human behavior recognition is targeted at the whole body behavior, instead of the actions of specific human body parts. Therefore, the human body silhouette is segmented from the background as the input data for the feature extraction step.
- (2)
- Human behavior feature extraction. To describe the human behavior information in detail, a combined strategy of global and local feature extraction is utilized in the paper. For each video sequence, several energy image species of the human body silhouette images are calculated as global feature descriptors of the human behavior. The advantage of this method is that it can describe the global human behavior information well in a statistical manner by using one image per video, which can greatly reduce the computational load of local feature extraction in the following processes. However, it cannot express the local human behavior information well. Therefore, after calculating each energy image species, an HPD is constructed to describe the local feature information of the targeted human behavior, which contains three steps.
- (3)
- Behavior pattern recognition. After extraction of human behavior features from the video sequences, different human behaviors are learned individually from the training video sequences of each class by using the ALLC algorithm and max-pooling. Each testing video sequence is then attributed to a predefined class according to its corresponding feature. At this stage, the HPD feature vectors are encoded together by the ALLC algorithm, which is a simple, yet effective, fast coding algorithm.
3.2. Human Behavior Feature Extraction
3.2.1. Environmental Modelling and Human Body Segmentation
3.2.2. Calculation of the Energy Image Species
- (1)
- MEI and MHI: The binary MEI and MHI can be calculated by Equations (1) and (2), respectively.
- (2)
- AMEI, EMEI, and MEnI: For the whole motion sequence of N frames, the average value of the binary contour is calculated as AMEI, which is shown in Equation (3).
3.2.3. Construction of the Hierarchical Patches Descriptor (HPD)
Algorithm 1 Construction Process of HPD |
Input Energy image species , , , , and ; Output HPD feature vector : Step 1: Obtain SIFT descriptors. For each energy species, the SIFT descriptors of 31×31 patches calculated over a grid with a spacing of 16 pixels are extracted from each key point or patch as local features. This is realized by using a difference-of-Gaussian function: . . Step 2: Generate a codebook with M channels by sparse coding [8]. To improve the computational efficiency, the K-means clustering method can be used to compute the cluster centers. Step 3: Encode the descriptors. Each SIFT descriptor is encoded into a code vector with codewords in the codebook and each descriptor is transferred to an code. Step 4: Spatial feature pooling.
|
3.3. Human Behavior Recognition Scheme Based on LLC Algorithm
3.3.1. Problem Formulation
3.3.2. The LLC Algorithm
3.3.3. ALLC Algorithm for Fast Coding
3.3.4. Max-Pooling
4. Experimental Results
4.1. Experimental Settings and Descriptions
- (1)
- Weizmann dataset: The Weizmann dataset consists of 10 human behavior categories, every behavior was completed by nine performers in a similar environment. Each video sequence has a different length. Following the database instructions of literature [7,41], nine behaviors were selected for MEI and MHI, which were bend, jump, jack, side, run, walk, skip, wave1 (one-hand wave), and wave2 (two-hand wave).
- (2)
- DHA dataset: The DHA dataset contains 23 categories of human behavior (e.g., bend, jump, pitch, and arm-swing), where every behavior contains 21 performers (12 males and 9 females). The duration of the video sequences also varies. Following [10] and the database instructions, 14 behaviors were selected for MEnI, including bend, jump, jack, run, skip, walk, side, wave1, wave2, side-box, arm-swing, tai chi, and leg-kick, and 17 behaviors were selected for AMEI and EMEI, including bend, jump, jack, pjump, run, walk, skip, side, wave1, wave2, arm-swing, leg-lick, front-lap, side-box, side-box, rod-swing, and tai chi.
4.2. Parameter Selection
4.3. Experimental Results and Comparative Analysis on Weizmann Dataset
4.3.1. Comparison of Different Feature Combinations
4.3.2. Comparison of Feature-Coding Algorithms
4.3.3. Comparison with Other Behavior Recognition Approaches
4.4. Experimental Results and Comparative Analysis on DHA Dataset
4.4.1. Comparison of Different Feature Combinations
4.4.2. Comparison of Feature-Coding Algorithms
4.4.3. Comparison of Different Multi-Modality Fusion Methods
4.4.4. Confusion Matrix Analysis
- (1)
- The lowest correct recognition rate was 81% for both AMEI and EMEI on the DHA dataset; 10 and 11 out of 17 types of behaviors achieved 100% accuracy in recognition, respectively.
- (2)
- Through analysing the confusion matrix, we can observe that certain behaviors were similar and may have caused confusion with each other; for example, wave1 and pjump; skip and jump; walk, skip and run; wave2 and leg-kick; pjump and jump; arm-swing and tai chi; side-box, jack, and pitch. Especially for side-box behavior, owing to the different motion ranges, angles, and boxing directions of the different performers, the accuracy was only 81%.
- (3)
- For behaviors with high similarity and involving position change, such as run, pjump, front-clap, side, the recognition results were worse than the other behaviors. One possible reason is that those behaviors all contain leg and arm movements, however, their motion directions and positions may vary between image frames. Although HPD was constructed based on different energy image species for obtaining detailed motion features, they could not describe the depth information well. Therefore, it was difficult to identify these types of behaviors correctly.
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Khaire, P.; Kumar, P. Deep learning and RGB-D based human action, human-human and human-object interaction recognition: A survey. JVCIR 2022, 86, 103531. [Google Scholar] [CrossRef]
- Wang, Z.H.; Zheng, Y.F.; Liu, Z.; Li, Y.J. A survey of video human behaviour recognition methodologies in the perspective of spatial-temporal. In Proceedings of the 2022 2nd International Conference on Intelligent Technology and Embedded Systems, Chengdou, China, 23–26 September 2022; pp. 138–147. [Google Scholar]
- Chen, A.T.; Morteza, B.A.; Wang, K.I. Investigating fast re-identification for multi-camera indoor person tracking. Comput. Electr. Eng. 2019, 77, 273–288. [Google Scholar] [CrossRef]
- Yue, R.J.; Tian, Z.Q.; Du, S.Y. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022, 512, 287–306. [Google Scholar] [CrossRef]
- Yao, G.L.; Tao, L.; Zhong, J.D. A review of convolutional neural network based action recognition. Pattern Recogn. Lett. 2019, 118, 14–22. [Google Scholar]
- Kumar, D.; Kukreja, V. Early recognition of wheat powdery mildew disease based on mask RCNN. In Proceedings of the 2022 International Conference on Data Analytics for Business and Industry (ICDABI), Sakheer, Bahrain, 25–26 October 2022; pp. 542–546. [Google Scholar]
- Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Und. 2021, 208, 103219. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive image features from scale-invariant key points. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Kumar, D.; Kukreja, V. MRISVM: A object detection and feature vector machine based network for brown mite variation in wheat plant. In Proceedings of the 2022 International Conference on Data Analytics for Business and Industry (ICDABI), Sakheer, Bahrain, 25–26 October 2022; pp. 707–711. [Google Scholar]
- Zeng, M.Y.; Wu, Z.M.; Chang, T.; Fu, Y.; Jie, F.R. Fusing appearance statistical features for person re-identification. J. Electron. Inf. Technol. 2014, 36, 1844–1851. [Google Scholar]
- Obaidi, S.A.; Abhayaratne, C. Temporal salience based human action recognition. Proceeedings of the 2019 International Conference on Acoustics, Speech and Signal Processing (ICASSP), Bradu, UK, 12–17 May 2019; pp. 2017–2021. [Google Scholar]
- Patel, C.I.; Labana, D.; Pandya, S. Histogram of oriented gradient-based fusion of features for human action recognition in action video sequences. Sensors 2020, 20, 7299. [Google Scholar] [CrossRef]
- Gao, Z.; Zhang, H.; Xu, G.P.; Xue, Y.B. Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 2015, 151, 554–564. [Google Scholar] [CrossRef]
- Chen, C.; Liu, M.Y.; Zhang, B.C. 3D action recognition using multi-temporal depth motion maps and fisher vector. In Proceedings of the 2016 International Conference on Artificial Intelligence, New York, NY, USA, 15 July 2016; pp. 3331–3337. [Google Scholar]
- Wright, J.; Yang, A.Y.; Ganesh, A.; Sastry, S.S.; Ma, Y. Robust face recognition via sparse representation. IEEE T-PAMI 2009, 31, 210–227. [Google Scholar] [CrossRef]
- Yang, J.C.; Yu, K.; Gong, Y.H.; Huang, T.S. Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 22–25 June 2009; pp. 1794–1800. [Google Scholar]
- Kumar, D.; Kukreja, V. Application of PSPNET and fuzzy Logic for wheat leaf rust disease and its severity. In Proceedings of the 2022 International Conference on Data Analytics for Business and Industry (ICDABI), Sakheer, Bahrain, 25–26 October 2022; pp. 547–551. [Google Scholar]
- Wang, J.J.; Yang, J.C.; Yu, K.; Lv, F.; Gong, Y. Locality-constrained linear coding for image classification. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 3267–3360. [Google Scholar]
- Wu, J.L.; Lin, Z.C.; Zheng, W.M.; Zha, H. Locality-constrained linear coding based bi-layer model for multi-view facial expression recognition. Neurocomputing 2017, 239, 143–152. [Google Scholar] [CrossRef]
- Wang, L.; Zhao, X.; Liu, Y.C. Skeleton feature based on multi-stream for action recognition. IEEE Access 2018, 6, 20788–20800. [Google Scholar] [CrossRef]
- Gao, Z.; Zhang, H.; Liu, A.A.; Xu, G.; Xue, Y. Human action recognition on depth dataset. Neural Comput. Appl. 2016, 27, 2047–2054. [Google Scholar] [CrossRef]
- Gao, Z.; Li, S.H.; Zhu, Y.J.; Wang, C.; Zhang, H. Collaborative sparse representation leaning model for RGB-D action recognition. J. Vis. Commun. Image R 2017, 48, 442–452. [Google Scholar] [CrossRef]
- Yan, Y.; Ricci, E.; Subramanian, R.; Liu, G.W.; Sebe, N. Multitask linear discriminant analysis for view invariant action recognition. IEEE Trans. Image Process. 2014, 23, 5599–5611. [Google Scholar] [CrossRef]
- Wang, P.C.; Li, W.Q.; Gao, Z.M. Action recognition from depth maps using deep convolutional neural networks. IEEE T. Hum. Mach. Syst. 2016, 46, 498–509. [Google Scholar] [CrossRef]
- Sharif, M.; Akram, T.; Raza, M. Hand-crafted and deep convolutional neural network features fusion and selection strategy: An application to intelligent human action recognition. Appl. Soft Comput. 2020, 87, 105986. [Google Scholar]
- Bhatt, D.; Patel, C.I.; Talsania, H. CNN variants for computer vision: History, architecture, application, challenges and future scope. Electronics 2021, 10, 2470. [Google Scholar] [CrossRef]
- Patel, C.I.; Bhatt, D.; Sharma, U. DBGC: Dimension-based generic convolution block for object recognition. Sensors 2022, 22, 1780. [Google Scholar] [CrossRef]
- Xue, F.; Ji, H.B.; Zhang, W.B.; Cao, Y. Attention based spatial temporal hierarchical ConvLSTM network for action recognition in videos. IET Comput. Vis. 2019, 13, 708–718. [Google Scholar] [CrossRef]
- Rocha, A.; Lopes, S.I.; Abreu, C. A Cost-effective infrared thermographic system for diabetic foot screening. In Proceedings of the 10th International Workshop on E-Health Pervasive Wireless Applications and Services, Thessaloniki, Greece, 10–12 October 2022; pp. 106–111. [Google Scholar]
- Kumar, D.; Kukreja, V. A symbiosis with panicle-SEG based CNN for count the number of wheat ears. In Proceedings of the 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) Amity University, Noida, India, 13–14 October 2022; pp. 1–5. [Google Scholar]
- Bobick, A.F.; Davis, J.W. The recognition of human movement using temporal templates. IEEE TPAMI 2001, 23, 257–267. [Google Scholar] [CrossRef]
- Bashir, K.; Tao, X.; Gong, S. Gait recognition using gait entropy image. In Proceedings of the 2010 International Conference on Crime Detection and Prevention, London, UK, 3 December 2009; pp. 1–5. [Google Scholar]
- Patel, C.I.; Garg, S.; Zaveri, T.; Banerjee, A.; Patel, R. Human action recognition using fusion of features for unconstrained video sequences. Comput. Electr. Eng. 2018, 70, 284–301. [Google Scholar] [CrossRef]
- Du, T.; Wang, H.; Torresani, L.; Ray, J.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–21 June 2018; pp. 6450–6459. [Google Scholar]
- Zhang, K.; Yang, K.; Li, S.Y.; Chen, H.B. A difference-based local contrast method for infrared small target detection under complex background. IEEE Access 2019, 7, 105503–105513. [Google Scholar] [CrossRef]
- Barnich, O.; Droogenbroeck, M.V. ViBe: A universal background subtraction algorithm for video sequences. IEEE T. Image Process 2011, 20, 1709–1724. [Google Scholar] [CrossRef] [PubMed]
- Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New York, NY, USA, 17–22 June 2006; pp. 2169–2178. [Google Scholar]
- Yu, K.; Zhang, T.; Gong, Y. Nonlinear learning using local coordinate coding. In Proceedings of the 2009 International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 20–22 May 2009; pp. 2223–2231. [Google Scholar]
- Blank, M.; Gorelick, L.; Shechtman, E.; Irani, M.; Basri, R. Actions as space-time shapes. In Proceedings of the 2005 IEEE International Conference on Computer Vision (ICCV), Beijing, China, 17–21 October 2005; pp. 1395–1402. [Google Scholar]
- Lin, Y.C.; Hu, M.C.; Cheng, W.H.; Hsieh, Y.H.; Chen, H.M. Human action recognition and retrieval using sole depth information. In Proceedings of the 2012 ACM MM, Nara, Japan, 5–8 September 2012; pp. 1053–1056. [Google Scholar]
- Liu, L.N.; Ma, S.W.; Fu, Q. Human action recognition based on locality constrained linear coding and two-dimensional spatial-temporal templates. In Proceedings of the 2017 China Automation Conference (CAC), Jinan, China, 20–22 October 2017; pp. 1879–1883. [Google Scholar]
- Yang, M.; Zhang, L.; Feng, X.C.; Zhang, D. Fisher discrimination dictionary learning for sparse representation. In Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 543–550. [Google Scholar]
- Jiang, Z.L.; Lin, Z.; Davis, L.S. Label consistent K-SVD: Learning a discriminative dictionary for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2651–2664. [Google Scholar] [CrossRef] [PubMed]
Approach | Ours | FDDL | LCKSVD | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Parameters | M | K | c | L | λ1 | λ2 | Dictionary Size | Sparsity | α | β |
MEI | 1024 | 5 | 7 | 2 | 0.05 | 0.5 | 60 | 8 | 0.05 | 0.001 |
MHI | 1024 | 3 | 13 | 2 | 0.05 | 0.5 | 60 | 8 | 0.01 | 0.001 |
MEnI | 1024 | 3 | 10 | 2 | 0.05 | 0.5 | 150 | 10 | 0.01 | 0.001 |
AMEI | 1024 | 3 | 13 | 2 | 0.005 | 0.05 | - | - | - | - |
EMEI | 1024 | 3 | 13 | 2 | 0.005 | 0.05 | - | - | - | - |
Features | Accuracy Rate (%) |
---|---|
MHI+BoW [7] | 90 |
MEI+PHOG [41] | 82.7 |
MHI+PHOG [41] | 92.6 |
MEI+R [41] | 86.4 |
MHI+R [41] | 81.5 |
Our MEI+HPD | 100 |
Our MHI+HPD | 98.77 |
Features | Feature-Coding Algorithms | Accuracy Rate (%) |
---|---|---|
MEI | LCKSVD1 | 92.6 |
MEI | LCKSVD2 | 95.07 |
MHI | LCKSVD1 | 93.83 |
MHI | LCKSVD2 | 96.3 |
MEI | FDDL | 96.3 |
MHI | FDDL | 95.06 |
MEI | Our ALLC | 95.06 |
MHI | Our ALLC | 93.83 |
Features | Classifiers | Accuracy Rate (%) |
---|---|---|
3D-SIFT [10] | KNN | 97.84 |
HOGS [11] | KNN | 99.65 |
HOG+CNN [24] | SVM | 99.4 |
Our MEI+HPD+ALLC | SVM | 100 |
Our MHI+HPD+ALLC | SVM | 98.77 |
Features | Accuracy Rate (%) |
---|---|
HOGS [11] | 99.39 |
DMPP+PHOG [13] | 95 |
DLRMPP+PHOG [13] | 95.6 |
DMPP+DLRMPP+PHOG [13] | 98.2 |
GIST+DSTIPs [21] | 93 |
HPM+TM [22] | 90.8 |
Our AMEI+HPD | 95.52 |
Our EMEI+HPD | 96.08 |
Our MEnI+HPD | 97.61 |
Features | Feature-Coding Algorithms | Accuracy Rate (%) |
---|---|---|
GIST+DSTIPs [17] | SRC | 93 |
HPM+TM [18] | SRC | 93 |
HPM+TM [18] | CSR | 98.6 |
AMEI | FDDL | 89.09 |
EMEI | FDDL | 91.32 |
MEnI | LCKSVD1 | 92.88 |
MEnI | LCKSVD2 | 94.58 |
Our AMEI+HPD | Our ALLC | 93.28 |
Our EMEI+HPD | Our ALLC | 94.68 |
Our MEnI+HPD | Our ALLC | 95.92 |
Data Modality | Features | Accuracy Rate (%) |
---|---|---|
RGB | HOGS [11] | 99.39 |
RGB | DLRMPP+PHOG [13] | 95.6 |
RGB | HPM+TM [22] | 91.9 |
RGB | Our AMEI+HPD | 95.52 |
RGB | Our EMEI+HPD | 96.08 |
RGB | Our MEnI+HPD | 97.61 |
Depth | DMPP+PHOG [13] | 95 |
Depth | GIST+DSTIPs [21] | 94 |
Depth | HPM+TM [22] | 90.8 |
RGB+Depth | DMPP+DLRMPP+PHOG [13] | 98.2 |
RGB+Depth | MMDJM+GIST+DSTIP [21] | 97 |
RGB+Depth | HPM+TM+CSR [22] | 98.6 |
RGB+Depth | HPM+TM+SRC [22] | 94.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, L.; Wang, K.I.-K.; Tian, B.; Abdulla, W.H.; Gao, M.; Jeon, G. Human Behavior Recognition via Hierarchical Patches Descriptor and Approximate Locality-Constrained Linear Coding. Sensors 2023, 23, 5179. https://doi.org/10.3390/s23115179
Liu L, Wang KI-K, Tian B, Abdulla WH, Gao M, Jeon G. Human Behavior Recognition via Hierarchical Patches Descriptor and Approximate Locality-Constrained Linear Coding. Sensors. 2023; 23(11):5179. https://doi.org/10.3390/s23115179
Chicago/Turabian StyleLiu, Lina, Kevin I-Kai Wang, Biao Tian, Waleed H. Abdulla, Mingliang Gao, and Gwanggil Jeon. 2023. "Human Behavior Recognition via Hierarchical Patches Descriptor and Approximate Locality-Constrained Linear Coding" Sensors 23, no. 11: 5179. https://doi.org/10.3390/s23115179
APA StyleLiu, L., Wang, K. I. -K., Tian, B., Abdulla, W. H., Gao, M., & Jeon, G. (2023). Human Behavior Recognition via Hierarchical Patches Descriptor and Approximate Locality-Constrained Linear Coding. Sensors, 23(11), 5179. https://doi.org/10.3390/s23115179