Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation
Abstract
:1. Introduction
- Each view describes different types of skeleton data. We use specific data augmentation methods for each view according to the characteristics of different views and combine these methods.
- We propose a new data augmentation strategy for skeleton sequences. An asymmetric augmentation pipeline with left and right branches is designed, where each branch is composed of different data augmentation methods.
- We conduct extensive experiments on two large-scale 3D skeleton datasets (NTU RGB + D 60 and NTU RGB + D 120) to demonstrate the effectiveness of the proposed data augmentation strategy.
2. Related Work
2.1. Action Representation
2.2. Self-Supervised Action Representation
3. Method
3.1. Basic Framework for Action Representation Based on Asymmetric Augmentation
- Data augmentation: Skeleton sequences are randomly transformed into , as pairs of positive samples. Different data augmentation methods are combined in the left and right branches of the asymmetric augmentation pipeline, as shown in the yellow area in Figure 2.
- Feature encoding: and are embedded into the hidden space by encoders and : and , where , is momentum updated by Equation (1),
- Nonlinear mapping: MLP projection heads and are used to map latent vectors and to the low-dimensional space: , ,.
- Queue update: A queue that stores a large number of negative samples is maintained to avoid redundant computation and iteratively updated by .
- Contrast loss: InfoNCE [39] is used to train the network:
Algorithm 1. Main algorithm of Basic framework |
Input: Temperature , momentum coefficient , mini-batch size , query encoder , key encoder , queue size Output: The pre-trained encoder . # Initialization Randomly initialize parameters of , and copy to (parameters ) Randomly initialize negative keys in queue. for a sample mini-batch do for all do # Select asymmetric augmentation strategy to perform two random augments , # Feature encoding , # Nonlinear mapping ), detach end for # Calculate contrastive loss for mini-batch and update encoders Update to minimize Update with momentum: # Update queue Enqueue keys of current mini-batch Dequeue the oldest mini-batch of keys end for |
3.2. Composite Framework for Action Representation Based on Asymmetric Augmentation
Algorithm 2. Main algorithm of Composite framework |
Input: Temperature , momentum coefficient , mini-batch size , query encoder , key encoder , queue size , Output: The pre-trained encoder . Randomly initialize negative keys in queue. for a sample mini-batch do, for all do # single-view contrastive learning representation # Calculate the sample similarity , # High-confidence Knowledge Mining , , end for # Calculate contrastive loss for mini-batch end for |
3.3. Augmentation Methods in Asymmetric Strategy
- Crop. In image classification tasks, crop randomly samples a part of the original image and then resizes this part to the original image size. This method is often called random cropping. For skeletons in a time sequence, some frames are firstly padded into the sequence symmetrically and then randomly cropped to the original length. The padding length is defined as T/γ, and γ is the padding ratio. This paper set γ = 6.
- Shear. Shear augmentation is a linear transformation in the spatial dimension. Each joint is moved in a fixed direction, i.e., the shape of the 3D coordinates of body joints will be slanted with a random angle. The transformation matrix is defined as
- Gaussian blur (GB). As an effective augmentation method to reduce the level of detail and noise of images, Gaussian blur can be applied to the skeleton sequence to smooth noisy joints and decrease action details. We randomly sample σ ∈ [0.1, 2.0] for the Gaussian kernel, which is a sliding window with a length of 15. Joint coordinates of the original sequence are blurred at 50% chance by the kernel below:
- Joint mask (JM). We apply a zero-mask to a number of body joints in skeleton frames (i.e., replace all coordinates by zeros), which encourages the model to learn different local regions (i.e., except for the masked region) that probably contain crucial action patterns. To be more specific, we randomly choose a certain number of body joints (number of joints ) from random frames (number of frames ) in the original skeleton sequence to apply the zero-mask.
3.4. Asymmetric Data Augmentation Strategy
4. Results
4.1. Dataset
4.2. Experimental Settings
4.3. Ablation Study
4.4. Comparison
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef] [Green Version]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. arXiv 2018, arXiv:1812.08008. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
- Xu, J.; Yu, Z.; Ni, B.; Yang, J.; Yang, X.; Zhang, W. Deep kinematics analysis for monocular 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 899–908. [Google Scholar]
- Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3d action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3288–3297. [Google Scholar]
- Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 7444–7452. [Google Scholar]
- Si, C.; Chen, W.; Wang, W.; Wang, L.; Tan, T. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1227–1236. [Google Scholar]
- Liu, Z.; Zhang, H.; Chen, Z.; Wang, Z.; Ouyang, W. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 143–152. [Google Scholar]
- Chen, Z.; Li, S.; Yang, B.; Li, Q.; Liu, H. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 1113–1122. [Google Scholar]
- Zheng, N.; Wen, J.; Liu, R.; Long, L.; Dai, J.; Gong, Z. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Su, K.; Liu, X.; Shlizerman, E. Predict & cluster: Unsupervised skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 9631–9640. [Google Scholar]
- Kundu, J.N.; Gor, M.; Uppala, P.K.; Radhakrishnan, V.B. Unsupervised feature learning of human actions as trajectories in pose embedding manifold. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019. [Google Scholar]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the European Conference on Computer Vision(ECCV), Virtual, 23–28 August 2020. [Google Scholar]
- Lin, L.; Song, S.; Yang, W.; Liu, J. Ms2l: Multi-task self-supervised learning for skeleton based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2490–2498. [Google Scholar]
- Rao, H.; Xu, S.; Hu, X.; Cheng, J.; Hu, B. Augmented skeleton based contrastive action learning with momentum lstm for unsupervised action recognition. Inf. Sci. 2021, 569, 90–109. [Google Scholar] [CrossRef]
- Li, L.; Wang, M.; Ni, B.; Wang, H.; Yang, J.; Zhang, W. 3D human action representation learning via Cross-View consistency pursuit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 4741–4750. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 9729–9738. [Google Scholar]
- Chen, X.; Fan, H.; Girshick, R.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
- Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; Isola, P. What makes for good views for contrastive learning? arXiv 2020, arXiv:2005.10243. [Google Scholar]
- Wang, X.; Qi, G.J. Contrastive Learning with Stronger Augmentations; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
- Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 18–20 June 2012; pp. 1290–1297. [Google Scholar]
- Vemulapalli, R.; Arrate, F.; Chellappa, R. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 588–595. [Google Scholar]
- Vemulapalli, R.; Chellapa, R. Rolling rotationsfor recognizing human actions from 3d skeletal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4471–4479. [Google Scholar]
- Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
- Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans. Image Process. 2018, 27, 3459–3471. [Google Scholar] [CrossRef] [PubMed]
- Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; Zheng, N. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1963–1978. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hochreiter, S.; Bengio, Y.; Frasconi, P.; Schmidhuber, J. Gradientflow in Recurrent Nets: The Difficulty of Learning Long-Term Dependencies; A Field Guide to Dynamical Recurrent Networks; Kremer, S.C., Kolen, J.F., Eds.; IEEE Press: Piscataway, NJ, USA, 2001; pp. 237–243. [Google Scholar]
- Du, Y.; Fu, Y.; Wang, L. Skeleton based action recognition with convolutional neural network. In Proceedings of the Asian Conference on Pattern Recognition (ACPR), Kuala Lumpur, Malaysia, 3–6 November 2015; pp. 579–583. [Google Scholar]
- Li, C.; Zhong, Q.; Xie, D.; Pu, S. Skeleton-based action recognition with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Hong Kong, China, 10–14 July 2017; pp. 597–600. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 12026–12035. [Google Scholar]
- Zhang, X.; Xu, C.; Tao, D. Context aware graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; pp. 14333–14342. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 12–18 July 2020; pp. 1597–1607. [Google Scholar]
- Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. In Proceedings of the Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020. [Google Scholar]
- Li, Y.; Hu, P.; Liu, Z.; Peng, D.; Zhou, J.T.; Peng, X. Contrastive Clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021. [Google Scholar]
- Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021. [Google Scholar]
- Guo, T.; Liu, H.; Chen, Z.; Liu, M.; Wang, T.; Ding, R. Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022. [Google Scholar]
- Thoker, F.M.; Doughty, H.; Snoek, C.G. Skeleton-contrastive 3D action representation learning. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 18–20 May 2021; pp. 1655–1663. [Google Scholar]
- Oord, A.V.D.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7912–7921. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Yang, S.; Liu, J.; Lu, S.; Er, M.H.; Kot, A.C. Skeleton cloud colorization for unsupervised 3D action representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 13423–13433. [Google Scholar]
Arguments | Value |
---|---|
sequence size | 50 frames |
batch size | 128 |
view | joint, motion, bone |
base encoder | st-gcn |
queue size | 32,768 |
momentum | 0.9 |
weight decay | 0.0001 |
epoch | 300 |
learning rate | 0.1(before 250 epoch)|0.01(after 250 epoch) |
loss function | Equation (2) (before 150 epoch)|Equation (5) (after 150 epoch) |
knowledge mining | 1 |
padding ratio | 6 |
shear factor | 0.5 |
Asymm. Aug. | Augmentations | xsub (%) | xview (%) |
---|---|---|---|
× | Basic augmentation | 77.8 | 83.4 |
× | Extreme augmentation | 76.5 | 83.1 |
√ | No Crop | 77.7 | 82.3 |
√ | No Shear | 77.3 | 83.6 |
√ | No Rotation | 78.8 | 84.3 |
√ | No GN/GB/JM/CM | 76.3 | 81.9 |
√ | Ours | 79.0 | 84.6 |
Method | 100 ep | 150 ep | 200 ep | 300 ep |
---|---|---|---|---|
3s-SkeletonCLR [16] | 71.3 | 73.8 | 74.1 | 74.1 |
3s-CrosSCLR [16] | 70.0 | 72.8 | 76.0 | 77.2 |
ours | 74.1 | 76.0 | 77.9 | 79.0 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, H.; Li, X.; Xu, D.; Liu, H.; Guo, J.; Zhang, Y. Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation. Sensors 2022, 22, 8989. https://doi.org/10.3390/s22228989
Zhou H, Li X, Xu D, Liu H, Guo J, Zhang Y. Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation. Sensors. 2022; 22(22):8989. https://doi.org/10.3390/s22228989
Chicago/Turabian StyleZhou, Hualing, Xi Li, Dahong Xu, Hong Liu, Jianping Guo, and Yihan Zhang. 2022. "Self-Supervised Action Representation Learning Based on Asymmetric Skeleton Data Augmentation" Sensors 22, no. 22: 8989. https://doi.org/10.3390/s22228989