MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions
Abstract
:1. Introduction
- We introduce (Multimodal Audio-image Representations), MHAiR, a new multimodal lightweight dataset.
- We build a new feature representation strategy to select the most informative candidate representations for audio-visual fusion.
- We achieve state-of-the-art or competitive results on standard public benchmarks, validating the generalizability of our proposed approach through extensive evaluation.
Value of Data
- It provides a significant reduction in dimensionality. The spectral centroid images represent the frequency content of the audio signal over time, which is a lower-dimensional representation of the original video dataset. This can make it easier and faster to process the data and extract meaningful features.
- It is robust against visual changes. The spectral centroid images are based on the audio signal, which is less affected by visual changes such as changes in lighting conditions or camera angles. This makes the dataset more robust to visual changes and can improve the accuracy of human action analysis.
- It offers standardization as spectral centroid images can be standardized to a fixed size and format, which can make it easier to compare and combine data from diverse sources. This can be useful for tasks such as cross-dataset validation and transfer learning. Hence, this dataset can serve as a standard benchmark for evaluating performance of different machine learning algorithms for human action analysis based on audio signals.
- It is suitable for privacy-oriented applications such as surveillance or healthcare monitoring, which may require analysis of human actions without capturing original visual information. Spectral centroid images provide a privacy-preserving alternative that can still enable effective analysis in applications where audio can be fused and aligned with non-visual sensory datasets such as HH105 and HH1251.
- Dataset versatility can facilitate the exploration of different approaches and the development of newer techniques for various applications and an extension of the existing ones.
- Audio images, derived from sound data, when fused with visual data can enhance interpretation, improve noise reduction, augment AR/VR experiences, refine content-based multimedia retrieval, and assist in healthcare applications like telemedicine. However, effective fusion requires advanced algorithms and careful attention to challenges such as data alignment, synchronization, and fusion model selection.
2. Related Works
2.1. Multimodal Recognition Methods
2.2. Audio-Image Representations
2.2.1. Waveplot
2.2.2. Spectral Centroid
2.2.3. Spectral Rolloff
2.2.4. Mel Frequency Cepstral Coefficients (MFCCs)
- Pre-emphasis: This step is performed to increase the signal’s amplitude of the high-frequency part.
- Framing: The continuous signal is divided into frames of N samples, with adjacent frames being separated by .
- Windowing: Each frame is multiplied by a window function (Hamming window, for instance).
- Fast Fourier Transform (FFT): This step is taken to convert each frame from the time domain to the frequency domain.
- Mel Filter Bank Processing: The power spectrum is then multiplied with a set of Mel filters to obtain a set of Mel-scaled spectra.
- Discrete Cosine Transform (DCT): Finally, the log Mel spectrum is transformed to the time domain using the DCT. The result is called the Mel Frequency Cepstral Coefficients.
2.2.5. MFCC Feature Scaling
- Standardization: This technique scales the MFCC features so they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one. This is achieved by subtracting the mean and then dividing by standard deviation.
- MinMax Scaling: Also known as normalization, this technique rescales the features to a fixed range, usually 0 to 1, or −1 to 1. The scaler subtracts the minimum value in the feature and then divides by range (max value—min value).
2.2.6. Chromagram
- The audio signal is first converted into the frequency domain using Fourier Transform or a similar method.
- The resulting spectral information is then mapped onto the 12 pitch classes in an octave using a filter bank tuned to chroma frequencies.
- Over time, a 2D representation (time-pitch intensity) is obtained.
- Efficient representation: Spectral centroid-based images provide efficient representation of the audio signal that can be easily processed by deep learning models. Unlike raw audio signals, which can be difficult to process due to their high dimensionality and variability, spectral centroid-based images provide a compact and informative representation that captures temporal dynamics of the audio signal.
- Robustness to noise: Spectral centroid-based images are less sensitive to noise and distortions than other audio features, such as the raw audio signal or Mel-frequency cepstral coefficients (MFCCs). This is because spectral centroids capture the “center of gravity” of the frequency content, which is less affected by noise and distortions than the fine-grained details of the audio signal. This makes them suitable for noisy environments where other audio features might be unreliable.
- Spatial information: Spectral centroid-based images provide spatial information that can be used by deep learning models to recognize human actions. By converting the spectral centroid over time into an image, we can capture the spatial and temporal information of the frequency distribution of the audio signal, which can be interpreted by deep learning models to recognize different human actions.
- Transfer learning: Spectral centroid-based images can be used for transfer learning, where pre-trained models can be fine-tuned on a specific task. This is because spectral centroid-based images provide a standardized and efficient representation that can be used to compare and combine data from dissimilar sources. This can be useful for tasks such as cross-dataset validation and transfer learning, where models trained on one dataset can be applied to another dataset.
3. Data Description
4. Methodology
5. Results
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
1 | https://casas.wsu.edu/datasets/ (accessed on 19 January 2024) |
References
- Shaikh, M.B.; Chai, D. RGB-D data-based action recognition: A review. Sensors 2021, 21, 4246. [Google Scholar] [CrossRef] [PubMed]
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. MAiVAR: Multimodal Audio-Image and Video Action Recognizer. In Proceedings of the International Conference on Visual Communications and Image Processing (VCIP), Suzhou, China, 13–16 December 2022; pp. 1–5. [Google Scholar] [CrossRef]
- Sudhakaran, S.; Escalera, S.; Lanz, O. Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1102–1111. [Google Scholar] [CrossRef]
- Yang, G.; Yang, Y.; Lu, Z.; Yang, J.; Liu, D.; Zhou, C.; Fan, Z. STA-TSN: Spatial-Temporal Attention Temporal Segment Network for action recognition in video. PLoS ONE 2022, 17, e0265115. [Google Scholar] [CrossRef] [PubMed]
- Zhang, K.; Li, D.; Huang, J.; Chen, Y. Automated video behavior recognition of pigs using two-stream convolutional networks. Sensors 2020, 20, 1085. [Google Scholar] [CrossRef] [PubMed]
- Lei, J.; Li, L.; Zhou, L.; Gan, Z.; Berg, T.L.; Bansal, M.; Liu, J. Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7331–7341. [Google Scholar] [CrossRef]
- Girdhar, R.; Ramanan, D.; Gupta, A.; Sivic, J.; Russell, B. ActionVLAD: Learning spatio-temporal aggregation for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 971–980. [Google Scholar] [CrossRef]
- Li, Y.; Li, W.; Mahadevan, V.; Vasconcelos, N. VLAD3: Encoding dynamics of deep features for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1951–1960. [Google Scholar] [CrossRef]
- Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 803–818. [Google Scholar] [CrossRef]
- Kwon, H.; Kim, M.; Kwak, S.; Cho, M. Learning Self-Similarity in Space and Time As Generalized Motion for Video Action Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR), Montreal, QC, Canada, 10–17 October 2021; pp. 13065–13075. [Google Scholar] [CrossRef]
- Mei, X.; Lee, H.C.; Diao, K.Y.; Huang, M.; Lin, B.; Liu, C.; Xie, Z.; Ma, Y.; Robson, P.M.; Chung, M. Artificial intelligence–enabled rapid diagnosis of patients with COVID-19. Nat. Med. 2020, 26, 1224–1228. [Google Scholar] [CrossRef] [PubMed]
- Gu, J.; Cai, H.; Dong, C.; Ren, J.S.; Timofte, R.; Gong, Y.; Lao, S.; Shi, S.; Wang, J.; Yang, S. NTIRE 2021 challenge on perceptual image quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 677–690. [Google Scholar] [CrossRef]
- Yan, C.; Teng, T.; Liu, Y.; Zhang, Y.; Wang, H.; Ji, X. Precise no-reference image quality evaluation based on distortion identification. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2021, 17, 1–21. [Google Scholar] [CrossRef]
- Liu, J.; Wang, X.; Wang, C.; Gao, Y.; Liu, M. Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Trans. Multimed. 2023, 26, 811–823. [Google Scholar] [CrossRef]
- Giannakopoulos, T.; Pikrakis, A. (Eds.) Introduction. In Introduction to Audio Analysis; Academic Press: Oxford, UK, 2014. [Google Scholar] [CrossRef]
- Imtiaz, H.; Mahbub, U.; Schaefer, G.; Zhu, S.Y.; Ahad, M.A.R. Human Action Recognition based on Spectral Domain Features. Procedia Comput. Sci. 2015, 60, 430–437. [Google Scholar] [CrossRef]
- Peeters, G. A large set of audio features for sound description (similarity and classification) in the CUIDADO project. CUIDADO Ist Proj. Rep. 2004, 54, 1–25. [Google Scholar]
- Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Multimodal Fusion for Audio-Image and Video Action Recognition. Neural Comput. Appl. 2024, 1–14. [Google Scholar] [CrossRef]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
- Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 571–575. [Google Scholar] [CrossRef]
- Chen, T.; Zhai, X.; Ritter, M.; Lucic, M.; Houlsby, N. Self-supervised GANs via auxiliary rotation loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12154–12163. [Google Scholar] [CrossRef]
- Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
- Takahashi, R.; Matsubara, T.; Uehara, K. Data augmentation using random image cropping and patching for deep CNNs. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 2917–2931. [Google Scholar] [CrossRef]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
- Anoosheh, A.; Sattler, T.; Timofte, R.; Pollefeys, M.; Van Gool, L. Night-to-day image translation for retrieval-based localization. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5958–5964. [Google Scholar] [CrossRef]
- Alharbi, Y.; Wonka, P. Disentangled image generation through structured noise injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5134–5142. [Google Scholar] [CrossRef]
- Liao, X.; Yu, Y.; Li, B.; Li, Z.; Qin, Z. A new payload partition strategy in color image steganography. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 685–696. [Google Scholar] [CrossRef]
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Spectral Centroid Images for Multi-Class Human Action Analysis: A Benchmark Dataset. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/yfvv3crnpy/1 (accessed on 29 October 2023).
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. PyMAiVAR: An open-source Python suite for audio-image representation in human action recognition. Softw. Impacts 2023, 17, 100544. [Google Scholar] [CrossRef]
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Chroma-Actions Dataset: Acoustic Images. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/r4r4m2vjvh/1 (accessed on 29 October 2023).
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Waveplot-Based Dataset for Multi-Class Human Action Analysis. Mendeley Data. Available online: https://data.mendeley.com/datasets/3vsz7v53pn/1 (accessed on 29 October 2023).
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. Spectral Rolloff Images for Multi-class Human Action Analysis: A Benchmark Dataset. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/nd5kftbhyj/1 (accessed on 29 October 2023).
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. MFFCs for Multi-Class Human Action Analysis: A Benchmark Dataset; Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/6ng2kgvnwk/1 (accessed on 29 October 2023).
- Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. MFCCs Feature Scaling Images for Multi-Class Human Action Analysis: A Benchmark Dataset. Mendeley Data. 2023. Available online: https://data.mendeley.com/datasets/6d8v9jmvgm/1 (accessed on 29 October 2023).
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (CVPR), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar] [CrossRef]
- Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 20–36. [Google Scholar] [CrossRef]
- Takahashi, N.; Gygli, M.; Van Gool, L. AENet: Learning deep audio features for video analysis. IEEE Trans. Multimed. 2017, 20, 513–524. [Google Scholar] [CrossRef]
- Tian, Y.; Shi, J.; Li, B.; Duan, Z.; Xu, C. Audio-visual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 247–263. [Google Scholar] [CrossRef]
- Brousmiche, M.; Rouat, J.; Dupont, S. Multimodal Attentive Fusion Network for audio-visual event recognition. Inf. Fusion 2022, 85, 52–59. [Google Scholar] [CrossRef]
- Long, X.; De Melo, G.; He, D.; Li, F.; Chi, Z.; Wen, S.; Gan, C. Purely attention based local feature integration for video classification. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 2140–2154. [Google Scholar] [CrossRef] [PubMed]
- Gao, R.; Oh, T.H.; Grauman, K.; Torresani, L. Listen to Look: Action Recognition by Previewing Audio. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10457–10467. [Google Scholar] [CrossRef]
- Shaikh, M.B.; Chai, D.; Shamsul Islam, S.M.; Akhtar, N. MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers. In Proceedings of the 2023 11th European Workshop on Visual Information Processing (EUVIP), Gjovik, Norway, 11–14 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
# | Category | Train | Test | # | Category | Train | Test |
---|---|---|---|---|---|---|---|
1 | HeadMassage | 106 | 41 | 27 | BandMarching | 112 | 43 |
2 | BoxingSpeedBag | 97 | 37 | 28 | CricketBowling | 103 | 36 |
3 | HandstandWalking | 77 | 34 | 29 | Basketball Dunk | 90 | 37 |
4 | Rafting | 83 | 28 | 30 | PlayingDaf | 110 | 41 |
5 | ApplyLipstick | 82 | 32 | 31 | FrisbeCatch | 89 | 37 |
6 | ParrallelBars | 77 | 37 | 32 | BodyWeightSquats | 82 | 30 |
7 | Haircut | 97 | 33 | 33 | Hammering | 107 | 33 |
8 | Typing | 93 | 43 | 34 | SumoWrestling | 82 | 34 |
9 | BoxingPunchingBag | 114 | 49 | 35 | CuttingInKitchen | 77 | 33 |
10 | StillRings | 80 | 32 | 36 | Archery | 104 | 41 |
11 | CricketShot | 118 | 49 | 37 | Mopping Floor | 76 | 34 |
12 | SkyDiving | 79 | 31 | 38 | Shotput | 98 | 46 |
13 | WritingOnBoard | 107 | 45 | 39 | HammerThrow | 105 | 45 |
14 | BlowingCandles | 76 | 33 | 40 | CliffDiving | 99 | 39 |
15 | IceDancing | 112 | 46 | 41 | PlayingSitar | 113 | 44 |
16 | BalanceBeam | 77 | 31 | 42 | BrushingTeeth | 95 | 36 |
17 | AppyEyeMakeup | 101 | 44 | 43 | WallPushups | 95 | 35 |
18 | TableTennisShot | 101 | 39 | 44 | Surfing | 93 | 33 |
19 | PlayingDhol | 115 | 49 | 45 | BabyCrawling | 97 | 35 |
20 | HandStandPushups | 96 | 28 | 46 | Bowling | 112 | 43 |
21 | UnevenBars | 76 | 28 | 47 | FrontCrawl | 100 | 37 |
22 | Playingflute | 107 | 48 | 48 | ShavingBeard | 118 | 43 |
23 | Playing Cello | 120 | 44 | 49 | LongJump | 92 | 39 |
24 | Floor Gymnastics | 89 | 36 | 50 | FieldHockeyPenalty | 86 | 40 |
25 | BlowDryHair | 93 | 38 | 51 | Knitting | 89 | 34 |
26 | SoccerPenalty | 96 | 41 |
Representation | Audio | Yield |
---|---|---|
Waveplot | 12.08% | +10.54% |
Spectral Centroids | 13.22% | +10.59% |
Spectral Rolloff | 16.46% | +10.33% |
MFCCs | 12.96% | +8.28% |
Ref. | Work | Venue |
---|---|---|
[19] | Multimodal Fusion for Audio-Image and Video Action Recognition | Neural Computing and Applications |
[2] | MAiVAR: Multimodal Audio-Image and Video Action Recognizer | International Conference on Visual Communications and Image Processing (VCIP) |
[30] | PyMAiVAR: An open-source Python suit for audio-image representation in human action recognition | Software Impacts |
[29] | Spectral Centroid Images for Multi-class Human Action Analysis: A Benchmark Dataset. | Mendeley Data |
[31] | Chroma-Actions Dataset—CAD. | Mendeley Data |
[32] | Waveplot-based Dataset for Multi-class Human Action Analysis | Mendeley Data |
[33] | Spectral Rolloff Images for Multi-class Human Action Analysis: A Benchmark Dataset | Mendeley Data |
[34] | MFFCs for Multi-class Human Action Analysis: A Benchmark Dataset | Mendeley Data |
[35] | MFCCs Feature Scaling Images for Multi-class Human Action Analysis: A Benchmark Dataset | Mendeley Data |
Year | Method | Accuracy (%) |
---|---|---|
2015 | C3D [36] | 82.23 |
2016 | TSN (RGB) [37] | 60.77 |
2017 | C3D + AENet [38] | 85.33 |
2018 | DMRN [39] | 81.04 |
2018 | DMRN [39] + [40] features | 82.93 |
2020 | Attention Cluster [41] | 84.79 |
2020 | IMGAUD2VID [42] | 81.10 |
2022 | STA-TSN (RGB) [4] | 82.1 |
2022 | MAFnet [40] | 86.72 |
2022 | MAiVAR-WP [2] | 86.21 |
2022 | MAiVAR-SC [2] | 86.26 |
2022 | MAiVAR-SR [2] | 86.00 |
2022 | MAiVAR-MFCC [2] | 83.95 |
2022 | MAiVAR-MFS [2] | 86.11 |
2022 | MAiVAR-CH [2] | 87.91 |
Ours | MAiVAR-T [43] | 91.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shaikh, M.B.; Chai, D.; Islam, S.M.S.; Akhtar, N. MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions. Data 2024, 9, 21. https://doi.org/10.3390/data9020021
Shaikh MB, Chai D, Islam SMS, Akhtar N. MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions. Data. 2024; 9(2):21. https://doi.org/10.3390/data9020021
Chicago/Turabian StyleShaikh, Muhammad Bilal, Douglas Chai, Syed Mohammed Shamsul Islam, and Naveed Akhtar. 2024. "MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions" Data 9, no. 2: 21. https://doi.org/10.3390/data9020021
APA StyleShaikh, M. B., Chai, D., Islam, S. M. S., & Akhtar, N. (2024). MHAiR: A Dataset of Audio-Image Representations for Multimodal Human Actions. Data, 9(2), 21. https://doi.org/10.3390/data9020021