Audio-Visual Action Recognition Using Transformer Fusion Network

Kim, Jun-Hwa; Won, Chee Sun

doi:10.3390/app14031190

Open AccessArticle

Audio-Visual Action Recognition Using Transformer Fusion Network

by

Jun-Hwa Kim

and

Chee Sun Won

^*

Department of Electronics and Electrical Engineering, Dongguk University, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(3), 1190; https://doi.org/10.3390/app14031190

Submission received: 8 January 2024 / Revised: 26 January 2024 / Accepted: 29 January 2024 / Published: 31 January 2024

Download

Browse Figures

Versions Notes

Abstract

:

Our approach to action recognition is grounded in the intrinsic coexistence of and complementary relationship between audio and visual information in videos. Going beyond the traditional emphasis on visual features, we propose a transformer-based network that integrates both audio and visual data as inputs. This network is designed to accept and process spatial, temporal, and audio modalities. Features from each modality are extracted using a single Swin Transformer, originally devised for still images. Subsequently, these extracted features from spatial, temporal, and audio data are adeptly combined using a novel modal fusion module (MFM). Our transformer-based network effectively fuses these three modalities, resulting in a robust solution for action recognition.

Keywords:

action recognition; multi modal; deep learning; video

1. Introduction

As the sources of video data vary across diverse fields such as the military [1], unmanned systems [2], surveillance systems [3], and personal security [4], the range of human actions depicted in videos also becomes highly diversified. As a result, recognizing these actions presents a significant challenge. This challenge necessitates the utilization of more sophisticated methods and additional information sources to accurately differentiate between these actions. We note that the traditional approaches [5,6,7,8,9,10] to action recognition in videos have predominantly relied solely on visual information.

With significant advancements in video-based deep learning techniques [7,8,9,10,11,12,13,14], visual features have become the primary clues for tasks related to video understanding. This predominance stems from the visual nature of videos, where actions, movements, and scenes are primarily conveyed through visual cues. Visual features, such as shapes, colors, and motion patterns, provide critical information for interpreting content and discerning different actions or events within a video. In the two-stream approach [7,8,11], spatial and temporal features are extracted from each spatial and temporal stream. For capturing temporal features, [7,11] utilized optical flow, and [8] used dynamic images. By extending input data along the temporal dimensions, [9,10,12,13,14] enabled the learning of spatio-temporal features. However, relying solely on visual information can be limiting, especially in scenarios where visual cues are ambiguous or obscured. We recognize that most videos inherently comprise both visual and audio components, which have the potential to complement each other. Audio can enrich video understanding by providing additional layers of information not always visible, such as dialogue, background sounds, or audio cues that correspond to off-screen activities. In challenging situations such as occlusions, poor lighting, or ambiguous actions, the audio component can provide indispensable contextual clues, thereby enhancing action recognition. For instance, the sound of footsteps or a vehicle in motion can offer vital insights into an action that is not visually discernible. Recent research [15,16,17] has begun exploring the potential of combined audio-visual data, emphasizing its significance in advancing video understanding. Wang et al. [15] proposed a framework to learn from video appearance, motion, and audio, investigating both early and late fusion. Arandjelovic et al. [16] utilized the correspondence between visual and audio information for training the network. Lastly, Xiao et al. [17] utilized a pathway that connects audio features to the layers learning visual features, thereby training a unified representation. This synergy between audio and visual elements allows for a more comprehensive approach to video analysis, leading to improved accuracy and robustness in action recognition tasks.

In previous CNN-based approaches, the fusion of different modalities was achieved through straightforward methods, such as early fusion and late fusion. Early fusion, also known as feature-level fusion, involves combining features from different modalities, like audio and visual, at the initial stage before feeding them into a learning model. This method can capitalize on the raw data’s synergy but may also introduce noise and complexity. On the other hand, late fusion, or decision-level fusion, occurs at a much later stage, where the decisions or predictions made from each individual modality are combined. While this approach maintains the purity of each modality’s data, it may fail to capture deeper, more complex inter-modal interactions. Meanwhile, with the advent of the transformer architecture, more sophisticated attention mechanisms, such as self-attention [18], have proven to be highly effective in capturing complex relationships within a single modality by enabling the model to assign varying degrees of importance to different elements in the input sequence. This powerful capability of self-attention can be harnessed to seamlessly integrate information from diverse modalities, like audio and visual data, by extending its application beyond a single modality. This enables the model to dynamically adjust its focus and allocate attention to the most relevant features from each modality, thereby facilitating the effective fusion of multi-modal information and enhancing the overall performance of tasks like action recognition.

Our approach is centered on two innovative methodologies aimed at enhancing the effectiveness and precision of multi-modal feature learning and fusion for action recognition. Firstly, we introduce a transformer model designed to learn and extract features from multi-modal data, specifically audio and visual information. This model processes individual modalities independently, capturing intricate relationships and patterns within each modality. It is adept at extracting high-level semantic features from both audio and visual data, ensuring a comprehensive understanding of each modality’s unique characteristics. Secondly, we present an attention transformer module that enables the effective integration of the extracted audio and visual features. This module is engineered to compute cross-modal attention between audio and visual elements, allowing the model to selectively emphasize the most relevant modality in a given context. Leveraging the attention mechanism, the model dynamically assigns importance weights to audio and visual features, leading to more robust and accurate action recognition by synthesizing the strengths of each modality.

2. Related Work

Audio-visual action recognition has attracted considerable attention in recent years as researchers strive to leverage the synergistic relationship between audio and visual information, thereby improving the performance of action recognition systems. One of the earliest approaches [19] to audio-visual action recognition focused on fusion techniques, including early fusion, late fusion, and intermediate fusion. These methods aimed to combine audio and visual features at different stages of the training process, whether at the input level, decision-making stage, or intermediate points throughout the training process.

With the advent of the deep learning method, significant advancements have been made in audio-visual action recognition. Researchers have employed convolutional neural networks (CNNs) to learn spatial features from video frames and recurrent neural networks (RNNs) and long short-term memory (LSTM) to capture temporal patterns [20,21] in audio and visual data. Various strategies have also been proposed to fuse the features, including concatenation, element-wise summation, and attention mechanisms [22,23].

More recently, transformer-based models have gained popularity in audio-visual action recognition because of their ability to model long-range dependencies and capture complex interactions between audio and visual modalities. The Multimodal Bottleneck Transformer (MBT) [24] introduces a transformer architecture that constructs multi-modal bottleneck tokens to efficiently fuse video and audio features from image and audio transformers, surpassing traditional late-fusion methods.

In this paper, we deviate from using the Video Swin transformer [13] and opt for a single Swin Transformer [25] designed for still images. We apply this single transformer to all modalities—image, video, and audio—enabling greater resource efficiency without compromising competitive performance.

3. Method

In this section, we introduce our transformer-based architecture for audio-visual action recognition. As shown in Figure 1, the structure consists of three main components: Data processing, feature extraction, and the modal fusion module (MFM). In the data processing part, three distinct operations are performed: randomly selecting a single frame (

X_{i}

) from the video sequence (X), selecting a set of T frames from X, which are then passed through the motion module, and the conversion of an audio signal into a spectrogram. We refer readers to [26] for the methodology of the frame selection. In the feature extraction, features are extracted from each input modality derived from the data processing phase. Given that the outputs of the data processing module for the three modalities share a common 2D image format with three channels, we can employ a single Swin transformer [25] for subsequent feature extraction. Finally, within the modal fusion module, the transformer encoder structure is applied to execute feature fusion.

3.1. Data Processing

The data processing module efficiently transforms spatial, temporal, and audio data into a standardized 2D image format (

W \times H \times C

), where W is the width, H is the height of the image in pixels, and C is the number of channels, ready for subsequent feature extraction by the Swin transformer. In the case of spatial data, a single frame (

X_{i}

) is selected randomly from the video sequence (X). This random selection exposes the model to diverse frames during training, thereby improving its generalization capabilities and increasing data variability. In the temporal stream, T frames, each having dimensions of

W \times H \times C

, are selected by uniform sampling. Then, these T frames are condensed into a single frame of

W \times H \times C

by the S3D (Shallow 3D CNN) motion module [27]. Notably, the S3D module does not utilize fixed weights; instead, its weights are initialized and updated during training, enabling a more adaptive and robust representation of motion features. Lastly, for the audio stream, the raw audio signal associated with the video is transformed into a log-mel spectrogram representation. Figure 2 illustrates this transformation by showing both the original waveform and the resulting log-mel spectrogram. This transformation involves first applying a short-time Fourier transform (STFT) to the audio signal to obtain a frequency to different frequencies. This allows the deep learning model to process the audio information alongside the visual features, capturing the frequency components of the sound over time.

3.2. Audio-Visual Feature Extraction

The feature extraction module is designed to extract learned features from the three processed inputs of spatial, temporal, and audio data, all uniformly transformed into a standardized format of

W \times H \times C

. This standardized approach ensures consistency in the feature extraction process across all modalities. To facilitate reproducibility, we specify that the Swin transformer is employed as a unified feature extraction network. This network employs shared parameters in a single Swin Transformer to efficiently learn and handle all modalities rather than using separate transformers for each. This approach ensures a consistent and integrated method for feature extraction across the different types of data. The extracted features, denoted as

f_{I}

(spatial),

f_{V}

(temporal), and

f_{A}

(audio), are not combined all at once. Instead, they are paired and directed into three distinct modal fusion modules (MFMs). Each MFM is responsible for fusing two types of modalities: the first for spatial-temporal (

f_{I}

and

f_{V}

), the second for spatial-audio (

f_{I}

and

f_{A}

), and the third for temporal-audio (

f_{V}

and

f_{A}

) data. This pairwise fusion strategy, implemented in separate MFMs, allows for a more nuanced and effective integration of the multi-modal data. Through this approach, the model size is kept manageable, avoiding the complexity that would arise from employing distinct feature extraction networks for each modality.

3.3. Transformer-Based Feature Fusion

To integrate the features extracted by the feature extraction module, a transformer-based modal fusion module (MFM) is proposed, as shown in Figure 1. Our MFM is shown in Figure 3. The MFM fuses each feature through the process of co-attention with the transformer encoders. Rather than fusing all features simultaneously, it performs pairwise fusion. Specifically, the process involves the fusion of spatial and temporal features, audio and spatial features, and temporal and audio features, respectively. This methodical approach ensures that the unique characteristics of each modality are effectively combined, allowing for a more comprehensive understanding of the multi-modal data.

As shown in Figure 3, the MFM consists of two transformer encoders, each taking in different modal features

f_{m o d a l_1}

and

f_{m o d a l_2}

as inputs. The transformer encoder, as illustrated in Figure 4, follows the same structural design as the encoder used in the vision transformer [28]. However, unlike the conventional approach where the Query, Key, and Value inputs are identical, our implementation feeds different modal features into these components.

One transformer encoder performs modal fusion by taking

f_{m o d a l_1}

as the Key and Value and

f_{m o d a l_2}

as the Query, while the other transformer encoder performs modal fusion by using

f_{m o d a l_1}

as the Query and

f_{m o d a l_2}

as the Key and Value. The fused attention output vector

f_{m o d a l_12}

and

f_{m o d a l_21}

can be represented as follows:

\begin{matrix} f_{m o d a l_12} & = Attention (f_{m o d a l_2}, f_{m o d a l_1}, f_{m o d a l_1}) \\ = Softmax (\frac{f_{m o d a l_2} f_{m o d a l_1}^{T}}{\sqrt{d_{k}}}) f_{m o d a l_2} \end{matrix}

(1)

\begin{matrix} f_{m o d a l_21} & = Attention (f_{m o d a l_1}, f_{m o d a l_2}, f_{m o d a l_2}) \\ = Softmax (\frac{f_{m o d a l_1} f_{m o d a l_2}^{T}}{\sqrt{d_{k}}}) f_{m o d a l_1}, \end{matrix}

(2)

where

d_{k}

is the dimensionality of the Key vector and T denotes the transpose operation. The attention function [18] in Equations (1) and (2) takes in Query, Key, and Value in that order. The fused attention output vectors are combined using concatenation to produce the final output Y.

In the fusion of spatial image and temporal video, image features

(f_{I})

and video frame features (

f_{V}

) are fed into the MFM, yielding the fused features

f_{I V}

and

f_{V I}

. The fused attention output vectors are combined using concatenation to produce the final output

Y_{I V}

. The fused attention output vector

f_{I V}

,

f_{V I}

, and

Y_{I V}

can be represented as follows:

f_{I V} = Attention (f_{V}, f_{I}, f_{I}) = Softmax (\frac{f_{V} f_{I}^{T}}{\sqrt{d_{k}}}) f_{I}

(3)

f_{V I} = Attention (f_{I}, f_{V}, f_{V}) = Softmax (\frac{f_{I} f_{V}^{T}}{\sqrt{d_{k}}}) f_{V}

(4)

Y_{I A} = Concat (f_{I A}, f_{A I})

(5)

In the fusion of spatial image and audio, image features

(f_{I})

and audio features

(f_{A})

are combined in the MFM, producing the fused features

f_{I A}

and

f_{A I}

. The attention output vectors from these fusion processes are then combined using concatenation to create the final output for the spatial-audio combination, denoted as

Y_{I A}

. The equations for the fused attention output vectors

f_{I A}

,

f_{A I}

, and

Y_{I A}

are as follows:

f_{I A} = Attention (f_{I}, f_{A}, f_{A}) = Softmax (\frac{f_{I} f_{A}^{T}}{\sqrt{d_{k}}}) f_{A}

(6)

f_{A I} = Attention (f_{A}, f_{I}, f_{I}) = Softmax (\frac{f_{A} f_{I}^{T}}{\sqrt{d_{k}}}) f_{I}

(7)

Y_{A I} = Concat (f_{A I}, f_{I A})

(8)

In the fusion of temporal video and audio, video frame features

(f_{V})

and audio features

(f_{A})

are combined in the MFM, producing the fused features

f_{V A}

and

f_{A V}

. The attention output vectors from these fusion processes are then combined using concatenation to create the final output for the temporal-audio combination, denoted as

Y_{V A}

. The equations for the fused attention output vectors

f_{V A}

,

f_{A V}

, and

Y_{V A}

are as follows:

f_{V A} = Attention (f_{V}, f_{A}, f_{A}) = Softmax (\frac{f_{V} f_{A}^{T}}{\sqrt{d_{k}}}) f_{A}

(9)

f_{A V} = Attention (f_{A}, f_{V}, f_{V}) = Softmax (\frac{f_{A} f_{V}^{T}}{\sqrt{d_{k}}}) f_{V}

(10)

Y_{V A} = Concat (f_{V A}, f_{A V})

(11)

The final stage of our model involves integrating the outputs from the spatial-temporal (

Y_{I V}

), temporal-audio (

Y_{V A}

), and spatial-audio (

Y_{I A}

) into one comprehensive output Y. This is achieved through concatenation, allowing for the preservation and combination of distinct features from each modality. The equation for this final concatenation is

Y = Concat (Y_{I V}, Y_{V A}, Y_{I A})

(12)

Once Y is obtained, it is fed into a final classification layer (MLP), which is responsible for the action recognition task. This layer, typically a fully connected neural network, interprets the rich, multi-modal feature set represented by Y to accurately classify and recognize various actions in the input video.

4. Experimental Results

4.1. Implementation

All our experiments were conducted in Intel i7-4790 CPU, NVIDIA TESLA V100 GPU, Pytorch 1.12.1 environments, and Ubuntu 20.4 LTS version. A pre-trained Swin transformer [25] with ImageNet [29] was adopted, and a trained model using AdamW [30] was used as the optimizer. The batch size was set to 16, with an initial learning rate of 5

\times 10^{- 4}

, and a cosine annealing scheduler [31] was adopted.

For the spatial stream, during training, a single frame was randomly selected from each video to enhance the model’s ability to generalize. For testing, the central frame of each video was consistently chosen to ensure uniformity and comparability of results. For the temporal stream processing in the S3D module, we utilized uniform sampling to select 16 frames as inputs, ensuring diverse and representative coverage of the video content.

The raw audio data were sampled at 22.05 kHz, and the input features were extracted using a short-time Fourier transformation (STFT) with a window size of 2048, an overlap of 50%, and 256 Mel bands. We used the augmentation methods of random crop, random flip, color jittering, and autoaug [32]. The default version of Swin transformer [25] is referred to as Swin-B. Additionally, there are versions Swin-T, Swin-S, and Swin-L, which have model and computational complexities of 0.25×, 0.5×, and 2×, respectively. Following the original Swin Transformer configuration, we set the window size to

M = 7

and the query dimension of each head to

d = 32

in our models. The other hyper-parameters of each model are as follows:

Swin-T: $C^{*} = 96$ , $L = {2, 2, 6, 2}$ ,
Swin-S: $C^{*} = 96$ , $L = {2, 2, 18, 2}$ ,
Swin-B: $C^{*} = 128$ , $L = {2, 2, 18, 2}$ ,
Swin-L: $C^{*} = 192$ , $L = {2, 2, 18, 2}$ ,

where

C^{*}

is the channel number of the hidden layers in the first stage and L is the layer number of each stage. The metric for evaluating model performance is accuracy.

4.2. Datasets

For the performance evaluation of the proposed method, UCF-sound [15], Kinetics-sound [16,17], and audio-visual datasets were adopted. The UCF-sound [15] is a dataset derived from UCF-101 [33], with or without an invalid soundtrack from UCF-101. The UCF-sound contains 6624 video clips with 50 classes. We divided the dataset into training and test sets, following the same split used in [15], resulting in 4733 samples for training and 1891 samples for testing. The classes in the UCF-sound dataset are diverse, covering a wide range of auditory activities, with each class incorporating both video and audio elements. These classes are listed below, with samples of these classes presented in Figure 5a.

applylipstick, archery, babycrawling, balancebeam, bandmarching, basketballdunk, blowdryhair, blowingcandles, bodyweightsquats, bowling, boxingpunchingbag, boxingspeedbag, brushingteeth, cliffdiving, cricketbowling, cricketshot, cuttinginkitchen, fieldhockeypenalty, floorgymnastics, frisbeecatch, frontcrawl, haircut, hammerthrow, hammering, handstandpushups, handstandwalking, headmassage, icedancing, knitting, longjump, moppingfloor, parallelbars, playingcello, playingdaf, playingdhol, playingflute, playingsitar, rafting, shavingbeard, shotput, skydiving, soccerpenalty, stillrings, sumowrestling, surfing, tabletennisshot, typing, unevenbars, wallpushups, writingonboard

The Kinetics-sound [16,17] is a dataset derived from the Kinetics video dataset [34]. Among the subsets of the Kinetics-sound dataset, one contains 34 human action classes that clearly contain audio-visual information. However, some classes were removed, and only 32 classes were used in the experiment, as in [17]. We divided the dataset into training and test sets, resulting in 22,914 samples for training and 1585 samples for testing. The classes in the Kinetics-sound dataset are primarily centered around musical instruments and everyday actions, with each class incorporating both video and audio elements. These classes are listed below, with samples of these classes presented in Figure 5b.

blowingnose, blowingoutcandles, blowingoutcandles, bowling, choppingwood, dribblingbasketball, laughing, mowinglawn, playingaccordion, playingbagpipes, playingbassguitar, playingclarinet, playingdrums, playingguitar, playingharmonica, playingkeyboard, playingorgan, playingpiano, playingviolin, playingxylophone, playingsaxophone, rippingpaper, shufflingcards, singing, stompinggrapes, strummingguitar, tapdancing, tappingguitar, tappingpen, tickling, playingtrumpet, playingtrombone9

Figure 5. Examples of the actions in the UCF-sound and Kinetics-sound datasets: (a) is UCF-sound and (b) is Kinetics-sound.

4.3. Results

The performance of the proposed method was evaluated with several state-of-the-art approaches on the UCF-sound [15] and Kinetics-sound [17] datasets. The experiment was conducted across four distinct cases, examining various combinations of audio-visual data. These cases encompassed the following: (1) all elements, including a single frame (spatial), T frames (temporal), and audio; (2) single frame and T frames without audio; (3) single frame and its corresponding audio; and (4) T frames and its corresponding audio. This facilitates a comprehensive evaluation of the interactions between various audio-visual configurations and their influence on the experimental outcomes.

In the UCF-sound dataset, Table 1 represents an ablation study comparing the performance of different versions of the Swin transformer for the four cases (1)–(4). The results show that case (1), which includes all elements (a single frame, T frames, and audio), consistently outperforms the other cases across all Swin Transformer variants, indicating the importance of integrating all modalities. This indicates the effectiveness of a multi-modal approach. Conversely, case (4), involving only T frames and their corresponding audio, shows the lowest performance, suggesting that the spatial data (single frame) significantly contributes to the model’s accuracy. The performance differences between cases (2) and (3) further emphasize the individual contributions of the spatial and audio modalities.

The results of the entire Swin Transformer versions show that case (3) outperforms the other two cases, while case (1) has slightly better performance than case (2). Consequently, while the inclusion of audio data does contribute to an overall improvement in performance, it is evident that visual data play a more crucial role in determining the effectiveness of the model on the UCF-sound dataset.

The results of the ablation study on the Kinetics-sound dataset are shown in Table 2. This consistency across datasets underscores the importance of integrating spatial, temporal, and audio data for optimal action recognition. Notably, in this dataset, case (4) (temporal frames and audio) outperforms case (2) (single frame and T frames without audio), suggesting that the combination of motion and audio features is more effective than spatial and temporal features alone. These variations between the two datasets highlight the impact of dataset-specific characteristics on the efficacy of different modality combinations.

Wang et al. [15] achieved results by either the early fusion (EF) or late fusion (LF) of spatial, temporal, and audio features and by making predictions using either neural networks (NNs) or SVM (see Table 3). EF involves concatenating features or transforming the feature space at the feature level, while LF refers to fusing different modalities at the decision level. Our Swin-B model achieves a superior performance of 93.00%, which is approximately 11% better than the best result of Wang et al. [15].

As shown in Table 4, our proposed method (1) with Swin-B achieves the highest accuracy of 89.34% on the Kinetics-sound dataset. This performance surpasses those of all other listed methods, including the Multi-level Attention Fusion Network (MAFnet) [35], AVSlowFast models [17], and even the MBT [24]. Notably, our method’s accuracy is approximately 4% higher than the nearest competing methods, marking a significant improvement.

4.4. Visual Interpretation with Grad-CAM

Figure 6 and Figure 7 present GRAD-CAM [37] visualizations for the UCF-sound and Kinetics-sound datasets. GRAD-CAM, which stands for gradient-weighted class activation mapping, leverages the gradient information from the last transformer layer of our feature extraction network. In each figure, alongside the original image, we provide two GRAD-CAM images. The first GRAD-CAM image is derived using only visual information, while the second incorporates data from all modalities. A notable observation is that the activations highlighted in the multi-modal model are more prominent and concentrated around areas of significant movement or sound generation. This contrast suggests that incorporating audio data alongside visual information enhances the model’s ability to focus on relevant parts of the scene, leading to more accurate and interpretable results. The comparison between these two sets of GRAD-CAM images underscores the added value of audio data in enriching the model’s understanding of the scene.

5. Discussion

In this study, we propose a novel approach for enhancing action recognition by integrating visual and audio information from videos using a transformer structure. While our model exhibits promising capabilities, there are inherent limitations and areas for future exploration. Firstly, the model employs a shared feature extraction network for both audio and visual data. This design was primarily driven by the aim to maintain network efficiency; however, it may impose limitations on the model’s ability to distinctly capture modality-specific features. For instance, separate feature extractors, such as 3D CNNs [9,38] for visual temporal features, and VGGish [39] or audio spectrogram transformers (ASTs) [40] for audio, can potentially yield better recognition performance. Secondly, our work focused on the fusion of visual and audio data, and we evaluated its performance. Recently, with the rise of multi-modal learning, various large-scale audio-visual datasets like Epic-sounds [41] have emerged. It will be beneficial to conduct future experiments on these large-scale datasets to verify our model’s performance. Moreover, there is a growing trend of attempting multi-modal learning with various modalities, such as text or data from different sensors. It is necessary to explore whether our structure can be beneficial in these contexts as well. Our structure, designed for visual and audio data, uses three modal fusion modules. However, with the addition of other modal data, additional modal fusion modules may be required. This can lead to an increase in model size or may not be optimal structurally. To address this, an improved modal fusion module capable of accommodating more modalities is needed.

6. Conclusions

In this study, we introduce a novel approach using a single Swin transformer across various modalities, including image, video, and audio. Our method simplifies the multi-modal fusion process with the integration of an attention transformer module in the modal fusion module (MFM), effectively fusing audio and visual features. This streamlined approach shows substantial performance improvements in robust action recognition compared to existing methods. The combined use of both visual and audio information is crucial for a comprehensive understanding of actions, which has significant implications for applications in areas such as surveillance, content analysis, and interactive media.

Author Contributions

Conceptualization, J.-H.K. and C.S.W.; methodology, J.-H.K. and C.S.W.; software, J.-H.K.; validation, J.-H.K.; formal analysis, J.-H.K.; investigation, J.-H.K.; resources, J.-H.K.; data curation, J.-H.K.; writing—original draft preparation, J.-H.K.; writing—review and editing, C.S.W.; visualization, J.-H.K.; supervision, C.S.W.; project administration, C.S.W.; funding acquisition, C.S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (NRF2023R1A2C1003588).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here. UCF-Sound: https://www.crcv.ucf.edu/research/data-sets/ucf101/, accessed on 8 January 2024. Kinetics-Sound: https://deepmind.com/research/open-source/kinetics, accessed on 8 January 2024. For information on how to obtain UCF-Sound and Kinetics-Sound from UCF-101 and Kinetics-400, go to https://github.com/kjunhwa/audiovisual_action, accessed on 8 January 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ukani, V.; Thakkar, P. A hybrid video based iot framework for military surveillance. Des. Eng. 2021, 5, 2050–2060. [Google Scholar]
Zhang, Q.; Sun, H.; Wu, X.; Zhong, H. Edge video analytics for public safety: A review. Proc. IEEE 2019, 107, 1675–1696. [Google Scholar] [CrossRef]
Kim, D.; Kim, H.; Mok, Y.; Paik, J. Real-time surveillance system for analyzing abnormal behavior of pedestrians. Appl. Sci. 2021, 11, 6153. [Google Scholar] [CrossRef]
Prathaban, T.; Thean, W.; Sazali, M.I.S.M. A vision-based home security system using OpenCV on Raspberry Pi 3. AIP Conf. Proc. 2019, 2173, 020013. [Google Scholar]
Ohn-Bar, E.; Trivedi, M. Joint angles similarities and HOG2 for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 23–28 June 2013; pp. 465–470. [Google Scholar]
Wang, H.; Schmid, C. Action recognition with improved trajectories. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 3551–3558. [Google Scholar]
Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A. Action Recognition with Dynamic Image Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 2799–2813. [Google Scholar] [CrossRef] [PubMed]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, CA, USA, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Khan, S.; Hassan, A.; Hussain, F.; Perwaiz, A.; Riaz, F.; Alsabaan, M.; Abdul, W. Enhanced spatial stream of two-stream network using optical flow for human action recognition. Appl. Sci. 2023, 13, 8003. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Liu, Z.; Ning, J.; Cao, Y.; Wei, Y.; Zhang, Z.; Lin, S.; Hu, H. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]
Wang, H.; Zhang, W.; Liu, G. TSNet: Token Sparsification for Efficient Video Transformer. Appl. Sci. 2023, 13, 10633. [Google Scholar] [CrossRef]
Wang, C.; Yang, H.; Meinel, C. Exploring multimodal video representation for action recognition. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 1924–1931. [Google Scholar]
Arandjelovic, R.; Zisserman, A. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, Hawaii, 21–26 July 2016; pp. 609–617. [Google Scholar]
Xiao, F.; Lee, Y.J.; Grauman, K.; Malik, J.; Feichtenhofer, C. Audiovisual slowfast networks for video recognition. arXiv 2020, arXiv:2001.08740. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; p. 30. [Google Scholar]
Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
Wöllmer, M.; Kaiser, M.; Eyben, F.; Schuller, B.; Rigoll, G. LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis. Comput. 2013, 31, 153–163. [Google Scholar] [CrossRef]
Gupta, M.V.; Vaikole, S.; Oza, A.D.; Patel, A.; Burduhos-Nergis, D.P.; Burduhos-Nergis, D.D. Audio-Visual Stress Classification Using Cascaded RNN-LSTM Networks. Bioengineering 2022, 9, 510. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.-R.; Du, J. Deep fusion: An attention guided factorized bilinear pooling for audio-video emotion recognition. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Duan, B.; Tang, H.; Wang, W.; Zong, Z.; Yang, G.; Yan, Y. Audio-visual event localization via recursive fusion by joint co-attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual Conference, 5–9 January 2021; pp. 4013–4022. [Google Scholar]
Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; Sun, C. Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 2021, 34, 14200–14213. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Conference, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Kim, J.-H.; Won, C.S. Action Recognition in Videos Using Pre-trained 2D Convolutional Neural Networks. IEEE Access 2020, 8, 60179–60188. [Google Scholar] [CrossRef]
Kim, J.-H.; Kim, N.; Won, C.S. Deep edge computing for videos. IEEE Access 2021, 9, 123348–123357. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Jia, D.; Wei, D.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
Brousmiche, M.; Rouat, J.; Dupont, S. Multi-level attention fusion network for audio-visual event recognition. arXiv 2021, arXiv:2106.06736. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
Gong, Y.; Chung, Y.-A.; Glass, J. AST: Audio Spectrogram Transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar]
Huh, J.; Chalk, J.; Kazakos, E.; Damen, D.; Zisserman, A. Epic-Sounds: A Large-Scale Dataset of Actions that Sound. In Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]

Figure 1. The overall transformer-based architecture for the audio-visual action recognition.

Figure 2. Visualization of audio transformation for the Kinetics-sound dataset. Top to bottom: raw audio waveforms followed by their respective log-mel spectrogram representations. Top-left: ‘Blowing Nose’, top-right: ‘Playing Accordion’, bottom-left: ‘Playing Trumpet’, bottom-right: ‘Tapping Pen’.

Figure 3. The general structure of the modal fusion module (MFM).

Figure 4. Transformer encoder.

Figure 6. GRAD-CAM visualization results of UCF-sound using Swin-L as a feature extractor. For each original image, its subsequent two GRAD-CAM images correspond to the outcomes using solely visual information and using all modalities, respectively.

Figure 7. GRAD-CAM visualization results of Kinetics-sound using Swin-L as a feature extractor. For each original image, its subsequent two GRAD-CAM images correspond to the outcomes using solely visual information and using all modalities, respectively.

Table 1. The ablation study for Swin transformer variants on the UCF-sound dataset: (1) all elements, including a single frame (spatial), T frames (temporal), and audio; (2) single frame and T frames without audio; (3) single frame and its corresponding audio; and (4) T frames and its corresponding audio.The bold numbers represent the best results for each Swin type.

Case	Swin-T	Swin-S	Swin-B	Swin-L
(1)	86.63%	91.05%	93.00%	92.53%
(2)	84.21%	87.79%	91.00%	91.95%
(3)	86.63%	90.16%	91.79%	91.84%
(4)	74.32%	79.26%	89.05%	90.84%

Table 2. The ablation study for Swin transformer on the Kinetics-sound dataset: (1) all elements, including a single frame (spatial), T frames (temporal), and audio; (2) single frame and T frames without audio; (3) single frame and its corresponding audio; and (4) T frames and its corresponding audio. The bold numbers represent the best results for each Swin type.

Case	Swin-T	Swin-S	Swin-B	Swin-L
(1)	87.50%	88.25%	89.34%	89.28%
(2)	82.24%	84.43%	85.38%	86.00%
(3)	86.48%	87.80%	88.59%	89.05%
(4)	85.25%	86.20%	88.18%	87.36%

Table 3. Comparison of multi-modal results on UCF-sound dataset. EF denotes early fusion, and LF denotes late fusion. NN denotes neural network. The bold numbers represent the best results.

Case	Accuracy
EF-NN [15]	80.06%
LF-NN [15]	61.00%
EF-SVM [15]	66.10%
LF-SVM [15]	82.50%
Proposed methods (Swin-B, case (1))	93.00%

Table 4. Comparison of multi-modal results on Kinetics-sound dataset. R50 denotes ResNet-50 [36], and R101 denotes ResNet-101 [36]. MAF denotes Multi-level Attention Fusion Network. The bold numbers represent the best results.

Method	Accuracy
$L^{3}$ -Net [16]	74.00%
SlowFast [17], R50	80.50%
AVSlowFast [17], R50	83.70%
SlowFast [17], R101	82.70%
MAFnet [35]	83.94%
AVSlowFast [17]	85.00%
MBT [24]	85.00%
Proposed methods (Swin-B, case (1))	89.34%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, J.-H.; Won, C.S. Audio-Visual Action Recognition Using Transformer Fusion Network. Appl. Sci. 2024, 14, 1190. https://doi.org/10.3390/app14031190

AMA Style

Kim J-H, Won CS. Audio-Visual Action Recognition Using Transformer Fusion Network. Applied Sciences. 2024; 14(3):1190. https://doi.org/10.3390/app14031190

Chicago/Turabian Style

Kim, Jun-Hwa, and Chee Sun Won. 2024. "Audio-Visual Action Recognition Using Transformer Fusion Network" Applied Sciences 14, no. 3: 1190. https://doi.org/10.3390/app14031190

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Audio-Visual Action Recognition Using Transformer Fusion Network

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Data Processing

3.2. Audio-Visual Feature Extraction

3.3. Transformer-Based Feature Fusion

4. Experimental Results

4.1. Implementation

4.2. Datasets

4.3. Results

4.4. Visual Interpretation with Grad-CAM

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI