Temporal Feature Prediction in Audio–Visual Deepfake Detection

Gao, Yuan; Wang, Xuelong; Zhang, Yu; Zeng, Ping; Ma, Yingjie

doi:10.3390/electronics13173433

Open AccessArticle

Temporal Feature Prediction in Audio–Visual Deepfake Detection

by

Yuan Gao

^1,2,*,

Xuelong Wang

¹,

Yu Zhang

^1,3,

Ping Zeng

^1,3 and

Yingjie Ma

¹

Department of Electronics and Communications Engineering, Beijing Electronic Science and Technology Institute, Beijing 100070, China

²

State Information Center, Beijing 100045, China

³

School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3433; https://doi.org/10.3390/electronics13173433

Submission received: 5 August 2024 / Revised: 25 August 2024 / Accepted: 26 August 2024 / Published: 29 August 2024

(This article belongs to the Special Issue Applied Cryptography and Practical Cryptoanalysis for Web 3.0)

Download

Browse Figures

Versions Notes

Abstract

:

The rapid growth of deepfake technology, generating realistic manipulated media, poses a significant threat due to potential misuse. Therefore, effective detection methods are urgently needed to prevent malicious use, as current approaches often focus on single modalities or the simple fusion of audio–visual signals, limiting their accuracy. To solve this problem, we propose a deepfake detection scheme based on bimodal temporal feature prediction, which innovatively introduces the idea of temporal feature prediction into the audio–video bimodal deepfake detection task, aiming at fully exploiting the temporal laws of audio–visual modalities. First, pairs of adjacent audio–video sequence clips are used to construct input quadruples, and a dual-stream network is employed to extract temporal feature representations from video and audio, respectively. A video prediction module and an audio prediction module are designed to capture the temporal inconsistencies within each single modality by predicting future temporal features and comparing them with reference features. Then, a projection layer network is designed to align the audio–visual features, using contrastive loss functions to perform contrastive learning and maximize the differences between real and fake video modalities. Experiments on the FakeAVCeleb dataset demonstrate superior performance with an accuracy of 84.33% and an AUC of 89.91%, outperforming existing methods and confirming the effectiveness of our approach in deepfake detection.

Keywords:

deepfake detection; deep learning; temporal feature prediction; contrastive learning; bimodal detection

1. Introduction

Currently, deepfake detection techniques mainly rely on anomalous features of visual modalities in videos. However, as deepfake generative systems continue to advance, visual fraud [1,2,3] and audio tampering [4,5] often coexist in forged videos. These manipulated videos often undergo compression and other disturbances that obscure forensic traces, rendering unimodal detection methods (e.g., hybrid boundaries [6,7], frequency anomalies [8,9,10,11], intra-modal inconsistencies [12,13], and facial or lip motion artifacts [14,15]) less effective. Thus, it is essential to develop robust multimodal detection approaches that leverage both audio and visual information. With the advantage of multimodal learning, the detector can utilize the inconsistency artifacts between different modalities to enhance its detection capability and obtain accurate and reliable results, which provides an effective path to solve the above problems.

Recent research has explored multimodal deepfake detection by integrating audio and visual signals. Muppalla et al. [16] enhanced detection granularity by categorizing samples using combined unimodal labels. However, the full potential of audio–visual correlations remains underexploited. While some works [17,18,19,20,21,22] have innovated in fusing these features, they fall short in addressing the fundamental differences between audio and video signals, leading to gaps in the modal distribution. Furthermore, current methods focusing on specific inconsistencies, like emotional cues [23], risk becoming ineffective if forgers adapt to conceal these anomalies.

A major limitation of current audio–video bimodal forgery detection methods is that the essence of training a network strategy, whether using comparative learning or self-supervision, is to learn the inconsistency between the two modalities. This inherent frame of mind limits the breadth and depth of application of these methods in deepfake detection. Previous research has made significant progress in exploiting intra-frame artifacts and temporal feature inconsistencies, etc., but existing multimodal learning methods tend to ignore these anomalous cues, especially those present in the time series, which leads to subtle traces of forgery being able to evade the detector. Therefore, introducing artifact features, which have been proven to be effective in unimodal detection, to the task of bimodal detection can fill the gap in the current research and better utilize the intra-modal artifacts and inter-modal inconsistency features of videos.

To address the issues raised earlier, this paper presents a deep forgery detection method based on bimodal temporal feature prediction, which makes significant contributions in several key areas:

A characterization prediction module based on temporal feature prediction is proposed. It segments the input video into several video segments and infers more long-term global feature relationships from the sequence segments. These long-term global features across multiple time steps help to mine visual and auditory regularities and predict future sequence representations.
A video prediction module and an audio prediction module are designed. By predicting future timing features and comparing them with reference features, the capture of timing inconsistencies within a single modality is realized.
This paper innovatively proposes the contrast loss function as an important part of the overall objective loss function. The contrast loss function is trained to maximize the distinction between real and fake videos, making real and fake videos distinguishable.
A projection layer network is designed for the audio and video features, respectively, which is used to merge the feature representations of neighboring audio and video clips and align the bimodal features, thus solving the problem of dimensional inconsistency between the video stream features and the audio stream features.

2. Related Work

Recent advances in deepfake detection have utilized contrast and self-supervised learning methods to identify differences between audio and visual channels. Chugh et al. [24] argued that audio–video dissonance can be a strong basis for detecting deepfake video, as tampered faces have discontinuities in the mouth region that result in a mismatch with the voice. Therefore, the authors proposed to use a dual-stream network to process video features and speech features separately, and to use the modal dissonance score (MDS) to construct a contrast loss function to compute the degree of incongruence between audio–video features. However, as with the frame-based approach, simply training the deep network on the video in a binary classification manner may result in the model overfitting the seen artifacts. To address this problem, Haliassos et al. [25] used a self-supervised approach that exploits known correspondences between visual and auditory modalities in natural videos to capture all the shared information between the two modalities through a cross-modal teacher–student framework. However, this approach requires a two-phase training process and uses audio–video bimodality for cross-modal supervision only in the first phase. In the testing phase, it abandons the audio modality and considers only visual features, which undoubtedly sacrifices the model’s performance and limits its scalability. Yu et al. [26] similarly extracted natural audio–visual correspondences from real videos in a self-supervised manner, and then used the learned real correspondences as targets to guide the extraction of audio–visual inconsistencies in the detection phase. The first stage utilizes two augmented visual views paired with corresponding audio cues to better explore common audio–visual correspondences. The second phase then aligns the audio–visual features of the real video using the network model frozen in the first phase to enable the detection network to better learn the audio–visual inconsistencies of deepfake videos.

Cheng et al. [27] found that voice and facial identity representations in deepfake videos are often mismatched but homogeneous in real videos. Based on this, an effective multimodal matching framework was designed to pretrain models on a generic audio–visual dataset and fine-tune them on downstream deepfake data to detect the plausibility of real and fake videos. Cozzolino et al. [28] created a deepfake detector for persons of interest (POIs) by extracting the audio–visual features that characterize a person’s identity and using these features. They applied comparative learning to extract feature representations of the most unique dynamic face and audio clips for each identity. When a person’s video and audio are manipulated, their representation in the embedding space will become inconsistent with the true identity, allowing the reliable detection of deepfakes. The detector is trained only on real face videos and does not depend on any specific manipulation method, thus providing the highest generalization ability.

In the field of deepfake detection in audio and video, several related works provide valuable references. Umirzakova et al. [29] proposed an improved facial image parsing technique that significantly improves the segmentation accuracy of facial features by combining spatial and channel attention mechanisms. Dam et al. [30] introduced the AYDIV model, which utilizes a contextual vision converter for 3D object detection. This approach may be useful for analyzing spatial and temporal relationships in audio and video. Lee et al. [31] introduced the Watt-EffNet model, which balances accuracy and computational requirements, and is particularly valuable for real-time fake detection, providing insights for resource-efficient model development.

3. Methods

Figure 1 illustrates the detection framework based on bimodal temporal feature prediction, consisting of five parts: preprocessing, audio–visual feature extraction, representation prediction, contrastive learning, and classification. First, the input raw video is subjected to preprocessing operations, including speech extraction and sequence segmentation, to generate the input video stream and audio stream. Second, an audio–visual feature extraction module performs feature learning on the signals of the two modalities to extract the dense representation of the video channel and the coherent features of the audio channel. Then, the representation prediction module performs a prediction task at adjacent video clips based on the learned bimodal features to achieve the detection of timing inconsistencies within a single modality. Meanwhile, after passing through the projection layer, between the two modalities, contrast learning is used to maximize the falsification of the differences between video modalities. Finally, in the classification module, the prediction results are output to identify the video authenticity. The following sections provide a comprehensive account of each of these processes.

3.1. Preprocessing

Assuming a training dataset composed of N videos

D = {(x^{1}, y^{1}), (x^{2}, y^{2}), \dots, (x^{N}, y^{N})}

, where

x^{i}

represents the input video and

y^{i}

represents the video label. For any given video

x^{i}

, it is first segmented into M one-second clip segments

{x_{1}^{i}, x_{2}^{i}, \dots, x_{M}^{i}}

. Subsequently, the S3FD [32] model is used for face detection on the video segment

x_{j}^{i}

, and the detected facial regions are cropped to form the video stream input

v_{j}^{i}

. Additionally, we use ffmpeg [33] to extract the audio information from each clip segment

a_{j}^{i}

. Therefore, for a video

x^{i}

, the inputs for the audio stream and video stream are

v^{i} = {v_{1}^{i}, v_{2}^{i}, \dots, v_{M}^{i}}

and

a^{i} = {a_{1}^{i}, a_{2}^{i}, \dots, a_{M}^{i}}

, respectively. As shown in Figure 1, we process pairs of adjacent video clip segments at a time, i.e.,

{v_{j}^{i}, a_{j}^{i}, v_{j + 1}^{i}, a_{j + 1}^{i}}

, where

v_{j}^{i}

and

a_{j}^{i}

are used to extract the source audio–visual features, and

v_{j + 1}^{i}

and

a_{j + 1}^{i}

are used to extract the target audio–visual features.

3.2. Audio–Visual Feature Extraction Process

Audio–visual feature extraction is divided into two parts: video stream processing and audio stream processing. Video stream processing helps the model learn rich spatio-temporal information so as to capture key features of video clips. Audio stream processing, on the other hand, helps the model to more accurately analyze and understand the audio content in order to better identify forged cues in the audio modality.

3.2.1. Video Stream Processing

Channel-Separated Convolutional Networks (CSNs) [34] serve as the feature extraction network

S_{v}

for the video stream. Although CSN builds on 3D-ResNet, CSN makes some key improvements over the original 3D-ResNet to better model the spatio-temporal information of the video data. First, CSN uses a channel-separated organizational structure. Within each residual block, the first

1 \times 1 \times 1

convolution acts only on the channel dimension, while the second

3 \times 3 \times 3

convolution captures both spatial and temporal information. This design reduces the number of parameters and allows for more focused modeling of spatial and time-domain patterns. Second, CSN introduces a tunable time-domain sampling rate. By configuring the time step of different STAGES, CSN can balance the fusion of the space and time domains to achieve the effect of progressively aggregating the timing information. Finally, CSN applies lightweight time domain modeling. Compared to directly stacking a large number of 3D convolutions across the entire network, CSN can learn the timing representation with only a small number of 3D convolution operations, reducing the computational complexity.

We made two improvements to the original CSN network structure. First, the temporal stride is set to 1 at all stages to prevent temporal resampling and maintain complete temporal patterns. Second, at the end of the network, global average pooling is applied only to the spatial dimension to produce a dense feature representation of the video clips rather than a global feature representation of the input video clips, which facilitates the subsequent prediction module.

S_{v}

maps the input source video clip

v_{j}^{i} \in R^{T \times C \times H \times W}

and target video clip

v_{j + 1}^{i} \in R^{T \times C \times H \times W}

into latent sequential representations

E_{j}^{v} = {e_{1}^{v}, e_{2}^{v}, \dots, e_{T}^{v},} \in R^{T \times C^{'}}

and

E_{j + 1}^{v} = {e_{T + 1}^{v}, e_{T + 2}^{v}, \dots, e_{2 T}^{v},} \in R^{T \times C^{'}}

, where T is the time length (number of frames), C is the number of input channels, and H and W are the spatial resolutions. The number of output feature channels

C^{'}

depends on the network parameter settings, which is set to 2048. Additionally, the visual embedding features maintain the same temporal dimension as the input features, containing rich information on the dynamic patterns.

3.2.2. Audio Stream Processing

Mel Frequency Cepstrum Coefficients (MFCCs) [35] serve as the input for the audio stream. For each audio segment

a_{i}

lasting 1 s with a sampling rate of 48,000 Hz, its MFCCs are computed and converted into a heatmap format, with dimensions of

13 \times 99 \times 1

. Here, the height 13 represents that there are 13 Mel frequency bands in the MFCC, the width 99 indicates that the audio segment is divided into 99 frames, and the channel number is 1. The preprocessing flow of the audio modality is shown in Figure 2. Overall, the audio is encoded as a heatmap image representing the MFCC values for each time step and each Meier band, which is subsequently fed into the audio stream.

Obtaining the MFCC features in an audio clip consists of six main steps as shown in Figure 3.

For the MFCC heatmap representation of the audio signal, ResNet-18 [36] is utilized as the feature extraction network

S_{a}

, with the number of input channels of the first convolutional layer set to 1. The main feature of ResNet is the use of residual modules and residual connections to build the network, which makes it possible to train deeper networks without the problem of gradient vanishing, and to excel in tasks such as image classification.

S_{a}

maps the source audio feature

a_{j}^{i}

and target audio feature

a_{j + 1}^{i}

to consistent audio representations

E_{j}^{a} \in R^{C^{″} \times 1}

and

E_{j + 1}^{a} \in R^{C^{″} \times 1}

, where

C^{″}

is 512.

3.3. Representation Prediction Process

With the continuous advancement of deepfake technology, the quality of the generated fake videos is becoming higher and higher, making the obvious specific artifactual features present in low visual quality videos utilized in the past proposed methods severely weakened, as well as facing the challenge of low generalization performance for cross-dataset detection. Therefore, a representation prediction module based on temporal feature prediction is proposed for detecting deepfake videos by predicting temporal feature representations of future time blocks in bimodal detection. In order to predict future feature representations from the present, instead of inferring short-term localized timing information from frame to frame, the input video is segmented into a number of S-second-long video segments, and more long-term global feature relationships are inferred from the sequence segments. These long-term global features across multiple time steps help to mine visual and auditory regularities and predict future sequence representations.

For audio–visual bimodal features, two types of prediction models are designed for visual and audio features, respectively. The visual prediction model

P^{v}

uses a 1-block Transformer [37] encoder structure, following the design of ViT [38] as shown in Figure 4. The source video feature

E_{j}^{v}

is processed through the predictor to infer the predicted feature

E_{p r e}^{v}

as follows:

E_{p r e}^{v} = P^{v} (E_{j}^{v}) = {p_{1}^{v}, p_{2}^{v}, \dots, p_{1}^{v}}

(1)

The input to the audio prediction model is the output of the audio stream processing module, which represents a globally consistent feature representation of the audio segment, suitable for a feed-forward neural network structure prediction model. Therefore, the structure of the audio prediction model

P^{a}

consists of a

1 \times 1

convolutional layer. The source audio feature

E_{j}^{a}

is processed through the predictor to infer the predicted feature

E_{p r e}^{a}

as follows:

E_{p r e}^{a} = P^{a} (E_{j}^{a})

(2)

Therefore, in each processing cycle, for the input set

{v_{j}^{i}, a_{j}^{i}, v_{j + 1}^{i}, a_{j + 1}^{i}}

, we obtain the embedded feature representation set

{E_{j}^{v}, E_{j}^{a}, E_{j + 1}^{v}, E_{j + 1}^{a}}

, and the predicted audio–visual feature set

{E_{p r e}^{v}, E_{p r e}^{a}}

. The representation prediction module optimizes the model using a correlation-based representation prediction loss. The correlation calculation between the predicted dense visual features and the referenced target video dense features is as follows:

c o r r_{v} = \frac{\sum_{t = 1}^{T} < p_{t}^{v}, e_{T + t}^{v} >}{T}

(3)

The calculation of the correlation between the predicted audio embedding and the referenced target audio embedding is as follows:

c o r r_{a} = < E_{p r e}^{a}, E_{j + 1}^{a} >

(4)

where

< e_{1}, e_{2} >

represents the computation of the correlation coefficient between two vectors.

Thus, the representation prediction loss can be described as follows:

L_{p r e} = - \frac{1}{N} \sum_{i} (y_{i} \times l n (σ (c o r r_{v} + c o r r_{a})) + (1 - y_{i}) \times l n (1 - (σ (c o r r_{v} + c o r r_{a})))

(5)

where

σ (x) = 1 / (1 + exp (- x))

, and

y_{i}

represents the label of the i-th video clip segment.

Since real video adjacent segments are overly smooth, their correlation is higher. In contrast, deepfake videos, due to incoherent facial movements, impact predictions, resulting in a lower correlation. This allows distinguishing between genuine and fabricated videos.

3.4. Comparative Learning Process

The contrast loss function amplifies the inconsistency between the different modalities in the fake video while reducing the differences between the modalities of the real video, making the real and fake videos distinguishable.

Due to the inconsistency in dimensions between the video stream features and audio stream features extracted by the audio–visual feature extraction module, it is not feasible to directly use the contrastive loss function for computation. Therefore, projection layer networks have been designed separately for video and audio features to merge and align the audio–visual features of adjacent video segments as shown in Figure 5.

The Euclidean distance is employed to quantify the disparity between the features output by the projection layer for the audio–visual features:

d = ∥ E_{p r o}^{v} - E_{p r o}^{a} ∥_{F}^{2}

(6)

Consequently, the contrast loss function can be expressed as follows:

L_{c o n} = \frac{1}{N} \sum_{i = 1}^{N} y^{i} d^{2} + (1 - y^{i}) m a x {(m a r g i n - d, 0)}^{2}

(7)

where d is the Euclidean distance between

E_{p r o}^{v}

and

E_{p r o}^{a}

, and the margin is a hyperparameter set to 0.5.

By training with the contrastive loss function, inconsistencies between different modalities in fabricated videos can be amplified while reducing the differences between modalities in real videos, making it possible to distinguish between genuine and fake videos.

3.5. Loss Function

The model architecture designed in this paper uses three loss functions to continue the training, including the representation prediction loss for within-modality, the comparison loss function for between-modality, and the cross-entropy loss function. In theory, two originally similar samples remain similar in the low-dimensional space after dimensionality reduction, while two dissimilar samples cannot be similar in the low-dimensional space. In terms of experimental observation, the faked video will show potential inconsistency relative to the real video. Therefore, the characterization prediction module and the comparison learning module and the corresponding two loss functions are designed.

On the respective sub-networks for the visual and audio modalities, a binary classification model based on MLP is implemented, applying the cross-entropy loss function to enhance anomaly detection capabilities under a single modality as follows:

L_{c o n} = - \frac{1}{N} \sum_{i = 1}^{N} y^{i} l o g (\hat{y^{i}}) + (1 - y^{i}) l o g (1 - \hat{y^{i}})

(8)

where

y^{i}

represents the predicted label for the i-th video clip segment.

Thus, the total loss function is the weighted sum of the prediction loss function

L_{p r e}

, the contrastive loss function

L_{c o n}

, and the cross-entropy loss function

L_{c r o}

, which can be expressed as:

L_{t o t a l} = λ_{1} L_{p r e} + λ_{2} L_{c o n} + λ_{3} L_{c r o}

(9)

where

L_{t o t a l}

is the final cross-modal loss function used for manipulation detection.

λ_{1}

,

λ_{2}

, and

λ_{3}

are the weight parameters used to balance the loss function, set to

1 / 3

.

The specific algorithmic flow is shown in Algorithm 1:

Algorithm 1 Algorithmicflow of detection framework based on temporal feature prediction.

Input :

$x^{i} \in D {(x^{1}, y^{1}), (x^{2}, y^{2}), \dots, (x^{N}, y^{N})}; 0 < i < N$
learning rate: $α = 0.001$
batch size: $B = 2$
epoch: $n u m_i t e r = 150$

Output :

Training Models: $θ_{n e t}, θ_{p r e}, θ_{p r o}, θ_{c r o}; S_{a}, S_{v}, S_{p r e}^{v}, S_{p r e}^{a}, S_{p r o}, S_{c l s}$

1:: for $i = 1 ⟶ n u m_i t e r$ do
2:: for $v_{j}^{i}, a_{j}^{i}, v_{j + 1}^{i}, a_{j + 1}^{i}$ range in $x^{i}$ do
3:: $E_{j}^{v}, E_{j + 1}^{v} : = S_{v} (v_{j}^{i}, v_{j + 1}^{i})$
4:: $E_{j}^{a}, E_{j + 1}^{a} : = S_{a} (a_{j}^{i}, a_{j + 1}^{i})$
5:: $E_{p r e}^{v} (j) : = S_{p r e}^{v} (E_{j}^{v})$
6:: $E_{p r e}^{a} (j) : = S_{p r e}^{a} (E_{j}^{a})$
7:: $E_{p r o}^{v} (j) : = S_{p r o}^{v} (E_{j}^{v}, E_{j + 1}^{v})$
8:: $E_{p r o}^{a} (j) : = S_{p r o}^{a} (E_{j}^{a}, E_{j + 1}^{a})$
9:: $\hat{y_{j}^{i}} : = S_{c l s} (E_{p r o}^{v} (j), E_{p r o}^{v} (j))$
10:: end for
11:: $g_{θ_{p r e}} ⟵ ▽_{θ_{p r e}} [\frac{1}{B} \sum_{b = 1}^{b = B} (\sum_{j = 1}^{j = l e n (x^{i}) - 1} L_{p r e} (E_{j + 1}^{v}, E_{j + 1}^{a}, E_{p r e}^{v} (j), E_{p r e}^{a} (j)))]$
12:: $g_{θ_{p r o}} ⟵ ▽_{θ_{p r o}} [\frac{1}{B} \sum_{b = 1}^{b = B} (\sum_{j = 1}^{j = l e n (x^{i}) - 1} L_{p r o} (E_{p r o}^{v} (j), E_{p r o}^{a} (j)))]$
13:: $g_{θ_{c r o}} ⟵ ▽_{θ_{c r o}} [\frac{1}{B} \sum_{b = 1}^{b = B} (\sum_{j = 1}^{j = l e n (x^{i}) - 1} L_{c r o} (j^{v}, \hat{y_{j}^{i}}))]$
14:: $θ_{n e t} ⟵ θ_{n e t} + α \cdot A d a m (θ_{n e t}, s u m (g_{θ_{p r e}}, g_{θ_{p r o}}, g_{θ_{c r o}}))$
15:: if $θ_{n e t}$ have not converged then
16:: break
17:: end if
18:: end for

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

Currently, most studies have chosen FakeAVCeleb [39] as the training and testing dataset, and this study also evaluates the performance of the proposed audio–video detector on the FakeAVCeleb dataset.

FakeAVCeleb, a multimodal deepfake dataset containing 500 real videos extracted from the VoxCeleb2 corpus, is used as a base set to generate about 20,000 deepfake videos by various deepfake generation methods. Deepfake video frames are generated using Faceswap and FSGAN, and deepfake audio is synthesized using Real-Time Voice Cloning (RTVC). Then, Wav2Lip is applied to synchronize the video frames with the audio. Therefore, the FakeAVCeleb dataset contains three different types of audio–visual forgery cases, namely FVFA (fake video fake audio), FVRA (fake video real audio) and RVFA (real video fake audio).

In the video preprocessing stage, the duration of the video clip segments is selected to be 1 s, and the segments with less than 1 s at the end of the video are directly discarded, and the final number of video clips obtained is shown in Table 1. For each category, we divide the training set and validation set according to the ratio of 7:3.

4.1.2. Evaluation Metrics

Accuracy (ACC), Receiver Operating Characteristic Curve (ROC), and Area Under Curve (AUC) are the evaluation metrics used.

ACC is the ratio of the number of samples correctly classified by the classifier to the total number of samples and is calculated as:

A C C = \frac{T P + T N}{T P + T N + F P + F N}

(10)

where TPrepresents True Positive, TN represents True Negative, FP represents False Positive, and FN represents False Negative. In the dichotomous classification task, the prediction has only two values of 0 and 1 for the classification result, representing both True and False categories. Therefore, TP represents both the label and prediction being true; TN represents both the label and prediction being false; FP represents the label being false and prediction being true; and FN represents the label being true and prediction being false.

AUC is the area under the ROC curve (Receiver Operating Characteristic Curve), which is also often used to evaluate the performance of classification models. The ROC curve is plotted with the True Positive Rate (TPR) as the vertical axis and the False Positive Rate (FPR) as the horizontal axis. FPR, as the horizontal axis, plots the performance of the classifier under different thresholds, where

T P R = T P / (T P + F N)

,

F P R = (F P + T N)

.

The value of AUC ranges from 0 to 1. The closer the ROC curve gets to the upper left corner, the closer the AUC value is to 1, indicating better classifier performance.

These metrics are calculated using the real/fake labels of the video aspects, obtained by averaging the scores of all segments from the same video. Results from other methods are directly referenced for comparison.

4.1.3. Implementation Details

All experiments were conducted on an Ubuntu 20.04 system using Pytorch and an NVIDIA RTX 3090 with 24GB of VRAM. Each video clip segment is of size

30 \times 3 \times 224 \times 224

. The output dimensions for the video stream are

30 \times 2048

, and for the audio stream,

512 \times 1

. Each input token for the video prediction model has a length of 2048, with a multi-head self-attention consisting of 8 heads, each head having an embedding dimension of 256.

The output dimension of the visual projection layer is

512 \times 1

. The proposed model is optimized using the Adam optimizer, with

β 1

and

β 2

set to 0.9 and 0.999 respectively, an initial learning rate of 0.001, and a weight decay of 0.0001. The batch size is 2, the number of epochs is 150, and early stopping is used to store the best model weights.

4.2. Comparative Experiment

To evaluate the performance of our method, we compare it with other unimodal and multimodal deepfake detection methods, and the experimental results are shown in Table 2.

First, compared with the unimodal visual deepfake detection method, the method proposed in this paper significantly outperforms Xception in terms of performance, with both ACC and AUC values improved by more than 15%. Xception, as a classical deepfake detection method, appears to be insufficient in the face of the new type of audio–video forged videos, despite its excellent performance in earlier studies. MAT employs a multiattention network for fine-grained forgery classification, but its performance is still unsatisfactory, which indicates the limitations of unimodal detection methods in dealing with more complex forgery videos. LipForensics detects forgery traces by analyzing high-level semantic irregularities in lip movements, but since it does not incorporate audio features, it cannot distinguish the abnormalities of lip movements in the face of forgery videos in the presence of lip synchronization, and its detection ability is limited.

Second, the methods proposed in this paper also perform well and achieve state-of-the-art performance when compared to multimodal deepfake detection methods. The emotions method identifies authenticity by analyzing the differences in perceived emotions expressed between audio and visual modalities, while AVFakeNet and AvoiD-DF acquire multimodal high-level features by designing different spatio-temporal coding networks and intermodal attentional mechanisms to fuse these features and thus detect discordance in audio–visual modalities. However, these methods mainly focus on intermodal incongruence, while the methods in this chapter not only consider intermodal incongruence but also introduce the incongruence of intra-modal temporal features, which integrates the two kinds of falsification anomalies to achieve the best detection results. By combining audio–visual discordance and timing feature inconsistency, the method proposed in this paper demonstrates higher accuracy in deepfake detection and makes significant progress.

4.3. Ablation Experiment

We conducted ablation experiments on audio–visual modalities, considering three scenarios: only visual, only audio, and bimodal. In these, the models with only a single modality do not consider the contrastive loss during training (i.e.,

λ_{1} = λ_{3} = 1 / 2, λ_{2} = 0

). The results are shown in Table 3, and the corresponding ROC curves are presented in Figure 6.

Based on the data in Table 3, it can be seen that the detection model containing only video streams (Only-Visual) has an ACC and AUC of 79.87% and 80.54%, respectively, whereas the detection model containing only audio streams (Only-Audio) achieves an ACC and AUC of 71.35% and 72.26%, respectively. This indicates that the detection performance of the video modality is better than the audio modality in a single modality. This is mainly due to the more complex structure of the 3D convolutional neural network used for visual modality and the use of Transformer structure as a predictor to obtain the inconsistency of the temporal features, whereas the extraction of the audio modality is simpler in comparison. When combining audio and video bimodality for detection (Visual-Audio), the ACC and AUC of the model are 84.33% and 89.91%, respectively, which are significantly better than the detection results of the single modality. This indicates that the combination of audio and video bimodality can better improve the detection performance of depth forgery, while single modality detection is becoming more and more difficult for higher visual quality forgery videos, so capturing and learning the inconsistency between audio and video modalities is an effective way to improve the detection accuracy.

To further investigate the effect of our proposed three loss functions, six different combinations were designed for ablation experiments. The results are shown in Table 4 and Figure 7. According to the results of ablation experiments, the impact of using different loss functions on the overall detection performance of the model is analyzed as follows:

(1) When the cross-entropy loss function is used singly, the detection performance of the model is relatively poor, and both ACC and AUC indexes are at a low level. This is because the cross-entropy loss function mainly supervises the classification task and may not be sensitive and fine enough for complex tasks such as deepfake video detection.

(2) When the contrast loss function is used alone, the model performance is improved; in particular, the AUC metrics are improved by nearly two percentage points compared to using only the cross-entropy loss. The contrast loss function can help the model better capture and learn the inconsistency between different modalities, thus improving the accuracy of detection.

(3) When the predictive loss function is used singly, the model performance gains further improvement, with the ACC and AUC metrics reaching 80.39% and 85.47%, respectively, which are significantly better than using the other two loss functions singly. The predictive loss function is able to better capture timing information by predicting the intrinsic relationship between current and future video clips, which is important for detecting video forgery.

(4) When combining the cross-entropy loss with either the comparison loss or the prediction loss, the model performance is improved, and the effect of the prediction loss is more significant. This indicates that the classification supervision effect of cross-entropy loss complements well with the ability of contrast loss to capture similar variability and prediction loss to capture temporal information.

(5) When using the three loss functions simultaneously, the model performance is the best, with an ACC of 84.33% and an AUC as high as 89.91%, which fully verifies that the effective fusion of the three loss functions can maximize their respective advantages and achieve the best results in detecting deeply forged videos.

Overall, the prediction loss function is the most critical for capturing video timing information, the cross-entropy loss provides supervision for the basic classification task, and the contrast loss function learns the similar differences within the video frames also has a certain role, and the use of the three to complement each other and fusion can greatly improve the performance of the detection of deepfake video.

5. Conclusions

In conclusion, this paper proposes a deep forgery detection scheme based on bimodal timing feature prediction, which combines two modalities, audio and visual, to achieve the detection of high-quality deep forgery videos by exposing the inter-modal variability and the inconsistency of the timing features within the modalities in the forgery videos. Compared with existing schemes, this scheme solves the challenge that the detection of a single modality is insufficient to cope with a wide variety of forged videos in the network environment. However, this scheme also has shortcomings. The design for the audio stream is relatively simple, and only the MFCC features, which are common in the field of speech detection, are considered as the input to the audio stream, making the subsequent prediction based on the audio information potentially unable to achieve the expected results. Another limitation of the proposed scheme is the failure to achieve real-time detection. Subsequent research, therefore, can consider different audio input forms and try to retain their temporal integrity when designing the network, which can then be used for prediction tasks through the Transformer structure for more effective audio modal detection. In addition, future deepfake detection research can improve efficiency by simplifying models and optimizing algorithms to enhance the timeliness of detection.

Despite some limitations, our approach still has great potential for application in domains that require robust verification of the authenticity of digital media, such as social platforms, news dissemination, and content review and authentication in the judicial domain. Subsequent research can consider different audio input forms and try to preserve their temporal integrity when designing the network, which can then be used for prediction tasks through the Transformer structure to achieve more effective audio modality detection.

Author Contributions

Conceptualization, Y.G. and Y.Z.; methodology, P.Z.; software, X.W.; validation, Y.G., X.W. and Y.Z.; formal analysis, Y.M.; data curation, X.W.; writing—original draft preparation, X.W.; writing—review and editing, P.Z.; supervision, Y.G.; funding acquisition, Y.G. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities Grant Number: 3282024052, the China Postdoctoral Science Foundation OF FUNDER grant number 2019M650608 and the National Social Science Foundation of China OF FUNDER grant number 19ZDA127.

Informed Consent Statement

Not applicable.

Data Availability Statement

This research employed publicly available datasets for its experimental studies. The FakeAVCeleb datasets can be obtained by visiting https://github.com/DASH-Lab/FakeAVCeleb (accessed on 2 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CSN	Channel-Separated Convolutional Networks
MFCC	Mel Frequency Cepstrum Coefficients
FFT	Fourier Transform
FVFA	fake video fake audio
FVRA	fake video real audio
RVFA	real video fake audio
ACC	Accuracy
ROC	Receiver Operating Characteristic Curve
AUC	Area Under Curve

References

Thies, J.; Zollhfer, M.; Nießner, M. Deferred neural rendering: Image synthesis using neural textures. ACM Trans. Graph. 2019, 38, 66. [Google Scholar] [CrossRef]
Jiang, L.; Li, R.; Wu, W.; Qian, C.; Loy, C.C. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 2889–2898. [Google Scholar] [CrossRef]
Deepfakes. Available online: https://github.com/deepfakes/faceswap (accessed on 2 September 2020).
Liu, H.; Chen, Z.; Yuan, Y.; Mei, X.; Liu, X.; Mandic, D.; Plumbley, M.D. Audioldm: Text-to-audio generation with latent diffusion models. arXiv 2023, arXiv:2301.12503. [Google Scholar]
Su, K.; Liu, X.; Shlizerman, E. Audeo: Audio generation for a silent performance video. Adv. Neural Inf. Process. Syst. 2020, 33, 3325–3337. [Google Scholar]
Li, L.; Bao, J.; Zhang, T.; Yang, H.; Chen, D.; Wen, F.; Guo, B. Face X-ray for more general face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020; pp. 5001–5010. [Google Scholar]
Zhao, T.; Xu, X.; Xu, M.; Ding, H.; Xiong, Y.; Xia, W. Learning self-consistency for deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 15023–15033. [Google Scholar]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Computer Vision—ECCV 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 86–103. [Google Scholar]
Liu, H.; Li, X.; Zhou, W.; Chen, Y.; He, Y.; Xue, H.; Yu, N. Spatial- phase shallow learning: Rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 772–781. [Google Scholar]
Li, J.; Xie, H.; Li, J.; Wang, Z.; Zhang, Y. Frequency-aware discrim- inative feature learning supervised by single-center loss for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 6458–6467. [Google Scholar]
Gu, Q.; Chen, S.; Yao, T.; Chen, Y.; Ding, S.; Yi, R. Exploiting fine-grained face forgery clues via progressive enhancement learning. Proc. Aaai Conf. Artif. Intell. 2022, 36, 735–743. [Google Scholar] [CrossRef]
Chen, S.; Yao, T.; Chen, Y.; Ding, S.; Li, J.; Ji, R. Local relation learning for face forgery detection. Proc. Aaai Conf. Artif. Intell. 2021, 35, 1081–1088. [Google Scholar] [CrossRef]
Yang, Z.; Liang, J.; Xu, Y.; Zhang, X.Y.; He, R. Masked relation learning for deepfake detection. IEEE Trans. Inf. Forensics Secur. 2023, 18, 1696–1708. [Google Scholar] [CrossRef]
Haliassos, A.; Vougioukas, K.; Petridis, S.; Pantic, M. Lips don’t lie: A generalisable and robust approach to face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 5039–5049. [Google Scholar]
Bai, W.; Liu, Y.; Zhang, Z.; Li, B.; Hu, W. Aunet: Learning relations between action units for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 24709–24719. [Google Scholar]
Muppalla, S.; Jia, S.; Lyu, S. Integrating audio-visual features for multimodal deepfake detection. arXiv 2023, arXiv:2310.03827. [Google Scholar]
Salvi, D.; Liu, H.; Mandelli, S.; Bestagini, P.; Zhou, W.; Zhang, W.; Tubaro, S. A robust approach to multimodal deepfake detection. J. Imaging 2023, 9, 122. [Google Scholar] [CrossRef] [PubMed]
Yang, W.; Zhou, X.; Chen, Z.; Guo, B.; Ba, Z.; Xia, Z.; Ren, K. Avoid-df: Audio-visual joint learning for detecting deepfake. IEEE Trans. Inf. Forensics Secur. 2023, 18, 2015–2029. [Google Scholar] [CrossRef]
Raza, M.A.; Malik, K.M. Multimodaltrace: Deepfake detection using audiovisual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 993–1000. [Google Scholar]
Katamneni, V.S.; Rattani, A. MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection. arXiv 2023, arXiv:2310.02234. [Google Scholar]
Zhou, Y.; Lim, S.N. Joint audio-visual deepfake detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 14800–14809. [Google Scholar]
Hafsa, I.; Ali, J.; Mahmood, M.K. AVFakeNet: A unified end-to-end Dense Swin Transformer deep learning model for audio-visual deepfakes detection. Appl. Soft Comput. 2023, 136, 110124. [Google Scholar]
Mittal, T.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. Emotions don’t lie: An audio-visual deepfake detection method using affective cues. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2823–2832. [Google Scholar]
Chugh, K.; Gupta, P.; Dhall, A.; Subramanian, R. Not made for each other-audio-visual dissonance-based deepfake detection and local- ization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 439–447. [Google Scholar]
Haliassos, A.; Mira, R.; Petridis, S.; Pantic, M. Leveraging real talking faces via self-supervision for robust forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 14950–14962. [Google Scholar]
Yu, Y.; Liu, X.; Ni, R.; Yang, S.; Zhao, Y.; Kot, A.C. PVASS-MDD: Predictive visual-audio alignment self-supervision for multimodal deepfake detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 6926–6936. [Google Scholar] [CrossRef]
Cheng, H.; Guo, Y.; Wang, T.; Li, Q.; Chang, X.; Nie, L. Voice-face homogeneity tells deepfake. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 76. [Google Scholar] [CrossRef]
Cozzolino, D.; Pianese, A.; Nießner, M.; Verdoliva, L. Audio-visual person-of-interest deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2023, Vancouver, BC, Canada, 8–22 June 2023; pp. 943–952. [Google Scholar]
Umirzakova, S.; Whangbo, T.K. Detailed feature extraction network-based fine-grained face segmentation. Knowl.-Based Syst. 2022, 250, 109036. [Google Scholar] [CrossRef]
Dam, T.; Dharavath, S.B.; Alam, S.; Lilith, N.; Chakraborty, S.; Feroskhan, M. AYDIV: Adaptable Yielding 3D Object Detection via Integrated Contextual Vision Transformer. arXiv 2024, arXiv:2402.07680. [Google Scholar]
Lee, G.Y.; Dam, T.; Ferdaus, M.M.; Poenar, D.P.; Duong, V.N. Watt-effnet: A lightweight and accurate model for classifying aerial disaster images. IEEE Geosci. Remote. Sens. Lett. 2023, 20, 6005205. [Google Scholar] [CrossRef]
Zhang, S.; Zhu, X.; Lei, Z.; Shi, H.; Wang, X.; Li, S.Z. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017; pp. 192–201. [Google Scholar]
Newmarch, J.; Newmarch, J. Ffmpeg/libav. In Linux Sound Programming; Springer: Berlin/Heidelberg, Germany, 2017; pp. 227–234. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Feiszli, M. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5552–5561. [Google Scholar]
Muda, L.; Begam, M.; Elamvazuthi, I. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv 2010, arXiv:1003.4083. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 17–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Khalid, H.; Tariq, S.; Kim, M.; Woo, S.S. FakeAVCeleb: A novel audio-video multimodal deepfake dataset. arXiv 2021, arXiv:2108.05080. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Zhao, H.; Zhou, W.; Chen, D.; Wei, T.; Zhang, W.; Yu, N. Multi- attentional deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 20–25 June 2021; pp. 2185–2194. [Google Scholar]

Figure 1. An overview of the audio–visual joint deepfake detection model architecture based on temporal feature prediction.

Figure 2. Audio modal preprocessing process.

Figure 3. MFCC feature extraction process.

Figure 4. Illustration of the structure of the visual prediction module based on 1-block Transformer.

Figure 5. Illustration of projection layer network aligning audio–visual features for comparative learning.

Figure 6. Componentablation experiment ROC curve comparison plot.

Figure 7. Comparison of ROC curves for loss function ablation experiments.

Table 1. Statistics for the FakeAVCeleb dataset.

Categories	Number of Original Videos	Number of Edited Clips
RVRA	500	3k
RVFA	1000	6k
FVRA	9709	58k
FVFA	10,857	65k

Table 2. Comparative experimental results on the fakeAVCeleb dataset.

Methods	Modality	ACC (%)	AUC (%)
Xception [40]	visual	67.90	70.50
LipForensics [14]	visual	80.10	82.40
MAT [41]	visual	77.60	79.30
MDS [24]	audio–visual	82.80	86.50
Emotions [23]	audio–visual	78.10	79.80
AVFakeNet [22]	audio–visual	78.40	83.40
AvoiD-DF [18]	audio–visual	83.70	89.20
Our	audio–visual	84.33	89.91

The highest result are highlighted in bold.

Table 3. Ablation experimental results of different audio–visual modalities.

No.	Ablation Model	ACC (%)	AUC (%)
1	only visual	79.87	80.54
2	only audio	71.35	72.26
3	audio–visual	84.33	89.91

The highest result are highlighted in bold.

Table 4. Ablation experimental results with different loss functions.

No.	Loss Function	$λ$	ACC (%)	AUC (%)
1	only- $L_{c r o}$	$[λ_{1}, λ_{2}, λ_{3}] = [0, 0, 1]$	67.90	70.50
2	only- $L_{c o n}$	$[λ_{1}, λ_{2}, λ_{3}] = [0, 1, 0]$	80.10	82.40
3	only- $L_{p r e}$	$[λ_{1}, λ_{2}, λ_{3}] = [1, 0, 0]$	77.60	79.30
4	$L_{c r o} + L_{c o n}$	$[λ_{1}, λ_{2}, λ_{3}] = [0, \frac{1}{2}, \frac{1}{2}]$	82.80	86.50
5	$L_{c r o} + L_{p r e}$	$[λ_{1}, λ_{2}, λ_{3}] = [\frac{1}{2}, 0, \frac{1}{2}]$	78.10	79.80
6	All	$[λ_{1}, λ_{2}, λ_{3}] = [\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]$	84.33	89.91

The highest result are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, Y.; Wang, X.; Zhang, Y.; Zeng, P.; Ma, Y. Temporal Feature Prediction in Audio–Visual Deepfake Detection. Electronics 2024, 13, 3433. https://doi.org/10.3390/electronics13173433

AMA Style

Gao Y, Wang X, Zhang Y, Zeng P, Ma Y. Temporal Feature Prediction in Audio–Visual Deepfake Detection. Electronics. 2024; 13(17):3433. https://doi.org/10.3390/electronics13173433

Chicago/Turabian Style

Gao, Yuan, Xuelong Wang, Yu Zhang, Ping Zeng, and Yingjie Ma. 2024. "Temporal Feature Prediction in Audio–Visual Deepfake Detection" Electronics 13, no. 17: 3433. https://doi.org/10.3390/electronics13173433

APA Style

Gao, Y., Wang, X., Zhang, Y., Zeng, P., & Ma, Y. (2024). Temporal Feature Prediction in Audio–Visual Deepfake Detection. Electronics, 13(17), 3433. https://doi.org/10.3390/electronics13173433

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Temporal Feature Prediction in Audio–Visual Deepfake Detection

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Preprocessing

3.2. Audio–Visual Feature Extraction Process

3.2.1. Video Stream Processing

3.2.2. Audio Stream Processing

3.3. Representation Prediction Process

3.4. Comparative Learning Process

3.5. Loss Function

4. Experiments

4.1. Experimental Setup

4.1.1. Dataset

4.1.2. Evaluation Metrics

4.1.3. Implementation Details

4.2. Comparative Experiment

4.3. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI