SupCon-MPL-DP: Supervised Contrastive Learning with Meta Pseudo Labels for Deepfake Image Detection

Moon, Kyeong-Hwan; Ok, Soo-Yol; Lee, Suk-Hwan

doi:10.3390/app14083249

Open AccessArticle

SupCon-MPL-DP: Supervised Contrastive Learning with Meta Pseudo Labels for Deepfake Image Detection

by

Kyeong-Hwan Moon

^1,2,

Soo-Yol Ok

¹ and

Suk-Hwan Lee

^1,*

¹

Department of Computer Engineering, Dong-A University, Busan 49315, Republic of Korea

²

Department of Information Convergence Engineering, Pusan National University, Busan 46241, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(8), 3249; https://doi.org/10.3390/app14083249

Submission received: 8 March 2024 / Revised: 6 April 2024 / Accepted: 6 April 2024 / Published: 12 April 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, there has been considerable research on deepfake detection. However, most existing methods face challenges in adapting to the advancements in new generative models within unknown domains. In addition, the emergence of new generative models capable of producing and editing high-quality images, such as diffusion, consistency, and LCM, poses a challenge for traditional deepfake training models. These advancements highlight the need for adapting and evolving existing deepfake detection techniques to effectively counter the threats posed by sophisticated image manipulation technologies. In this paper, our objective is to detect deepfake videos in unknown domains using unlabeled data. Specifically, our proposed approach employs Meta Pseudo Labels (MPL) with supervised contrastive learning, so-called SupCon-MPL, allowing the model to be trained on unlabeled images. MPL involves the simultaneous training of both a teacher model and a student model, where the teacher model generates pseudo labels utilized to train the student model. This method aims to enhance the adaptability and robustness of deepfake detection systems against emerging unknown domains. Supervised contrastive learning utilizes labels to compare samples within similar classes more intensively, while encouraging greater distinction from samples in dissimilar classes. This facilitates the learning of features in a diverse set of deepfake images by the model, consequently contributing to the performance of deepfake detection in unknown domains. When utilizing the ResNet50 model as the backbone, SupCon-MPL exhibited an improvement of 1.58% in accuracy compared with traditional MPL in known domain detection. Moreover, in the same generation of unknown domain detection, there was a 1.32% accuracy enhancement, while in the detection of post-generation unknown domains, there was an 8.74% increase in accuracy.

Keywords:

deepfake detection; deepfake unknown domain; meta pseudo labels; supervised contrastive learning; generative misuse

1. Introduction

Recently, with the advancement of generative artificial intelligence models [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], deepfakes have become increasingly similar to real images/videos, making them difficult to distinguish. Deepfakes can be broadly categorized into three generations based on the evolution of generative models. First-generation deepfake generative models [1,2,3,4,5] typically attempt to generate simple and low-resolution images/videos based on probability distributions or synthesize multiple images/videos by exploiting features in tasks such as Face2Face and FaceSwap. In particular, generative adversarial network (GAN)-based models such as CGAN [2], WGAN [4], and WGAN-GP [5], as well as autoencoder-based models like VAE [6] and conditional VAE [7] have enabled the generation of various deepfake images/videos. However, first-generation deepfakes often exhibit noticeable artifacts that can be discerned by the human eye. With the transition to second-generation deepfake generative models [6,7,8], there has been progress in generating high-resolution deepfake images that are more difficult to distinguish compared with first-generation ones, along with performance improvements in various tasks. In particular, second-generation deepfake generative models like StyleGAN, proposed by T. Karras et al. [8], produce deepfake images that are difficult for the human eye to distinguish, excluding some flaws such as artifacts in hair. Second-generation deepfakes can be generated using deepfake generation tools such as DeepfaceLab [10], DeepSwap [11], Synthesia [12], and others. Finally, third-generation deepfake generative models [9,13,14,15] produce images/videos that are even more flexible and difficult to distinguish than those generated by second-generation models, across various tasks. In particular, Stable Diffusion, proposed by R. Rombach et al. [13], is currently being used for the generation of various human and artwork images, raising concerns related to copyright and human rights issues. Furthermore, the consistency model proposed by Y. Song et al. [15] has enabled state-of-the-art deepfake generative model training at a lower cost by reducing the extensive iteration process required by previous diffusion models for restoring original images from noise. As deepfakes increasingly become difficult to distinguish from real images/videos, they are being utilized in various criminal activities.

To address issues caused by deepfakes, methods have been proposed to identify flaws in landmarks that occur when deepfake generative models create images, aiding in the detection of deepfakes [16,17,18,19]. Meanwhile, recent advancements in deepfake detection for single models have shown improvement in detecting deepfake videos and images. D.A. Coccomini et al. [20] enhanced the detection performance of deepfake videos and images by combining EfficientNet [21] and Vision Transformer [22] when training a single model. In other words, existing deepfake detection models [16,17,18,19,20,21,22,23] verify flaws in facial landmarks during the preprocessing stage and construct large models for flexible predictions.

Previous studies have primarily focused on the detection performance of labeled known domain (known domain) tasks in deepfake detection. However, deepfake generation models are rapidly evolving, and similar generations of deepfake generation models are also being developed diversely. Therefore, detecting deepfake images in unknown domains (unknown domain) is also crucial. A few studies have proposed generalized deepfake detection models using techniques such as contrastive learning, meta learning, and others [24,25,26,27,28,29,30,31,32,33,34,35,36].

In this paper, we propose SupCon-MPL, a combination of the Meta Pseudo Labels (MPL) [37] with supervised contrastive learning (SupCon) [38] to further train the model with unlabeled images/videos, simultaneously enhancing the model’s generalization ability to distinguish deepfakes in unknown domains. The proposed SupCon-MPL utilizes the basic structure of MPL, where two models, namely teacher and student, are simultaneously trained. Each model influences the other during training. The teacher model constructs pseudo labels for unlabeled images and transfers them to the student model. Through this approach, the student model learns from unlabeled data, providing the potential to train effectively with limited labeled data. Furthermore, during the training process, we apply the supervised contrastive loss (SupConLoss) [38] to the encoder of each model, enabling contrastive representation learning, thereby inducing generalized model training.

The performance evaluation experiments were conducted in two parts: model validation experiments and deepfake detection experiment based on scenario. In the model validation experiments, we utilized the data from five domains within FaceForensics++ [39]. We evaluated the detection performance in labeled known domains by combining the data in various ways and assessed the generalized detection performance in unknown domains. The deepfake detection experiment based on scenario involves training the model with first-generation deepfake datasets (FaceForensics++ [39], DFDC [40], Celeb-DF [41]) and evaluating the detection performance on first- and second-generation unknown deepfake datasets (StyleGAN [8], NeuralTextures [39]). The experimental results showed that SupCon-MPL achieved performance improvements of 1.58%, 1.32%, and 8.74% over the baseline MPL model in the proposed evaluation scenario, respectively. The main contributions of this paper are as follows:

(1): The proposed method enables additional training through unlabeled data. Especially, while two models are trained simultaneously, the student model infers information about unlabeled data from the other model and provides feedback, allowing for additional training to be conducted with less bias towards a specific model. Ultimately, the proposed method enhances the performance of deepfake detection by enabling additional training with a large amount of unlabeled data.
(2): Our model enables generalized deepfake detection model training through contrastive learning. We improved the generalized deepfake detection performance on unknown data, which was previously low in the Meta Pseudo Labels-based deepfake detection model [24], through contrastive learning.
(3): Our model exhibited higher deepfake detection performance compared with all other models in the comparison with various generalized deepfake detection models [24,31,34,35]. The experimental results demonstrate that our model outperforms existing deepfake detection models, showing robust detection capability across diverse labeled datasets and even unknown generational deepfakes.

2. Related Works

As deepfake generation models advance, various detection methods have also been researched. A common approach in deepfake detection is to explore flaws in facial images [16,17,18,19]. However, the continual development of new generative models has led to the problem of being unable to train detection models using data from all generative models. To address this problem, a few studies have explored training generalized deepfake detection models [24,25,26,27,28,29,30,31,32,33,34,35,36].

2.1. Generalized Deepfake Detection

The generalization of deepfake detection implies the ability to detect deepfake videos generated not only by the models used during training but also by unseen or new generative models. In other word, as generative models progress from GANs and VAEs to diffusion and consistency models, achieving the ability to detect deepfakes generated by various and new generative models simultaneously is the main goal of generalized deepfake detection techniques. Recently, research has been conducted on detecting deepfake videos that are unknown from both the data and training perspectives.

On the data perspective, SBL [29] and OST [30] enhanced the generalization of deepfake detection by synthesizing additional training data by combining original images from each generative model with various other images and selectively using them. On the training perspective, A. Jain et al. [25] utilized datasets from Google, Jigsaw, FaceForensics++ [39], Celeb-DF [41], Deepfake-TIMIT [42], and their own database DF-Mobio to train a generalized deepfake detection model using contrastive representation learning across various domains. A. Nadimpalli et al. [26] proposed a hybrid learning technique combining supervised learning and reinforcement learning. In particular, during the training process, the reinforcement learning agent selects the top k augmentations that have the most significant impact on performance improvement when training through convolutional neural networks (CNN) and uses them for testing, enabling the training of a generalized deepfake detection model. We employed the meta-learning technique, Meta Pseudo Labels, in the deepfake training process, applying it after domain splitting for each data, resulting in training a model with higher performance in the same model training [24], and conducted experiments using various CNN-based models specialized in extracting features from images, including EfficientNet [21], ResNet [43], ResNext [44], and WideResNet [45], which are image classification models.

2.2. Contrastive Representation Learning in Deepfake Detection

Currently, research applying contrastive representation learning (CRL) for training generalized deepfake detection models is conducted. CRL enables learning similar features in the feature space between a specific image and from the same domain (positive images), while also learning features that differentiate from different domains (negative images). H. Chih-Chung et al. [27] utilized contrastive loss [43] to train the encoder, following which they trained the classifier to generalize the discriminative performance on deepfake images generated by various GAN-based models. S. Fung et al. [36] trained the encoder using unsupervised CRL with image pairs that include random augmentation applied to the same image during the training process. Following this, they trained the classifier using labeled images to develop a generalized deepfake detection model. X. Ying et al. [35] addressed the issue of conventional CRL techniques not utilizing label information of deepfake images by applying supervised contrastive learning (SupCon) [38], which uses label information. However, CRL requires a large amount of data, especially a significant number of negative samples. In this paper, we performed CRL using SupCon [38], while simultaneously combining the SupCon model with meta-learning method, MPL [37]. This approach also allows for additional CRL training on unlabeled data, even when using the same labeled data, enabling the training of a more generalized deepfake detection model compared with conventional Meta Pseudo Labels.

2.3. Meta Pseudo Labels

Meta Pseudo Labels (MPL) [37] trains the model using unlabeled images, and when the same model is trained on an image classification task, it has shown improved performance compared with conventional models. MPL gained significant attention by achieving over 90% Top 1 score on the ImageNet [44] classification task, marking a significant milestone. Figure 1a shows the MPL facilitates the learning of the teacher model through feedback from the student model, thus enhancing the conventional learning techniques such as knowledge distillation [45] or noisy student [46], where the teacher model passes on information to the student model. This improvement addresses the issue of inadequate learning of the student model when the performance of the teacher model is subpar. The training process of the MPL is shown in Figure 1b. The student model in MPL learns through the Pseudo Labels inferred by the teacher model. Subsequently, it imparts the feedback value regarding the learning to the teacher model. The teacher model learns through the labeled loss from the labeled data, UDA loss [47], feedback from the student model, and MPL loss from the unlabeled data. However, it consumes substantial computing resources due to the simultaneous training of the two models.

From the perspective of training a generalized deepfake detection model, MPL can enhance the performance of generalized deepfake detection by enabling additional training through unlabeled data, compared with models trained solely with labeled data. In this paper, we experiment with the enhancement of detection capabilities for unknown domains and post-generation deepfakes, using both MPL [37] and SupCon [38].

3. Proposed SupCon-MPL-Based Deepfake Detection

To detect deepfake videos in the deepfake unknown domain, the proposed method introduces SupCon-MPL, a meta-learning model based on contrastive learning, utilizing unlabeled images from the deepfake known domain. You can find notations used in the rest of paper summarized in Table A1.

3.1. Proposed Training Strategy

3.1.1. Known Domain and Unknown Domain in Deepfake

A deepfake domain can be defined as a collection of images and their features, generated from a “specific deepfake generative model”. In this paper, we distinguish deepfake domains into the known domain (

K

) and unknown domain (

U

). The known domain (

K

) refers to a collection of deepfake images that are labeled when training models. The data in

K

is labeled and therefore can be directly used for training. Meanwhile, the unknown domain (

U

) refers to data created by unknown deepfake generative models. The data in

U

is not labeled, hence it is not possible to determine whether the image is real or fake. Also, as they are created from various generative models, they can involve various features. The known domain

K

can be defined as

K = {K_{1}, K_{2}, \dots}

where

K_{i}

is i-th known deepfake generative model, and the deepfake dataset

D_{K} = {D_{K_{1}}, D_{K_{2}}, \dots}

consists of a dataset

D_{K_{i}} = {(x_{i}, y_{i})}

composed with a set of deepfake images

x_{i}

and labels

y_{i}

generated by

K_{i}

. On the other hand, the unknown domain

U

can be defined as

U = {U_{1}, U_{2}, \dots}

where

U_{i}

is i-th unknown deepfake generative model, and the deepfake dataset

D_{U} = {D_{U_{1}}, D_{U_{2}}, \dots}

consists of a dataset

D_{U_{i}} = {(x_{i})}

composed with a set of deepfake images

x_{i}

generated by

U_{i}

.

In this paper, to address

U

, we first experiment by distinguishing

D_{K}

into a labeled dataset (

D_{L}

) and an unlabeled dataset (

D_{U L}

), as shown in Figure 2a. Subsequently, to verify the influence of

D_{U}

on the training process, we assume

D_{U}

as

D_{U L}

and perform experiments, as shown in Figure 2b. In the deepfake training scenario, from the perspective of generative models by generation, both

D_{L}

and

D_{U L}

constitute with first-generation

D_{K}

, and evaluation is conducted using the first- and second-generation

D_{U}

.

3.1.2. Training Strategy for Deepfake Unknown Domain Detection

The training process is employed based on a comparison between the base model and the student model of the MPL (SupCon-MPL). Upon completion of training the base model with the entire dataset

D_{K}

, the model is subsequently employed as the teacher model to train the student model. In other words, we aim to verify performance improvement when training the model under the same conditions. If performance enhancement is validated at this method, it suggests that superior performing models can be trained under identical training conditions, even when employing larger or state-of-the-art (SOTA) models.

The training images are constructed considering the problems of existing deepfake detection. While deepfakes by known generative models exist in

K

, deepfake images by unknown generative models also exist in

U

. Therefore, during the training phase, we enhance the deepfake detection performance in

K

using labeled data and contribute to the generalization of the learning model by using data

D_{K}

and

D_{U}

from

K

and

U

as unlabeled data, respectively. Consistent with this approach, the data

D_{L}

and

D_{U L}

are structured into

D_{K}

, with images from dataset

D_{U}

serving as

D_{U L}

.

In the proposed method, we combine data in three strategies to detect the unknown domain dataset

D_{U}

. The first strategy is to use the data from

D_{L}

and

D_{U L}

as the same domain, aiming to verify whether unlabeled data from a specific

K

contributes to the improvement of model performance. Figure 2a illustrates the training strategy of using

D_{K}

as unlabeled data. The second strategy aims to solve the realistic deepfake problem by experimenting with the impact of unlabeled data on the detection performance of the corresponding domain. Figure 2b illustrates the feasibility of improving model performance by employing dataset

D_{K}

as labeled data

D_{L}

and dataset

D_{U}

as unlabeled data

D_{U L}

. Finally, in the deepfake scenario experiment, after training the model using the first-generation deepfake dataset as

D_{L}

and

D_{U L}

, the generalized deepfake detection model learning is assessed through the first-generation

D_{K}

and the first- and second-generation

D_{U}

.

3.2. SupCon-MPL: Supervised Contrastive Learning with Meta Pseudo Labels

In the proposed method, following the strategy in Figure 2b, the MPL model is trained for the detection of deepfakes in the unknown domain

U

. SupCon-MPL allows supplementary training utilizing unlabeled videos, and with the aid of CRL, it enhances the deepfake detection in feature space. Furthermore, it affords the flexibility to employ diverse encoder models during the training phase and enables the fine tuning of the SupCon-MPL-trained model.

In particular, the limitations of deepfake detection with limited labeled data can be mitigated by using unlabeled data, and a generalized detection model can be trained through CRL. Another notable advantage lies in the capability to conduct concurrent learning via feedback from the student model, even if the performance of the T model is low. The details of the proposed method are elucidated in Figure 3. The most significant distinction from the conventional MPL and SupCon model training is that learning through unlabeled data not only resolves the training issue of CRL due to limited data but also enhances detection capabilities in both

K

and

U

. Ultimately, the final goal is to enhance the detection capabilities of deepfakes in domains that are not targeted, especially in a situation where new deepfake models in

U

continue to be developed.

Supcon-MPL consists of a teacher model

T

and a student model

S

. Each model has the same structure but different parameter values. The teacher model (

T

) utilizes a pre-trained model, while the student model (

S

) starts its training from the initial state before being trained. Each model consists of an identical structure of an encoder and a classifier. The encoder is modified to extract a 128-dimensional feature by removing the classifier layer of a specific model, allowing it to primarily learn the representations of labeled and unlabeled images. The classifier is composed of a single linear layer, which performs classification based on the representation values received from the encoder.

The training of the SupCon-MPL is conducted by first having the student model

T

use unlabeled data to perform CRL, followed by fine tuning with labeled data. In this process, the teacher model

T

’s classifier learns through the feedback from

S

, while

S

learns dependently on

T

.

3.3. SupCon-MPL Loss Function

SupCon-MPL, as shown in Figure 3, is composed of a teacher model (

T

) and a student model (

S

), each of which consists of an encoder and a linear classifier. SupCon-MPL has two loss functions in order to sequentially train each model. One involves the teacher model

T

distilling knowledge to the student model

S

, while the other entails the teacher model

T

training from the feedback factor provided by

S

on the labeled data. The knowledge distilled by

T

includes previously learned content about deepfakes.

In SupCon-MPL, let the parameters of

T

classifier and

S

classifier be

θ_{T}

,

θ_{S},

respectively, and denote the batch of images and labels on the labeled data as

(x_{l}, y_{l}) \in D_{K}

, and the batch of images on the unlabeled data as

x_{u} \in D_{U}

. The goal of SupCon-MPL is to minimize the parameters

θ_{S}^{P L}

of the generalized deepfake detection model

S

:

θ_{S}^{P L} = \underset{θ_{S}}{argmin} \underset{L_{u} ≔ (θ_{T}, θ_{S})}{\underset{⏟}{E_{x_{u}} [C E (T (x_{u}; θ_{T}), S (x_{u}; θ_{S}))]}}

(1)

Hence, the objective function of SupCon-MPL is defined as follows.

\begin{matrix} L_{l} w i t h r e s p e c t i v e t o θ_{T} : \\ \min_{θ_{T}} L_{l} (θ_{S}^{P L} (θ_{T})), \\ w h e r e θ_{S}^{P L} (θ_{T}) = \underset{θ_{S}}{argmin} L_{u} (θ_{T}, θ_{S}) . \end{matrix}

(2)

For optimization, SupCon-MPL approximates

θ_{S}^{P L} (θ_{T})

by the learning rate

η s

, and then,

θ_{S}^{P L} (θ_{T}) \approx θ_{S} - η s \cdot \nabla_{θ_{S}} L_{u} (θ_{T}, θ_{S})

(3)

defines the final objective function as follows:

\begin{matrix} L_{l} w i t h r e s p e c t i v e t o θ_{T} : \\ \min_{θ_{T}} L_{l} (θ_{S} - η s \cdot \nabla_{θ_{S}} L_{u} (θ_{T}, θ_{S})), \\ w h e r e θ_{S}^{P L} (θ_{T}) = \underset{θ_{S}}{argmin} L_{u} (θ_{T}, θ_{S}) . \end{matrix}

(4)

Both

T

and

S

consist of an encoder and a classifier and are trained according to their respective loss functions. The loss function of

E N C_{T}

, the encoder of

T

, is composed of SupConLoss [38], and the loss function of

C L F_{T}

, the classifier, is composed of labeled loss for the labeled data

x_{l}

and MPL loss, reflecting the feedback from

S

. First and foremost,

g_{T, c o n t r a s t}^{(t)}

the loss function of

E N C_{T}

, receives image pairs

(R a n d A u g m e n t_{a} (x_{l}), R a n d A u g m e n t_{b} (x_{l}), y_{l})

as inputs that reflect different random augmentations on the same image

x_{l}

and the label

y_{l}

. Subsequently, the loss value is obtained by passing image pairs through SupConLoss [38]. At this juncture, given the similarity between the current training process and that of the original MPL’s UDA loss, the utilization of the UDA loss is no longer used:

g_{T, c o n t r a s t}^{(t)} = \nabla_{θ_{T}} S u p C o n L o s s (R a n d A u g m e n t_{a} (x_{l}), R a n d A u g m e n t_{b} (x_{l}), y_{l}) |_{θ_{T} = θ_{T}^{(t)}}

(5)

E N C_{T}

is promptly updated following the computation of the

g_{T, c o n t r a s t}^{(t)}

:

θ_{T, E N C}^{(t + 1)} = θ_{T, E N C}^{(t)} - η_{S} \cdot g_{T, c o n t r a s t}^{(t)}

(6)

The labeled loss of

C l F_{T}

,

g_{T, s u p e r v i s e d}^{(t)}

, measures the difference between

y_{l}

and the label predicted by

T

through cross-entropy loss (CE Loss). Here,

e m b_{l}^{T}

denotes the embedding value derived by passing the labeled data

x_{l}

through

E N C_{T}

:

g_{T, s u p e r v i s e d}^{(t)} = \nabla_{θ_{T}} C E (y_{l}, C L F_{T} (e m b_{l}^{T}; θ_{T})) |_{θ_{T} = θ_{T}^{(t)}}

(7)

The MPL loss

g_{T}^{(t)}

calculates the difference between the hard pseudo label

y_{u}

, which is the maximum value extracted from the pseudo labels generated by

T

through

x_{u}

, and the logit. Here,

e m b_{u}^{T}

denotes the embedding value derived by passing the labeled data

x_{u}

through

E N C_{T}

:

g_{T}^{(t)} = h \cdot \nabla_{θ_{T}} C E ({\hat{y}}_{u}, C L F_{T} (e m b_{u}^{T}; θ_{T})) |_{θ_{T} = θ_{T}^{(t)}}

(8)

The feedback factor

h

of

S

was calculated in the same way as the original Meta Pseudo Labels [37], using Taylor expansion to calculate the difference before and after the training of

S

. In the proposed method, we approximated

h

using the difference from the CE loss value for the labeled data after

S

was trained to the value before training. This allows the final loss value to converge as the training progresses.

h = C E (y_{l}, S (x_{l}; θ_{S}^{(t + 1)})) - C E (y_{l}, S (x_{l}; θ_{S}^{(t)}))

(9)

The final loss function of

C l F_{T}

is composed of the sum of each loss function value:

θ_{T}^{(t + 1)} = θ_{T}^{(t)} - η s \cdot (g_{T}^{(t)} + g_{T, s u p e r v i s e d}^{(t)})

(10)

S

is trained through unlabeled data. The loss function of the student model’s encoder

E N C_{S}

, denoted as

g_{S, c o n t r a s t}^{(t)}

, is trained utilizing SupConLoss [38], akin to

E N C_{T}

. It leverages

(x_{u}, {\hat{y}}_{u})

, comprising an unlabeled image

x_{u}

paired with pseudo labels

{\hat{y}}_{u}

, generated by

T

:

\begin{matrix} g_{S, c o n t r a s t}^{(t)} \\ = \nabla_{θ_{T}} S u p C o n L o s s (R a n d A u g m e n t_{a} (x_{u}), R a n d A u g m e n t_{b} (x_{u}), {\hat{y}}_{u}) |_{θ_{S} = θ_{S}^{(t)}} \end{matrix}

(11)

E N C_{S}

is also promptly updated following the computation of the

g_{S, c o n t r a s t}^{(t)}

:

θ_{S, E N C}^{(t + 1)} = θ_{S, E N C}^{(t)} - η_{S} \cdot g_{S, c o n t r a s t}^{(t)}

(12)

The loss function of

C L F_{S}

is calculated using CE loss for the hard pseudo label

{\hat{y}}_{u}

of

T

for

x_{u}

and the prediction of

S

. Here,

e m b_{u}^{S}

denotes the embedding value derived by passing the labeled data

x_{u}

through

E N C_{S}

:

θ_{S}^{(t + 1)} = θ_{S}^{(t)} - η s \cdot \nabla_{θ_{S}} C E ({\hat{y}}_{u}, C L F_{S} (e m b_{u}^{S}; θ_{S})) |_{θ_{S} = θ_{S}^{(t)}}

(13)

The SupConLoss in

g_{T, c o n t r a s t}^{(t)}

,

g_{S, c o n t r a s t}^{(t)}

on the teacher model

T

and student model

S

are as follows:

S u p C o n L o s s = \sum_{i \in I} S u p C o n L o s s_{i} = \sum_{i \in I} \frac{- 1}{|P (i)|} \sum_{p \in P (i)} l o g \frac{e x p (z_{i} \cdot z_{p} / τ)}{\sum_{a \in A (i)} e x p (z_{i} \cdot z_{a} / τ)}

(14)

Here,

i \in I \equiv {1, \dots, 2 N}

is the index of the randomly augmented data, and

P (i) \equiv {p \in A (i) : y_{p} = y_{i}}

is the set of indices for all positives in the batch (since

y_{p}

and

y_{i}

are labels of images that have been randomly augmented from

y_{l}

, they are the same as

y_{l}

).

z_{i}

and

z_{p}

are the embedding values of each randomly augmented image in

(R a n d A u g m e n t_{a} (x_{l}), R a n d A u g m e n t_{b} (x_{l}))

passed through the encoder

E N C

, and

A (i) \equiv I \ {i}

, and

τ

is the temperature parameter. In other words, the inner product between positive pairs (

i

and

p

are the same class but different samples) is maximized through

e x p (z_{i} \cdot z_{p} / τ)

, and the inner product between negative pairs is minimized through

e x p (z_{i} \cdot z_{a} / τ)

, so that the SupConLoss is minimized.

The training process of the proposed SupCon-MPL model for deepfake detection is shown in Algorithm 1.

Algorithm 1 The Deepfake detection method based on SupCon-MPL (Pseudo code)
	Set Labeled data, Unlabeled data with domain splitting [37]. Input: Labeled data $x_{l}$ , $y_{l}$ and unlabeled data $x_{u}$ . Initialize $θ_{T}^{(0)}$ and $θ_{S}^{(0)}$ . Pretrain Teacher model with $x_{l}$ , $y_{l}$ . For $t = 0$ to $N - 1$ do
		Sample an unlabeled example $x_{u}$ and a labeled example $x_{l}$ , $y_{l}$ .
		Sample a pseudo label ${\hat{y}}_{u} \sim P (\cdot\| x_{u}; θ_{T})$ .
		Compute contrastive loss of student encoder $E N C_{S}$ using the pseudo label ${\hat{y}}_{u}$ :
		$g_{S, c o n t r a s t}^{(t)} = \nabla_{θ_{T}} S u p C o n L o s s (R a n d A u g m e n t_{a} (x_{u}), R a n d A u g m e n t_{b} (x_{u}), {\hat{y}}_{u}) \|_{θ_{S} = θ_{S}^{(t)}}$
		Update the student encoder $E N C_{S}$ using the pseudo label $\hat{y}$ :
		$θ_{S, E N C}^{(t + 1)} = θ_{S, E N C}^{(t)} - η_{S} \cdot g_{S, contrast}^{(t)}$
		Update the student classifier $C L F_{S}$ using the pseudo label ${\hat{y}}_{u}$ :
		$θ_{S}^{(t + 1)} = θ_{S}^{(t)} - {η s \cdot \nabla_{θ_{S}} C E ({\hat{y}}_{u}, C L F_{S} (e m b_{u}^{S}; θ_{S}))\|}_{θ_{S} = θ_{S}^{(t)}}$
		Compute contrastive loss of teacher encoder $E N C_{T}$ using the labeled data $(x_{l}, y_{l})$ :
		$g_{T, c o n t r a s t}^{(t)} = \nabla_{θ_{T}} S u p C o n L o s s (R a n d A u g m e n t_{a} (x_{l}), R a n d A u g m e n t_{b} (x_{l}), y_{l}) \|_{θ_{T} = θ_{T}^{(t)}}$
		Update the teacher encoder $E N C_{T}$ using the labeled data $(x_{l}, y_{l})$ :
		$θ_{T, E N C}^{(t + 1)} = θ_{T, E N C}^{(t)} - η_{S} \cdot g_{T, contrast}^{(t)}$
		Compute gradient on labeled data $(x_{l}, y_{l})$ :
		$g_{T, supervised}^{(t)} = {\nabla_{θ_{T}} C E (y_{l}, C L F_{T} (e m b_{l}^{T}; θ_{T}))\|}_{θ_{T} = θ_{T}^{(t)}}$
		Compute feedback factor $h$ from student:
		$h = C E (y_{l}, S (x_{l}; θ_{S}^{(t + 1)})) - C E (y_{l}, S (x_{l}; θ_{S}^{(t)}))$
		Compute MPL loss from unlabeled data $x_{l}$ :
		$g_{T}^{(t)} = {h \cdot \nabla_{θ_{T}} C E ({\hat{y}}_{u}, C L F_{T} (e m b_{u}^{T}; θ_{T}))\|}_{θ_{T} = θ_{T}^{(t)}}$
		Update the teacher classifier $C L F_{T}$ :
		$θ_{T}^{(t + 1)} = θ_{T}^{(t)} - η s \cdot (g_{T}^{(t)} + g_{T, supervised}^{(t)})$
	end for return $θ_{S, E N C}^{(N)}, θ_{S}^{(N)}$ $⊳$ Only the student encoder and classifier are returned for evaluations.

4. Experiment

In this chapter, we first describe the experimental setup. Subsequently, we present the experimental results for Figure 2a,b in Section 4.2 and Section 4.3, respectively. Our main results consist of comparisons with the pretrained model and SupCon model in Section 4.4 and comparisons with state-of-the-art models in Section 4.5.

4.1. Experiment Setup

The experiments are conducted using NVIDIA Tesla V100 32 and NVIDIA RTX-3090 (NVIDIA, Santa Clara, CA, USA) with ubuntu 20.04 environment for reproducibility and stability. The single-domain experiment and the multi-domain experiment are existing outputs of the Meta Learning-based Deepfake Detection Project [24]. The pretrained model, Meta Pseudo Labels model (MPL model), SupCon model, and SupCon-MPL model are experimented for their training performance under the same conditions and hyperparameters. The training dataset uses the videos of the deepfakes (DF), Face2Face (F2F), FaceSwap (FS), NeuralTextures (NT), and real videos in FaceForensics++ [39], with DFDC [40], and Celeb-DF [41]. In scenario evaluation, deepfake videos of first-generation’s unknown domain are NeuralTextures(NT) [39] with real videos, and for post-generation’s unknown domain, we selected StyleGAN [8] images with CelebA [48] videos. We used MTCNN [49] to extract face images frame by frame of each video.

In the single-domain experiment, 260,000 real and 340,000 fake data from each domain in the FaceForensics++ [39] are used. During training, the amount of validation data used is 20% of the training data, and the evaluation dataset uses 150,000 per each data domain. In the multi-domain experiment, 200,000 labeled data are randomly extracted from four domains, and 180,000 unlabeled data are extracted from a single domain for use.

The generational deepfake scenario trains using 170,000 each of the first-generation known domain’s FaceForensics++ (DF, F2F, FS, Real) [39], DFDC [40], Celeb-DF [41] data, and then evaluates using 51,200 each of first- and second-generation unknown domain data. In the backbone model in scenario evaluation, we used ResNet50 [50] due to lack of computational resources.

The hyperparameters used in the experiment are a learning rate of 10⁻⁴, an image size of 64, and a batch size of 512. In the MPL, SupCon-MPL models, the batch sizes of labeled and unlabeled images are 64 and 448, respectively. Finally, the threshold is set at 0.95. The training models used were ResNet50 [50], ResNet101 [50], ResNext50 [51], EfficientNet-b5 [21], and WideResNet50 [52].

Experiment data and evaluation data use a mix of fake and real data. In the experiment in Section 4.2, video data from one domain is used as labeled and unlabeled data, and in the experiment in Section 4.3, videos from multiple domains are used as labeled data, and one domain is used as unlabeled data.

4.2. Single-Domain Experiment

In this section, single-domain experiments evaluate the detection performance for a single deepfake generation model, while, in Section 4.3, the performance on new deepfakes is evaluated through training on various deepfake generation models. Section 4.4 and Section 4.5 assess the performance of deepfake detection across diverse deepfake datasets and scenario-based deepfake detection, respectively.

In the experiment using only one domain, as shown in Table 1, the performance of the known domain increased in most of the situation. Furthermore, it was confirmed that the performance in unknown domains also increased in most of the situation. Based on EfficientNet-b5 [21] in Table 1, ACC and AUC in K improved by an average of 4.47% and 4.53%, respectively, and ACC in U improved by an average of 0.20%, but AUC decreased by an average of 0.13%. However, overall, out of a total of 64 ACC and AUC validations recorded in Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8, 44 and 41 cases improved, respectively. Through this, it was confirmed that when using the same data, the performance of the MPL model is higher than the pretrained model.

4.3. Multi-Domain Experiment

In the multi-domain experiment, the combination of data is configured considering the actual deepfake situation. The situation is assumed to have

K

deepfake data from multiple domains and

D_{U L}

data from

U

. Afterwards, the evaluation is conducted through the

U

data used as

D_{U L}

. Therefore, MPL trains with deepfake videos from multiple domains, and, after training, MPL experiments the ability to detect unlabeled data using unlabeled data. As a result of the experiment, ACC and AUC increased by an average of 1.59% and 1.26% in two models in Table 9 and Table 10. Through this, it was confirmed that the MPL model improved the deepfake detection ability of the unknown model by learning with unlabeled data.

4.4. SupCon-MPL Experiment

The experiment uses the labeled data identically to the multi-domain experiment and evaluates according to each combination of known domain and unknown domain. At this time, Celeb-DF [41] is used as unlabeled data in the training of the SupCon-MPL model. As a result of the experiment, shown in Table 11, when evaluating FS data as an unknown domain compared with the SupCon model [35], ACC and AUC decreased, but in other validations, the performance of the SupCon-MPL model was similar or higher than the performance of the two models being compared. This shows that the SupCon-MPL model has been trained as a generalized detection model compared with the existing deepfake detection model.

4.5. Deepfake Scenario Experiment

In this section, we construct a training scenario for a deepfake detection model in the real world and train the model. The scenario involves training a deepfake model with first-generation deepfake data, then experimenting with the detection of first-generation deepfakes (NT) that were not participated while training, and post-generation (second-generation) deepfakes (StyleGAN) that are newly developed and unknown. Table 12 shows the training results of various models according to the scenario. As a result of the experiment, among various models, the SupCon-MPL model achieved highest performance in all scenario evaluations compared with other models.

4.6. Limitations

The main goal of the SupCon-MPL is to enhance the deepfake detection performance in unknown domains using meta-learning. Therefore, in this paper, we conducted experiments by reconfiguring a limited deepfake dataset into scenarios.

The main limitation is related to computing resources. As the training in Section 4.3. and Section 4.4. was conducted using NVIDIA RTX-3090, only ResNet50 [50] could be used as the backbone model in SupCon-MPL, which uses two models. The ResNet50 [50] model currently stands out among various image classification models for its stability and consistently decent performance scores. Additionally, within the given resources, it can be utilized for training the SupCon-MPL model. Subsequent experiments are needed with various backbone models and larger image sizes through more computing resources.

The next limitation is that we could not find a verified deepfake dataset for the third generation and higher. Further experiments are needed through the corresponding dataset in the future.

5. Conclusions

With the development of various deepfake generative models, it has become important to develop a generalized deepfake detection model that guarantees the detection performance of unknown domain deepfakes, not just the data of the domain used for training. The proposed SupCon-MPL allows for training from unlabeled data and performs contrastive learning to enhance the detection performance of unknown deepfakes. This enables contrastive learning with a diverse range of data, and, during training, the teacher model and the student model infer information and provide feedback to each other, facilitating the student model to surpass the performance of the teacher model.

Our model improved all detection performances over other generalized deepfake detection models’ known and unknown domains in deepfake scenario evaluation. Indeed, one of the significant features of SupCon-MPL is its ability to train models using a large amount of unlabeled data. This provides a method to enhance the model’s performance utilizing the countless images and videos available on the internet and so on. Through this, our model enables the training of a generalized deepfake detection model and provides robust detection capability for real-world deepfakes. SupCon-MPL can contribute to the detection of newly developed deepfake generative models as they become increasingly difficult to distinguish from real images, especially with the development of various deepfake generative models.

Future research will focus on improving the detection performance of higher generation deepfake images/videos using these methods. Additionally, studies on reducing the training cost of SupCon-MPL will be conducted. The goal is to develop a more efficient and economical deepfake detection model through these efforts.

Author Contributions

Conceptualization, K.-H.M. and S.-H.L.; Methodology, K.-H.M., S.-Y.O. and S.-H.L.; Software, K.-H.M.; Validation, K.-H.M.; Resources, S.-Y.O.; Writing—original draft, K.-H.M.; Writing—review & editing, S.-H.L.; Supervision, S.-Y.O. and S.-H.L.; Project administration, K.-H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Dong-A University research fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Summary of notations.

Symbol	Definition
CRL	Contrastive Representation Learning.
MPL	Meta Pseudo Labels.
GAN	Generative Adversarial Networks.
VAE	Variational Auto Encoder.
$K$	Known domain which involves a set of known deepfake generative models.
$U$	Unknown domain which involves a set of unknown deepfake generative models.
$D_{K}$	Deepfake datasets of the known domain which are labeled.
$D_{U}$	Deepfake datasets of the unknown domain which are unlabeled.
$K_{i}$	$i$ -th known deepfake generative model in known domain.
$U_{i}$	$i$ -th unknown deepfake generative model in unknown domain.
$x_{i}$	Images of real and deepfake.
$y_{i}$	Labels of real and deepfake.
$D_{L}$	Labeled dataset used while training.
$D_{U L}$	Unlabeled dataset used while training.
$T$	Teacher model of SupCon-MPL.
$S$	Student model of SupCon-MPL.
$θ_{T}$	Parameters of the teacher model’s classifier.
$θ_{S}$	Parameters of the student model’s classifier.
$θ_{S}^{P L}$	Minimized parameters of model in SupCon-MPL.
$θ_{T, E N C}^{(t)}$	Parameters of teacher encoder.
$θ_{S, E N C}^{(t)}$	Parameters of student encoder.
$η s$	Learning rate used while training.
$C l F_{T}$	Classifier in the teacher model.
$E N C_{T}$	Encoder in the teacher model.
$C L F_{S}$	Classifier in the student model.
$E N C_{S}$	Encoder in the student model.
$x_{l}$	Labeled images used while training.
$x_{u}$	Unlabeled images used while training.
$y_{l}$	Labels used while training.
${\hat{y}}_{u}$	Pseudo label generated by the teacher model.
$e m b_{l}^{T}$	Embedding value derived by passing the labeled data through the teacher encoder.
$e m b_{u}^{S}$	Embedding value derived by passing the unlabeled data through the student encoder.
$h$	Student model’s feedback factor.
$I$	Index of the randomly augmented data.
$P (i)$	Set of indices for all positives in the batch.
$z_{i}, z_{p}$	Embedding values of each randomly augmented image.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of wasserstein gans. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar] [CrossRef]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the ICML 2021 Workshop on Unsupervised Reinforcement Learning, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
DeepFaceLab. Available online: https://github.com/iperov/DeepFaceLab (accessed on 5 March 2024).
Deepswap. Available online: https://deepfaceswap.ai/ (accessed on 5 March 2024).
Synthesia. Available online: https://www.synthesia.io (accessed on 5 March 2024).
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar] [CrossRef]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Norouzi, M. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Song, Y.; Dhariwal, P.; Chen, M.; Sutskever, I. Consistency models. arXiv 2023, arXiv:2303.01469. [Google Scholar]
Li, Y.; Lyu, S. Exposing deepfake videos by detecting face warping artifacts. arXiv 2018, arXiv:1811.00656. [Google Scholar]
Matern, F.; Riess, C.; Stamminger, M. Exploiting visual artifacts to expose deepfakes and face manipulations. In Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops, Waikoloa, HI, USA, 7–11 June 2019; pp. 83–92. [Google Scholar] [CrossRef]
Li, Y.; Chang, M.C.; Lyu, S. In ictu oculi: Exposing ai generated fake face videos by detecting eye blinking. arXiv 2018, arXiv:1806.02877. [Google Scholar]
Ciftci, U.A.; Demir, I.; Yin, L. Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE Trans. Pattern Anal. Mach. Intell. 2020, early access. [CrossRef]
Coccomini, D.A.; Messina, N.; Gennaro, C.; Falchi, F. Combining efficientnet and vision transformers for video deepfake detection. In Proceedings of the International Conference on Image Analysis and Processing, Lecce, Italy, 23–27 May 2022; pp. 219–229. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Los Angeles, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar] [CrossRef]
Moon, K.-H.; Ok, S.-Y.; Seo, J.; Lee, S.-H. Meta Pseudo Labels Based Deepfake Video Detection. J. Korea Multimed. Soc. 2024, 27, 9–21. [Google Scholar] [CrossRef]
Jain, A.; Korshunov, P.; Marcel, S. Improving generalization of deepfake detection by training for attribution. In Proceedings of the 2021 IEEE 23rd International Workshop on Multimedia Signal Processing, Tampere, Finland, 6–8 October 2021; pp. 1–6. [Google Scholar] [CrossRef]
Nadimpalli, A.V.; Rattani, A. On improving cross-dataset generalization of deepfake detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 91–99. [Google Scholar] [CrossRef]
Hsu, C.C.; Lee, C.Y.; Zhuang, Y.X. Learning to detect fake face images in the wild. In Proceedings of the 2018 International Symposium on Computer, Consumer and Control, Taichung, Taiwan, 6–8 December 2018; pp. 388–391. [Google Scholar] [CrossRef]
Dong, F.; Zou, X.; Wang, J.; Liu, X. Contrastive learning-based general Deepfake detection with multi-scale RGB frequency clues. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 90–99. [Google Scholar] [CrossRef]
Shiohara, K.; Yamasaki, T. Detecting deepfakes with self-blended images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18720–18729. [Google Scholar] [CrossRef]
Chen, L.; Zhang, Y.; Song, Y.; Wang, J.; Liu, L. Ost: Improving generalization of deepfake detection via one-shot test-time training. Adv. Neural Inf. Process. Syst. 2022, 35, 24597–24610. [Google Scholar]
Aneja, S.; Nießner, M. Generalized zero and few-shot transfer for facial forgery detection. arXiv 2020, arXiv:2006.11863. [Google Scholar]
Kim, M.; Tariq, S.; Woo, S.S. Fretal: Generalizing deepfake detection using knowledge distillation and representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1001–1012. [Google Scholar] [CrossRef]
Qi, H.; Guo, Q.; Juefei-Xu, F.; Xie, X.; Ma, L.; Feng, W.; Zhao, J. Deeprhythm: Exposing deepfakes with attentional visual heartbeat rhythms. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 4318–4327. [Google Scholar] [CrossRef]
Lee, S.; Tariq, S.; Kim, J.; Woo, S.S. Tar: Generalized forensic framework to detect deepfakes using weakly supervised learning. In Proceedings of the IFIP International Conference on ICT Systems Security and Privacy Protection, Oslo, Norway, 22–24 June 2021; pp. 351–366. [Google Scholar] [CrossRef]
Xu, Y.; Raja, K.; Pedersen, M. Supervised contrastive learning for generalizable and explainable deepfakes detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 379–389. [Google Scholar]
Fung, S.; Lu, X.; Zhang, C.; Li, C.T. DeepfakeUCL: Deepfake Detection via Unsupervised Contrastive Learning. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
Pham, H.; Dai, Z.; Xie, Q.; Le, Q.V. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11557–11568. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Niessner, M. Faceforensics++: Learning to Detect Manipulated Facial Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar] [CrossRef]
Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C.C. The deepfake detection challenge (dfdc) dataset. arXiv 2020, arXiv:2006.07397. [Google Scholar]
Li, Y.Z.; Yang, X.; Sun, P.; Qi, H.G.; Lyu, S. Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics. In Proceedings of the Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3207–3216. [Google Scholar] [CrossRef]
Korshunov, P.; Marcel, S. Deepfakes: A new threat to face recognition? assessment and detection. arXiv 2018, arXiv:1812.08685. [Google Scholar]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality reduction by learning an invariant mapping. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar] [CrossRef]
Xie, Q.; Dai, Z.; Hovy, E.; Luong, T.; Le, Q. Unsupervised data augmentation for consistency training. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; pp. 6256–6268. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar] [CrossRef]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]

Figure 1. (a) Training overview and (b) process with label/unlabeled datasets of Meta Pseudo Labels [37].

Figure 2. Deepfake image discrimination strategy targeting for the (a) known domain and (b) unknown domain.

Figure 3. Modified Meta Pseudo Labels and loss functions for deepfake detection.

Table 1. The performance of pretrained and MPL model on the known domain (EfficientNetb5 [21]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
EfficientNetb5 [21]	DF	DF	89.35	89.35	89.19	90.08	90.08	90.36
	F2F	F2F	77.21	77.21	77.61	80.35	80.35	79.45
	FS	FS	84.52	84.31	83.20	87.90	87.43	85.98
	NT	NT	64.13	63.33	58.18	74.79	74.47	71.36

Table 2. The performance of pretrained and MPL model on the unknown domain (EfficientNetb5 [21]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
EfficientNetb5 [21]	DF	F2F	51.59	51.60	22.09	51.70	51.70	23.01
		FS	55.97	52.06	23.23	56.45	52.65	25.01
		NT	55.37	51.26	20.06	55.41	51.37	21.28
	F2F	DF	57.56	57.55	47.78	53.87	53.84	32.11
		FS	56.47	54.05	39.66	55.48	51.92	26.57
		NT	54.63	51.91	33.96	54.76	50.99	23.26
	FS	DF	57.25	57.22	39.84	57.61	57.88	36.54
		F2F	51.76	51.77	25.78	51.98	51.99	20.16
		NT	53.79	49.89	20.80	54.32	49.80	12.35
	NT	DF	58.72	58.71	52.73	61.03	61.00	52.78
		F2F	54.52	54.53	45.23	55.88	55.89	43.68
		FS	51.65	49.42	34.03	53.29	49.34	27.25

Table 3. The performance of pretrained and MPL model on the known domain (ResNet50 [50]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
ResNet50 [50]	DF	DF	91.16	91.16	90.91	91.29	91.30	90.97
	F2F	F2F	82.47	82.48	81.84	83.98	83.97	82.75
	FS	FS	87.55	87.21	86.01	88.31	88.09	87.06
	NT	NT	74.96	74.28	71.20	76.22	75.84	73.15

Table 4. The performance of pretrained and MPL model on the unknown domain (ResNet50 [50]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
ResNet50 [50]	DF	F2F	52.91	52.89	22.41	51.92	51.86	20.36
		FS	56.80	52.68	21.41	57.02	53.00	22.63
		NT	55.91	51.77	18.54	55.45	51.33	18.53
	F2F	DF	53.94	53.90	31.72	53.79	53.75	28.28
		FS	54.62	50.97	24.47	55.32	51.35	21.84
		NT	54.76	51.20	24.07	54.96	51.04	21.00
	FS	DF	56.34	56.30	33.89	59.57	59.53	41.50
		F2F	51.16	51.10	20.08	51.57	51.52	21.01
		NT	53.67	49.44	14.32	53.81	49.61	15.49
	NT	DF	60.45	60.43	49.93	60.17	60.15	50.85
		F2F	54.74	54.70	38.88	53.56	53.51	36.77
		FS	51.59	48.15	22.30	51.84	48.51	25.20

Table 5. The performance of pretrained and MPL model on the known domain (ResNet101 [50]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
ResNet101 [50]	DF	DF	91.16	91.16	90.84	91.13	91.13	90.85
	F2F	F2F	81.41	81.41	80.27	83.50	83.49	82.73
	FS	FS	87.59	87.37	86.13	87.75	87.66	86.39
	NT	NT	74.17	74.56	73.16	76.03	75.53	73.22

Table 6. The performance of pretrained and MPL model on the unknown domain (ResNet101 [50]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
ResNet101 [50]	DF	F2F	52.69	52.63	21.22	51.26	51.20	18.01
		FS	56.90	52.73	20.42	56.02	51.84	18.71
		NT	55.93	51.74	18.33	54.93	50.73	16.53
	F2F	DF	53.63	53.59	30.44	53.50	53.46	27.42
		FS	54.33	50.08	22.82	54.86	50.85	19.67
		NT	54.62	51.01	23.30	55.17	51.28	20.98
	FS	DF	57.04	57.70	37.00	59.55	59.51	42.55
		F2F	51.43	51.37	20.63	51.84	51.79	24.60
		NT	53.66	49.55	15.04	53.61	49.59	17.59
	NT	DF	62.66	62.65	59.08	60.16	60.14	49.99
		F2F	52.69	52.63	47.55	51.26	51.20	37.97
		FS	56.90	52.73	29.98	56.02	51.84	21.29

Table 7. The performance of pretrained and MPL model on the known domain (ResNext50 [51]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Sore
ResNext50 [51]	DF	DF	90.29	90.29	90.01	91.14	91.14	91.02
	F2F	F2F	81.19	81.17	80.07	82.87	82.85	82.11
	FS	FS	87.04	86.77	85.44	87.36	87.36	86.36
	NT	NT	74.26	74.43	73.21	76.32	75.78	73.30

Table 8. The performance of pretrained and MPL model on the unknown domain (ResNext50 [51]).

Baseline Model	Train Dataset	Test Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Test Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
ResNext50 [51]	DF	F2F	51.83	51.73	20.17	51.66	51.56	18.92
		FS	56.57	52.53	21.48	57.24	53.21	23.08
		NT	55.25	51.57	18.50	54.72	50.50	15.47
	F2F	DF	54.57	54.58	31.56	54.46	54.47	31.25
		FS	54.31	50.38	18.21	55.28	51.45	22.28
		NT	54.67	50.89	21.12	54.80	51.02	21.58
	FS	DF	59.23	59.23	41.25	60.33	60.33	46.22
		F2F	52.02	51.93	23.43	52.98	52.89	27.91
		NT	53.31	49.22	15.05	53.41	49.55	17.30
	NT	DF	60.58	60.58	43.46	58.50	58.51	44.82
		F2F	54.16	54.11	30.69	52.46	52.38	31.00
		FS	49.17	46.78	19.36	51.26	47.80	21.71

Table 9. The performance of pretrained and MPL model on the targeted unknown domain (ResNext50 [51]).

Baseline Model	Train Dataset	Unlabeled Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Unlabeled Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
ResNext50 [51]	F2F, FS, NT	DF	63.33	63.32	70.02	66.23	66.22	70.35
	DF, FS, NT	F2F	57.35	57.39	60.69	58.04	58.08	61.64
	DF, F2F, NT	FS	50.69	50.53	46.69	51.81	51.54	47.22
	DF, F2F, FS	NT	53.63	53.05	47.65	53.89	53.06	46.28

Table 10. The performance of pretrained and MPL model on the targeted unknown domain (WideResNet50 [52]).

Baseline Model	Train Dataset	Unlabeled Dataset	Pretrained Model			MPL Model
Baseline Model	Train Dataset	Unlabeled Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score
WideResNet50 [52]	F2F, FS, NT	DF	63.17	63.14	69.19	66.03	66.00	70.15
	DF, FS, NT	F2F	56.86	56.87	60.76	57.60	57.60	58.91
	DF, F2F, NT	FS	48.37	47.74	39.32	52.86	51.13	41.60
	DF, F2F, FS	NT	54.22	53.43	47.37	53.92	51.96	39.87

Table 11. The performance of SupCon-MPL compared with the baseline model and SupCon model [35].

Baseline Model	Train Dataset	Unlabeled Dataset	Pretrained Model			SupCon Model [35]			SupCon-MPL(Ours) Model
Baseline Model	Train Dataset	Unlabeled Dataset	ACC	AUC	F1 Score	ACC	AUC	F1 Score	ACC	AUC	F1 Score
ResNet50 [50]	FF (without DF)	DF (unknown)	64.24	64.27	65.49	62.88	62.84	58.55	64.60	64.55	61.60
	FF (without DF)	F2F + FS + NT (known)	70.56	71.36	83.36	75.44	75.52	83.01	75.84	76.00	83.50
	FF (without F2F)	F2F (unknown)	55.76	55.64	40.63	56.61	56.64	49.47	58.74	58.77	51.52
	FF (without F2F)	DF + FS + NT (known)	77.26	76.89	81.66	75.54	75.54	84.13	78.11	78.28	85.84
	FF (without FS)	FS (unknown)	54.47	52.07	79.18	55.75	53.68	77.20	55.72	53.41	79.29
	FF (without FS)	DF + F2F + NT (known)	75.99	75.87	81.66	76.02	75.88	84.84	77.22	77.12	85.84
	FF (without NT)	NT (unknown)	56.71	54.95	62.66	56.58	54.31	66.27	56.76	54.02	69.43
	FF (without NT)	DF + F2F + FS (known)	77.39	77.41	81.66	79.09	79.26	84.13	81.22	81.32	85.84

Table 12. The performance of SupCon-MPL compared with other deepfake detection methods.

Model	Scenario Deepfakes (Known)	Current-Generation Deepfakes (Unknown)	Post-Generation Deepfakes (Unknown)
Tar [34]	52.40	44.62	49.96
DDT [31]	80.41	44.62	49.49
MPL [24]	79.82	56.53	43.16
SupCon [35]	79.01	56.66	47.77
SupCon-MPL (ours)	81.40	57.85	51.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Moon, K.-H.; Ok, S.-Y.; Lee, S.-H. SupCon-MPL-DP: Supervised Contrastive Learning with Meta Pseudo Labels for Deepfake Image Detection. Appl. Sci. 2024, 14, 3249. https://doi.org/10.3390/app14083249

AMA Style

Moon K-H, Ok S-Y, Lee S-H. SupCon-MPL-DP: Supervised Contrastive Learning with Meta Pseudo Labels for Deepfake Image Detection. Applied Sciences. 2024; 14(8):3249. https://doi.org/10.3390/app14083249

Chicago/Turabian Style

Moon, Kyeong-Hwan, Soo-Yol Ok, and Suk-Hwan Lee. 2024. "SupCon-MPL-DP: Supervised Contrastive Learning with Meta Pseudo Labels for Deepfake Image Detection" Applied Sciences 14, no. 8: 3249. https://doi.org/10.3390/app14083249

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SupCon-MPL-DP: Supervised Contrastive Learning with Meta Pseudo Labels for Deepfake Image Detection

Abstract

1. Introduction

2. Related Works

2.1. Generalized Deepfake Detection

2.2. Contrastive Representation Learning in Deepfake Detection

2.3. Meta Pseudo Labels

3. Proposed SupCon-MPL-Based Deepfake Detection

3.1. Proposed Training Strategy

3.1.1. Known Domain and Unknown Domain in Deepfake

3.1.2. Training Strategy for Deepfake Unknown Domain Detection

3.2. SupCon-MPL: Supervised Contrastive Learning with Meta Pseudo Labels

3.3. SupCon-MPL Loss Function

4. Experiment

4.1. Experiment Setup

4.2. Single-Domain Experiment

4.3. Multi-Domain Experiment

4.4. SupCon-MPL Experiment

4.5. Deepfake Scenario Experiment

4.6. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI