1. Introduction
Along with the rapid development of deep learning, artificial intelligence (AI) technologies used to create images and videos, which is a type of generative AI, have made significant strides over the past few decades. It is extremely difficult for people to distinguish artificial digital images generated by these technologies from images captured by digital cameras. AI-based image generation holds potential for a wide range of applications, such as movie acting, business advertising, and game creation.
On the other hand, AI-based image generation can be abused. For example, social media accounts uploading high-quality fake profile images could be used for fraudulent purposes. Generating fake images based on deep learning is called “Deepfake”. According to the classification in reference [
1], deepfake approaches can be categorized into four types: (1) synthesis, (2) attribute editing, (3) identity swap, and (4) face reenactment. Synthesis (e.g., PGGAN [
2], StyleGAN [
3], and StyleGAN2 [
4]) refers to the process of creating a fake facial image of someone who does not exist in the real world. Attribute editing (e.g., STGAN [
5], StarGAN [
6], and StarGAN v2 [
7]) refers to the process of altering a person’s facial features, such as their hairstyle or facial hair, or adding or removing signs of aging. Meanwhile, the identity swap (e.g., FaceSwap [
8]) is an approach which involves replacing one person’s face with another. Finally, during reenactment (e.g., Face2Face [
9]), though the person in an image remains the same, his/her facial expression is changed.
In this paper, we mainly focus on detecting deepfakes in the form of synthesis images. The synthesis approach can be classified into generative adversarial network (GAN) and diffusion model. GAN-based synthesis consists of two neural network models: generator and discriminator. The generator is effectively generated by training them alternatively and adversarially. The generator is trained to generate high-quality images that are falsely classified by the discriminator, while the discriminator is trained to precisely distinguish the fake images generated by the generator from real images. Moreover, the diffusion probabilistic model (DPM) [
10] is another type of synthesis approach which achieves high-quality image generation by dividing the process of generating an image from an input noise into simpler processes such as removing noise iteratively and gradually. Recently, stable diffusion [
11], which is an extension of DPM, has attracted considerable attention, even from non-expert users because it enables users to control image generation via text prompts. In order to prevent the abuse of image generation techniques, it is important to create techniques that allow the early detection of a wide variety of false images.
In recent years, various deepfake detection methods using different clues have been investigated. Photo response non-uniformity (PRNU) is caused by the sensitivity difference of an image sensor, and it can be treated as the fingerprint of a camera. Meanwhile, PRNU-based deepfake detection was proposed in [
12]. This approach can detect a deepfake, even from a partially edited facial image. On the other hand, applying PRNU-based deepfake detection to rotated and/or scaled images is immensely challenging. Deepfake detection using deep neural networks (DNNs) [
13,
14,
15,
16] has also been explored.
If a fake image is created using an existing deepfake generator, and the source of the image has been identified, then deepfake detection is relatively easy and achieves relatively high accuracy. However, there are challenges associated with achieving reliable and practical deepfake detection. References [
17,
18] address the issue of improving the generalization performance of deepfake detection models to achieve high deepfake detection accuracy, even when the model used to generate the fake images is unknown. Reference [
17] reported that data augmentation is effective at improving the generalization of deepfake detectors. Reference [
18] proposed deepfake detection that extracts low-level visual cues that are applicable to a wider range of fake image sources by re-synthesizing input images before classification.
Existing synthesis detection approaches assume that real images and synthesized images are inputted as they are, and the invisible and visible artifacts generated by a deepfake generator appear throughout the facial image. However, we also need to consider the scenario in which the artifacts are available only from a part of the fake image or degraded. For example, parts of the face may be edited manually or through inpainting methods to change the appearance of the face partly. That can destroy the artifacts partly and make it more difficult to detect synthesized images. In such a case, we need to detect deepfake from the area that keeps the artifacts generated by a deepfake generator. Analysis with a small area of facial image is helpful to realize robust deepfake detection regarding manual editing and inpainting. On the other hand, it is difficult to extract artifacts completely separated from the scene content such as textures of the background and face because the invisible artifacts are typically weak signals. In particular, it is a difficult challenge to detect a synthesized image using only a small area of facial images because only a small amount of contaminated artifacts can be obtained.
Because the eye, cornea, nose, and mouth are major facial components, detecting whether each face part is synthesized is useful for detecting synthesized images. In reference [
19], inconsistent corneal specular highlights are used to identify a deepfake because left and right corneas must have consistent corneal specular highlights in real images under certain lighting conditions. Because this method uses only cornea regions, it is robust to the occlusion of other face parts. However, even real images may not have clear corneal specular highlights, and the accuracy is not sufficient for images captured in various lighting conditions. Moreover, this technique cannot be used to identify a deepfake from fake images in which the cornea is occluded with sunglasses. Part of the real face images may be hidden by sunglasses, masks, hands, etc. Therefore, it is also important to achieve face-part-based deepfake detection that is robust to partial occlusion and editing.
In this paper, we focus on the challenges presented during deepfake detection involving partly occluded or edited facial images. Our approach is based on an ensemble of convolutional neural network (CNN) models. Each CNN model is specialized for a facial part and can concentrate on artifact analysis of the assigned facial part. We promote the CNN model to extract more appropriate features of the artifacts, even from small and/or low-resolution images by limiting the type of scene content of the input images. The CNN models achieve much higher accuracy than state-of-the-art methods for small facial part images and are useful to detect deepfake part by part. Moreover, the ensemble of those CNN models can enhance the deepfake detection performance. Other than the typical ensemble model that is generated in advance, our ensemble model is dynamically generated image by image according to the results of the face part detection. Our approach is robust to occlusion because only visible face parts are used. Moreover, the proposed ensemble model can achieve highly accurate deepfake detection comparable to state-of-the-art methods for high-resolution images. The contributions of this paper are as follows:
We propose specialized CNN-based deepfake detectors for different face parts and analyze their effectiveness in terms of deepfake detection. In this paper, we construct deepfake detectors that are specialized for five different facial parts and evaluate their performance. As shown in
Figure 1, we focus on the eye, cornea, nose, mouth, and face as facial parts.
Figure 1.
Examples of facial images and their five types of face part images. The images of the first two and last two rows are from the FFHQ dataset [
20] and StyleGAN2 dataset [
21], respectively.
Figure 1.
Examples of facial images and their five types of face part images. The images of the first two and last two rows are from the FFHQ dataset [
20] and StyleGAN2 dataset [
21], respectively.
We evaluate various ensemble models consisting of different combinations of the CNN-based deepfake detectors and demonstrate that the ensemble technique achieves sufficiently high accuracy that is comparable with that of deepfake detection that examines the whole face.
We develop a face-part detector based on YOLOv8 [
22]. We achieve face-part detection that is robust to different face directions by training with data augmentation that mimics 3D face rotation using a homography technique.
We realize the dynamic selection of deepfake detectors that focus on different face parts, which is based on the results of face-part detection.
We demonstrate that the proposed method achieves reliable deepfake detection that is more robust to partial occlusion and editing than the state-of-the-art methods.
Although this paper is an extended version of the conference paper presented in [
23], the third to fifth contributions are new ones.
The rest of this paper is organized as follows: The proposed method is introduced in
Section 2. We present an analysis of the effectiveness of each CNN model specialized for a single face part and various ensemble models of their CNN models in
Section 3. We show the evaluation of face-part detection in
Section 4. We compare our approach, which is based on dynamic ensemble selection, with the state-of-the-art methods in
Section 5. We discuss the potential and challenges of deepfake detection based on face parts and their combination in
Section 6. Finally, we present our conclusions in
Section 7.
3. Evaluation of Deepfake Detection Using Face Parts
In this section, we build specialized CNN models for deepfake detection using individual face parts and analyze which parts, or combinations of parts, contribute to identifying real or fake faces.
3.1. Environment
We used PyTorch 1.11.0 on Python 3.7.13. We trained CNN models on Google Colaboratory and evaluated them on a desktop PC with a Core i7 CPU of 3.80 GHz, 48.0 GB memory, and NVIDIA RTX3090.
3.2. Dataset Preparation for Training and Analysis
In this experiment, a face part dataset was generated using real images from the Flickr-Faces-HQ (FFHQ) [
20] and fake images from the StyleGAN2 [
21] dataset. We used YOLOv5 [
26] trained with only 101 annotated images to roughly and automatically generate face-part patches from real and fake images. We show the evaluation of the proposed face-part detector based on YOLOv8 in
Section 4.
The 5000 real images of FFHQ (00000-04999) and 5000 fake images of StyleGAN2 (00000-004999) were used for training and validation. We show the statistics of extracted face-part patches for training and validation in
Table 1. In addition, from 5000 real images of FFHQ (05000-09999) and 5000 fake images of StyleGAN2 (005000-009999), we selected 618 real images and 618 fake images, in which all face parts were detected, for evaluation.
3.3. Evaluation Method
In this paper, fake (real) images were regarded as samples of the positive (negative) class. We evaluated the deepfake detection performance with recall, precision, F1-score, and accuracy defined as follows:
where TP (TN) represents the number of fake (real) images that are correctly classified, and FP (FN) represents the number of real (fake) images that are falsely classified.
3.4. Generation of Single and Ensemble Models
Using the training dataset shown in
Table 1, we trained a single ResNet-18 [
25] model whose input layer’s size was
pixels for each face part. The ResNet-18 models specialized for the eye, cornea, nose, mouth, and face are denoted by
,
,
,
, and
, respectively. The suffix of each single model represents the initial letter of the corresponding face part. Each single model is trained with the cross-entropy loss. Moreover, we generated ten ensemble models consisting of two single models, ten ensemble models consisting of three single models, five ensemble models consisting of four single models, and one ensemble model consisting of five single models. The type of model is denoted using a suffix concatenating the suffixes of the corresponding face parts of the single models used. For example,
represents an ensemble model consisting of
,
, and
. We also trained another ResNet-18 model, denoted by
, whose input layer’s size is
, which enables us to analyze with a higher resolution.
In training, the cross-entropy loss was employed as a loss function, and Adam with a learning rate of 0.001 was used as an optimizer. The number of training epochs was set to 100. The batch size was 32. The model that achieved the best validation accuracy during training was used for evaluation with the test dataset.
We show the F1-score, accuracy, and inference time for five single models and 26 ensemble models of different combinations of single models in
Table 2. The inference time was measured, while the batch size was set to 256.
The first five rows of
Table 2 represent the results of five single models. This information shows that all face parts significantly contribute to deepfake detection. In this experiment, in particular, the single model specialized for the cornea, denoted by
, achieved the best F1-score. From the combined perspective of detection complexity, including efforts of annotation and deepfake detection accuracy, the cornea was most suitable for deepfake detection, though there was room to improve the performance of single models specialized for other face parts. Because the cornea is the smallest facial part, it is difficult to detect from low-resolution images. Larger face parts, such as the nose and mouth, also achieve relatively high deepfake detection performance, and it is important to use these parts effectively.
Next, we analyzed the possibility of improving the deepfake detection performance by combining two single models. First, we focused on the cornea, whose single model achieved the best F1-score of 0.997. The ensembles of two models including the cornea slightly improved the F1-score to 0.998–0.999. Next, we focused on the mouth whose single model achieved the worst F1-score of 0.977. The ensembles of two models including the mouth significantly improved the F1-score to 0.993–0.999. Similarly, the ensemble model of any combination achieved a higher F1-score than the single model used in the ensemble model. Moreover, the ensemble of three (four) models achieved at least 0.997 (0.998) in the F1-score. We can see that any combination of face parts has high potential to improve the deepfake detection performance.
Of all the models, and achieved the fastest and slowest processing times, respectively. In fact, although the inference times on GPU for and were similar, the pre-processing to resize extracted patch to pixels was different. The pre-processing for the face patch took more time because of its larger size.
We also compared the best ensemble models consisting of 1–4 of
and
with the single model
that receives a face patch of
pixels as an input. The comparison results are shown in
Table 3. As shown, the ensemble models were faster than
and achieved comparable deepfake detection performance. These results show that using the entire face is not mandatory for deepfake detection. Even if only some parts of the face are used, the same level of deepfake detection accuracy can be achieved as when the entire face of a higher resolution is used. The ensemble model is also superior in terms of execution time and computational cost because the computational cost of each single model can be reduced.
In the dataset of
Table 1, the number of real images is different from that of fake images for each face part. To relax the class imbalance problem of
Table 1, we also trained the single models with focal loss [
27]. Let
specify the ground-truth class and
be the estimated probability for the ground-truth class with
. To simply represent the focal loss function,
is defined by
The focal loss is defined by
Based on the preliminary experiments, we set
to 0.5, 0, 0.5, 1.0, and 2.0 for the eye, cornea, nose, mouth, and face, respectively.
Table 4 shows the evaluation results of five single models trained with focal loss and 26 ensemble models. We can see that
, which suffers from serious class imbalance, improves the F1-score significantly by using the focal loss. The F1-score for
and
also increases, while that for the cornea and face decreases. The trend is similar to that in
Table 2. We can see that any face part can contribute to deepfake detection and any combination of single models improves the F1-score.
5. Evaluation of Dynamic Ensemble Selection
We generated a face-part image dataset with which to train ResNet-18 specialized for each face part using the proposed face-part detector. To show that our approach can be widely applied, we used StyleGAN [
29] images as fake ones, instead of StyleGAN2. In
Table 6, we show the number of training and validation images for each class. We used face images in 5000 images (00000-04999) of the FFHQ and 5000 images (000000-004999) of the StyleGAN datasets. We also used test datasets consisting of 1894 images (06000-07999) in the FFHQ dataset and 1894 images (006000-007999) in the StyleGAN dataset.
First, we evaluated the performance of deepfake detection for a small area of facial image. We used cornea and nose images that are extracted from the test images by the YOLOv8s model and are scaled down to
in this experiment. Please note that the artifacts generated by StyleGAN are degraded by resizing. We compared the proposed method with four state-of-the-art methods [
17,
18,
30,
31]. The proposed method dynamically generates an ensemble model consisting of the eye, cornea, mouth, and nose detected by the face-part detector with a threshold
of 0.8. The method proposed in [
17] employs a ResNet-50 model trained with data augmentation, such as blurred and compression. The method in [
18] proposed a re-synthesizer extracting low-level and general visual cues. Reference [
30] evaluated seven methods [
13,
17,
32,
33,
34,
35,
36] and their variants and selected the best model. The selected model is a modified ResNet50, in which the downsampling of the first layer is removed. Ref. [
31] employed a static ensemble of EfficientNet-B4 models that are trained with orthogonal datasets; different datasets include images depicting different semantic content and images post-processed and compressed differently.
Table 7 shows the F1-score for cornea and nose images. The existing methods are specialized for relatively high-resolution images and assume that almost part of facial images is received. Although they achieve high accuracy for deepfake detection for such images, they are not effective for deepfake detection using small facial parts. On the other hand, our approach identifies the face parts generated by a deepfake generator. Even for small face-part images of
, we can extract artifacts useful for deepfake detection and analyze each face part precisely because each single model is specialized for extracting effective features from small-area, low-resolution, and degraded artifacts.
We also created four types of occluded test datasets from the original test dataset.
Figure 6 shows examples of occluded images of each test dataset. The upper left and lower right coordinates of the occluded area were determined based on the bounding box of the detected parts to generate the occlusion image.
Figure 6a,b represent test images of small eye occlusion (SEO) and large eye occlusion (LEO), respectively. In SEO, the right eye’s upper-left (lower-right) coordinate is used as the upper-left (lower-right) coordinate of the occlusion region. Moreover, LEO expanded the occlusion region by multiplying the upper-left coordinate by 0.95 and the lower-right coordinate by 1.05.
Figure 6c,d represent test images of small mouth and nose occlusion (SMNO) and large mouth and nose occlusion (LMNO), respectively. SMNO (LMNO) used x-coordinates of the right eye and y-coordinates of the nose for the upper-left coordinates, and x-coordinates of the left eye and y-coordinates of the mouth for the lower-right coordinates. LMNO expanded and shifted the occlusion region by multiplying the y-coordinates by 1.1. Moreover, we created a resized test dataset by resizing all test images into
pixels.
Next,
Table 8 shows the F1-score of deepfake detection for non-occluded images. We also show the normalized F1-score that is calculated by the F1-score for occluded/resized images over that for non-occluded images. Although the F1-score of the proposed method is slightly lower than [
30], we can see that the proposed method not only achieves deep fake detection for small facial part but also achieves the same level of accuracy as the state-of-the-art methods for the original face images. Based on the evaluation of the normalized F1-score, we analyzed how sensitive each method was to occlusion. For the method proposed in [
17], occlusions decreased the normalized F1-score by 1.0–3.7%. Similarly, in the method proposed in [
18], the normalized F1-score was reduced by 1.0–3.0%. On the other hand, the normalized F1-score reduction for the proposed method was only 0.30% to 0.50%. The proposed method appropriately selected face parts and realized more robust deepfake detection against occlusion. The methods proposed in [
30,
31] also achieved a high normalized F1-score. For resized images, the methods proposed in [
17,
18,
31] reduced the normalized F1-score significantly. On the other hand, the proposed method and the method proposed in [
30] maintain a high normalized F1-score. We show examples of deepfake detection results in
Figure 7.
6. Discussion
There are many existing studies on deepfake detection based on deep learning, such as [
13,
14,
15,
16,
17,
18,
30,
31], which have shown that high accuracy can be achieved when the model used to generate the fake images is identified. However, it has not been adequately discussed whether these methods can be applied to cases in which parts of the image have been hidden or additionally edited. In such cases, only a part of the artifacts generated by a deepfake generator is available for deepfake detection. The analysis results in this paper show that most of existing methods that extract features from the entire face can be greatly degraded in accuracy due to partial occlusion and editing, such as resizing.
On the other hand, we demonstrated that it is not necessary to use information from the entire face for highly accurate deepfake detection. Even small regions corresponding to facial parts, such as the cornea, eyes, nose, and mouth, contain sufficient information to discriminate fake images. The existing models are suitable for extracting effective features from a relatively wide area of facial images. However, it is difficult for them to extract effective features from a small area of facial images because only a small amount of contaminated artifacts can be obtained. We proposed simple but powerful CNN models that are specialized for individual face parts. These CNN models can concentrate on extracting effective features from images of similar textures, and they are more robust to contamination and degradation of artifacts. This enables us to detect deepfake if the artifacts generated by a deepfake generator remain in a small area. We also showed that the accuracy can be further improved by using an ensemble model of deepfake detectors that focus on different parts of the face. Based on these facts, the proposed method is able to achieve robust deepfake detection against partial occlusion and editing.
Another advantage of the proposed method is the ability to show the class probability for each facial part to users in addition to that for the whole of the facial image as shown in
Figure 7. This is a new feature as a detector of synthesized images. If the artifacts remain only partially due to significant editing, there is a risk that the conventional method or the proposed ensemble model will falsely consider a fake image as a real image when the entire face image is evaluated. The proposed method can provide class probabilities for the major parts of the face, thus alerting the user to the possibility of being generated by the GAN.
On the other hand, there is still room for improvement in the proposed method. Because the proposed method assumes that a 64 × 64 low-resolution image is input to each single model, it is necessary to reduce the size of facial parts to feed them to the single model, even when the input image is an image of higher resolution. This dares to degrade the artifacts of the high-resolution image. The images that failed to be classified in the experiment may be correctly classified by using a CNN model with higher-resolution images as input. To achieve more reliable deepfake detection, it may be necessary to train CNN models that are specific to different resolutions and adaptively select one based on the resolution of the input image. Additionally, there is the problem that deepfake detection fails in case the bounding box estimation for face-part detection is inaccurate. In this paper, we selected a simple average for the proposed ensemble model because it is one of the simplest ways and has a low computational cost, and our experimental results showed that even a simple average can improve the accuracy effectively. However, not all detected face parts are effective for deepfake detection. Therefore, the weight of each face part should be determined based on the overall judgment of patch size, bounding box accuracy, and ease of artifact extraction.
Deepfake generalization performance, which is discussed in [
17,
18], is an important issue. This paper presents an in-depth exploration of the possibility of deepfake detection using face parts and does not address the improvement of the generalization performance. However, the approaches used to improve generalization performance via data augmentation and image resynthesis proposed by [
17,
18] can be introduced into the proposed method, and it would be interesting to analyze the synergistic effects of such combinations.